Page MenuHomeSoftware Heritage

luigi.UploadExportToS3: Skip upload of already-uploaded files
ClosedPublic

Authored by vlorentz on Dec 16 2022, 3:40 PM.

Diff Detail

Repository
rDDATASET Datasets
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 33258
Build 52145: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 52144: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8966 (id=32303)

Could not rebase; Attempt merge onto c717f60fe0...

Updating c717f60..40394eb
Fast-forward
 swh/dataset/luigi.py | 62 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 40 insertions(+), 22 deletions(-)
Changes applied before test
commit 40394eb3c87fb999ed76267756f57b5ffa30e00d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:39:58 2022 +0100

    luigi.UploadExportToS3: Skip upload of already-uploaded files

commit b957b58c4baa2bea2ca559979a1ce1d08ff4eb3d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:34:53 2022 +0100

    luigi: Dynamically list directories instead of using object_types
    
    Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
    set of object types was the same as the set of directories, which is wrong:
    
    * for the `edges` format, there is no origin_visit or origin_visit_status
      directory
    * for both `edges` and `orc` formats, this was missing relational tables.
    
    A possible fix would have been to use the `swh.dataset.relational.TABLES`
    constant and keep ignoring non-existing dirs in the `edges`, but I decided to
    simply list directories instead, as it will prevent future issues if we
    decide to add directories that do not match any table in Athena for
    whatever reason.

commit dab573172c21e67c273ab86122714e35da33465e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:10:25 2022 +0100

    luigi: Read meta/export.json instead of relying on stamp files
    
    Stamp files are only useful while building, and not copied to and from S3,
    so the check failed after a round-trip through S3.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/177/ for more details.

jayeshv added inline comments.
swh/dataset/luigi.py
421

This will do an s3 get on every object, correct? If this scenario (repeating objects) is rare, isn't it better to override the existing object?

swh/dataset/luigi.py
421

it does a HEAD, not a GET, so it only gets stats and not the whole object. Given that we have like a couple hundred files of several gigabytes each, it's worth it even if we use it rarely.

This revision is now accepted and ready to land.Dec 19 2022, 11:14 AM

Build is green

Patch application report for D8966 (id=32319)

Could not rebase; Attempt merge onto d391394e53...

Updating d391394..a01a82f
Fast-forward
 swh/dataset/luigi.py | 62 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 40 insertions(+), 22 deletions(-)
Changes applied before test
commit a01a82fc755a6a41c2789c31f9cd6dd8655e9951
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:39:58 2022 +0100

    luigi.UploadExportToS3: Skip upload of already-uploaded files

commit 28898bbf017ddf50def798a0d2522e88a1c30019
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:34:53 2022 +0100

    luigi: Dynamically list directories instead of using object_types
    
    Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
    set of object types was the same as the set of directories, which is wrong:
    
    * for the `edges` format, there is no origin_visit or origin_visit_status
      directory
    * for both `edges` and `orc` formats, this was missing relational tables.
    
    A possible fix would have been to use the `swh.dataset.relational.TABLES`
    constant and keep ignoring non-existing dirs in the `edges`, but I decided to
    simply list directories instead, as it will prevent future issues if we
    decide to add directories that do not match any table in Athena for
    whatever reason.

commit 4c432adf7b218ed57d2e1075ba7f63e48e6ab9ca
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:10:25 2022 +0100

    luigi: Read meta/export.json instead of relying on stamp files
    
    Stamp files are only useful while building, and not copied to and from S3,
    so the check failed after a round-trip through S3.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/180/ for more details.