Updating c717f60..40394eb
Fast-forward
 swh/dataset/luigi.py | 62 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 40 insertions(+), 22 deletions(-)

Changes applied before test

commit 40394eb3c87fb999ed76267756f57b5ffa30e00d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:39:58 2022 +0100

    luigi.UploadExportToS3: Skip upload of already-uploaded files

commit b957b58c4baa2bea2ca559979a1ce1d08ff4eb3d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:34:53 2022 +0100

    luigi: Dynamically list directories instead of using object_types
    
    Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
    set of object types was the same as the set of directories, which is wrong:
    
    * for the `edges` format, there is no origin_visit or origin_visit_status
      directory
    * for both `edges` and `orc` formats, this was missing relational tables.
    
    A possible fix would have been to use the `swh.dataset.relational.TABLES`
    constant and keep ignoring non-existing dirs in the `edges`, but I decided to
    simply list directories instead, as it will prevent future issues if we
    decide to add directories that do not match any table in Athena for
    whatever reason.

commit dab573172c21e67c273ab86122714e35da33465e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:10:25 2022 +0100

    luigi: Read meta/export.json instead of relying on stamp files
    
    Stamp files are only useful while building, and not copied to and from S3,
    so the check failed after a round-trip through S3.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/177/ for more details.

Harbormaster completed remote builds in B33258: Diff 32303.Dec 16 2022, 3:42 PM

vlorentz requested review of this revision.Dec 16 2022, 3:42 PM

jayeshv added a subscriber: jayeshv.Dec 16 2022, 4:32 PM

jayeshv added inline comments.

swh/dataset/luigi.py
420	This will do an s3 get on every object, correct? If this scenario (repeating objects) is rare, isn't it better to override the existing object?

vlorentz added inline comments.Dec 16 2022, 6:36 PM

swh/dataset/luigi.py
420	it does a HEAD, not a GET, so it only gets stats and not the whole object. Given that we have like a couple hundred files of several gigabytes each, it's worth it even if we use it rarely.

jayeshv accepted this revision.Dec 19 2022, 11:14 AM

This revision is now accepted and ready to land.Dec 19 2022, 11:14 AM

rebase

Build is green

Patch application report for D8966 (id=32319)

Could not rebase; Attempt merge onto d391394e53...

Updating d391394..a01a82f
Fast-forward
 swh/dataset/luigi.py | 62 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 40 insertions(+), 22 deletions(-)

Changes applied before test

commit a01a82fc755a6a41c2789c31f9cd6dd8655e9951
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:39:58 2022 +0100

    luigi.UploadExportToS3: Skip upload of already-uploaded files

commit 28898bbf017ddf50def798a0d2522e88a1c30019
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:34:53 2022 +0100

    luigi: Dynamically list directories instead of using object_types
    
    Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
    set of object types was the same as the set of directories, which is wrong:
    
    * for the `edges` format, there is no origin_visit or origin_visit_status
      directory
    * for both `edges` and `orc` formats, this was missing relational tables.
    
    A possible fix would have been to use the `swh.dataset.relational.TABLES`
    constant and keep ignoring non-existing dirs in the `edges`, but I decided to
    simply list directories instead, as it will prevent future issues if we
    decide to add directories that do not match any table in Athena for
    whatever reason.

commit 4c432adf7b218ed57d2e1075ba7f63e48e6ab9ca
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:10:25 2022 +0100

    luigi: Read meta/export.json instead of relying on stamp files
    
    Stamp files are only useful while building, and not copied to and from S3,
    so the check failed after a round-trip through S3.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/180/ for more details.

Harbormaster completed remote builds in B33310: Diff 32319.Dec 19 2022, 6:25 PM

Closed by commit rDDATASETa01a82fc755a: luigi.UploadExportToS3: Skip upload of already-uploaded files (authored by vlorentz). · Explain WhyDec 20 2022, 10:04 AM

This revision was automatically updated to reflect the committed changes.

vlorentz added a commit: rDDATASETa01a82fc755a: luigi.UploadExportToS3: Skip upload of already-uploaded files.