Page MenuHomeSoftware Heritage

luigi: Dynamically list directories instead of using object_types
ClosedPublic

Authored by vlorentz on Dec 16 2022, 3:37 PM.

Details

Summary

Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
set of object types was the same as the set of directories, which is wrong:

  • for the edges format, there is no origin_visit or origin_visit_status directory
  • for both edges and orc formats, this was missing relational tables.

A possible fix would have been to use the swh.dataset.relational.TABLES
constant and keep ignoring non-existing dirs in the edges, but I decided to
simply list directories instead, as it will prevent future issues if we
decide to add directories that do not match any table in Athena for
whatever reason.

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8965 (id=32302)

Could not rebase; Attempt merge onto c717f60fe0...

Updating c717f60..b957b58
Fast-forward
 swh/dataset/luigi.py | 49 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 19 deletions(-)
Changes applied before test
commit b957b58c4baa2bea2ca559979a1ce1d08ff4eb3d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:34:53 2022 +0100

    luigi: Dynamically list directories instead of using object_types
    
    Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
    set of object types was the same as the set of directories, which is wrong:
    
    * for the `edges` format, there is no origin_visit or origin_visit_status
      directory
    * for both `edges` and `orc` formats, this was missing relational tables.
    
    A possible fix would have been to use the `swh.dataset.relational.TABLES`
    constant and keep ignoring non-existing dirs in the `edges`, but I decided to
    simply list directories instead, as it will prevent future issues if we
    decide to add directories that do not match any table in Athena for
    whatever reason.

commit dab573172c21e67c273ab86122714e35da33465e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:10:25 2022 +0100

    luigi: Read meta/export.json instead of relying on stamp files
    
    Stamp files are only useful while building, and not copied to and from S3,
    so the check failed after a round-trip through S3.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/176/ for more details.

This revision is now accepted and ready to land.Dec 19 2022, 5:04 PM

Build is green

Patch application report for D8965 (id=32318)

Could not rebase; Attempt merge onto d391394e53...

Updating d391394..28898bb
Fast-forward
 swh/dataset/luigi.py | 49 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 19 deletions(-)
Changes applied before test
commit 28898bbf017ddf50def798a0d2522e88a1c30019
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:34:53 2022 +0100

    luigi: Dynamically list directories instead of using object_types
    
    Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
    set of object types was the same as the set of directories, which is wrong:
    
    * for the `edges` format, there is no origin_visit or origin_visit_status
      directory
    * for both `edges` and `orc` formats, this was missing relational tables.
    
    A possible fix would have been to use the `swh.dataset.relational.TABLES`
    constant and keep ignoring non-existing dirs in the `edges`, but I decided to
    simply list directories instead, as it will prevent future issues if we
    decide to add directories that do not match any table in Athena for
    whatever reason.

commit 4c432adf7b218ed57d2e1075ba7f63e48e6ab9ca
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Dec 16 15:10:25 2022 +0100

    luigi: Read meta/export.json instead of relying on stamp files
    
    Stamp files are only useful while building, and not copied to and from S3,
    so the check failed after a round-trip through S3.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/179/ for more details.