Details
Details
- Reviewers
jayeshv - Group Reviewers
Reviewers - Commits
- rDDATASETa01a82fc755a: luigi.UploadExportToS3: Skip upload of already-uploaded files
Diff Detail
Diff Detail
- Repository
- rDDATASET Datasets
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 33310 Build 52210: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 52209: arc lint + arc unit
Event Timeline
Comment Actions
Build is green
Patch application report for D8966 (id=32303)
Could not rebase; Attempt merge onto c717f60fe0...
Updating c717f60..40394eb Fast-forward swh/dataset/luigi.py | 62 +++++++++++++++++++++++++++++++++------------------- 1 file changed, 40 insertions(+), 22 deletions(-)
Changes applied before test
commit 40394eb3c87fb999ed76267756f57b5ffa30e00d Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Fri Dec 16 15:39:58 2022 +0100 luigi.UploadExportToS3: Skip upload of already-uploaded files commit b957b58c4baa2bea2ca559979a1ce1d08ff4eb3d Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Fri Dec 16 15:34:53 2022 +0100 luigi: Dynamically list directories instead of using object_types Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the set of object types was the same as the set of directories, which is wrong: * for the `edges` format, there is no origin_visit or origin_visit_status directory * for both `edges` and `orc` formats, this was missing relational tables. A possible fix would have been to use the `swh.dataset.relational.TABLES` constant and keep ignoring non-existing dirs in the `edges`, but I decided to simply list directories instead, as it will prevent future issues if we decide to add directories that do not match any table in Athena for whatever reason. commit dab573172c21e67c273ab86122714e35da33465e Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Fri Dec 16 15:10:25 2022 +0100 luigi: Read meta/export.json instead of relying on stamp files Stamp files are only useful while building, and not copied to and from S3, so the check failed after a round-trip through S3.
See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/177/ for more details.
swh/dataset/luigi.py | ||
---|---|---|
420 | This will do an s3 get on every object, correct? If this scenario (repeating objects) is rare, isn't it better to override the existing object? |
swh/dataset/luigi.py | ||
---|---|---|
420 | it does a HEAD, not a GET, so it only gets stats and not the whole object. Given that we have like a couple hundred files of several gigabytes each, it's worth it even if we use it rarely. |
Comment Actions
Build is green
Patch application report for D8966 (id=32319)
Could not rebase; Attempt merge onto d391394e53...
Updating d391394..a01a82f Fast-forward swh/dataset/luigi.py | 62 +++++++++++++++++++++++++++++++++------------------- 1 file changed, 40 insertions(+), 22 deletions(-)
Changes applied before test
commit a01a82fc755a6a41c2789c31f9cd6dd8655e9951 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Fri Dec 16 15:39:58 2022 +0100 luigi.UploadExportToS3: Skip upload of already-uploaded files commit 28898bbf017ddf50def798a0d2522e88a1c30019 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Fri Dec 16 15:34:53 2022 +0100 luigi: Dynamically list directories instead of using object_types Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the set of object types was the same as the set of directories, which is wrong: * for the `edges` format, there is no origin_visit or origin_visit_status directory * for both `edges` and `orc` formats, this was missing relational tables. A possible fix would have been to use the `swh.dataset.relational.TABLES` constant and keep ignoring non-existing dirs in the `edges`, but I decided to simply list directories instead, as it will prevent future issues if we decide to add directories that do not match any table in Athena for whatever reason. commit 4c432adf7b218ed57d2e1075ba7f63e48e6ab9ca Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Fri Dec 16 15:10:25 2022 +0100 luigi: Read meta/export.json instead of relying on stamp files Stamp files are only useful while building, and not copied to and from S3, so the check failed after a round-trip through S3.
See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/180/ for more details.