Details
Details
- Reviewers
jayeshv - Group Reviewers
Reviewers - Commits
- rDDATASETa01a82fc755a: luigi.UploadExportToS3: Skip upload of already-uploaded files
Diff Detail
Diff Detail
- Repository
- rDDATASET Datasets
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
Comment Actions
Build is green
Patch application report for D8966 (id=32303)
Could not rebase; Attempt merge onto c717f60fe0...
Updating c717f60..40394eb Fast-forward swh/dataset/luigi.py | 62 +++++++++++++++++++++++++++++++++------------------- 1 file changed, 40 insertions(+), 22 deletions(-)
Changes applied before test
commit 40394eb3c87fb999ed76267756f57b5ffa30e00d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Fri Dec 16 15:39:58 2022 +0100
luigi.UploadExportToS3: Skip upload of already-uploaded files
commit b957b58c4baa2bea2ca559979a1ce1d08ff4eb3d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Fri Dec 16 15:34:53 2022 +0100
luigi: Dynamically list directories instead of using object_types
Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
set of object types was the same as the set of directories, which is wrong:
* for the `edges` format, there is no origin_visit or origin_visit_status
directory
* for both `edges` and `orc` formats, this was missing relational tables.
A possible fix would have been to use the `swh.dataset.relational.TABLES`
constant and keep ignoring non-existing dirs in the `edges`, but I decided to
simply list directories instead, as it will prevent future issues if we
decide to add directories that do not match any table in Athena for
whatever reason.
commit dab573172c21e67c273ab86122714e35da33465e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Fri Dec 16 15:10:25 2022 +0100
luigi: Read meta/export.json instead of relying on stamp files
Stamp files are only useful while building, and not copied to and from S3,
so the check failed after a round-trip through S3.See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/177/ for more details.
| swh/dataset/luigi.py | ||
|---|---|---|
| 420 | This will do an s3 get on every object, correct? If this scenario (repeating objects) is rare, isn't it better to override the existing object? | |
| swh/dataset/luigi.py | ||
|---|---|---|
| 420 | it does a HEAD, not a GET, so it only gets stats and not the whole object. Given that we have like a couple hundred files of several gigabytes each, it's worth it even if we use it rarely. | |
Comment Actions
Build is green
Patch application report for D8966 (id=32319)
Could not rebase; Attempt merge onto d391394e53...
Updating d391394..a01a82f Fast-forward swh/dataset/luigi.py | 62 +++++++++++++++++++++++++++++++++------------------- 1 file changed, 40 insertions(+), 22 deletions(-)
Changes applied before test
commit a01a82fc755a6a41c2789c31f9cd6dd8655e9951
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Fri Dec 16 15:39:58 2022 +0100
luigi.UploadExportToS3: Skip upload of already-uploaded files
commit 28898bbf017ddf50def798a0d2522e88a1c30019
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Fri Dec 16 15:34:53 2022 +0100
luigi: Dynamically list directories instead of using object_types
Before this commit, UploadExportToS3 and DownloadExportFromS3 assumed the
set of object types was the same as the set of directories, which is wrong:
* for the `edges` format, there is no origin_visit or origin_visit_status
directory
* for both `edges` and `orc` formats, this was missing relational tables.
A possible fix would have been to use the `swh.dataset.relational.TABLES`
constant and keep ignoring non-existing dirs in the `edges`, but I decided to
simply list directories instead, as it will prevent future issues if we
decide to add directories that do not match any table in Athena for
whatever reason.
commit 4c432adf7b218ed57d2e1075ba7f63e48e6ab9ca
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Fri Dec 16 15:10:25 2022 +0100
luigi: Read meta/export.json instead of relying on stamp files
Stamp files are only useful while building, and not copied to and from S3,
so the check failed after a round-trip through S3.See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/180/ for more details.