This is a production-ready version of scripts I've had for years to
generate subdatasts of the SWH graph dataset from a list of swhids to include.
It uploads the list to S3, then JOINs the list with the main dataset using
Amazon Athena.
Details
- Reviewers
vlorentz - Group Reviewers
Reviewers - Commits
- rDDATASET027235d6d46d: Add a command to generate a subdataset from a list of SWHIDs using S3
Tested manually, very hard to unit test as it's super AWS specific.
Diff Detail
- Repository
- rDDATASET Datasets
- Branch
- master
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 26257 Build 41051: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 41050: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D7010 (id=25414)
Rebasing onto ab2ebfadcf...
Current branch diff-target is up to date.
Changes applied before test
commit 538aa2f919297ac11cd82a442615aa829c571b6f Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Fri Jan 21 14:25:07 2022 +0100 Add a command to generate a subdataset from a list of SWHIDs using S3
See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/22/ for more details.
swh/dataset/cli.py | ||
---|---|---|
202–205 | It's literally "the SWHIDs to include in the subdataset". What's computed is the intersection between the base dataset and the swhids contained in the file. I will try to expand the description more. |
Build is green
Patch application report for D7010 (id=25472)
Rebasing onto ab2ebfadcf...
Current branch diff-target is up to date.
Changes applied before test
commit 8129a5b90ef803fb335d21c07993e8ee76ffb7ed Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Fri Jan 21 14:25:07 2022 +0100 Add a command to generate a subdataset from a list of SWHIDs using S3
See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/23/ for more details.
Build is green
Patch application report for D7010 (id=25474)
Rebasing onto ab2ebfadcf...
Current branch diff-target is up to date.
Changes applied before test
commit c515bc745f8d41f67a84f23520279c5bbae8d6db Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Fri Jan 21 14:25:07 2022 +0100 Add a command to generate a subdataset from a list of SWHIDs using S3
See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/24/ for more details.
Build is green
Patch application report for D7010 (id=25517)
Rebasing onto 027235d6d4...
First, rewinding head to replay your work on top of it... Fast-forwarded diff-target to base-revision-25-D7010.
Changes applied before test
See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/25/ for more details.