Page MenuHomeSoftware Heritage

Add a command to generate a subdataset from a list of SWHIDs using S3
ClosedPublic

Authored by seirl on Jan 21 2022, 2:28 PM.

Details

Summary

This is a production-ready version of scripts I've had for years to
generate subdatasts of the SWH graph dataset from a list of swhids to include.
It uploads the list to S3, then JOINs the list with the main dataset using
Amazon Athena.

Test Plan

Tested manually, very hard to unit test as it's super AWS specific.

Diff Detail

Repository
rDDATASET Datasets
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 26345
Build 41193: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 41192: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D7010 (id=25414)

Rebasing onto ab2ebfadcf...

Current branch diff-target is up to date.
Changes applied before test
commit 538aa2f919297ac11cd82a442615aa829c571b6f
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Fri Jan 21 14:25:07 2022 +0100

    Add a command to generate a subdataset from a list of SWHIDs using S3

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/22/ for more details.

seirl requested review of this revision.Jan 21 2022, 2:29 PM
vlorentz added inline comments.
swh/dataset/cli.py
188
201–204

Please define "subdataset" better. Is it the transitive closure of the given SWHIDs? or just the subgraph induced by them?

swh/dataset/cli.py
201–204

It's literally "the SWHIDs to include in the subdataset". What's computed is the intersection between the base dataset and the swhids contained in the file. I will try to expand the description more.

Expand command description

database -> database to create

Build is green

Patch application report for D7010 (id=25472)

Rebasing onto ab2ebfadcf...

Current branch diff-target is up to date.
Changes applied before test
commit 8129a5b90ef803fb335d21c07993e8ee76ffb7ed
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Fri Jan 21 14:25:07 2022 +0100

    Add a command to generate a subdataset from a list of SWHIDs using S3

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/23/ for more details.

Build is green

Patch application report for D7010 (id=25474)

Rebasing onto ab2ebfadcf...

Current branch diff-target is up to date.
Changes applied before test
commit c515bc745f8d41f67a84f23520279c5bbae8d6db
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Fri Jan 21 14:25:07 2022 +0100

    Add a command to generate a subdataset from a list of SWHIDs using S3

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/24/ for more details.

This revision is now accepted and ready to land.Jan 24 2022, 5:51 PM

Add special case for revisions, compress with ZST

This revision was landed with ongoing or failed builds.Jan 25 2022, 7:48 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D7010 (id=25517)

Rebasing onto 027235d6d4...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-25-D7010.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/25/ for more details.