Page MenuHomeSoftware Heritage

DatasetsFolder
ActivePublic

Details

Description

datasets maintained by Software Heritage

Recent Activity

Jan 8 2023

gitlab-migration closed T4551: document the license dataset on docs.s.o as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 10:24 PM · Documentation, Datasets
gitlab-migration closed T4550: dataset: document the AWS S3 bucket for content objects as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 10:24 PM · Documentation, Datasets
gitlab-migration changed the status of T4676: Add Luigi workflow in swh-dataset from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 10:04 PM · Datasets, Compressed graph service
gitlab-migration changed the status of T4676: Add Luigi workflow in swh-dataset, a subtask of T4677: Add support for generating subdatasets in swh.dataset.luigi, from Resolved to Migrated.
Jan 8 2023, 10:04 PM · Datasets
gitlab-migration changed the status of T3260: publish swh.dataset to pypi from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 10:02 PM · Continuous Integration, Datasets
gitlab-migration changed the status of T3178: document how to export the graph dataset automatically, a subtask of T1847: fully automate export of the graph dataset, from Invalid to Migrated.
Jan 8 2023, 10:02 PM · Compressed graph service, Datasets
gitlab-migration changed the status of T3178: document how to export the graph dataset automatically from Invalid to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 10:02 PM · Documentation, Datasets
gitlab-migration changed the status of T3021: Investigate why reading the journal of the content table takes so long from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 10:01 PM · Journal, Datasets
gitlab-migration changed the status of T2431: Document how to export the graph edge dataset from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 10:00 PM · Documentation, Compressed graph service, Datasets
gitlab-migration changed the status of T2431: Document how to export the graph edge dataset, a subtask of T1847: fully automate export of the graph dataset, from Resolved to Migrated.
Jan 8 2023, 10:00 PM · Compressed graph service, Datasets
gitlab-migration changed the status of T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, from Resolved to Migrated.
Jan 8 2023, 9:59 PM · Datasets
gitlab-migration changed the status of T1847: fully automate export of the graph dataset from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 9:59 PM · Compressed graph service, Datasets
gitlab-migration closed T4747: Extract sample of .c files along with their most popular file name as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:06 PM · Datasets
gitlab-migration closed T4729: collaboration graph: drop pseudo-SWHIDs and add mapping ori<->url, a subtask of T4695: Provide a collaboration graph / dataset, as Migrated.
Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4729: collaboration graph: drop pseudo-SWHIDs and add mapping ori<->url as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4714: Write Luigi tasks to generate the citation dataset, a subtask of T4713: Generate the citation dataset, as Migrated.
Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4714: Write Luigi tasks to generate the citation dataset as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4713: Generate the citation dataset as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4713: Generate the citation dataset, a subtask of T4712: Write Luigi tasks to regenerate the license dataset, as Migrated.
Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4712: Write Luigi tasks to regenerate the license dataset as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4695: Provide a collaboration graph / dataset as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4685: license dataset: add logic to convert/import dataset into a SQL database as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4683: license dataset: use a consistent file format for CSV-like files, a subtask of T4685: license dataset: add logic to convert/import dataset into a SQL database, as Migrated.
Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4683: license dataset: use a consistent file format for CSV-like files as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T4677: Add support for generating subdatasets in swh.dataset.luigi as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:05 PM · Datasets
gitlab-migration closed T3885: Filter rows of size >32MB from dataset export as Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 5:03 PM · Datasets
gitlab-migration changed the status of T4682: license dataset: missing java stuff from the replication package from Wontfix to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:38 PM · Datasets
gitlab-migration changed the status of T4586: max_matching_nodes is applied before filtering for node type, a subtask of T4469: update license blob dataset to match-ish latest compress graph, from Resolved to Migrated.
Jan 8 2023, 4:37 PM · Datasets
gitlab-migration changed the status of T4469: update license blob dataset to match-ish latest compress graph from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:37 PM · Datasets
gitlab-migration changed the status of T3626: graph API: add ?limit parameter to /leaves endpoint, a subtask of T4469: update license blob dataset to match-ish latest compress graph, from Resolved to Migrated.
Jan 8 2023, 4:35 PM · Datasets
gitlab-migration changed the status of T3329: document ORC format dataset availability from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:34 PM · Datasets
gitlab-migration changed the status of T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli' from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:30 PM · Datasets
gitlab-migration changed the status of T2003: Content replayer may try to copy objects before they are available from an objstorage, a subtask of T1914: Keep mirror of contents on S3 up to date, from Resolved to Migrated.
Jan 8 2023, 4:28 PM · Mirror, Datasets
gitlab-migration changed the status of T1956: Integrate usage docs of the graph dataset in swh-docs from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:28 PM · Datasets
gitlab-migration changed the status of T1914: Keep mirror of contents on S3 up to date, a subtask of T1899: complete object storage mirror on AWS, from Duplicate to Migrated.
Jan 8 2023, 4:28 PM · Mirror, Datasets
gitlab-migration changed the status of T1914: Keep mirror of contents on S3 up to date from Duplicate to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:28 PM · Mirror, Datasets
gitlab-migration changed the status of T1899: complete object storage mirror on AWS from Duplicate to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:27 PM · Mirror, Datasets
gitlab-migration changed the status of T1848: refresh graph dataset export from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:27 PM · Datasets
gitlab-migration changed the status of T1796: Datasets exported from Spark are missing some rows from Wontfix to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:27 PM · Datasets
gitlab-migration changed the status of T1783: edge dataset: re-export rev→rev edges in the right order from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:27 PM · Datasets
gitlab-migration changed the status of T1741: graph dataset: update to use persistent identifiers everywhere, a subtask of T1848: refresh graph dataset export, from Resolved to Migrated.
Jan 8 2023, 4:27 PM · Datasets
gitlab-migration changed the status of T1743: create a nice landing web page for exported dataset from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:27 PM · Datasets
gitlab-migration changed the status of T1742: graph dataset: uniform file names from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:27 PM · Datasets
gitlab-migration changed the status of T1741: graph dataset: update to use persistent identifiers everywhere from Resolved to Migrated.

This task has been migrated to GitLab.

Jan 8 2023, 4:27 PM · Datasets

Jan 6 2023

vlorentz added a project to T4747: Extract sample of .c files along with their most popular file name: Datasets.
Jan 6 2023, 12:14 PM · Datasets

Dec 22 2022

vlorentz closed T4682: license dataset: missing java stuff from the replication package as Wontfix.

Future versions will be generated using only code in swh-graph (bash glue code replaced by Python code, some of which shells out to bash for simplicity), so the replication package will simply be replaced by a swh-graph tag.

Dec 22 2022, 2:52 PM · Datasets
vlorentz added revisions to T4695: Provide a collaboration graph / dataset: D8970: origin_contributors: Use origin IDs instead of SWHIDs, D8971: origin_contributors: Write table mapping origin ID to origin URL (base64-encoded), D8972: origin_contributors: Rename 'person' to 'contributor' in outputs.
Dec 22 2022, 1:57 PM · Datasets
vlorentz added a comment to T4695: Provide a collaboration graph / dataset.

TODO: deanonymized dataset should be just a <contributor_id,contributor_base64,contributor_escaped> table, rather than repeating the origin<->contributor mapping

Dec 22 2022, 1:57 PM · Datasets

Dec 21 2022

vlorentz added a comment to T4683: license dataset: use a consistent file format for CSV-like files.

blobs-fileinfo.csv.zst: (no changes needed)

Dec 21 2022, 1:50 PM · Datasets

Dec 19 2022

vlorentz added a revision to T4729: collaboration graph: drop pseudo-SWHIDs and add mapping ori<->url: D8971: origin_contributors: Write table mapping origin ID to origin URL (base64-encoded).
Dec 19 2022, 5:55 PM · Datasets