Page MenuHomeSoftware Heritage

DatasetsFolder
ActivePublic

Details

Description

datasets maintained by Software Heritage

Recent Activity

Tue, Dec 6

vlorentz added revisions to T4676: Add Luigi workflow in swh-dataset: D8919: Add CLI script to generate Luigi config and call it, D8924: exporters/orc: Fix crash on visit status with no type, D8925: luigi.CreateAthena: Fix validation of DB name, D8926: luigi.RunExportAll: Default to exporting all formats.
Tue, Dec 6, 2:37 PM · Datasets, Compressed graph service

Mon, Dec 5

vlorentz triaged T4714: Write Luigi tasks to generate the citation dataset as Normal priority.
Mon, Dec 5, 10:51 AM · Datasets
vlorentz triaged T4713: Generate the citation dataset as Normal priority.
Mon, Dec 5, 10:51 AM · Datasets
vlorentz updated the task description for T4712: Write Luigi tasks to regenerate the license dataset.
Mon, Dec 5, 10:50 AM · Datasets
vlorentz triaged T4712: Write Luigi tasks to regenerate the license dataset as Low priority.
Mon, Dec 5, 10:50 AM · Datasets

Thu, Dec 1

vlorentz added revisions to T4695: Provide a collaboration graph / dataset: D8908: Add ListOriginContributors, D8910: Regenerate the test dataset to include a release with no author, D8912: ListOriginContributors: Ignore null author/committer in revisions/releases.
Thu, Dec 1, 4:15 PM · Datasets

Thu, Nov 24

vlorentz added a revision to T4695: Provide a collaboration graph / dataset: D8883: Add a script to generate a topological sort.
Thu, Nov 24, 4:20 PM · Datasets

Mon, Nov 21

vlorentz triaged T4695: Provide a collaboration graph / dataset as Normal priority.
Mon, Nov 21, 12:13 PM · Datasets

Mon, Nov 14

zack added a parent task for T4683: license dataset: use a consistent file format for CSV-like files: T4685: license dataset: add logic to convert/import dataset into a SQL database.
Mon, Nov 14, 4:50 PM · Datasets
zack added a subtask for T4685: license dataset: add logic to convert/import dataset into a SQL database: T4683: license dataset: use a consistent file format for CSV-like files.
Mon, Nov 14, 4:50 PM · Datasets
zack triaged T4685: license dataset: add logic to convert/import dataset into a SQL database as Low priority.
Mon, Nov 14, 4:49 PM · Datasets
zack changed the edit policy for P1529 import the license dataset into sqlite.
Mon, Nov 14, 4:47 PM · Datasets
zack created P1529 import the license dataset into sqlite.
Mon, Nov 14, 4:47 PM · Datasets
zack added a project to T4683: license dataset: use a consistent file format for CSV-like files: Datasets.
Mon, Nov 14, 3:09 PM · Datasets
vlorentz added a comment to T4682: license dataset: missing java stuff from the replication package.

the replication/05-earliest-revision.sh script in the replication package mentions the swh-graph version it uses, and the fully qualified class name, so it can be found in the swh-graph code.

Mon, Nov 14, 3:08 PM · Datasets
zack triaged T4682: license dataset: missing java stuff from the replication package as Low priority.
Mon, Nov 14, 2:45 PM · Datasets

Thu, Nov 10

vlorentz added revisions to T4676: Add Luigi workflow in swh-dataset: D8827: athena: Fix create_table to work with restricted permissions, D8828: cli: Move the main code of export_graph to its own function, D8829: Add luigi tasks.
Thu, Nov 10, 10:42 AM · Datasets, Compressed graph service
vlorentz added a parent task for T4676: Add Luigi workflow in swh-dataset: T4677: Add support for generating subdatasets in swh.dataset.luigi.
Thu, Nov 10, 10:42 AM · Datasets, Compressed graph service
vlorentz added a subtask for T4677: Add support for generating subdatasets in swh.dataset.luigi: T4676: Add Luigi workflow in swh-dataset.
Thu, Nov 10, 10:42 AM · Datasets
vlorentz triaged T4677: Add support for generating subdatasets in swh.dataset.luigi as Normal priority.
Thu, Nov 10, 10:42 AM · Datasets
vlorentz triaged T4676: Add Luigi workflow in swh-dataset as High priority.
Thu, Nov 10, 10:41 AM · Datasets, Compressed graph service

Nov 7 2022

vlorentz closed T4469: update license blob dataset to match-ish latest compress graph as Resolved.

It's now available on https://annex.softwareheritage.org/public/dataset/license-blobs/2022-04-25/

Nov 7 2022, 10:34 AM · Datasets

Oct 19 2022

gitlab-migration changed the status of T4507: Out of memory on granet, a subtask of T4469: update license blob dataset to match-ish latest compress graph, from Resolved to Migrated.
Oct 19 2022, 6:08 PM · Datasets

Oct 11 2022

vlorentz closed T4507: Out of memory on granet, a subtask of T4469: update license blob dataset to match-ish latest compress graph, as Resolved.
Oct 11 2022, 11:45 AM · Datasets

Oct 3 2022

vlorentz closed T4586: max_matching_nodes is applied before filtering for node type, a subtask of T4469: update license blob dataset to match-ish latest compress graph, as Resolved.
Oct 3 2022, 9:56 AM · Datasets

Sep 29 2022

vlorentz added a subtask for T4469: update license blob dataset to match-ish latest compress graph: T4507: Out of memory on granet.
Sep 29 2022, 3:08 PM · Datasets
vlorentz removed a subtask for T4469: update license blob dataset to match-ish latest compress graph: T4522: graph gRPC API: Add support for limiting traversals by number of results.
Sep 29 2022, 3:07 PM · Datasets
vlorentz added subtasks for T4469: update license blob dataset to match-ish latest compress graph: T4586: max_matching_nodes is applied before filtering for node type, T4522: graph gRPC API: Add support for limiting traversals by number of results, T3626: graph API: add ?limit parameter to /leaves endpoint.
Sep 29 2022, 3:06 PM · Datasets

Sep 23 2022

zack renamed T4551: document the license dataset on docs.s.o from document the license dataset to document the license dataset on docs.s.o.
Sep 23 2022, 4:38 PM · Documentation, Datasets
zack triaged T4551: document the license dataset on docs.s.o as Normal priority.
Sep 23 2022, 4:38 PM · Documentation, Datasets
zack triaged T4550: dataset: document the AWS S3 bucket for content objects as Normal priority.
Sep 23 2022, 4:27 PM · Documentation, Datasets

Aug 29 2022

zack triaged T4469: update license blob dataset to match-ish latest compress graph as Normal priority.
Aug 29 2022, 11:46 AM · Datasets

May 1 2022

seirl closed T1848: refresh graph dataset export as Resolved.

Now that there is both a columnar+compressed graph from 2021 and a columnar graph from 2022 that is pending compression, this task about "refreshing the export from January 2019" is resolved.

May 1 2022, 12:08 PM · Datasets

Apr 29 2022

seirl changed the status of T1848: refresh graph dataset export from Open to Work in Progress.
Apr 29 2022, 6:23 PM · Datasets
seirl moved T1847: fully automate export of the graph dataset from Backlog to Deployed on the Compressed graph service board.
Apr 29 2022, 6:22 PM · Compressed graph service, Datasets
seirl moved T2431: Document how to export the graph edge dataset from Backlog to Deployed on the Compressed graph service board.
Apr 29 2022, 6:22 PM · Documentation, Compressed graph service, Datasets
seirl closed T3021: Investigate why reading the journal of the content table takes so long as Resolved.

Fixed in D7718

Apr 29 2022, 6:20 PM · Journal, Datasets
seirl closed T2431: Document how to export the graph edge dataset, a subtask of T1847: fully automate export of the graph dataset, as Resolved.
Apr 29 2022, 6:15 PM · Compressed graph service, Datasets
seirl closed T2431: Document how to export the graph edge dataset as Resolved.

Done here: D7693 and here: D7711

Apr 29 2022, 6:15 PM · Documentation, Compressed graph service, Datasets
seirl closed T1743: create a nice landing web page for exported dataset as Resolved.
Apr 29 2022, 6:14 PM · Datasets
seirl added a comment to T1743: create a nice landing web page for exported dataset.

Done, this page https://annex.softwareheritage.org/public/dataset/graph/ now contains a link to the detailed list of datasets: https://forge.softwareheritage.org/D7487

Apr 29 2022, 6:14 PM · Datasets
seirl closed T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Resolved.
Apr 29 2022, 5:57 PM · Datasets
seirl closed T1847: fully automate export of the graph dataset as Resolved.

Done!

Apr 29 2022, 5:57 PM · Compressed graph service, Datasets
seirl closed T3329: document ORC format dataset availability as Resolved.

Fixed in D7487

Apr 29 2022, 5:56 PM · Datasets

Apr 27 2022

seirl added a comment to T3021: Investigate why reading the journal of the content table takes so long.

Apr 27 2022, 2:58 PM · Journal, Datasets
seirl reopened T3021: Investigate why reading the journal of the content table takes so long as "Open".
Apr 27 2022, 2:57 PM · Journal, Datasets
seirl closed T3021: Investigate why reading the journal of the content table takes so long as Resolved.

No longer happens with a more recent stack

Apr 27 2022, 10:12 AM · Journal, Datasets

Apr 5 2022

zack changed the status of T1743: create a nice landing web page for exported dataset from Open to Work in Progress.
Apr 5 2022, 1:39 PM · Datasets
zack changed the status of T3329: document ORC format dataset availability from Open to Work in Progress.
Apr 5 2022, 1:38 PM · Datasets

Mar 30 2022

zack added a member for Datasets: seirl.
Mar 30 2022, 1:42 PM