Page MenuHomeSoftware Heritage

DatasetsFolder
ActivePublic

Members

  • This project does not have any members.
  • View All

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

datasets maintained by Software Heritage

Recent Activity

Jun 3 2020

zack renamed T2431: Document how to export the graph edge dataset from Documentat how to export the graph edge dataset to Document how to export the graph edge dataset.
Jun 3 2020, 4:36 PM · Development documentation, Graph service, Datasets
seirl triaged T2431: Document how to export the graph edge dataset as Normal priority.
Jun 3 2020, 4:34 PM · Development documentation, Graph service, Datasets
zack changed the status of T1796: Datasets exported from Spark are missing some rows from Resolved to Wontfix.
Jun 3 2020, 4:20 PM · Datasets
seirl closed T1796: Datasets exported from Spark are missing some rows as Resolved.

We no longer export edges from Spark

Jun 3 2020, 4:14 PM · Datasets
seirl closed T1741: graph dataset: update to use persistent identifiers everywhere, a subtask of T1848: refresh graph dataset export, as Resolved.
Jun 3 2020, 4:08 PM · Datasets
seirl closed T1741: graph dataset: update to use persistent identifiers everywhere as Resolved.

We no longer export edges per file type.

Jun 3 2020, 4:08 PM · Datasets
seirl closed T1956: Integrate usage docs of the graph dataset in swh-docs as Resolved.
Jun 3 2020, 4:07 PM · Datasets

Apr 15 2020

seirl closed T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli' as Resolved.
Apr 15 2020, 3:36 PM · Datasets
seirl added a comment to T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli'.

Temporary fix here until the branch that implements this entrypoint is merged: https://forge.softwareheritage.org/rDDATASETbe9e71ba1f858bbb8f44649306b919a1fa965ea2

Apr 15 2020, 3:36 PM · Datasets
zack triaged T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli' as Normal priority.
Apr 15 2020, 1:32 PM · Datasets

Jan 23 2020

olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

We've now hit T2003 hard as the client caught up with the head of the local kafka cluster. That's why the curve is flattening out currently, as I stopped the replayers until the queue is implemented.

Jan 23 2020, 2:17 PM · Mirror, Datasets

Dec 7 2019

olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

We'll need to address T2003 before this can be closed (if we go the journal client route), so marking accordingly.

Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd added a subtask for T1914: Keep mirror of contents on S3 up to date: T2003: Content replayer may try to copy objects before they are available from an objstorage.
Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd renamed T1914: Keep mirror of contents on S3 up to date from synchronously write content objects to AWS during ingestion to Keep mirror of contents on S3 up to date.
Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

I don't think we're going to do this but rather use the journal client approach. (Even more so considering that writing to S3 takes 500ms for each object, which sounds like a silly artificial limit to put on a synchronous process).

Dec 7 2019, 6:32 PM · Mirror, Datasets
olasd merged task T1899: complete object storage mirror on AWS into T1954: Up-to-date objstorage mirror on S3.
Dec 7 2019, 6:30 PM · Mirror, Datasets

Nov 18 2019

zack raised the priority of T1848: refresh graph dataset export from Low to Normal.
Nov 18 2019, 2:50 PM · Datasets
zack lowered the priority of T1847: fully automate export of the graph dataset from High to Normal.
Nov 18 2019, 2:50 PM · Graph service, Datasets
zack added a project to T1847: fully automate export of the graph dataset: Graph service.
Nov 18 2019, 2:48 PM · Graph service, Datasets

Aug 19 2019

seirl triaged T1956: Integrate usage docs of the graph dataset in swh-docs as High priority.
Aug 19 2019, 6:19 PM · Datasets

Jul 14 2019

zack renamed T1914: Keep mirror of contents on S3 up to date from synchronously write content objects to AWS to synchronously write content objects to AWS during ingestion.
Jul 14 2019, 4:48 PM · Mirror, Datasets
zack triaged T1914: Keep mirror of contents on S3 up to date as High priority.
Jul 14 2019, 4:47 PM · Mirror, Datasets

Jul 9 2019

zack triaged T1899: complete object storage mirror on AWS as Normal priority.
Jul 9 2019, 10:59 AM · Mirror, Datasets

Jun 30 2019

zack added a parent task for T1848: refresh graph dataset export: T1868: refresh compressed representation of the archive.
Jun 30 2019, 1:58 PM · Datasets

Jun 23 2019

zack added a subtask for T1848: refresh graph dataset export: T1741: graph dataset: update to use persistent identifiers everywhere.
Jun 23 2019, 10:23 PM · Datasets
zack added a parent task for T1741: graph dataset: update to use persistent identifiers everywhere: T1848: refresh graph dataset export.
Jun 23 2019, 10:23 PM · Datasets
zack triaged T1848: refresh graph dataset export as Low priority.
Jun 23 2019, 10:22 PM · Datasets
zack added a parent task for T1847: fully automate export of the graph dataset: T1848: refresh graph dataset export.
Jun 23 2019, 10:22 PM · Graph service, Datasets
zack added a subtask for T1848: refresh graph dataset export: T1847: fully automate export of the graph dataset.
Jun 23 2019, 10:22 PM · Datasets
zack created T1848: refresh graph dataset export.
Jun 23 2019, 10:21 PM · Datasets
zack triaged T1847: fully automate export of the graph dataset as High priority.
Jun 23 2019, 10:20 PM · Graph service, Datasets
zack created T1847: fully automate export of the graph dataset.
Jun 23 2019, 10:20 PM · Graph service, Datasets

Jun 11 2019

seirl triaged T1796: Datasets exported from Spark are missing some rows as Normal priority.
Jun 11 2019, 11:52 PM · Datasets

Jun 5 2019

zack claimed T1742: graph dataset: uniform file names.
Jun 5 2019, 10:07 AM · Datasets
zack closed T1742: graph dataset: uniform file names as Resolved.
Jun 5 2019, 10:07 AM · Datasets

Jun 4 2019

zack closed T1783: edge dataset: re-export rev→rev edges in the right order as Resolved.
Jun 4 2019, 10:33 PM · Datasets
zack triaged T1783: edge dataset: re-export rev→rev edges in the right order as High priority.
Jun 4 2019, 2:33 PM · Datasets

May 23 2019

zack added a project to T1741: graph dataset: update to use persistent identifiers everywhere: Datasets.
May 23 2019, 2:37 PM · Datasets
zack added a project to T1742: graph dataset: uniform file names: Datasets.
May 23 2019, 2:37 PM · Datasets
zack added a comment to T1743: create a nice landing web page for exported dataset.

A nice related work here are the LAW datasets.

May 23 2019, 2:37 PM · Datasets
zack triaged T1743: create a nice landing web page for exported dataset as Low priority.
May 23 2019, 2:36 PM · Datasets
zack created Datasets.
May 23 2019, 2:29 PM