Page MenuHomeSoftware Heritage

DatasetsFolder
ActivePublic

Members

  • This project does not have any members.
  • View All

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

datasets maintained by Software Heritage

Recent Activity

May 18 2021

zack updated the task description for T3329: document ORC format dataset availability.
May 18 2021, 9:33 AM · Datasets
zack triaged T3329: document ORC format dataset availability as High priority.
May 18 2021, 9:32 AM · Datasets

Apr 19 2021

olasd closed T2003: Content replayer may try to copy objects before they are available from an objstorage, a subtask of T1914: Keep mirror of contents on S3 up to date, as Resolved.
Apr 19 2021, 12:06 PM · Mirror, Datasets

Apr 17 2021

ardumont added a comment to T3260: publish swh.dataset to pypi.

This should help:

12:08 <+ardumont> fwiw, i don't see swh-dataset in the jenkins ci declaration so that won't get published
12:08 <+ardumont> https://forge.softwareheritage.org/source/swh-jenkins-jobs/browse/master/jobs/swh-packages.yaml
12:10 <+ardumont> (relatedly without ^ that won't show up in jenkins)
12:10 <+ardumont> related docs https://wiki.softwareheritage.org/wiki/Debian_packaging#Setting_up_the_repository_on_Phabricator and https://wiki.softwareheritage.org/wiki/Debian_packaging#Setting_up_the_repository_on_Phabricator
Apr 17 2021, 3:23 PM · Continuous Integration, Datasets
zack triaged T3260: publish swh.dataset to pypi as Low priority.
Apr 17 2021, 12:31 PM · Continuous Integration, Datasets

Apr 7 2021

seirl closed T3178: document how to export the graph dataset automatically, a subtask of T1847: fully automate export of the graph dataset, as Invalid.
Apr 7 2021, 3:03 PM · Graph service, Datasets
seirl closed T3178: document how to export the graph dataset automatically as Invalid.

Duplicate of T2431

Apr 7 2021, 3:03 PM · Documentation, Datasets
seirl added a subtask for T1847: fully automate export of the graph dataset: T2431: Document how to export the graph edge dataset.
Apr 7 2021, 3:03 PM · Graph service, Datasets
seirl added a parent task for T2431: Document how to export the graph edge dataset: T1847: fully automate export of the graph dataset.
Apr 7 2021, 3:03 PM · Documentation, Graph service, Datasets

Mar 26 2021

zack triaged T3178: document how to export the graph dataset automatically as Normal priority.
Mar 26 2021, 12:25 PM · Documentation, Datasets
zack reopened T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Open.
Mar 26 2021, 12:25 PM · Datasets
zack reopened T1847: fully automate export of the graph dataset as "Open".

reopening, as ideally we'd like to have run the entire ORC export once to completion before closing

Mar 26 2021, 12:25 PM · Graph service, Datasets
seirl closed T1847: fully automate export of the graph dataset as Resolved.

The ORC exporter is done, and it's likely that we won't provide CSV exports in the future, or we'll generate them from the ORC format.

Mar 26 2021, 12:04 PM · Graph service, Datasets
seirl closed T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Resolved.
Mar 26 2021, 12:04 PM · Datasets

Mar 8 2021

rdicosmo added a parent task for T1743: create a nice landing web page for exported dataset: T3085: Complete and updated copy of the archive on S3 (objects+graph).
Mar 8 2021, 9:49 AM · Datasets
rdicosmo added a parent task for T1848: refresh graph dataset export: T3085: Complete and updated copy of the archive on S3 (objects+graph).
Mar 8 2021, 9:45 AM · Datasets

Mar 4 2021

rdicosmo merged task T1914: Keep mirror of contents on S3 up to date into T1954: Up-to-date objstorage mirror on S3.
Mar 4 2021, 5:44 PM · Mirror, Datasets

Feb 2 2021

seirl triaged T3021: Investigate why reading the journal of the content table takes so long as Normal priority.
Feb 2 2021, 2:00 PM · Journal, Datasets

Sep 22 2020

moranegg moved T2431: Document how to export the graph edge dataset from Backlog to archive-users (docs/user-guides/) on the Documentation board.
Sep 22 2020, 2:37 PM · Documentation, Graph service, Datasets

Sep 17 2020

zack changed the status of T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, from Open to Work in Progress.
Sep 17 2020, 9:04 AM · Datasets
zack changed the status of T1847: fully automate export of the graph dataset from Open to Work in Progress.
Sep 17 2020, 9:04 AM · Graph service, Datasets

Sep 16 2020

seirl added a comment to T1847: fully automate export of the graph dataset.

No, only the edge part is done, we still need a parquet and a CSV exporter :/

Sep 16 2020, 10:59 PM · Graph service, Datasets
zack removed a parent task for T1848: refresh graph dataset export: T1868: refresh compressed representation of the archive.
Sep 16 2020, 8:43 PM · Datasets
zack added a comment to T1847: fully automate export of the graph dataset.

I think this is (reasonably) done now, please check and close it.

Sep 16 2020, 8:43 PM · Graph service, Datasets
zack raised the priority of T1848: refresh graph dataset export from Normal to High.
Sep 16 2020, 8:42 PM · Datasets
zack added a comment to T1848: refresh graph dataset export.
Sep 16 2020, 8:42 PM · Datasets

Sep 4 2020

ardumont added projects to T2564: migrate existing revisions metadata extra_headers to actual extra_headers field: Storage manager, Datasets.
Sep 4 2020, 11:30 AM · System administration, Storage manager

Jun 3 2020

zack renamed T2431: Document how to export the graph edge dataset from Documentat how to export the graph edge dataset to Document how to export the graph edge dataset.
Jun 3 2020, 4:36 PM · Documentation, Graph service, Datasets
seirl triaged T2431: Document how to export the graph edge dataset as Normal priority.
Jun 3 2020, 4:34 PM · Documentation, Graph service, Datasets
zack changed the status of T1796: Datasets exported from Spark are missing some rows from Resolved to Wontfix.
Jun 3 2020, 4:20 PM · Datasets
seirl closed T1796: Datasets exported from Spark are missing some rows as Resolved.

We no longer export edges from Spark

Jun 3 2020, 4:14 PM · Datasets
seirl closed T1741: graph dataset: update to use persistent identifiers everywhere, a subtask of T1848: refresh graph dataset export, as Resolved.
Jun 3 2020, 4:08 PM · Datasets
seirl closed T1741: graph dataset: update to use persistent identifiers everywhere as Resolved.

We no longer export edges per file type.

Jun 3 2020, 4:08 PM · Datasets
seirl closed T1956: Integrate usage docs of the graph dataset in swh-docs as Resolved.
Jun 3 2020, 4:07 PM · Datasets

Apr 15 2020

seirl closed T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli' as Resolved.
Apr 15 2020, 3:36 PM · Datasets
seirl added a comment to T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli'.

Temporary fix here until the branch that implements this entrypoint is merged: https://forge.softwareheritage.org/rDDATASETbe9e71ba1f858bbb8f44649306b919a1fa965ea2

Apr 15 2020, 3:36 PM · Datasets
zack triaged T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli' as Normal priority.
Apr 15 2020, 1:32 PM · Datasets

Jan 23 2020

olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

We've now hit T2003 hard as the client caught up with the head of the local kafka cluster. That's why the curve is flattening out currently, as I stopped the replayers until the queue is implemented.

Jan 23 2020, 2:17 PM · Mirror, Datasets

Dec 7 2019

olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

We'll need to address T2003 before this can be closed (if we go the journal client route), so marking accordingly.

Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd added a subtask for T1914: Keep mirror of contents on S3 up to date: T2003: Content replayer may try to copy objects before they are available from an objstorage.
Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd renamed T1914: Keep mirror of contents on S3 up to date from synchronously write content objects to AWS during ingestion to Keep mirror of contents on S3 up to date.
Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

I don't think we're going to do this but rather use the journal client approach. (Even more so considering that writing to S3 takes 500ms for each object, which sounds like a silly artificial limit to put on a synchronous process).

Dec 7 2019, 6:32 PM · Mirror, Datasets
olasd merged task T1899: complete object storage mirror on AWS into T1954: Up-to-date objstorage mirror on S3.
Dec 7 2019, 6:30 PM · Mirror, Datasets

Nov 18 2019

zack raised the priority of T1848: refresh graph dataset export from Low to Normal.
Nov 18 2019, 2:50 PM · Datasets
zack lowered the priority of T1847: fully automate export of the graph dataset from High to Normal.
Nov 18 2019, 2:50 PM · Graph service, Datasets
zack added a project to T1847: fully automate export of the graph dataset: Graph service.
Nov 18 2019, 2:48 PM · Graph service, Datasets

Aug 19 2019

seirl triaged T1956: Integrate usage docs of the graph dataset in swh-docs as High priority.
Aug 19 2019, 6:19 PM · Datasets

Jul 14 2019

zack renamed T1914: Keep mirror of contents on S3 up to date from synchronously write content objects to AWS to synchronously write content objects to AWS during ingestion.
Jul 14 2019, 4:48 PM · Mirror, Datasets
zack triaged T1914: Keep mirror of contents on S3 up to date as High priority.
Jul 14 2019, 4:47 PM · Mirror, Datasets

Jul 9 2019

zack triaged T1899: complete object storage mirror on AWS as Normal priority.
Jul 9 2019, 10:59 AM · Mirror, Datasets