Page MenuHomeSoftware Heritage

DatasetsFolder
ActivePublic

Details

Description

datasets maintained by Software Heritage

Recent Activity

Sun, May 1

seirl closed T1848: refresh graph dataset export as Resolved.

Now that there is both a columnar+compressed graph from 2021 and a columnar graph from 2022 that is pending compression, this task about "refreshing the export from January 2019" is resolved.

Sun, May 1, 12:08 PM · Datasets

Fri, Apr 29

seirl changed the status of T1848: refresh graph dataset export from Open to Work in Progress.
Fri, Apr 29, 6:23 PM · Datasets
seirl moved T1847: fully automate export of the graph dataset from Backlog to Deployed on the Compressed graph service board.
Fri, Apr 29, 6:22 PM · Compressed graph service, Datasets
seirl moved T2431: Document how to export the graph edge dataset from Backlog to Deployed on the Compressed graph service board.
Fri, Apr 29, 6:22 PM · Documentation, Compressed graph service, Datasets
seirl closed T3021: Investigate why reading the journal of the content table takes so long as Resolved.

Fixed in D7718

Fri, Apr 29, 6:20 PM · Journal, Datasets
seirl closed T2431: Document how to export the graph edge dataset, a subtask of T1847: fully automate export of the graph dataset, as Resolved.
Fri, Apr 29, 6:15 PM · Compressed graph service, Datasets
seirl closed T2431: Document how to export the graph edge dataset as Resolved.

Done here: D7693 and here: D7711

Fri, Apr 29, 6:15 PM · Documentation, Compressed graph service, Datasets
seirl closed T1743: create a nice landing web page for exported dataset as Resolved.
Fri, Apr 29, 6:14 PM · Datasets
seirl added a comment to T1743: create a nice landing web page for exported dataset.

Done, this page https://annex.softwareheritage.org/public/dataset/graph/ now contains a link to the detailed list of datasets: https://forge.softwareheritage.org/D7487

Fri, Apr 29, 6:14 PM · Datasets
seirl closed T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Resolved.
Fri, Apr 29, 5:57 PM · Datasets
seirl closed T1847: fully automate export of the graph dataset as Resolved.

Done!

Fri, Apr 29, 5:57 PM · Compressed graph service, Datasets
seirl closed T3329: document ORC format dataset availability as Resolved.

Fixed in D7487

Fri, Apr 29, 5:56 PM · Datasets

Wed, Apr 27

seirl added a comment to T3021: Investigate why reading the journal of the content table takes so long.

Wed, Apr 27, 2:58 PM · Journal, Datasets
seirl reopened T3021: Investigate why reading the journal of the content table takes so long as "Open".
Wed, Apr 27, 2:57 PM · Journal, Datasets
seirl closed T3021: Investigate why reading the journal of the content table takes so long as Resolved.

No longer happens with a more recent stack

Wed, Apr 27, 10:12 AM · Journal, Datasets

Apr 5 2022

zack changed the status of T1743: create a nice landing web page for exported dataset from Open to Work in Progress.
Apr 5 2022, 1:39 PM · Datasets
zack changed the status of T3329: document ORC format dataset availability from Open to Work in Progress.
Apr 5 2022, 1:38 PM · Datasets

Mar 30 2022

zack added a member for Datasets: seirl.
Mar 30 2022, 1:42 PM
zack added a watcher for Datasets: zack.
Mar 30 2022, 1:41 PM
zack added a member for Datasets: zack.
Mar 30 2022, 1:39 PM

Mar 22 2022

vlorentz created P1315 pyorc_no_zoneinfo.patch.
Mar 22 2022, 4:43 PM · Datasets

Jan 25 2022

zack triaged T3885: Filter rows of size >32MB from dataset export as Normal priority.
Jan 25 2022, 1:32 PM · Datasets

Jan 24 2022

seirl created T3885: Filter rows of size >32MB from dataset export.
Jan 24 2022, 9:18 PM · Datasets

Jan 4 2022

zack closed T3260: publish swh.dataset to pypi as Resolved.
Jan 4 2022, 1:42 PM · Continuous Integration, Datasets

Jul 29 2021

vlorentz moved T2431: Document how to export the graph edge dataset from sys-admin (docs/sysadm) to developers (docs/devel/) on the Documentation board.
Jul 29 2021, 3:54 PM · Documentation, Compressed graph service, Datasets
vlorentz moved T2431: Document how to export the graph edge dataset from archive-users (docs/user-guides/) to sys-admin (docs/sysadm) on the Documentation board.
Jul 29 2021, 3:54 PM · Documentation, Compressed graph service, Datasets
vlorentz added a comment to T2431: Document how to export the graph edge dataset.

It is now somewhat documented here: https://forge.softwareheritage.org/source/swh-environment/browse/master/docker/services/swh-graph/entrypoint.sh

Jul 29 2021, 3:54 PM · Documentation, Compressed graph service, Datasets

May 18 2021

zack updated the task description for T3329: document ORC format dataset availability.
May 18 2021, 9:33 AM · Datasets
zack triaged T3329: document ORC format dataset availability as High priority.
May 18 2021, 9:32 AM · Datasets

Apr 19 2021

olasd closed T2003: Content replayer may try to copy objects before they are available from an objstorage, a subtask of T1914: Keep mirror of contents on S3 up to date, as Resolved.
Apr 19 2021, 12:06 PM · Mirror, Datasets

Apr 17 2021

ardumont added a comment to T3260: publish swh.dataset to pypi.

This should help:

12:08 <+ardumont> fwiw, i don't see swh-dataset in the jenkins ci declaration so that won't get published
12:08 <+ardumont> https://forge.softwareheritage.org/source/swh-jenkins-jobs/browse/master/jobs/swh-packages.yaml
12:10 <+ardumont> (relatedly without ^ that won't show up in jenkins)
12:10 <+ardumont> related docs https://wiki.softwareheritage.org/wiki/Debian_packaging#Setting_up_the_repository_on_Phabricator and https://wiki.softwareheritage.org/wiki/Debian_packaging#Setting_up_the_repository_on_Phabricator
Apr 17 2021, 3:23 PM · Continuous Integration, Datasets
zack triaged T3260: publish swh.dataset to pypi as Low priority.
Apr 17 2021, 12:31 PM · Continuous Integration, Datasets

Apr 7 2021

seirl closed T3178: document how to export the graph dataset automatically, a subtask of T1847: fully automate export of the graph dataset, as Invalid.
Apr 7 2021, 3:03 PM · Compressed graph service, Datasets
seirl closed T3178: document how to export the graph dataset automatically as Invalid.

Duplicate of T2431

Apr 7 2021, 3:03 PM · Documentation, Datasets
seirl added a subtask for T1847: fully automate export of the graph dataset: T2431: Document how to export the graph edge dataset.
Apr 7 2021, 3:03 PM · Compressed graph service, Datasets
seirl added a parent task for T2431: Document how to export the graph edge dataset: T1847: fully automate export of the graph dataset.
Apr 7 2021, 3:03 PM · Documentation, Compressed graph service, Datasets

Mar 26 2021

zack triaged T3178: document how to export the graph dataset automatically as Normal priority.
Mar 26 2021, 12:25 PM · Documentation, Datasets
zack reopened T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Open.
Mar 26 2021, 12:25 PM · Datasets
zack reopened T1847: fully automate export of the graph dataset as "Open".

reopening, as ideally we'd like to have run the entire ORC export once to completion before closing

Mar 26 2021, 12:25 PM · Compressed graph service, Datasets
seirl closed T1847: fully automate export of the graph dataset as Resolved.

The ORC exporter is done, and it's likely that we won't provide CSV exports in the future, or we'll generate them from the ORC format.

Mar 26 2021, 12:04 PM · Compressed graph service, Datasets
seirl closed T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Resolved.
Mar 26 2021, 12:04 PM · Datasets

Mar 8 2021

rdicosmo added a parent task for T1743: create a nice landing web page for exported dataset: T3085: Complete and updated copy of the archive on S3 (objects+graph).
Mar 8 2021, 9:49 AM · Datasets
rdicosmo added a parent task for T1848: refresh graph dataset export: T3085: Complete and updated copy of the archive on S3 (objects+graph).
Mar 8 2021, 9:45 AM · Datasets

Mar 4 2021

rdicosmo merged task T1914: Keep mirror of contents on S3 up to date into T1954: Up-to-date objstorage mirror on S3.
Mar 4 2021, 5:44 PM · Mirror, Datasets

Feb 2 2021

seirl triaged T3021: Investigate why reading the journal of the content table takes so long as Normal priority.
Feb 2 2021, 2:00 PM · Journal, Datasets

Sep 22 2020

moranegg moved T2431: Document how to export the graph edge dataset from Backlog to archive-users (docs/user-guides/) on the Documentation board.
Sep 22 2020, 2:37 PM · Documentation, Compressed graph service, Datasets

Sep 17 2020

zack changed the status of T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, from Open to Work in Progress.
Sep 17 2020, 9:04 AM · Datasets
zack changed the status of T1847: fully automate export of the graph dataset from Open to Work in Progress.
Sep 17 2020, 9:04 AM · Compressed graph service, Datasets

Sep 16 2020

seirl added a comment to T1847: fully automate export of the graph dataset.

No, only the edge part is done, we still need a parquet and a CSV exporter :/

Sep 16 2020, 10:59 PM · Compressed graph service, Datasets
zack removed a parent task for T1848: refresh graph dataset export: T1868: refresh compressed representation of the archive.
Sep 16 2020, 8:43 PM · Datasets