Page MenuHomeSoftware Heritage
Feed Advanced Search

Apr 29 2022

seirl closed T3329: document ORC format dataset availability as Resolved.

Fixed in D7487

Apr 29 2022, 5:56 PM · Datasets

Apr 27 2022

seirl added a comment to T3021: Investigate why reading the journal of the content table takes so long.

Apr 27 2022, 2:58 PM · Journal, Datasets
seirl reopened T3021: Investigate why reading the journal of the content table takes so long as "Open".
Apr 27 2022, 2:57 PM · Journal, Datasets
seirl closed T3021: Investigate why reading the journal of the content table takes so long as Resolved.

No longer happens with a more recent stack

Apr 27 2022, 10:12 AM · Journal, Datasets

Apr 5 2022

zack changed the status of T1743: create a nice landing web page for exported dataset from Open to Work in Progress.
Apr 5 2022, 1:39 PM · Datasets
zack changed the status of T3329: document ORC format dataset availability from Open to Work in Progress.
Apr 5 2022, 1:38 PM · Datasets

Mar 30 2022

zack added a member for Datasets: seirl.
Mar 30 2022, 1:42 PM
zack added a watcher for Datasets: zack.
Mar 30 2022, 1:41 PM
zack added a member for Datasets: zack.
Mar 30 2022, 1:39 PM

Mar 22 2022

vlorentz created P1315 pyorc_no_zoneinfo.patch.
Mar 22 2022, 4:43 PM · Datasets

Jan 25 2022

zack triaged T3885: Filter rows of size >32MB from dataset export as Normal priority.
Jan 25 2022, 1:32 PM · Datasets

Jan 24 2022

seirl created T3885: Filter rows of size >32MB from dataset export.
Jan 24 2022, 9:18 PM · Datasets

Jan 4 2022

zack closed T3260: publish swh.dataset to pypi as Resolved.
Jan 4 2022, 1:42 PM · Continuous Integration, Datasets

Jul 29 2021

vlorentz moved T2431: Document how to export the graph edge dataset from sys-admin (docs/sysadm) to developers (docs/devel/) on the Documentation board.
Jul 29 2021, 3:54 PM · Documentation, Compressed graph service, Datasets
vlorentz moved T2431: Document how to export the graph edge dataset from archive-users (docs/user-guides/) to sys-admin (docs/sysadm) on the Documentation board.
Jul 29 2021, 3:54 PM · Documentation, Compressed graph service, Datasets
vlorentz added a comment to T2431: Document how to export the graph edge dataset.

It is now somewhat documented here: https://forge.softwareheritage.org/source/swh-environment/browse/master/docker/services/swh-graph/entrypoint.sh

Jul 29 2021, 3:54 PM · Documentation, Compressed graph service, Datasets

May 18 2021

zack updated the task description for T3329: document ORC format dataset availability.
May 18 2021, 9:33 AM · Datasets
zack triaged T3329: document ORC format dataset availability as High priority.
May 18 2021, 9:32 AM · Datasets

Apr 19 2021

olasd closed T2003: Content replayer may try to copy objects before they are available from an objstorage, a subtask of T1914: Keep mirror of contents on S3 up to date, as Resolved.
Apr 19 2021, 12:06 PM · Mirror, Datasets

Apr 17 2021

ardumont added a comment to T3260: publish swh.dataset to pypi.

This should help:

12:08 <+ardumont> fwiw, i don't see swh-dataset in the jenkins ci declaration so that won't get published
12:08 <+ardumont> https://forge.softwareheritage.org/source/swh-jenkins-jobs/browse/master/jobs/swh-packages.yaml
12:10 <+ardumont> (relatedly without ^ that won't show up in jenkins)
12:10 <+ardumont> related docs https://wiki.softwareheritage.org/wiki/Debian_packaging#Setting_up_the_repository_on_Phabricator and https://wiki.softwareheritage.org/wiki/Debian_packaging#Setting_up_the_repository_on_Phabricator
Apr 17 2021, 3:23 PM · Continuous Integration, Datasets
zack triaged T3260: publish swh.dataset to pypi as Low priority.
Apr 17 2021, 12:31 PM · Continuous Integration, Datasets

Apr 7 2021

seirl closed T3178: document how to export the graph dataset automatically, a subtask of T1847: fully automate export of the graph dataset, as Invalid.
Apr 7 2021, 3:03 PM · Compressed graph service, Datasets
seirl closed T3178: document how to export the graph dataset automatically as Invalid.

Duplicate of T2431

Apr 7 2021, 3:03 PM · Documentation, Datasets
seirl added a subtask for T1847: fully automate export of the graph dataset: T2431: Document how to export the graph edge dataset.
Apr 7 2021, 3:03 PM · Compressed graph service, Datasets
seirl added a parent task for T2431: Document how to export the graph edge dataset: T1847: fully automate export of the graph dataset.
Apr 7 2021, 3:03 PM · Documentation, Compressed graph service, Datasets

Mar 26 2021

zack triaged T3178: document how to export the graph dataset automatically as Normal priority.
Mar 26 2021, 12:25 PM · Documentation, Datasets
zack reopened T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Open.
Mar 26 2021, 12:25 PM · Datasets
zack reopened T1847: fully automate export of the graph dataset as "Open".

reopening, as ideally we'd like to have run the entire ORC export once to completion before closing

Mar 26 2021, 12:25 PM · Compressed graph service, Datasets
seirl closed T1847: fully automate export of the graph dataset as Resolved.

The ORC exporter is done, and it's likely that we won't provide CSV exports in the future, or we'll generate them from the ORC format.

Mar 26 2021, 12:04 PM · Compressed graph service, Datasets
seirl closed T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Resolved.
Mar 26 2021, 12:04 PM · Datasets

Mar 8 2021

rdicosmo added a parent task for T1743: create a nice landing web page for exported dataset: T3085: Complete and updated copy of the archive on S3 (objects+graph).
Mar 8 2021, 9:49 AM · Datasets
rdicosmo added a parent task for T1848: refresh graph dataset export: T3085: Complete and updated copy of the archive on S3 (objects+graph).
Mar 8 2021, 9:45 AM · Datasets

Mar 4 2021

rdicosmo merged task T1914: Keep mirror of contents on S3 up to date into T1954: Up-to-date objstorage mirror on S3.
Mar 4 2021, 5:44 PM · Mirror, Datasets

Feb 2 2021

seirl triaged T3021: Investigate why reading the journal of the content table takes so long as Normal priority.
Feb 2 2021, 2:00 PM · Journal, Datasets

Sep 22 2020

moranegg moved T2431: Document how to export the graph edge dataset from Backlog to archive-users (docs/user-guides/) on the Documentation board.
Sep 22 2020, 2:37 PM · Documentation, Compressed graph service, Datasets

Sep 17 2020

zack changed the status of T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, from Open to Work in Progress.
Sep 17 2020, 9:04 AM · Datasets
zack changed the status of T1847: fully automate export of the graph dataset from Open to Work in Progress.
Sep 17 2020, 9:04 AM · Compressed graph service, Datasets

Sep 16 2020

seirl added a comment to T1847: fully automate export of the graph dataset.

No, only the edge part is done, we still need a parquet and a CSV exporter :/

Sep 16 2020, 10:59 PM · Compressed graph service, Datasets
zack removed a parent task for T1848: refresh graph dataset export: T1868: refresh compressed representation of the archive.
Sep 16 2020, 8:43 PM · Datasets
zack added a comment to T1847: fully automate export of the graph dataset.

I think this is (reasonably) done now, please check and close it.

Sep 16 2020, 8:43 PM · Compressed graph service, Datasets
zack raised the priority of T1848: refresh graph dataset export from Normal to High.
Sep 16 2020, 8:42 PM · Datasets
zack added a comment to T1848: refresh graph dataset export.
Sep 16 2020, 8:42 PM · Datasets

Sep 4 2020

ardumont added projects to T2564: migrate existing revisions metadata extra_headers to actual extra_headers field: Storage manager, Datasets.
Sep 4 2020, 11:30 AM · System administration, Storage manager

Jun 3 2020

zack renamed T2431: Document how to export the graph edge dataset from Documentat how to export the graph edge dataset to Document how to export the graph edge dataset.
Jun 3 2020, 4:36 PM · Documentation, Compressed graph service, Datasets
seirl triaged T2431: Document how to export the graph edge dataset as Normal priority.
Jun 3 2020, 4:34 PM · Documentation, Compressed graph service, Datasets
zack changed the status of T1796: Datasets exported from Spark are missing some rows from Resolved to Wontfix.
Jun 3 2020, 4:20 PM · Datasets
seirl closed T1796: Datasets exported from Spark are missing some rows as Resolved.

We no longer export edges from Spark

Jun 3 2020, 4:14 PM · Datasets
seirl closed T1741: graph dataset: update to use persistent identifiers everywhere, a subtask of T1848: refresh graph dataset export, as Resolved.
Jun 3 2020, 4:08 PM · Datasets
seirl closed T1741: graph dataset: update to use persistent identifiers everywhere as Resolved.

We no longer export edges per file type.

Jun 3 2020, 4:08 PM · Datasets
seirl closed T1956: Integrate usage docs of the graph dataset in swh-docs as Resolved.
Jun 3 2020, 4:07 PM · Datasets

Apr 15 2020

seirl closed T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli' as Resolved.
Apr 15 2020, 3:36 PM · Datasets
seirl added a comment to T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli'.

Temporary fix here until the branch that implements this entrypoint is merged: https://forge.softwareheritage.org/rDDATASETbe9e71ba1f858bbb8f44649306b919a1fa965ea2

Apr 15 2020, 3:36 PM · Datasets
zack triaged T2361: WARNING:swh.core.cli:Could not load subcommand dataset: No module named 'swh.dataset.cli' as Normal priority.
Apr 15 2020, 1:32 PM · Datasets

Jan 23 2020

olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

We've now hit T2003 hard as the client caught up with the head of the local kafka cluster. That's why the curve is flattening out currently, as I stopped the replayers until the queue is implemented.

Jan 23 2020, 2:17 PM · Mirror, Datasets

Dec 7 2019

olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

We'll need to address T2003 before this can be closed (if we go the journal client route), so marking accordingly.

Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd added a subtask for T1914: Keep mirror of contents on S3 up to date: T2003: Content replayer may try to copy objects before they are available from an objstorage.
Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd renamed T1914: Keep mirror of contents on S3 up to date from synchronously write content objects to AWS during ingestion to Keep mirror of contents on S3 up to date.
Dec 7 2019, 6:35 PM · Mirror, Datasets
olasd added a comment to T1914: Keep mirror of contents on S3 up to date.

I don't think we're going to do this but rather use the journal client approach. (Even more so considering that writing to S3 takes 500ms for each object, which sounds like a silly artificial limit to put on a synchronous process).

Dec 7 2019, 6:32 PM · Mirror, Datasets
olasd merged task T1899: complete object storage mirror on AWS into T1954: Up-to-date objstorage mirror on S3.
Dec 7 2019, 6:30 PM · Mirror, Datasets

Nov 18 2019

zack raised the priority of T1848: refresh graph dataset export from Low to Normal.
Nov 18 2019, 2:50 PM · Datasets
zack lowered the priority of T1847: fully automate export of the graph dataset from High to Normal.
Nov 18 2019, 2:50 PM · Compressed graph service, Datasets
zack added a project to T1847: fully automate export of the graph dataset: Compressed graph service.
Nov 18 2019, 2:48 PM · Compressed graph service, Datasets

Aug 19 2019

seirl triaged T1956: Integrate usage docs of the graph dataset in swh-docs as High priority.
Aug 19 2019, 6:19 PM · Datasets

Jul 14 2019

zack renamed T1914: Keep mirror of contents on S3 up to date from synchronously write content objects to AWS to synchronously write content objects to AWS during ingestion.
Jul 14 2019, 4:48 PM · Mirror, Datasets
zack triaged T1914: Keep mirror of contents on S3 up to date as High priority.
Jul 14 2019, 4:47 PM · Mirror, Datasets

Jul 9 2019

zack triaged T1899: complete object storage mirror on AWS as Normal priority.
Jul 9 2019, 10:59 AM · Mirror, Datasets

Jun 30 2019

zack added a parent task for T1848: refresh graph dataset export: T1868: refresh compressed representation of the archive.
Jun 30 2019, 1:58 PM · Datasets

Jun 23 2019

zack added a subtask for T1848: refresh graph dataset export: T1741: graph dataset: update to use persistent identifiers everywhere.
Jun 23 2019, 10:23 PM · Datasets
zack added a parent task for T1741: graph dataset: update to use persistent identifiers everywhere: T1848: refresh graph dataset export.
Jun 23 2019, 10:23 PM · Datasets
zack triaged T1848: refresh graph dataset export as Low priority.
Jun 23 2019, 10:22 PM · Datasets
zack added a parent task for T1847: fully automate export of the graph dataset: T1848: refresh graph dataset export.
Jun 23 2019, 10:22 PM · Compressed graph service, Datasets
zack added a subtask for T1848: refresh graph dataset export: T1847: fully automate export of the graph dataset.
Jun 23 2019, 10:22 PM · Datasets
zack created T1848: refresh graph dataset export.
Jun 23 2019, 10:21 PM · Datasets
zack triaged T1847: fully automate export of the graph dataset as High priority.
Jun 23 2019, 10:20 PM · Compressed graph service, Datasets
zack created T1847: fully automate export of the graph dataset.
Jun 23 2019, 10:20 PM · Compressed graph service, Datasets

Jun 11 2019

seirl triaged T1796: Datasets exported from Spark are missing some rows as Normal priority.
Jun 11 2019, 11:52 PM · Datasets

Jun 5 2019

zack claimed T1742: graph dataset: uniform file names.
Jun 5 2019, 10:07 AM · Datasets
zack closed T1742: graph dataset: uniform file names as Resolved.
Jun 5 2019, 10:07 AM · Datasets

Jun 4 2019

zack closed T1783: edge dataset: re-export rev→rev edges in the right order as Resolved.
Jun 4 2019, 10:33 PM · Datasets
zack triaged T1783: edge dataset: re-export rev→rev edges in the right order as High priority.
Jun 4 2019, 2:33 PM · Datasets

May 23 2019

zack added a project to T1741: graph dataset: update to use persistent identifiers everywhere: Datasets.
May 23 2019, 2:37 PM · Datasets
zack added a project to T1742: graph dataset: uniform file names: Datasets.
May 23 2019, 2:37 PM · Datasets
zack added a comment to T1743: create a nice landing web page for exported dataset.

A nice related work here are the LAW datasets.

May 23 2019, 2:37 PM · Datasets
zack triaged T1743: create a nice landing web page for exported dataset as Low priority.
May 23 2019, 2:36 PM · Datasets
zack created Datasets.
May 23 2019, 2:29 PM