datasets maintained by Software Heritage
May 18 2021
Apr 19 2021
Apr 17 2021
This should help:
12:08 <+ardumont> fwiw, i don't see swh-dataset in the jenkins ci declaration so that won't get published 12:08 <+ardumont> https://forge.softwareheritage.org/source/swh-jenkins-jobs/browse/master/jobs/swh-packages.yaml 12:10 <+ardumont> (relatedly without ^ that won't show up in jenkins) 12:10 <+ardumont> related docs https://wiki.softwareheritage.org/wiki/Debian_packaging#Setting_up_the_repository_on_Phabricator and https://wiki.softwareheritage.org/wiki/Debian_packaging#Setting_up_the_repository_on_Phabricator
Apr 7 2021
Duplicate of T2431
Mar 26 2021
reopening, as ideally we'd like to have run the entire ORC export once to completion before closing
The ORC exporter is done, and it's likely that we won't provide CSV exports in the future, or we'll generate them from the ORC format.
Mar 8 2021
Mar 4 2021
Feb 2 2021
Sep 22 2020
Sep 17 2020
Sep 16 2020
No, only the edge part is done, we still need a parquet and a CSV exporter :/
I think this is (reasonably) done now, please check and close it.
Sep 4 2020
Jun 3 2020
We no longer export edges from Spark
We no longer export edges per file type.
Apr 15 2020
Temporary fix here until the branch that implements this entrypoint is merged: https://forge.softwareheritage.org/rDDATASETbe9e71ba1f858bbb8f44649306b919a1fa965ea2
Jan 23 2020
We've now hit T2003 hard as the client caught up with the head of the local kafka cluster. That's why the curve is flattening out currently, as I stopped the replayers until the queue is implemented.
Dec 7 2019
We'll need to address T2003 before this can be closed (if we go the journal client route), so marking accordingly.
I don't think we're going to do this but rather use the journal client approach. (Even more so considering that writing to S3 takes 500ms for each object, which sounds like a silly artificial limit to put on a synchronous process).