datasets maintained by Software Heritage
Sep 22 2020
Sep 17 2020
Sep 16 2020
No, only the edge part is done, we still need a parquet and a CSV exporter :/
I think this is (reasonably) done now, please check and close it.
Sep 4 2020
Jun 3 2020
We no longer export edges from Spark
We no longer export edges per file type.
Apr 15 2020
Temporary fix here until the branch that implements this entrypoint is merged: https://forge.softwareheritage.org/rDDATASETbe9e71ba1f858bbb8f44649306b919a1fa965ea2
Jan 23 2020
We've now hit T2003 hard as the client caught up with the head of the local kafka cluster. That's why the curve is flattening out currently, as I stopped the replayers until the queue is implemented.
Dec 7 2019
We'll need to address T2003 before this can be closed (if we go the journal client route), so marking accordingly.
I don't think we're going to do this but rather use the journal client approach. (Even more so considering that writing to S3 takes 500ms for each object, which sounds like a silly artificial limit to put on a synchronous process).
Nov 18 2019
Aug 19 2019
Jul 14 2019
Jul 9 2019
Jun 30 2019
Jun 23 2019
Jun 11 2019
Jun 5 2019
Jun 4 2019
May 23 2019
A nice related work here are the LAW datasets.