Page MenuHomeSoftware Heritage

license dataset: missing java stuff from the replication package
Closed, MigratedEdits Locked

Description

Up to the 2021 version of the dataset we used to have the Java source code of custom code used to, e.g., find the earliest occurrence of a license blob, as part of the dataset in a java/ subdir.
This seems to be gone from the 2022 version.
We should add it back (ideally; or else we can point to the code used for that as part of swh-graph, but that would make the replication package a bit less useful in its own).

Event Timeline

zack triaged this task as Low priority.Nov 14 2022, 2:45 PM
zack created this task.

the replication/05-earliest-revision.sh script in the replication package mentions the swh-graph version it uses, and the fully qualified class name, so it can be found in the swh-graph code.

but that would make the replication package a bit less useful in its own

the old EarliestRevision.java was also not useful on its own, because it doesn't compile with the current version of swh-graph

Future versions will be generated using only code in swh-graph (bash glue code replaced by Python code, some of which shells out to bash for simplicity), so the replication package will simply be replaced by a swh-graph tag.