Page MenuHomeSoftware Heritage

collaboration graph: drop pseudo-SWHIDs and add mapping ori<->url
Closed, MigratedEdits Locked

Description

I've started looking into the current draft export of the collaboration graph (T4695), which is currently a single CSV file with two columns: <origin, author>, where origin is a pseudo-SWHIDs (of the ori type) and author an integer.

It's already quite useful in this format, but based on early discussions with potential users a few change requests emerged already:

  • We should have a version of the origin field that is just an integer. Rationale is that any serious/practical use of the collab graph will have to map ori SWHIDs to integers anyway. And given we have those numbers already, we can just emit them. (Yes, doing so would be a "leak" of some internal identifiers, but that's already the case with ori SWHIDs which we do not want users to rely upon anyway.)

We can either add another integer-based origin field, but I'd rather just remove ori SWHIDs in favor of integer-based origins.

  • We should have an easy way to map origin to matching URLs, without having to query the database internally. The handiest format for this would be providing a separate table mapping ori identifiers to full URLs. In the current format that would be a swhid,url table, but if we go with the previous suggestion that would be an even simpler int,url table.

(I still don't know if there will be reasons not to publish such an association table, but we should produce it anyway and decide later whether it should be in the public or restricted version of the dataset.)