Page MenuHomeSoftware Heritage

SWHIDv2: List issues with SWHIDv1 that should be fixed
Open, NormalPublic

Description

(Please edit this task with SWHIDv1 issues)

  • The way we format manifests for revisions/releases with negative non-integer timestamps is broken, because "200000 microseconds before timestamp 0" (aka 0 minus 1 seconds plus 800000 microseconds) is represented as "-1.8" in their git-like manifest. Discussed here
  • No way to represent missing DAG nodes (discussion at T1957)
  • The same "abstract" object (especially directories) can have many Git representations, therefore multiple Git identifiers, making its SWHIDv1 non-unique (even if one is "more canonical" than others)

Event Timeline

vlorentz triaged this task as Normal priority.Sep 23 2021, 5:00 PM
vlorentz created this task.
vlorentz updated the task description. (Show Details)

I've thought of mentioning here a couple of issues that I've seen come up again over the past few days:

First an easy one:

  • Revisions and Releases in SWHIDv1 have a single author field. We should make these fields multi-valued from the get go.

And then a more complex one:

  • Revisions and Releases include PII (Personally Identifying Information) right inside the data used for their integrity chain, which makes it really expensive to update/obfuscate said PII.

It's currently imposible to provide an anonymized dataset with good intrinsic integrity properties. It also makes it impossible to have a name change policy without compromising the integrity of the data.

One possible way to work around this, assuming that we will depart completely from git compatibility (which sounds very likely, at least to me), would be to add a level of indirection, where the "SWHID manifest" for revisions and releases includes a hash for authors/committers, and the mapping is only provided to trusted third parties.

It would be possible to update this mapping to respond to, e.g. name changes and other legal requests with a somewhat low impact to the integrity of the data (only the modified / redacted PII would be impacted, rather than the full chain of Revisions/Releases).

This would also make anonymized datasets first-party citizens of our data model, which is probably a good thing to do in practice.

(While we will be re-computing all hashes of all objects for SWHIDv2, I think we should consider parsing trailing git pseudo-headers, e.g. Signed-off-by, Co-authored-by, etc. to be hidden behind this prospective PII mapping as well)

I agree

I think we should consider parsing trailing git pseudo-headers, e.g. Signed-off-by, Co-authored-by, etc. to be hidden behind this prospective PII mapping as well

we may have to hash each line in the commit message individually to future-proof against additional pseudo-headers being invented.