SWHIDv2: List issues with SWHIDv1 that should be fixed
Closed, MigratedEdits Locked
Actions

Description

(Please edit this task with SWHIDv1 issues)

The way we format manifests for revisions/releases with negative non-integer timestamps is broken, because "200000 microseconds before timestamp 0" (aka 0 minus 1 seconds plus 800000 microseconds) is represented as "-1.8" in their git-like manifest. Discussed here
No way to represent missing DAG nodes (discussion at T1957)
The same "abstract" object (especially directories) can have many Git representations, therefore multiple Git identifiers, making its SWHIDv1 non-unique (even if one is "more canonical" than others)

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2210 Data Model
Migrated	gitlab-migration	T3134 SWHID v2
Migrated	gitlab-migration	T3609 SWHIDv2: List issues with SWHIDv1 that should be fixed

Event Timeline

vlorentz triaged this task as Normal priority.Sep 23 2021, 5:00 PM

vlorentz created this task.

vlorentz updated the task description. (Show Details)

vlorentz updated the task description. (Show Details)Oct 14 2021, 12:15 PM

vlorentz updated the task description. (Show Details)Nov 26 2021, 3:16 PM

vlorentz mentioned this in T3594: Faithfully store weird git objects.Nov 26 2021, 4:40 PM

vlorentz removed a project: meta-task.Dec 17 2021, 10:51 AM

I've thought of mentioning here a couple of issues that I've seen come up again over the past few days:

First an easy one:

Revisions and Releases in SWHIDv1 have a single author field. We should make these fields multi-valued from the get go.

And then a more complex one:

Revisions and Releases include PII (Personally Identifying Information) right inside the data used for their integrity chain, which makes it really expensive to update/obfuscate said PII.

It's currently imposible to provide an anonymized dataset with good intrinsic integrity properties. It also makes it impossible to have a name change policy without compromising the integrity of the data.

One possible way to work around this, assuming that we will depart completely from git compatibility (which sounds very likely, at least to me), would be to add a level of indirection, where the "SWHID manifest" for revisions and releases includes a hash for authors/committers, and the mapping is only provided to trusted third parties.

It would be possible to update this mapping to respond to, e.g. name changes and other legal requests with a somewhat low impact to the integrity of the data (only the modified / redacted PII would be impacted, rather than the full chain of Revisions/Releases).

This would also make anonymized datasets first-party citizens of our data model, which is probably a good thing to do in practice.

(While we will be re-computing all hashes of all objects for SWHIDv2, I think we should consider parsing trailing git pseudo-headers, e.g. Signed-off-by, Co-authored-by, etc. to be hidden behind this prospective PII mapping as well)

I agree

I think we should consider parsing trailing git pseudo-headers, e.g. Signed-off-by, Co-authored-by, etc. to be hidden behind this prospective PII mapping as well

we may have to hash each line in the commit message individually to future-proof against additional pseudo-headers being invented.

vsellier added a subscriber: vsellier.Jan 31 2022, 10:43 AM

This task has been migrated to GitLab.

SWHIDv2: List issues with SWHIDv1 that should be fixedClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

SWHIDv2: List issues with SWHIDv1 that should be fixed
Closed, MigratedEdits Locked
Actions

Related Objects
Search...