Page MenuHomeSoftware Heritage

Faithfully store weird git objects
Open, NormalPublic

Description

Our data model is based on git, and normalizes some of the data we read; this means that "weird" git objects cannot be represented.

This meta-task will group this kind of issues

Possible options so far:

  1. extend the data model to support them (like "negative_utc_offset, but somewhat generalized, eg. store text representation of offsets)
  2. store a binary delta between the object we would generate from the model object and the original
  3. store the full original manifest for all objects that can't be losslessly represented in the model, alongside the main graph storage
  4. store the full original manifest for all objects, in a separate storage
  5. give up on all/some "weird objects"

Some mixes of the options are possible, especially 1 with 2, 3, or 5.

Discussion of these options:

1 -> is annoying to handle, and needs continuous effort, but this is essentially what we are already doing with negative_utc_offset (a boolean to tell the different between the "normal" timezone "+0000" for UTC, and the "-0000" timezone that appears in 1.8M commits)

2 -> brittle, as a botched migration or a bug in swh-model would make the deltas unusable

4 -> probably doubles or triples the size of the graph; but it's the only way to protect against bad migrations (short of recomputing all checksums in migrations). On the other hand, parser errors may go unnoticed because we would rely on these manifest.

2, 3, 4 -> currently, if the parsed object does not always exactly match the manifest, we raise an error. This makes us notice any parsing error. If we go with either of these options, we will have to remove that error, so parser bugs may go unnoticed. (But they would be recoverable afterward, if and when we finally notice it)

Related Objects

StatusAssignedTask
Work in Progressvlorentz
Work in ProgressNone
Work in Progressvlorentz
Resolvedvlorentz
Resolvedvlorentz
Resolvedardumont
Resolvedardumont
Resolvedvlorentz
Resolvedvlorentz
Resolvedardumont
Resolvedardumont
Resolvedvlorentz
Work in Progressvlorentz
Work in Progressvlorentz
Resolvedvlorentz
Resolvedvlorentz
Openolasd
Openvlorentz
Wontfixvlorentz
Wontfixvlorentz
Wontfixvlorentz
Openvlorentz
Openvlorentz

Event Timeline

vlorentz triaged this task as Normal priority.Sep 22 2021, 1:31 PM
vlorentz created this task.
vlorentz updated the task description. (Show Details)

Copy of an email I sent today:

Summary of this email: we will change the way we internally represent time offsets, and store the original manifest for all git objects that are otherwise unrepresentable. This will not affect users of the public API.

Conclusion

We just had a meeting to take a decision on the matter. Given the respective numbers of objects, we decided that we should:

  1. alter the data model to *represent time offsets as strings* (instead of the current tuple of an int and the "negative_utc" boolean), because there are many such objects
  1. for all objects that have other types of corruption (mostly directories with "040000" permissions or wrong orders; but also some revisions), *store their original git manifest next to them*. If we implemented this today, there would be under 100k such objects, mostly directories.

This will allow the archive to preserve all original objects, with a SWHIDv1 matching their Git identifier.

Next steps and API changes

We will now need to implement these changes to the archive. It is unclear for now how this will affect the public API; but it will most likely only add two fields:

  1. we can parse the offset string in the web server, to keep serving an int through the API
  2. the manifest may be added as an extra field and/or API endpoint, but it won't affect the responses in any other way

So this should *not cause any disruption* to users of the API.

Finally, we will inject all objects we can repair in the archive.

"Future work"

However, this raised (or rather, reminded us of) the issue of the uniqueness of SWHIDs: because the same "abstract" directory can be represented in different ways in Git, it can have many Git identifiers. And because SWHIDv1 is compatible with Git, it can have many SWHIDv1s.

It does have a SWHID that is "more canonical" than the other, though; this is the one computed by the main Git implementation and by our swh-identify tool.

We decided to postpone a decision on this matter to the future discussion on how to design SWHIDv2, to address this issue at the same time as other ones (see https://forge.softwareheritage.org/T3134 / https://forge.softwareheritage.org/T3609 )