Our data model is based on git, and normalizes some of the data we read; this means that "weird" git objects cannot be represented.
This meta-task will group this kind of issues
Possible options so far:
- extend the data model to support them (like "negative_utc_offset, but somewhat generalized, eg. store text representation of offsets)
- store a binary delta between the object we would generate from the model object and the original
- store the full original manifest for all objects that can't be losslessly represented in the model, alongside the main graph storage
- store the full original manifest for all objects, in a separate storage
- give up on all/some "weird objects"
Some mixes of the options are possible, especially 1 with 2, 3, or 5.
Discussion of these options:
1 -> is annoying to handle, and needs continuous effort, but this is essentially what we are already doing with negative_utc_offset (a boolean to tell the different between the "normal" timezone "+0000" for UTC, and the "-0000" timezone that appears in 1.8M commits)
2 -> brittle, as a botched migration or a bug in swh-model would make the deltas unusable
4 -> probably doubles or triples the size of the graph; but it's the only way to protect against bad migrations (short of recomputing all checksums in migrations). On the other hand, parser errors may go unnoticed because we would rely on these manifest.
2, 3, 4 -> currently, if the parsed object does not always exactly match the manifest, we raise an error. This makes us notice any parsing error. If we go with either of these options, we will have to remove that error, so parser bugs may go unnoticed. (But they would be recoverable afterward, if and when we finally notice it)