Faithfully store weird git objects
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Sep 22 2021, 1:31 PM

Description

Our data model is based on git, and normalizes some of the data we read; this means that "weird" git objects cannot be represented.

This meta-task will group this kind of issues

Possible options so far:

extend the data model to support them (like "negative_utc_offset, but somewhat generalized, eg. store text representation of offsets)
store a binary delta between the object we would generate from the model object and the original
store the full original manifest for all objects that can't be losslessly represented in the model, alongside the main graph storage
store the full original manifest for all objects, in a separate storage
give up on all/some "weird objects"

Some mixes of the options are possible, especially 1 with 2, 3, or 5.

Discussion of these options:

1 -> is annoying to handle, and needs continuous effort, but this is essentially what we are already doing with negative_utc_offset (a boolean to tell the different between the "normal" timezone "+0000" for UTC, and the "-0000" timezone that appears in 1.8M commits)

2 -> brittle, as a botched migration or a bug in swh-model would make the deltas unusable

4 -> probably doubles or triples the size of the graph; but it's the only way to protect against bad migrations (short of recomputing all checksums in migrations). On the other hand, parser errors may go unnoticed because we would rely on these manifest.

2, 3, 4 -> currently, if the parsed object does not always exactly match the manifest, we raise an error. This makes us notice any parsing error. If we go with either of these options, we will have to remove that error, so parser bugs may go unnoticed. (But they would be recoverable afterward, if and when we finally notice it)

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3096 Efficient and reliable download via the Vault
Migrated	gitlab-migration	T887 Vault: "snapshot" cooker
Migrated	gitlab-migration	T3504 Make the git-bare cooker publicly available
Migrated	gitlab-migration	T3505 Make the git-bare cooker available to the staff and beta-testers in the production webapp
Migrated	gitlab-migration	T3506 Get rid of the concept of vault "object_type"
Migrated	gitlab-migration	T3507 prod: vault: Deploy v1.0.0
Migrated	gitlab-migration	T3503 staging: vault: Deploy v1.0.0
Migrated	gitlab-migration	T843 Vault: Add a "git bare" tarball cooker
Migrated	gitlab-migration	T3412 Deploy vault "git bare" tarball cooker (swh-vault v0.6)
Migrated	gitlab-migration	T3518 Enable vault cookers to access swh-graph
Migrated	gitlab-migration	T3543 Debian package python3-swh.graph.client
Migrated	gitlab-migration	T3565 Add an option in swh-web to allow git-bare cooking via the UI for beta-testers
Migrated	gitlab-migration	T3551 Fix git-fsck errors in the git-bare cooker
Migrated	gitlab-migration	T3552 Fix corrupted releases, revisions, and directories in the storage
Migrated	gitlab-migration	T3566 git-fsck errors in the git-bare cooker are not properly reported
Migrated	gitlab-migration	T3731 Try parallelizing directory_get_entries with an async swh-storage client
Migrated	gitlab-migration	T3924 Write mailmaps after cooking git-bare archives with display names?
Migrated	gitlab-migration	T3135 Improve integrity of ingested content
Migrated	gitlab-migration	T3594 Faithfully store weird git objects
Migrated	gitlab-migration	T3595 Support disordered directory entries in git
Migrated	gitlab-migration	T3596 Support "weird" permissions in directories
Migrated	gitlab-migration	T3598 Support revisions with "extra headers" not at the end
Migrated	gitlab-migration	T3752 Store/represent time offsets as strings
Migrated	gitlab-migration	T3819 Deploy swh.model 4.1.0 / swh.storage 0.41.0 to production
Migrated	gitlab-migration	T3753 Store original git manifests

Event Timeline

vlorentz triaged this task as Normal priority.Sep 22 2021, 1:31 PM

vlorentz created this task.

vlorentz updated the task description. (Show Details)Sep 22 2021, 1:42 PM

vlorentz added a parent task: T3552: Fix corrupted releases, revisions, and directories in the storage.Sep 24 2021, 3:13 PM

vlorentz updated the task description. (Show Details)Oct 15 2021, 2:38 PM

jayeshv added a subscriber: jayeshv.Oct 25 2021, 2:24 PM

vlorentz updated the task description. (Show Details)Oct 27 2021, 2:03 PM

vlorentz updated the task description. (Show Details)

vlorentz updated the task description. (Show Details)Oct 27 2021, 2:08 PM

Copy of an email I sent today:

Summary of this email: we will change the way we internally represent time offsets, and store the original manifest for all git objects that are otherwise unrepresentable. This will not affect users of the public API.

Conclusion

We just had a meeting to take a decision on the matter. Given the respective numbers of objects, we decided that we should:

alter the data model to *represent time offsets as strings* (instead of the current tuple of an int and the "negative_utc" boolean), because there are many such objects

for all objects that have other types of corruption (mostly directories with "040000" permissions or wrong orders; but also some revisions), *store their original git manifest next to them*. If we implemented this today, there would be under 100k such objects, mostly directories.

This will allow the archive to preserve all original objects, with a SWHIDv1 matching their Git identifier.

Next steps and API changes

We will now need to implement these changes to the archive. It is unclear for now how this will affect the public API; but it will most likely only add two fields:

we can parse the offset string in the web server, to keep serving an int through the API
the manifest may be added as an extra field and/or API endpoint, but it won't affect the responses in any other way

So this should *not cause any disruption* to users of the API.

Finally, we will inject all objects we can repair in the archive.

"Future work"

However, this raised (or rather, reminded us of) the issue of the uniqueness of SWHIDs: because the same "abstract" directory can be represented in different ways in Git, it can have many Git identifiers. And because SWHIDv1 is compatible with Git, it can have many SWHIDv1s.

It does have a SWHID that is "more canonical" than the other, though; this is the one computed by the main Git implementation and by our swh-identify tool.

We decided to postpone a decision on this matter to the future discussion on how to design SWHIDv2, to address this issue at the same time as other ones (see https://forge.softwareheritage.org/T3134 / https://forge.softwareheritage.org/T3609 )