Check integrity of directories, revisions, and releases
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Oct 6 2015, 2:36 PM

Description

Git commit and tags have been imported considering their authors' names and emails have been encoded as utf-8. This assumption is most likely wrong in some cases, and we need to figure out which repositories we need to reimport to fix them.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3096 Efficient and reliable download via the Vault
Migrated	gitlab-migration	T3135 Improve integrity of ingested content
Migrated	gitlab-migration	T399 (Re-)Compute data checksums before insertion
Migrated	gitlab-migration	T3010 Enable the validating storage proxy in production
Migrated	gitlab-migration	T887 Vault: "snapshot" cooker
Migrated	gitlab-migration	T3504 Make the git-bare cooker publicly available
Migrated	gitlab-migration	T3505 Make the git-bare cooker available to the staff and beta-testers in the production webapp
Migrated	gitlab-migration	T3506 Get rid of the concept of vault "object_type"
Migrated	gitlab-migration	T3507 prod: vault: Deploy v1.0.0
Migrated	gitlab-migration	T3503 staging: vault: Deploy v1.0.0
Migrated	gitlab-migration	T843 Vault: Add a "git bare" tarball cooker
Migrated	gitlab-migration	T3412 Deploy vault "git bare" tarball cooker (swh-vault v0.6)
Migrated	gitlab-migration	T3518 Enable vault cookers to access swh-graph
Migrated	gitlab-migration	T3543 Debian package python3-swh.graph.client
Migrated	gitlab-migration	T3565 Add an option in swh-web to allow git-bare cooking via the UI for beta-testers
Migrated	gitlab-migration	T3551 Fix git-fsck errors in the git-bare cooker
Migrated	gitlab-migration	T3552 Fix corrupted releases, revisions, and directories in the storage
Migrated	gitlab-migration	T3566 git-fsck errors in the git-bare cooker are not properly reported
Migrated	gitlab-migration	T3731 Try parallelizing directory_get_entries with an async swh-storage client
Migrated	gitlab-migration	T3924 Write mailmaps after cooking git-bare archives with display names?
Migrated	gitlab-migration	T3860 Document types/sources of object corruption
Migrated	gitlab-migration	T75 Check integrity of directories, revisions, and releases
Migrated	gitlab-migration	T366 Refactor the person table

Event Timeline

olasd created this task.Oct 6 2015, 2:36 PM

olasd renamed this task from Check integrity of Revisions and Tags to Check integrity of Revisions and Releases.

olasd raised the priority of this task from to High.

olasd updated the task description. (Show Details)

olasd added projects: Staff, Developers, Storage manager.

olasd updated the task description. (Show Details)

olasd added a subscriber: olasd.

olasd mentioned this in T184: Create a tarball from swh storage.Nov 3 2015, 7:12 PM

I have done some investigations on this in light of T272. Bottom line: not good: git is very proficient in the corner cases department.

On the bright side, setting aside the issue underlying T272, we get the right result most of the time.

Here are a few of the corner cases that I stumbled upon:

git can store a positive or negative null offset when your timezone is UTC (that is, +0000 and -0000 are valid offsets)
The author/committer line can be missing the timestamp and timezone. In that case, pygit2 defaults to (timestamp, offset) = (0, +0000)
This would all be fine, except that sometimes, there _is_ an explicit 0 +0000 at the end of the author line.
Empty message storage is inconsistent. The internal git object format is: a newline-terminated header, followed by an empty line, followed by the message. When the message is empty, there are two newlines at the end of the object. Except that sometimes, the empty line is omitted.

Some example releases:

Negative UTC offset: 97c8d2573a001f88e72d75f596cf86b12b82fd01 @ https://github.com/dankamongmen/sprezzos-installer
No timestamp in author line: c00ee9a43734e2a87b71ce5f6695ea7ede24bb98 @ https://github.com/zenczykowski/klibc
Explicit epoch timestamp in author line: 953d97ed27ad0ee5c288ed52f9589f0407b1eb05 @ https://github.com/antoniofabio/iDMC
Empty message, with empty line: dfcfdbdf877f252a44a8fd35930a0cf643c2d815 @ https://github.com/CroquetMagic/DownloadRankings
Empty message, without empty line: 953d97ed27ad0ee5c288ed52f9589f0407b1eb05 @ https://github.com/antoniofabio/iDMC (combo)

I just noticed that empty messages with empty lines are stored as an empty bytea, whereas empty messages without the empty line are stored as NULL. So there's that.

Dulwich seems to handle some of those special cases just fine.

Negative UTC offset: t._tag_timezone_neg_utc
No timestamp in author line: t.tag_time is None
Explicit epoch timestamp: (t.tag_timestamp, t.tag_time) = (0, 0)

It currently breaks on *completely* empty messages, but the patch seems fairly simple.

Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.

A gpgsig attribute records information for signed commits;
a mergetag attribute records information about gpg signatures on tags merged in the commit.

Obviously nothing prevents git upstream from adding other such attributes in the future either.

As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.

In T75#3505, @olasd wrote:

Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.

Nice catch.

As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.

Yes, absolutely. As they're not generic to our VCS model, they shouldn't be glorified as columns or anything of the sort. OTOH, we will need to look for them when checking for VCS object integrity, because otherwise the object ID wouldn't correspond to the actual object content (right?).

(Which also reminds me of the fact that we should document somewhere our "schemas" for arbitrary metadata, incrementally, every time we add new metadata to the DB. Where would be a right place to do this? Somewhere under swh-storage I guess…)

In T75#3503, @olasd wrote:

It currently breaks on *completely* empty messages, but the patch seems fairly simple.

Submitted https://github.com/jelmer/dulwich/pull/413

zack removed projects: Developers, Staff.Mar 10 2016, 5:51 PM

The dulwich pull request has been merged, and the corresponding package has been added to the local swh archive for importers.

Some progress has been made on several fronts: swh.model has been updated for the latest schema modifications, and swh.loader.git properly populates the new attributes on updates.

The swh.model implementation has been exercised on the failing cases above and all seems well.

We have another snag: git supports an encoding header, that we have to support somehow. Two options: add it as an extra header, or as a new field on releases/revisions. The second option seems sensible as I believe the field is general-purpose enough (even if only as an advisory setting).

Ok, one more reason why we must keep the original data from the dumps.

The scripts checking the integrity of revisions and fixing some of them have been committed to swh-storage (in a new utils directory).

By manually checking the original source of some existing revisions with bad checksums, we have uncovered a few bugs where libgit2 would happily eat some data.

Some fixes we're trying:

Missing newlines at the beginning or end of commit messages
Negative UTC offsets in author/committer timestamps
Trailing spaces in author/committer names

We still have some ways to go:

some commits have a completely empty author name (author <bla@bla> timestamp offset instead of author <bla@bla> timestamp offset)
some commits have trailing null bytes (eww.)
some commits have extra headers that were previously ignored by our importer

Once we get down to a reasonable number of commits left to fix (there are still a few million currently), we'll be able to track down the original source and dump it to see where we went wrong.

We currently have 3'431'504 revisions that need fixing (on a total of 469'428'397, that is 0.73%).

Of those, 2'035'135 have had their fix computed using the heuristics listed above, which leaves 1'396'369 to fix (0.29% of the total number of revisions).

I'm running a new pass with the "null author" heuristic.

olasd created subtask T366: Refactor the person table.Apr 7 2016, 2:28 PM

olasd closed subtask T366: Refactor the person table as Resolved.Apr 7 2016, 6:30 PM

zack added a project: Restricted Project.Apr 27 2016, 9:03 PM

zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Apr 27 2016, 9:21 PM

olasd mentioned this in T399: (Re-)Compute data checksums before insertion.May 11 2016, 7:16 PM

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:05 PM

zack lowered the priority of this task from High to Normal.Jun 1 2016, 7:16 PM

zack removed a project: Restricted Project.

olasd mentioned this in rDSNIPd61b531a7d13: consistency: add consistency checking script for revisions and releases.Feb 10 2017, 6:29 PM

zack added a project: Restricted Project.Feb 12 2017, 6:14 PM

zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 12 2017, 6:38 PM

zack mentioned this in T680: fix off-by-1us timestamp in revisions coming from SVN loader.Feb 14 2017, 9:50 AM

olasd mentioned this in rDMODe0dbae331f22: identifiers: properly escape newlines in author specifications.Feb 15 2017, 5:46 PM

I've looked at the 31k releases with improper checksums.

17k were imported with git-svn and were expecting a timestamp with a "-0000" timezone
325 expected a null author
4230 expected a null date
70 had a mangled author. For instance, Perl has releases from Nicholas_Clark_<...>, _Nicholas_Clark_<...> and Nicholas_Clark__<...>. Some were from authors without a name (<email>).
7 had a weird offset not handled by libgit2 (+0159, +0359 or +0859)
6 were containing binary data and got mangled by libgit2; they were manually fixed from the source using dulwich.
2 were exercising a bug in swh.model (rDMODe0dbae331f22)

After all those fixes, we were down to 8911 releases with improper checksums, all of them synthetic (from swh-loader-tar).

Their checksums were computed with a wrong algorithm (appending a newline to the stored message, and treating an integral timestamp as a floating point value), and have now been fixed.

All releases should now have a proper identifier.

In T75#12186, @olasd wrote:

After all those fixes, we were down to 8911 releases with improper checksums, all of them synthetic (from swh-loader-tar).

Their checksums were computed with a wrong algorithm (appending a newline to the stored message, and treating an integral timestamp as a floating point value), and have now been fixed.

All releases should now have a proper identifier.

Due to T683, those releases were dangling, so the fact that their identifier was wrong didn't impact any link in our graph.

olasd mentioned this in rDSNIP36f5369066d4: consistency: add a tool to fetch revisions again.Feb 16 2017, 6:15 PM

zack edited projects, added Archive content; removed Storage manager.Apr 7 2017, 10:59 AM

olasd mentioned this in D2719: storages: Add model validation server side as well.Feb 27 2020, 3:19 PM

olasd mentioned this in T3012: Check all objects in the production storage/journal have a correct hash.Feb 4 2021, 6:11 PM

vlorentz merged a task: T3012: Check all objects in the production storage/journal have a correct hash.Feb 4 2021, 6:13 PM

vlorentz added a parent task: T3010: Enable the validating storage proxy in production.

vlorentz added a subscriber: vlorentz.

Dulwich 0.19.10 (released in january 2019) changed the way they handle signatures on annotated tags, so we silently drop all signatures since we started using it (probably whenever we upgraded loaders to Buster).

Fix: D6045

vlorentz added a revision: D6045: converters: Preserve GPG signatures on releases.Jul 30 2021, 10:58 AM

vlorentz removed a revision: D6045: converters: Preserve GPG signatures on releases.

Old versions of Dulwich (eg. 0.16.3, the version in stretch), dropped newlines at the end of the gpgsig header.

Example:

$ git clone https://github.com/progval/limnoria
$ pip3 install dulwich==0.19.11  # the version in buster

$ python3
>>> from dulwich.repo import Repo
>>> r = Repo('.')
>>> c = r[b'40b469ef825bec99054bec0fcaec224abe2e9cef']
>>> c.gpgsig
b'-----BEGIN PGP SIGNATURE-----\n\nwsBcBAABCAAQBQJZ9QLsCRBK7hj4Ov3rIwAAdHIIAEcXahMkbk+vJSZyaop7EOeq\n+QJcMnF4teMAdjIeCitoXsq0iIFLK2iWCLqTUeuWyxI58IjUT5KrRK/xjo/tqEz0\ndHswrnN+vHsffx/aOGyqTQiG6Hn3GZheqyj60DgqY0WhpM38DpdEYtgSTmMXgeSm\nagAUsN9tfpXPnkKTmRQaicQadQOmTDoWNaaQsFJt33Fk9KfcnpVCY/9GFAYC04vN\niU5GFwHeSmfgHIOVN1bNEhAbdF+TgpsadHA3mu5+eywzH5a5th7IYluJXgknR4vZ\nMQ2fqY4fcH56Nv5dW7v0LN2/p1N9BLxpFHDm8Knu8hNMlt7naDyaxkh6sS5cubU=\n=Pi4N\n-----END PGP SIGNATURE-----\n'

$ pip3 install dulwich==0.16.3  # the version in stretch

>>> from dulwich.repo import Repo
>>> r = Repo('.')
>>> c = r[b'40b469ef825bec99054bec0fcaec224abe2e9cef']
>>> c.gpgsig
b'-----BEGIN PGP SIGNATURE-----\n\nwsBcBAABCAAQBQJZ9QLsCRBK7hj4Ov3rIwAAdHIIAEcXahMkbk+vJSZyaop7EOeq\n+QJcMnF4teMAdjIeCitoXsq0iIFLK2iWCLqTUeuWyxI58IjUT5KrRK/xjo/tqEz0\ndHswrnN+vHsffx/aOGyqTQiG6Hn3GZheqyj60DgqY0WhpM38DpdEYtgSTmMXgeSm\nagAUsN9tfpXPnkKTmRQaicQadQOmTDoWNaaQsFJt33Fk9KfcnpVCY/9GFAYC04vN\niU5GFwHeSmfgHIOVN1bNEhAbdF+TgpsadHA3mu5+eywzH5a5th7IYluJXgknR4vZ\nMQ2fqY4fcH56Nv5dW7v0LN2/p1N9BLxpFHDm8Knu8hNMlt7naDyaxkh6sS5cubU=\n=Pi4N\n-----END PGP SIGNATURE-----'

vlorentz added a parent task: T3552: Fix corrupted releases, revisions, and directories in the storage.Sep 3 2021, 6:28 PM

vlorentz claimed this task.Sep 14 2021, 4:50 PM

vlorentz mentioned this in D6281: converters: Recompute hashes and check they match the originals.Sep 16 2021, 2:03 PM

Two other sources of mismatched checksums:

GitHub generated gpgsig without indenting subsequent lines, but adding a space at the end of each line but the last (artifact of deserializing then reserializing?). eg. the initial commit of https://github.com/astronomy-commons/genesis-images.git . a dulwich deserialization + swh-model serialization preserves most of this; except the line starting with "gpgsig" gets moved at the end
directory manifests committed in the wrong order

vlorentz mentioned this in T3586: Figure out what to do with 'misordered' directories in Cassandra.Sep 17 2021, 11:38 AM

vlorentz mentioned this in rDLDG6d7a998b1093: converters: Recompute hashes and check they match the originals.Sep 17 2021, 12:27 PM

Commits ingested before https://forge.softwareheritage.org/rDLDGfb03140e169f38f5ea0d64d81b524b4b967335f7 have an extra newline at the end of the mergetag header

We always add "gpgsig" as the last header of revisions, but sometimes it is not. eg. https://github.com/kuehnelth/secure-nyancatd/commit/10011016fc08401b6d3f05c2561a5dd3eb0a2641

swh:1:dir:880034219b47a123b97161de4e0d3301746cec75 (from https://github.com/kentnl/Dist-Zilla-PluginBundle-Author-KENTNL) has a single entry; we stored it with mode 40000 on that entry, but the original mode is 40755.

vlorentz renamed this task from Check integrity of Revisions and Releases to Check integrity of directories, revisions, and releases.Sep 23 2021, 3:55 PM

some new ones:

Missing leading/trailing/middle space in author/committer name or email
missing leading 0 in timezone (I found a couple of commits like this, migrated from hg or from a bot used by vim developers)
timezone stored as "+0000", but it is "+0059"/"+0159"/"+0359"/etc in the original commit
missing leading "\n\n" in messages of commits migrated with some hg-to-git tool(s?)
commit messages truncated on the first "\x00" byte (which can happen eg. when migrating from hg)
hash in parent header is in uppercase (????)
timezone stored as "+0000" but it is "+051800" in the original (??????)

"author xxx<yyy@googlemail.com> <xxx<yyy@googlemail.com>> 1282956323 +0200" parsed as "author xxx<yyy@googlemail.com> <xxx <yy@googlemail.com>> 1282956323 +0200" (yes there is a fullname in the email)
a few commits with this timezone: "--700", parsed as "+0700"
Some weird offsets parsed as "+0000": "+1400"/"+051800"/"+1558601"
looks like old git loaders tried to decode and re-encode name and/or email, which messed it up when not UTF8

some negative UTCs don't have the boolean set to true (the only example I have is also missing a space in the same line, so this might be why)
"0000" as offset (with no sign), dulwich can't even parse this
"+0575" as offset; parsed as "+0615"
"\r" in an email address was stripped

"nonce" header is *after* gpgsig
double "author" field in the original, and another commit with three "committer"....
"mergetag" headers with an extra newline at the end (current versions of the loader strip it, looks like older ones didn't)
"author xxx <yyy@gmail.com> <type 'int'> -0200" in original commit (dulwich obviously can't parse this)

douardda added a subscriber: douardda.Oct 4 2021, 10:43 AM

Aggregated stats at the moment:

953259	fixable
7104	found_but_unparseable
141782	mismatch_misc_directory
644068	mismatch_misc_release
468913	mismatch_misc_revision_svn
9812	mismatch_misc_tar
713699	recoverable
1	revision_missing_from_storage
36610	unrecoverable_no-origin
113	unrecoverable_not-in-swh-graph
14028	weird

How to read this:

"fixable" means my script could bruteforce variants of the object to get the right hash, without fetching the origin
"recoverable" means it could get some missing data from the origin, extract some headers, and add them to the original object to fix it.
"recoverable_misc_rev" means it was not able to find this set of headers, but could reconstruct a SWH revision object from the original Git object anyway. (this typically happens with weird authors, messages transformations that I did not implement in my "bruteforcer", or because the revision had more than one type of corruption)
"weird" means the object does not fit the SWH data model
"found_but_unparseable" means dulwich crashes when trying to parse the object (so in all likelihood, it also does not fitt the SWH data model)
"unrecoverable" means my script could not find the original object
"mismatch" are buckets I did not investigate yet.

Details:

27197	fixable_add_encoding
3387	fixable_add_encoding_and_leading_newlines
35883	fixable_author_or_committer
511	fixable_leading_newlines
13	fixable_move_nonce
3105	fixable_offset
455	fixable_space_and_negative_utc
80	fixable_space_and_newline_and_negative_utc
882628	fixable_trivial
14028	fixable_unpadded_time_offset
7104	found_but_unparseable
141782	mismatch_misc_directory
644068	mismatch_misc_release
468913	mismatch_misc_revision_svn
9812	mismatch_misc_tar
437102	recoverable_extra_headers
4830	recoverable_extra_headers_and_leading_newlines
473	recoverable_hg_branch_nullbytes_truncated
997	recoverable_misc_rev
246501	recoverable_missing_gpgsig
23796	recoverable_missing_mergetag_and_maybe_gpgsig
1	revision_missing_from_storage
36610	unrecoverable_no-origin
113	unrecoverable_not-in-swh-graph
436	weird_capitalized_revision_parent
726	weird-offset-misc
232	weird-offset=+0575
239	weird-offset=+40347417
69	weird-offset=--700
52	weird_misc_rev
231	weird_unordered_dir

vlorentz mentioned this in D6453: converters: Prevent zero dates from being converted to null dates..Oct 11 2021, 2:17 PM

this commit has lots of one-line "gpgsig" headers instead of a single "gpgsig" header with a multiline value: https://github.com/git-for-windows/git/commit/5f549aa2f78314ac37bbd436c8f80aea4c752e07

My script finished running on releases. Result: all 644k releases are recoverable (mostly just missing gpg signatures), except 75k whose origin does not exist anymore.

analysis on directories (some are also part of the fixable_trivial above, but I don't have the exact number, I lost it in my analysis):

19	fine_in_storage
88117	fixable_directory_perms
10660	recoverable_misc_dir
4690	unrecoverable_dir_no-origin
280	unrecoverable_dir_not-in-swh-graph
27846	weird-padded_perms_40000
9198	weird-padded_perms_40000_except_1
960	weird_misc_dir
231	weird_unordered_dir

fine_in_storage is an interesting category: directories corrupt in Kafka, but correct in Postgresql

the bulk of recoverable_misc_dir is:

entries in a wrong order,
mix of non-normalized permissions, and/or
corrupt names (eg. \r in an entry name turned into \n)

Finally, padded_perms_40000 is a directory that can be fixed by adding a leading 0 to all 40000 permissions (ie. a directory entry), and padded_perms_40000_except_1 is doing that for all 40000 permissions except one (in any position, but often the first one), which is a surprisingly common issue.

After further investigation, I can't find any directory that is in a completely bad order; they are either ordered like git does (by adding a / at the end of dir entries) or by assuming a null byte at the end of dir entries.

Directories that seemed to be in a completely bad order were actually badly ordered on our side because we had wrong permissions that made us confuse dir entries with file entries (solved in D6504).

Here is an updated (hopefully final) binning for dirs:

19	fine_in_storage
88117	fixable_directory_perms
78026	fixable_trivial_directory                  <---- revised definition, now this is only some easy perm fixes, not reorders
9302	recoverable_misc_dir                       <---- smaller than before, because some were re-bucketed as "weird"
2210	unrecoverable_dir_no-origin
280	unrecoverable_dir_not-in-swh-graph
1	weird-dir_with_filefirst_full_git_sort
613	weird-dir_with_filefirst_partial_sort
128	weird-dir_with_nullbyte_sort
123	weird-dir_with_unknown_sort_ok_in_kafka    <---- this seems to be only directories loaded since we enabled the journal-writer
27846	weird-padded_perms_40000
9198	weird-padded_perms_40000_except_1
1402	weird_dir_perms_misc
575	weird_misc_dir

vsellier added a subscriber: vsellier.Oct 20 2021, 3:48 PM

Great news: of the 469k corrupt SVN revisions, all but 14 (yes, 14) can be fixed simply by adding 1 microsecond to their timestamp.

I am investigating the remaining ones. They all belong to two repositories:

in http://coelacanth-lib.googlecode.com/svn/ revision 2 is stored as b174bdcb08984ff0fe2a3b963262a7600aad9b61 has the right data, but the wrong hash. The hash should be ecbbce1e452fbd1bc374ffbf4d2e8dd641b3a1e7. some of the descendants (but not all!) have the same issue
in http://umsnooker.googlecode.com/svn/ , revision 181 combines both issues: after the microsecond is incremented, its data is the same as the original, but the hash is still wrong. Two of its descendants are also corrupt, I did not investigate further (I assume it's the same issue)

vsellier mentioned this in T3683: cassandra - benchmark the vault.Nov 2 2021, 12:23 PM

vlorentz mentioned this in T3860: Document types/sources of object corruption.Jan 18 2022, 12:44 PM

vlorentz added a parent task: T3860: Document types/sources of object corruption.

vlorentz mentioned this in T3878: Fix existing corrupt objects.Jan 24 2022, 11:01 AM

vlorentz mentioned this in D7011: Revert "Restore 'offset' and 'negative_utc' arguments and make them optional".Jan 26 2022, 11:37 AM

With some more heuristics, retrying failed origins, and looking at dumps on banco (great idea from @olasd), I was able to bring down the total number of unrecoverable objects to 95k:

1860	unrecoverable_dir_no-origin
222	unrecoverable_dir_not-in-swh-graph
58	unrecoverable_dir_swh-graph-crashes
73751	unrecoverable_rel_no-origin
19595	unrecoverable_rev_no-origin
112	unrecoverable_rev_swh-graph-crashes

gitlab-migration changed the status of subtask T366: Refactor the person table from Resolved to Migrated.Jan 8 2023, 4:18 PM

This task has been migrated to GitLab.

Check integrity of directories, revisions, and releasesClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Check integrity of directories, revisions, and releases
Closed, MigratedEdits Locked
Actions

Related Objects
Search...