Page MenuHomeSoftware Heritage

Check integrity of directories, revisions, and releases
Closed, MigratedEdits Locked

Description

Git commit and tags have been imported considering their authors' names and emails have been encoded as utf-8. This assumption is most likely wrong in some cases, and we need to figure out which repositories we need to reimport to fix them.

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

olasd renamed this task from Check integrity of Revisions and Tags to Check integrity of Revisions and Releases.
olasd raised the priority of this task from to High.
olasd updated the task description. (Show Details)
olasd updated the task description. (Show Details)
olasd added a subscriber: olasd.
olasd changed the task status from Open to Work in Progress.Jan 21 2016, 7:41 PM
olasd claimed this task.

I have done some investigations on this in light of T272. Bottom line: not good: git is very proficient in the corner cases department.

On the bright side, setting aside the issue underlying T272, we get the right result most of the time.

Here are a few of the corner cases that I stumbled upon:

  • git can store a positive or negative null offset when your timezone is UTC (that is, +0000 and -0000 are valid offsets)
  • The author/committer line can be missing the timestamp and timezone. In that case, pygit2 defaults to (timestamp, offset) = (0, +0000)
  • This would all be fine, except that sometimes, there _is_ an explicit 0 +0000 at the end of the author line.
  • Empty message storage is inconsistent. The internal git object format is: a newline-terminated header, followed by an empty line, followed by the message. When the message is empty, there are two newlines at the end of the object. Except that sometimes, the empty line is omitted.

Some example releases:

I just noticed that empty messages with empty lines are stored as an empty bytea, whereas empty messages without the empty line are stored as NULL. So there's that.

Dulwich seems to handle some of those special cases just fine.

  • Negative UTC offset: t._tag_timezone_neg_utc
  • No timestamp in author line: t.tag_time is None
  • Explicit epoch timestamp: (t.tag_timestamp, t.tag_time) = (0, 0)

It currently breaks on *completely* empty messages, but the patch seems fairly simple.

Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.

  • A gpgsig attribute records information for signed commits;
  • a mergetag attribute records information about gpg signatures on tags merged in the commit.

Obviously nothing prevents git upstream from adding other such attributes in the future either.

As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.

In T75#3505, @olasd wrote:

Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.

Nice catch.

As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.

Yes, absolutely. As they're not generic to our VCS model, they shouldn't be glorified as columns or anything of the sort. OTOH, we will need to look for them when checking for VCS object integrity, because otherwise the object ID wouldn't correspond to the actual object content (right?).

(Which also reminds me of the fact that we should document somewhere our "schemas" for arbitrary metadata, incrementally, every time we add new metadata to the DB. Where would be a right place to do this? Somewhere under swh-storage I guess…)

In T75#3503, @olasd wrote:

It currently breaks on *completely* empty messages, but the patch seems fairly simple.

Submitted https://github.com/jelmer/dulwich/pull/413

The dulwich pull request has been merged, and the corresponding package has been added to the local swh archive for importers.

Some progress has been made on several fronts: swh.model has been updated for the latest schema modifications, and swh.loader.git properly populates the new attributes on updates.

The swh.model implementation has been exercised on the failing cases above and all seems well.

We have another snag: git supports an encoding header, that we have to support somehow. Two options: add it as an extra header, or as a new field on releases/revisions. The second option seems sensible as I believe the field is general-purpose enough (even if only as an advisory setting).

Ok, one more reason why we must keep the original data from the dumps.

The scripts checking the integrity of revisions and fixing some of them have been committed to swh-storage (in a new utils directory).

By manually checking the original source of some existing revisions with bad checksums, we have uncovered a few bugs where libgit2 would happily eat some data.

Some fixes we're trying:

  • Missing newlines at the beginning or end of commit messages
  • Negative UTC offsets in author/committer timestamps
  • Trailing spaces in author/committer names

We still have some ways to go:

  • some commits have a completely empty author name (author <bla@bla> timestamp offset instead of author <bla@bla> timestamp offset)
  • some commits have trailing null bytes (eww.)
  • some commits have extra headers that were previously ignored by our importer

Once we get down to a reasonable number of commits left to fix (there are still a few million currently), we'll be able to track down the original source and dump it to see where we went wrong.

We currently have 3'431'504 revisions that need fixing (on a total of 469'428'397, that is 0.73%).

Of those, 2'035'135 have had their fix computed using the heuristics listed above, which leaves 1'396'369 to fix (0.29% of the total number of revisions).

I'm running a new pass with the "null author" heuristic.

zack added a project: Restricted Project.Apr 27 2016, 9:03 PM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Apr 27 2016, 9:21 PM
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:05 PM
zack lowered the priority of this task from High to Normal.Jun 1 2016, 7:16 PM
zack removed a project: Restricted Project.
zack added a project: Restricted Project.Feb 12 2017, 6:14 PM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 12 2017, 6:38 PM

I've looked at the 31k releases with improper checksums.

  • 17k were imported with git-svn and were expecting a timestamp with a "-0000" timezone
  • 325 expected a null author
  • 4230 expected a null date
  • 70 had a mangled author. For instance, Perl has releases from Nicholas_Clark_<...>, _Nicholas_Clark_<...> and Nicholas_Clark__<...>. Some were from authors without a name (<email>).
  • 7 had a weird offset not handled by libgit2 (+0159, +0359 or +0859)
  • 6 were containing binary data and got mangled by libgit2; they were manually fixed from the source using dulwich.
  • 2 were exercising a bug in swh.model (rDMODe0dbae331f22)

After all those fixes, we were down to 8911 releases with improper checksums, all of them synthetic (from swh-loader-tar).

Their checksums were computed with a wrong algorithm (appending a newline to the stored message, and treating an integral timestamp as a floating point value), and have now been fixed.

All releases should now have a proper identifier.

In T75#12186, @olasd wrote:

After all those fixes, we were down to 8911 releases with improper checksums, all of them synthetic (from swh-loader-tar).

Their checksums were computed with a wrong algorithm (appending a newline to the stored message, and treating an integral timestamp as a floating point value), and have now been fixed.

All releases should now have a proper identifier.

Due to T683, those releases were dangling, so the fact that their identifier was wrong didn't impact any link in our graph.

Dulwich 0.19.10 (released in january 2019) changed the way they handle signatures on annotated tags, so we silently drop all signatures since we started using it (probably whenever we upgraded loaders to Buster).

Fix: D6045

Old versions of Dulwich (eg. 0.16.3, the version in stretch), dropped newlines at the end of the gpgsig header.

Example:

$ git clone https://github.com/progval/limnoria
$ pip3 install dulwich==0.19.11  # the version in buster

$ python3
>>> from dulwich.repo import Repo
>>> r = Repo('.')
>>> c = r[b'40b469ef825bec99054bec0fcaec224abe2e9cef']
>>> c.gpgsig
b'-----BEGIN PGP SIGNATURE-----\n\nwsBcBAABCAAQBQJZ9QLsCRBK7hj4Ov3rIwAAdHIIAEcXahMkbk+vJSZyaop7EOeq\n+QJcMnF4teMAdjIeCitoXsq0iIFLK2iWCLqTUeuWyxI58IjUT5KrRK/xjo/tqEz0\ndHswrnN+vHsffx/aOGyqTQiG6Hn3GZheqyj60DgqY0WhpM38DpdEYtgSTmMXgeSm\nagAUsN9tfpXPnkKTmRQaicQadQOmTDoWNaaQsFJt33Fk9KfcnpVCY/9GFAYC04vN\niU5GFwHeSmfgHIOVN1bNEhAbdF+TgpsadHA3mu5+eywzH5a5th7IYluJXgknR4vZ\nMQ2fqY4fcH56Nv5dW7v0LN2/p1N9BLxpFHDm8Knu8hNMlt7naDyaxkh6sS5cubU=\n=Pi4N\n-----END PGP SIGNATURE-----\n'

$ pip3 install dulwich==0.16.3  # the version in stretch

>>> from dulwich.repo import Repo
>>> r = Repo('.')
>>> c = r[b'40b469ef825bec99054bec0fcaec224abe2e9cef']
>>> c.gpgsig
b'-----BEGIN PGP SIGNATURE-----\n\nwsBcBAABCAAQBQJZ9QLsCRBK7hj4Ov3rIwAAdHIIAEcXahMkbk+vJSZyaop7EOeq\n+QJcMnF4teMAdjIeCitoXsq0iIFLK2iWCLqTUeuWyxI58IjUT5KrRK/xjo/tqEz0\ndHswrnN+vHsffx/aOGyqTQiG6Hn3GZheqyj60DgqY0WhpM38DpdEYtgSTmMXgeSm\nagAUsN9tfpXPnkKTmRQaicQadQOmTDoWNaaQsFJt33Fk9KfcnpVCY/9GFAYC04vN\niU5GFwHeSmfgHIOVN1bNEhAbdF+TgpsadHA3mu5+eywzH5a5th7IYluJXgknR4vZ\nMQ2fqY4fcH56Nv5dW7v0LN2/p1N9BLxpFHDm8Knu8hNMlt7naDyaxkh6sS5cubU=\n=Pi4N\n-----END PGP SIGNATURE-----'

Two other sources of mismatched checksums:

  1. GitHub generated gpgsig without indenting subsequent lines, but adding a space at the end of each line but the last (artifact of deserializing then reserializing?). eg. the initial commit of https://github.com/astronomy-commons/genesis-images.git . a dulwich deserialization + swh-model serialization preserves most of this; except the line starting with "gpgsig" gets moved at the end
  2. directory manifests committed in the wrong order

swh:1:dir:880034219b47a123b97161de4e0d3301746cec75 (from https://github.com/kentnl/Dist-Zilla-PluginBundle-Author-KENTNL) has a single entry; we stored it with mode 40000 on that entry, but the original mode is 40755.

vlorentz renamed this task from Check integrity of Revisions and Releases to Check integrity of directories, revisions, and releases.Sep 23 2021, 3:55 PM

some new ones:

  1. Missing leading/trailing/middle space in author/committer name or email
  2. missing leading 0 in timezone (I found a couple of commits like this, migrated from hg or from a bot used by vim developers)
  3. timezone stored as "+0000", but it is "+0059"/"+0159"/"+0359"/etc in the original commit
  4. missing leading "\n\n" in messages of commits migrated with some hg-to-git tool(s?)
  5. commit messages truncated on the first "\x00" byte (which can happen eg. when migrating from hg)
  6. hash in parent header is in uppercase (????)
  7. timezone stored as "+0000" but it is "+051800" in the original (??????)
  1. "author xxx<yyy@googlemail.com> <xxx<yyy@googlemail.com>> 1282956323 +0200" parsed as "author xxx<yyy@googlemail.com> <xxx <yy@googlemail.com>> 1282956323 +0200" (yes there is a fullname in the email)
  2. a few commits with this timezone: "--700", parsed as "+0700"
  3. Some weird offsets parsed as "+0000": "+1400"/"+051800"/"+1558601"
  4. looks like old git loaders tried to decode and re-encode name and/or email, which messed it up when not UTF8
  1. some negative UTCs don't have the boolean set to true (the only example I have is also missing a space in the same line, so this might be why)
  2. "0000" as offset (with no sign), dulwich can't even parse this
  3. "+0575" as offset; parsed as "+0615"
  4. "\r" in an email address was stripped
  1. "nonce" header is *after* gpgsig
  2. double "author" field in the original, and another commit with three "committer"....
  3. "mergetag" headers with an extra newline at the end (current versions of the loader strip it, looks like older ones didn't)
  4. "author xxx <yyy@gmail.com> <type 'int'> -0200" in original commit (dulwich obviously can't parse this)

Aggregated stats at the moment:

953259	fixable
7104	found_but_unparseable
141782	mismatch_misc_directory
644068	mismatch_misc_release
468913	mismatch_misc_revision_svn
9812	mismatch_misc_tar
713699	recoverable
1	revision_missing_from_storage
36610	unrecoverable_no-origin
113	unrecoverable_not-in-swh-graph
14028	weird

How to read this:

  • "fixable" means my script could bruteforce variants of the object to get the right hash, without fetching the origin
  • "recoverable" means it could get some missing data from the origin, extract some headers, and add them to the original object to fix it.
  • "recoverable_misc_rev" means it was not able to find this set of headers, but could reconstruct a SWH revision object from the original Git object anyway. (this typically happens with weird authors, messages transformations that I did not implement in my "bruteforcer", or because the revision had more than one type of corruption)
  • "weird" means the object does not fit the SWH data model
  • "found_but_unparseable" means dulwich crashes when trying to parse the object (so in all likelihood, it also does not fitt the SWH data model)
  • "unrecoverable" means my script could not find the original object
  • "mismatch" are buckets I did not investigate yet.

Details:

27197	fixable_add_encoding
3387	fixable_add_encoding_and_leading_newlines
35883	fixable_author_or_committer
511	fixable_leading_newlines
13	fixable_move_nonce
3105	fixable_offset
455	fixable_space_and_negative_utc
80	fixable_space_and_newline_and_negative_utc
882628	fixable_trivial
14028	fixable_unpadded_time_offset
7104	found_but_unparseable
141782	mismatch_misc_directory
644068	mismatch_misc_release
468913	mismatch_misc_revision_svn
9812	mismatch_misc_tar
437102	recoverable_extra_headers
4830	recoverable_extra_headers_and_leading_newlines
473	recoverable_hg_branch_nullbytes_truncated
997	recoverable_misc_rev
246501	recoverable_missing_gpgsig
23796	recoverable_missing_mergetag_and_maybe_gpgsig
1	revision_missing_from_storage
36610	unrecoverable_no-origin
113	unrecoverable_not-in-swh-graph
436	weird_capitalized_revision_parent
726	weird-offset-misc
232	weird-offset=+0575
239	weird-offset=+40347417
69	weird-offset=--700
52	weird_misc_rev
231	weird_unordered_dir

this commit has lots of one-line "gpgsig" headers instead of a single "gpgsig" header with a multiline value: https://github.com/git-for-windows/git/commit/5f549aa2f78314ac37bbd436c8f80aea4c752e07

My script finished running on releases. Result: all 644k releases are recoverable (mostly just missing gpg signatures), except 75k whose origin does not exist anymore.

analysis on directories (some are also part of the fixable_trivial above, but I don't have the exact number, I lost it in my analysis):

19	fine_in_storage
88117	fixable_directory_perms
10660	recoverable_misc_dir
4690	unrecoverable_dir_no-origin
280	unrecoverable_dir_not-in-swh-graph
27846	weird-padded_perms_40000
9198	weird-padded_perms_40000_except_1
960	weird_misc_dir
231	weird_unordered_dir

fine_in_storage is an interesting category: directories corrupt in Kafka, but correct in Postgresql

the bulk of recoverable_misc_dir is:

  • entries in a wrong order,
  • mix of non-normalized permissions, and/or
  • corrupt names (eg. \r in an entry name turned into \n)

Finally, padded_perms_40000 is a directory that can be fixed by adding a leading 0 to all 40000 permissions (ie. a directory entry), and padded_perms_40000_except_1 is doing that for all 40000 permissions except one (in any position, but often the first one), which is a surprisingly common issue.

After further investigation, I can't find any directory that is in a completely bad order; they are either ordered like git does (by adding a / at the end of dir entries) or by assuming a null byte at the end of dir entries.

Directories that seemed to be in a completely bad order were actually badly ordered on our side because we had wrong permissions that made us confuse dir entries with file entries (solved in D6504).

Here is an updated (hopefully final) binning for dirs:

19	fine_in_storage
88117	fixable_directory_perms
78026	fixable_trivial_directory                  <---- revised definition, now this is only some easy perm fixes, not reorders
9302	recoverable_misc_dir                       <---- smaller than before, because some were re-bucketed as "weird"
2210	unrecoverable_dir_no-origin
280	unrecoverable_dir_not-in-swh-graph
1	weird-dir_with_filefirst_full_git_sort
613	weird-dir_with_filefirst_partial_sort
128	weird-dir_with_nullbyte_sort
123	weird-dir_with_unknown_sort_ok_in_kafka    <---- this seems to be only directories loaded since we enabled the journal-writer
27846	weird-padded_perms_40000
9198	weird-padded_perms_40000_except_1
1402	weird_dir_perms_misc
575	weird_misc_dir

Great news: of the 469k corrupt SVN revisions, all but 14 (yes, 14) can be fixed simply by adding 1 microsecond to their timestamp.

I am investigating the remaining ones. They all belong to two repositories:

  1. in http://coelacanth-lib.googlecode.com/svn/ revision 2 is stored as b174bdcb08984ff0fe2a3b963262a7600aad9b61 has the right data, but the wrong hash. The hash should be ecbbce1e452fbd1bc374ffbf4d2e8dd641b3a1e7. some of the descendants (but not all!) have the same issue
  2. in http://umsnooker.googlecode.com/svn/ , revision 181 combines both issues: after the microsecond is incremented, its data is the same as the original, but the hash is still wrong. Two of its descendants are also corrupt, I did not investigate further (I assume it's the same issue)

With some more heuristics, retrying failed origins, and looking at dumps on banco (great idea from @olasd), I was able to bring down the total number of unrecoverable objects to 95k:

1860	unrecoverable_dir_no-origin
222	unrecoverable_dir_not-in-swh-graph
58	unrecoverable_dir_swh-graph-crashes
73751	unrecoverable_rel_no-origin
19595	unrecoverable_rev_no-origin
112	unrecoverable_rev_swh-graph-crashes