Git commit and tags have been imported considering their authors' names and emails have been encoded as utf-8. This assumption is most likely wrong in some cases, and we need to figure out which repositories we need to reimport to fix them.
Description
Event Timeline
I have done some investigations on this in light of T272. Bottom line: not good: git is very proficient in the corner cases department.
On the bright side, setting aside the issue underlying T272, we get the right result most of the time.
Here are a few of the corner cases that I stumbled upon:
- git can store a positive or negative null offset when your timezone is UTC (that is, +0000 and -0000 are valid offsets)
- The author/committer line can be missing the timestamp and timezone. In that case, pygit2 defaults to (timestamp, offset) = (0, +0000)
- This would all be fine, except that sometimes, there _is_ an explicit 0 +0000 at the end of the author line.
- Empty message storage is inconsistent. The internal git object format is: a newline-terminated header, followed by an empty line, followed by the message. When the message is empty, there are two newlines at the end of the object. Except that sometimes, the empty line is omitted.
Some example releases:
- Negative UTC offset: 97c8d2573a001f88e72d75f596cf86b12b82fd01 @ https://github.com/dankamongmen/sprezzos-installer
- No timestamp in author line: c00ee9a43734e2a87b71ce5f6695ea7ede24bb98 @ https://github.com/zenczykowski/klibc
- Explicit epoch timestamp in author line: 953d97ed27ad0ee5c288ed52f9589f0407b1eb05 @ https://github.com/antoniofabio/iDMC
- Empty message, with empty line: dfcfdbdf877f252a44a8fd35930a0cf643c2d815 @ https://github.com/CroquetMagic/DownloadRankings
- Empty message, without empty line: 953d97ed27ad0ee5c288ed52f9589f0407b1eb05 @ https://github.com/antoniofabio/iDMC (combo)
I just noticed that empty messages with empty lines are stored as an empty bytea, whereas empty messages without the empty line are stored as NULL. So there's that.
Dulwich seems to handle some of those special cases just fine.
- Negative UTC offset: t._tag_timezone_neg_utc
- No timestamp in author line: t.tag_time is None
- Explicit epoch timestamp: (t.tag_timestamp, t.tag_time) = (0, 0)
It currently breaks on *completely* empty messages, but the patch seems fairly simple.
Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.
- A gpgsig attribute records information for signed commits;
- a mergetag attribute records information about gpg signatures on tags merged in the commit.
Obviously nothing prevents git upstream from adding other such attributes in the future either.
As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.
Nice catch.
As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.
Yes, absolutely. As they're not generic to our VCS model, they shouldn't be glorified as columns or anything of the sort. OTOH, we will need to look for them when checking for VCS object integrity, because otherwise the object ID wouldn't correspond to the actual object content (right?).
(Which also reminds me of the fact that we should document somewhere our "schemas" for arbitrary metadata, incrementally, every time we add new metadata to the DB. Where would be a right place to do this? Somewhere under swh-storage I guess…)
The dulwich pull request has been merged, and the corresponding package has been added to the local swh archive for importers.
Some progress has been made on several fronts: swh.model has been updated for the latest schema modifications, and swh.loader.git properly populates the new attributes on updates.
The swh.model implementation has been exercised on the failing cases above and all seems well.
We have another snag: git supports an encoding header, that we have to support somehow. Two options: add it as an extra header, or as a new field on releases/revisions. The second option seems sensible as I believe the field is general-purpose enough (even if only as an advisory setting).
The scripts checking the integrity of revisions and fixing some of them have been committed to swh-storage (in a new utils directory).
By manually checking the original source of some existing revisions with bad checksums, we have uncovered a few bugs where libgit2 would happily eat some data.
Some fixes we're trying:
- Missing newlines at the beginning or end of commit messages
- Negative UTC offsets in author/committer timestamps
- Trailing spaces in author/committer names
We still have some ways to go:
- some commits have a completely empty author name (author <bla@bla> timestamp offset instead of author <bla@bla> timestamp offset)
- some commits have trailing null bytes (eww.)
- some commits have extra headers that were previously ignored by our importer
Once we get down to a reasonable number of commits left to fix (there are still a few million currently), we'll be able to track down the original source and dump it to see where we went wrong.
We currently have 3'431'504 revisions that need fixing (on a total of 469'428'397, that is 0.73%).
Of those, 2'035'135 have had their fix computed using the heuristics listed above, which leaves 1'396'369 to fix (0.29% of the total number of revisions).
I'm running a new pass with the "null author" heuristic.
I've looked at the 31k releases with improper checksums.
- 17k were imported with git-svn and were expecting a timestamp with a "-0000" timezone
- 325 expected a null author
- 4230 expected a null date
- 70 had a mangled author. For instance, Perl has releases from Nicholas_Clark_<...>, _Nicholas_Clark_<...> and Nicholas_Clark__<...>. Some were from authors without a name (<email>).
- 7 had a weird offset not handled by libgit2 (+0159, +0359 or +0859)
- 6 were containing binary data and got mangled by libgit2; they were manually fixed from the source using dulwich.
- 2 were exercising a bug in swh.model (rDMODe0dbae331f22)
After all those fixes, we were down to 8911 releases with improper checksums, all of them synthetic (from swh-loader-tar).
Their checksums were computed with a wrong algorithm (appending a newline to the stored message, and treating an integral timestamp as a floating point value), and have now been fixed.
All releases should now have a proper identifier.
Due to T683, those releases were dangling, so the fact that their identifier was wrong didn't impact any link in our graph.
Dulwich 0.19.10 (released in january 2019) changed the way they handle signatures on annotated tags, so we silently drop all signatures since we started using it (probably whenever we upgraded loaders to Buster).
Fix: D6045
Old versions of Dulwich (eg. 0.16.3, the version in stretch), dropped newlines at the end of the gpgsig header.
Example:
$ git clone https://github.com/progval/limnoria $ pip3 install dulwich==0.19.11 # the version in buster $ python3 >>> from dulwich.repo import Repo >>> r = Repo('.') >>> c = r[b'40b469ef825bec99054bec0fcaec224abe2e9cef'] >>> c.gpgsig b'-----BEGIN PGP SIGNATURE-----\n\nwsBcBAABCAAQBQJZ9QLsCRBK7hj4Ov3rIwAAdHIIAEcXahMkbk+vJSZyaop7EOeq\n+QJcMnF4teMAdjIeCitoXsq0iIFLK2iWCLqTUeuWyxI58IjUT5KrRK/xjo/tqEz0\ndHswrnN+vHsffx/aOGyqTQiG6Hn3GZheqyj60DgqY0WhpM38DpdEYtgSTmMXgeSm\nagAUsN9tfpXPnkKTmRQaicQadQOmTDoWNaaQsFJt33Fk9KfcnpVCY/9GFAYC04vN\niU5GFwHeSmfgHIOVN1bNEhAbdF+TgpsadHA3mu5+eywzH5a5th7IYluJXgknR4vZ\nMQ2fqY4fcH56Nv5dW7v0LN2/p1N9BLxpFHDm8Knu8hNMlt7naDyaxkh6sS5cubU=\n=Pi4N\n-----END PGP SIGNATURE-----\n' $ pip3 install dulwich==0.16.3 # the version in stretch >>> from dulwich.repo import Repo >>> r = Repo('.') >>> c = r[b'40b469ef825bec99054bec0fcaec224abe2e9cef'] >>> c.gpgsig b'-----BEGIN PGP SIGNATURE-----\n\nwsBcBAABCAAQBQJZ9QLsCRBK7hj4Ov3rIwAAdHIIAEcXahMkbk+vJSZyaop7EOeq\n+QJcMnF4teMAdjIeCitoXsq0iIFLK2iWCLqTUeuWyxI58IjUT5KrRK/xjo/tqEz0\ndHswrnN+vHsffx/aOGyqTQiG6Hn3GZheqyj60DgqY0WhpM38DpdEYtgSTmMXgeSm\nagAUsN9tfpXPnkKTmRQaicQadQOmTDoWNaaQsFJt33Fk9KfcnpVCY/9GFAYC04vN\niU5GFwHeSmfgHIOVN1bNEhAbdF+TgpsadHA3mu5+eywzH5a5th7IYluJXgknR4vZ\nMQ2fqY4fcH56Nv5dW7v0LN2/p1N9BLxpFHDm8Knu8hNMlt7naDyaxkh6sS5cubU=\n=Pi4N\n-----END PGP SIGNATURE-----'
Two other sources of mismatched checksums:
- GitHub generated gpgsig without indenting subsequent lines, but adding a space at the end of each line but the last (artifact of deserializing then reserializing?). eg. the initial commit of https://github.com/astronomy-commons/genesis-images.git . a dulwich deserialization + swh-model serialization preserves most of this; except the line starting with "gpgsig" gets moved at the end
- directory manifests committed in the wrong order
Commits ingested before https://forge.softwareheritage.org/rDLDGfb03140e169f38f5ea0d64d81b524b4b967335f7 have an extra newline at the end of the mergetag header
We always add "gpgsig" as the last header of revisions, but sometimes it is not. eg. https://github.com/kuehnelth/secure-nyancatd/commit/10011016fc08401b6d3f05c2561a5dd3eb0a2641
swh:1:dir:880034219b47a123b97161de4e0d3301746cec75 (from https://github.com/kentnl/Dist-Zilla-PluginBundle-Author-KENTNL) has a single entry; we stored it with mode 40000 on that entry, but the original mode is 40755.
some new ones:
- Missing leading/trailing/middle space in author/committer name or email
- missing leading 0 in timezone (I found a couple of commits like this, migrated from hg or from a bot used by vim developers)
- timezone stored as "+0000", but it is "+0059"/"+0159"/"+0359"/etc in the original commit
- missing leading "\n\n" in messages of commits migrated with some hg-to-git tool(s?)
- commit messages truncated on the first "\x00" byte (which can happen eg. when migrating from hg)
- hash in parent header is in uppercase (????)
- timezone stored as "+0000" but it is "+051800" in the original (??????)
- "author xxx<yyy@googlemail.com> <xxx<yyy@googlemail.com>> 1282956323 +0200" parsed as "author xxx<yyy@googlemail.com> <xxx <yy@googlemail.com>> 1282956323 +0200" (yes there is a fullname in the email)
- a few commits with this timezone: "--700", parsed as "+0700"
- Some weird offsets parsed as "+0000": "+1400"/"+051800"/"+1558601"
- looks like old git loaders tried to decode and re-encode name and/or email, which messed it up when not UTF8
- some negative UTCs don't have the boolean set to true (the only example I have is also missing a space in the same line, so this might be why)
- "0000" as offset (with no sign), dulwich can't even parse this
- "+0575" as offset; parsed as "+0615"
- "\r" in an email address was stripped
- "nonce" header is *after* gpgsig
- double "author" field in the original, and another commit with three "committer"....
- "mergetag" headers with an extra newline at the end (current versions of the loader strip it, looks like older ones didn't)
- "author xxx <yyy@gmail.com> <type 'int'> -0200" in original commit (dulwich obviously can't parse this)
Aggregated stats at the moment:
953259 fixable 7104 found_but_unparseable 141782 mismatch_misc_directory 644068 mismatch_misc_release 468913 mismatch_misc_revision_svn 9812 mismatch_misc_tar 713699 recoverable 1 revision_missing_from_storage 36610 unrecoverable_no-origin 113 unrecoverable_not-in-swh-graph 14028 weird
How to read this:
- "fixable" means my script could bruteforce variants of the object to get the right hash, without fetching the origin
- "recoverable" means it could get some missing data from the origin, extract some headers, and add them to the original object to fix it.
- "recoverable_misc_rev" means it was not able to find this set of headers, but could reconstruct a SWH revision object from the original Git object anyway. (this typically happens with weird authors, messages transformations that I did not implement in my "bruteforcer", or because the revision had more than one type of corruption)
- "weird" means the object does not fit the SWH data model
- "found_but_unparseable" means dulwich crashes when trying to parse the object (so in all likelihood, it also does not fitt the SWH data model)
- "unrecoverable" means my script could not find the original object
- "mismatch" are buckets I did not investigate yet.
Details:
27197 fixable_add_encoding 3387 fixable_add_encoding_and_leading_newlines 35883 fixable_author_or_committer 511 fixable_leading_newlines 13 fixable_move_nonce 3105 fixable_offset 455 fixable_space_and_negative_utc 80 fixable_space_and_newline_and_negative_utc 882628 fixable_trivial 14028 fixable_unpadded_time_offset 7104 found_but_unparseable 141782 mismatch_misc_directory 644068 mismatch_misc_release 468913 mismatch_misc_revision_svn 9812 mismatch_misc_tar 437102 recoverable_extra_headers 4830 recoverable_extra_headers_and_leading_newlines 473 recoverable_hg_branch_nullbytes_truncated 997 recoverable_misc_rev 246501 recoverable_missing_gpgsig 23796 recoverable_missing_mergetag_and_maybe_gpgsig 1 revision_missing_from_storage 36610 unrecoverable_no-origin 113 unrecoverable_not-in-swh-graph 436 weird_capitalized_revision_parent 726 weird-offset-misc 232 weird-offset=+0575 239 weird-offset=+40347417 69 weird-offset=--700 52 weird_misc_rev 231 weird_unordered_dir
this commit has lots of one-line "gpgsig" headers instead of a single "gpgsig" header with a multiline value: https://github.com/git-for-windows/git/commit/5f549aa2f78314ac37bbd436c8f80aea4c752e07
My script finished running on releases. Result: all 644k releases are recoverable (mostly just missing gpg signatures), except 75k whose origin does not exist anymore.
analysis on directories (some are also part of the fixable_trivial above, but I don't have the exact number, I lost it in my analysis):
19 fine_in_storage 88117 fixable_directory_perms 10660 recoverable_misc_dir 4690 unrecoverable_dir_no-origin 280 unrecoverable_dir_not-in-swh-graph 27846 weird-padded_perms_40000 9198 weird-padded_perms_40000_except_1 960 weird_misc_dir 231 weird_unordered_dir
fine_in_storage is an interesting category: directories corrupt in Kafka, but correct in Postgresql
the bulk of recoverable_misc_dir is:
- entries in a wrong order,
- mix of non-normalized permissions, and/or
- corrupt names (eg. \r in an entry name turned into \n)
Finally, padded_perms_40000 is a directory that can be fixed by adding a leading 0 to all 40000 permissions (ie. a directory entry), and padded_perms_40000_except_1 is doing that for all 40000 permissions except one (in any position, but often the first one), which is a surprisingly common issue.
After further investigation, I can't find any directory that is in a completely bad order; they are either ordered like git does (by adding a / at the end of dir entries) or by assuming a null byte at the end of dir entries.
Directories that seemed to be in a completely bad order were actually badly ordered on our side because we had wrong permissions that made us confuse dir entries with file entries (solved in D6504).
Here is an updated (hopefully final) binning for dirs:
19 fine_in_storage 88117 fixable_directory_perms 78026 fixable_trivial_directory <---- revised definition, now this is only some easy perm fixes, not reorders 9302 recoverable_misc_dir <---- smaller than before, because some were re-bucketed as "weird" 2210 unrecoverable_dir_no-origin 280 unrecoverable_dir_not-in-swh-graph 1 weird-dir_with_filefirst_full_git_sort 613 weird-dir_with_filefirst_partial_sort 128 weird-dir_with_nullbyte_sort 123 weird-dir_with_unknown_sort_ok_in_kafka <---- this seems to be only directories loaded since we enabled the journal-writer 27846 weird-padded_perms_40000 9198 weird-padded_perms_40000_except_1 1402 weird_dir_perms_misc 575 weird_misc_dir
Great news: of the 469k corrupt SVN revisions, all but 14 (yes, 14) can be fixed simply by adding 1 microsecond to their timestamp.
I am investigating the remaining ones. They all belong to two repositories:
- in http://coelacanth-lib.googlecode.com/svn/ revision 2 is stored as b174bdcb08984ff0fe2a3b963262a7600aad9b61 has the right data, but the wrong hash. The hash should be ecbbce1e452fbd1bc374ffbf4d2e8dd641b3a1e7. some of the descendants (but not all!) have the same issue
- in http://umsnooker.googlecode.com/svn/ , revision 181 combines both issues: after the microsecond is incremented, its data is the same as the original, but the hash is still wrong. Two of its descendants are also corrupt, I did not investigate further (I assume it's the same issue)
With some more heuristics, retrying failed origins, and looking at dumps on banco (great idea from @olasd), I was able to bring down the total number of unrecoverable objects to 95k:
1860 unrecoverable_dir_no-origin 222 unrecoverable_dir_not-in-swh-graph 58 unrecoverable_dir_swh-graph-crashes 73751 unrecoverable_rel_no-origin 19595 unrecoverable_rev_no-origin 112 unrecoverable_rev_swh-graph-crashes