Page MenuHomeSoftware Heritage

Check integrity of Revisions and Releases
Started, Work in Progress, NormalPublic

Description

Git commit and tags have been imported considering their authors' names and emails have been encoded as utf-8. This assumption is most likely wrong in some cases, and we need to figure out which repositories we need to reimport to fix them.

Event Timeline

olasd created this task.Oct 6 2015, 2:36 PM
olasd renamed this task from Check integrity of Revisions and Tags to Check integrity of Revisions and Releases.
olasd raised the priority of this task from to High.
olasd updated the task description. (Show Details)
olasd updated the task description. (Show Details)
olasd added a subscriber: olasd.
olasd changed the task status from Open to Work in Progress.Jan 21 2016, 7:41 PM
olasd claimed this task.

I have done some investigations on this in light of T272. Bottom line: not good: git is very proficient in the corner cases department.

On the bright side, setting aside the issue underlying T272, we get the right result most of the time.

Here are a few of the corner cases that I stumbled upon:

  • git can store a positive or negative null offset when your timezone is UTC (that is, +0000 and -0000 are valid offsets)
  • The author/committer line can be missing the timestamp and timezone. In that case, pygit2 defaults to (timestamp, offset) = (0, +0000)
  • This would all be fine, except that sometimes, there _is_ an explicit 0 +0000 at the end of the author line.
  • Empty message storage is inconsistent. The internal git object format is: a newline-terminated header, followed by an empty line, followed by the message. When the message is empty, there are two newlines at the end of the object. Except that sometimes, the empty line is omitted.
olasd added a comment.Jan 21 2016, 8:22 PM

Some example releases:

olasd added a comment.Jan 21 2016, 8:40 PM

I just noticed that empty messages with empty lines are stored as an empty bytea, whereas empty messages without the empty line are stored as NULL. So there's that.

Dulwich seems to handle some of those special cases just fine.

  • Negative UTC offset: t._tag_timezone_neg_utc
  • No timestamp in author line: t.tag_time is None
  • Explicit epoch timestamp: (t.tag_timestamp, t.tag_time) = (0, 0)

It currently breaks on *completely* empty messages, but the patch seems fairly simple.

olasd added a comment.Jan 26 2016, 4:26 PM

Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.

  • A gpgsig attribute records information for signed commits;
  • a mergetag attribute records information about gpg signatures on tags merged in the commit.

Obviously nothing prevents git upstream from adding other such attributes in the future either.

As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.

zack added a subscriber: zack.Jan 26 2016, 10:45 PM
In T75#3505, @olasd wrote:

Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.

Nice catch.

As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.

Yes, absolutely. As they're not generic to our VCS model, they shouldn't be glorified as columns or anything of the sort. OTOH, we will need to look for them when checking for VCS object integrity, because otherwise the object ID wouldn't correspond to the actual object content (right?).

(Which also reminds me of the fact that we should document somewhere our "schemas" for arbitrary metadata, incrementally, every time we add new metadata to the DB. Where would be a right place to do this? Somewhere under swh-storage I guess…)

In T75#3503, @olasd wrote:

It currently breaks on *completely* empty messages, but the patch seems fairly simple.

Submitted https://github.com/jelmer/dulwich/pull/413

olasd added a comment.Mar 30 2016, 3:11 PM

The dulwich pull request has been merged, and the corresponding package has been added to the local swh archive for importers.

Some progress has been made on several fronts: swh.model has been updated for the latest schema modifications, and swh.loader.git properly populates the new attributes on updates.

The swh.model implementation has been exercised on the failing cases above and all seems well.

We have another snag: git supports an encoding header, that we have to support somehow. Two options: add it as an extra header, or as a new field on releases/revisions. The second option seems sensible as I believe the field is general-purpose enough (even if only as an advisory setting).

Ok, one more reason why we must keep the original data from the dumps.

olasd added a comment.Apr 5 2016, 4:46 PM

The scripts checking the integrity of revisions and fixing some of them have been committed to swh-storage (in a new utils directory).

By manually checking the original source of some existing revisions with bad checksums, we have uncovered a few bugs where libgit2 would happily eat some data.

Some fixes we're trying:

  • Missing newlines at the beginning or end of commit messages
  • Negative UTC offsets in author/committer timestamps
  • Trailing spaces in author/committer names

We still have some ways to go:

  • some commits have a completely empty author name (author <bla@bla> timestamp offset instead of author <bla@bla> timestamp offset)
  • some commits have trailing null bytes (eww.)
  • some commits have extra headers that were previously ignored by our importer

Once we get down to a reasonable number of commits left to fix (there are still a few million currently), we'll be able to track down the original source and dump it to see where we went wrong.

olasd added a comment.Apr 6 2016, 2:04 PM

We currently have 3'431'504 revisions that need fixing (on a total of 469'428'397, that is 0.73%).

Of those, 2'035'135 have had their fix computed using the heuristics listed above, which leaves 1'396'369 to fix (0.29% of the total number of revisions).

I'm running a new pass with the "null author" heuristic.

zack added a project: Restricted Project.Apr 27 2016, 9:03 PM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Apr 27 2016, 9:21 PM
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:05 PM
zack lowered the priority of this task from High to Normal.Jun 1 2016, 7:16 PM
zack removed a project: Restricted Project.
zack added a project: Restricted Project.Feb 12 2017, 6:14 PM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 12 2017, 6:38 PM
olasd added a comment.Feb 15 2017, 6:03 PM

I've looked at the 31k releases with improper checksums.

  • 17k were imported with git-svn and were expecting a timestamp with a "-0000" timezone
  • 325 expected a null author
  • 4230 expected a null date
  • 70 had a mangled author. For instance, Perl has releases from Nicholas_Clark_<...>, _Nicholas_Clark_<...> and Nicholas_Clark__<...>. Some were from authors without a name (<email>).
  • 7 had a weird offset not handled by libgit2 (+0159, +0359 or +0859)
  • 6 were containing binary data and got mangled by libgit2; they were manually fixed from the source using dulwich.
  • 2 were exercising a bug in swh.model (rDMODe0dbae331f22)

After all those fixes, we were down to 8911 releases with improper checksums, all of them synthetic (from swh-loader-tar).

Their checksums were computed with a wrong algorithm (appending a newline to the stored message, and treating an integral timestamp as a floating point value), and have now been fixed.

All releases should now have a proper identifier.

olasd added a comment.Feb 15 2017, 6:07 PM
In T75#12186, @olasd wrote:

After all those fixes, we were down to 8911 releases with improper checksums, all of them synthetic (from swh-loader-tar).

Their checksums were computed with a wrong algorithm (appending a newline to the stored message, and treating an integral timestamp as a floating point value), and have now been fixed.

All releases should now have a proper identifier.

Due to T683, those releases were dangling, so the fact that their identifier was wrong didn't impact any link in our graph.