Page MenuHomeSoftware Heritage

Compare swh-loader-svn's hash computation with git-svn's
Closed, ResolvedPublic

Description

From what i gather thus far.

git-svn by default:

  • add revision-id and revision-uuid at the end of the svn commit message (impacts hash computation)
  • is aware of svn conventions (impacts hash computation)
  • add @<repo-uuid> for every user it encounters (impacts hash computation)
  • does not clone empty directory (impacts hash computation)

However, swh-svn:

  • adds revision-id and revision-uuid in a git standard way (using extra-header)
  • is not svn convention aware
  • does not add @<repo-uuid> for every user it encounters
  • does take into account empty directory that svn might throw at him (impacts hash computation)

This task is about to try and make the tools' behavior converge so that we can use git-svn to validate swh-svn.

This may reveal errors in swh-svn logics.

Event Timeline

swh-svn hash been updated with the following behavior.

  • Add @<repo-uuid> for every user it encounters (this seems like a good idea to borrow)
  • Option has been added to inhibit revision-id and revision-uuid to be added (from my point of view, this is only a test option as this break the swh-svn update behavior)
  • Option has been added to ignore empty folder (to mimic git-svn )

Now the current status.
There are some convergences on hash computation at the beginning of the revision-log history but some divergence occurred at some point which are currently investigated by yours truly.

This may reveal errors in swh-svn logics.

Indeed, it did.
There is at least one bug on the git-hash-update function which is in charge of updating the tree hash.
It's supposed to be smart and minimize the reads on disk (otherwise, it's a recursive deep tree walk and hash computations of every blob/folder encountered for each svn commit).

Why: There are some absolutely insanely huge svn repositories (in terms of svn logs and tree depths) where we can't possibly read trees from scratch at every commit... The first version of swh-svn did and this got slower at every new round (as expected).

There is also yet another divergence that i did not have time to investigate yet.
I tried a run without the smart (and broken for now) function (so the function does walk other the trees at every svn commit) and i had also divergences in hash.
This one might be due to the svn conventions followed by git-svn and not by swh-svn.

ardumont added a comment.EditedMay 24 2016, 4:14 PM

Follow up on this.

There is at least one bug on the git-hash-update function which is in charge of updating the tree hash.

It's not one bug but at least they have all the same root cause, the 'empty folder' option.

Context:
git svn clone does not checkout empty directories/
swh-svn does (same code as swh tarball loader). So an empty directory account for something in tree hash computation.
And, as svn checkout does include empty directories... We've got our first divergence.

So the swh-svn option was to make up for that.

But then, we can have strange use cases where:

  • an empty directory appears in the changelog as deleted. But as it was empty, it was already ignored at previous runs so that path never existed in memory, so when looking up that path, boum. It could not have happened in a standard way (before the option) since no one was ignored (except for the .svn folder but they stay ignored all the time so no suprises there).
  • a directory is filled with empty directories. Thus, actually, from git-svn's standpoint, it should not be. So we must remove it from memory too)
  • a directory sees its subdirectories becoming empty ones...

Well some corner cases are just a mess.

TL; DR when i caught up with one of those cases, i ask for recomputations.

There is also yet another divergence that i did not have time to investigate yet.
I tried a run without the smart (and broken for now) function (so the function does walk other the trees at every svn commit) and i had also divergences in hash.

Important detail, I meant revision hash divergences.
What the previous bug showed me was that the revision hash divergence was systematically originated from the tree hash divergence.
Thus, this might, and i would even daresay, may be the same issue.
I need to check this one once i'm sure i've solved the previous issue.


Also, a conclusion on an oral discussion between @zack and me.
We want to compare the git tree hash computation between git-svn and swh-svn.

ardumont added a comment.EditedMay 27 2016, 9:43 AM

Same hash tree

Some more progress on this, using svn://svn.debian.org/svn/pkg-fox repository as base comparison, i have the exact same tree hashes between git-svn with the right options (--no-metadata) and swh-svn with the right options as well (no update, ignore empty folder).

But...

But i hit a last snag using another more filled repository svn://svn.debian.org/svn/glibc-bsd.

What new problem now?

A new use case appeared about svn's keyword expansion.
svn checkouts expands keyword and thus modifies the file's content. git-svn does not do that.
Thus yet another diff for that particular case.

For multiple reasons, we want to avoid that:

  • Since it's done at checkout time, it impedes performance
  • having no such modification on file can simplify diffing with external tools (other than svn's).

Inhibit then...

Now, i'm trying to find a way to inhibit that expansion behavior in the pysvn api (it's the lib we use to interact with svn).
I have found nothing in the documentation about an option to avoid that thus far...

I'm into this.

ardumont added a comment.EditedMay 27 2016, 10:02 AM

sources

Doc

Nothing in docs about it.
pysvn docs: http://pysvn.stage.tigris.org/docs/pysvn_prog_ref.html#pysvn_client_export

Issue tracker

pysvn issue tracker: http://pysvn.tigris.org/ds/viewForumSummary.do?dsForumId=1334

The only issue that speaks about 'keyword expansion' is http://pysvn.tigris.org/issues/show_bug.cgi?id=183 and it's svn export (dump a working copy at specified revision) related.
By the way, this option reflects the svn cli...

The svn checkout cli does not refer to any keyword expansion (ignoring or otherwise) either. Only svn export does.

To give visibility about this.

keyword expansion is the fact that keyword like $Id$, $Date$, etc... are expanded with some values at checkout time.
Thus, in effect, dirtying the tree hash computation.

TL;DR

I hit a wall with this keyword expansion issue.
Either we avoid it and we are slower or we have the keyword expanded (and we diverge from git-svn's tree computations but not from git's).

Detail

The only way to avoid the keyword expansion (which is systematically done with the svn checkout) is to use svn export instead.

I adapted the code in a branch and tested for that:

  • hash tree comparison (ok)
  • performance (worse than with keyword expansion). I used a same local svn-mirror of svn://svn.debian.org/svn/pkg-fox and compare those results.

The performance are worse than with the keyword expansion... (the actual tested in worker01).

There are lots of difficulties to overcome, the base of it all is:

  • svn checkout is cleaning tree at each revision
  • svn export does not

This imposes to take some extra steps between svn export phases.

Using the same folder to export into, either:

  • we remove the previous tree and export in the same folder

--> add the cost of cleaning up previous trees (slow)

  • we try and removing only what's changed between revisions using the changed paths.

--> This seems like the best approach but this fails.
Somehow the changed paths are not complete between revisions.
In my tests, the trees come up dirty with orphan files.

  • we use a new folder for the next export and remove (asynchronously) the previous revision folders

--> In the current state, the swh.model.git.update_checksums is not longer usable since it uses absolute paths.
So using the default function which walk and compute all the trees for each new export.
This is slow.

  • we use a new folder for the next export and remove (asynchronously) the previous folders

--> In yet another branch, i updated the swh.model.git.update_checksum function to use relative paths instead of absolute ones.
This add extra steps to transform paths which are costly for the speed of it all.
And other side-effects i won't detail here (delayed loading of contents for one which must break now).

In explaining all this, i see some extra possibilities:

  • diffs (i tried but got nowhere since the pysvn api is not clear to me)
  • cache

Note

I tried to check how git-svn does its computation but the perl reading is somewhat difficult to read...
https://github.com/git/git/blob/master/git-svn.perl
So for now, i did not progress enough on that part.

My hypothesis for git-svn is that they orchestrate themselves svn export and diffs.

I'm open for suggestion.

The first step to take is to decide whether we want keyword expansion or not.

My take is that we do not want keywords expanded when importing an SVN
repository, for the following reasons:

-> keyword expansion is a "client side" operation that might or not be enabled

(and here I'm very surprised that pysvn has not an option to disable it!)

-> each keyword is expanded to values corresponding to the "last modification"

in the repository which might be reproduced more or less automatically
(see http://svnbook.red-bean.com/en/1.4/svn.advanced.props.special.keywords.html)
if we keep track of the svn version number in our data model

So, keeping the unexpanded version of the file would be the safest bet for
having some kind of normal form of it.

At the same time, notice that when looking up a source file that is under svn,
and from a user that uses expansion, one will not find it in Software Heritage
unless we unexpand the keywords...

ardumont added a comment.EditedMay 30 2016, 12:00 PM

My take is that we do not want keywords expanded when importing an SVN
So, keeping the unexpanded version of the file would be the safest bet for
having some kind of normal form of it.

Yes, that's my track of thought too.

(and here I'm very surprised that pysvn has not an option to disable it!)

Even the svn cli svn checkout does not either
The only option available in this regard is in the svn export interface.
Symmetrically, the pysvn documentation mentions it (only in the pysvn's source code repository though, not the website)

At the same time, notice that when looking up a source file that is under svn,
and from a user that uses expansion, one will not find it in Software Heritage
unless we unexpand the keywords...

Right!

By we, do you mean the softwareheritage api?

My hypothesis for git-svn is that they orchestrate themselves svn export and diffs.

Nope, not svn export, a crafted svn client (using libsvn that is) that handles svn diffs.

git-svn spec...ish

I begin to understand some git-svn code after browsing some more of it.

/rant By the way, perl is not an easy code to read... well, at least big functions which mutates stuff in every possible ways...

They use some self-crafted perl modules:

  • Git::SVN::Fetcher - tree delta consumer for "git svn fetch". (It uses a SVN::Delta from libsvn)
  • Git::SVN::Ra - Subversion remote access functions for git-svn

Roughly, from what i gather as of now, they have their dulwich client equivalent for svn ^^.
(context: dulwich is the git library used by swh-loader-git).
They retrieve the diffs and leverage git to store it.

In comparison, swh-svn does not speak any svn language.
Leveraging pysvn, it checkouts (or updates or exports depending on the git branch i use) on disk.
And in any case, walk the tree to compute hashes (as in swh-loader-dir).
(Well, there is a part which minimizes the walk on disk).

Anyway, i'm looking into how to do something similar.

Also, to be fair with me, learning from past experience (git, regarding libgit2 and dulwich changes), i tried to come up with something similar before.

When i tried and selected svn lib, i tried to find some with that ability but found None, python3 compliant that is.
IIRC jelmer/subvertpy does but it's python2 --> This is dulwich's author...

Given the performance, i'll try harder.

svn keywords

Regarding svn keywords, in the man page (that i read again some more).
They simply ignore all the svn keywords.

From the end of the man page:

 man
BUGS
----

We ignore all SVN properties except svn:executable.  Any unhandled
properties are logged to $GIT_DIR/svn/<refname>/unhandled.log

Note

Note for others:

  • The module's docstring (or whatever it's named) is at the end of the perl module. I absolutely forgot...
ardumont added a comment.EditedMay 31 2016, 12:38 PM

IIRC jelmer/subvertpy does but it's python2 --> This is dulwich's author...

Taking a look again, there are python3 related branches (python3, python3-dev, python3-branch) on the github repository.
So testing it (branch python3-branch) and thus far it's setup.py installable and the examples work fine ^^
What interests me is that it proposes an svn api similar to the one i saw in git-svn (based on RemoteAccess, the ra i mentioned earlier that is).

Note:
There'd be the debian package python-subvertpy, to update 'cause it's only python2 right now.
-> I'm not sure how to deal with packaging when the package to do is on 2 different git branches in the upstream repository...
Anyway, one problem at a time.

ardumont added a comment.EditedJun 12 2016, 10:03 PM

TL;DR

Using the same principle as git-svn (remote-access client), we are now faster than git-svn either with or without storage.

Comparison - 3

The second iteration was about finding out whether:

  • the loader was fast (it was not)
  • the loader did its computation right (it did with some adjustment options)

To improve the speed, we investigate further how git-svn did its job (we were not able to determine how to reuse git-svn directly). It uses a Remote Access server approach. Meaning, it discusses directly with the svn server.

Using the same approach as git-svn (using subvertpy instead of pysvn), we were able to adapt the code accordingly.
So now, the loader speaks to the server and computes hashes alongside.

Here is the comparison (using the same options as comparison 2 to have the same hashes as git-svn for trees and commits):

Comparison - no swh-storage

No swh-storage is used here.
Only write on disk and hash computations, then write result on logs (for swh-loader)

Type# RevsUrlgit-svn (git svn clone)# Revs updatedswh-svnRatio
small145svn://svn.debian.org/svn/pkg-fox/447.07414566.561959994956850.14888354
medium6006svn://svn.debian.org/svn/glibc-bsd/4740.9286073338.067033790051940.071308198
large10707svn://svn.debian.org/svn/pkg-voip/8592.39810707536.19718810729680.062403672
very large34523svn://svn.debian.org/svn/python-modules/36627.907345233310.8051347779110.090390235
very large48013svn://svn.debian.org/svn/pkg-gnome/71902.080490613519.94805750902740.048954746

Log extract (P82)

comparison - with swh-storage

Type# RevsUrlgit-svn (git svn clone)# Revs updatedswh-svnRatio
small145svn://svn.debian.org/svn/pkg-fox/447.074145117.174792870879170.26209261
medium6006svn://svn.debian.org/svn/glibc-bsd/4740.9286073962.47790403012190.20301466
large10707svn://svn.debian.org/svn/pkg-voip/8592.398107073028.87349635176360.35250619
very large34523svn://svn.debian.org/svn/python-modules/36627.9073452322651.2653599372130.61841550
very large48013svn://svn.debian.org/svn/pkg-gnome/71902.0804906143629.1051759151740.60678502

Log extract (cf. P83)

Fully detailed comparison is at https://forge.softwareheritage.org/diffusion/DLDSVN/browse/master/docs/comparison-git-svn-swh-svn.org (org format so copy/paste and activate org-mode - M-x org-mode)

ardumont closed this task as Resolved.Jun 14 2016, 11:46 AM