diff --git a/docs/swh-loader-svn.txt b/docs/swh-loader-svn.txt index 532b305..2f56aaa 100644 --- a/docs/swh-loader-svn.txt +++ b/docs/swh-loader-svn.txt @@ -1,195 +1,195 @@ swh-loader-svn ============== The goal is to load a svn repository's lifetime logs to swh-storage. This must be able to deal with: - unknown svn repository (resulting in a new origin) - known svn repository (starting up from the last known svn revision and update from that moment on) For a full detailed comparison between version's speed, please refer to https://forge.softwareheritage.org/diffusion/DLDSVN/browse/master/docs/comparison-git-svn-swh-svn.org. # v1 ## Description This is a first basic implementation, a proof-of-concept of sort. Based on checkout-ing on disk the svn repository at each revision and walking the tree at svn revision to compute the swh hashes and store them in swh-storage. Conclusion: It is possible but it is slow. We use git-svn to check if the hash computations were a match, and they were not. The swh hashes computation are corrects though. It's just not the same assertions as git-svn so the hashes mismatch. git-svn: - does not checkout empty folders - adds metadata at the end of the svn commit message (by default, this can be avoided but then no update, in the swh sense, is possible afterwards) - integrates the svn repository's uuid in the git revision for the commit author (author@) swh-loader-svn: - checkouts empty folder (which are then used in swh hashes) - adds metadata the git way (leveraging git's extra-header slot), so that we can deal with svn repository updates ## Pseudo ``` Checkout/Update/Export on disk the first known revision or 1 if unknown repository When revision is not 1 Check the history is altered (revision hashes won't match) If it is altered, log an error message and stop Otherwise continue Iterate over logs from revision 1 to revision head_revision The revision is now rev checkout/update/export the revision at rev walk the tree directory for that revision and compute hashes compute the revision hash send the blobs for storage in swh send the directories for storage in swh send the revision for storage in swh done Send the occurrence pointing to the last revision seen ``` ## Notes SVN checkout/update instructions are faster than export since they leverage svn diffs. But: - they do keyword expansion (so bad for diffs with external tools so bad for swh) - we need to ignore .svn folder since it's present (this needed some adaptation in code to ignore folder based on pattern so slow as well) SVN export instruction is slower than the 2 previous ones since they don't use diffs. But: - there is one option to ignore keyword expansion (good) - no folder are to be omitted during hash computation from disk (good) All in all, there is a trade-off here to choose from. Still, everything was tested (with much code adapted in the lower level api) and both are slow. # v2 ## Description The v2 is more about: - adding options to match the git-svn's hash computations - trying to improve the performance So, options are added: - remove empty folder when encountered (to ignore during hash computations) - add an extra commit line to the svn commit message - (de)activate the loader svn's update routine - (de)activate the sending of contents/directories/revisions/occurrences/releases to swh-storage - (de)activate the extra-header metadata in revision hash (thus deactivating the svn update options altogether) As this is thought as genuine implementation, we adapted the revision message to also use the repository's uuid in the author's email. Also, optimization are done as well: - instead of walking the disk from the top leve at each revision (slow for huge repository like svn.apache.org), compute from the svn log's changed paths between the previous revision and the current one, the lowest common path. Then, walk only that path to compute the updated hashes. Then update from that path to the top level the in-memory hashes (less i/o, less RAM are used). - in the loader-core, lifting the existing swh-storage api to filter only the missing entities on the client side (there are already filters on the server side but filtering client-side uses less RAM. Especially for blobs, since we extract the data from disk and store it in RAM, this is now done only for unknown blobs and still before updating the disk with a new revision content) - in the loader-core, cache are added as well Now the computations, with the right options, are a match with git-svn. Still, the performance against git-svn are bad. Taking a closer look at git-svn, they used a remote-access approach, that is discussing directly with the svn server and computing at the same time the hashes. That is the base for the v3 implementation. ## Pseudo Relatively to the v1, the logic does not change, only the inner implementation. # v3 ## Description This one is about performance only. Leveraging another low-level library (subvertpy) to permit the use of the same git-svn approach, the remote-access. The idea is to replay the logs and diffs on disk and compute hashes closely in time (not as close as possible though, cf. ## Note below). ## Pseudo ``` Do we know the repository (with swh-svn-update option on)? Yes, extract the last swh known revision from swh-storage set start-rev to last-swh-known-revision Export on disk the svn at start-rev Compute revision hashes (from top level tree's hashes + commit log for that revision) Does the revision hash match the one in swh-storage? (<=> Is the history altered?) No log an error message and stop Yes keep the current in-memory hashes (for the following updates steps if any) No set start-rev to 1 Set head-revision to latest svn repository's head revision When start-rev is the same as head revision, we are done. Otherwise continue Iterate over the stream of svn-logs from start-rev to head-rev The current revision is rev replay the diffs from previous rev (rev - 1) to rev and compute hashes along compute the revision hash send the blobs for storage in swh send the directories for storage in swh send the revision for storage in swh done Send the occurrence pointing to the last revision seen ``` ## Note -There could be margin for improvment in the actual implementation +There could be margin for improvement in the actual implementation here. We apply the diff on files first and then open the file to compute its hashes afterwards. If we'd apply the diff and compute the hashes directly, we'd gain one round-trip. Depending on the ratio files/directory, this could be significant. This approach has also the following benefits: - no keyword expansion - no need to ignore .svn folder (since it does not exist)