swh-loader-svn.txt
No OneTemporary
Actions

Size

6 KB

Subscribers

None

swh-loader-svn.txt
View Options

	swh-loader-svn
	==============

	The goal is to load a svn repository's lifetime logs to swh-storage.

	This must be able to deal with:
	- unknown svn repository (resulting in a new origin)
	- known svn repository (starting up from the last known svn revision
	and update from that moment on)

	For a full detailed comparison between version's speed, please refer
	to https://forge.softwareheritage.org/diffusion/DLDSVN/browse/master/docs/comparison-git-svn-swh-svn.org.


	# v1

	## Description

	This is a first basic implementation, a proof-of-concept of sort.
	Based on checkout-ing on disk the svn repository at each revision and
	walking the tree at svn revision to compute the swh hashes and store
	them in swh-storage.

	Conclusion: It is possible but it is slow.

	We use git-svn to check if the hash computations were a match, and
	they were not. The swh hashes computation are corrects though.

	It's just not the same assertions as git-svn so the hashes mismatch.

	git-svn:
	- does not checkout empty folders
	- adds metadata at the end of the svn commit message (by default, this
	can be avoided but then no update, in the swh sense, is possible
	afterwards)
	- integrates the svn repository's uuid in the git revision for the
	commit author (author@<repo-uuid>)

	swh-loader-svn:
	- checkouts empty folder (which are then used in swh hashes)
	- adds metadata the git way (leveraging git's extra-header slot), so
	that we can deal with svn repository updates

	## Pseudo

	```
	Checkout/Update/Export on disk the first known revision or 1 if unknown repository
	When revision is not 1
	Check the history is altered (revision hashes won't match)
	If it is altered, log an error message and stop
	Otherwise continue

	Iterate over logs from revision 1 to revision head_revision
	The revision is now rev
	checkout/update/export the revision at rev
	walk the tree directory for that revision and compute hashes
	compute the revision hash
	send the blobs for storage in swh
	send the directories for storage in swh
	send the revision for storage in swh
	done

	Send the occurrence pointing to the last revision seen
	```

	## Notes

	SVN checkout/update instructions are faster than export since they
	leverage svn diffs. But:
	- they do keyword expansion (so bad for diffs with external tools so
	bad for swh)
	- we need to ignore .svn folder since it's present (this needed some
	adaptation in code to ignore folder based on pattern so slow as
	well)

	SVN export instruction is slower than the 2 previous ones since they
	don't use diffs. But:
	- there is one option to ignore keyword expansion (good)
	- no folder are to be omitted during hash computation from disk (good)

	All in all, there is a trade-off here to choose from.

	Still, everything was tested (with much code adapted in the lower
	level api) and both are slow.

	# v2

	## Description

	The v2 is more about:
	- adding options to match the git-svn's hash computations
	- trying to improve the performance

	So, options are added:
	- remove empty folder when encountered (to ignore during hash
	computations)
	- add an extra commit line to the svn commit message
	- (de)activate the loader svn's update routine
	- (de)activate the sending of
	contents/directories/revisions/occurrences/releases to swh-storage
	- (de)activate the extra-header metadata in revision hash (thus
	deactivating the svn update options altogether)

	As this is thought as genuine implementation, we adapted the revision
	message to also use the repository's uuid in the author's email.

	Also, optimization are done as well:
	- instead of walking the disk from the top leve at each revision (slow
	for huge repository like svn.apache.org), compute from the svn log's
	changed paths between the previous revision and the current one, the
	lowest common path. Then, walk only that path to compute the updated
	hashes. Then update from that path to the top level the in-memory
	hashes (less i/o, less RAM are used).

	- in the loader-core, lifting the existing swh-storage api to filter
	only the missing entities on the client side (there are already
	filters on the server side but filtering client-side uses less RAM.
	Especially for blobs, since we extract the data from disk and store
	it in RAM, this is now done only for unknown blobs and still before
	updating the disk with a new revision content)

	- in the loader-core, cache are added as well

	Now the computations, with the right options, are a match with git-svn.
	Still, the performance against git-svn are bad.

	Taking a closer look at git-svn, they used a remote-access approach,
	that is discussing directly with the svn server and computing at the
	same time the hashes.
	That is the base for the v3 implementation.

	## Pseudo

	Relatively to the v1, the logic does not change, only the inner
	implementation.

	# v3

	## Description

	This one is about performance only.

	Leveraging another low-level library (subvertpy) to permit the use of
	the same git-svn approach, the remote-access.

	The idea is to replay the logs and diffs on disk and compute hashes
	closely in time (not as close as possible though, cf. ## Note below).

	## Pseudo

	```
	Do we know the repository (with swh-svn-update option on)?
	Yes, extract the last swh known revision from swh-storage
	set start-rev to last-swh-known-revision
	Export on disk the svn at start-rev
	Compute revision hashes (from top level tree's hashes + commit log for that revision)
	Does the revision hash match the one in swh-storage? (<=> Is the history altered?)
	No
	log an error message and stop
	Yes
	keep the current in-memory hashes (for the following updates steps if any)
	No
	set start-rev to 1

	Set head-revision to latest svn repository's head revision
	When start-rev is the same as head revision, we are done.
	Otherwise continue

	Iterate over the stream of svn-logs from start-rev to head-rev
	The current revision is rev
	replay the diffs from previous rev (rev - 1) to rev and compute hashes along
	compute the revision hash
	send the blobs for storage in swh
	send the directories for storage in swh
	send the revision for storage in swh
	done

	Send the occurrence pointing to the last revision seen
	```

	## Note

	There could be margin for improvement in the actual implementation
	here.

	We apply the diff on files first and then open the file to compute its
	hashes afterwards.

	If we'd apply the diff and compute the hashes directly, we'd gain one
	round-trip. Depending on the ratio files/directory, this could be
	significant.

	This approach has also the following benefits:
	- no keyword expansion
	- no need to ignore .svn folder (since it does not exist)

File Metadata

Mime Type: text/plain
Expires: Jun 4 2025, 7:07 PM (10 w, 1 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3375999

swh-loader-svn.txtNo OneTemporaryActions

swh-loader-svn.txtView Options

File Metadata

Event Timeline

swh-loader-svn.txt
No OneTemporary
Actions

swh-loader-svn.txt
View Options