Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F8393035
swh-loader-svn.txt
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
6 KB
Subscribers
None
swh-loader-svn.txt
View Options
swh-loader-svn
==============
The goal is to load a svn repository's lifetime logs to swh-storage.
This must be able to deal with:
- unknown svn repository (resulting in a new origin)
- known svn repository (starting up from the last known svn revision
and update from that moment on)
For a full detailed comparison between version's speed, please refer
to https://forge.softwareheritage.org/diffusion/DLDSVN/browse/master/docs/comparison-git-svn-swh-svn.org.
# v1
## Description
This is a first basic implementation, a proof-of-concept of sort.
Based on checkout-ing on disk the svn repository at each revision and
walking the tree at svn revision to compute the swh hashes and store
them in swh-storage.
Conclusion: It is possible but it is slow.
We use git-svn to check if the hash computations were a match, and
they were not. The swh hashes computation are corrects though.
It's just not the same assertions as git-svn so the hashes mismatch.
git-svn:
- does not checkout empty folders
- adds metadata at the end of the svn commit message (by default, this
can be avoided but then no update, in the swh sense, is possible
afterwards)
- integrates the svn repository's uuid in the git revision for the
commit author (author@<repo-uuid>)
swh-loader-svn:
- checkouts empty folder (which are then used in swh hashes)
- adds metadata the git way (leveraging git's extra-header slot), so
that we can deal with svn repository updates
## Pseudo
```
Checkout/Update/Export on disk the first known revision or 1 if unknown repository
When revision is not 1
Check the history is altered (revision hashes won't match)
If it is altered, log an error message and stop
Otherwise continue
Iterate over logs from revision 1 to revision head_revision
The revision is now rev
checkout/update/export the revision at rev
walk the tree directory for that revision and compute hashes
compute the revision hash
send the blobs for storage in swh
send the directories for storage in swh
send the revision for storage in swh
done
Send the occurrence pointing to the last revision seen
```
## Notes
SVN checkout/update instructions are faster than export since they
leverage svn diffs. But:
- they do keyword expansion (so bad for diffs with external tools so
bad for swh)
- we need to ignore .svn folder since it's present (this needed some
adaptation in code to ignore folder based on pattern so slow as
well)
SVN export instruction is slower than the 2 previous ones since they
don't use diffs. But:
- there is one option to ignore keyword expansion (good)
- no folder are to be omitted during hash computation from disk (good)
All in all, there is a trade-off here to choose from.
Still, everything was tested (with much code adapted in the lower
level api) and both are slow.
# v2
## Description
The v2 is more about:
- adding options to match the git-svn's hash computations
- trying to improve the performance
So, options are added:
- remove empty folder when encountered (to ignore during hash
computations)
- add an extra commit line to the svn commit message
- (de)activate the loader svn's update routine
- (de)activate the sending of
contents/directories/revisions/occurrences/releases to swh-storage
- (de)activate the extra-header metadata in revision hash (thus
deactivating the svn update options altogether)
As this is thought as genuine implementation, we adapted the revision
message to also use the repository's uuid in the author's email.
Also, optimization are done as well:
- instead of walking the disk from the top leve at each revision (slow
for huge repository like svn.apache.org), compute from the svn log's
changed paths between the previous revision and the current one, the
lowest common path. Then, walk only that path to compute the updated
hashes. Then update from that path to the top level the in-memory
hashes (less i/o, less RAM are used).
- in the loader-core, lifting the existing swh-storage api to filter
only the missing entities on the client side (there are already
filters on the server side but filtering client-side uses less RAM.
Especially for blobs, since we extract the data from disk and store
it in RAM, this is now done only for unknown blobs and still before
updating the disk with a new revision content)
- in the loader-core, cache are added as well
Now the computations, with the right options, are a match with git-svn.
Still, the performance against git-svn are bad.
Taking a closer look at git-svn, they used a remote-access approach,
that is discussing directly with the svn server and computing at the
same time the hashes.
That is the base for the v3 implementation.
## Pseudo
Relatively to the v1, the logic does not change, only the inner
implementation.
# v3
## Description
This one is about performance only.
Leveraging another low-level library (subvertpy) to permit the use of
the same git-svn approach, the remote-access.
The idea is to replay the logs and diffs on disk and compute hashes
closely in time (not as close as possible though, cf. ## Note below).
## Pseudo
```
Do we know the repository (with swh-svn-update option on)?
Yes, extract the last swh known revision from swh-storage
set start-rev to last-swh-known-revision
Export on disk the svn at start-rev
Compute revision hashes (from top level tree's hashes + commit log for that revision)
Does the revision hash match the one in swh-storage? (<=> Is the history altered?)
No
log an error message and stop
Yes
keep the current in-memory hashes (for the following updates steps if any)
No
set start-rev to 1
Set head-revision to latest svn repository's head revision
When start-rev is the same as head revision, we are done.
Otherwise continue
Iterate over the stream of svn-logs from start-rev to head-rev
The current revision is rev
replay the diffs from previous rev (rev - 1) to rev and compute hashes along
compute the revision hash
send the blobs for storage in swh
send the directories for storage in swh
send the revision for storage in swh
done
Send the occurrence pointing to the last revision seen
```
## Note
There could be margin for improvement in the actual implementation
here.
We apply the diff on files first and then open the file to compute its
hashes afterwards.
If we'd apply the diff and compute the hashes directly, we'd gain one
round-trip. Depending on the ratio files/directory, this could be
significant.
This approach has also the following benefits:
- no keyword expansion
- no need to ignore .svn folder (since it does not exist)
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Jun 4 2025, 7:07 PM (10 w, 1 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3375999
Attached To
rDLDSVN Subversion (SVN) loader
Event Timeline
Log In to Comment