Page MenuHomeSoftware Heritage

support for external definitions in the svn/subversion loader
Open, NormalPublic

Description

We need to support the svn:externals property, which is similar in spirit to Git submodules. See upstream doc.
It looks like that doing so will require extending our data model for directory entries pointing to revision.

Our directory_entry_rev entries are currently able to point only to specific revisions (by their checksum ID), that can be found in the archive.
svn:externals OTOH allows to point to either specific revisions (identified by SVN revision ID...) or to repository URLs. In the latter case the semantics is, checkout the most recent revision at the time of the checkout of the parent repository.

Various problems need to be tackled:

  • How to store the information in our directories. To that end we can either 1) add a new type of directory entry (repo_entries?), or 2) modify directory_entry_rev to point to either a specific rev or a repo.
  • How to point to an external repo, as that is a moving target. In our current model the most natural thing will probably be to point to an origin, but we need to take care of the caveat that we might encounter repositories pointing to URLs that we haven't yet added as origins.
  • In case a specific revision is specified, when to lookup its checksum ID, as that might change over time. Here again, "never" is a legitimate answer to consider, even though that would make implementing repository checkouts (e.g., for the vault) more complex.
  • Finally, how to avoid losing svn-specific information, in particular I think we want to keep both the URL and the optional revision in their native form (URL, revision number) even if/when we decide to resolve them to internal SWH identifiers. This might mean adding some (possibly JSON) metadata field to the relevant directory entries table.

Discuss ☺

Related Objects

StatusAssignedTask
OpenNone
OpenNone
Resolvedardumont
OpenNone
OpenNone

Event Timeline

zack created this task.Dec 9 2016, 11:07 AM
ardumont removed ardumont as the assignee of this task.Jan 12 2017, 12:41 PM
ardumont added a subscriber: ardumont.

To get some ideas on what we can found, below are some examples of svn:externals property values from googlecode svn projects.

https://wow-xlog.googlecode.com/svn/
LibXEvent-1.0 https://wow-xlog.googlecode.com/svn/branches/LibXEvent-1.0/

https://thd-root.googlecode.com/svn/
_documents-tools https://gr4-documents.googlecode.com/svn/trunk

https://mindup.googlecode.com/svn/
symfony         http://svn.symfony-project.com/branches/1.4/

http://13ns9-1spr.googlecode.com/svn/
http://step13sgroup.googlecode.com/svn/01-Docs/WPF/ WPF

http://develenv-qametrics-plugin.googlecode.com/svn/
src/main/tools http://develenv-qametrics-plugin.googlecode.com/svn/trunk/thirdParty/tools/
src/main/webapp/tablesorter http://develenv-qametrics-plugin.googlecode.com/svn/trunk/thirdParty/web/tablesorter

http://gtm-oauth.googlecode.com/svn/
HTTPFetcher http://gtm-http-fetcher.googlecode.com/svn/trunk/Source

According to the official documentation (marked not a smart idea to reference), there has been a breaking migration format from svn 1.5 onwards.

So we can have both something like (prior to svn 1.5):

third-party/sounds             http://svn.example.com/repos/sounds
third-party/skins -r148        http://svn.example.com/skinproj

And after that:

      http://svn.example.com/repos/sounds third-party/sounds
-r148 http://svn.example.com/skinproj third-party/skins
-r21  http://svn.example.com/skin-maker third-party/skins/toolkit

Even:

http://svn.example.com/repos/sounds third-party/sounds
http://svn.example.com/skinproj@148 third-party/skins
http://svn.example.com/skin-maker@21 third-party/skins/toolkit

@zack Can you enlighten me as to why we want to store that information at the directory level (and not say at the revision one)?

At the svn revision level, we would be symmetric with the svn revision model.

We could add the extra-headers information in the swh revision metadata field with those svn revision information (either raw or parsed, no swh information there).
That way, the revision id hash is updated when something change there and we do not lose that information.

Cons: We delegate the resolving of those origins to either later, never or at some other layer... which might be bad, i don't know.

zack added a comment.Oct 1 2018, 4:58 PM

@zack Can you enlighten me as to why we want to store that information at the directory level (and not say at the revision one)?

I'm not sure I fully understand the question, so I'll answer at different abstraction levels.

We want to store the information at the directory level, because that's what our data model supports (via directory manifests referencing revisions). The data model is not set in stone, but that's what we currently have. It can be improved/extended if needed.

If the question is why the data model is organized this way, the reason is that it allows to share directory more easily. Imagine you have a gazillion revisions all containing an unmodified directory that points to always the same external revision. If you store the info at the revision level you have to replicate that info a gazillion times. If you store the info at the directory level you store it only once (in the directory) and you just reference the same directory over and over again.

That said, I don't understand how storing the info at the revision level would solve your problem. You can use the same argument for directories: you store the revision id hash in the directory, and the directory ID change every time the revision is updated, without losing the information.

Thanks for the clarification, i needed it.

... If you store the info at the revision level you have to replicate that info a gazillion times. If you store the info at the directory level you store it only once (in the directory) and you just reference the same directory over and over again.

... but of course

That said, I don't understand how storing the info at the revision level would solve your problem. You can use the same argument for directories: you store the revision id hash in the directory, and the directory ID change every time the revision is updated, without losing the information.

It does not indeed. I need more thinking on this...

would solve *your* problem

A nitpick, It's the team's problem, not solely mine.

Cheers,

It does not indeed, I need more thinking on this...

Sorry for the long description, feel free to not read it...

My following reasoning applies to origins with svn:externals
property. Nothing changes for the other svn origins.


I tried to answer the following question: Can we keep the loader's
idempotency and solve that problem?

tl;dr We cannot.

I do not separate the origin with svn:externals case that works (svn
revision mentioned) and those that does not. I do not think that's a
reasonable assumption to make.

As the svn revision number is only recommended but unfortunately not
mandatory in the svn:externals property, we will fall upon origins
with svn:externals entries without specified svn revision number...
This, probably more often than not given how our human brain induces
us to choose the least resistance path...) [1]

As a result, we won't be able to have an idempotent loader... E.g
given an origin with svn:externals to load, loading it at time t0
won't necessary mean we will be able to load at t1 and have exactly
the same information later (e.g. if the svn origins submodule lives,
new svn revision will occur).

So given that, we cannot use the directory model as is.
So far, that's already mentioned above (but now i understand the reasons).

In that regard, I see multiple solutions.

Solution 1

I see a possible solution, without touching anything on the model.
Generate a file with the content of the svn:externals content
(either sanitized or not) [2]. That file becomes part of the
directory listing (thus computation hash impacted, good). That would
match a little what git does with submodules (.gitmodules file) [3]

Pros, we keep the following properties:

  • loader idempotency
  • update (from new svn revisions) on that file results in new directory computation
  • no svn information loss

Cons: We need to take extra care for a new visit on an already visited

origin (to have the same initial directory, by default we won't
have the same output).

For other purposes, we can still try to provide heuristic when
browsing or cooking [2]

Solution 2

Altering the directory model to hold metadata information. Same
as 1. plus we add those information in that new metadata field in the
directory model.

One more pros: the model hold information, so that simplifies
downstream (browse, cooking, etc...)

Solution 3

Altering the directory model to hold metadata information. We alter
the directory hash computation model to account for that optional
metadata field (without it, current computation stays the same). That
way, we do not need to add the svn:externals file in the tree. That
seems the more reasonable.

Same pros as 2. I do not see cons.

Given that, that's the solution to retain.


Answers needed though:

  • what do other dvcs do for such case (mercurial, etc...)?

[1] We have 0.54% or the googlecode origins holding svn:externals

property. Now the next step would be to check the combination
distribution (how many with svn revision, how many without) of
those...

Computation:
```
(/  (* 100 3102.0) 575835.0); 0.5386959806194482
```

[2] Sanitizing the property could help for other purposes

(browsability, cooking, etc...)

[3] However, we don't have issues with git because the git submodules

are designed to target a git revision (and in our model
`directory_entry_rev` uses the sha1_git provided by git, we do not
need to compute it).

Cheers,