Page MenuHomeSoftware Heritage

Use .gitmodules to discover origins
Closed, MigratedEdits Locked

Description

.gitmodules is a file created at the root of a git repository when this repository has submodules. It contains URLs to other Git repositories. We could/should use it for two reasons:

  1. discovery of new origins
  2. completeness of the repository containing .gitmodules; as it would reference revisions of these other repositories.

Possible ways to implement it:

  1. A lister scanning through SWH's DB? (ew)
  2. Indexer?
  3. Make the git loader create these origins?

Revisions and Commits

Event Timeline

vlorentz created this task.

This is a good idea, thanks for raising it.

I think 3 (make the git loader create the origin) would be the best way, because *both* other options sound like "ew" to me :-)
With (3) we would process the information at the earliest possible moment and avoid having to rely on other moving parts.

(Not sure if for (3) there are concerns about a feedback loop lister -> loader -> lister. But if there are we should probably address them anyway.)

I think the only issue with (3) is not being retroactive

I think the only issue with (3) is not being retroactive

Right. We can start making the world^W archive a better place by improving things for the future with (3).

Then we can estimate whether it's worth to do a one-off pass on the archive comment to catch up with the past.
Looking up/processing all files called .gitmodules once should be doable as a single batch process once.

I think this is worthwhile in general, at least for repositories that are still live.

I assume that when you mean "create the origin", you mean "create the origin in the lister/scheduler database", rather than create it in the archive itself (which would only create an empty entry with no visits).

I do wonder how many of these repositories we'll catch as "new" repositories rather than repositories that already exist thanks to an existing lister.

I'm also wondering if it would be worth submitting these recursive origins with "save code now" so we can try to get submodule updates close to the update of the main repository (this is a bit less drastic than having the loader recurse within submodules itself). Or at least to try to do so if some of the submoduled revisions aren't available in the archive.

I also wonder if we have a somewhat common approach to handle the SVN externals as well.

if it would be worth submitting these recursive origins with "save code now" so we can try to get submodule updates close to the update of the main repository

Definitely, yes. It allows bumping their priority in the absence of a (smart) lister.

I also wonder if we have a somewhat common approach to handle the SVN externals as well.

And bzr's stacked branches. However, with bzr's stacked branches (and probably SVN externals as well), we would need to load the origin first, so we can get a sha1_git hash we can use as parent; so we would either need the loader to recurse, or pause loading until the referenced origin is loaded. (With, of course, a fallback in case the referenced origin is lost forever)

It's been more/less discussed above but IMHO it would make sense to:

  • use the journal to handle these submodules (so we can easily recover existing .gitmodules files) instead of adding complexity / workload on the workers,
  • use 'save code now' to process the newly discovered origins

I think the approach in D7332 is interesting, but it feels a bit expensive to be doing it for every instance of a .gitmodules file found in any new directory for all git repos that are being loaded, as well as doing it again for the top level of any known branch in the git snapshot being loaded currently.

I'm not comfortable always creating high priority tasks in this context either, as I'm not sure what the throttling implications are when we inevitably end up on a repository that references a commit in a submodule that doesn't exist.

I wonder if it would be possible to restrict the parsing of .gitmodules to :

  • new revisions
  • which are known to contain directory entries pointing at unknown revisions

And to add these discovered repositories as entries of a virtual lister, instead of as high priority tasks akin to save code now.

However, I do like the idea of eagerly creating tasks in save code now for submodules of repositories that are loaded via save code now, as that seems to be the "least surprising" outcome in that context (solving T3923); this means adding a boolean flag to the git loader and enabling it on save code now tasks, to effectively make them recursive?

In T3311#80997, @olasd wrote:

I'm not comfortable always creating high priority tasks in this context either, as I'm not sure what the throttling implications are when we inevitably end up on a repository that references a commit in a submodule that doesn't exist.

Well, thinking about it some more, the worst case scenario is the next load operation is a noop, so maybe it doesn't matter that much in practice. Hmm.

The worst case scenario is that someone maliciously creates repositories generated on the fly that refer to each other via .gitmodules, so we end up in an infinite loop of loading garbage.