Page MenuHomeSoftware Heritage

loader: Add support for submodules discovering
Changes PlannedPublic

Authored by anlambert on Mar 10 2022, 3:25 PM.

Details

Summary

The git loader can now discover submodules while loading a repository.

That process works the following way:

  1. Before sending a new directory to archive in the storage, check if it has a .gitmodules file in its entries and add the tuple (directory_id, content_sha1git) in a global set if it is the case.
  1. During the post_load operation, process each discovered .gitmodules file the following way:
    • retrieve content metadata to get sha1 checksum of file
    • retrieve .gitmodules content bytes in objstorage from sha1
    • parse .gitmodules file content
    • for each submodule definition:
      • get git commit id associated to submodule path
      • check if git commit has been archived by SWH
      • if not, add the submodule repository URL in a set
    • for each submodule detected as not archived or partially archived, create a one shot git loading task with high priority in the scheduler database

Related to T3311
Related to T3923

Diff Detail

Repository
rDLDG Git loader
Branch
submodules-recursive-loading-with-scheduler
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 27405
Build 42882: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 42881: arc lint + arc unit

Event Timeline

anlambert edited the summary of this revision. (Show Details)

Build is green

Patch application report for D7332 (id=26514)

Rebasing onto c347e3f757...

Current branch diff-target is up to date.
Changes applied before test
commit c65c8812b65553137486695d0acebcfac44f5b7a
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Feb 15 18:04:21 2022 +0100

    loader: Add support for submodules discovering
    
    The git loader can now discover submodules while loading a
    repository.
    
    That process works the following way:
    
    1. Before sending a new directory to archive in the storage,
    check if it has a ".gitmodules" file in its entries and add
    the tuple (directory_id, content_sha1git) in a global set if
    it is the case.
    
    2. During the post_load operation, process each discovered
    ".gitmodules" file the following way:
    
      - retrieve content metadata to get sha1 checksum of file
    
      - retrieve .gitmodules content bytes in objstorage from sha1
    
      - parse .gitmodules file content
    
      - for each submodule definition:
    
        * get git commit id associated to submodule path
    
        * check if git commit has been archived by SWH
    
        * if not, add the submodule repository URL in a set
    
      - for each submodule detected as not archived or partially
        archived, create a one shot git loading task with high
        priority in the scheduler database
    
    Related to T3311
    Related to T3923

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/184/ for more details.

Could you move all this new code in a new module and call it from the loader? loader.py is already uncomfortably large.

Also, I imagined the two tasks would be solved with a journal client listening for new directories, to avoid adding more stuff to the loaders themselves. But this designs avoids introducing a new (set of) process to monitor, so why not.

swh/loader/git/loader.py
254–256

Should make it clear this is a heuristic. Actual .gitmodules apply to local paths on the computer, not to URLs of remote clones.

262–267

could you add parentheses (or nested blocks if Black won't let you), so we don't depend on the priority of or and and?

I'd need to look up the priority to understand the code...

271–300

why these checks?

It looks like this is to avoid scheduling two origins that would contain the same revision; but I think it is sufficient to just deduplicate based on origin URL

Could you move all this new code in a new module and call it from the loader? loader.py is already uncomfortably large.

Sure, should be feasible.

swh/loader/git/loader.py
254–256

Ack

262–267

Sure.

271–300

The purpose is to speedup the loading process by avoiding to query the storage too many times for a revision we know already archived.
Indeed, such revision can be derived from multiple .gitmodules file discovered during the loading process.

Build is green

Patch application report for D7332 (id=26520)

Rebasing onto c347e3f757...

Current branch diff-target is up to date.
Changes applied before test
commit 8797d1d792b8ccc54a9437755d5119ec52ae1998
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Feb 15 18:04:21 2022 +0100

    loader: Add support for submodules discovering
    
    The git loader can now discover submodules while loading a
    repository.
    
    That process works the following way:
    
    1. Before sending a new directory to archive in the storage,
       check if it has a ".gitmodules" file in its entries and add
       the tuple (directory_id, content_sha1git) in a global set if
       it is the case.
    
    2. During the post_load operation, process each discovered
       ".gitmodules" file the following way:
    
      - retrieve content metadata to get sha1 checksum of file
    
      - retrieve .gitmodules content bytes in objstorage from sha1
    
      - parse .gitmodules file content
    
      - for each submodule definition:
    
        * get git commit id associated to submodule path
    
        * check if git commit has been archived by SWH
    
        * if not, add the submodule repository URL in a set
    
      - for each submodule detected as not archived or partially
        archived, create a one shot git loading task with high
        priority in the scheduler database
    
    Related to T3311
    Related to T3923

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/185/ for more details.

Could you add tests for the various error scenarios? (or split them out of the one big test case if I missed them)

Also, I we just discussed, it should probably submit URLs to Save Code Now instead of directly to the scheduler. (To be discussed with the rest of the team)

swh/loader/git/loader.py
172

Why self.snapshot = Snapshot(branches={})? This makes it easy to accidentally write the empty snapshot instead of the actual one

swh/loader/git/submodules.py
117–118

FYI, @douardda added a comment on T3311 (I think it's the best place to discuss the design)

olasd added a subscriber: olasd.

FYI, @douardda added a comment on T3311 (I think it's the best place to discuss the design)

I've also done that.