Paths

Table of Contentst

Differential D7332

loader: Add support for submodules discovering
Changes PlannedPublic
Actions

Authored by anlambert on Mar 10 2022, 3:25 PM.

Details

Reviewers

None

Group Reviewers

Reviewers

Maniphest Tasks

T3311: Use .gitmodules to discover origins
T3923: Include submodules recursively when saving git repositories

Summary

The git loader can now discover submodules while loading a repository.

That process works the following way:

Before sending a new directory to archive in the storage, check if it has a .gitmodules file in its entries and add the tuple (directory_id, content_sha1git) in a global set if it is the case.

During the post_load operation, process each discovered .gitmodules file the following way:
- retrieve content metadata to get sha1 checksum of file
- retrieve .gitmodules content bytes in objstorage from sha1
- parse .gitmodules file content
- for each submodule definition:
  - get git commit id associated to submodule path
  - check if git commit has been archived by SWH
  - if not, add the submodule repository URL in a set
- for each submodule detected as not archived or partially archived, create a one shot git loading task with high priority in the scheduler database

Related to T3311
Related to T3923

Diff Detail

Repository

rDLDG Git loader

Branch

submodules-recursive-loading-with-scheduler

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 27405
Build 42882: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 42881: arc lint + arc unit

Event Timeline

anlambert created this revision.Mar 10 2022, 3:25 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptMar 10 2022, 3:25 PM

anlambert edited the summary of this revision. (Show Details)Mar 10 2022, 3:26 PM

anlambert edited the summary of this revision. (Show Details)

Build is green

Patch application report for D7332 (id=26514)

Rebasing onto c347e3f757...

Current branch diff-target is up to date.

Changes applied before test

commit c65c8812b65553137486695d0acebcfac44f5b7a
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Feb 15 18:04:21 2022 +0100

    loader: Add support for submodules discovering
    
    The git loader can now discover submodules while loading a
    repository.
    
    That process works the following way:
    
    1. Before sending a new directory to archive in the storage,
    check if it has a ".gitmodules" file in its entries and add
    the tuple (directory_id, content_sha1git) in a global set if
    it is the case.
    
    2. During the post_load operation, process each discovered
    ".gitmodules" file the following way:
    
      - retrieve content metadata to get sha1 checksum of file
    
      - retrieve .gitmodules content bytes in objstorage from sha1
    
      - parse .gitmodules file content
    
      - for each submodule definition:
    
        * get git commit id associated to submodule path
    
        * check if git commit has been archived by SWH
    
        * if not, add the submodule repository URL in a set
    
      - for each submodule detected as not archived or partially
        archived, create a one shot git loading task with high
        priority in the scheduler database
    
    Related to T3311
    Related to T3923

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/184/ for more details.

Harbormaster completed remote builds in B27399: Diff 26514.Mar 10 2022, 3:28 PM

anlambert requested review of this revision.Mar 10 2022, 3:28 PM

Could you move all this new code in a new module and call it from the loader? loader.py is already uncomfortably large.

Also, I imagined the two tasks would be solved with a journal client listening for new directories, to avoid adding more stuff to the loaders themselves. But this designs avoids introducing a new (set of) process to monitor, so why not.

swh/loader/git/loader.py
254–256	Should make it clear this is a heuristic. Actual `.gitmodules` apply to local paths on the computer, not to URLs of remote clones.
262–267	could you add parentheses (or nested blocks if Black won't let you), so we don't depend on the priority of `or` and `and`? I'd need to look up the priority to understand the code...
271–300	why these checks? It looks like this is to avoid scheduling two origins that would contain the same revision; but I think it is sufficient to just deduplicate based on origin URL

Could you move all this new code in a new module and call it from the loader? loader.py is already uncomfortably large.

Sure, should be feasible.

swh/loader/git/loader.py
254–256	Ack
262–267	Sure.
271–300	The purpose is to speedup the loading process by avoiding to query the storage too many times for a revision we know already archived. Indeed, such revision can be derived from multiple `.gitmodules` file discovered during the loading process.

Update: Address @vlorentz comments

Build is green

Patch application report for D7332 (id=26520)

Rebasing onto c347e3f757...

Current branch diff-target is up to date.

Changes applied before test

commit 8797d1d792b8ccc54a9437755d5119ec52ae1998
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Feb 15 18:04:21 2022 +0100

    loader: Add support for submodules discovering
    
    The git loader can now discover submodules while loading a
    repository.
    
    That process works the following way:
    
    1. Before sending a new directory to archive in the storage,
       check if it has a ".gitmodules" file in its entries and add
       the tuple (directory_id, content_sha1git) in a global set if
       it is the case.
    
    2. During the post_load operation, process each discovered
       ".gitmodules" file the following way:
    
      - retrieve content metadata to get sha1 checksum of file
    
      - retrieve .gitmodules content bytes in objstorage from sha1
    
      - parse .gitmodules file content
    
      - for each submodule definition:
    
        * get git commit id associated to submodule path
    
        * check if git commit has been archived by SWH
    
        * if not, add the submodule repository URL in a set
    
      - for each submodule detected as not archived or partially
        archived, create a one shot git loading task with high
        priority in the scheduler database
    
    Related to T3311
    Related to T3923

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/185/ for more details.

Harbormaster completed remote builds in B27405: Diff 26520.Mar 10 2022, 5:19 PM

Could you add tests for the various error scenarios? (or split them out of the one big test case if I missed them)

Also, I we just discussed, it should probably submit URLs to Save Code Now instead of directly to the scheduler. (To be discussed with the rest of the team)

swh/loader/git/loader.py
172	Why `self.snapshot = Snapshot(branches={})`? This makes it easy to accidentally write the empty snapshot instead of the actual one
swh/loader/git/submodules.py
117–118