Page MenuHomeSoftware Heritage

Improve store_data implem to allow multiple calls with partial visit
Changes PlannedPublic

Authored by ardumont on Thu, Sep 30, 5:07 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3625: Reduce git loader memory footprint
Summary

This introduces a create_partial_visit flag to the store_data method call which
allows to write partial visit with the current snapshot built during one iteration of
the ingestion loop.

The final loop of the visit will not create a partial visit. It will create the last
visit with status "full" targeting the final snapshot as before.

The main difference is that an ingestion is actually more verbose in terms of
origin_visit_status. Which in turn allows to be more incremental in subsequent visit of
the same origin.

This is required to allow performance improvments on the loader git [1].

[1] D6386

Related to T3625

Test Plan

tox

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
allow-multiple-calls2
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24171
Build 37727: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37726: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6380 (id=23217)

Rebasing onto 69028b09cd...

Current branch diff-target is up to date.
Changes applied before test
commit 4b9260a743cd31512134c5f0804e4876a8b94d27
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 16:55:54 2021 +0200

    Allow multiple calls to DVCSLoader.store_data implementation
    
    This introduces a more_data_to_fetch which allows to not write immediately the snapshot
    if we need some more data to fetch in other iterations of the loop. Prior to this
    commit, it was implied that the store_data could only be called once. It's a limitation
    that needs to change for some ongoing optimizations in the loader git.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/578/ for more details.

what about naming the parameter create_snapshot instead?

Prior to this
commit, it was implied that the store_data could only be called once. It's a limitation
that needs to change for some ongoing optimizations in the loader git.

is it, though? it allows creating "partial" snapshots

what about naming the parameter create_snapshot instead?

Fine with me.

Prior to this
commit, it was implied that the store_data could only be called once. It's a limitation
that needs to change for some ongoing optimizations in the loader git.

is it, though? it allows creating "partial" snapshots

Well, yes. But even with this, that's still the case (if create_snapshot is True after the first round).

Without this though, we cannot pass into more than one iteration of the loop (in the git loader
which is the sole running subclass instance of this). The ingestion fails because it wants
to create one snapshot after the first store_data called (so only one loop is allowed
with the current implem).

As it misses references to build the snapshot, it fails.
The reading of all references is done through multiple iterations (optimization ongoing to
read the packfiles into multiple steps)

swh/loader/core/loader.py
482–483

should be in the if. We don't really want the flush to happen each time the loop occurs.
Plus, if that can already happen anyway (with the proper buffer/filter configuration).

Adapt according to suggestion

Build is green

Patch application report for D6380 (id=23219)

Rebasing onto 69028b09cd...

Current branch diff-target is up to date.
Changes applied before test
commit 9e81b1a2895345272036b52d8494e9e884db5db5
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 16:55:54 2021 +0200

    Allow multiple calls to DVCSLoader.store_data implementation
    
    This introduces a more_data_to_fetch which allows to not write immediately the snapshot
    if we need some more data to fetch in other iterations of the loop. Prior to this
    commit, it was implied that the store_data could only be called once. It's a limitation
    that needs to change for some ongoing optimizations in the loader git.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/579/ for more details.

Adapt according to latest analysis/development

ardumont retitled this revision from Allow multiple calls to DVCSLoader.store_data implementation to Improve store_data implem to allow multiple calls with partial visit.Sun, Oct 3, 3:13 PM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)

what about naming the parameter create_snapshot instead?

Fine with me.

Prior to this
commit, it was implied that the store_data could only be called once. It's a limitation
that needs to change for some ongoing optimizations in the loader git.

is it, though? it allows creating "partial" snapshots

Well, yes. But even with this, that's still the case (if create_snapshot is True after the first round).

Without this though, we cannot pass into more than one iteration of the loop (in the git loader
which is the sole running subclass instance of this). The ingestion fails because it wants
to create one snapshot after the first store_data called (so only one loop is allowed
with the current implem).

As it misses references to build the snapshot, it fails.
The reading of all references is done through multiple iterations (optimization ongoing to
read the packfiles into multiple steps)

Given enough sleep on this, it clicked ;)

I have an ongoing adaptation for that part in the loader git to exploit this. The loader
git will make a snapshot after each iteration, only it will create a "partial" visit
targeting that snapshot only when more_data_to_fetch is True.

Only, more_data_to_fetch/create_snapshot is renamed create_partial_visit though as
that makes more sense now.

Build is green

Patch application report for D6380 (id=23254)

Rebasing onto 69028b09cd...

Current branch diff-target is up to date.
Changes applied before test
commit cc027ed76ba140ac2656a1155e059fc905f72f98
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 16:55:54 2021 +0200

    Improve store_data implem to allow multiple calls with partial visit
    
    This introduces a `create_partial_visit` flag to the `store_data` method call which
    allows to write partial visit with the current snapshot built during one iteration of
    the ingestion loop.
    
    The final loop of the visit will not create a partial visit. It will create the last
    visit with status "full" targeting the final snapshot as before.
    
    The main difference is that an ingestion is actually more verbose in terms of
    origin_visit_status. Which in turn allows to be more incremental in subsequent visit of
    the same origin.
    
    Related to T3625

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/581/ for more details.

ardumont edited the summary of this revision. (Show Details)

The buffer proxy does not buffer visit statuses, so we need to flush the buffer before creating visit statuses.

swh/loader/core/loader.py
472

maybe?

swh/loader/core/loader.py
472

I recall that's what i initially did and mypy was not happy about it.

482–483

that comment was for a previous implem, it's now irrelevant.