Page MenuHomeSoftware Heritage

Allow partial snapshot creation during ingestion
Needs ReviewPublic

Authored by ardumont on Sep 30 2021, 5:07 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3625: Reduce git loader memory footprint
Summary

This introduces a create_partial_snapshot parameter to the base loader constructor.
When activated, during each call of the store_data method, if there are more data to
fetch, this will create a partial snapshot.

The final loop behaves as before, create the last visit with status 'full' targeting the
snapshot.

The main difference between the 2 behavior is that an ingestion with that parameter on
is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental
in subsequent visits for the same origin. This may especially be interesting for cases
when loading fail due to out of hand resources issues (e.g. large svn or git
repositories).

This is required to allow performance improvments on the loader git [1].

[1] D6386

Related to T3625

Test Plan

tox

Diff Detail

Event Timeline

Build is green

Patch application report for D6380 (id=23217)

Rebasing onto 69028b09cd...

Current branch diff-target is up to date.
Changes applied before test
commit 4b9260a743cd31512134c5f0804e4876a8b94d27
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 16:55:54 2021 +0200

    Allow multiple calls to DVCSLoader.store_data implementation
    
    This introduces a more_data_to_fetch which allows to not write immediately the snapshot
    if we need some more data to fetch in other iterations of the loop. Prior to this
    commit, it was implied that the store_data could only be called once. It's a limitation
    that needs to change for some ongoing optimizations in the loader git.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/578/ for more details.

what about naming the parameter create_snapshot instead?

Prior to this
commit, it was implied that the store_data could only be called once. It's a limitation
that needs to change for some ongoing optimizations in the loader git.

is it, though? it allows creating "partial" snapshots

what about naming the parameter create_snapshot instead?

Fine with me.

Prior to this
commit, it was implied that the store_data could only be called once. It's a limitation
that needs to change for some ongoing optimizations in the loader git.

is it, though? it allows creating "partial" snapshots

Well, yes. But even with this, that's still the case (if create_snapshot is True after the first round).

Without this though, we cannot pass into more than one iteration of the loop (in the git loader
which is the sole running subclass instance of this). The ingestion fails because it wants
to create one snapshot after the first store_data called (so only one loop is allowed
with the current implem).

As it misses references to build the snapshot, it fails.
The reading of all references is done through multiple iterations (optimization ongoing to
read the packfiles into multiple steps)

swh/loader/core/loader.py
469

should be in the if. We don't really want the flush to happen each time the loop occurs.
Plus, if that can already happen anyway (with the proper buffer/filter configuration).

Adapt according to suggestion

Build is green

Patch application report for D6380 (id=23219)

Rebasing onto 69028b09cd...

Current branch diff-target is up to date.
Changes applied before test
commit 9e81b1a2895345272036b52d8494e9e884db5db5
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 16:55:54 2021 +0200

    Allow multiple calls to DVCSLoader.store_data implementation
    
    This introduces a more_data_to_fetch which allows to not write immediately the snapshot
    if we need some more data to fetch in other iterations of the loop. Prior to this
    commit, it was implied that the store_data could only be called once. It's a limitation
    that needs to change for some ongoing optimizations in the loader git.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/579/ for more details.

Adapt according to latest analysis/development

ardumont retitled this revision from Allow multiple calls to DVCSLoader.store_data implementation to Improve store_data implem to allow multiple calls with partial visit.Oct 3 2021, 3:13 PM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)

what about naming the parameter create_snapshot instead?

Fine with me.

Prior to this
commit, it was implied that the store_data could only be called once. It's a limitation
that needs to change for some ongoing optimizations in the loader git.

is it, though? it allows creating "partial" snapshots

Well, yes. But even with this, that's still the case (if create_snapshot is True after the first round).

Without this though, we cannot pass into more than one iteration of the loop (in the git loader
which is the sole running subclass instance of this). The ingestion fails because it wants
to create one snapshot after the first store_data called (so only one loop is allowed
with the current implem).

As it misses references to build the snapshot, it fails.
The reading of all references is done through multiple iterations (optimization ongoing to
read the packfiles into multiple steps)

Given enough sleep on this, it clicked ;)

I have an ongoing adaptation for that part in the loader git to exploit this. The loader
git will make a snapshot after each iteration, only it will create a "partial" visit
targeting that snapshot only when more_data_to_fetch is True.

Only, more_data_to_fetch/create_snapshot is renamed create_partial_visit though as
that makes more sense now.

Build is green

Patch application report for D6380 (id=23254)

Rebasing onto 69028b09cd...

Current branch diff-target is up to date.
Changes applied before test
commit cc027ed76ba140ac2656a1155e059fc905f72f98
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 30 16:55:54 2021 +0200

    Improve store_data implem to allow multiple calls with partial visit
    
    This introduces a `create_partial_visit` flag to the `store_data` method call which
    allows to write partial visit with the current snapshot built during one iteration of
    the ingestion loop.
    
    The final loop of the visit will not create a partial visit. It will create the last
    visit with status "full" targeting the final snapshot as before.
    
    The main difference is that an ingestion is actually more verbose in terms of
    origin_visit_status. Which in turn allows to be more incremental in subsequent visit of
    the same origin.
    
    Related to T3625

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/581/ for more details.

ardumont edited the summary of this revision. (Show Details)

The buffer proxy does not buffer visit statuses, so we need to flush the buffer before creating visit statuses.

swh/loader/core/loader.py
468

maybe?

swh/loader/core/loader.py
468

I recall that's what i initially did and mypy was not happy about it.

469

that comment was for a previous implem, it's now irrelevant.

ardumont edited the summary of this revision. (Show Details)
  • Rebase
  • reword commit and diff description
  • adapt parameter according to review suggestion from @vlorentz

Only, more_data_to_fetch/create_snapshot is renamed create_partial_visit though as
that makes more sense now.

create_partial_snapshot makes for even more sense (updated).

ardumont retitled this revision from Improve store_data implem to allow multiple calls with partial visit to Allow partial snapshot creation during ingestion.Nov 14 2022, 3:50 PM
ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D6380 (id=31851)

Rebasing onto bf2cb039d5...

Current branch diff-target is up to date.
Changes applied before test
commit 92b14be4aa468f7164621a30695c3861544ddda5
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Nov 14 14:48:07 2022 +0100

    Allow partial snapshot creation during ingestion
    
    This introduces a `create_partial_snapshot` parameter to the base loader constructor.
    When activated, during each call of the `store_data` method, if there are more data to
    fetch, this will create a partial snapshot.
    
    The final loop behaves as before, create the last visit with status 'full' targeting the
    snapshot.
    
    The main difference between the 2 behavior is that an ingestion with that parameter on
    is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental
    in subsequent visits for the same origin. This may especially be interesting for cases
    when loading fail due to out of hand resources issues (e.g. large svn or git
    repositories).
    
    Related to T3625

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1020/ for more details.

ardumont marked an inline comment as done.

Add coverage (which is a bit convoluted but we are in loader-core so no real loader to
check that actual behavior beyond what i propose).

Build has FAILED

Patch application report for D6380 (id=31855)

Rebasing onto bf2cb039d5...

Current branch diff-target is up to date.
Changes applied before test
commit 272c48df3284445a3655d62069d931dc11df04f0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Nov 14 14:48:07 2022 +0100

    Allow partial snapshot creation during ingestion
    
    This introduces a `create_partial_snapshot` parameter to the base loader constructor.
    When activated, during each call of the `store_data` method, if there are more data to
    fetch, this will create a partial snapshot.
    
    The final loop behaves as before, create the last visit with status 'full' targeting the
    snapshot.
    
    The main difference between the 2 behavior is that an ingestion with that parameter on
    is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental
    in subsequent visits for the same origin. This may especially be interesting for cases
    when loading fail due to out of hand resources issues (e.g. large svn or git
    repositories).
    
    Related to T3625

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1023/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1023/console

Build is green

Patch application report for D6380 (id=31858)

Rebasing onto bf2cb039d5...

Current branch diff-target is up to date.
Changes applied before test
commit 372822713584d39b907bacd1782671b146be39a8
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Nov 14 14:48:07 2022 +0100

    Allow partial snapshot creation during ingestion
    
    This introduces a `create_partial_snapshot` parameter to the base loader constructor.
    When activated, during each call of the `store_data` method, if there are more data to
    fetch, this will create a partial snapshot.
    
    The final loop behaves as before, create the last visit with status 'full' targeting the
    snapshot.
    
    The main difference between the 2 behavior is that an ingestion with that parameter on
    is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental
    in subsequent visits for the same origin. This may especially be interesting for cases
    when loading fail due to out of hand resources issues (e.g. large svn or git
    repositories).
    
    Related to T3625

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1024/ for more details.

swh/loader/core/loader.py
338

This should use a keyword argument, for readability (make it clear what the argument means)

338–351

The current code uses slightly ambiguous "create_partial_snapshot" parameter to store_data, which makes it create an OVS in some cases; and create the OVS here in other cases.

Instead, you can create the OVS here, simply by moving it inside the loop. This keeps store_data as it was before the diff (doesn't need to concern itself with creating objects) and avoid duplicating OVS creation.

Just because Phabricator mangles the diff, this is what the new code should look like:

while True:
    t1 = time.monotonic()
    more_data_to_fetch = self.fetch_data()
    t2 = time.monotonic()
    total_time_fetch_data += t2 - t1

    more_data_to_fetch = self.process_data() and more_data_to_fetch
    t3 = time.monotonic()
    total_time_process_data += t3 - t2
    status = "partial" if more_data_to_fetch else self.visit_status()

    self.store_data()
    visit_status = OriginVisitStatus(
        origin=self.origin.url,
        visit=self.visit.visit,
        type=self.visit_type,
        date=now(),
        status=status,
        snapshot=self.loaded_snapshot_id,
    )
    self.storage.origin_visit_status_add([visit_status])

    t4 = time.monotonic()
    total_time_store_data += t4 - t3
    if not more_data_to_fetch:
        break

self.statsd_timing("fetch_data", total_time_fetch_data * 1000.0)
self.statsd_timing("process_data", total_time_process_data * 1000.0)
self.statsd_timing("store_data", total_time_store_data * 1000.0)
swh/loader/core/loader.py
338–351

oops, you should move the status = "partial" if more_data_to_fetch else self.visit_status() line after the call to store_data in both my snippets

Adapt according to discussion

Build is green

Patch application report for D6380 (id=31885)

Rebasing onto 31ab1aa69e...

Current branch diff-target is up to date.
Changes applied before test
commit 91d7b4bf348951c26c6a08554722522305e91da5
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Nov 14 14:48:07 2022 +0100

    Allow partial snapshot creation during ingestion
    
    This introduces a `create_partial_snapshot` parameter to the base loader constructor.
    When activated, during each call of the `store_data` method, if there are more data to
    fetch, this will create a partial snapshot (and an associated visit status).
    
    The final loop behaves as before, create the last visit with status 'full' targeting the
    final snapshot.
    
    The main difference between the 2 behaviors is that an ingestion with that parameter on
    is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental
    in subsequent visits for the same origin. This may especially be interesting for cases
    when loading fail due to out of hand resources issues (e.g. large svn or git
    repositories).
    
    Related to T3625

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1028/ for more details.

Build is green

Patch application report for D6380 (id=31886)

Rebasing onto 31ab1aa69e...

Current branch diff-target is up to date.
Changes applied before test
commit ad5c8f49c1dea00e95b895811df21e3c93a6667d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Nov 14 14:48:07 2022 +0100

    Allow partial snapshot creation during ingestion
    
    This introduces a `create_partial_snapshot` parameter to the base loader constructor.
    When activated, during each call of the `store_data` method, if there are more data to
    fetch, this will create a partial snapshot (and an associated visit status).
    
    The final loop behaves as before, create the last visit with status 'full' targeting the
    final snapshot.
    
    The main difference between the 2 behaviors is that an ingestion with that parameter on
    is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental
    in subsequent visits for the same origin. This may especially be interesting for cases
    when loading fail due to out of hand resources issues (e.g. large svn or git
    repositories).
    
    Related to T3625

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1029/ for more details.