Page MenuHomeSoftware Heritage

Limit the number of entries in the cache
ClosedPublic

Authored by vsellier on Jun 27 2022, 3:58 PM.

Details

Summary

The implementation test after each revision if there are
more than 100 000 entries of any kind in the cache. If yes,
it flush the content.
It's quite naive but the memory seems to stay around 5Go
max on big repositories like linux.

This can be still an issue with the current zeromq client because
the origin are sorted by repositories a lot of snapshots of big
repositories can be ingested in parallel.
With the journal implementation, it should be more distributed on
repositories of different sizes

Related to T4313

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8040 (id=28949)

Rebasing onto 80434e3b21...

Current branch diff-target is up to date.
Changes applied before test
commit 6bf00a395eca96eaa04c07a2168389a8d6ab85e6
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Mon Jun 27 15:48:47 2022 +0200

    Limit the number of entries in the cache
    
    The implementation test after each revision if there are
    more than 100 000 entries of any kind in the cache. If yes,
    it flush the content.
    It's quite naive but the memory seems to stay around 5Go
    max on big repositories like linux.
    
    This can be still an issue with the current zeromq client because
    the origin are sorted by repositories a lot of snapshots of big
    repositories can be ingested in parallel.
    With the journal implementation, it should be more distributed on
    repositories of different sizes
    
    Related to T4313

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/609/ for more details.

Build is green

Patch application report for D8040 (id=28950)

Rebasing onto 80434e3b21...

Current branch diff-target is up to date.
Changes applied before test
commit 571477be0498f2fbaf35fcca4b307447eeb430b5
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Mon Jun 27 15:48:47 2022 +0200

    Limit the number of entries in the cache
    
    The implementation test after each revision if there are
    more than 100 000 entries of any kind in the cache. If yes,
    it flush the content.
    It's quite naive but the memory seems to stay around 5Go
    max on big repositories like linux.
    
    This can be still an issue with the current zeromq client because
    the origin are sorted by repositories a lot of snapshots of big
    repositories can be ingested in parallel.
    With the journal implementation, it should be more distributed on
    repositories of different sizes
    
    Related to T4313

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/610/ for more details.

vlorentz added inline comments.
swh/provenance/origin.py
107–111

saves a syscall when unnecessary

swh/provenance/provenance.py
101–113
116–128

would this work?

swh/provenance/origin.py
107–111

nvm, I misunderstood that the point was to time the function in the conditional

lgtm, especially better if you attend to val's good suggestions ;)

A couple of docstring suggestions inline.

swh/provenance/provenance.py
124
swh/provenance/interface.py
252–254
This revision is now accepted and ready to land.Jun 27 2022, 5:10 PM
vsellier marked 6 inline comments as done.

update according the reviews

  • simplify the cache management
  • fix the doc strings
swh/provenance/provenance.py
101–113

nice thanks

116–128

it works with a test on the data presence because of directory_flatten and revision_before_revision which are simple dict

Build has FAILED

Patch application report for D8040 (id=28958)

Rebasing onto 80434e3b21...

Current branch diff-target is up to date.
Changes applied before test
commit b4f9226cffe69a211dbe0204541bd49e89daa8df
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Mon Jun 27 15:48:47 2022 +0200

    Limit the number of entries in the cache
    
    The implementation test after each revision if there are
    more than 100 000 entries of any kind in the cache. If yes,
    it flush the content.
    It's quite naive but the memory seems to stay around 5Go
    max on big repositories like linux.
    
    This can be still an issue with the current zeromq client because
    the origin are sorted by repositories a lot of snapshots of big
    repositories can be ingested in parallel.
    With the journal implementation, it should be more distributed on
    repositories of different sizes
    
    Related to T4313

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/611/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/611/console

Build has FAILED

Patch application report for D8040 (id=28959)

Rebasing onto 80434e3b21...

Current branch diff-target is up to date.
Changes applied before test
commit 35c92963798c96475165aa0e132f5936be66e6f5
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Mon Jun 27 15:48:47 2022 +0200

    Limit the number of entries in the cache
    
    The implementation test after each revision if there are
    more than 100 000 entries of any kind in the cache. If yes,
    it flush the content.
    It's quite naive but the memory seems to stay around 5Go
    max on big repositories like linux.
    
    This can be still an issue with the current zeromq client because
    the origin are sorted by repositories a lot of snapshots of big
    repositories can be ingested in parallel.
    With the journal implementation, it should be more distributed on
    repositories of different sizes
    
    Related to T4313

Link to build: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/612/
See console output for more information: https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/612/console

Build is green

Patch application report for D8040 (id=28962)

Rebasing onto 80434e3b21...

Current branch diff-target is up to date.
Changes applied before test
commit f5f741366383cff7e7f173a79f656e9c6e159602
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Mon Jun 27 15:48:47 2022 +0200

    Limit the number of entries in the cache
    
    The implementation test after each revision if there are
    more than 100 000 entries of any kind in the cache. If yes,
    it flush the content.
    It's quite naive but the memory seems to stay around 5Go
    max on big repositories like linux.
    
    This can be still an issue with the current zeromq client because
    the origin are sorted by repositories a lot of snapshots of big
    repositories can be ingested in parallel.
    With the journal implementation, it should be more distributed on
    repositories of different sizes
    
    Related to T4313

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/613/ for more details.