Page MenuHomeSoftware Heritage

buffer: add a threshold for the number of directory entries in one batch
ClosedPublic

Authored by olasd on Oct 8 2021, 3:06 PM.

Details

Summary

The size of individual directories is essentially unbounded. This means
that, when the buffer storage is used as a way of limiting memory use
for an ingestion process, it is still possible to go beyond the expected
memory use when adding a batch of (very) large directories.

The duration of the database operation for directory_add is also
commensurate to the number of directory entries added in a batch, so
using the buffer proxy to limit the time individual database operations
takes was not effective.

Adding a threshold on cumulated number of directory entries per batch
makes this overuse of memory and of database transaction time much less
likely.

Depends on D6427
Related to T3625

Test Plan

new test added

Diff Detail

Event Timeline

Make test slightly more robust against changes in test data

Build is green

Patch application report for D6443 (id=23400)

Could not rebase; Attempt merge onto 113088ab06...

Updating 113088ab..956186d5
Fast-forward
 swh/storage/proxies/buffer.py    | 21 +++++++++++++--
 swh/storage/proxies/filter.py    | 55 ++++++++++++++++++++++++++++++----------
 swh/storage/tests/test_buffer.py | 34 +++++++++++++++++++++++++
 swh/storage/tests/test_filter.py | 51 +++++++++++++++++++++++++++++++++++++
 4 files changed, 146 insertions(+), 15 deletions(-)
Changes applied before test
commit 956186d55eeff4160ca633cb3764e2be8d58bacc
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Oct 8 14:23:01 2021 +0200

    buffer: add a threshold for the number of directory entries in one batch
    
    The size of individual directories is essentially unbounded. This means
    that, when the buffer storage is used as a way of limiting memory use
    for an ingestion process, it is still possible to go beyond the expected
    memory use when adding a batch of (very) large directories.
    
    The duration of the database operation for directory_add is also
    commensurate to the number of directory entries added in a batch, so
    using the buffer proxy to limit the time individual database operations
    takes was not effective.
    
    Adding a threshold on cumulated number of directory entries per batch
    makes this overuse of memory and of database transaction time much less
    likely.

commit abe95b34a2f9d8e16b16de7f406c61d16d753339
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:26:04 2021 +0200

    filter: add filtering for release_add

commit c52b7b66791166a80239710095764abfd9609c65
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:25:21 2021 +0200

    filter: do not call the underlying functions if there's nothing to add

commit 5d5d4c941eacf622de42b4454baa32efa3207d45
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:21:26 2021 +0200

    buffer: Ensure that we don't send data from empty buffers
    
    This was already the case (as grouper called on an empty iterator just
    returns no batches), but add a test to enforce it.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1444/ for more details.

Build is green

Patch application report for D6443 (id=23401)

Could not rebase; Attempt merge onto 113088ab06...

Updating 113088ab..5edc0ba7
Fast-forward
 swh/storage/proxies/buffer.py    | 21 +++++++++++++--
 swh/storage/proxies/filter.py    | 55 ++++++++++++++++++++++++++++++----------
 swh/storage/tests/test_buffer.py | 34 +++++++++++++++++++++++++
 swh/storage/tests/test_filter.py | 51 +++++++++++++++++++++++++++++++++++++
 4 files changed, 146 insertions(+), 15 deletions(-)
Changes applied before test
commit 5edc0ba7ac12a6ba1a1e4200c46e08237c638671
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Oct 8 14:23:01 2021 +0200

    buffer: add a threshold for the number of directory entries in one batch
    
    The size of individual directories is essentially unbounded. This means
    that, when the buffer storage is used as a way of limiting memory use
    for an ingestion process, it is still possible to go beyond the expected
    memory use when adding a batch of (very) large directories.
    
    The duration of the database operation for directory_add is also
    commensurate to the number of directory entries added in a batch, so
    using the buffer proxy to limit the time individual database operations
    takes was not effective.
    
    Adding a threshold on cumulated number of directory entries per batch
    makes this overuse of memory and of database transaction time much less
    likely.

commit abe95b34a2f9d8e16b16de7f406c61d16d753339
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:26:04 2021 +0200

    filter: add filtering for release_add

commit c52b7b66791166a80239710095764abfd9609c65
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:25:21 2021 +0200

    filter: do not call the underlying functions if there's nothing to add

commit 5d5d4c941eacf622de42b4454baa32efa3207d45
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:21:26 2021 +0200

    buffer: Ensure that we don't send data from empty buffers
    
    This was already the case (as grouper called on an empty iterator just
    returns no batches), but add a test to enforce it.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1445/ for more details.

olasd requested review of this revision.Oct 8 2021, 3:21 PM
This revision is now accepted and ready to land.Oct 8 2021, 3:46 PM