Page MenuHomeSoftware Heritage

buffer: add a threshold for the estimated size of revision and release batches
ClosedPublic

Authored by olasd on Oct 8 2021, 3:58 PM.

Details

Summary

The size of individual revisions and releases is essentially unbounded.
This means that, when the buffer storage is used as a way of limiting
memory use for an ingestion process, it is still possible to go beyond
the expected memory use when adding a batch of revisions or releases
with large messages or other metadata.

The duration of the database operations for revision_add or release_add is also
commensurate to the size of the objects added in a batch, so
using the buffer proxy to limit the time individual database operations
takes was not effective.

Adding a threshold on estimated sizes for batches of revision and
release objects makes this overuse of memory and of database transaction
time much less likely.

Depends on D6445
Related to T3625

Test Plan

new tests added for the new thresholds

Diff Detail

Event Timeline

Build is green

Patch application report for D6446 (id=23410)

Could not rebase; Attempt merge onto 5edc0ba7ac...

Updating 5edc0ba7..1db72a0e
Fast-forward
 swh/storage/proxies/buffer.py    | 75 +++++++++++++++++++++++++++++++++++++++-
 swh/storage/tests/test_buffer.py | 75 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 148 insertions(+), 2 deletions(-)
Changes applied before test
commit 1db72a0e005c8201dddfca1806044659aa8f87c7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Oct 8 15:44:42 2021 +0200

    buffer: add a threshold for the estimated size of revision and release batches
    
    The size of individual revisions and releases is essentially unbounded.
    This means that, when the buffer storage is used as a way of limiting
    memory use for an ingestion process, it is still possible to go beyond
    the expected memory use when adding a batch of revisions or releases
    with large messages or other metadata.
    
    The duration of the database operations for revision_add or release_add is also
    commensurate to the size of the objects added in a batch, so
    using the buffer proxy to limit the time individual database operations
    takes was not effective.
    
    Adding a threshold on estimated sizes for batches of revision and
    release objects makes this overuse of memory and of database transaction
    time much less likely.

commit 7c5b0ec15e40ce7cb91b8a50beefe29d6dc8faf7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Oct 8 15:13:59 2021 +0200

    buffer: add a threshold for the number of revision parents in one batch
    
    The size of individual revisions is essentially unbounded. This means
    that, when the buffer storage is used as a way of limiting memory use
    for an ingestion process, it is still possible to go beyond the expected
    memory use when adding a batch of revisions with extensive histories.
    
    The duration of the database operation for revision_add is also
    commensurate to the number of revision parents added in a batch, so
    using the buffer proxy to limit the time individual database operations
    takes was not effective.
    
    Adding a threshold on cumulated number of revision parents per batch
    makes this overuse of memory and of database transaction time much less
    likely.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1447/ for more details.

olasd requested review of this revision.Oct 8 2021, 4:08 PM
ardumont added inline comments.
swh/storage/proxies/buffer.py
200

oops ;)

lgtm, providing the typo is fixed ;)

This revision is now accepted and ready to land.Oct 8 2021, 4:53 PM

Fix revision -> release typo in release_add flush call

Build is green

Patch application report for D6446 (id=23420)

Rebasing onto 7c5b0ec15e...

Current branch diff-target is up to date.
Changes applied before test
commit b6040142fe723771f43ffef75b2e1fc778641a42
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Oct 8 15:44:42 2021 +0200

    buffer: add a threshold for the estimated size of revision and release batches
    
    The size of individual revisions and releases is essentially unbounded.
    This means that, when the buffer storage is used as a way of limiting
    memory use for an ingestion process, it is still possible to go beyond
    the expected memory use when adding a batch of revisions or releases
    with large messages or other metadata.
    
    The duration of the database operations for revision_add or release_add is also
    commensurate to the size of the objects added in a batch, so
    using the buffer proxy to limit the time individual database operations
    takes was not effective.
    
    Adding a threshold on estimated sizes for batches of revision and
    release objects makes this overuse of memory and of database transaction
    time much less likely.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1449/ for more details.