Page MenuHomeSoftware Heritage

swh.storage filter/buffer improvements
ClosedPublic

Authored by olasd on Oct 6 2021, 6:41 PM.

Details

Summary

buffer: Don't send data from empty buffers
filter: do not call the underlying functions if there's nothing to add
filter: add filtering for release_add

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D6427 (id=23351)

Rebasing onto 113088ab06...

Current branch diff-target is up to date.
Changes applied before test
commit 9fdeb676868b6c927922a31f3ab72f6a3dae5690
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:30:56 2021 +0200

    buffer: add some debug logging for number of objects sent

commit ebbeff524bff6b37e55fd8b791f62bf7cfe0d195
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:26:04 2021 +0200

    filter: add filtering for release_add

commit 2f882f56aa92efb45b97f55094ff26f8ab55db78
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:25:21 2021 +0200

    filter: do not call the underlying functions if there's nothing to add

commit 2959c474e8af5e6c0e986207673716b10d6aef87
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:21:26 2021 +0200

    buffer: Don't send data from empty buffers

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1434/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1434/console

Return proper stats in the filter shortcuts

Build has FAILED

Patch application report for D6427 (id=23353)

Rebasing onto 113088ab06...

Current branch diff-target is up to date.
Changes applied before test
commit 404e5a3bf893a7a626452e90f2f8ae5f9ab7c7dc
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:30:56 2021 +0200

    buffer: add some debug logging for number of objects sent

commit 4129cd0d9fba6d31ee1f63a04a5dfe1cac4b70c3
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:26:04 2021 +0200

    filter: add filtering for release_add

commit 15029c871061231ba136e7a07fac3e8f6816b87c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:25:21 2021 +0200

    filter: do not call the underlying functions if there's nothing to add

commit 2959c474e8af5e6c0e986207673716b10d6aef87
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:21:26 2021 +0200

    buffer: Don't send data from empty buffers

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1435/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1435/console

I've triggered a run in staging worker (patched venv with this) on NixOS/nixpkgs.
It's still ongoing.

But i'm already seeing it's doing the right thing about few directories
having lots of directory entries (new threshold introduced here helped).

DEBUG:swh.storage.proxies.buffer:Flushing 46 objects of type directory
DEBUG:swh.storage.proxies.buffer:Flushed 202727 directory entries
DEBUG:swh.storage.proxies.buffer:Flushing 46 objects of type directory
DEBUG:swh.storage.proxies.buffer:Flushed 203098 directory entries
DEBUG:swh.storage.proxies.buffer:Flushing 46 objects of type directory
DEBUG:swh.storage.proxies.buffer:Flushed 203210 directory entries
DEBUG:swh.storage.proxies.buffer:Flushing 1000 objects of type directory
DEBUG:swh.storage.proxies.buffer:Flushed 95783 directory entries
DEBUG:swh.storage.proxies.buffer:Flushing 1000 objects of type directory
DEBUG:swh.storage.proxies.buffer:Flushed 21288 directory entries
DEBUG:swh.storage.proxies.buffer:Flushing 1000 objects of type directory
DEBUG:swh.storage.proxies.buffer:Flushed 152126 directory entries
DEBUG:swh.storage.proxies.buffer:Flushing 327 objects of type directory
DEBUG:swh.storage.proxies.buffer:Flushed 200418 directory entries

It's not complete though, it's currently failing beyond revision.
Probably the new stuff about release.

DEBUG:swh.storage.proxies.buffer:Flushing 67 objects of type revision
DEBUG:swh.storage.proxies.buffer:Flushing 30 objects of type release
ERROR:swh.loader.git.loader.GitLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/loader/core/loader.py", line 339, in load
    self.store_data()
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/loader/core/loader.py", line 467, in store_data
    self.flush()
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/loader/core/loader.py", line 168, in flush
    self.storage.flush()
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/storage/proxies/buffer.py", line 193, in flush
    stats = add_fn(list(batch))
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 84, in release_add
    missing_ids = self._filter_missing_ids("release", (r.id for r in releases))
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 142, in _filter_missing_ids
    fn = fn_by_object_type[object_type]
KeyError: 'release'
DEBUG:swh.storage.proxies.buffer:Flushing 67 objects of type revision
DEBUG:swh.storage.proxies.buffer:Flushing 30 objects of type release
Traceback (most recent call last):
  File "/home/swhworker/profiling/bin/swh", line 8, in <module>
    sys.exit(main())
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/home/swhworker/profiling/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/swhworker/profiling/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/swhworker/profiling/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/swhworker/profiling/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/swhworker/profiling/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/swhworker/profiling/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/swhworker/profiling/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/loader/cli.py", line 105, in run
    result = loader.load()
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/loader/core/loader.py", line 382, in load
    self.flush()
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/loader/core/loader.py", line 168, in flush
    self.storage.flush()
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/storage/proxies/buffer.py", line 193, in flush
    stats = add_fn(list(batch))
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 84, in release_add
    missing_ids = self._filter_missing_ids("release", (r.id for r in releases))
  File "/home/swhworker/profiling/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 142, in _filter_missing_ids
    fn = fn_by_object_type[object_type]
KeyError: 'release'
swh/storage/proxies/filter.py
37 ↗(On Diff #23353)

yes, @olasd, regarding my last comment [1], adding release here should fix the problem.

[1] D6427#167010

swh/storage/proxies/filter.py
37 ↗(On Diff #23353)

Ah, yeah, duh.

Time to clean this up and make sure everything is tested...

vlorentz added inline comments.
swh/storage/proxies/buffer.py
178–188 ↗(On Diff #23353)

small nitpick

fwiw, the complete run log (with tracemalloc) [1]

[1]

Update tests for the following changes:

  • buffer: Ensure that we don't send data from empty buffers
  • filter: do not call the underlying functions if there's nothing to add
  • filter: add filtering for release_add
olasd published this revision for review.Oct 8 2021, 2:13 PM
olasd retitled this revision from Memory usage improvements to the buffer/filter proxies to swh.storage filter/buffer improvements.
olasd edited the summary of this revision. (Show Details)

I'll split off the new buffer thresholds in a new diff. This diff now only contains the (small) improvements to the buffer/filter proxies

Build was aborted

Patch application report for D6427 (id=23399)

Rebasing onto 113088ab06...

Current branch diff-target is up to date.
Changes applied before test
commit abe95b34a2f9d8e16b16de7f406c61d16d753339
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:26:04 2021 +0200

    filter: add filtering for release_add

commit c52b7b66791166a80239710095764abfd9609c65
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:25:21 2021 +0200

    filter: do not call the underlying functions if there's nothing to add

commit 5d5d4c941eacf622de42b4454baa32efa3207d45
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:21:26 2021 +0200

    buffer: Ensure that we don't send data from empty buffers
    
    This was already the case (as grouper called on an empty iterator just
    returns no batches), but add a test to enforce it.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1442/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1442/console

olasd marked an inline comment as done.Oct 8 2021, 2:35 PM

Build is green

Patch application report for D6427 (id=23399)

Rebasing onto 113088ab06...

Current branch diff-target is up to date.
Changes applied before test
commit abe95b34a2f9d8e16b16de7f406c61d16d753339
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:26:04 2021 +0200

    filter: add filtering for release_add

commit c52b7b66791166a80239710095764abfd9609c65
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:25:21 2021 +0200

    filter: do not call the underlying functions if there's nothing to add

commit 5d5d4c941eacf622de42b4454baa32efa3207d45
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Oct 6 18:21:26 2021 +0200

    buffer: Ensure that we don't send data from empty buffers
    
    This was already the case (as grouper called on an empty iterator just
    returns no batches), but add a test to enforce it.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1443/ for more details.

douardda added a subscriber: douardda.

looks fine to me

This revision is now accepted and ready to land.Oct 8 2021, 2:53 PM