Page MenuHomeSoftware Heritage

Use a Merkle discovery algorithm with archives
ClosedPublic

Authored by Alphare on Sep 22 2022, 12:28 PM.

Details

Summary

"Discovery" is the term used to find out the differences between two
Merkle graphs. Using such an algorithm is useful in that it drastically
reduces the amount of data that needs to be transferred.

This commit introduces an efficient but simple algorithm that is a good
starting point for improved performance: random sampling of directories,
the details of which are explained in the docstrings.

Mercurial uses a more sophisticated algorithm for its discovery, but it
is quite a bit more involved and would introduce too much complexity at
once. Also, the constraints for speed that Mercurial has (in the order
of milliseconds) don't apply as obviously to this context without
further investigation.

Benchmarks

Setup

  • With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production
  • With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
  • Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
  • only taking into account the loading (not the downloading of the tarball, or its decompression)

Result

before: ~30s
after: ~17s

Reproduced 4 times.

Diff Detail

Event Timeline

Build is green

Patch application report for D8521 (id=30687)

Rebasing onto e1f2135f24...

Current branch diff-target is up to date.
Changes applied before test
commit cb93ebabeebbd592a791b93fc034c2433ac88b41
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/885/ for more details.

Nice, thanks.

I got some questions and suggestions inline.

swh/loader/core/discovery.py
44

Naive question, can't we optimize a bit all those directories iterations into 1 loop (including the various set initialzations)?

That won't be less clear than the current initialization here anyway.

Ideally, i'd see only 2 loops, one over content and another over directories (which would init self.{_undecided_directories, _all, undecided} at the same time as well.

What do you think?

55
58

What's a node? Is it a directory or a content or both?

Indeed as per you irc comment, a bit of type would help here ;p

111

Another naive question ;), why until only contents are undecided?

159–160

let logging do the formatting per log level.

Alphare marked an inline comment as done.

Optimize loops, add types, improve docs and fix a bug

swh/loader/core/discovery.py
44

Yep, absolutely. I even have no need for _all for directories, so I'll clean this all up.

55

Modified this and all other docstrings that use the same conjugation.

58

I've added typing information and adjusted docstrings to change "nodes" to "directory entries".

111

I've written a more expansive docstring, please tell me if it still isn't clear.

129

Caught this mistake also. :)

Build has FAILED

Patch application report for D8521 (id=30695)

Rebasing onto e1f2135f24...

Current branch diff-target is up to date.
Changes applied before test
commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/886/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/886/console

Build has FAILED

Patch application report for D8521 (id=30695)

Rebasing onto e1f2135f24...

Current branch diff-target is up to date.
Changes applied before test
commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/887/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/887/console

Build is green

Patch application report for D8521 (id=30696)

Rebasing onto e1f2135f24...

Current branch diff-target is up to date.
Changes applied before test
commit 995ae98a51f10e0028132ad8b13b143e1eafc2ea
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/888/ for more details.

Build has FAILED

Patch application report for D8521 (id=30695)

Rebasing onto e1f2135f24...

Current branch diff-target is up to date.
Changes applied before test
commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/889/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/889/console

ardumont added 1 blocking reviewer(s): Reviewers.

ok, thanks, it's clearer.

2 small nitpicks inline.

I'll let some other people from the team a while to have a look too.

swh/loader/core/discovery.py
35–47

Might as well do the union computation on undecided set on the fly within the same iteration loop.

88

please make the return explicit, that avoids asking oneself questions.

Alphare marked 2 inline comments as done.

Address nits

Build is green

Patch application report for D8521 (id=30702)

Rebasing onto e1f2135f24...

Current branch diff-target is up to date.
Changes applied before test
commit dc781b5e67d5d6b3c09794b1040f56443a3a0b47
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/890/ for more details.

There seems to be two chunks in get_sample not covered by tests. Could you add some coverage for them?

swh/loader/core/discovery.py
23

Could you use a namedtuple instead?

And use sets instead of list, as they are converted to sets before use anyway

39–41

This is going to handle large numbers of objects, right? It's probably be faster to use self._all.update((content.sha1_git, content) for content in ...)

50

spares a copy

62–72

Please describe the meaning of all three arguments in the docstring.

And would be more natural to have target_set as last argument, as it's the one being mutated, IMO

79–80
100

spares a copy

107–109
110–112
121

ditto

Oops, sent too soon. Two more requests:

  • Use double backticks in docstrings (we use ReST, not Markdown)
swh/loader/core/discovery.py
46

Forget the bits about |=, they are fine. (thx to @olasd for pointing it out)

swh/loader/core/discovery.py
50

https://docs.python.org/3/library/stdtypes.html#frozenset.update claims that left_set |= right_set is supposed to be strictly equivalent to left_set.update(right_set).

79–80

and to_proceed should probably be called to_process?

There seems to be two chunks in get_sample not covered by tests. Could you add some coverage for them?

Sure, I'll check that out after lunch, I've addressed/answered your questions in the mean time.

swh/loader/core/discovery.py
39–41

We're going to be iterating over all contents twice if we do this since we need to populate undecided as well. Re-allocating more space in the dict a few times shouldn't be a big deal in comparison for large inputs.

107–109

skipped_contents_sample is a set of ids (after your recommendation, before that a list), not a dict.

110–112

*technically* faster, I'll allow it ;)

swh/loader/core/discovery.py
110–112

right, i missed that, set comprehension!;p

Build has FAILED

Patch application report for D8521 (id=30759)

Rebasing onto e1f2135f24...

Current branch diff-target is up to date.
Changes applied before test
commit 14df70ed3cef5ab5faf26dec1827b743b221ecf5
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/891/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/891/console

Build is green

Patch application report for D8521 (id=30760)

Rebasing onto 26fe954bd9...

First, rewinding head to replay your work on top of it...
Applying: Use a Merkle discovery algorithm with archives
Changes applied before test
commit 5add0728637e4374a2662beb5d608f0a24b49bef
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/892/ for more details.

Add coverage of random sampling and skipped content

Build is green

Patch application report for D8521 (id=30780)

Rebasing onto 26fe954bd9...

First, rewinding head to replay your work on top of it...
Applying: Use a Merkle discovery algorithm with archives
Changes applied before test
commit 2275d3b2cbf6458591a98005547794241152adca
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/893/ for more details.

vlorentz added inline comments.
swh/loader/package/archive/tests/test_archive.py
191

add assert set(_expected_new_contents_first_visit) | set(_expected_new_skipped_contents_first_visit) == set(_expected_new_contents_first_visit) here, to make sure test data is self-consistent.

This revision is now accepted and ready to land.Sep 26 2022, 5:06 PM

Add assertion for data consistency in tests

Build is green

Patch application report for D8521 (id=30785)

Rebasing onto 26fe954bd9...

First, rewinding head to replay your work on top of it...
Applying: Use a Merkle discovery algorithm with archives
Changes applied before test
commit 5339a9d5cefe6a6efdb77cdfa9d140ebba712288
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/894/ for more details.

Build is green

Patch application report for D8521 (id=30786)

Rebasing onto 26fe954bd9...

Current branch diff-target is up to date.
Changes applied before test
commit 798f749e66c883fa3a7c962681690166586c730f
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/895/ for more details.