Page MenuHomeSoftware Heritage

git_bare: Optionally access the objstorage directly
ClosedPublic

Authored by vlorentz on May 10 2021, 10:24 PM.

Details

Summary

instead of going through swh-storage.

This also allows batching queries, so it should be more efficient overall.

Depends on D5730.

Unit TestsFailed

TimeTest
7 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.vault.tests.test_cli::test_cook_directory[directory--dir]
__wrapped_mock_method__ = <function NonCallableMock.assert_called_with at 0x7f7d1bc49048> args = (<MagicMock id='140175241859144'>, 'directory', b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', <MagicMock spec='InMemoryVaultBackend' id='140175241697208'>, <object object at 0x7f7d197e3a80>, None) kwargs = {}, __tracebackhide__ = True
6 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.vault.tests.test_cli::test_cook_directory[revision-gitfast-rev]
__wrapped_mock_method__ = <function NonCallableMock.assert_called_with at 0x7f7d1bc49048> args = (<MagicMock id='140175241024120'>, 'revision_gitfast', b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\...00\x00\x00\x00', <MagicMock spec='InMemoryVaultBackend' id='140175239002488'>, <object object at 0x7f7d197e3fa0>, None) kwargs = {}, __tracebackhide__ = True
75 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.vault.tests.test_backend::test_available
98 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.vault.tests.test_backend::test_cache_expire_oldest
107 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.vault.tests.test_backend::test_cache_expire_until
View Full Test Results (2 Failed · 83 Passed · 2 Skipped)

Event Timeline

Build has FAILED

Patch application report for D5731 (id=20464)

Could not rebase; Attempt merge onto 35c9f519cd...

Updating 35c9f51..b4b60b4
Fast-forward
 requirements-swh.txt                    |   1 +
 swh/vault/cli.py                        |  15 +-
 swh/vault/cookers/__init__.py           |   6 +
 swh/vault/cookers/base.py               |  17 +-
 swh/vault/cookers/git_bare.py           | 291 ++++++++++++++++++++++++++++++++
 swh/vault/in_memory_backend.py          |   2 +-
 swh/vault/tests/test_cli.py             |   1 +
 swh/vault/tests/test_cookers.py         | 272 +++++++++++++++++++++--------
 swh/vault/tests/test_git_bare_cooker.py | 178 +++++++++++++++++++
 9 files changed, 708 insertions(+), 75 deletions(-)
 create mode 100644 swh/vault/cookers/git_bare.py
 create mode 100644 swh/vault/tests/test_git_bare_cooker.py
Changes applied before test
commit b4b60b4a678b48f6c86f0360a2c44e2a91630e7a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 22:23:39 2021 +0200

    git_bare: Optionally access the objstorage directly
    
    instead of going through swh-storage.
    
    This also allows batching queries, so it should be more efficient overall.

commit 57760c2d2333eb16ce569bdce2e090d28b2b60e4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:52:37 2021 +0200

    git_bare: Use batched content_get() instead of content_find()
    
    It is considerably faster (30% less run time on an average repo)

commit 7a04e787128212aab3bc0aa55f399ec83b40e2f6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:04:33 2021 +0200

    git_bare: Use directory_get_entries instead of directory_ls, it should be faster
    
    As it does not need to join with the content table.
    
    On small repositories with a warm cache, it doesn't seem to matter much, though.
    But it's also closer to a feature swh-graph will provide in the future,
    so it's a win anyway.

commit 0c0ff0146058c2280f3cc935993af78c8be710eb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 20:09:30 2021 +0200

    git_bare: Refactor the graph descent using explicit stacks instead of the call stack.
    
    This will allow batching large groups of objects, instead of being limited
    to those given as argument from a parent.

commit 43e735a7a5fc8c7f89275df8a03124358c0c3cc3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri May 7 11:30:10 2021 +0200

    git_bare: When possible, use swh-graph instead of swh-storage to query revision history
    
    We expect it to be more efficient eventually; but run time is equivalent so far.

commit 8007936a8aff0b29eefdd93bbe037b996d6b743d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 6 14:43:37 2021 +0200

    Run all directory tests on the gitfast cooker
    
    1. It increases test coverage
    2. test_revision_bogus_perms it now redundant (there is test_directory_bogus_perms)

commit 3e76bc5656d0aa1eb510dcfdaa3b6196f6ee5976
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 22:22:17 2021 +0200

    git_bare: Deduplicate object downloads and writes

commit 4052f53698454ac47a01d26d470b8ab4b0f77a6d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 19:30:34 2021 +0200

    Add a naive git bare cooker
    
    It can cook directories (by adding a synthetic revision pointing to it)
    and revisions.
    
    Current limitations:
    
    * It does not deduplicate directories and files at all, and queries
      all objects one by one.
    * No support for missing/absent contents
    * No support for missing submodules
    
    Tests reuse existing tests of the DirectoryCooker and
    RevisionGitfastCooker using parametrized pytest fixtures.

Link to build: https://jenkins.softwareheritage.org/job/DVAU/job/tests-on-diff/91/
See console output for more information: https://jenkins.softwareheritage.org/job/DVAU/job/tests-on-diff/91/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 10 2021, 10:24 PM
Harbormaster failed remote builds in B21439: Diff 20464!

I suspect some gains could come from parallelizing objstorage accesses. It's probably worth doing directly in the objstorage get_batch method for the backends that na(t)ively parallelism, e.g. azure_prefixed.

swh/vault/cli.py
109–117

At this point it's probably sensible to use all kwargs.

swh/vault/tests/test_cookers.py
203–205

Maybe call this argument use_objstorage or objstorage_direct?

447

Same here.

481

I think Path.is_symlink is a method rather than a property, so this assert is always true (although the next call would fail if it wasn't anyway).

apply comments:

  • all kwargs
  • rename param to direct_objstorage
  • Path.is_symlink is a function

Build has FAILED

Patch application report for D5731 (id=20499)

Could not rebase; Attempt merge onto 35c9f519cd...

Updating 35c9f51..e2e9244
Fast-forward
 requirements-swh.txt                    |   1 +
 swh/vault/cli.py                        |  15 +-
 swh/vault/cookers/__init__.py           |   6 +
 swh/vault/cookers/base.py               |  17 +-
 swh/vault/cookers/git_bare.py           | 289 ++++++++++++++++++++++++++++++++
 swh/vault/in_memory_backend.py          |   2 +-
 swh/vault/tests/test_cli.py             |   1 +
 swh/vault/tests/test_cookers.py         | 280 +++++++++++++++++++++++--------
 swh/vault/tests/test_git_bare_cooker.py | 181 ++++++++++++++++++++
 9 files changed, 715 insertions(+), 77 deletions(-)
 create mode 100644 swh/vault/cookers/git_bare.py
 create mode 100644 swh/vault/tests/test_git_bare_cooker.py
Changes applied before test
commit e2e924430a835b30151ff78fc1904fc8d67ac5b8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 22:23:39 2021 +0200

    git_bare: Optionally access the objstorage directly
    
    instead of going through swh-storage.
    
    This also allows batching queries, so it should be more efficient overall.

commit 66b54b5cdcc21e1512cd5959023827c43aac1b1d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:52:37 2021 +0200

    git_bare: Use batched content_get() instead of content_find()
    
    It is considerably faster (30% less run time on an average repo)

commit b00c56677bc046418b8be5b731845ca07af60a30
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:04:33 2021 +0200

    git_bare: Use directory_get_entries instead of directory_ls, it should be faster
    
    As it does not need to join with the content table.
    
    On small repositories with a warm cache, it doesn't seem to matter much, though.
    But it's also closer to a feature swh-graph will provide in the future,
    so it's a win anyway.

commit bfcb0af5dc20ab7cfd84e0b7621d0978aece6de8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 20:09:30 2021 +0200

    git_bare: Refactor the graph descent using explicit stacks instead of the call stack.
    
    This will allow batching large groups of objects, instead of being limited
    to those given as argument from a parent.

commit 2ec60e27c75775f7073dd51947648a999748be35
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri May 7 11:30:10 2021 +0200

    git_bare: When possible, use swh-graph instead of swh-storage to query revision history
    
    We expect it to be more efficient eventually; but run time is equivalent so far.

commit 8007936a8aff0b29eefdd93bbe037b996d6b743d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 6 14:43:37 2021 +0200

    Run all directory tests on the gitfast cooker
    
    1. It increases test coverage
    2. test_revision_bogus_perms it now redundant (there is test_directory_bogus_perms)

commit 3e76bc5656d0aa1eb510dcfdaa3b6196f6ee5976
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 22:22:17 2021 +0200

    git_bare: Deduplicate object downloads and writes

commit 4052f53698454ac47a01d26d470b8ab4b0f77a6d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 19:30:34 2021 +0200

    Add a naive git bare cooker
    
    It can cook directories (by adding a synthetic revision pointing to it)
    and revisions.
    
    Current limitations:
    
    * It does not deduplicate directories and files at all, and queries
      all objects one by one.
    * No support for missing/absent contents
    * No support for missing submodules
    
    Tests reuse existing tests of the DirectoryCooker and
    RevisionGitfastCooker using parametrized pytest fixtures.

Link to build: https://jenkins.softwareheritage.org/job/DVAU/job/tests-on-diff/101/
See console output for more information: https://jenkins.softwareheritage.org/job/DVAU/job/tests-on-diff/101/console

This revision is now accepted and ready to land.May 11 2021, 1:47 PM

Build has FAILED

Patch application report for D5731 (id=20515)

Could not rebase; Attempt merge onto 545246e9af...

Updating 545246e..3bf5cbc
Fast-forward
 requirements-swh.txt            |   2 +-
 swh/vault/cli.py                |  11 +++-
 swh/vault/cookers/base.py       |   2 +
 swh/vault/cookers/git_bare.py   | 131 ++++++++++++++++++++++++++--------------
 swh/vault/tests/test_cookers.py |  67 ++++++++++++++++++--
 5 files changed, 160 insertions(+), 53 deletions(-)
Changes applied before test
commit 3bf5cbc4c41bee7dcf0d047471f56b4b8a524ac5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 22:23:39 2021 +0200

    git_bare: Optionally access the objstorage directly
    
    instead of going through swh-storage.
    
    This also allows batching queries, so it should be more efficient overall.

commit bea488d2eb3c26e77dc38ee4410b0820165409d7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:52:37 2021 +0200

    git_bare: Use batched content_get() instead of content_find()
    
    It is considerably faster (30% less run time on an average repo)

commit 6fb358d6aaeb3969c41c497a9ec9d24847220c51
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 21:04:33 2021 +0200

    git_bare: Use directory_get_entries instead of directory_ls, it should be faster
    
    As it does not need to join with the content table.
    
    On small repositories with a warm cache, it doesn't seem to matter much, though.
    But it's also closer to a feature swh-graph will provide in the future,
    so it's a win anyway.

commit e77069a5e1cab7630e912ac59b0aa1242346a95f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 20:09:30 2021 +0200

    git_bare: Refactor the graph descent using explicit stacks instead of the call stack.
    
    This will allow batching large groups of objects, instead of being limited
    to those given as argument from a parent.

Link to build: https://jenkins.softwareheritage.org/job/DVAU/job/tests-on-diff/116/
See console output for more information: https://jenkins.softwareheritage.org/job/DVAU/job/tests-on-diff/116/console

Build is green

Patch application report for D5731 (id=20519)

Rebasing onto bea488d2eb...

Current branch diff-target is up to date.
Changes applied before test
commit 15a16d9da01d34550363eef2bf1735d9f39b4032
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 10 22:23:39 2021 +0200

    git_bare: Optionally access the objstorage directly
    
    instead of going through swh-storage.
    
    This also allows batching queries, so it should be more efficient overall.

See https://jenkins.softwareheritage.org/job/DVAU/job/tests-on-diff/117/ for more details.