Page MenuHomeSoftware Heritage

fs: history: add by-hash/ sharded directory
ClosedPublic

Authored by haltode on Nov 5 2020, 12:21 PM.

Diff Detail

Repository
rDFUSE FUSE virtual file system
Branch
feature/add-history-by-hash-sharding
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 16924
Build 26109: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 26108: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4416 (id=15649)

Rebasing onto 46a48a1907...

Current branch diff-target is up to date.
Changes applied before test
commit 994e11e59521084ea90636efa1fe16350d06cf18
Author: Thibault Allançon <haltode@gmail.com>
Date:   Thu Nov 5 12:19:06 2020 +0100

    fs: history: add by-hash/ sharded directory

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/145/ for more details.

zack requested changes to this revision.Nov 5 2020, 1:41 PM
zack added a subscriber: zack.

LGTM, I've only noted down a couple of minor issues

swh/fuse/fs/entry.py
124–126 ↗(On Diff #15649)

minor: sharding["length"] is a constant from the point of view of this code, you can cache it outside the for loop to avoid looking it up twice each loop iteration

146 ↗(On Diff #15649)

minor: you can move this prefix_len cached length to outside the for loop, and also the conf lookup of sharding > length just below, I think

This revision now requires changes to proceed.Nov 5 2020, 1:41 PM

Fix zack comments (cache outside of loop)

Build is green

Patch application report for D4416 (id=15656)

Rebasing onto 46a48a1907...

Current branch diff-target is up to date.
Changes applied before test
commit db41b75df1e39cf6df021dee77617003712a81f3
Author: Thibault Allançon <haltode@gmail.com>
Date:   Thu Nov 5 12:19:06 2020 +0100

    fs: history: add by-hash/ sharded directory

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/146/ for more details.

seirl added inline comments.
swh/fuse/fs/entry.py
124–126 ↗(On Diff #15649)

Also you should do:

for i in range(0, sharding['depth']):
    name += basename[i * length: (i + 1) * length]

Do not store history in memory

Build is green

Patch application report for D4416 (id=15681)

Rebasing onto 46a48a1907...

Current branch diff-target is up to date.
Changes applied before test
commit 6eca12205996992a6031a337bd168ae32afe8387
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Nov 6 10:41:10 2020 +0100

    WIP: do not store history in memory

commit db41b75df1e39cf6df021dee77617003712a81f3
Author: Thibault Allançon <haltode@gmail.com>
Date:   Thu Nov 5 12:19:06 2020 +0100

    fs: history: add by-hash/ sharded directory

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/147/ for more details.

  • clean duplicate code
  • remove cache asizeof

Build is green

Patch application report for D4416 (id=15712)

Rebasing onto 46a48a1907...

Current branch diff-target is up to date.
Changes applied before test
commit 357e62ae5cae1ab60df15c3953a493f9c0eb8dde
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Nov 6 15:53:51 2020 +0100

    WIP: remove cache asizeof

commit 375f7b32bba64f19abf9efeef3cc10cfe5825428
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Nov 6 10:41:10 2020 +0100

    WIP: do not store history in memory

commit db41b75df1e39cf6df021dee77617003712a81f3
Author: Thibault Allançon <haltode@gmail.com>
Date:   Thu Nov 5 12:19:06 2020 +0100

    fs: history: add by-hash/ sharded directory

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/148/ for more details.

I think I understand what your fill_direntry_cache function is trying to do: you want to avoid fetching the history multiple times by doing the request only once and writing the direntry cache of all the children recursively?
Would it be maybe better to instead have a small LRU cache for the API queries, and keep the direntry code simple and fully lazy?

swh/fuse/fs/artifact.py
235–238

More pythonic (and probably more efficient) without string concatenation:

parts = [
    basename[i * sharding_length : (i + 1) * sharding_length]
    for i in range(sharding_depth)
]
parts.append(str(swhid))
name = os.path.join(parts)
  • Fix string construction
  • Rework commit messages

Build is green

Patch application report for D4416 (id=15871)

Rebasing onto 46a48a1907...

Current branch diff-target is up to date.
Changes applied before test
commit 604665ab40ec3c5350bda1da57b6f11f5eabf077
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Nov 6 15:53:51 2020 +0100

    cache: replace asizeof() call with simpler heuristic
    
    The asizeof() introduced a very heavy overhead which made swhfs unusable
    when relying on many asizeof calls (eg: storing history exploration in
    the direntry cache with a history containing >1000 commits).
    
    The heuristic to calculate the size of a FuseEntry is based on
    experimental results from repositories with various history sizes (10K,
    100k, ...)

commit 2b1a725ad68dfa0944fb8b77667671f7fdc4c678
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Nov 6 10:41:10 2020 +0100

    fs: history: do not store history list in memory
    
    Pre-fill the direntry cache with all the sub-entries once the history
    has been retrieved.

commit db41b75df1e39cf6df021dee77617003712a81f3
Author: Thibault Allançon <haltode@gmail.com>
Date:   Thu Nov 5 12:19:06 2020 +0100

    fs: history: add by-hash/ sharded directory

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/149/ for more details.

This revision is now accepted and ready to land.Nov 13 2020, 4:03 PM