Page MenuHomeSoftware Heritage

fuse: add cache on directories entries
ClosedPublic

Authored by haltode on Oct 23 2020, 4:43 PM.

Diff Detail

Repository
rDFUSE FUSE virtual file system
Branch
feature/add-direntry-lru-cache
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 16628
Build 25631: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 25630: arc lint + arc unit

Event Timeline

haltode retitled this revision from fuse: add cache on directories entries to WIP fuse: add cache on directories entries.Oct 23 2020, 4:45 PM
haltode edited the summary of this revision. (Show Details)

Build is green

Patch application report for D4345 (id=15377)

Rebasing onto e8cb3c6d74...

Current branch diff-target is up to date.
Changes applied before test
commit a95ba75a1499f6ef98ab3a2db85ab5567b8b394b
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    Rationale: speed up readdir/lookup endpoints.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/128/ for more details.

zack requested changes to this revision.Oct 24 2020, 1:37 PM
zack added a subscriber: zack.

A few consideration on the open design questions:

Rationale: speed up readdir/lookup endpoints.

By how much? We need to have an idea about the speedup, to understand the dimensioning of the cache. It would be useful to have a benchmark showing how much an absent cache compares to a (potentially infinite size) cache in terms of performance.
Also, we need a memory benchmark to understand how much various workload will require in terms of actual RAM to be fully cached. E.g., how much RAM would a cache need for a trivial use case, a medium one, and one in which you have the equivalent of an entire ls -R on the linux source tree in memory.

These are all data points that we need to make informed design decisions.

Should we bound the cache by number of directories or by number of total directories entries?
Should we have a limit on what entries size a directory can be cached?

We should think about this from the point of view of the user writing the configuration file. Any absolute number, no matter if of dentries or of directories will be entirely meaningless to them.
What users can relate to is RAM usage. So, ideally, your maxsize setting here should become a maxram one, ideally expressible both in absolute terms (MB/GB) and in relative terms (e.g., "5% of available RAM").
Once we have that, there are various ways to implement such a policy. One is to actually inspect memory usage at runtime, and trim the cache when needed. Another option is via approximation, e.g., we can measure how much RAM on average a dentry requires and trim the cache when an approximate threshold is passed.

This revision now requires changes to proceed.Oct 24 2020, 1:37 PM
  • Use RAM usage instead of number of dir entries for LRU cache policy
  • Parse % of RAM or B/KB/MB/GB in config file

Build is green

Patch application report for D4345 (id=15399)

Rebasing onto f9de49aca7...

First, rewinding head to replay your work on top of it...
Applying: fuse: add cache on directories entries
Changes applied before test
commit ff473f14f256be80dbf9dae3e5fb866b68c58b72
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    Rationale: speed up readdir/lookup endpoints.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/129/ for more details.

Build is green

Patch application report for D4345 (id=15586)

Rebasing onto 34eed7bc90...

Current branch diff-target is up to date.
Changes applied before test
commit dea02223eef331e5211960bba6ab937420635029
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    Rationale: speed up readdir/lookup endpoints.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/138/ for more details.

Add offset and generator for direntries.

Build is green

Patch application report for D4345 (id=15599)

Rebasing onto 34eed7bc90...

Current branch diff-target is up to date.
Changes applied before test
commit aebe777fb468e0650bc00c2f1a06c48a3f826dcd
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    - readdir() is now O(N) instead of O(N²) since we do not go through all
      entries every time, but only those after the readdir() offset.
    - lookup() is now amortized O(1) TODO.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/139/ for more details.

Build is green

Patch application report for D4345 (id=15602)

Rebasing onto 34eed7bc90...

Current branch diff-target is up to date.
Changes applied before test
commit 3066466a8d0451203c52ed9a93e033421f5b3c24
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    - readdir() is now O(N) instead of O(N²) since we do not go through all
      entries every time, but only those after the readdir() offset.
    - lookup() is now amortized O(1) TODO.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/140/ for more details.

seirl requested changes to this revision.Nov 3 2020, 7:47 PM

One thing I don't really like here is that FuseEntries cannot easily list their own entries easily using the cache when available. I would much rather have the cache logic moved inside FuseEntry like what we discussed.

This revision now requires changes to proceed.Nov 3 2020, 7:47 PM

Build is green

Patch application report for D4345 (id=15622)

Rebasing onto 34eed7bc90...

Current branch diff-target is up to date.
Changes applied before test
commit 7b20c6575f71bd490a50a9d3e7fc9eba3fedd8a1
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Nov 4 11:38:31 2020 +0100

    aiter -> compute_entries

commit 3066466a8d0451203c52ed9a93e033421f5b3c24
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    - readdir() is now O(N) instead of O(N²) since we do not go through all
      entries every time, but only those after the readdir() offset.
    - lookup() is now amortized O(1) TODO.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/141/ for more details.

Build is green

Patch application report for D4345 (id=15623)

Rebasing onto 34eed7bc90...

Current branch diff-target is up to date.
Changes applied before test
commit 97565bd69db2bb68c75c7addf90a10e8fa8fe349
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Nov 4 11:56:46 2020 +0100

    use Sequence

commit 7b20c6575f71bd490a50a9d3e7fc9eba3fedd8a1
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Nov 4 11:38:31 2020 +0100

    aiter -> compute_entries

commit 3066466a8d0451203c52ed9a93e033421f5b3c24
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    - readdir() is now O(N) instead of O(N²) since we do not go through all
      entries every time, but only those after the readdir() offset.
    - lookup() is now amortized O(1) TODO.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/142/ for more details.

seirl requested changes to this revision.Nov 4 2020, 2:41 PM

Looks good apart from two small things.

swh/fuse/fs/entry.py
93

You're doing a copy here

This revision now requires changes to proceed.Nov 4 2020, 2:41 PM
swh/fuse/fs/artifact.py
72 ↗(On Diff #15623)

You should probably return Iterator[] here, and only use the covariant Sequence[] type in the base class.

Use AsyncIterator in implementation and remove slicing

Build is green

Patch application report for D4345 (id=15625)

Rebasing onto 34eed7bc90...

Current branch diff-target is up to date.
Changes applied before test
commit df71c710abbe5e7b1822f50706d0285827263485
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Nov 4 11:56:46 2020 +0100

    use Sequence

commit 7b20c6575f71bd490a50a9d3e7fc9eba3fedd8a1
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Nov 4 11:38:31 2020 +0100

    aiter -> compute_entries

commit 3066466a8d0451203c52ed9a93e033421f5b3c24
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    - readdir() is now O(N) instead of O(N²) since we do not go through all
      entries every time, but only those after the readdir() offset.
    - lookup() is now amortized O(1) TODO.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/143/ for more details.

This revision is now accepted and ready to land.Nov 4 2020, 5:01 PM

Build is green

Patch application report for D4345 (id=15629)

Rebasing onto 34eed7bc90...

Current branch diff-target is up to date.
Changes applied before test
commit 46a48a190722bd1a00c93b3331aa92ec354c8cd6
Author: Thibault Allançon <haltode@gmail.com>
Date:   Fri Oct 23 14:54:11 2020 +0200

    fuse: add cache on directories entries
    
    Closes T2695.
    
    The readdir() endpoint is now O(N) instead of O(N²) since we do not go
    through all entries every time, but only those after the readdir()
    offset.

See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/144/ for more details.

haltode retitled this revision from WIP fuse: add cache on directories entries to fuse: add cache on directories entries.
haltode edited the summary of this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.