Closes T2695.
Details
- Reviewers
zack seirl - Group Reviewers
Reviewers - Maniphest Tasks
- T2695: Cache directory entries to make readdir/lookup more efficient
- Commits
- rDFUSE46a48a190722: fuse: add cache on directories entries
Diff Detail
- Repository
- rDFUSE FUSE virtual file system
- Branch
- feature/add-direntry-lru-cache
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 16823 Build 25947: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 25946: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D4345 (id=15377)
Rebasing onto e8cb3c6d74...
Current branch diff-target is up to date.
Changes applied before test
commit a95ba75a1499f6ef98ab3a2db85ab5567b8b394b Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. Rationale: speed up readdir/lookup endpoints.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/128/ for more details.
A few consideration on the open design questions:
Rationale: speed up readdir/lookup endpoints.
By how much? We need to have an idea about the speedup, to understand the dimensioning of the cache. It would be useful to have a benchmark showing how much an absent cache compares to a (potentially infinite size) cache in terms of performance.
Also, we need a memory benchmark to understand how much various workload will require in terms of actual RAM to be fully cached. E.g., how much RAM would a cache need for a trivial use case, a medium one, and one in which you have the equivalent of an entire ls -R on the linux source tree in memory.
These are all data points that we need to make informed design decisions.
Should we bound the cache by number of directories or by number of total directories entries?
Should we have a limit on what entries size a directory can be cached?
We should think about this from the point of view of the user writing the configuration file. Any absolute number, no matter if of dentries or of directories will be entirely meaningless to them.
What users can relate to is RAM usage. So, ideally, your maxsize setting here should become a maxram one, ideally expressible both in absolute terms (MB/GB) and in relative terms (e.g., "5% of available RAM").
Once we have that, there are various ways to implement such a policy. One is to actually inspect memory usage at runtime, and trim the cache when needed. Another option is via approximation, e.g., we can measure how much RAM on average a dentry requires and trim the cache when an approximate threshold is passed.
- Use RAM usage instead of number of dir entries for LRU cache policy
- Parse % of RAM or B/KB/MB/GB in config file
Build is green
Patch application report for D4345 (id=15399)
Rebasing onto f9de49aca7...
First, rewinding head to replay your work on top of it... Applying: fuse: add cache on directories entries
Changes applied before test
commit ff473f14f256be80dbf9dae3e5fb866b68c58b72 Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. Rationale: speed up readdir/lookup endpoints.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/129/ for more details.
Build is green
Patch application report for D4345 (id=15586)
Rebasing onto 34eed7bc90...
Current branch diff-target is up to date.
Changes applied before test
commit dea02223eef331e5211960bba6ab937420635029 Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. Rationale: speed up readdir/lookup endpoints.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/138/ for more details.
Build is green
Patch application report for D4345 (id=15599)
Rebasing onto 34eed7bc90...
Current branch diff-target is up to date.
Changes applied before test
commit aebe777fb468e0650bc00c2f1a06c48a3f826dcd Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. - readdir() is now O(N) instead of O(N²) since we do not go through all entries every time, but only those after the readdir() offset. - lookup() is now amortized O(1) TODO.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/139/ for more details.
Build is green
Patch application report for D4345 (id=15602)
Rebasing onto 34eed7bc90...
Current branch diff-target is up to date.
Changes applied before test
commit 3066466a8d0451203c52ed9a93e033421f5b3c24 Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. - readdir() is now O(N) instead of O(N²) since we do not go through all entries every time, but only those after the readdir() offset. - lookup() is now amortized O(1) TODO.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/140/ for more details.
One thing I don't really like here is that FuseEntries cannot easily list their own entries easily using the cache when available. I would much rather have the cache logic moved inside FuseEntry like what we discussed.
Build is green
Patch application report for D4345 (id=15622)
Rebasing onto 34eed7bc90...
Current branch diff-target is up to date.
Changes applied before test
commit 7b20c6575f71bd490a50a9d3e7fc9eba3fedd8a1 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Nov 4 11:38:31 2020 +0100 aiter -> compute_entries commit 3066466a8d0451203c52ed9a93e033421f5b3c24 Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. - readdir() is now O(N) instead of O(N²) since we do not go through all entries every time, but only those after the readdir() offset. - lookup() is now amortized O(1) TODO.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/141/ for more details.
Build is green
Patch application report for D4345 (id=15623)
Rebasing onto 34eed7bc90...
Current branch diff-target is up to date.
Changes applied before test
commit 97565bd69db2bb68c75c7addf90a10e8fa8fe349 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Nov 4 11:56:46 2020 +0100 use Sequence commit 7b20c6575f71bd490a50a9d3e7fc9eba3fedd8a1 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Nov 4 11:38:31 2020 +0100 aiter -> compute_entries commit 3066466a8d0451203c52ed9a93e033421f5b3c24 Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. - readdir() is now O(N) instead of O(N²) since we do not go through all entries every time, but only those after the readdir() offset. - lookup() is now amortized O(1) TODO.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/142/ for more details.
Looks good apart from two small things.
swh/fuse/fs/entry.py | ||
---|---|---|
93 | You're doing a copy here |
swh/fuse/fs/artifact.py | ||
---|---|---|
72 ↗ | (On Diff #15623) | You should probably return Iterator[] here, and only use the covariant Sequence[] type in the base class. |
Build is green
Patch application report for D4345 (id=15625)
Rebasing onto 34eed7bc90...
Current branch diff-target is up to date.
Changes applied before test
commit df71c710abbe5e7b1822f50706d0285827263485 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Nov 4 11:56:46 2020 +0100 use Sequence commit 7b20c6575f71bd490a50a9d3e7fc9eba3fedd8a1 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Nov 4 11:38:31 2020 +0100 aiter -> compute_entries commit 3066466a8d0451203c52ed9a93e033421f5b3c24 Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. - readdir() is now O(N) instead of O(N²) since we do not go through all entries every time, but only those after the readdir() offset. - lookup() is now amortized O(1) TODO.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/143/ for more details.
Build is green
Patch application report for D4345 (id=15629)
Rebasing onto 34eed7bc90...
Current branch diff-target is up to date.
Changes applied before test
commit 46a48a190722bd1a00c93b3331aa92ec354c8cd6 Author: Thibault Allançon <haltode@gmail.com> Date: Fri Oct 23 14:54:11 2020 +0200 fuse: add cache on directories entries Closes T2695. The readdir() endpoint is now O(N) instead of O(N²) since we do not go through all entries every time, but only those after the readdir() offset.
See https://jenkins.softwareheritage.org/job/DFUSE/job/tests-on-diff/144/ for more details.