Page MenuHomeSoftware Heritage

FUSE: rethink the visibility of files under archive/ and meta/, and possibly add a new cache/ entrypoint
Closed, MigratedEdits Locked

Description

This subsumes the initial need behind D4371.

Initial idea was to make ls under archive/ and meta/ return nothing, while still make other syscals (like stat) work properly in those dirs.
Other proposals came up later and are still in flux (see discussion below).

Event Timeline

zack triaged this task as Low priority.Nov 11 2020, 3:11 PM
zack created this task.

It occurred to me that, if we accept that archive/ and meta/ will return nothing when ls'd, we're accepting a fundamental inconsistency for them: file entries in there exist but are not user-visible.
If we are OK with that, we can also go a bit further, and find what I think is a win-win middle ground between this task and T2694.

I propose the following:

  • archive/ contains non-sharded <SWHID> entries, one for each <SWHID> in the cache, as it does now. Which can be accessed as usual
  • ls archive/ will not list all SWHIDs (as envisaged by this task), but will return sharded sub-directories as we were imagining for T2694. Navigating those subdirs one will eventually reach, e.g., ab/cd/swh:1:XYZ:abcd, which will just be a symlink to ../../swh:1:XYZ:abcd

(identical considerations apply for meta/)

This way:

  • we no longer have an explosion in the number of entries when ls archive/, which was the original problem we were trying to solve with T2694, then falling back to the present task
  • it is still possible to list the full content of the cache, which seems an interesting feature to me (in the future we can even imagine enabling to selectively purge the cache with rm for instance!)
  • we no longer have a problem with complex symlink handling from other directories, all symlinks will just point into archive/ and meta/, as we do now

Thoughts?

zack renamed this task from FUSE: make ls archive/ meta/ return no result to FUSE: shard entries returned by ls {archive,meta}/, hiding {archive,meta}/SWHID entries.Dec 1 2020, 11:31 AM

New proposal (lather, rinse, repeat…) based on an idea from @seirl:

  • rework the entry points to have, under the mount point, archive/, meta/, and cache/
  • archive/ will be used as starting point (e.g., cd archive/<SWHID>) and target for SWHID symlinks, but ls in it will return nothing
  • cache/ is new with this proposal and its purpose is allowing to navigate what is currently in the cache. For SWHIDs in the cache, its content will be sharded (to avoid exploding the number of entries in it). There are two points still to be discussed about this:
    • what to do with stuff that is in cache, but is not identified by a SWHID, e.g., origins. We want to be able to navigate those, but it's not clear how to organize them. Maybe, in addition to SWHIDs, we should have cache/origin(s)/ that contain URL-ecaped origin URLs?
    • what to do with the 'meta/' dir. In principle it should not be browsable for consistency with archive/, but then JSON files will be hard to find. Should we link JSON files from the cache/ dir too? E.g., each SWHID will appear under cache twice, as cache/01/23/swh:1:dir:0123...ff/ and as cache/01/23/swh:1:dir:0123...ff.json
zack renamed this task from FUSE: shard entries returned by ls {archive,meta}/, hiding {archive,meta}/SWHID entries to FUSE: rethink the visibility of files under archive/ and meta/, and possibly add a new cache/ entrypoint.Dec 3 2020, 1:40 PM
zack updated the task description. (Show Details)

We also need to discuss what exactly we put in cache/. I thought about symlinks to archive/ and meta/, what do you think? Removing the symlinks also means removing the data from the cache.

For the open questions, I think both of your suggestions are good ideas (resp. cache/origins, and having a .json symlink next to each artifact)

In T2771#53972, @seirl wrote:

We also need to discuss what exactly we put in cache/. I thought about symlinks to archive/ and meta/, what do you think? Removing the symlinks also means removing the data from the cache.

I didn't think of that, but yeah, symlinks to archive/meta sound good to me.

And ack also on the rm part (which we discussed on IRC yesterday, but I forgot to mention above).
The semantics of cache removal might not be entirely trivial to get right, but we can worry about that later.

For the open questions, I think both of your suggestions are good ideas (resp. cache/origins, and having a .json symlink next to each artifact)

OK, thanks for the feedback.

haltode changed the task status from Open to Work in Progress.Dec 7 2020, 5:11 PM
haltode moved this task from Backlog to In progress on the Software Heritage filesystem board.

The visibility part of this task is done, I filed a new task about supporting rm cache/{...} syntax in T2889.