Page MenuHomeSoftware Heritage

WIP: fuse design doc
AbandonedPublic

Authored by zack on Sep 17 2020, 1:28 PM.

Details

Reviewers
seirl
haltode
Group Reviewers
Reviewers
Summary

Design documentation for the FUSE file representation.

Diff Detail

Repository
rDGRPH Compressed graph representation
Branch
fuse-design-doc
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 15217
Build 23462: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 23461: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D3974 (id=14004)

Rebasing onto eaf0323a1c...

Current branch diff-target is up to date.
Changes applied before test
commit 5639ac580b443be5749efa80560e0a4e20208b3f
Author: Thibault Allançon <haltode@gmail.com>
Date:   Thu Sep 17 13:27:19 2020 +0200

    WIP: fuse design doc

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/32/ for more details.

Add more context information on swh-graph.

Build is green

Patch application report for D3974 (id=14006)

Rebasing onto eaf0323a1c...

Current branch diff-target is up to date.
Changes applied before test
commit ff791874538d7155d38066e1d22cd03777faba69
Author: Thibault Allançon <haltode@gmail.com>
Date:   Thu Sep 17 13:27:19 2020 +0200

    WIP: fuse design doc

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/33/ for more details.

zack requested changes to this revision.Sep 17 2020, 4:31 PM
zack added a subscriber: seirl.
zack added inline comments.
docs/fuse.rst
4–7

What I meant in our IRC conversation about this is that readers don't even need to know about swh-graph. So, my suggestion for this intro text would be something like (links to be added):

The Software Heritage data model (LINK) is a direct acyclic graph with node of different types that correspond to source code artifacts such as directories, commits, etc. Using this FUSE (LINK) module you can locally mount, and then navigate as a virtual filesystem, a part of the archive (a subgraph) rooted at a node of your choice, identified by a SWHID (LINK).

To retrieve information about the source code artifacts the FUSE module interacts over the network with the Software Heritage archive via the archive Web API (LINK).

17–19

Aside from the reference to the fact that the graph structure is "compressed" (that should go), why this?

I think files should not be empty, when opening cnt files one should get the file content.

Also, file entry names will, at least in the general case, not be SWHIDs, but the legitimate entry names.

Maybe this "by default" part should just go away, and will describe the file naming below, case by case?

24–26

if I understand correctly the state of the discussion with @seirl, we are now going towards:

  • not needing storage at all for directory browsing (all we need are names and perms, which will be available via the /graph API)
  • needing storage for retrieving file contents

Either way, all the endpoints we will need will be accessible via the Web API, so we can drop the conditionality on having storage or not.

Also, local/remote distinction is no longer relevant, as we'll always access stuff via the Web API.

There is potentially a conditionality on whether the Web API endpoints under /graph have access (or not) to the edge labels. I think the right way to go about it is some sort of graceful degradation in the code (e.g., all files have read-only perms, and we use SWHID names instead of entry names), rather than warn the user about it upfront.

58–65

do we care?
I think this should be removed

This revision now requires changes to proceed.Sep 17 2020, 4:31 PM

Build is green

Patch application report for D3974 (id=14055)

Rebasing onto bc5614a2c6...

First, rewinding head to replay your work on top of it...
Applying: WIP: fuse design doc
Changes applied before test
commit df4ed1a6fde575aae50e23da82671528633decea
Author: Thibault Allançon <haltode@gmail.com>
Date:   Thu Sep 17 13:27:19 2020 +0200

    WIP: fuse design doc

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/38/ for more details.

zack requested changes to this revision.Sep 18 2020, 2:05 PM
zack added reviewers: seirl, Reviewers.
zack added inline comments.
docs/fuse.rst
18

improvement for this one: "Each archive element (or, equivalently, node in the archive Merkle DAG) is represented as one entity in the virtual file system (VFS). The type and content of file system entities depend on the type of archive element being represented. For each supported node type we describe below the corresponding VFS representation."

20

We can switch now to a more structured/formal style now.
For instance, one section per node type. (As I'm adding a bunch of suggestion in the rest of the doc, having more "space" with sections rather than a top-level bullet point will help.)
And in each section having a:

  • file type line: regular file v. directory
  • content line (which will be the byte stuff for blobs v. directory content for dirs)
23

here we need to say:

  • for the directory itself:
    • that the directory will contain one entry for each entry in the directory archived by software heritage
  • for each entry:
    • that the entry name will be the original entry name in the archived dir (bonus point: adding a mention that there are no guarantees that the character encoding matches the file system encoding of the FUSE user. Although this might be something that impacts other parts of this document, so it might need to be factored out somewhere else. To be checked)
    • that the permission of the entry will be the original permissions, as archived
    • that the type of the entry will be the type of the VFS representation of the pointed object (as per intro)
25

Looks like an important one is missing here, the parent(s) commit(s).
I'm not sure how to go between optimizing for the most common case (i.e., a single parent commit, in 90% of the cases) and giving consistent access to all the parents when there is a merge commit (which might have an arbitrary number of parents).
Tentative proposal:

  • there is always a parents/ VFS entry, which is a directory containing one entry for each parent commit, numbered from 1 (first parent), on
    • open question: what to do when there are 0 parents (e.g., for the initial commit in a repo): we can either have an empty parents/ dir, or not have the dir at all, which is easier to check pro grammatically than the fact that a dir is empty
  • if and only if a commit has a single parent commit, there is parent VFS entry pointing to that commit
28–30

both authorship info and timestamps are recurring in various type of VFS entries, it might be worth to factor them out in separate sections and point to them from here

stuff that will need to go in the factored out sections is (at least):

  • the syntax of author info (it's *usually* "FIRSTNAME LASTNAME <EMAIL>", but there is no guarantee, plus the usual encoding considerations apply)
  • the timestamp syntax and semantics (ISO 8601)
45

There is no guarantee that target points to a source tree. In fact, rel objects can point to any kind of object. So:

  • it is fine to have a target VFS entry here, its type will depend on the type of the pointed node
  • but we also need a target_type, that will include the 3-letter name of the type of the target object
  • (this target/target_type pattern will probably also emerge elsewhere)
  • it might be worth to optimize for the common case in which a rel points to a rev and we want to explore the source code at that time. For instance, by having a root VFS entry in that case (and only in that case) which peels down to the root dir pointed by the inner rev object. That would provide a consistent VFS interface when accessing root source dirs via a rev v. via a rel object
52–53
  • we need to decide and describe the mangling. Unless we find a way to avoid it, which might be nice
  • snp branches can point to any kind of object, so we might need here a target/target_type split as before, and maybe some optimizations

this is going to be a tricky one...

This revision now requires changes to proceed.Sep 18 2020, 2:05 PM
zack edited reviewers, added: haltode; removed: zack.
This revision now requires review to proceed.Sep 23 2020, 10:43 AM

we have rewritten this from scratch and we will commit it separately