diff --git a/docs/design.md b/docs/design.md --- a/docs/design.md +++ b/docs/design.md @@ -49,20 +49,21 @@ The SwhFS mount point contain: -- `archive/`: initially empty, this directory is lazily populated with one -entry per accessed SWHID, having actual SWHIDs as names (possibly sharded into -`xy/../SWHID` paths to avoid overcrowding `archive/`). +- `archive/`: initially empty, this directory is lazily populated with one entry + per accessed SWHID, having actual SWHIDs as names (possibly sharded into + `xy/../SWHID` paths to avoid overcrowding `archive/`). - `meta/`: initially empty, this directory contains one `.json` file for -each `` entry under `archive/`. The JSON file contain all available meta -information about the given SWHID, as returned by the Software Heritage Web API -for that object. Note that, in case of pagination (e.g., snapshot objects with -many branches) the JSON file will contain a complete version with all pages -merged together. + each `` entry under `archive/`. The JSON file contain all available + meta information about the given SWHID, as returned by the Software Heritage + Web API for that object. Note that, in case of pagination (e.g., snapshot + objects with many branches) the JSON file will contain a complete version with + all pages merged together. -- `origin/`: initially empty, this directory is lazily populated with one -entry per accessed origin URL, having encoded URL as names. The URL encoding is -done using the percent-encoding mechanism described in RFC 3986. +- `origin/`: initially empty, this directory is lazily populated with one entry + per accessed origin URL, having encoded URL as names. The URL encoding is done + using the percent-encoding mechanism described in + [RFC 3986](https://tools.ietf.org/html/rfc3986.html). ## File system representation @@ -100,19 +101,19 @@ with the following entries: - `root`: source tree at the time of the commit, as a symlink pointing into -`archive/`, to a SWHID of type `dir` + `archive/`, to a SWHID of type `dir` - `parents/` (note the plural): a virtual directory containing entries named -`1`, `2`, `3`, etc., one for each parent commit. Each of these entry is a -symlink pointing into `archive/`, to the SWHID file for the given parent commit -- `parent` (note the singular): present if and only if the current commit has a -single parent commit (which is the most common case). When present it is a -symlink pointing into `archive/` to the SWHID for the sole parent commit -- `history`: a virtual directory containing all the parents commit until the -root commit. Entries are listed as symlinks with the SWHID as directory name, -pointing into `archive/SWHID`, and are returned in a topological ordering -similar to `git log` ordering. + `1`, `2`, `3`, etc., one for each parent commit. Each of these entry is a + symlink pointing into `archive/`, to the SWHID file for the given parent + commit +- `parent` (note the singular): present if and only if the current commit has at + least one parent commit (which is the most common case). When present it is a + symlink pointing into `parents/1/` +- `history`: a virtual directory listing all its revision ancestors, sorted in + reverse topological order. The history can be listed through `by-date/`, + `by-hash/` or `by-page/` with each its own sharding policy. - `meta.json`: metadata for the current node, as a symlink pointing to the -relevant `meta/.json` file + relevant `meta/.json` file ### `rel` nodes (releases) @@ -121,12 +122,12 @@ following entries: - `target`: target node, as a symlink to `archive/` -- `target_type`: type of the target SWHID, as a 3-letter code +- `target_type`: regular file containing the type of the target SWHID - `root`: present if and only if the release points to something that -(transitively) resolves to a directory. When present it is a symlink pointing -into `archive/` to the SWHID of the given directory + (transitively) resolves to a directory. When present it is a symlink pointing + into `archive/` to the SWHID of the given directory - `meta.json`: metadata for the current node, as a symlink pointing to the -relevant `meta/.json` file + relevant `meta/.json` file ### `snp` nodes (snapshots) @@ -180,11 +181,13 @@ ### Metadata cache - SWHID → JSON metadata + Artifact → JSON metadata -The metadata cache map each SWHID to the complete metadata of the referenced +The metadata cache map each artifact to the complete metadata of the referenced object. This is analogous to what is available in `meta/.json` file (and generally used as data source for returning the content of those files). +Artifacts are identified using their SWHIDs, or in the case of origins visits +using their URLs. Cache location on-disk: `$XDG_CACHE_HOME/swh/fuse/metadata.sqlite` @@ -207,37 +210,6 @@ Cache location on-disk: `$XDG_CACHE_HOME/swh/fuse/blob.sqlite` -### Dentry cache - - dir SWHID → directory entries - -The dentry (directory entry) cache map SWHIDs of type `dir` to the directory -entries they contain. Each entry comes with its name as well as file attributes -(i.e., all its needed to perform a detailed directory listing). - -Additional attributes of each directory entry should be looked up on a entry by -entry basis, possibly hitting the metadata cache. - -The dentry cache for a given dir is populated, at the latest, when the content -of the directory is listed. More aggressive prefetching might happen. For -instance, when first opening a dir a recursive listing of it can be retrieved -from the remote backend and used to recursively populate the dentry cache for -all (transitive) sub-directories. - - -### Parents cache - - rev SWHID → parent SWHIDs - -The parents cache map SWHIDs of type `rev` to the list of their parent commits. - -The parents cache for a given rev is populated, at the latest, when the content -of the revision virtual directory is listed. More aggressive prefetching might -happen. For instance, when first opening a rev virtual directory a recursive -listing of all its ancestor can be retrieved from the remote backend and used to -recursively populate the parents cache for all ancestors. - - ### History cache rev SWHID → ancestor SWHIDs @@ -248,3 +220,25 @@ prefetched. To efficiently store the ancestor lists, the history cache represents ancestors as graph edges (a pair of two SWHID nodes), meaning the history cache is shared amongst all revisions parents. + +Cache location on-disk: `$XDG_CACHE_HOME/swh/fuse/metadata.sqlite` + + +### Direntry cache + + dir inode → directory entries + +The direntry cache map inode representing directories to the entries they +contain. Each entry comes with its name as well as file attributes (i.e., all +its needed to perform a detailed directory listing). + +Additional attributes of each directory entry should be looked up on a entry by +entry basis, possibly hitting the metadata cache. + +The direntry cache for a given dir is populated, at the latest, when the content +of the directory is listed. More aggressive prefetching might happen. For +instance, when first opening a dir a recursive listing of it can be retrieved +from the remote backend and used to recursively populate the direntry cache for +all (transitive) sub-directories. + +Cache location: in-memory. diff --git a/swh/fuse/cache.py b/swh/fuse/cache.py --- a/swh/fuse/cache.py +++ b/swh/fuse/cache.py @@ -19,7 +19,7 @@ from swh.fuse.fs.artifact import RevisionHistoryShardByDate from swh.fuse.fs.entry import FuseDirEntry, FuseEntry -from swh.fuse.fs.mountpoint import ArchiveDir, MetaDir +from swh.fuse.fs.mountpoint import ArchiveDir, MetaDir, OriginDir from swh.model.exceptions import ValidationError from swh.model.identifiers import REVISION, SWHID, parse_swhid from swh.web.client.client import ORIGIN_VISIT, typify_json @@ -47,10 +47,10 @@ self.cache_conf = cache_conf async def __aenter__(self): - # TODO: unified name for metadata/history? + # History and raw metadata share the same SQLite db self.metadata = MetadataCache(self.cache_conf["metadata"]) - self.blob = BlobCache(self.cache_conf["blob"]) self.history = HistoryCache(self.cache_conf["metadata"]) + self.blob = BlobCache(self.cache_conf["blob"]) self.direntry = DirEntryCache(self.cache_conf["direntry"]) await self.metadata.__aenter__() await self.blob.__aenter__() @@ -106,24 +106,29 @@ class MetadataCache(AbstractCache): - """ The metadata cache map each SWHID to the complete metadata of the + """ The metadata cache map each artifact to the complete metadata of the referenced object. This is analogous to what is available in `meta/.json` file (and generally used as data source for returning - the content of those files). """ + the content of those files). Artifacts are identified using their SWHIDs, or + in the case of origins visits using their URLs. """ async def __aenter__(self): await super().__aenter__() - await self.conn.execute( - "create table if not exists metadata_cache (swhid, metadata, date)" - ) - await self.conn.execute( - "create index if not exists idx_metadata on metadata_cache(swhid)" - ) - await self.conn.execute( - "create table if not exists visits_cache (url, metadata)" - ) - await self.conn.execute( - "create index if not exists idx_visits on visits_cache(url)" + await self.conn.executescript( + """ + create table if not exists metadata_cache ( + swhid text, + metadata blob, + date text + ); + create index if not exists idx_metadata on metadata_cache(swhid); + + create table if not exists visits_cache ( + url text, + metadata blob + ); + create index if not exists idx_visits on visits_cache(url); + """ ) await self.conn.commit() return self @@ -139,22 +144,20 @@ else: return None - async def get_visits( - self, url_encoded: str, typify: bool = True - ) -> Optional[List[Dict[str, Any]]]: + async def get_visits(self, url_encoded: str) -> Optional[List[Dict[str, Any]]]: cursor = await self.conn.execute( "select metadata from visits_cache where url=?", (url_encoded,) ) cache = await cursor.fetchone() if cache: visits = json.loads(cache[0]) - if typify: - visits = [typify_json(v, ORIGIN_VISIT) for v in visits] - return visits + visits_typed = [typify_json(v, ORIGIN_VISIT) for v in visits] + return visits_typed else: return None async def set(self, swhid: SWHID, metadata: Any) -> None: + # Fill in the date column for revisions (used as cache for history/by-date/) swhid_date = "" if swhid.object_type == REVISION: date = dateutil.parser.parse(metadata["date"]) @@ -166,7 +169,6 @@ "insert into metadata_cache values (?, ?, ?)", (str(swhid), json.dumps(metadata), swhid_date), ) - await self.conn.commit() async def set_visits(self, url_encoded: str, visits: List[Dict[str, Any]]) -> None: @@ -187,9 +189,14 @@ async def __aenter__(self): await super().__aenter__() - await self.conn.execute("create table if not exists blob_cache (swhid, blob)") - await self.conn.execute( - "create index if not exists idx_blob on blob_cache(swhid)" + await self.conn.executescript( + """ + create table if not exists blob_cache ( + swhid text, + blob blob + ); + create index if not exists idx_blob on blob_cache(swhid); + """ ) await self.conn.commit() return self @@ -222,18 +229,16 @@ async def __aenter__(self): await super().__aenter__() - await self.conn.execute( + await self.conn.executescript( """ create table if not exists history_graph ( src text not null, dst text not null, unique(src, dst) - ) + ); + create index if not exists idx_history on history_graph(src); """ ) - await self.conn.execute( - "create index if not exists idx_history on history_graph(src)" - ) await self.conn.commit() return self @@ -265,7 +270,7 @@ return history async def get_with_date_prefix( - self, swhid: SWHID, date_prefix + self, swhid: SWHID, date_prefix: str ) -> List[Tuple[SWHID, str]]: cursor = await self.conn.execute( f""" @@ -360,9 +365,9 @@ return self.lru_cache.get(direntry.inode, None) def set(self, direntry: FuseDirEntry, entries: List[FuseEntry]) -> None: - if isinstance(direntry, ArchiveDir) or isinstance(direntry, MetaDir): - # The `archive/` and `meta/` are populated on the fly so we should - # never cache them + if isinstance(direntry, (ArchiveDir, MetaDir, OriginDir)): + # The `archive/`, `meta/`, and `origin/` are populated on the fly so + # we should never cache them pass elif ( isinstance(direntry, RevisionHistoryShardByDate) diff --git a/swh/fuse/fs/artifact.py b/swh/fuse/fs/artifact.py --- a/swh/fuse/fs/artifact.py +++ b/swh/fuse/fs/artifact.py @@ -143,8 +143,8 @@ has at least one parent commit (which is the most common case). When present it is a symlink pointing into `parents/1/` - `history`: a virtual directory listing all its revision ancestors, sorted - in reverse topological order. Each entry is a symlink pointing into - `archive/SWHID`. + in reverse topological order. The history can be listed through + `by-date/`, `by-hash/` or `by-page/` with each its own sharding policy. - `meta.json`: metadata for the current node, as a symlink pointing to the relevant `meta/.json` file """ diff --git a/swh/fuse/fuse.py b/swh/fuse/fuse.py --- a/swh/fuse/fuse.py +++ b/swh/fuse/fuse.py @@ -115,6 +115,8 @@ raise async def get_history(self, swhid: SWHID) -> List[SWHID]: + """ Retrieve a revision's history using Software Heritage Graph API """ + if swhid.object_type != REVISION: raise pyfuse3.FUSEError(errno.EINVAL)