diff --git a/docs/vault-blueprint.md b/docs/vault-blueprint.md index e037365..993eb32 100644 --- a/docs/vault-blueprint.md +++ b/docs/vault-blueprint.md @@ -1,139 +1,139 @@ Software Heritage Vault ======================= Software source code **objects**---e.g., individual source code files, tarballs, commits, tagged releases, etc.---are stored in the Software Heritage (SWH) Archive in fully deduplicated form. That allows direct access to individual artifacts but require some preparation, usually in the form of -collecting and assemblying multiple artifacts in a single **bundle**, when fast +collecting and assembling multiple artifacts in a single **bundle**, when fast access to a set of related artifacts (e.g., the snapshot of a VCS repository, the archive corresponding to a Git commit, or a specific software release as a zip archive) is required. The **Software Heritage Vault** is a cache of pre-built source code bundles which are assembled opportunistically retrieving objects from the Software Heritage Archive, can be accessed efficiently, and might be garbage collected -after a long period of non use. +after a long period of non-use. Requirements ------------ * **Shared cache** The vault is a cache shared among the various origins that the SWH archive tracks. If the same bundle, originally coming from different origins, is requested, a single entry for it in the cache shall exist. * **Efficient retrieval** Where supported by the desired access protocol (e.g., HTTP) it should be possible for the vault to serve bundles efficiently (e.g., as static files served via HTTP, possibly further proxied/cached at that level). In particular, this rules out building bundles on the fly from the archive DB. API --- All URLs below are meant to be mounted at API root, which is currently at . Unless otherwise stated, all API endpoints respond on HTTP GET method. ## Object identification The vault stores bundles corresponding to different kinds of objects. The following object kinds are supported: * directories * revisions * repository snapshots The URL fragment `:objectkind/:objectid` is used throughout the vault API to fully identify vault objects. The syntax and meaning of :objectid for the different object kinds is detailed below. ### Directories * object kind: directory * URL fragment: directory/:sha1git where :sha1git is the directory ID in the SWH data model. ### Revisions * object kind: revision * URL fragment: revision/:sha1git where :sha1git is the revision ID in the SWH data model. ### Repository snapshots * object kind: snapshot * URL fragment: snapshot/:sha1git where :sha1git is the snapshot ID in the SWH data model. (**TODO** repository snapshots don't exist yet as first-class citizens in the SWH data model; see References below.) ## Cooking Bundles in the vault might be ready for retrieval or not. When they are not, they will need to be **cooked** before they can be retrieved. A cooked bundle will remain around until it expires; at that point it will need to be cooked again before it can be retrieved. Cooking is idempotent, and a no-op in between a previous cooking operation and expiration. To cook a bundle: * POST /vault/:objectkind/:objectid Request body: **TODO** something here in a JSON payload that would allow notifying the user when the bundle is ready. Response: 201 Created ## Retrieval * GET /vault/:objectkind (paginated) list of all bundles of a given kind available in the vault; see Pagination. Note that, due to cache expiration, objects might disappear - between listing and subsequent actions on them + between listing and subsequent actions on them. Examples: * GET /vault/directory * GET /vault/revision * GET /vault/:objectkind/:objectid Retrieve a specific bundle from the vault. Response: * 200 OK: bundle available; response body is the bundle * 404 Not Found: missing bundle; client should request its preparation (see Cooking) References ---------- * [Repository snapshot objects](https://wiki.softwareheritage.org/index.php?title=User:StefanoZacchiroli/Repository_snapshot_objects) * Amazon Web Services, [API Reference for Amazon Glacier](http://docs.aws.amazon.com/amazonglacier/latest/dev/amazon-glacier-api.html); specifically [Job Operations](http://docs.aws.amazon.com/amazonglacier/latest/dev/job-operations.html) TODO ==== * **TODO** pagination using HATEOAS * **TODO** authorization: the cooking API should be somehow controlled to avoid obvious abuses (e.g., let's cache everything) * **TODO** finalize repository snapshot proposal