diff --git a/docs/tutorial.md b/docs/tutorial.md deleted file mode 100644 --- a/docs/tutorial.md +++ /dev/null @@ -1,260 +0,0 @@ -# Software Heritage Filesystem (SwhFS) --- Tutorial - - -## Installation - -The Software Heritage virtual filesystem (SwhFS) is available from PyPI -as [swh.fuse](https://pypi.org/project/swh.fuse/). It can be installed from -there using `pip`: - - $ pip install swh.fuse - - -## Setup and teardown - -SwhFS is controlled by the `swh fs` command-line interface (CLI). - -Like all filesystems, SwhFS must be "mounted" before use and "unmounted" -afterwards. Users should first mount the archive as a whole and then browse -archived objects looking up their SWHIDs below the `archive/` entry-point. To -mount the Software Heritage archive, use the `swh fs mount` command: - - $ mkdir swhfs - $ swh fs mount swhfs/ # mount the archive - - $ ls -1F swhfs/ # list entry points - archive/ # <- start browsing from here - cache/ - origin/ - README - -By default SwhFS daemonizes into background and logs to syslog; it can be kept -in foreground, logging to the console, by passing `-f/--foreground` to `mount`. - -To unmount use `swh fs umount PATH`. Note that, since SwhFS is a *user-space* -filesystem, mounting and unmounting it are not privileged operations, any user -can do it. - -The configuration file `~/.swh/config/global.yml` is read if present. Its main -use case is inserting a per-user authentication token for the SWH API, which -might be needed in case of heavy use to bypass the default API rate limit. See -the {ref}`configuration documentation ` for details. - - -## Lazy loading - -Once mounted, the archive can be navigated as if it were locally available -on-disk. Archived objects are referenced by -{ref}`Software Heritage identifiers ` (SWHIDs). -They are loaded on-demand from the archive and populate lazily the `archive/` -directory below the SwhFS mount point. - -SWHIDs for source code that is not locally available can be obtained in various -ways: searching on the [Software Heritage website][webui]; finding SWHID -references in [scientific papers][citeguide], [Wikidata][wikidataswhid], and -software bills of materials using the [SPDX standard][spdx]; deriving SWHIDs -from other version control system references (e.g., as SWHIDs version 1 are -compatible with Git, a Git commit identifier like -`9d76c0b163675505d1a901e5fe5249a2c55609bc` can be turned into a SWHID by simply -prefixing it with `swh:1:rev:` to obtain -`swh:1:rev:9d76c0b163675505d1a901e5fe5249a2c55609bc`). - -[citeguide]: https://www.softwareheritage.org/save-and-reference-research-software -[spdx]: https://spdx.dev/ -[swhid]: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html -[webui]: https://archive.softwareheritage.org -[wikidataswhid]: https://www.wikidata.org/wiki/Property:P6138 - - -## Source code files - -Here is a SwhFS Hello World: - - $ cd swhfs/ - - $ cat archive/swh:1:cnt:c839dea9e8e6f0528b468214348fee8669b305b2 - #include - - int main(void) { - printf("Hello, World!\n"); - } - -Given the SWHID of a source code file, we can directly access it via the -filesystem. - -Metadata about archived source code artifacts is also locally available. For -each entry `archive/` there is a matching JSON file -`archive/.json`, corresponding to what the [Software Heritage Web -API][webapi] will return. For example, here is what the Software Heritage -archive knows about the above Hello World implementation: - - $ cat archive/swh:1:cnt:c839dea9e8e6f0528b468214348fee8669b305b2.json - { - "length": 67, - "status": "visible", - "checksums": { - "sha256": "06dfb5d936f50b3cb80152aa053724e4a18417c35f745b66ab9571c25afd0f79", - "sha1": "459ee8545e5ba6cb819ba41e6ea2f0011cedd728", - "blake2s256": "87e6ab9c92681e9a022a8f4679dcd9d9b841fe4146edcbc15329fc66d8c82b4f", - "sha1_git": "c839dea9e8e6f0528b468214348fee8669b305b2" - }, - "data_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:c839dea9e8e6f0528b468214348fee8669b305b2/raw/", - "filetype_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:c839dea9e8e6f0528b468214348fee8669b305b2/filetype/", - "language_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:c839dea9e8e6f0528b468214348fee8669b305b2/language/", - "license_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:c839dea9e8e6f0528b468214348fee8669b305b2/license/" - } - -Note: JSON metadata files are indented by default when read, this can be changed -in the configuration file (see {ref}`documentation `). - - -[webapi]: https://archive.softwareheritage.org/api/ - - -## Source code trees - -In addition to individual source code files, we can also browse entire source -code directories. Here is the historical Apollo 11 source code, where we can -find interesting comments about the antenna during landing: - - $ cd archive/swh:1:dir:1fee702c7e6d14395bbf5ac3598e73bcbf97b030 - - $ ls | wc -l - 127 - - $ grep -i antenna THE_LUNAR_LANDING.s | cut -f 5 - # IS THE LR ANTENNA IN POSITION 1 YET - # BRANCH IF ANTENNA ALREADY IN POSITION 1 - -We can checkout the commit of a more modern code base, like jQuery, and count -its JavaScript lines of code (SLOC): - - $ cd archive/swh:1:rev:9d76c0b163675505d1a901e5fe5249a2c55609bc - - $ ls -1F - history/ - meta.json@ - parent@ - parents/ - root@ - - $ find root/src/ -type f -name '*.js' | xargs cat | wc -l - 10136 - - -## History browsing - -`meta.json` files of revision objects contain complete commit metadata, e.g.: - - $ jq '.author.name, .date, .message' meta.json - "Michal Golebiowski-Owczarek" - "2020-03-02T23:02:42+01:00" - "Data:Event:Manipulation: Prevent collisions with Object.prototype ..." - -Commit history can be browsed commit-by-commit digging into directories -`parent(s)/` directories or, more efficiently, using the history summaries -located under `history/`: - - $ ls -f history/by-page/000/ | wc -l - 6469 - - $ ls -f history/by-page/000/ | head -n 5 - swh:1:rev:358b769a00c3a09a8ec621b8dcb2d5e31b7da69a - swh:1:rev:4a7fc8544e2020c75047456d11979e4e3a517fdf - swh:1:rev:364476c3dc1231603ba61fc08068fa89fb095e1a - swh:1:rev:721744a9fab5b597febea64e466272eabfdb9463 - swh:1:rev:4592595b478be979141ce35c693dbc6b65647173 - -The jQuery commit at hand is preceded by 6469 commits, which can be listed in -`git log` order via the `by-page` view. The `by-hash` and `by-date` views list -commits sharded by commit identifier and timestamp: - - $ ls history/by-hash/00/ | head -n 5 - swh:1:rev:00a9c2e5f4c855382435cec6b3908eb9bd5a53b7 - swh:1:rev:005040379d8b64aacbe54941d878efa6e86df1cc - swh:1:rev:00cc67af23bf9cf2cdbaeaeee6ded76baf0292f0 - swh:1:rev:00575d4d8c7421c5119f181009374ff2e7736127 - swh:1:rev:0019a463bdcb81dc6ba3434505a45774ca27f363 - - $ ls -1F history/by-date/ - 2006/ - 2007/ - 2008/ - ... - 2018/ - 2019/ - 2020/ - - $ ls -f history/by-date/2020/03/16/ - swh:1:ref:90fed4b453a5becdb7f173d9e3c1492390a1441f - - $ jq .date history/by-date/2020/03/16/*/meta.json - "2020-03-16T21:49:29+01:00" - -Note that to populate the `by-date` view, metadata about all commits in the -history are needed. To avoid blocking on that, metadata are retrieved -asynchronously, populating the view incrementally. The hidden `by-date/.status` -file provides a progress report and is removed upon completion. - - -## Repository snapshots and branches - -Snapshot objects keep track of where each branch and release (or "tag") pointed -at archival time. Here is an example using -the [Unix history repository](https://github.com/dspinellis/unix-history-repo), -which uses historical Unix releases as branch names: - - $ cd archive/swh:1:snp:2ca5d6eff8f04a671c0d5b13646cede522c64b7d - - $ ls -f refs/heads/ | wc -l - 40 - - $ ls -f refs/heads/ | grep Bell - Bell-32V-Snapshot-Development - Bell-Release - - $ cd refs/heads/Bell-Release - $ jq .message,.date meta.json - "Bell 32V release\nSnapshot of the completed development branch\n\nSynthesized-from: 32v\n" - "1979-05-02T23:26:55-05:00" - - $ grep core root/usr/src/games/fortune.c - printf("Memory fault -- core dumped\n"); - -We can check that two of the available branches correspond to historical Bell -Labs UNIX releases. And we can dig into the `fortune` implementation of -[UNIX/32V](https://en.wikipedia.org/wiki/UNIX/32V) instantly, without having to -clone a 1.6  GiB repository first. - - -## Origin search - -Origins can be accessed via the `origin/` top-level directory using their -**encoded** URL (the percent-encoding mechanism described in [RFC -3986](https://tools.ietf.org/html/rfc3986.html). - - $ cd origin/https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux - $ ls - 2015-07-09/ 2016-09-14/ 2017-09-12/ 2018-03-08/ 2018-09-06/ ... - -Each directory corresponds to a visit, containing metadata and a symlink to the -visit's snapshot: - - $ ls -l origin/https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux/2020-09-21/ - total 0 - -r--r--r-- 1 haltode haltode 470 Dec 28 12:12 meta.json - lr--r--r-- 1 haltode haltode 67 Dec 28 12:12 snapshot -> ../../../archive/swh:1:snp:c7beb2432b7e93c4cf6ab09cd194c7c1998df2f9/ - -In order to find origin URLs, we can use the `web search` CLI: - - $ swh web search python --limit 5 - https://github.com/neon670/python.dev https://archive.softwareheritage.org/api/1/origin/https://github.com/neon670/python.dev/visits/ - https://github.com/aur-archive/python-werkzeug https://archive.softwareheritage.org/api/1/origin/https://github.com/aur-archive/python-werkzeug/visits/ - https://github.com/jsagon/jtradutor-web-python https://archive.softwareheritage.org/api/1/origin/https://github.com/jsagon/jtradutor-web-python/visits/ - https://github.com/zjmwqx/ipythonCode https://archive.softwareheritage.org/api/1/origin/https://github.com/zjmwqx/ipythonCode/visits/ - https://github.com/knutab/Python-BSM https://archive.softwareheritage.org/api/1/origin/https://github.com/knutab/Python-BSM/visits/ - -The `search` tool is also useful to escape URL: - - $ swh web search "torvalds linux" --limit 1 --url-encode | cut -f1 - https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux diff --git a/docs/tutorial.rst b/docs/tutorial.rst new file mode 100644 --- /dev/null +++ b/docs/tutorial.rst @@ -0,0 +1,278 @@ +Software Heritage Filesystem (SwhFS) — Tutorial +=============================================== + +Installation +------------ + +The Software Heritage virtual filesystem (SwhFS) is available from PyPI as `swh.fuse +`_. It can be installed from there using ``pip``: + +:: + + $ pip install swh.fuse + +Setup and teardown +------------------ + +SwhFS is controlled by the ``swh fs`` command-line interface (CLI). + +Like all filesystems, SwhFS must be “mounted” before use and “unmounted” afterwards. +Users should first mount the archive as a whole and then browse archived objects looking +up their SWHIDs below the ``archive/`` entry-point. To mount the Software Heritage +archive, use the ``swh fs mount`` command: + +:: + + $ mkdir swhfs + $ swh fs mount swhfs/ # mount the archive + + $ ls -1F swhfs/ # list entry points + archive/ # <- start browsing from here + cache/ + origin/ + README + +By default SwhFS daemonizes into background and logs to syslog; it can be kept in +foreground, logging to the console, by passing ``-f/--foreground`` to ``mount``. + +To unmount use ``swh fs umount PATH``. Note that, since SwhFS is a *user-space* +filesystem, mounting and unmounting it are not privileged operations, any user can do +it. + +The configuration file ``~/.swh/config/global.yml`` is read if present. Its main use +case is inserting a per-user authentication token for the SWH API, which might be needed +in case of heavy use to bypass the default API rate limit. See the {ref}\ +``configuration documentation `` for details. + +Lazy loading +------------ + +Once mounted, the archive can be navigated as if it were locally available on-disk. +Archived objects are referenced by {ref}\ ``Software Heritage identifiers +`` (SWHIDs). They are loaded on-demand from the archive and +populate lazily the ``archive/`` directory below the SwhFS mount point. + +SWHIDs for source code that is not locally available can be obtained in various ways: +searching on the :swh_web:`Software Heritage website `; finding SWHID references in +`scientific papers +`_, `Wikidata +`_, and software bills of materials using +the `SPDX standard `_; deriving SWHIDs from other version control +system references (e.g., as SWHIDs version 1 are compatible with Git, a Git commit +identifier like ``9d76c0b163675505d1a901e5fe5249a2c55609bc`` can be turned into a SWHID +by simply prefixing it with ``swh:1:rev:`` to obtain +``swh:1:rev:9d76c0b163675505d1a901e5fe5249a2c55609bc``). + +Source code files +----------------- + +Here is a SwhFS Hello World: + +:: + + $ cd swhfs/ + + $ cat archive/swh:1:cnt:c839dea9e8e6f0528b468214348fee8669b305b2 + #include + + int main(void) { + printf("Hello, World!\n"); + } + +Given the SWHID of a source code file, we can directly access it via the filesystem. + +Metadata about archived source code artifacts is also locally available. For each entry +``archive/`` there is a matching JSON file ``archive/.json``, +corresponding to what the :swh_web:`Software Heritage Web API ` will return. For +example, here is what the Software Heritage archive knows about the above Hello World +implementation: + +:: + + $ cat archive/swh:1:cnt:c839dea9e8e6f0528b468214348fee8669b305b2.json + { + "length": 67, + "status": "visible", + "checksums": { + "sha256": "06dfb5d936f50b3cb80152aa053724e4a18417c35f745b66ab9571c25afd0f79", + "sha1": "459ee8545e5ba6cb819ba41e6ea2f0011cedd728", + "blake2s256": "87e6ab9c92681e9a022a8f4679dcd9d9b841fe4146edcbc15329fc66d8c82b4f", + "sha1_git": "c839dea9e8e6f0528b468214348fee8669b305b2" + }, + "data_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:c839dea9e8e6f0528b468214348fee8669b305b2/raw/", + "filetype_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:c839dea9e8e6f0528b468214348fee8669b305b2/filetype/", + "language_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:c839dea9e8e6f0528b468214348fee8669b305b2/language/", + "license_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:c839dea9e8e6f0528b468214348fee8669b305b2/license/" + } + +Note: JSON metadata files are indented by default when read, this can be changed in the +configuration file (see {ref}\ ``documentation ``). + +Source code trees +----------------- + +In addition to individual source code files, we can also browse entire source code +directories. Here is the historical Apollo 11 source code, where we can find interesting +comments about the antenna during landing: + +:: + + $ cd archive/swh:1:dir:1fee702c7e6d14395bbf5ac3598e73bcbf97b030 + + $ ls | wc -l + 127 + + $ grep -i antenna THE_LUNAR_LANDING.s | cut -f 5 + # IS THE LR ANTENNA IN POSITION 1 YET + # BRANCH IF ANTENNA ALREADY IN POSITION 1 + +We can checkout the commit of a more modern code base, like jQuery, and count its +JavaScript lines of code (SLOC): + +:: + + $ cd archive/swh:1:rev:9d76c0b163675505d1a901e5fe5249a2c55609bc + + $ ls -1F + history/ + meta.json@ + parent@ + parents/ + root@ + + $ find root/src/ -type f -name '*.js' | xargs cat | wc -l + 10136 + +History browsing +---------------- + +``meta.json`` files of revision objects contain complete commit metadata, e.g.: + +:: + + $ jq '.author.name, .date, .message' meta.json + "Michal Golebiowski-Owczarek" + "2020-03-02T23:02:42+01:00" + "Data:Event:Manipulation: Prevent collisions with Object.prototype ..." + +Commit history can be browsed commit-by-commit digging into directories ``parent(s)/`` +directories or, more efficiently, using the history summaries located under +``history/``: + +:: + + $ ls -f history/by-page/000/ | wc -l + 6469 + + $ ls -f history/by-page/000/ | head -n 5 + swh:1:rev:358b769a00c3a09a8ec621b8dcb2d5e31b7da69a + swh:1:rev:4a7fc8544e2020c75047456d11979e4e3a517fdf + swh:1:rev:364476c3dc1231603ba61fc08068fa89fb095e1a + swh:1:rev:721744a9fab5b597febea64e466272eabfdb9463 + swh:1:rev:4592595b478be979141ce35c693dbc6b65647173 + +The jQuery commit at hand is preceded by 6469 commits, which can be listed in ``git +log`` order via the ``by-page`` view. The ``by-hash`` and ``by-date`` views list commits +sharded by commit identifier and timestamp: + +:: + + $ ls history/by-hash/00/ | head -n 5 + swh:1:rev:00a9c2e5f4c855382435cec6b3908eb9bd5a53b7 + swh:1:rev:005040379d8b64aacbe54941d878efa6e86df1cc + swh:1:rev:00cc67af23bf9cf2cdbaeaeee6ded76baf0292f0 + swh:1:rev:00575d4d8c7421c5119f181009374ff2e7736127 + swh:1:rev:0019a463bdcb81dc6ba3434505a45774ca27f363 + + $ ls -1F history/by-date/ + 2006/ + 2007/ + 2008/ + ... + 2018/ + 2019/ + 2020/ + + $ ls -f history/by-date/2020/03/16/ + swh:1:ref:90fed4b453a5becdb7f173d9e3c1492390a1441f + + $ jq .date history/by-date/2020/03/16/*/meta.json + "2020-03-16T21:49:29+01:00" + +Note that to populate the ``by-date`` view, metadata about all commits in the history +are needed. To avoid blocking on that, metadata are retrieved asynchronously, populating +the view incrementally. The hidden ``by-date/.status`` file provides a progress report +and is removed upon completion. + +Repository snapshots and branches +--------------------------------- + +Snapshot objects keep track of where each branch and release (or “tag”) pointed at +archival time. Here is an example using the `Unix history repository +`_, which uses historical Unix releases +as branch names: + +:: + + $ cd archive/swh:1:snp:2ca5d6eff8f04a671c0d5b13646cede522c64b7d + + $ ls -f refs/heads/ | wc -l + 40 + + $ ls -f refs/heads/ | grep Bell + Bell-32V-Snapshot-Development + Bell-Release + + $ cd refs/heads/Bell-Release + $ jq .message,.date meta.json + "Bell 32V release\nSnapshot of the completed development branch\n\nSynthesized-from: 32v\n" + "1979-05-02T23:26:55-05:00" + + $ grep core root/usr/src/games/fortune.c + printf("Memory fault -- core dumped\n"); + +We can check that two of the available branches correspond to historical Bell Labs UNIX +releases. And we can dig into the ``fortune`` implementation of `UNIX/32V +`_ instantly, without having to clone a 1.6  GiB +repository first. + +Origin search +------------- + +Origins can be accessed via the ``origin/`` top-level directory using their **encoded** +URL (the percent-encoding mechanism described in `RFC 3986 +`_. + +:: + + $ cd origin/https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux + $ ls + 2015-07-09/ 2016-09-14/ 2017-09-12/ 2018-03-08/ 2018-09-06/ ... + +Each directory corresponds to a visit, containing metadata and a symlink to the visit’s +snapshot: + +:: + + $ ls -l origin/https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux/2020-09-21/ + total 0 + -r--r--r-- 1 haltode haltode 470 Dec 28 12:12 meta.json + lr--r--r-- 1 haltode haltode 67 Dec 28 12:12 snapshot -> ../../../archive/swh:1:snp:c7beb2432b7e93c4cf6ab09cd194c7c1998df2f9/ + +In order to find origin URLs, we can use the ``web search`` CLI: + +:: + + $ swh web search python --limit 5 + https://github.com/neon670/python.dev https://archive.softwareheritage.org/api/1/origin/https://github.com/neon670/python.dev/visits/ + https://github.com/aur-archive/python-werkzeug https://archive.softwareheritage.org/api/1/origin/https://github.com/aur-archive/python-werkzeug/visits/ + https://github.com/jsagon/jtradutor-web-python https://archive.softwareheritage.org/api/1/origin/https://github.com/jsagon/jtradutor-web-python/visits/ + https://github.com/zjmwqx/ipythonCode https://archive.softwareheritage.org/api/1/origin/https://github.com/zjmwqx/ipythonCode/visits/ + https://github.com/knutab/Python-BSM https://archive.softwareheritage.org/api/1/origin/https://github.com/knutab/Python-BSM/visits/ + +The ``search`` tool is also useful to escape URL: + +:: + + $ swh web search "torvalds linux" --limit 1 --url-encode | cut -f1 + https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux