diff --git a/docs/contributing/sphinx.rst b/docs/contributing/sphinx.rst --- a/docs/contributing/sphinx.rst +++ b/docs/contributing/sphinx.rst @@ -3,7 +3,7 @@ Sphinx gotchas ============== -Here is a list of common gotchas when formatting Python docstrings for [http://www.sphinx-doc.org/en/stable/ Sphinx] and the [http://www.sphinx-doc.org/en/stable/ext/napoleon.html Napoleon] style. +Here is a list of common gotchas when formatting Python docstrings for `Sphinx `_ and the `Napoleon `_ style. Sphinx ------ @@ -11,12 +11,12 @@ Lists +++++ -All sorts of `lists `_ +All sorts of `lists `_ require an empty line before the first bullet and after the last one, to be properly interpreted as list. No indentation is required for list elements w.r.t. surrounding text, and line continuations should be indented like the first character -after the bullet +after the bullet. Bad:: @@ -65,7 +65,7 @@ Verbatim source code ++++++++++++++++++++ -Verbatim `code blocks `_, +Verbatim `code blocks `_, e.g., for code examples, requires double colon at the end of a line, then an empty line, and then the code block itself, indented: @@ -103,7 +103,7 @@ ``**kwargs``, ``**args`` +++++++++++++++++++++++++ -`Asterisks needs to be escaped `_ +`Asterisks needs to be escaped `_ to avoid capture by emphasis markup. In case of multiple adjacent asterisks, escaping the first one is enough. @@ -120,8 +120,8 @@ Backquotes are not enough to cross-reference a Python entity (class, function, module, etc.); you need to use -`Sphinx domains `_ for that, -and in particular the `Python domain `_ +`Sphinx domains `_ for that, +and in particular the `Python domain `_ Bad:: @@ -138,7 +138,7 @@ you can avoid a long, fully-qualified anchor setting an :func:`explicit label ` for a link -See also: the `list of Python roles `_ +See also: the `list of Python roles `_ that you can use to cross-reference Python objects. Note that you can (and should) omit the :py: prefix, as Python is the default domain. @@ -155,7 +155,7 @@ Docstring sections ++++++++++++++++++ -See the `list of docstring sections `_ +See the `list of docstring sections `_ supported by Napoleon. Everything else will *not* be typeset with a dedicated heading, you will have to do so explicitly using reStructuredText markup. diff --git a/docs/getting-started/api.rst b/docs/getting-started/api.rst new file mode 100644 --- /dev/null +++ b/docs/getting-started/api.rst @@ -0,0 +1,660 @@ +============================================== +Getting Started with the Software Heritage API +============================================== + +Introduction +------------ + +About Software Heritage +^^^^^^^^^^^^^^^^^^^^^^^ + +The `Software Heritage project `__ was +started in 2015 with a rather impressive goal and purpose: + + Software Heritage is an ambitious initiative that aims at collecting, + organizing, preserving and sharing all the source code publicly + available in the world. + +Yes, you read it well: all source code available in the world. It implies to +build an equally impressive infrastructure to hold the huge amount of +information represented, make the archive available to the public +through a `nice web interface `__ +and even propose a :ref:`well-documented API ` to access it +seamlessly. For the records, there are also :ref:`various datasets +available ` for download, with detailed instructions +about how to set it up. And, yes it’s huge: the full graph generated +from the archive (with only metadata, content is not included) has more +than 20b nodes and weights 1.2TB. Overall size of the archive is in the +hundreds of TBs. + +This article presents, and demonstrates the use of, the `Software +Heritage API `__ to query +basic information about archived content and fetch the content of a +software project. + +Terms and Concepts +^^^^^^^^^^^^^^^^^^ + +For our activity we need to define the following terms and concepts: + +- The repositories analysed by the SWH are registered as **origins**. + Examples of origins are: https://bitbucket.org/anthroweb/apache.git, + https://github.com/apache/ant, or other types of sources (debian + source packages, npmjs, pypi, cran..). +- When repositories are analysed, it creates **snapshots**. Snapshots + describe the state of the repository at the time of analysis, and + provide links to the repository content. As an example in the case of a git + repository, the snapshot links to the list of branches, which + themselves link to revisions and releases. +- **Revisions** are consistent sets of directories and contents + representing the repository at a given time, like in a baseline. They + can be conceptually mapped to commits in subversion, to git + references, or to source package versions in debian or nmpjs + repositories. +- Revisions are linked to a **directory**, which itself links to other + directories and **contents** (aka blobs). + +A full list of terms is provided in the `Software Heritage +doc `__. + +Preliminary steps +----------------- + +This article uses Python 3.x on the client side, and the ``requests`` +Python module to manipulate the HTTP requests. Note however that any +language that provides HTTP requests (GET, POST) can access the API and +could be used. Firstly let’s make sure we have the correct Python +version and module installed:: + + boris@castalia:notebook$ python3 -V + Python 3.7.3 + boris@castalia:notebooks$ pip3 install requests + Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.21.0) + boris@castalia:notebook$ + +Initialise the script +--------------------- + +We need to import a few modules and utilities to play with the Software +Heritage API, namely ``json`` and the aforementioned ``requests`` +modules. We also define a utility function to pretty-print json data +easily: + +.. code:: python + + import json + import requests + + # Utility to pretty-print json. + def jprint(obj): + # create a formatted string of the Python JSON object + print(json.dumps(obj, sort_keys=True, indent=4)) + + +The syntax mentioned in the `API +documentation `__ is rather +straightforward. Since we want to read it from the main Software +Heritage server, we will use ``https://archive.softwareheritage.org/`` +as the basename. All API calls will be forged according to the same +syntax: + +:: + + https://archive.softwareheritage.org/api/1/ + +Request basic Information +------------------------- + +We want to get some basic information about the main server activity and +content. The ``stat`` endpoint provides a summary of the main indexes and +some statistics about the archive. We can request a GET on the main +counters of the archive using the counters path, as described in the +`endpoint +documentation `__: + +``/api/1/stat/counters/`` + +This API endpoint returns the following information: + +* **content** is the total number of blobs (files) in the archive. +* **directory** is the total number of repositories in the archive. +* **origin** is the number of distinct origins (repositories) fetched by + the archive bots. +* **origin_visits** is the total number of visits across all origins. +* **person** is the number of authors (e.g. committers, authors) in the + archived files. +* **release** is the number of tags retrieved in the archive. +* **revision** is the number of revisions stored in the archive. +* **skipped_content** is the number of objects which could be + imported in the archive. +* **snapshot** is the number of snapshots stored in the archive. + +Note that we use the default JSON format for the output. We could use +YAML if we wanted to, with a custom ``Request Headers`` set to +``application/yaml``. + +.. code-block:: python + + resp = requests.get("https://archive.softwareheritage.org/api/1/stat/counters/") + counters = resp.json() + jprint(counters) + + +.. code-block:: python + + { + "content": 10049535736, + "directory": 8390591308, + "origin": 156388918, + "person": 42263568, + "release": 17218891, + "revision": 2109783249 + } + + +There are almost 10bn blobs (aka files) in the archive and 8bn+ +directories already, for 155m repositories analysed. + +Now, what about a specific repository? Let’s say we want to find if +`alambic `__ (an open-source data provider and +analysis system for software development) has already been analysed by +the archive’s bots. + +Search the archive +------------------ + +Search for a keyword +^^^^^^^^^^^^^^^^^^^^ + +The easiest way to look for a keyword in the repositories analysed by +the archive is to use the ``search`` feature of the ``origin`` endpoint. +Documentation for the endpoint is +`here `__ +and the complete syntax is: + +:: + + `/api/1/origin/search//` + +The server returns an array of hashes, with each item being formatted +as: + +- **origin_visits_url** attribute is an URL that points to the API page + listing all visits (bot fetches) to this repository. +- **url** is the url of the origin, or repository, itself. + +A (truncated) example of a result from this endpoint is shown below: + +:: + + [ + { + "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", + "url": "https://github.com/borisbaldassari/alambic" + } + ... + ] + +As an example we will look for instances of *alambic* in the archive’s +analysed repositories:: + + resp = requests.get("https://archive.softwareheritage.org/api/1/origin/search/alambic/") + origins = resp.json() + print(f"We found {len(origins)} entries.") + for origin in origins[1:10]: + print(f"- {origin['url']}") + + +Which produces:: + + We found 52 entries. + - https://github.com/royal-alambic-club/sauron + - https://github.com/scamberlin/alambic + - https://github.com/WebTales/alambic-connector-mongodb + - https://github.com/WebTales/alambic + - https://github.com/AssoAlambic/alambic-website + - https://bitbucket.org/nayoub/alambic.git + - https://github.com/Alexandru-Dobre/alambic-connector-rest + - https://github.com/WebTales/alambic-connector-diffbot + - https://github.com/WebTales/alambic-connector-firebase + + +There are obviously many projects and repositories that embed the word +alambic, and we will need to be a bit more specific if we are to +identify the origin actually related to the alambic project. + +If we want to know more about a specific origin, we can simply use the +``url`` attribute (or any known URL) as an entry for any of the +``origin`` endpoints. + +Search for a specific origin +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Now say that we want to query the database for the specific repository +of Alambic, to know what information has been registered by the archive. +The API endpoint can be found `in the swh-web +documentation `__, +and has the following syntax: + +``/api/1/origin//get/`` + +Which returns the same type of JSON object than the ``search`` command +seen previously: + +- **origin_visits_url** attribute is an URL that points to the API page + listing all visits (bot fetches) to this repository. +- **url** is the url of the origin, or repository, itself. + +We know that Alambic is hosted at +‘https://github.com/borisbaldassari/alambic/’, so the API call will look +like this: + +``/api/1/origin/https://github.com/borisbaldassari/alambic/get/`` + +.. code:: python + + resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/get/") + found = resp.json() + jprint(found) + + +.. parsed-literal:: + + { + "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", + "url": "https://github.com/borisbaldassari/alambic" + } + + +Get visits information +^^^^^^^^^^^^^^^^^^^^^^ + +We can use the ``origin_visits_url`` attribute to know more about when +the repository was analysed by the archive bots. The API endpoint is +fully documented on the `Software Heritage doc +site `__, +and has the following syntax: + +``/api/1/origin//visits/`` + +We will use the same query as before about the main Alambic repository. + +.. code:: python + + resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/") + found = resp.json() + length = len(found) + print(f"Number of visits found: {format(length)}.") + print("With dates:") + for visit in found: + print(f"- {visit['visit']} {visit['date']}") + print("\nExample of a single visit entry:") + jprint(found[0]) + + +.. parsed-literal:: + + Number of visits found: 5. + With dates: + - 5 2021-01-01T19:35:41.308336+00:00 + - 4 2020-02-06T10:41:45.700641+00:00 + - 3 2019-09-01T22:38:12.056537+00:00 + - 2 2019-06-16T04:52:18.162914+00:00 + - 1 2019-01-30T07:19:20.799217+00:00 + + Example of a single visit entry: + { + "date": "2021-01-01T19:35:41.308336+00:00", + "metadata": {}, + "origin": "https://github.com/borisbaldassari/alambic", + "origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visit/5/", + "snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", + "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc/", + "status": "full", + "type": "git", + "visit": 5 + } + + +Get the content +--------------- + +As defined in the beginning, a snapshot is a capture of the repository +at a given time with links to all branches and releases. In this example +we will work on the snapshot ID of the last visit to Alambic, as returned +by the previous command we executed. + +.. code:: python + + # Store snapshot id + snapshot = found[0]['snapshot'] + print(f"Snapshot is {format(snapshot)}.") + + +.. parsed-literal:: + + Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc. + + +Note that the latest visit to the repository can also be directly +retrieved using the `dedicated +endpoint `__ +``/api/1/origin/visit/latest/``. + +Get the snapshot +^^^^^^^^^^^^^^^^ + +We want now to retrieve the content of the project at this snapshot. For +that purpose there is the ``snapshot`` endpoint, and its documentation +is `provided +here `__. The +complete syntax is: + +``/api/1/snapshot//`` + +The snapshot endpoint returns in the ``branches`` attribute a list of +**revisions** (aka commits in a git context), which +themselves point to the set of directories and files in the branch at +the time of analysis. Let’s follow this chain of links, starting with +the snapshot’s list of revisions (branches): + +.. code:: python + + snapshotr = requests.get("https://archive.softwareheritage.org/api/1/snapshot/{}/".format(snapshot)) + snapshotj = snapshotr.json() + jprint(snapshotj) + + +.. parsed-literal:: + + { + "branches": { + "HEAD": { + "target": "refs/heads/master", + "target_type": "alias", + "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" + }, + "refs/heads/devel": { + "target": "e298b8c5692b18928013a68e41fd185419515075", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/e298b8c5692b18928013a68e41fd185419515075/" + }, + "refs/heads/features/cr152_anonymise_data": { + "target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162/" + }, + "refs/heads/features/cr164_github_project": { + "target": "0005abb080e4c67a97533ee923e9d28142877752", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" + }, + "refs/heads/features/cr165_github_its": { + "target": "0005abb080e4c67a97533ee923e9d28142877752", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" + }, + "refs/heads/features/cr89_gitlabwizard": { + "target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/b941fd5f93a6cfc2349358b891e47d0fffe0ed2d/" + }, + "refs/heads/master": { + "target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" + } + }, + "id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", + "next_branch": null + } + + +Get the root directory +^^^^^^^^^^^^^^^^^^^^^^ + +The revision associated to the branch can be retrieved by following the +corresponding link in the ``target_url`` attribute. We will follow the +``refs/heads/master`` branch and get the associated revision object. In +this case (a git repository) the revision is equivalent to a commit, with +an ID and message. + +.. code:: python + + print(f"Revision ID is {snapshotj['id']}.") + master_url = snapshotj['branches']['refs/heads/master']['target_url'] + masterr = requests.get(master_url) + masterj = masterr.json() + jprint(masterj) + + +.. parsed-literal:: + + Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc + { + "author": { + "email": "boris.baldassari@gmail.com", + "fullname": "Boris Baldassari ", + "name": "Boris Baldassari" + }, + "committer": { + "email": "boris.baldassari@gmail.com", + "fullname": "Boris Baldassari ", + "name": "Boris Baldassari" + }, + "committer_date": "2020-11-01T12:55:13+01:00", + "date": "2020-11-01T12:55:13+01:00", + "directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8", + "directory_url": "https://archive.softwareheritage.org/api/1/directory/fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8/", + "extra_headers": [], + "history_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/log/", + "id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", + "merge": false, + "message": "#163 Fix dygraphs zero padding in forums plugin.\n", + "metadata": {}, + "parents": [ + { + "id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151", + "url": "https://archive.softwareheritage.org/api/1/revision/a4a2d8925c1cc43612602ac28e4ca9a31728b151/" + } + ], + "synthetic": false, + "type": "git", + "url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" + } + + +The revision references the root directory of the project. We can +list all files and directories at the root by requesting more +information from the ``directory_url`` attribute. The endpoint is +documented +`here `__ and +has the following syntax: + +``/api/1/directory//`` + +The structure of the response is an **array of directory entries**. +**Content entries** are represented like this: + +:: + + { + "checksums": { + "sha1": "5973b582bfaeffa71c924e3fe7150620230391d8", + "sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", + "sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc" + }, + "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", + "length": 101, + "name": ".dockerignore", + "perms": 33188, + "status": "visible", + "target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", + "target_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b/", + "type": "file" + } + +And **directory entries** are represented with: + +:: + + { + "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", + "length": null, + "name": "doc", + "perms": 16384, + "target": "316468df4988351911992ecbf1866f1c1f575c23", + "target_url": "https://archive.softwareheritage.org/api/1/directory/316468df4988351911992ecbf1866f1c1f575c23/", + "type": "dir" + } + +We will print the list of contents and directories located at the root of +the repository at the time of analysis: + +.. code:: python + + root_url = masterj['directory_url'] + rootr = requests.get(root_url) + rootj = rootr.json() + for f in rootj: + print(f"- {f['name']}.") + + +.. parsed-literal:: + + - .dockerignore + - .env + - .gitignore + - CODE_OF_CONDUCT.html + - CODE_OF_CONDUCT.md + - LICENCE.html + - LICENCE.md + - Readme.md + - doc + - docker + - docker-compose.run.yml + - docker-compose.test.yml + - dockercfg.encrypted + - mojo + - resources + + +We could follow the links up (or down) to the leaves in order to rebuild +the project structure and download all files individually to rebuild the +project locally. However the archive can do it for us, and provides a +feature to download the content of a whole project in one step: +**cooking**. The feature is described in the :ref:`swh-vault +documentation `. + +Download content of a project +----------------------------- + +When we ask the Archive to cook a directory for us, it invokes an +asynchronous job to recuversively fetch the directories and files of the +project, following the graph up to the leaves (files) and exporting the +result as a tar.gz file. This procedure is handled by the :ref:`swh-vault +component `, and it’s all automatic. + +Order the meal +^^^^^^^^^^^^^^ + +A cooking job can be invoked for revisions, directories or snapshots +(soon). It is initiated with a POST request on the ``vault//`` +endpoint, and its complete syntax is: + +``/api/1/vault/directory//`` + +The first POST request initiates the cooking, and subsequent GET +requests can fetch the job result and download the archive. See the +`Software Heritage documentation ` on this, with useful +examples. The API endpoint is documented `here `__. + +In this example we will fetch the content of the root directory that we +previously identified. + +.. code:: python + + mealr = requests.post("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") + mealj = mealr.json() + jprint(mealj) + + +.. parsed-literal:: + + { + "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", + "id": 379321799, + "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", + "obj_type": "directory", + "progress_message": null, + "status": "done" + } + + +Ask if it’s ready +^^^^^^^^^^^^^^^^^ + +We can use a GET request on the same URL to get information about the +process status: + +.. code:: python + + statusr = requests.get("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") + statusj = statusr.json() + jprint(statusj) + + +.. parsed-literal:: + + { + "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", + "id": 379321799, + "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", + "obj_type": "directory", + "progress_message": null, + "status": "done" + } + + +Get the plate +^^^^^^^^^^^^^ + +Once the processing is finished (it can take up to a few minutes) the +tar.gz archive can be downloaded through the ``fetch_url`` link, and +extracted as a tar.gz archive: + +:: + + boris@castalia:downloads$ curl https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/ -o myarchive.tar.gz + % Total % Received % Xferd Average Speed Time Time Time Current + Dload Upload Total Spent Left Speed + 100 9555k 100 9555k 0 0 1459k 0 0:00:06 0:00:06 --:--:-- 1717k + boris@castalia:downloads$ ls + myarchive.tar.gz + boris@castalia:downloads$ tar xzf myarchive.tar.gz + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/ + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.md + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.md + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/Readme.md + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/ + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/Readme.md + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config + [SNIP] + +Conclusion +---------- + +In this article, we learned **how to explore and use the Software +Heritage archive using its API**: searching for a repository, +identifying projects and downloading specific snapshots of a repository. +There is a lot more to the Archive and its API than what we have seen, +and all features are generously documented on the `Software Heritage web +site `__. + + + diff --git a/docs/getting-started/index.rst b/docs/getting-started/index.rst --- a/docs/getting-started/index.rst +++ b/docs/getting-started/index.rst @@ -11,3 +11,5 @@ ../getting-started ../developer-setup using-docker + api +