Differential D5784 Diff 21714 docs/getting-started/api.rst

Changeset View

Standalone View

docs/getting-started/api.rst

This file was added.

				==============================================
				Getting Started with the Software Heritage API
				==============================================

				Introduction
				------------

				About Software Heritage
				^^^^^^^^^^^^^^^^^^^^^^^

				The `Software Heritage project <https://www.softwareheritage.org>`__ was
				started in 2015 with a rather impressive goal and purpose:

				Software Heritage is an ambitious initiative that aims at collecting,
				organizing, preserving and sharing all the source code publicly
				available in the world.

				Yes, you read it well: all source code available in the world. It implies to
				build an equally impressive infrastructure to hold the huge amount of
				information represented, make the archive available to the public
				through a `nice web interface <https://archive.softwareheritage.org/>`__
				and even propose a :ref:`well-documented API <swh-web>` to access it
				seamlessly. For the records, there are also :ref:`various datasets
				available <swh-dataset>` for download, with detailed instructions
				about how to set it up. And, yes it’s huge: the full graph generated
				from the archive (with only metadata, content is not included) has more
				than 20b nodes and weights 1.2TB. Overall size of the archive is in the
				hundreds of TBs.

				This article presents, and demonstrates the use of, the `Software
				Heritage API <https://archive.softwareheritage.org/api/1/>`__ to query
				basic information about archived content and fetch the content of a
				software project.

				Terms and Concepts
				^^^^^^^^^^^^^^^^^^

				For our activity we need to define the following terms and concepts:

				- The repositories analysed by the SWH are registered as origins.
				Examples of origins are: https://bitbucket.org/anthroweb/apache.git,
				https://github.com/apache/ant, or other types of sources (debian
				source packages, npmjs, pypi, cran..).
				- When repositories are analysed, it creates snapshots. Snapshots
				describe the state of the repository at the time of analysis, and
				provide links to the repository content. As an example in the case of a git
				repository, the snapshot links to the list of branches, which
				themselves link to revisions and releases.
				- Revisions are consistent sets of directories and contents
				representing the repository at a given time, like in a baseline. They
				can be conceptually mapped to commits in subversion, to git
				references, or to source package versions in debian or nmpjs
				repositories.
				- Revisions are linked to a directory, which itself links to other
				directories and contents (aka blobs).

				A full list of terms is provided in the `Software Heritage
				doc <https://wiki.softwareheritage.org/index.php?title=Glossary>`__.

				Preliminary steps
				-----------------

				This article uses Python 3.x on the client side, and the ``requests``
				Python module to manipulate the HTTP requests. Note however that any
				language that provides HTTP requests (GET, POST) can access the API and
				could be used. Firstly let’s make sure we have the correct Python
				version and module installed::

				boris@castalia:notebook$ python3 -V
				Python 3.7.3
				boris@castalia:notebooks$ pip3 install requests
				Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.21.0)
				boris@castalia:notebook$

				Initialise the script
				---------------------

				We need to import a few modules and utilities to play with the Software
				Heritage API, namely ``json`` and the aforementioned ``requests``
				modules. We also define a utility function to pretty-print json data
				easily:

				.. code:: python

				import json
				import requests

				# Utility to pretty-print json.
				def jprint(obj):
				# create a formatted string of the Python JSON object
				print(json.dumps(obj, sort_keys=True, indent=4))


				The syntax mentioned in the `API
				documentation <https://archive.softwareheritage.org/api/1/>`__ is rather
				straightforward. Since we want to read it from the main Software
				Heritage server, we will use ``https://archive.softwareheritage.org/``
				as the basename. All API calls will be forged according to the same
				syntax:

				::

				https://archive.softwareheritage.org/api/1/<endpoint>

				Request basic Information
				-------------------------

				We want to get some basic information about the main server activity and
				content. The ``stat`` endpoint provides a summary of the main indexes and
				some statistics about the archive. We can request a GET on the main
				counters of the archive using the counters path, as described in the
				`endpoint
				documentation <https://archive.softwareheritage.org/api/1/stat/counters/>`__:

				``/api/1/stat/counters/``

				This API endpoint returns the following information:

				* content is the total number of blobs (files) in the archive.
				* directory is the total number of repositories in the archive.
				* origin is the number of distinct origins (repositories) fetched by
				the archive bots.
				* origin_visits is the total number of visits across all origins.
				* person is the number of authors (e.g.!committers, authors) in the
				archived files.
				* release is the number of tags retrieved in the archive.
				* revision is the number of revisions stored in the archive.
				* skipped_content is the number of objects which could be
				imported in the archive.
				* snapshot is the number of snapshots stored in the archive.

				Note that we use the default JSON format for the output. We could use
				YAML if we wanted to, with a custom ``Request Headers`` set to
				``application/yaml``.

				.. code-block:: python

				resp = requests.get("https://archive.softwareheritage.org/api/1/stat/counters/")
				counters = resp.json()
				jprint(counters)


				.. code-block:: python

				{
				"content": 10049535736,
				"directory": 8390591308,
				"origin": 156388918,
				"person": 42263568,
				"release": 17218891,
				"revision": 2109783249
				}


				There are almost 10bn blobs (aka files) in the archive and 8bn+
				directories already, for 155m repositories analysed.

				Now, what about a specific repository? Let’s say we want to find if
				`alambic <https://alambic.io>`__ (an open-source data provider and
				analysis system for software development) has already been analysed by
				the archive’s bots.

				Search the archive
				------------------

				Search for a keyword
				^^^^^^^^^^^^^^^^^^^^

				The easiest way to look for a keyword in the repositories analysed by
				the archive is to use the ``search`` feature of the ``origin`` endpoint.
				Documentation for the endpoint is
				`here <https://archive.softwareheritage.org/api/1/origin/search/doc/>`__
				and the complete syntax is:

				::

				`/api/1/origin/search/<keyword>/`

				The server returns an array of hashes, with each item being formatted
				as:

				- origin_visits_url attribute is an URL that points to the API page
				listing all visits (bot fetches) to this repository.
				- url is the url of the origin, or repository, itself.

				A (truncated) example of a result from this endpoint is shown below:

				::

				[
				{
				"origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/",
				"url": "https://github.com/borisbaldassari/alambic"
				}
				...
				]

				As an example we will look for instances of alambic in the archive’s
				analysed repositories::

				resp = requests.get("https://archive.softwareheritage.org/api/1/origin/search/alambic/")
				origins = resp.json()
				print(f"We found {len(origins)} entries.")
				for origin in origins[1:10]:
				print(f"- {origin['url']}")


				Which produces::

				We found 52 entries.
				- https://github.com/royal-alambic-club/sauron
				- https://github.com/scamberlin/alambic
				- https://github.com/WebTales/alambic-connector-mongodb
				- https://github.com/WebTales/alambic
				- https://github.com/AssoAlambic/alambic-website
				- https://bitbucket.org/nayoub/alambic.git
				- https://github.com/Alexandru-Dobre/alambic-connector-rest
				- https://github.com/WebTales/alambic-connector-diffbot
				- https://github.com/WebTales/alambic-connector-firebase


				There are obviously many projects and repositories that embed the word
				alambic, and we will need to be a bit more specific if we are to
				identify the origin actually related to the alambic project.

				If we want to know more about a specific origin, we can simply use the
				``url`` attribute (or any known URL) as an entry for any of the
				``origin`` endpoints.

				Search for a specific origin
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				Now say that we want to query the database for the specific repository
				of Alambic, to know what information has been registered by the archive.
				The API endpoint can be found `in the swh-web
				documentation <https://archive.softwareheritage.org/api/1/origin/doc/>`__,
				and has the following syntax:

				``/api/1/origin/<origin_url>/get/``

				Which returns the same type of JSON object than the ``search`` command
				seen previously:

				- origin_visits_url attribute is an URL that points to the API page
				listing all visits (bot fetches) to this repository.
				- url is the url of the origin, or repository, itself.

				We know that Alambic is hosted at
				‘https://github.com/borisbaldassari/alambic/’, so the API call will look
				like this:

				``/api/1/origin/https://github.com/borisbaldassari/alambic/get/``

				.. code:: python

				resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/get/")
				found = resp.json()
				jprint(found)


				.. parsed-literal::

				{
				"origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/",
				"url": "https://github.com/borisbaldassari/alambic"
				}


				Get visits information
				^^^^^^^^^^^^^^^^^^^^^^

				We can use the ``origin_visits_url`` attribute to know more about when
				the repository was analysed by the archive bots. The API endpoint is
				fully documented on the `Software Heritage doc
				site <https://archive.softwareheritage.org/api/1/origin/visits/doc/>`__,
				and has the following syntax:

				``/api/1/origin/<origin_url>/visits/``

				We will use the same query as before about the main Alambic repository.

				.. code:: python

				resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/")
				found = resp.json()
				length = len(found)
				print(f"Number of visits found: {format(length)}.")
				print("With dates:")
				for visit in found:
				print(f"- {visit['visit']} {visit['date']}")
				print("\nExample of a single visit entry:")
				jprint(found[0])


				.. parsed-literal::

				Number of visits found: 5.
				With dates:
				- 5 2021-01-01T19:35:41.308336+00:00
				- 4 2020-02-06T10:41:45.700641+00:00
				- 3 2019-09-01T22:38:12.056537+00:00
				- 2 2019-06-16T04:52:18.162914+00:00
				- 1 2019-01-30T07:19:20.799217+00:00

				Example of a single visit entry:
				{
				"date": "2021-01-01T19:35:41.308336+00:00",
				"metadata": {},
				"origin": "https://github.com/borisbaldassari/alambic",
				"origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visit/5/",
				"snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc",
				"snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc/",
				"status": "full",
				"type": "git",
				"visit": 5
				}


				Get the content
				---------------

				As defined in the beginning, a snapshot is a capture of the repository
				at a given time with links to all branches and releases. In this example
				we will work on the snapshot ID of the last visit to Alambic, as returned
				by the previous command we executed.

				.. code:: python

				# Store snapshot id
				snapshot = found[0]['snapshot']
				print(f"Snapshot is {format(snapshot)}.")


				.. parsed-literal::

				Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc.


				Note that the latest visit to the repository can also be directly
				retrieved using the `dedicated
				endpoint <https://archive.softwareheritage.org/api/1/origin/visit/latest/doc/>`__
				``/api/1/origin/visit/latest/``.

				Get the snapshot
				^^^^^^^^^^^^^^^^

				We want now to retrieve the content of the project at this snapshot. For
				that purpose there is the ``snapshot`` endpoint, and its documentation
				is `provided
				here <https://archive.softwareheritage.org/api/1/snapshot/doc/>`__. The
				complete syntax is:

				``/api/1/snapshot/<snapshot_id>/``

				The snapshot endpoint returns in the ``branches`` attribute a list of
				revisions (aka commits in a git context), which
				themselves point to the set of directories and files in the branch at
				the time of analysis. Let’s follow this chain of links, starting with
				the snapshot’s list of revisions (branches):

				.. code:: python

				snapshotr = requests.get("https://archive.softwareheritage.org/api/1/snapshot/{}/".format(snapshot))
				snapshotj = snapshotr.json()
				jprint(snapshotj)


				.. parsed-literal::

				{
				"branches": {
				"HEAD": {
				"target": "refs/heads/master",
				"target_type": "alias",
				"target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
				},
				"refs/heads/devel": {
				"target": "e298b8c5692b18928013a68e41fd185419515075",
				"target_type": "revision",
				"target_url": "https://archive.softwareheritage.org/api/1/revision/e298b8c5692b18928013a68e41fd185419515075/"
				},
				"refs/heads/features/cr152_anonymise_data": {
				"target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162",
				"target_type": "revision",
				"target_url": "https://archive.softwareheritage.org/api/1/revision/ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162/"
				},
				"refs/heads/features/cr164_github_project": {
				"target": "0005abb080e4c67a97533ee923e9d28142877752",
				"target_type": "revision",
				"target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/"
				},
				"refs/heads/features/cr165_github_its": {
				"target": "0005abb080e4c67a97533ee923e9d28142877752",
				"target_type": "revision",
				"target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/"
				},
				"refs/heads/features/cr89_gitlabwizard": {
				"target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d",
				"target_type": "revision",
				"target_url": "https://archive.softwareheritage.org/api/1/revision/b941fd5f93a6cfc2349358b891e47d0fffe0ed2d/"
				},
				"refs/heads/master": {
				"target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19",
				"target_type": "revision",
				"target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
				}
				},
				"id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc",
				"next_branch": null
				}


				Get the root directory
				^^^^^^^^^^^^^^^^^^^^^^

				The revision associated to the branch can be retrieved by following the
				corresponding link in the ``target_url`` attribute. We will follow the
				``refs/heads/master`` branch and get the associated revision object. In
				this case (a git repository) the revision is equivalent to a commit, with
				an ID and message.

				.. code:: python

				print(f"Revision ID is {snapshotj['id']}.")
				master_url = snapshotj['branches']['refs/heads/master']['target_url']
				masterr = requests.get(master_url)
				masterj = masterr.json()
				jprint(masterj)


				.. parsed-literal::

				Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc
				{
				"author": {
				"email": "boris.baldassari@gmail.com",
				"fullname": "Boris Baldassari <boris.baldassari@gmail.com>",
				"name": "Boris Baldassari"
				},
				"committer": {
				"email": "boris.baldassari@gmail.com",
				"fullname": "Boris Baldassari <boris.baldassari@gmail.com>",
				"name": "Boris Baldassari"
				},
				"committer_date": "2020-11-01T12:55:13+01:00",
				"date": "2020-11-01T12:55:13+01:00",
				"directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8",
				"directory_url": "https://archive.softwareheritage.org/api/1/directory/fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8/",
				"extra_headers": [],
				"history_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/log/",
				"id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19",
				"merge": false,
				"message": "#163 Fix dygraphs zero padding in forums plugin.\n",
				"metadata": {},
				"parents": [
				{
				"id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151",
				"url": "https://archive.softwareheritage.org/api/1/revision/a4a2d8925c1cc43612602ac28e4ca9a31728b151/"
				}
				],
				"synthetic": false,
				"type": "git",
				"url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
				}


				The revision references the root directory of the project. We can
				list all files and directories at the root by requesting more
				information from the ``directory_url`` attribute. The endpoint is
				documented
				`here <https://archive.softwareheritage.org/api/1/directory/doc/>`__ and
				has the following syntax:

				``/api/1/directory/<directory_id>/``

				The structure of the response is an array of directory entries.
				Content entries are represented like this:

				::

				{
				"checksums": {
				"sha1": "5973b582bfaeffa71c924e3fe7150620230391d8",
				"sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b",
				"sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc"
				},
				"dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
				"length": 101,
				"name": ".dockerignore",
				"perms": 33188,
				"status": "visible",
				"target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b",
				"target_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b/",
				"type": "file"
				}

				And directory entries are represented with:

				::

				{
				"dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
				"length": null,
				"name": "doc",
				"perms": 16384,
				"target": "316468df4988351911992ecbf1866f1c1f575c23",
				"target_url": "https://archive.softwareheritage.org/api/1/directory/316468df4988351911992ecbf1866f1c1f575c23/",
				"type": "dir"
				}

				We will print the list of contents and directories located at the root of
				the repository at the time of analysis:

				.. code:: python

				root_url = masterj['directory_url']
				rootr = requests.get(root_url)
				rootj = rootr.json()
				for f in rootj:
				print(f"- {f['name']}.")


				.. parsed-literal::

				- .dockerignore
				- .env
				- .gitignore
				- CODE_OF_CONDUCT.html
				- CODE_OF_CONDUCT.md
				- LICENCE.html
				- LICENCE.md
				- Readme.md
				- doc
				- docker
				- docker-compose.run.yml
				- docker-compose.test.yml
				- dockercfg.encrypted
				- mojo
				- resources


				We could follow the links up (or down) to the leaves in order to rebuild
				the project structure and download all files individually to rebuild the
				project locally. However the archive can do it for us, and provides a
				feature to download the content of a whole project in one step:
				cooking. The feature is described in the :ref:`swh-vault
				documentation <swh-vault>`.

				Download content of a project
				-----------------------------

				When we ask the Archive to cook a directory for us, it invokes an
				asynchronous job to recuversively fetch the directories and files of the
				project, following the graph up to the leaves (files) and exporting the
				result as a tar.gz file. This procedure is handled by the :ref:`swh-vault
				component <swh-vault>`, and it’s all automatic.

				Order the meal
				^^^^^^^^^^^^^^

				A cooking job can be invoked for revisions, directories or snapshots
				(soon). It is initiated with a POST request on the ``vault/<type>/``
				endpoint, and its complete syntax is:

				``/api/1/vault/directory/<directory_id>/``

				The first POST request initiates the cooking, and subsequent GET
				requests can fetch the job result and download the archive. See the
				`Software Heritage documentation <vault-primer>` on this, with useful
				examples. The API endpoint is documented `here <https://archive.softwareheritage.org/api/1/vault/directory/doc/>`__.

				In this example we will fetch the content of the root directory that we
				previously identified.

				.. code:: python

				mealr = requests.post("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/")
				mealj = mealr.json()
				jprint(mealj)


				.. parsed-literal::

				{
				"fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/",
				"id": 379321799,
				"obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
				"obj_type": "directory",
				"progress_message": null,
				"status": "done"
				}


				Ask if it’s ready
				^^^^^^^^^^^^^^^^^

				We can use a GET request on the same URL to get information about the
				process status:

				.. code:: python

				statusr = requests.get("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/")
				statusj = statusr.json()
				jprint(statusj)


				.. parsed-literal::

				{
				"fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/",
				"id": 379321799,
				"obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
				"obj_type": "directory",
				"progress_message": null,
				"status": "done"
				}


				Get the plate
				^^^^^^^^^^^^^

				Once the processing is finished (it can take up to a few minutes) the
				tar.gz archive can be downloaded through the ``fetch_url`` link, and
				extracted as a tar.gz archive:

				::

				boris@castalia:downloads$ curl https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/ -o myarchive.tar.gz
				% Total % Received % Xferd Average Speed Time Time Time Current
				Dload Upload Total Spent Left Speed
				100 9555k 100 9555k 0 0 1459k 0 0:00:06 0:00:06 --:--:-- 1717k
				boris@castalia:downloads$ ls
				myarchive.tar.gz
				boris@castalia:downloads$ tar xzf myarchive.tar.gz
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.md
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.md
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/Readme.md
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/Readme.md
				3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config
				[SNIP]

				Conclusion
				----------

				In this article, we learned **how to explore and use the Software
				Heritage archive using its API**: searching for a repository,
				identifying projects and downloading specific snapshots of a repository.
				There is a lot more to the Archive and its API than what we have seen,
				and all features are generously documented on the `Software Heritage web
				site <https://archive.softwareheritage.org/api/>`__.