+Getting Started with the Software Heritage API
+About Software Heritage
+The `Software Heritage project <>`__ was
+started in 2015 with a rather impressive goal and purpose:
+ Software Heritage is an ambitious initiative that aims at collecting,
+ organizing, preserving and sharing all the source code publicly
+ available in the world.
+Yes, you read it well: all source code available in the world. It implies to
+build an equally impressive structure to hold the huge amount of
+information represented, make the archive available to the public
+through a `nice web interface <>`__
+and even propose a `well-documented
+API <>`__ to access it
+seamlessly. For the records, there are also `various datasets
+available <>`__
+for download, with detailed instructions about how to set it up. And,
+yes it’s huge: the full graph generated from the archive (with only
+metadata, content is not included) has more than 20b nodes and weights
+1.2TB. Overall size of the archive is in the hundreds of TBs.
+This article presents, and demonstrates the use of, the `Software
+Heritage API <>`__ to query
+basic information about archived content and fetch the content of a
+software project.
+Terms and Concepts
+For our activity we need to define the following terms and concepts:
+- The repositories analysed by the SWH are registered as **origins**.
+ Examples of origins are:,
+, or other types of sources (debian
+ source packages, npmjs, pypi, cran..).
+- When repositories are analysed, it creates **snapshots**. Snapshots
+ describe the state of the repository at the time of analysis, and
+ provide links to the content. As an example in the case of a git
+ repository, the snapshot links to the list of branches, which
+ themselves link to revisions and content.
+- **Revisions** are consistent sets of directories and files
+ representing the repository at a given time, like in a baseline. They
+ can be conceptually mapped to commits in subversion, to git
+ references, or to source package versions in debian or nmpjs
+ repositories.
+- Revisions are linked to a **directory**, which itself links to other
+ directories and **files** (aka blobs).
+A full list of terms is provided in the `Software Heritage
+doc <>`__.
+Preliminary steps
+System requirements
+This article uses Python 3.x on the client side, and the ``requests``
+Python module to manipulate the HTTP requests. Note however that any
+language that provides HTTP requests (GET, POST) can access the API and
+could be used. Firstly let’s make sure we have the correct Python
+version and module installed:
+ (gs_env) boris@castalia:gs$ python -V
+ Python 3.7.3
+ (gs_env) boris@castalia:notebooks$ pip install requests
+ Requirement already satisfied: requests in ./gs_env/lib/python3.7/site-packages (2.25.1)
+ Requirement already satisfied: certifi>=2017.4.17 in ./gs_env/lib/python3.7/site-packages (from requests) (2020.12.5)
+ Requirement already satisfied: chardet<5,>=3.0.2 in ./gs_env/lib/python3.7/site-packages (from requests) (4.0.0)
+ Requirement already satisfied: idna<3,>=2.5 in ./gs_env/lib/python3.7/site-packages (from requests) (2.10)
+ Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./gs_env/lib/python3.7/site-packages (from requests) (1.26.4)
+ (gs_env) boris@castalia:gs$
+Initialise the script
+We need to import a few modules and utilities to play with the Software
+Heritage API, namely ``json`` and the aforementioned ``requests``
+modules. We also define a utility function to pretty-print json data
+.. code:: ipython3
+ import json
+ import requests
+ # Utility to pretty-print json.
+ def jprint(obj):
+ # create a formatted string of the Python JSON object
+ print(json.dumps(obj, sort_keys=True, indent=4))
+The syntax mentioned in the `API
+documentation <>`__ is rather
+straightforward. Since we want to read it from the main Software
+Heritage server, we will use ````
+as the basename. All API calls will be forged according to the same
+Request basic Information
+We want to get some basic information about the main server activity and
+content. The ``stat`` endpoint provides asummary of the main indexes and
+some statistics about the archive. We can request a GET on the main
+counters of the archive using the counters path, as described in the
+documentation <>`__:
+This API endpoint returns the following information: \* **content** is
+the total number of blobs (files) in the archive. \* **directory** is
+the total number of repositories in the archive. \* **origin** is the
+number of distinct origins (repositories) fetched by the archive bots.
+\* **origin_visits** is the total number of visits across all origins.
+\* **person** is the number of authors (e.g. committers, authors) in the
+archived files. \* **release** is the number of tags retrieved in the
+archive. \* **revision** is the number of revisions stored in the
+archive. \* **skipped_content** is the number of objects which could be
+imported in the archive. \* **snapshot** is the number of snapshots
+stored in the archive.
+Note that we use the default JSON format for the output. We could use
+YAML if we wanted to, with a custom ``Request Headers`` set to
+.. code:: ipython3
+ resp = requests.get("")
+ counters = resp.json()
+ jprint(counters)
+.. parsed-literal::
+ {
+ "content": 10049535736,
+ "directory": 8390591308,
+ "origin": 156388918,
+ "person": 42263568,
+ "release": 17218891,
+ "revision": 2109783249
+ }
+There are almost 10bn blobs (aka files) in the archive and 8bn+
+directories already, for 155m repositories analysed.
+Now, what about a specific repository? Let’s say we want to find if
+`alambic <>`__ (an open-source data provider and
+analysis system for software development) has already been analysed by
+the archive’s bots.
+Search the archive
+Search for a keyword
+The easiest way to look for a keyword in the repositories analysed by
+the archive is to use the ``search`` feature of the ``origin`` endpoint.
+Documentation for the endpoint is
+`here <>`__
+and the complete syntax is:
+ `/api/1/origin/search/<keyword>/`
+The server returns an array of hashes, with each item being formatted
+- **origin_visits_url** attribute is an URL that points to the API page
+ listing all visits (bot fetches) to this repository.
+- **url** is the url of the origin, or repository, itself.
+A (truncated) example of a result from this endpoint is shown below:
+ [
+ {
+ "origin_visits_url": "",
+ "url": ""
+ }
+ ...
+ ]
+As an example we will look for instances of *alambic* in the archive’s
+analysed repositories:
+.. code:: ipython3
+ resp = requests.get("")
+ origins = resp.json()
+ print("We found",len(origins),"entries.")
+ for origin in origins[1:10]:
+ print('- ',origin['url'])
+.. parsed-literal::
+ We found 52 entries.
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+There are obviously many projects and repositories that embed the word
+alambic, and we will need to be a bit more specific if we are to
+identify the origin actually related to the alambic project.
+If we want to know more about a specific origin, we can simply use the
+``url`` attribute (or any known URL) as an entry for any of the
+``origin`` endpoints.
+Search for a specific origin
+Now say that we want to query the database for the specific repository
+of Alambic, to know what information has been registered by the archive.
+The API endpoint can be found `in the swh-web
+documentation <>`__,
+and has the following syntax:
+Which returns the same type of JSON object than the ``search`` command
+seen previously:
+- **origin_visits_url** attribute is an URL that points to the API page
+ listing all visits (bot fetches) to this repository.
+- **url** is the url of the origin, or repository, itself.
+We know that Alambic is hosted at
+‘’, so the API call will look
+like this:
+.. code:: ipython3
+ resp = requests.get("")
+ found = resp.json()
+ jprint(found)
+.. parsed-literal::
+ {
+ "origin_visits_url": "",
+ "url": ""
+ }
+Get visits information
+We can use the ``origin_visits_url`` attribute to know more about when
+the repository was analysed by the archive bots. The API endpoint is
+fully documented on the `Software Heritage doc
+site <>`__,
+and has the following syntax:
+We will use the same query as before about the main Alambic repository.
+.. code:: ipython3
+ resp = requests.get("")
+ found = resp.json()
+ length = len(found)
+ print("Number of visits found: {}.".format(length))
+ print("With dates:")
+ for visit in found:
+ print("-",visit['visit'],visit['date'])
+ print("\nExample of a single visit entry:")
+ jprint(found[0])
+.. parsed-literal::
+ Number of visits found: 5.
+ With dates:
+ - 5 2021-01-01T19:35:41.308336+00:00
+ - 4 2020-02-06T10:41:45.700641+00:00
+ - 3 2019-09-01T22:38:12.056537+00:00
+ - 2 2019-06-16T04:52:18.162914+00:00
+ - 1 2019-01-30T07:19:20.799217+00:00
+ Example of a single visit entry:
+ {
+ "date": "2021-01-01T19:35:41.308336+00:00",
+ "metadata": {},
+ "origin": "",
+ "origin_visit_url": "",
+ "snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc",
+ "snapshot_url": "",
+ "status": "full",
+ "type": "git",
+ "visit": 5
+ }
+Get the content
+As defined in the beginning, a snapshot is a capture of the repository
+at a given time with links to all branches, commits and associated
+content. In this example we will work on the snapshot ID of the last
+visit to Alambic, as returned by the previous command we executed.
+.. code:: ipython3
+ # Store snapshot id
+ snapshot = found[0]['snapshot']
+ print("Snapshot is {}.".format(snapshot))
+.. parsed-literal::
+ Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc.
+Note that the latest visit to the repository can also be directly
+retrieved using the `dedicated
+endpoint <>`__
+Get the snapshot
+We want now to retrieve the content of the project at this snapshot. For
+that purpose there is the ``snapshot`` endpoint, and its documentation
+is `provided
+here <>`__. The
+complete syntax is:
+The snapshot endpoint returns in the ``branches`` attribute a list of
+**revisions** (aka commits or branch refs in a git context), which
+themselves point to the set of directories and files in the branch at
+the time of analysis. Let’s follow this chain of links, starting with
+the snapshot’s list of revisions (branches):
+.. code:: ipython3
+ snapshotr = requests.get("{}/".format(snapshot))
+ snapshotj = snapshotr.json()
+ jprint(snapshotj)
+.. parsed-literal::
+ {
+ "branches": {
+ "HEAD": {
+ "target": "refs/heads/master",
+ "target_type": "alias",
+ "target_url": ""
+ },
+ "refs/heads/devel": {
+ "target": "e298b8c5692b18928013a68e41fd185419515075",
+ "target_type": "revision",
+ "target_url": ""
+ },
+ "refs/heads/features/cr152_anonymise_data": {
+ "target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162",
+ "target_type": "revision",
+ "target_url": ""
+ },
+ "refs/heads/features/cr164_github_project": {
+ "target": "0005abb080e4c67a97533ee923e9d28142877752",
+ "target_type": "revision",
+ "target_url": ""
+ },
+ "refs/heads/features/cr165_github_its": {
+ "target": "0005abb080e4c67a97533ee923e9d28142877752",
+ "target_type": "revision",
+ "target_url": ""
+ },
+ "refs/heads/features/cr89_gitlabwizard": {
+ "target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d",
+ "target_type": "revision",
+ "target_url": ""
+ },
+ "refs/heads/master": {
+ "target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19",
+ "target_type": "revision",
+ "target_url": ""
+ }
+ },
+ "id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc",
+ "next_branch": null
+ }
+Get the root directory
+The revision associated to the branch can be retrieved by following the
+corresponding link in the ``target_url`` attribute. We will follow the
+``refs/heads/master`` branch and get the associated revision object. In
+this case (a git repository) the revision is equivalent to a branch ref
+or commit, with an ID and message.
+.. code:: ipython3
+ print('Revision ID is',snapshotj['id'])
+ master_url = snapshotj['branches']['refs/heads/master']['target_url']
+ masterr = requests.get(master_url)
+ masterj = masterr.json()
+ jprint(masterj)
+.. parsed-literal::
+ Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc
+ {
+ "author": {
+ "email": "",
+ "fullname": "Boris Baldassari <>",
+ "name": "Boris Baldassari"
+ },
+ "committer": {
+ "email": "",
+ "fullname": "Boris Baldassari <>",
+ "name": "Boris Baldassari"
+ },
+ "committer_date": "2020-11-01T12:55:13+01:00",
+ "date": "2020-11-01T12:55:13+01:00",
+ "directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8",
+ "directory_url": "",
+ "extra_headers": [],
+ "history_url": "",
+ "id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19",
+ "merge": false,
+ "message": "#163 Fix dygraphs zero padding in forums plugin.\n",
+ "metadata": {},
+ "parents": [
+ {
+ "id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151",
+ "url": ""
+ }
+ ],
+ "synthetic": false,
+ "type": "git",
+ "url": ""
+ }
+The revision is associated to the root directory of the project. We can
+list all files and directories at the root by requesting more
+information from the ``directory_url`` attribute. The endpoint is
+`here <>`__ and
+has the following syntax:
+The structure of the response is an **array of files and directories**.
+**Files** are represented like this:
+ {
+ "checksums": {
+ "sha1": "5973b582bfaeffa71c924e3fe7150620230391d8",
+ "sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b",
+ "sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc"
+ },
+ "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
+ "length": 101,
+ "name": ".dockerignore",
+ "perms": 33188,
+ "status": "visible",
+ "target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b",
+ "target_url": "",
+ "type": "file"
+ }
+And **directories** are represented with:
+ {
+ "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
+ "length": null,
+ "name": "doc",
+ "perms": 16384,
+ "target": "316468df4988351911992ecbf1866f1c1f575c23",
+ "target_url": "",
+ "type": "dir"
+ }
+We will print the list of files and directories located at the root of
+the repository at the time of analysis:
+.. code:: ipython3
+ root_url = masterj['directory_url']
+ rootr = requests.get(root_url)
+ rootj = rootr.json()
+ for f in rootj:
+ print('-',f['name'])
+ #jprint(rootj)
+.. parsed-literal::
+ - .dockerignore
+ - .env
+ - .gitignore
+ -
+ - LICENCE.html
+ -
+ -
+ - doc
+ - docker
+ -
+ - docker-compose.test.yml
+ - dockercfg.encrypted
+ - mojo
+ - resources
+We could follow the links up (or down) to the leaves in order to rebuild
+the project structure and download all files individually to rebuild the
+project locally. However the archive can do it for us, and provides a
+feature to download the content of a whole project in one step:
+**cooking**. The feature is described in the `swh-vault
+documentation <>`__.
+Download content of a project
+When we ask the Archive to cook a directory for us, it invokes an
+asynchronous job to recuversively fetch the directories and files of the
+project, following the graph up to the leaves (files) and exporting the
+result as a tar.gz file. This procedure is handled by the `swh-vault
+component <>`__,
+and it’s all automatic.
+Order the meal
+A cooking job can be invoked for revisions, directories or snapshots
+(soon). It is initiated with a POST request on the ``vault/<type>/``
+endpoint, and its complete syntax is:
+The first POST request initiates the cooking, and subsequent GET
+requests can fetch the job result and download the archive. See the
+`Software Heritage
+documentation <>`__
+on this, with useful examples. The API endpoint is documented
+`here <>`__.
+In this example we will fetch the content of the root directory that we
+previously identified.
+.. code:: ipython3
+ mealr ="")
+ mealj = mealr.json()
+ jprint(mealj)
+.. parsed-literal::
+ {
+ "fetch_url": "",
+ "id": 379321799,
+ "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
+ "obj_type": "directory",
+ "progress_message": null,
+ "status": "done"
+ }
+Ask if it’s ready
+We can use a GET request on the same URL to get information about the
+process status:
+.. code:: ipython3
+ statusr = requests.get("")
+ statusj = statusr.json()
+ jprint(statusj)
+.. parsed-literal::
+ {
+ "fetch_url": "",
+ "id": 379321799,
+ "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
+ "obj_type": "directory",
+ "progress_message": null,
+ "status": "done"
+ }
+Get the plate
+Once the processing is finished (it can take up to a few minutes) the
+tar.gz archive can be downloaded through the ``fetch_url`` link, and
+extracted as a tar.gz archive:
+ boris@castalia:downloads$ curl -o myarchive.tar.gz
+ % Total % Received % Xferd Average Speed Time Time Time Current
+ Dload Upload Total Spent Left Speed
+ 100 9555k 100 9555k 0 0 1459k 0 0:00:06 0:00:06 --:--:-- 1717k
+ boris@castalia:downloads$ ls
+ myarchive.tar.gz
+ boris@castalia:downloads$ tar xzf myarchive.tar.gz
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/
+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config
+ [SNIP]
+In this article, we learned **how to explore and use the Software
+Heritage archive using its API**: searching for a repository,
+identifying projects and downloading specific snapshots of a repository.
+There is a lot more to the Archive and its API than what we have seen,
+and all features are generously documented on the `Software Heritage web
+site <>`__.
