diff --git a/docs/getting-started/getting_started_with_the_swh_api.rst b/docs/getting-started/getting_started_with_the_swh_api.rst new file mode 100644 --- /dev/null +++ b/docs/getting-started/getting_started_with_the_swh_api.rst @@ -0,0 +1,672 @@ +Getting Started with the Software Heritage API +============================================== + +Introduction +------------ + +About Software Heritage +~~~~~~~~~~~~~~~~~~~~~~~ + +The `Software Heritage project `__ was +started in 2015 with a rather impressive goal and purpose: + + Software Heritage is an ambitious initiative that aims at collecting, + organizing, preserving and sharing all the source code publicly + available in the world. + +Yes, you read it well: all source code available in the world. It implies to +build an equally impressive structure to hold the huge amount of +information represented, make the archive available to the public +through a `nice web interface `__ +and even propose a `well-documented +API `__ to access it +seamlessly. For the records, there are also `various datasets +available `__ +for download, with detailed instructions about how to set it up. And, +yes it’s huge: the full graph generated from the archive (with only +metadata, content is not included) has more than 20b nodes and weights +1.2TB. Overall size of the archive is in the hundreds of TBs. + +This article presents, and demonstrates the use of, the `Software +Heritage API `__ to query +basic information about archived content and fetch the content of a +software project. + +Terms and Concepts +~~~~~~~~~~~~~~~~~~ + +For our activity we need to define the following terms and concepts: + +- The repositories analysed by the SWH are registered as **origins**. + Examples of origins are: https://bitbucket.org/anthroweb/apache.git, + https://github.com/apache/ant, or other types of sources (debian + source packages, npmjs, pypi, cran..). +- When repositories are analysed, it creates **snapshots**. Snapshots + describe the state of the repository at the time of analysis, and + provide links to the content. As an example in the case of a git + repository, the snapshot links to the list of branches, which + themselves link to revisions and content. +- **Revisions** are consistent sets of directories and files + representing the repository at a given time, like in a baseline. They + can be conceptually mapped to commits in subversion, to git + references, or to source package versions in debian or nmpjs + repositories. +- Revisions are linked to a **directory**, which itself links to other + directories and **files** (aka blobs). + +A full list of terms is provided in the `Software Heritage +doc `__. + +Preliminary steps +----------------- + +System requirements +~~~~~~~~~~~~~~~~~~~ + +This article uses Python 3.x on the client side, and the ``requests`` +Python module to manipulate the HTTP requests. Note however that any +language that provides HTTP requests (GET, POST) can access the API and +could be used. Firstly let’s make sure we have the correct Python +version and module installed: + +:: + + (gs_env) boris@castalia:gs$ python -V + Python 3.7.3 + (gs_env) boris@castalia:notebooks$ pip install requests + Requirement already satisfied: requests in ./gs_env/lib/python3.7/site-packages (2.25.1) + Requirement already satisfied: certifi>=2017.4.17 in ./gs_env/lib/python3.7/site-packages (from requests) (2020.12.5) + Requirement already satisfied: chardet<5,>=3.0.2 in ./gs_env/lib/python3.7/site-packages (from requests) (4.0.0) + Requirement already satisfied: idna<3,>=2.5 in ./gs_env/lib/python3.7/site-packages (from requests) (2.10) + Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./gs_env/lib/python3.7/site-packages (from requests) (1.26.4) + (gs_env) boris@castalia:gs$ + +Initialise the script +--------------------- + +We need to import a few modules and utilities to play with the Software +Heritage API, namely ``json`` and the aforementioned ``requests`` +modules. We also define a utility function to pretty-print json data +easily: + +.. code:: ipython3 + + import json + import requests + + # Utility to pretty-print json. + def jprint(obj): + # create a formatted string of the Python JSON object + print(json.dumps(obj, sort_keys=True, indent=4)) + + +The syntax mentioned in the `API +documentation `__ is rather +straightforward. Since we want to read it from the main Software +Heritage server, we will use ``https://archive.softwareheritage.org/`` +as the basename. All API calls will be forged according to the same +syntax: + +:: + + https://archive.softwareheritage.org/api/1/ + +Request basic Information +------------------------- + +We want to get some basic information about the main server activity and +content. The ``stat`` endpoint provides asummary of the main indexes and +some statistics about the archive. We can request a GET on the main +counters of the archive using the counters path, as described in the +`endpoint +documentation `__: + +``/api/1/stat/counters/`` + +This API endpoint returns the following information: \* **content** is +the total number of blobs (files) in the archive. \* **directory** is +the total number of repositories in the archive. \* **origin** is the +number of distinct origins (repositories) fetched by the archive bots. +\* **origin_visits** is the total number of visits across all origins. +\* **person** is the number of authors (e.g. committers, authors) in the +archived files. \* **release** is the number of tags retrieved in the +archive. \* **revision** is the number of revisions stored in the +archive. \* **skipped_content** is the number of objects which could be +imported in the archive. \* **snapshot** is the number of snapshots +stored in the archive. + +Note that we use the default JSON format for the output. We could use +YAML if we wanted to, with a custom ``Request Headers`` set to +``application/yaml``. + +.. code:: ipython3 + + resp = requests.get("https://archive.softwareheritage.org/api/1/stat/counters/") + counters = resp.json() + jprint(counters) + + +.. parsed-literal:: + + { + "content": 10049535736, + "directory": 8390591308, + "origin": 156388918, + "person": 42263568, + "release": 17218891, + "revision": 2109783249 + } + + +There are almost 10bn blobs (aka files) in the archive and 8bn+ +directories already, for 155m repositories analysed. + +Now, what about a specific repository? Let’s say we want to find if +`alambic `__ (an open-source data provider and +analysis system for software development) has already been analysed by +the archive’s bots. + +Search the archive +------------------ + +Search for a keyword +~~~~~~~~~~~~~~~~~~~~ + +The easiest way to look for a keyword in the repositories analysed by +the archive is to use the ``search`` feature of the ``origin`` endpoint. +Documentation for the endpoint is +`here `__ +and the complete syntax is: + +:: + + `/api/1/origin/search//` + +The server returns an array of hashes, with each item being formatted +as: + +- **origin_visits_url** attribute is an URL that points to the API page + listing all visits (bot fetches) to this repository. +- **url** is the url of the origin, or repository, itself. + +A (truncated) example of a result from this endpoint is shown below: + +:: + + [ + { + "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", + "url": "https://github.com/borisbaldassari/alambic" + } + ... + ] + +As an example we will look for instances of *alambic* in the archive’s +analysed repositories: + +.. code:: ipython3 + + resp = requests.get("https://archive.softwareheritage.org/api/1/origin/search/alambic/") + origins = resp.json() + print("We found",len(origins),"entries.") + for origin in origins[1:10]: + print('- ',origin['url']) + + +.. parsed-literal:: + + We found 52 entries. + - https://github.com/royal-alambic-club/sauron + - https://github.com/scamberlin/alambic + - https://github.com/WebTales/alambic-connector-mongodb + - https://github.com/WebTales/alambic + - https://github.com/AssoAlambic/alambic-website + - https://bitbucket.org/nayoub/alambic.git + - https://github.com/Alexandru-Dobre/alambic-connector-rest + - https://github.com/WebTales/alambic-connector-diffbot + - https://github.com/WebTales/alambic-connector-firebase + + +There are obviously many projects and repositories that embed the word +alambic, and we will need to be a bit more specific if we are to +identify the origin actually related to the alambic project. + +If we want to know more about a specific origin, we can simply use the +``url`` attribute (or any known URL) as an entry for any of the +``origin`` endpoints. + +Search for a specific origin +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now say that we want to query the database for the specific repository +of Alambic, to know what information has been registered by the archive. +The API endpoint can be found `in the swh-web +documentation `__, +and has the following syntax: + +``/api/1/origin//get/`` + +Which returns the same type of JSON object than the ``search`` command +seen previously: + +- **origin_visits_url** attribute is an URL that points to the API page + listing all visits (bot fetches) to this repository. +- **url** is the url of the origin, or repository, itself. + +We know that Alambic is hosted at +‘https://github.com/borisbaldassari/alambic/’, so the API call will look +like this: + +``/api/1/origin/https://github.com/borisbaldassari/alambic/get/`` + +.. code:: ipython3 + + resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/get/") + found = resp.json() + jprint(found) + + +.. parsed-literal:: + + { + "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", + "url": "https://github.com/borisbaldassari/alambic" + } + + +Get visits information +~~~~~~~~~~~~~~~~~~~~~~ + +We can use the ``origin_visits_url`` attribute to know more about when +the repository was analysed by the archive bots. The API endpoint is +fully documented on the `Software Heritage doc +site `__, +and has the following syntax: + +``/api/1/origin//visits/`` + +We will use the same query as before about the main Alambic repository. + +.. code:: ipython3 + + resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/") + found = resp.json() + length = len(found) + print("Number of visits found: {}.".format(length)) + print("With dates:") + for visit in found: + print("-",visit['visit'],visit['date']) + print("\nExample of a single visit entry:") + jprint(found[0]) + + +.. parsed-literal:: + + Number of visits found: 5. + With dates: + - 5 2021-01-01T19:35:41.308336+00:00 + - 4 2020-02-06T10:41:45.700641+00:00 + - 3 2019-09-01T22:38:12.056537+00:00 + - 2 2019-06-16T04:52:18.162914+00:00 + - 1 2019-01-30T07:19:20.799217+00:00 + + Example of a single visit entry: + { + "date": "2021-01-01T19:35:41.308336+00:00", + "metadata": {}, + "origin": "https://github.com/borisbaldassari/alambic", + "origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visit/5/", + "snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", + "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc/", + "status": "full", + "type": "git", + "visit": 5 + } + + +Get the content +--------------- + +As defined in the beginning, a snapshot is a capture of the repository +at a given time with links to all branches, commits and associated +content. In this example we will work on the snapshot ID of the last +visit to Alambic, as returned by the previous command we executed. + +.. code:: ipython3 + + # Store snapshot id + snapshot = found[0]['snapshot'] + print("Snapshot is {}.".format(snapshot)) + + +.. parsed-literal:: + + Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc. + + +Note that the latest visit to the repository can also be directly +retrieved using the `dedicated +endpoint `__ +``/api/1/origin/visit/latest/``. + +Get the snapshot +~~~~~~~~~~~~~~~~ + +We want now to retrieve the content of the project at this snapshot. For +that purpose there is the ``snapshot`` endpoint, and its documentation +is `provided +here `__. The +complete syntax is: + +``/api/1/snapshot//`` + +The snapshot endpoint returns in the ``branches`` attribute a list of +**revisions** (aka commits or branch refs in a git context), which +themselves point to the set of directories and files in the branch at +the time of analysis. Let’s follow this chain of links, starting with +the snapshot’s list of revisions (branches): + +.. code:: ipython3 + + snapshotr = requests.get("https://archive.softwareheritage.org/api/1/snapshot/{}/".format(snapshot)) + snapshotj = snapshotr.json() + jprint(snapshotj) + + +.. parsed-literal:: + + { + "branches": { + "HEAD": { + "target": "refs/heads/master", + "target_type": "alias", + "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" + }, + "refs/heads/devel": { + "target": "e298b8c5692b18928013a68e41fd185419515075", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/e298b8c5692b18928013a68e41fd185419515075/" + }, + "refs/heads/features/cr152_anonymise_data": { + "target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162/" + }, + "refs/heads/features/cr164_github_project": { + "target": "0005abb080e4c67a97533ee923e9d28142877752", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" + }, + "refs/heads/features/cr165_github_its": { + "target": "0005abb080e4c67a97533ee923e9d28142877752", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" + }, + "refs/heads/features/cr89_gitlabwizard": { + "target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/b941fd5f93a6cfc2349358b891e47d0fffe0ed2d/" + }, + "refs/heads/master": { + "target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", + "target_type": "revision", + "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" + } + }, + "id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", + "next_branch": null + } + + +Get the root directory +~~~~~~~~~~~~~~~~~~~~~~ + +The revision associated to the branch can be retrieved by following the +corresponding link in the ``target_url`` attribute. We will follow the +``refs/heads/master`` branch and get the associated revision object. In +this case (a git repository) the revision is equivalent to a branch ref +or commit, with an ID and message. + +.. code:: ipython3 + + print('Revision ID is',snapshotj['id']) + master_url = snapshotj['branches']['refs/heads/master']['target_url'] + masterr = requests.get(master_url) + masterj = masterr.json() + jprint(masterj) + + +.. parsed-literal:: + + Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc + { + "author": { + "email": "boris.baldassari@gmail.com", + "fullname": "Boris Baldassari ", + "name": "Boris Baldassari" + }, + "committer": { + "email": "boris.baldassari@gmail.com", + "fullname": "Boris Baldassari ", + "name": "Boris Baldassari" + }, + "committer_date": "2020-11-01T12:55:13+01:00", + "date": "2020-11-01T12:55:13+01:00", + "directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8", + "directory_url": "https://archive.softwareheritage.org/api/1/directory/fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8/", + "extra_headers": [], + "history_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/log/", + "id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", + "merge": false, + "message": "#163 Fix dygraphs zero padding in forums plugin.\n", + "metadata": {}, + "parents": [ + { + "id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151", + "url": "https://archive.softwareheritage.org/api/1/revision/a4a2d8925c1cc43612602ac28e4ca9a31728b151/" + } + ], + "synthetic": false, + "type": "git", + "url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" + } + + +The revision is associated to the root directory of the project. We can +list all files and directories at the root by requesting more +information from the ``directory_url`` attribute. The endpoint is +documented +`here `__ and +has the following syntax: + +``/api/1/directory//`` + +The structure of the response is an **array of files and directories**. +**Files** are represented like this: + +:: + + { + "checksums": { + "sha1": "5973b582bfaeffa71c924e3fe7150620230391d8", + "sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", + "sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc" + }, + "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", + "length": 101, + "name": ".dockerignore", + "perms": 33188, + "status": "visible", + "target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", + "target_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b/", + "type": "file" + } + +And **directories** are represented with: + +:: + + { + "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", + "length": null, + "name": "doc", + "perms": 16384, + "target": "316468df4988351911992ecbf1866f1c1f575c23", + "target_url": "https://archive.softwareheritage.org/api/1/directory/316468df4988351911992ecbf1866f1c1f575c23/", + "type": "dir" + } + +We will print the list of files and directories located at the root of +the repository at the time of analysis: + +.. code:: ipython3 + + root_url = masterj['directory_url'] + rootr = requests.get(root_url) + rootj = rootr.json() + for f in rootj: + print('-',f['name']) + #jprint(rootj) + + +.. parsed-literal:: + + - .dockerignore + - .env + - .gitignore + - CODE_OF_CONDUCT.html + - CODE_OF_CONDUCT.md + - LICENCE.html + - LICENCE.md + - Readme.md + - doc + - docker + - docker-compose.run.yml + - docker-compose.test.yml + - dockercfg.encrypted + - mojo + - resources + + +We could follow the links up (or down) to the leaves in order to rebuild +the project structure and download all files individually to rebuild the +project locally. However the archive can do it for us, and provides a +feature to download the content of a whole project in one step: +**cooking**. The feature is described in the `swh-vault +documentation `__. + +Download content of a project +----------------------------- + +When we ask the Archive to cook a directory for us, it invokes an +asynchronous job to recuversively fetch the directories and files of the +project, following the graph up to the leaves (files) and exporting the +result as a tar.gz file. This procedure is handled by the `swh-vault +component `__, +and it’s all automatic. + +Order the meal +~~~~~~~~~~~~~~ + +A cooking job can be invoked for revisions, directories or snapshots +(soon). It is initiated with a POST request on the ``vault//`` +endpoint, and its complete syntax is: + +``/api/1/vault/directory//`` + +The first POST request initiates the cooking, and subsequent GET +requests can fetch the job result and download the archive. See the +`Software Heritage +documentation `__ +on this, with useful examples. The API endpoint is documented +`here `__. + +In this example we will fetch the content of the root directory that we +previously identified. + +.. code:: ipython3 + + mealr = requests.post("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") + mealj = mealr.json() + jprint(mealj) + + +.. parsed-literal:: + + { + "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", + "id": 379321799, + "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", + "obj_type": "directory", + "progress_message": null, + "status": "done" + } + + +Ask if it’s ready +~~~~~~~~~~~~~~~~~ + +We can use a GET request on the same URL to get information about the +process status: + +.. code:: ipython3 + + statusr = requests.get("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") + statusj = statusr.json() + jprint(statusj) + + +.. parsed-literal:: + + { + "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", + "id": 379321799, + "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", + "obj_type": "directory", + "progress_message": null, + "status": "done" + } + + +Get the plate +~~~~~~~~~~~~~ + +Once the processing is finished (it can take up to a few minutes) the +tar.gz archive can be downloaded through the ``fetch_url`` link, and +extracted as a tar.gz archive: + +:: + + boris@castalia:downloads$ curl https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/ -o myarchive.tar.gz + % Total % Received % Xferd Average Speed Time Time Time Current + Dload Upload Total Spent Left Speed + 100 9555k 100 9555k 0 0 1459k 0 0:00:06 0:00:06 --:--:-- 1717k + boris@castalia:downloads$ ls + myarchive.tar.gz + boris@castalia:downloads$ tar xzf myarchive.tar.gz + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/ + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.md + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.md + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/Readme.md + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/ + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/Readme.md + 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config + [SNIP] + +Conclusion +---------- + +In this article, we learned **how to explore and use the Software +Heritage archive using its API**: searching for a repository, +identifying projects and downloading specific snapshots of a repository. +There is a lot more to the Archive and its API than what we have seen, +and all features are generously documented on the `Software Heritage web +site `__. + + + diff --git a/docs/getting-started/index.rst b/docs/getting-started/index.rst --- a/docs/getting-started/index.rst +++ b/docs/getting-started/index.rst @@ -11,3 +11,5 @@ ../getting-started ../developer-setup using-docker + getting_started_with_the_swh_api +