Changeset View
Standalone View
docs/getting-started/getting_started_with_the_swh_api.rst
- This file was added.
Getting Started with the Software Heritage API | |||||||||||||||||
============================================== | |||||||||||||||||
Introduction | |||||||||||||||||
------------ | |||||||||||||||||
About Software Heritage | |||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~~ | |||||||||||||||||
The `Software Heritage project <https://www.softwareheritage.org>`__ was | |||||||||||||||||
started in 2015 with a rather impressive goal and purpose: | |||||||||||||||||
Software Heritage is an ambitious initiative that aims at collecting, | |||||||||||||||||
organizing, preserving and sharing all the source code publicly | |||||||||||||||||
available in the world. | |||||||||||||||||
Yes, you read it well: all source code available in the world. It implies to | |||||||||||||||||
build an equally impressive structure to hold the huge amount of | |||||||||||||||||
ardumontUnsubmitted Done Inline Actions
ardumont: | |||||||||||||||||
information represented, make the archive available to the public | |||||||||||||||||
through a `nice web interface <https://archive.softwareheritage.org/>`__ | |||||||||||||||||
and even propose a `well-documented | |||||||||||||||||
API <https://docs.softwareheritage.org/devel/swh-web/>`__ to access it | |||||||||||||||||
seamlessly. For the records, there are also `various datasets | |||||||||||||||||
available <https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html>`__ | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
Use references instead of external links: https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#ref-role You'll need to edit swh-dataset's documentation to add an anchor there first vlorentz: Use references instead of external links: https://www.sphinx-doc. | |||||||||||||||||
for download, with detailed instructions about how to set it up. And, | |||||||||||||||||
yes it’s huge: the full graph generated from the archive (with only | |||||||||||||||||
metadata, content is not included) has more than 20b nodes and weights | |||||||||||||||||
1.2TB. Overall size of the archive is in the hundreds of TBs. | |||||||||||||||||
This article presents, and demonstrates the use of, the `Software | |||||||||||||||||
Heritage API <https://archive.softwareheritage.org/api/1/>`__ to query | |||||||||||||||||
basic information about archived content and fetch the content of a | |||||||||||||||||
software project. | |||||||||||||||||
Terms and Concepts | |||||||||||||||||
~~~~~~~~~~~~~~~~~~ | |||||||||||||||||
For our activity we need to define the following terms and concepts: | |||||||||||||||||
- The repositories analysed by the SWH are registered as **origins**. | |||||||||||||||||
Examples of origins are: https://bitbucket.org/anthroweb/apache.git, | |||||||||||||||||
https://github.com/apache/ant, or other types of sources (debian | |||||||||||||||||
source packages, npmjs, pypi, cran..). | |||||||||||||||||
- When repositories are analysed, it creates **snapshots**. Snapshots | |||||||||||||||||
describe the state of the repository at the time of analysis, and | |||||||||||||||||
provide links to the content. As an example in the case of a git | |||||||||||||||||
ardumontUnsubmitted Done Inline Actions
ardumont: | |||||||||||||||||
repository, the snapshot links to the list of branches, which | |||||||||||||||||
themselves link to revisions and content. | |||||||||||||||||
ardumontUnsubmitted Done Inline Actions
ardumont: | |||||||||||||||||
- **Revisions** are consistent sets of directories and files | |||||||||||||||||
ardumontUnsubmitted Done Inline Actions
ardumont: | |||||||||||||||||
representing the repository at a given time, like in a baseline. They | |||||||||||||||||
can be conceptually mapped to commits in subversion, to git | |||||||||||||||||
references, or to source package versions in debian or nmpjs | |||||||||||||||||
repositories. | |||||||||||||||||
- Revisions are linked to a **directory**, which itself links to other | |||||||||||||||||
directories and **files** (aka blobs). | |||||||||||||||||
ardumontUnsubmitted Done Inline Actions
ardumont: | |||||||||||||||||
A full list of terms is provided in the `Software Heritage | |||||||||||||||||
doc <https://wiki.softwareheritage.org/index.php?title=Glossary>`__. | |||||||||||||||||
vlorentzUnsubmitted Not Done Inline ActionsIsn't this description redundant with the data model? https://docs.softwareheritage.org/devel/swh-model/data-model.html vlorentz: Isn't this description redundant with the data model? https://docs.softwareheritage. | |||||||||||||||||
borisbaldassariAuthorUnsubmitted Done Inline ActionsWell, it's a subset of concepts needed to quickly understand the document, instead of the big, comprehensive but complex, data model. When I'm reading a getting started, I don't want to have to swallow the full mindblowing range of great concepts, I just want to get started, so.. If I may stand up for that, I will, otherwise I'll cut the description short as proposed (and that would be ok ;-). borisbaldassari: Well, it's a subset of concepts needed to quickly understand the document, instead of the big… | |||||||||||||||||
Preliminary steps | |||||||||||||||||
----------------- | |||||||||||||||||
System requirements | |||||||||||||||||
~~~~~~~~~~~~~~~~~~~ | |||||||||||||||||
This article uses Python 3.x on the client side, and the ``requests`` | |||||||||||||||||
Python module to manipulate the HTTP requests. Note however that any | |||||||||||||||||
language that provides HTTP requests (GET, POST) can access the API and | |||||||||||||||||
could be used. Firstly let’s make sure we have the correct Python | |||||||||||||||||
version and module installed: | |||||||||||||||||
:: | |||||||||||||||||
(gs_env) boris@castalia:gs$ python -V | |||||||||||||||||
Python 3.7.3 | |||||||||||||||||
(gs_env) boris@castalia:notebooks$ pip install requests | |||||||||||||||||
Requirement already satisfied: requests in ./gs_env/lib/python3.7/site-packages (2.25.1) | |||||||||||||||||
Requirement already satisfied: certifi>=2017.4.17 in ./gs_env/lib/python3.7/site-packages (from requests) (2020.12.5) | |||||||||||||||||
Requirement already satisfied: chardet<5,>=3.0.2 in ./gs_env/lib/python3.7/site-packages (from requests) (4.0.0) | |||||||||||||||||
Requirement already satisfied: idna<3,>=2.5 in ./gs_env/lib/python3.7/site-packages (from requests) (2.10) | |||||||||||||||||
Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./gs_env/lib/python3.7/site-packages (from requests) (1.26.4) | |||||||||||||||||
(gs_env) boris@castalia:gs$ | |||||||||||||||||
vlorentzUnsubmitted Done Inline ActionsThis assumes a virtualenv to work. Could you either mention virtualenvs in the documentation, or update it to work also without a virtualenv (you just need to rename python to python3 and pip to pip3)? vlorentz: This assumes a virtualenv to work.
Could you either mention virtualenvs in the documentation… | |||||||||||||||||
borisbaldassariAuthorUnsubmitted Done Inline ActionsI'd say that "no assumption" is the way to go, so applied what you proposed (no virtualenv). borisbaldassari: I'd say that "no assumption" is the way to go, so applied what you proposed (no virtualenv). | |||||||||||||||||
Initialise the script | |||||||||||||||||
--------------------- | |||||||||||||||||
We need to import a few modules and utilities to play with the Software | |||||||||||||||||
Heritage API, namely ``json`` and the aforementioned ``requests`` | |||||||||||||||||
modules. We also define a utility function to pretty-print json data | |||||||||||||||||
easily: | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
import json | |||||||||||||||||
import requests | |||||||||||||||||
# Utility to pretty-print json. | |||||||||||||||||
def jprint(obj): | |||||||||||||||||
# create a formatted string of the Python JSON object | |||||||||||||||||
print(json.dumps(obj, sort_keys=True, indent=4)) | |||||||||||||||||
The syntax mentioned in the `API | |||||||||||||||||
documentation <https://archive.softwareheritage.org/api/1/>`__ is rather | |||||||||||||||||
straightforward. Since we want to read it from the main Software | |||||||||||||||||
Heritage server, we will use ``https://archive.softwareheritage.org/`` | |||||||||||||||||
as the basename. All API calls will be forged according to the same | |||||||||||||||||
syntax: | |||||||||||||||||
:: | |||||||||||||||||
https://archive.softwareheritage.org/api/1/<end/point> | |||||||||||||||||
ardumontUnsubmitted Done Inline Actions
ardumont: | |||||||||||||||||
Request basic Information | |||||||||||||||||
------------------------- | |||||||||||||||||
We want to get some basic information about the main server activity and | |||||||||||||||||
content. The ``stat`` endpoint provides asummary of the main indexes and | |||||||||||||||||
ardumontUnsubmitted Done Inline Actions
ardumont: | |||||||||||||||||
some statistics about the archive. We can request a GET on the main | |||||||||||||||||
counters of the archive using the counters path, as described in the | |||||||||||||||||
`endpoint | |||||||||||||||||
documentation <https://archive.softwareheritage.org/api/1/stat/counters/>`__: | |||||||||||||||||
``/api/1/stat/counters/`` | |||||||||||||||||
This API endpoint returns the following information: \* **content** is | |||||||||||||||||
the total number of blobs (files) in the archive. \* **directory** is | |||||||||||||||||
the total number of repositories in the archive. \* **origin** is the | |||||||||||||||||
number of distinct origins (repositories) fetched by the archive bots. | |||||||||||||||||
\* **origin_visits** is the total number of visits across all origins. | |||||||||||||||||
\* **person** is the number of authors (e.g.!committers, authors) in the | |||||||||||||||||
archived files. \* **release** is the number of tags retrieved in the | |||||||||||||||||
archive. \* **revision** is the number of revisions stored in the | |||||||||||||||||
archive. \* **skipped_content** is the number of objects which could be | |||||||||||||||||
imported in the archive. \* **snapshot** is the number of snapshots | |||||||||||||||||
stored in the archive. | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actionswhat is \*? Shouldn't it be a list? vlorentz: what is `\*`? Shouldn't it be a list? | |||||||||||||||||
borisbaldassariAuthorUnsubmitted Done Inline ActionsIt should indeed! borisbaldassari: It should indeed! | |||||||||||||||||
Note that we use the default JSON format for the output. We could use | |||||||||||||||||
YAML if we wanted to, with a custom ``Request Headers`` set to | |||||||||||||||||
``application/yaml``. | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
resp = requests.get("https://archive.softwareheritage.org/api/1/stat/counters/") | |||||||||||||||||
counters = resp.json() | |||||||||||||||||
jprint(counters) | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
{ | |||||||||||||||||
"content": 10049535736, | |||||||||||||||||
"directory": 8390591308, | |||||||||||||||||
"origin": 156388918, | |||||||||||||||||
"person": 42263568, | |||||||||||||||||
"release": 17218891, | |||||||||||||||||
"revision": 2109783249 | |||||||||||||||||
} | |||||||||||||||||
There are almost 10bn blobs (aka files) in the archive and 8bn+ | |||||||||||||||||
directories already, for 155m repositories analysed. | |||||||||||||||||
Now, what about a specific repository? Let’s say we want to find if | |||||||||||||||||
`alambic <https://alambic.io>`__ (an open-source data provider and | |||||||||||||||||
analysis system for software development) has already been analysed by | |||||||||||||||||
the archive’s bots. | |||||||||||||||||
Search the archive | |||||||||||||||||
------------------ | |||||||||||||||||
Search for a keyword | |||||||||||||||||
~~~~~~~~~~~~~~~~~~~~ | |||||||||||||||||
The easiest way to look for a keyword in the repositories analysed by | |||||||||||||||||
the archive is to use the ``search`` feature of the ``origin`` endpoint. | |||||||||||||||||
Documentation for the endpoint is | |||||||||||||||||
`here <https://archive.softwareheritage.org/api/1/origin/search/doc/>`__ | |||||||||||||||||
and the complete syntax is: | |||||||||||||||||
:: | |||||||||||||||||
`/api/1/origin/search/<keyword>/` | |||||||||||||||||
The server returns an array of hashes, with each item being formatted | |||||||||||||||||
as: | |||||||||||||||||
- **origin_visits_url** attribute is an URL that points to the API page | |||||||||||||||||
listing all visits (bot fetches) to this repository. | |||||||||||||||||
- **url** is the url of the origin, or repository, itself. | |||||||||||||||||
A (truncated) example of a result from this endpoint is shown below: | |||||||||||||||||
:: | |||||||||||||||||
[ | |||||||||||||||||
{ | |||||||||||||||||
"origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", | |||||||||||||||||
"url": "https://github.com/borisbaldassari/alambic" | |||||||||||||||||
} | |||||||||||||||||
... | |||||||||||||||||
] | |||||||||||||||||
As an example we will look for instances of *alambic* in the archive’s | |||||||||||||||||
analysed repositories: | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
resp = requests.get("https://archive.softwareheritage.org/api/1/origin/search/alambic/") | |||||||||||||||||
origins = resp.json() | |||||||||||||||||
print("We found",len(origins),"entries.") | |||||||||||||||||
for origin in origins[1:10]: | |||||||||||||||||
print('- ',origin['url']) | |||||||||||||||||
ardumontUnsubmitted Done Inline Actions
nitpicks, we tend to:
ardumont: nitpicks, we tend to:
- use black to format our base code, might as well show it the same way… | |||||||||||||||||
borisbaldassariAuthorUnsubmitted Done Inline ActionsFixed for the f-string syntax, but I'm not sure about the fix for the black formatting. All code blocks have been switched from ipython3 to python, hope it will fix it. borisbaldassari: Fixed for the f-string syntax, but I'm not sure about the fix for the black formatting. All… | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
We found 52 entries. | |||||||||||||||||
- https://github.com/royal-alambic-club/sauron | |||||||||||||||||
- https://github.com/scamberlin/alambic | |||||||||||||||||
- https://github.com/WebTales/alambic-connector-mongodb | |||||||||||||||||
- https://github.com/WebTales/alambic | |||||||||||||||||
- https://github.com/AssoAlambic/alambic-website | |||||||||||||||||
- https://bitbucket.org/nayoub/alambic.git | |||||||||||||||||
- https://github.com/Alexandru-Dobre/alambic-connector-rest | |||||||||||||||||
- https://github.com/WebTales/alambic-connector-diffbot | |||||||||||||||||
- https://github.com/WebTales/alambic-connector-firebase | |||||||||||||||||
There are obviously many projects and repositories that embed the word | |||||||||||||||||
alambic, and we will need to be a bit more specific if we are to | |||||||||||||||||
identify the origin actually related to the alambic project. | |||||||||||||||||
If we want to know more about a specific origin, we can simply use the | |||||||||||||||||
``url`` attribute (or any known URL) as an entry for any of the | |||||||||||||||||
``origin`` endpoints. | |||||||||||||||||
Search for a specific origin | |||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||||||||||||||||
Now say that we want to query the database for the specific repository | |||||||||||||||||
of Alambic, to know what information has been registered by the archive. | |||||||||||||||||
The API endpoint can be found `in the swh-web | |||||||||||||||||
documentation <https://archive.softwareheritage.org/api/1/origin/doc/>`__, | |||||||||||||||||
and has the following syntax: | |||||||||||||||||
``/api/1/origin/<origin_url>/get/`` | |||||||||||||||||
Which returns the same type of JSON object than the ``search`` command | |||||||||||||||||
seen previously: | |||||||||||||||||
- **origin_visits_url** attribute is an URL that points to the API page | |||||||||||||||||
listing all visits (bot fetches) to this repository. | |||||||||||||||||
- **url** is the url of the origin, or repository, itself. | |||||||||||||||||
We know that Alambic is hosted at | |||||||||||||||||
‘https://github.com/borisbaldassari/alambic/’, so the API call will look | |||||||||||||||||
like this: | |||||||||||||||||
``/api/1/origin/https://github.com/borisbaldassari/alambic/get/`` | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/get/") | |||||||||||||||||
found = resp.json() | |||||||||||||||||
jprint(found) | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
{ | |||||||||||||||||
"origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", | |||||||||||||||||
"url": "https://github.com/borisbaldassari/alambic" | |||||||||||||||||
} | |||||||||||||||||
Get visits information | |||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~ | |||||||||||||||||
We can use the ``origin_visits_url`` attribute to know more about when | |||||||||||||||||
the repository was analysed by the archive bots. The API endpoint is | |||||||||||||||||
fully documented on the `Software Heritage doc | |||||||||||||||||
site <https://archive.softwareheritage.org/api/1/origin/visits/doc/>`__, | |||||||||||||||||
and has the following syntax: | |||||||||||||||||
``/api/1/origin/<origin_url>/visits/`` | |||||||||||||||||
We will use the same query as before about the main Alambic repository. | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/") | |||||||||||||||||
found = resp.json() | |||||||||||||||||
length = len(found) | |||||||||||||||||
print("Number of visits found: {}.".format(length)) | |||||||||||||||||
print("With dates:") | |||||||||||||||||
for visit in found: | |||||||||||||||||
print("-",visit['visit'],visit['date']) | |||||||||||||||||
print("\nExample of a single visit entry:") | |||||||||||||||||
jprint(found[0]) | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
Number of visits found: 5. | |||||||||||||||||
With dates: | |||||||||||||||||
- 5 2021-01-01T19:35:41.308336+00:00 | |||||||||||||||||
- 4 2020-02-06T10:41:45.700641+00:00 | |||||||||||||||||
- 3 2019-09-01T22:38:12.056537+00:00 | |||||||||||||||||
- 2 2019-06-16T04:52:18.162914+00:00 | |||||||||||||||||
- 1 2019-01-30T07:19:20.799217+00:00 | |||||||||||||||||
Example of a single visit entry: | |||||||||||||||||
{ | |||||||||||||||||
"date": "2021-01-01T19:35:41.308336+00:00", | |||||||||||||||||
"metadata": {}, | |||||||||||||||||
"origin": "https://github.com/borisbaldassari/alambic", | |||||||||||||||||
"origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visit/5/", | |||||||||||||||||
"snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", | |||||||||||||||||
"snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc/", | |||||||||||||||||
"status": "full", | |||||||||||||||||
"type": "git", | |||||||||||||||||
"visit": 5 | |||||||||||||||||
} | |||||||||||||||||
Get the content | |||||||||||||||||
--------------- | |||||||||||||||||
As defined in the beginning, a snapshot is a capture of the repository | |||||||||||||||||
at a given time with links to all branches, commits and associated | |||||||||||||||||
content. In this example we will work on the snapshot ID of the last | |||||||||||||||||
visit to Alambic, as returned by the previous command we executed. | |||||||||||||||||
vlorentzUnsubmitted Done Inline ActionsOnly to branches and releases. Revisions and content are only linked indirectly vlorentz: Only to branches and releases. Revisions and content are only linked indirectly | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
# Store snapshot id | |||||||||||||||||
snapshot = found[0]['snapshot'] | |||||||||||||||||
print("Snapshot is {}.".format(snapshot)) | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc. | |||||||||||||||||
Note that the latest visit to the repository can also be directly | |||||||||||||||||
retrieved using the `dedicated | |||||||||||||||||
endpoint <https://archive.softwareheritage.org/api/1/origin/visit/latest/doc/>`__ | |||||||||||||||||
``/api/1/origin/visit/latest/``. | |||||||||||||||||
Get the snapshot | |||||||||||||||||
~~~~~~~~~~~~~~~~ | |||||||||||||||||
We want now to retrieve the content of the project at this snapshot. For | |||||||||||||||||
that purpose there is the ``snapshot`` endpoint, and its documentation | |||||||||||||||||
is `provided | |||||||||||||||||
here <https://archive.softwareheritage.org/api/1/snapshot/doc/>`__. The | |||||||||||||||||
complete syntax is: | |||||||||||||||||
``/api/1/snapshot/<snapshot_id>/`` | |||||||||||||||||
The snapshot endpoint returns in the ``branches`` attribute a list of | |||||||||||||||||
**revisions** (aka commits or branch refs in a git context), which | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
vlorentz: | |||||||||||||||||
themselves point to the set of directories and files in the branch at | |||||||||||||||||
the time of analysis. Let’s follow this chain of links, starting with | |||||||||||||||||
the snapshot’s list of revisions (branches): | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
snapshotr = requests.get("https://archive.softwareheritage.org/api/1/snapshot/{}/".format(snapshot)) | |||||||||||||||||
snapshotj = snapshotr.json() | |||||||||||||||||
jprint(snapshotj) | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
{ | |||||||||||||||||
"branches": { | |||||||||||||||||
"HEAD": { | |||||||||||||||||
"target": "refs/heads/master", | |||||||||||||||||
"target_type": "alias", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" | |||||||||||||||||
}, | |||||||||||||||||
"refs/heads/devel": { | |||||||||||||||||
"target": "e298b8c5692b18928013a68e41fd185419515075", | |||||||||||||||||
"target_type": "revision", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/revision/e298b8c5692b18928013a68e41fd185419515075/" | |||||||||||||||||
}, | |||||||||||||||||
"refs/heads/features/cr152_anonymise_data": { | |||||||||||||||||
"target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162", | |||||||||||||||||
"target_type": "revision", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/revision/ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162/" | |||||||||||||||||
}, | |||||||||||||||||
"refs/heads/features/cr164_github_project": { | |||||||||||||||||
"target": "0005abb080e4c67a97533ee923e9d28142877752", | |||||||||||||||||
"target_type": "revision", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" | |||||||||||||||||
}, | |||||||||||||||||
"refs/heads/features/cr165_github_its": { | |||||||||||||||||
"target": "0005abb080e4c67a97533ee923e9d28142877752", | |||||||||||||||||
"target_type": "revision", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" | |||||||||||||||||
}, | |||||||||||||||||
"refs/heads/features/cr89_gitlabwizard": { | |||||||||||||||||
"target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d", | |||||||||||||||||
"target_type": "revision", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/revision/b941fd5f93a6cfc2349358b891e47d0fffe0ed2d/" | |||||||||||||||||
}, | |||||||||||||||||
"refs/heads/master": { | |||||||||||||||||
"target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", | |||||||||||||||||
"target_type": "revision", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" | |||||||||||||||||
} | |||||||||||||||||
}, | |||||||||||||||||
"id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", | |||||||||||||||||
"next_branch": null | |||||||||||||||||
} | |||||||||||||||||
Get the root directory | |||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~ | |||||||||||||||||
The revision associated to the branch can be retrieved by following the | |||||||||||||||||
corresponding link in the ``target_url`` attribute. We will follow the | |||||||||||||||||
``refs/heads/master`` branch and get the associated revision object. In | |||||||||||||||||
this case (a git repository) the revision is equivalent to a branch ref | |||||||||||||||||
or commit, with an ID and message. | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
vlorentz: | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
print('Revision ID is',snapshotj['id']) | |||||||||||||||||
master_url = snapshotj['branches']['refs/heads/master']['target_url'] | |||||||||||||||||
masterr = requests.get(master_url) | |||||||||||||||||
masterj = masterr.json() | |||||||||||||||||
jprint(masterj) | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc | |||||||||||||||||
{ | |||||||||||||||||
"author": { | |||||||||||||||||
"email": "boris.baldassari@gmail.com", | |||||||||||||||||
"fullname": "Boris Baldassari <boris.baldassari@gmail.com>", | |||||||||||||||||
"name": "Boris Baldassari" | |||||||||||||||||
}, | |||||||||||||||||
"committer": { | |||||||||||||||||
"email": "boris.baldassari@gmail.com", | |||||||||||||||||
"fullname": "Boris Baldassari <boris.baldassari@gmail.com>", | |||||||||||||||||
"name": "Boris Baldassari" | |||||||||||||||||
}, | |||||||||||||||||
"committer_date": "2020-11-01T12:55:13+01:00", | |||||||||||||||||
"date": "2020-11-01T12:55:13+01:00", | |||||||||||||||||
"directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8", | |||||||||||||||||
"directory_url": "https://archive.softwareheritage.org/api/1/directory/fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8/", | |||||||||||||||||
"extra_headers": [], | |||||||||||||||||
"history_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/log/", | |||||||||||||||||
"id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", | |||||||||||||||||
"merge": false, | |||||||||||||||||
"message": "#163 Fix dygraphs zero padding in forums plugin.\n", | |||||||||||||||||
"metadata": {}, | |||||||||||||||||
"parents": [ | |||||||||||||||||
{ | |||||||||||||||||
"id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151", | |||||||||||||||||
"url": "https://archive.softwareheritage.org/api/1/revision/a4a2d8925c1cc43612602ac28e4ca9a31728b151/" | |||||||||||||||||
} | |||||||||||||||||
], | |||||||||||||||||
"synthetic": false, | |||||||||||||||||
"type": "git", | |||||||||||||||||
"url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" | |||||||||||||||||
} | |||||||||||||||||
The revision is associated to the root directory of the project. We can | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
"associated" implies a mutual relationship vlorentz: "associated" implies a mutual relationship | |||||||||||||||||
list all files and directories at the root by requesting more | |||||||||||||||||
information from the ``directory_url`` attribute. The endpoint is | |||||||||||||||||
documented | |||||||||||||||||
`here <https://archive.softwareheritage.org/api/1/directory/doc/>`__ and | |||||||||||||||||
has the following syntax: | |||||||||||||||||
``/api/1/directory/<directory_id>/`` | |||||||||||||||||
The structure of the response is an **array of files and directories**. | |||||||||||||||||
**Files** are represented like this: | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
they are not the contents themselves, as "name" and "perms" are part of the directory manifest, not the content vlorentz: they are not the contents themselves, as "name" and "perms" are part of the directory manifest… | |||||||||||||||||
:: | |||||||||||||||||
{ | |||||||||||||||||
"checksums": { | |||||||||||||||||
"sha1": "5973b582bfaeffa71c924e3fe7150620230391d8", | |||||||||||||||||
"sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", | |||||||||||||||||
"sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc" | |||||||||||||||||
}, | |||||||||||||||||
"dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", | |||||||||||||||||
"length": 101, | |||||||||||||||||
"name": ".dockerignore", | |||||||||||||||||
"perms": 33188, | |||||||||||||||||
"status": "visible", | |||||||||||||||||
"target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b/", | |||||||||||||||||
"type": "file" | |||||||||||||||||
} | |||||||||||||||||
And **directories** are represented with: | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
ditto vlorentz: ditto | |||||||||||||||||
:: | |||||||||||||||||
{ | |||||||||||||||||
"dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", | |||||||||||||||||
"length": null, | |||||||||||||||||
"name": "doc", | |||||||||||||||||
"perms": 16384, | |||||||||||||||||
"target": "316468df4988351911992ecbf1866f1c1f575c23", | |||||||||||||||||
"target_url": "https://archive.softwareheritage.org/api/1/directory/316468df4988351911992ecbf1866f1c1f575c23/", | |||||||||||||||||
"type": "dir" | |||||||||||||||||
} | |||||||||||||||||
We will print the list of files and directories located at the root of | |||||||||||||||||
the repository at the time of analysis: | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
vlorentz: | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
root_url = masterj['directory_url'] | |||||||||||||||||
rootr = requests.get(root_url) | |||||||||||||||||
rootj = rootr.json() | |||||||||||||||||
for f in rootj: | |||||||||||||||||
print('-',f['name']) | |||||||||||||||||
#jprint(rootj) | |||||||||||||||||
vlorentzUnsubmitted Done Inline ActionsUncomment / remove? vlorentz: Uncomment / remove? | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
- .dockerignore | |||||||||||||||||
- .env | |||||||||||||||||
- .gitignore | |||||||||||||||||
- CODE_OF_CONDUCT.html | |||||||||||||||||
- CODE_OF_CONDUCT.md | |||||||||||||||||
- LICENCE.html | |||||||||||||||||
- LICENCE.md | |||||||||||||||||
- Readme.md | |||||||||||||||||
- doc | |||||||||||||||||
- docker | |||||||||||||||||
- docker-compose.run.yml | |||||||||||||||||
- docker-compose.test.yml | |||||||||||||||||
- dockercfg.encrypted | |||||||||||||||||
- mojo | |||||||||||||||||
- resources | |||||||||||||||||
We could follow the links up (or down) to the leaves in order to rebuild | |||||||||||||||||
the project structure and download all files individually to rebuild the | |||||||||||||||||
project locally. However the archive can do it for us, and provides a | |||||||||||||||||
feature to download the content of a whole project in one step: | |||||||||||||||||
**cooking**. The feature is described in the `swh-vault | |||||||||||||||||
documentation <https://docs.softwareheritage.org/devel/swh-vault/api.html#cooking-and-status-checking>`__. | |||||||||||||||||
vlorentzUnsubmitted Done Inline ActionsUse a reference here too (you'll need to add an anchor in swh-vault's documentation) vlorentz: Use a reference here too (you'll need to add an anchor in swh-vault's documentation) | |||||||||||||||||
Download content of a project | |||||||||||||||||
----------------------------- | |||||||||||||||||
When we ask the Archive to cook a directory for us, it invokes an | |||||||||||||||||
asynchronous job to recuversively fetch the directories and files of the | |||||||||||||||||
project, following the graph up to the leaves (files) and exporting the | |||||||||||||||||
result as a tar.gz file. This procedure is handled by the `swh-vault | |||||||||||||||||
component <https://docs.softwareheritage.org/devel/swh-vault/getting-started.html>`__, | |||||||||||||||||
and it’s all automatic. | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
vlorentz: | |||||||||||||||||
Order the meal | |||||||||||||||||
~~~~~~~~~~~~~~ | |||||||||||||||||
A cooking job can be invoked for revisions, directories or snapshots | |||||||||||||||||
(soon). It is initiated with a POST request on the ``vault/<type>/`` | |||||||||||||||||
endpoint, and its complete syntax is: | |||||||||||||||||
``/api/1/vault/directory/<directory_id>/`` | |||||||||||||||||
The first POST request initiates the cooking, and subsequent GET | |||||||||||||||||
requests can fetch the job result and download the archive. See the | |||||||||||||||||
`Software Heritage | |||||||||||||||||
documentation <https://docs.softwareheritage.org/devel/swh-vault/getting-started.html#example-retrieving-a-directory>`__ | |||||||||||||||||
vlorentzUnsubmitted Done Inline Actionsditto vlorentz: ditto | |||||||||||||||||
on this, with useful examples. The API endpoint is documented | |||||||||||||||||
`here <https://archive.softwareheritage.org/api/1/vault/directory/doc/>`__. | |||||||||||||||||
In this example we will fetch the content of the root directory that we | |||||||||||||||||
previously identified. | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
mealr = requests.post("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") | |||||||||||||||||
mealj = mealr.json() | |||||||||||||||||
jprint(mealj) | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
{ | |||||||||||||||||
"fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", | |||||||||||||||||
"id": 379321799, | |||||||||||||||||
"obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", | |||||||||||||||||
"obj_type": "directory", | |||||||||||||||||
"progress_message": null, | |||||||||||||||||
"status": "done" | |||||||||||||||||
} | |||||||||||||||||
Ask if it’s ready | |||||||||||||||||
~~~~~~~~~~~~~~~~~ | |||||||||||||||||
We can use a GET request on the same URL to get information about the | |||||||||||||||||
process status: | |||||||||||||||||
.. code:: ipython3 | |||||||||||||||||
statusr = requests.get("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") | |||||||||||||||||
statusj = statusr.json() | |||||||||||||||||
jprint(statusj) | |||||||||||||||||
.. parsed-literal:: | |||||||||||||||||
{ | |||||||||||||||||
"fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", | |||||||||||||||||
"id": 379321799, | |||||||||||||||||
"obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", | |||||||||||||||||
"obj_type": "directory", | |||||||||||||||||
"progress_message": null, | |||||||||||||||||
"status": "done" | |||||||||||||||||
} | |||||||||||||||||
Get the plate | |||||||||||||||||
~~~~~~~~~~~~~ | |||||||||||||||||
Once the processing is finished (it can take up to a few minutes) the | |||||||||||||||||
tar.gz archive can be downloaded through the ``fetch_url`` link, and | |||||||||||||||||
extracted as a tar.gz archive: | |||||||||||||||||
:: | |||||||||||||||||
boris@castalia:downloads$ curl https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/ -o myarchive.tar.gz | |||||||||||||||||
% Total % Received % Xferd Average Speed Time Time Time Current | |||||||||||||||||
Dload Upload Total Spent Left Speed | |||||||||||||||||
100 9555k 100 9555k 0 0 1459k 0 0:00:06 0:00:06 --:--:-- 1717k | |||||||||||||||||
boris@castalia:downloads$ ls | |||||||||||||||||
myarchive.tar.gz | |||||||||||||||||
boris@castalia:downloads$ tar xzf myarchive.tar.gz | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/ | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.md | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.md | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/Readme.md | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/ | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/Readme.md | |||||||||||||||||
3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config | |||||||||||||||||
[SNIP] | |||||||||||||||||
Conclusion | |||||||||||||||||
---------- | |||||||||||||||||
In this article, we learned **how to explore and use the Software | |||||||||||||||||
Heritage archive using its API**: searching for a repository, | |||||||||||||||||
identifying projects and downloading specific snapshots of a repository. | |||||||||||||||||
There is a lot more to the Archive and its API than what we have seen, | |||||||||||||||||
and all features are generously documented on the `Software Heritage web | |||||||||||||||||
site <https://archive.softwareheritage.org/api/>`__. | |||||||||||||||||