D5784.id20681.diff
No OneTemporary
Actions

Size

25 KB

Subscribers

None

D5784.id20681.diff
View Options

	diff --git a/docs/getting-started/getting_started_with_the_swh_api.rst b/docs/getting-started/getting_started_with_the_swh_api.rst
	new file mode 100644
	--- /dev/null
	+++ b/docs/getting-started/getting_started_with_the_swh_api.rst
	@@ -0,0 +1,672 @@
	+Getting Started with the Software Heritage API
	+==============================================
	+
	+Introduction
	+------------
	+
	+About Software Heritage
	+~~~~~~~~~~~~~~~~~~~~~~~
	+
	+The `Software Heritage project <https://www.softwareheritage.org>`__ was
	+started in 2015 with a rather impressive goal and purpose:
	+
	+ Software Heritage is an ambitious initiative that aims at collecting,
	+ organizing, preserving and sharing all the source code publicly
	+ available in the world.
	+
	+Yes, you read it well: all source code available in the world. It implies to
	+build an equally impressive structure to hold the huge amount of
	+information represented, make the archive available to the public
	+through a `nice web interface <https://archive.softwareheritage.org/>`__
	+and even propose a `well-documented
	+API <https://docs.softwareheritage.org/devel/swh-web/>`__ to access it
	+seamlessly. For the records, there are also `various datasets
	+available <https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html>`__
	+for download, with detailed instructions about how to set it up. And,
	+yes it’s huge: the full graph generated from the archive (with only
	+metadata, content is not included) has more than 20b nodes and weights
	+1.2TB. Overall size of the archive is in the hundreds of TBs.
	+
	+This article presents, and demonstrates the use of, the `Software
	+Heritage API <https://archive.softwareheritage.org/api/1/>`__ to query
	+basic information about archived content and fetch the content of a
	+software project.
	+
	+Terms and Concepts
	+~~~~~~~~~~~~~~~~~~
	+
	+For our activity we need to define the following terms and concepts:
	+
	+- The repositories analysed by the SWH are registered as origins.
	+ Examples of origins are: https://bitbucket.org/anthroweb/apache.git,
	+ https://github.com/apache/ant, or other types of sources (debian
	+ source packages, npmjs, pypi, cran..).
	+- When repositories are analysed, it creates snapshots. Snapshots
	+ describe the state of the repository at the time of analysis, and
	+ provide links to the content. As an example in the case of a git
	+ repository, the snapshot links to the list of branches, which
	+ themselves link to revisions and content.
	+- Revisions are consistent sets of directories and files
	+ representing the repository at a given time, like in a baseline. They
	+ can be conceptually mapped to commits in subversion, to git
	+ references, or to source package versions in debian or nmpjs
	+ repositories.
	+- Revisions are linked to a directory, which itself links to other
	+ directories and files (aka blobs).
	+
	+A full list of terms is provided in the `Software Heritage
	+doc <https://wiki.softwareheritage.org/index.php?title=Glossary>`__.
	+
	+Preliminary steps
	+-----------------
	+
	+System requirements
	+~~~~~~~~~~~~~~~~~~~
	+
	+This article uses Python 3.x on the client side, and the ``requests``
	+Python module to manipulate the HTTP requests. Note however that any
	+language that provides HTTP requests (GET, POST) can access the API and
	+could be used. Firstly let’s make sure we have the correct Python
	+version and module installed:
	+
	+::
	+
	+ (gs_env) boris@castalia:gs$ python -V
	+ Python 3.7.3
	+ (gs_env) boris@castalia:notebooks$ pip install requests
	+ Requirement already satisfied: requests in ./gs_env/lib/python3.7/site-packages (2.25.1)
	+ Requirement already satisfied: certifi>=2017.4.17 in ./gs_env/lib/python3.7/site-packages (from requests) (2020.12.5)
	+ Requirement already satisfied: chardet<5,>=3.0.2 in ./gs_env/lib/python3.7/site-packages (from requests) (4.0.0)
	+ Requirement already satisfied: idna<3,>=2.5 in ./gs_env/lib/python3.7/site-packages (from requests) (2.10)
	+ Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./gs_env/lib/python3.7/site-packages (from requests) (1.26.4)
	+ (gs_env) boris@castalia:gs$
	+
	+Initialise the script
	+---------------------
	+
	+We need to import a few modules and utilities to play with the Software
	+Heritage API, namely ``json`` and the aforementioned ``requests``
	+modules. We also define a utility function to pretty-print json data
	+easily:
	+
	+.. code:: ipython3
	+
	+ import json
	+ import requests
	+
	+ # Utility to pretty-print json.
	+ def jprint(obj):
	+ # create a formatted string of the Python JSON object
	+ print(json.dumps(obj, sort_keys=True, indent=4))
	+
	+
	+The syntax mentioned in the `API
	+documentation <https://archive.softwareheritage.org/api/1/>`__ is rather
	+straightforward. Since we want to read it from the main Software
	+Heritage server, we will use ``https://archive.softwareheritage.org/``
	+as the basename. All API calls will be forged according to the same
	+syntax:
	+
	+::
	+
	+ https://archive.softwareheritage.org/api/1/<end/point>
	+
	+Request basic Information
	+-------------------------
	+
	+We want to get some basic information about the main server activity and
	+content. The ``stat`` endpoint provides asummary of the main indexes and
	+some statistics about the archive. We can request a GET on the main
	+counters of the archive using the counters path, as described in the
	+`endpoint
	+documentation <https://archive.softwareheritage.org/api/1/stat/counters/>`__:
	+
	+``/api/1/stat/counters/``
	+
	+This API endpoint returns the following information: \* content is
	+the total number of blobs (files) in the archive. \* directory is
	+the total number of repositories in the archive. \* origin is the
	+number of distinct origins (repositories) fetched by the archive bots.
	+\* origin_visits is the total number of visits across all origins.
	+\* person is the number of authors (e.g. committers, authors) in the
	+archived files. \* release is the number of tags retrieved in the
	+archive. \* revision is the number of revisions stored in the
	+archive. \* skipped_content is the number of objects which could be
	+imported in the archive. \* snapshot is the number of snapshots
	+stored in the archive.
	+
	+Note that we use the default JSON format for the output. We could use
	+YAML if we wanted to, with a custom ``Request Headers`` set to
	+``application/yaml``.
	+
	+.. code:: ipython3
	+
	+ resp = requests.get("https://archive.softwareheritage.org/api/1/stat/counters/")
	+ counters = resp.json()
	+ jprint(counters)
	+
	+
	+.. parsed-literal::
	+
	+ {
	+ "content": 10049535736,
	+ "directory": 8390591308,
	+ "origin": 156388918,
	+ "person": 42263568,
	+ "release": 17218891,
	+ "revision": 2109783249
	+ }
	+
	+
	+There are almost 10bn blobs (aka files) in the archive and 8bn+
	+directories already, for 155m repositories analysed.
	+
	+Now, what about a specific repository? Let’s say we want to find if
	+`alambic <https://alambic.io>`__ (an open-source data provider and
	+analysis system for software development) has already been analysed by
	+the archive’s bots.
	+
	+Search the archive
	+------------------
	+
	+Search for a keyword
	+~~~~~~~~~~~~~~~~~~~~
	+
	+The easiest way to look for a keyword in the repositories analysed by
	+the archive is to use the ``search`` feature of the ``origin`` endpoint.
	+Documentation for the endpoint is
	+`here <https://archive.softwareheritage.org/api/1/origin/search/doc/>`__
	+and the complete syntax is:
	+
	+::
	+
	+ `/api/1/origin/search/<keyword>/`
	+
	+The server returns an array of hashes, with each item being formatted
	+as:
	+
	+- origin_visits_url attribute is an URL that points to the API page
	+ listing all visits (bot fetches) to this repository.
	+- url is the url of the origin, or repository, itself.
	+
	+A (truncated) example of a result from this endpoint is shown below:
	+
	+::
	+
	+ [
	+ {
	+ "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/",
	+ "url": "https://github.com/borisbaldassari/alambic"
	+ }
	+ ...
	+ ]
	+
	+As an example we will look for instances of alambic in the archive’s
	+analysed repositories:
	+
	+.. code:: ipython3
	+
	+ resp = requests.get("https://archive.softwareheritage.org/api/1/origin/search/alambic/")
	+ origins = resp.json()
	+ print("We found",len(origins),"entries.")
	+ for origin in origins[1:10]:
	+ print('- ',origin['url'])
	+
	+
	+.. parsed-literal::
	+
	+ We found 52 entries.
	+ - https://github.com/royal-alambic-club/sauron
	+ - https://github.com/scamberlin/alambic
	+ - https://github.com/WebTales/alambic-connector-mongodb
	+ - https://github.com/WebTales/alambic
	+ - https://github.com/AssoAlambic/alambic-website
	+ - https://bitbucket.org/nayoub/alambic.git
	+ - https://github.com/Alexandru-Dobre/alambic-connector-rest
	+ - https://github.com/WebTales/alambic-connector-diffbot
	+ - https://github.com/WebTales/alambic-connector-firebase
	+
	+
	+There are obviously many projects and repositories that embed the word
	+alambic, and we will need to be a bit more specific if we are to
	+identify the origin actually related to the alambic project.
	+
	+If we want to know more about a specific origin, we can simply use the
	+``url`` attribute (or any known URL) as an entry for any of the
	+``origin`` endpoints.
	+
	+Search for a specific origin
	+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	+
	+Now say that we want to query the database for the specific repository
	+of Alambic, to know what information has been registered by the archive.
	+The API endpoint can be found `in the swh-web
	+documentation <https://archive.softwareheritage.org/api/1/origin/doc/>`__,
	+and has the following syntax:
	+
	+``/api/1/origin/<origin_url>/get/``
	+
	+Which returns the same type of JSON object than the ``search`` command
	+seen previously:
	+
	+- origin_visits_url attribute is an URL that points to the API page
	+ listing all visits (bot fetches) to this repository.
	+- url is the url of the origin, or repository, itself.
	+
	+We know that Alambic is hosted at
	+‘https://github.com/borisbaldassari/alambic/’, so the API call will look
	+like this:
	+
	+``/api/1/origin/https://github.com/borisbaldassari/alambic/get/``
	+
	+.. code:: ipython3
	+
	+ resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/get/")
	+ found = resp.json()
	+ jprint(found)
	+
	+
	+.. parsed-literal::
	+
	+ {
	+ "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/",
	+ "url": "https://github.com/borisbaldassari/alambic"
	+ }
	+
	+
	+Get visits information
	+~~~~~~~~~~~~~~~~~~~~~~
	+
	+We can use the ``origin_visits_url`` attribute to know more about when
	+the repository was analysed by the archive bots. The API endpoint is
	+fully documented on the `Software Heritage doc
	+site <https://archive.softwareheritage.org/api/1/origin/visits/doc/>`__,
	+and has the following syntax:
	+
	+``/api/1/origin/<origin_url>/visits/``
	+
	+We will use the same query as before about the main Alambic repository.
	+
	+.. code:: ipython3
	+
	+ resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/")
	+ found = resp.json()
	+ length = len(found)
	+ print("Number of visits found: {}.".format(length))
	+ print("With dates:")
	+ for visit in found:
	+ print("-",visit['visit'],visit['date'])
	+ print("\nExample of a single visit entry:")
	+ jprint(found[0])
	+
	+
	+.. parsed-literal::
	+
	+ Number of visits found: 5.
	+ With dates:
	+ - 5 2021-01-01T19:35:41.308336+00:00
	+ - 4 2020-02-06T10:41:45.700641+00:00
	+ - 3 2019-09-01T22:38:12.056537+00:00
	+ - 2 2019-06-16T04:52:18.162914+00:00
	+ - 1 2019-01-30T07:19:20.799217+00:00
	+
	+ Example of a single visit entry:
	+ {
	+ "date": "2021-01-01T19:35:41.308336+00:00",
	+ "metadata": {},
	+ "origin": "https://github.com/borisbaldassari/alambic",
	+ "origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visit/5/",
	+ "snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc",
	+ "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc/",
	+ "status": "full",
	+ "type": "git",
	+ "visit": 5
	+ }
	+
	+
	+Get the content
	+---------------
	+
	+As defined in the beginning, a snapshot is a capture of the repository
	+at a given time with links to all branches, commits and associated
	+content. In this example we will work on the snapshot ID of the last
	+visit to Alambic, as returned by the previous command we executed.
	+
	+.. code:: ipython3
	+
	+ # Store snapshot id
	+ snapshot = found[0]['snapshot']
	+ print("Snapshot is {}.".format(snapshot))
	+
	+
	+.. parsed-literal::
	+
	+ Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc.
	+
	+
	+Note that the latest visit to the repository can also be directly
	+retrieved using the `dedicated
	+endpoint <https://archive.softwareheritage.org/api/1/origin/visit/latest/doc/>`__
	+``/api/1/origin/visit/latest/``.
	+
	+Get the snapshot
	+~~~~~~~~~~~~~~~~
	+
	+We want now to retrieve the content of the project at this snapshot. For
	+that purpose there is the ``snapshot`` endpoint, and its documentation
	+is `provided
	+here <https://archive.softwareheritage.org/api/1/snapshot/doc/>`__. The
	+complete syntax is:
	+
	+``/api/1/snapshot/<snapshot_id>/``
	+
	+The snapshot endpoint returns in the ``branches`` attribute a list of
	+revisions (aka commits or branch refs in a git context), which
	+themselves point to the set of directories and files in the branch at
	+the time of analysis. Let’s follow this chain of links, starting with
	+the snapshot’s list of revisions (branches):
	+
	+.. code:: ipython3
	+
	+ snapshotr = requests.get("https://archive.softwareheritage.org/api/1/snapshot/{}/".format(snapshot))
	+ snapshotj = snapshotr.json()
	+ jprint(snapshotj)
	+
	+
	+.. parsed-literal::
	+
	+ {
	+ "branches": {
	+ "HEAD": {
	+ "target": "refs/heads/master",
	+ "target_type": "alias",
	+ "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
	+ },
	+ "refs/heads/devel": {
	+ "target": "e298b8c5692b18928013a68e41fd185419515075",
	+ "target_type": "revision",
	+ "target_url": "https://archive.softwareheritage.org/api/1/revision/e298b8c5692b18928013a68e41fd185419515075/"
	+ },
	+ "refs/heads/features/cr152_anonymise_data": {
	+ "target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162",
	+ "target_type": "revision",
	+ "target_url": "https://archive.softwareheritage.org/api/1/revision/ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162/"
	+ },
	+ "refs/heads/features/cr164_github_project": {
	+ "target": "0005abb080e4c67a97533ee923e9d28142877752",
	+ "target_type": "revision",
	+ "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/"
	+ },
	+ "refs/heads/features/cr165_github_its": {
	+ "target": "0005abb080e4c67a97533ee923e9d28142877752",
	+ "target_type": "revision",
	+ "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/"
	+ },
	+ "refs/heads/features/cr89_gitlabwizard": {
	+ "target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d",
	+ "target_type": "revision",
	+ "target_url": "https://archive.softwareheritage.org/api/1/revision/b941fd5f93a6cfc2349358b891e47d0fffe0ed2d/"
	+ },
	+ "refs/heads/master": {
	+ "target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19",
	+ "target_type": "revision",
	+ "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
	+ }
	+ },
	+ "id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc",
	+ "next_branch": null
	+ }
	+
	+
	+Get the root directory
	+~~~~~~~~~~~~~~~~~~~~~~
	+
	+The revision associated to the branch can be retrieved by following the
	+corresponding link in the ``target_url`` attribute. We will follow the
	+``refs/heads/master`` branch and get the associated revision object. In
	+this case (a git repository) the revision is equivalent to a branch ref
	+or commit, with an ID and message.
	+
	+.. code:: ipython3
	+
	+ print('Revision ID is',snapshotj['id'])
	+ master_url = snapshotj['branches']['refs/heads/master']['target_url']
	+ masterr = requests.get(master_url)
	+ masterj = masterr.json()
	+ jprint(masterj)
	+
	+
	+.. parsed-literal::
	+
	+ Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc
	+ {
	+ "author": {
	+ "email": "boris.baldassari@gmail.com",
	+ "fullname": "Boris Baldassari <boris.baldassari@gmail.com>",
	+ "name": "Boris Baldassari"
	+ },
	+ "committer": {
	+ "email": "boris.baldassari@gmail.com",
	+ "fullname": "Boris Baldassari <boris.baldassari@gmail.com>",
	+ "name": "Boris Baldassari"
	+ },
	+ "committer_date": "2020-11-01T12:55:13+01:00",
	+ "date": "2020-11-01T12:55:13+01:00",
	+ "directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8",
	+ "directory_url": "https://archive.softwareheritage.org/api/1/directory/fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8/",
	+ "extra_headers": [],
	+ "history_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/log/",
	+ "id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19",
	+ "merge": false,
	+ "message": "#163 Fix dygraphs zero padding in forums plugin.\n",
	+ "metadata": {},
	+ "parents": [
	+ {
	+ "id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151",
	+ "url": "https://archive.softwareheritage.org/api/1/revision/a4a2d8925c1cc43612602ac28e4ca9a31728b151/"
	+ }
	+ ],
	+ "synthetic": false,
	+ "type": "git",
	+ "url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
	+ }
	+
	+
	+The revision is associated to the root directory of the project. We can
	+list all files and directories at the root by requesting more
	+information from the ``directory_url`` attribute. The endpoint is
	+documented
	+`here <https://archive.softwareheritage.org/api/1/directory/doc/>`__ and
	+has the following syntax:
	+
	+``/api/1/directory/<directory_id>/``
	+
	+The structure of the response is an array of files and directories.
	+Files are represented like this:
	+
	+::
	+
	+ {
	+ "checksums": {
	+ "sha1": "5973b582bfaeffa71c924e3fe7150620230391d8",
	+ "sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b",
	+ "sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc"
	+ },
	+ "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
	+ "length": 101,
	+ "name": ".dockerignore",
	+ "perms": 33188,
	+ "status": "visible",
	+ "target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b",
	+ "target_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b/",
	+ "type": "file"
	+ }
	+
	+And directories are represented with:
	+
	+::
	+
	+ {
	+ "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
	+ "length": null,
	+ "name": "doc",
	+ "perms": 16384,
	+ "target": "316468df4988351911992ecbf1866f1c1f575c23",
	+ "target_url": "https://archive.softwareheritage.org/api/1/directory/316468df4988351911992ecbf1866f1c1f575c23/",
	+ "type": "dir"
	+ }
	+
	+We will print the list of files and directories located at the root of
	+the repository at the time of analysis:
	+
	+.. code:: ipython3
	+
	+ root_url = masterj['directory_url']
	+ rootr = requests.get(root_url)
	+ rootj = rootr.json()
	+ for f in rootj:
	+ print('-',f['name'])
	+ #jprint(rootj)
	+
	+
	+.. parsed-literal::
	+
	+ - .dockerignore
	+ - .env
	+ - .gitignore
	+ - CODE_OF_CONDUCT.html
	+ - CODE_OF_CONDUCT.md
	+ - LICENCE.html
	+ - LICENCE.md
	+ - Readme.md
	+ - doc
	+ - docker
	+ - docker-compose.run.yml
	+ - docker-compose.test.yml
	+ - dockercfg.encrypted
	+ - mojo
	+ - resources
	+
	+
	+We could follow the links up (or down) to the leaves in order to rebuild
	+the project structure and download all files individually to rebuild the
	+project locally. However the archive can do it for us, and provides a
	+feature to download the content of a whole project in one step:
	+cooking. The feature is described in the `swh-vault
	+documentation <https://docs.softwareheritage.org/devel/swh-vault/api.html#cooking-and-status-checking>`__.
	+
	+Download content of a project
	+-----------------------------
	+
	+When we ask the Archive to cook a directory for us, it invokes an
	+asynchronous job to recuversively fetch the directories and files of the
	+project, following the graph up to the leaves (files) and exporting the
	+result as a tar.gz file. This procedure is handled by the `swh-vault
	+component <https://docs.softwareheritage.org/devel/swh-vault/getting-started.html>`__,
	+and it’s all automatic.
	+
	+Order the meal
	+~~~~~~~~~~~~~~
	+
	+A cooking job can be invoked for revisions, directories or snapshots
	+(soon). It is initiated with a POST request on the ``vault/<type>/``
	+endpoint, and its complete syntax is:
	+
	+``/api/1/vault/directory/<directory_id>/``
	+
	+The first POST request initiates the cooking, and subsequent GET
	+requests can fetch the job result and download the archive. See the
	+`Software Heritage
	+documentation <https://docs.softwareheritage.org/devel/swh-vault/getting-started.html#example-retrieving-a-directory>`__
	+on this, with useful examples. The API endpoint is documented
	+`here <https://archive.softwareheritage.org/api/1/vault/directory/doc/>`__.
	+
	+In this example we will fetch the content of the root directory that we
	+previously identified.
	+
	+.. code:: ipython3
	+
	+ mealr = requests.post("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/")
	+ mealj = mealr.json()
	+ jprint(mealj)
	+
	+
	+.. parsed-literal::
	+
	+ {
	+ "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/",
	+ "id": 379321799,
	+ "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
	+ "obj_type": "directory",
	+ "progress_message": null,
	+ "status": "done"
	+ }
	+
	+
	+Ask if it’s ready
	+~~~~~~~~~~~~~~~~~
	+
	+We can use a GET request on the same URL to get information about the
	+process status:
	+
	+.. code:: ipython3
	+
	+ statusr = requests.get("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/")
	+ statusj = statusr.json()
	+ jprint(statusj)
	+
	+
	+.. parsed-literal::
	+
	+ {
	+ "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/",
	+ "id": 379321799,
	+ "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
	+ "obj_type": "directory",
	+ "progress_message": null,
	+ "status": "done"
	+ }
	+
	+
	+Get the plate
	+~~~~~~~~~~~~~
	+
	+Once the processing is finished (it can take up to a few minutes) the
	+tar.gz archive can be downloaded through the ``fetch_url`` link, and
	+extracted as a tar.gz archive:
	+
	+::
	+
	+ boris@castalia:downloads$ curl https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/ -o myarchive.tar.gz
	+ % Total % Received % Xferd Average Speed Time Time Time Current
	+ Dload Upload Total Spent Left Speed
	+ 100 9555k 100 9555k 0 0 1459k 0 0:00:06 0:00:06 --:--:-- 1717k
	+ boris@castalia:downloads$ ls
	+ myarchive.tar.gz
	+ boris@castalia:downloads$ tar xzf myarchive.tar.gz
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.md
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.md
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/Readme.md
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/Readme.md
	+ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config
	+ [SNIP]
	+
	+Conclusion
	+----------
	+
	+In this article, we learned **how to explore and use the Software
	+Heritage archive using its API**: searching for a repository,
	+identifying projects and downloading specific snapshots of a repository.
	+There is a lot more to the Archive and its API than what we have seen,
	+and all features are generously documented on the `Software Heritage web
	+site <https://archive.softwareheritage.org/api/>`__.
	+
	+
	+
	diff --git a/docs/getting-started/index.rst b/docs/getting-started/index.rst
	--- a/docs/getting-started/index.rst
	+++ b/docs/getting-started/index.rst
	@@ -11,3 +11,5 @@
	../getting-started
	../developer-setup
	using-docker
	+ getting_started_with_the_swh_api
	+

File Metadata

Mime Type: text/plain
Expires: Nov 5 2024, 2:57 PM (34 w, 4 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3231312

D5784.id20681.diffNo OneTemporaryActions

D5784.id20681.diffView Options

File Metadata

Event Timeline

D5784.id20681.diff
No OneTemporary
Actions

D5784.id20681.diff
View Options