diff --git a/docs/architecture/overview.rst b/docs/architecture/overview.rst index d9000ee..02380ce 100644 --- a/docs/architecture/overview.rst +++ b/docs/architecture/overview.rst @@ -1,275 +1,274 @@ .. _architecture-overview: Software Architecture Overview ============================== From an end-user point of view, the |swh| platform consists in the :term:`archive`, which can be accessed using the web interface or its REST API. Behind the scene (and the web app) are several components/services that expose different aspects of the |swh| :term:`archive` as internal RPC APIs. These internal APIs have a dedicated database, usually PostgreSQL_. A global (and incomplete) view of this architecture looks like: .. thumbnail:: ../images/general-architecture.svg General view of the |swh| architecture. .. _architecture-tier-1: Core components --------------- The following components are the foundation of the entire |swh| architecture, as they fetch data, store it, and make it available to every other service. Data storage ^^^^^^^^^^^^ The :ref:`Storage ` provides an API to store and retrieve elements of the :ref:`graph `, such as directory structure, revision history, and their respective metadata. It relies on the :ref:`Object Storage ` service to store the content of source code file themselves. Both the Storage and Object Storage are designed as abstractions over possible backends. The former supports both PostgreSQL (the current solution in production) and Cassandra (a more scalable option we are exploring). The latter supports a large variety of "cloud" object storage as backends, as well as a simple local filesystem. Task management ^^^^^^^^^^^^^^^ The :ref:`Scheduler ` manages the entire choreography of jobs/tasks in |swh|, from detecting and ingesting repositories, to extracting metadata from them, to repackaging repositories into small downloadable archives. It does this by managing its own database of tasks that need to run (either periodically or only once), and passing them to celery_ for execution on dedicated workers. Listers ^^^^^^^ :term:`Listers ` are type of task, run by the Scheduler, aiming at scraping a web site, a forge, etc. to gather all the source code repositories it can find, also known as :term:`origins `. For each found source code repository, a :term:`loader` task is created. The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab_ forge, but any other supported source type would be very similar. .. thumbnail:: ../images/tasks-lister.svg As one might observe in this diagram, it does two things: - it asks the forge (a gitlab_ instance in this case) the list of known repositories, and - it insert one :term:`loader` task for each source code repository that will be in charge of importing the content of that repository. Note that most listers usually work in incremental mode, meaning they store in a dedicated database the current state of the listing of the forge. Then, on a subsequent execution of the lister, it will ask only for new repositories. Also note that if the lister inserts a new loading task for a repository for which a loading task already exists, the existing task will be updated (if needed) instead of creating a new task. Loaders ^^^^^^^ :term:`Loaders ` are also a type of task, but aim at importing or updating a source code repository. It is the one that inserts :term:`blob` objects in the :term:`object storage`, and inserts nodes and edges in the :ref:`graph `. The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar. .. thumbnail:: ../images/tasks-git-loader.svg Journal ^^^^^^^ The last core component is the :term:`Journal `, which is a persistent logger of every change in the archive, with publish-subscribe_ support, using Kafka. The Storage writes to it every time a new object is added to the archive; and many components read from it to be notified of these changes. For example, it allows the Scheduler to know how often software repositories are updated by their developers, to decide when next to visit these repositories. It is also the foundation of the :ref:`mirror` infrastructure, as it allows mirrors to stay up to date. .. _architecture-tier-2: Other major components ---------------------- All the components we saw above are critical to the |swh| archive as they are in charge of archiving source code. But are not enough to provide another important features of |swh|: making this archive accessible and searchable by anyone. Archive website and API ^^^^^^^^^^^^^^^^^^^^^^^ First of all, the archive website and API, also known as :ref:`swh-web `, is the main entry point of the archive. -This is the component that serves https://archive.softwareheritage.org/, -which is the window into the entire archive, as it provides access to it -through a web browser or the HTTP API. +This is the component that serves https://archive.softwareheritage.org/, which is the +window into the entire archive, as it provides access to it through a web browser or the +HTTP API. -It does so by querying most of the internal APIs of |swh|: -the Data Storage (to display source code repositories and their content), -the Scheduler (to allow manual scheduling of loader tasks through the -`Save Code Now `_ feature), -and many of the other services we will see below. +It does so by querying most of the internal APIs of |swh|: the Data Storage (to display +source code repositories and their content), the Scheduler (to allow manual scheduling +of loader tasks through the :swh_web:`Save Code Now ` feature), and many of the +other services we will see below. Internal data mining ^^^^^^^^^^^^^^^^^^^^ :term:`Indexers ` are a type of task aiming at crawling the content of the :term:`archive` to extract derived information. It ranges from detecting the MIME type or license of individual files, to reading all types of metadata files at the root of repositories and storing them together in a unified format, CodeMeta_. All results computed by Indexers are stored in a PostgreSQL database, the Indexer Storage. Vault ^^^^^ The :term:`Vault ` is an internal API, in charge of cooking compressed archive (zip or tgz) of archived objects on request (via swh-web). These compressed objects are typically directories or repositories. Since this can be a rather long process, it is delegated to an asynchronous (celery) task, through the Scheduler. .. _architecture-tier-3: Extra services -------------- Finally, |swh| provides additional tools that, although not necessary to operate the archive, provide convenient interfaces or performance benefits. It is therefore possible to have a fully-functioning archive without any of these services (our :ref:`development Docker environment ` disables most of these by default). Search ^^^^^^ The :ref:`swh-search ` service complements both the Storage and the Indexer Storage, to provide efficient advanced reverse-index search queries, such as full-text search on origin URLs and metadata. This service is a recent addition to the |swh| architecture based on ElasticSearch, and is currently in use only for URL search. Graph ^^^^^ :ref:`swh-graph ` is also a recent addition to the architecture designed to complement the Storage using a specialized backend. It leverages WebGraph_ to store a compressed in-memory representation of the entire graph, and provides fast implementations of graph traversal algorithms. Counters ^^^^^^^^ -The `archive's landing page `_ features -counts of the total number of files/directories/revisions/... in the archive. -Perhaps surprisingly, counting unique objects at |swh|'s scale is hard, -and a performance bottleneck when implemented purely in the Storage's SQL database. +The :swh_web:`archive's landing page ` features counts of the total number of +files/directories/revisions/... in the archive. Perhaps surprisingly, counting unique +objects at |swh|'s scale is hard, and a performance bottleneck when implemented purely +in the Storage's SQL database. :ref:`swh-counters ` provides an alternative design to solve this issue, by reading new objects from the Journal and counting them using Redis_' HyperLogLog_ feature; and keeps the history of these counters over time using Prometheus_. Deposit ^^^^^^^ The :ref:`Deposit ` is an alternative way to add content to the archive. While listers and loaders, as we saw above, **discover** repositories and **pull** artifacts into the archive, the Deposit allows trusted partners to **push** the content of their repository directly to the archive, and is internally loaded by the :mod:`Deposit Loader ` The Deposit is centered on the SWORDv2_ protocol, which allows depositing archives (usually TAR or ZIP) along with metadata in XML. The Deposit has its own HTTP interface, independent of swh-web. It also has its own SWORD client, which is specialized to interact with the Deposit server. Authentication ^^^^^^^^^^^^^^ While the archive itself is public, |swh| reserves some features to authenticated clients, such as higher rate limits, access to experimental APIs (currently: the Graph service), or the Deposit. This is managed centrally by :ref:`swh-auth ` using KeyCloak. Web Client, Fuse, Scanner ^^^^^^^^^^^^^^^^^^^^^^^^^ SWH provides a few tools to access the archive via the API: * :ref:`swh-web-client`, a command-line interface to authenticate with SWH and a library to access the API from Python programs * :ref:`swh-fuse`, a Filesystem in USErspace implementation, that exposes the entire archive as a regular directory on your computer * :ref:`swh-scanner`, a work-in-progress to check which of the files in a project are already in the archive, without submitting them Replayers and backfillers ^^^^^^^^^^^^^^^^^^^^^^^^^ As the Journal and various databases may be out of sync for various reasons (scrub of either of them, migration, database addition, ...), and because some databases need to follow the content of the Journal (mirrors), some places of the |swh| codebase contains tools known as "replayers" and "backfillers", designed to keep them in sync: * the :mod:`Object Storage Replayer ` copies the content of an objects storage to another one. It first performs a full copy, then streams new objects using the Journal to stay up to date * the Storage Replayer loads the entire content of the Journal into a Storage database, and also keeps them in sync. This is used for mirrors, and when creating a new database. * the Storage Backfiller, which does the opposite. This was initially used to populate the Journal from the database; and is occasionally when one needs to clear a topic in the Journal and recreate it. .. _celery: https://www.celeryproject.org .. _CodeMeta: https://codemeta.github.io/ .. _gitlab: https://gitlab.com .. _PostgreSQL: https://www.postgresql.org/ .. _Prometheus: https://prometheus.io/ .. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern .. _Redis: https://redis.io/ .. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html .. _HyperLogLog: https://redislabs.com/redis-best-practices/counting/hyperloglog/ .. _WebGraph: https://webgraph.di.unimi.it/ diff --git a/docs/faq/index.rst b/docs/faq/index.rst index 0091a4c..633aa87 100644 --- a/docs/faq/index.rst +++ b/docs/faq/index.rst @@ -1,270 +1,268 @@ .. _faq: Frequently Asked Questions ************************** .. contents:: :depth: 3 :local: .. .. _faq_prerequisites: Prerequisites for code contributions ==================================== What are the Skills required to be a code contributor? ------------------------------------------------------ Generally, only Python and basic Git knowledge are required to contribute. Other than that, it really depends on what technical areas you want to work on. For student internships, the `internships`_ page details specific prerequisites needed to pick up a topic. Feel free to contact us via our `development channels `__ to inquiry about specific skills needed to work on any topic of your interest. What are the minimum system requirements (hardware/software) to run SWH locally? -------------------------------------------------------------------------------- Python 3.7 or newer is required. See the :ref:`developer setup documentation ` for more details. .. _faq_getting_started: Getting Started =============== What are the must read docs before I start contributing? -------------------------------------------------------- We recommend you read the top links listed at from the :ref:`documentation home page ` in order: getting started, contributing, and architecture overview, as well as the data model. Where can I see the getting started guide for developers? --------------------------------------------------------- For hacking on the Software Heritage code base you should start from the :ref:`developer-setup` tutorial. How do I find an easy task to get started? ------------------------------------------ We maintain a `list of easy tickets `__ to work on, see the `Easy hacks page `__ for more details. I am skilled in one specific technology, can I find tickets requiring that skill? --------------------------------------------------------------------------------- Unfortunately, not at the moment. But you can look at the `internships`_ list to look for something matching this skill, and this may allow you to find topics to search for in the `bug tracking system`_. Either way, feel free to contact our developers through any of the `development channels`_, we would love to work with you. Where should I ask for technical help? -------------------------------------- You can choose one of the following: * `development channels`_ * `contact form`_ for any enquiries .. _faq_run_swh: Running an SWH instance locally =============================== How do I run a local "toy version" of the archive? -------------------------------------------------- The :ref:`getting-started` tutorial shows how to run a local instance of the Software Heritage software infrastructure, using Docker. I have SWH stack running in my local. How do I get some initial data to play around? ------------------------------------------------------------------------------------ You can setup a job on your local machine, for this you can :ref:`schedule a listing task ` for example. Doing so on small forge, will allow you to load some repositories. Or you can also trigger directly :ref:`loading from the cli `. I have a SWH stack running in local, How do I setup a lister/loader job? ------------------------------------------------------------------------ See the :ref:`"Managing tasks" chapter ` in the Docker environment documentation. How can I create a user in my local instance? --------------------------------------------- We cannot right now. Stay either anonymous or use the user "test" (password "test") or the user ambassador (password "ambassador"). Should I run/test the web app in any particular browser? -------------------------------------------------------- We expect the web app to work on all major browsers. It uses mostly straightforward HTML/CSS and a little Javascript for search and source code highlighting, so testing in a single browser is usually enough. .. _faq_dataset: Getting sample datasets ======================= Is there a way to connect to SWH archived (production) database from my local machine? -------------------------------------------------------------------------------------- We provide the archive as a dataset on public clouds, see the :ref:`swh-dataset documentation `. We can also provide read access to one of the main databases on request, `contact us`_. .. _faq_error_bugs: Errors and bugs =============== I found a bug/improvement in the system, where should I report it? ------------------------------------------------------------------ Please report it on our `bug tracking system`_. First create an account, then create a bug report using the "Create task" button. You should get some feedback within a week (at least someone triaging your issue). If not, `get in touch with us `_ to make sure we did not miss it. .. _faq_legal: Legal matters ============= Do I need to sign a form to contribute code? -------------------------------------------- Yes, on your first diff, you will have to sign such document. As long as it's not signed, your diff content won't be visible. Will my name be added to a CONTRIBUTORS file? --------------------------------------------- You will be asked during review to add yourself. .. _faq_code_review: Code Review =========== I found a straightforward typo fix, should my fix go through the entire code review process? -------------------------------------------------------------------------------------------- You are welcome to drop us a message at one of the `development channels`_, we will pick it up and fix it so you don't have to follow the whole :ref:`code review process `. What tests I should run before committing the code? --------------------------------------------------- Mostly run `tox` (or `pytest`) to run the unit tests suite. When you will propose a patch in our forge, the continuous integration factory will trigger a build (using `tox` as well). I am getting errors while trying to commit. What is going wrong? ---------------------------------------------------------------- Ensure you followed the proper guide to :ref:`setup your environment ` and try again. If the error persists, you are welcome to drop us a message at one of the `development channels`_ Is there a format/guideline for writing commit messages? -------------------------------------------------------- See the :ref:`git-style-guide` Is there some recommended git branching strategy? ------------------------------------------------- It's left at the developer's discretion. Mostly people hack on their feature, then propose a diff from a git branch or directly from the master branch. There is no imperative. The only imperative is that for a feature to be packaged and deployed, it needs to land first in the master branch. how should I document the code I contribute to SWH? --------------------------------------------------- Any new feature should include documentation in the form of comments and/or docstrings. Ideally, they should also be documented in plain English in the repository's `docs/` folder if relevant to a single package, or in the main `swh-docs` repository if it is a transversal feature. .. _faq_api: Software Heritage API ===================== How do I generate API usage credentials? ---------------------------------------- See the :ref:`Authentication guide `. Is there a page where I can see all the API endpoints? ------------------------------------------------------ -See the `API endpoint listing page`_. +See the :swh_web:`API endpoint listing page `. What are the usage limits for SWH APIs? --------------------------------------- Maximum number of permitted requests per hour: * 120 for anonymous users * 1200 for authenticated users -It's described in the `rate limit documentation page`_. +It's described in the :swh_web:`rate limit documentation page `. .. It's temporarily here but it should be moved into its own sphinx instance at some point in the future. .. _faq_sysadm: System Administration ===================== How does SWH release? --------------------- Release is mostly done: - first in docker (somewhat as part of the development process) - secondly packaged and deployed on staging (mostly) - thirdly the same package is deployed on production Is there a release cycle? ------------------------- When a functionality is ready (tests ok, landed in master, docker run ok), the module is tagged. The tag is pushed. This triggers a packaging build process. When the package is ready, depending on the module [1], sysadms deploy the package with the help of puppet. [1] swh-web module is mostly automatic. Other modules are not yet automatic as some internal state migration (dbs) often enters the release cycle and due to the data volume, that may need human intervention. -.. _API endpoint listing page: https://archive.softwareheritage.org/api/1/ -.. _rate limit documentation page: https://archive.softwareheritage.org/api/#rate-limiting .. _bug tracking system: https://forge.softwareheritage.org/ .. _contact form: https://www.softwareheritage.org/contact/ .. _contact us: https://www.softwareheritage.org/contact/ .. _development channels: https://www.softwareheritage.org/community/developers/ .. _internships: https://wiki.softwareheritage.org/wiki/Internships diff --git a/docs/getting-started/api.rst b/docs/getting-started/api.rst index da1ac48..6a139c1 100644 --- a/docs/getting-started/api.rst +++ b/docs/getting-started/api.rst @@ -1,656 +1,637 @@ ============================================== Getting Started with the Software Heritage API ============================================== Introduction ------------ About Software Heritage ^^^^^^^^^^^^^^^^^^^^^^^ The `Software Heritage project `__ was started in 2015 with a rather impressive goal and purpose: Software Heritage is an ambitious initiative that aims at collecting, organizing, preserving and sharing all the source code publicly available in the world. -Yes, you read it well: all source code available in the world. It implies to -build an equally impressive infrastructure to hold the huge amount of -information represented, make the archive available to the public -through a `nice web interface `__ -and even propose a :ref:`well-documented API ` to access it -seamlessly. For the records, there are also :ref:`various datasets -available ` for download, with detailed instructions -about how to set it up. And, yes it’s huge: the full graph generated -from the archive (with only metadata, content is not included) has more -than 20b nodes and weights 1.2TB. Overall size of the archive is in the -hundreds of TBs. - -This article presents, and demonstrates the use of, the `Software -Heritage API `__ to query -basic information about archived content and fetch the content of a +Yes, all source code available in the world. It implies to build an equally impressive +infrastructure to hold the huge amount of information represented, make the archive +available to the public through a :swh_web:`nice web interface ` and even propose a +:ref:`well-documented API ` to access it seamlessly. For the records, there are +also :ref:`various datasets available ` for download, with detailed +instructions about how to set it up. And, yes it’s huge: the full graph generated from +the archive (with only metadata, content is not included) has more than 20b nodes and +weights 1.2TB. Overall size of the archive is in the hundreds of TBs. + +This article presents, and demonstrates the use of, the :swh_web:`Software Heritage API +` to query basic information about archived content and fetch the content of a software project. Terms and Concepts ^^^^^^^^^^^^^^^^^^ For our activity we need to define the following terms and concepts: - The repositories analysed by the SWH are registered as **origins**. Examples of origins are: https://bitbucket.org/anthroweb/apache.git, https://github.com/apache/ant, or other types of sources (debian source packages, npmjs, pypi, cran..). - When repositories are analysed, it creates **snapshots**. Snapshots describe the state of the repository at the time of analysis, and provide links to the repository content. As an example in the case of a git repository, the snapshot links to the list of branches, which themselves link to revisions and releases. - **Revisions** are consistent sets of directories and contents representing the repository at a given time, like in a baseline. They can be conceptually mapped to commits in subversion, to git references, or to source package versions in debian or nmpjs repositories. - Revisions are linked to a **directory**, which itself links to other directories and **contents** (aka blobs). A full list of terms is provided in the `Software Heritage doc `__. Preliminary steps ----------------- This article uses Python 3.x on the client side, and the ``requests`` Python module to manipulate the HTTP requests. Note however that any language that provides HTTP requests (GET, POST) can access the API and could be used. Firstly let’s make sure we have the correct Python version and module installed:: boris@castalia:notebook$ python3 -V Python 3.7.3 boris@castalia:notebooks$ pip3 install requests Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.21.0) boris@castalia:notebook$ Initialise the script --------------------- We need to import a few modules and utilities to play with the Software Heritage API, namely ``json`` and the aforementioned ``requests`` modules. We also define a utility function to pretty-print json data easily: .. code:: python import json import requests # Utility to pretty-print json. def jprint(obj): # create a formatted string of the Python JSON object print(json.dumps(obj, sort_keys=True, indent=4)) -The syntax mentioned in the `API -documentation `__ is rather -straightforward. Since we want to read it from the main Software -Heritage server, we will use ``https://archive.softwareheritage.org/`` -as the basename. All API calls will be forged according to the same -syntax: +The syntax mentioned in the :swh_web:`API documentation ` is rather +straightforward. Since we want to read it from the main Software Heritage server, we +will use ``https://archive.softwareheritage.org/`` as the basename. All API calls will +be forged according to the same syntax: ``https://archive.softwareheritage.org/api/1/`` Request basic Information ------------------------- -We want to get some basic information about the main server activity and -content. The ``stat`` endpoint provides a summary of the main indexes and -some statistics about the archive. We can request a GET on the main -counters of the archive using the counters path, as described in the -`endpoint -documentation `__: +We want to get some basic information about the main server activity and content. The +``stat`` endpoint provides a summary of the main indexes and some statistics about the +archive. We can request a GET on the main counters of the archive using the counters +path, as described in the :swh_web:`endpoint documentation `: ``/api/1/stat/counters/`` This API endpoint returns the following information: * **content** is the total number of blobs (files) in the archive. * **directory** is the total number of repositories in the archive. * **origin** is the number of distinct origins (repositories) fetched by the archive bots. * **origin_visits** is the total number of visits across all origins. * **person** is the number of authors (e.g. committers, authors) in the archived files. * **release** is the number of tags retrieved in the archive. * **revision** is the number of revisions stored in the archive. * **skipped_content** is the number of objects which could be imported in the archive. * **snapshot** is the number of snapshots stored in the archive. Note that we use the default JSON format for the output. We could use YAML if we wanted to, with a custom ``Request Headers`` set to ``application/yaml``. .. code-block:: python resp = requests.get("https://archive.softwareheritage.org/api/1/stat/counters/") counters = resp.json() jprint(counters) .. code-block:: python { "content": 10049535736, "directory": 8390591308, "origin": 156388918, "person": 42263568, "release": 17218891, "revision": 2109783249 } There are almost 10bn blobs (aka files) in the archive and 8bn+ directories already, for 155m repositories analysed. Now, what about a specific repository? Let’s say we want to find if `alambic `__ (an open-source data provider and analysis system for software development) has already been analysed by the archive’s bots. Search the archive ------------------ Search for a keyword ^^^^^^^^^^^^^^^^^^^^ -The easiest way to look for a keyword in the repositories analysed by -the archive is to use the ``search`` feature of the ``origin`` endpoint. -Documentation for the endpoint is -`here `__ -and the complete syntax is: +The easiest way to look for a keyword in the repositories analysed by the archive is to +use the ``search`` feature of the ``origin`` endpoint. Documentation for the endpoint is +:swh_web:`here ` and the complete syntax is: ``/api/1/origin/search//`` The server returns an array of hashes, with each item being formatted as: - **origin_visits_url** attribute is an URL that points to the API page listing all visits (bot fetches) to this repository. - **url** is the url of the origin, or repository, itself. A (truncated) example of a result from this endpoint is shown below: :: [ { "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", "url": "https://github.com/borisbaldassari/alambic" } ... ] As an example we will look for instances of *alambic* in the archive’s analysed repositories:: resp = requests.get("https://archive.softwareheritage.org/api/1/origin/search/alambic/") origins = resp.json() print(f"We found {len(origins)} entries.") for origin in origins[1:10]: print(f"- {origin['url']}") Which produces:: We found 52 entries. - https://github.com/royal-alambic-club/sauron - https://github.com/scamberlin/alambic - https://github.com/WebTales/alambic-connector-mongodb - https://github.com/WebTales/alambic - https://github.com/AssoAlambic/alambic-website - https://bitbucket.org/nayoub/alambic.git - https://github.com/Alexandru-Dobre/alambic-connector-rest - https://github.com/WebTales/alambic-connector-diffbot - https://github.com/WebTales/alambic-connector-firebase There are obviously many projects and repositories that embed the word alambic, and we will need to be a bit more specific if we are to identify the origin actually related to the alambic project. If we want to know more about a specific origin, we can simply use the ``url`` attribute (or any known URL) as an entry for any of the ``origin`` endpoints. Search for a specific origin ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Now say that we want to query the database for the specific repository -of Alambic, to know what information has been registered by the archive. -The API endpoint can be found `in the swh-web -documentation `__, -and has the following syntax: +Now say that we want to query the database for the specific repository of Alambic, to +know what information has been registered by the archive. The API endpoint can be found +:swh_web:`in the swh-web documentation `, and has the following +syntax: ``/api/1/origin//get/`` Which returns the same type of JSON object than the ``search`` command seen previously: - **origin_visits_url** attribute is an URL that points to the API page listing all visits (bot fetches) to this repository. - **url** is the url of the origin, or repository, itself. We know that Alambic is hosted at ‘https://github.com/borisbaldassari/alambic/’, so the API call will look like this: ``/api/1/origin/https://github.com/borisbaldassari/alambic/get/`` .. code:: python resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/get/") found = resp.json() jprint(found) -.. parsed-literal:: +.. code:: { "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", "url": "https://github.com/borisbaldassari/alambic" } Get visits information ^^^^^^^^^^^^^^^^^^^^^^ -We can use the ``origin_visits_url`` attribute to know more about when -the repository was analysed by the archive bots. The API endpoint is -fully documented on the `Software Heritage doc -site `__, -and has the following syntax: +We can use the ``origin_visits_url`` attribute to know more about when the repository +was analysed by the archive bots. The API endpoint is fully documented on the +:swh_web:`Software Heritage doc site `, and has the following +syntax: ``/api/1/origin//visits/`` We will use the same query as before about the main Alambic repository. .. code:: python resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/") found = resp.json() length = len(found) print(f"Number of visits found: {format(length)}.") print("With dates:") for visit in found: print(f"- {visit['visit']} {visit['date']}") print("\nExample of a single visit entry:") jprint(found[0]) -.. parsed-literal:: +.. code:: Number of visits found: 5. With dates: - 5 2021-01-01T19:35:41.308336+00:00 - 4 2020-02-06T10:41:45.700641+00:00 - 3 2019-09-01T22:38:12.056537+00:00 - 2 2019-06-16T04:52:18.162914+00:00 - 1 2019-01-30T07:19:20.799217+00:00 Example of a single visit entry: { "date": "2021-01-01T19:35:41.308336+00:00", "metadata": {}, "origin": "https://github.com/borisbaldassari/alambic", "origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visit/5/", "snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc/", "status": "full", "type": "git", "visit": 5 } Get the content --------------- As defined in the beginning, a snapshot is a capture of the repository at a given time with links to all branches and releases. In this example we will work on the snapshot ID of the last visit to Alambic, as returned by the previous command we executed. .. code:: python # Store snapshot id snapshot = found[0]['snapshot'] print(f"Snapshot is {format(snapshot)}.") -.. parsed-literal:: +.. code:: Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc. -Note that the latest visit to the repository can also be directly -retrieved using the `dedicated -endpoint `__ +Note that the latest visit to the repository can also be directly retrieved using the +:swh_web:`dedicated endpoint ` ``/api/1/origin/visit/latest/``. Get the snapshot ^^^^^^^^^^^^^^^^ -We want now to retrieve the content of the project at this snapshot. For -that purpose there is the ``snapshot`` endpoint, and its documentation -is `provided -here `__. The -complete syntax is: +We want now to retrieve the content of the project at this snapshot. For that purpose +there is the ``snapshot`` endpoint, and its documentation is :swh_web:`provided here +`. The complete syntax is: ``/api/1/snapshot//`` -The snapshot endpoint returns in the ``branches`` attribute a list of -**revisions** (aka commits in a git context), which -themselves point to the set of directories and files in the branch at -the time of analysis. Let’s follow this chain of links, starting with -the snapshot’s list of revisions (branches): +The snapshot endpoint returns in the ``branches`` attribute a list of **revisions** (aka +commits in a git context), which themselves point to the set of directories and files in +the branch at the time of analysis. Let’s follow this chain of links, starting with the +snapshot’s list of revisions (branches): .. code:: python snapshotr = requests.get("https://archive.softwareheritage.org/api/1/snapshot/{}/".format(snapshot)) snapshotj = snapshotr.json() jprint(snapshotj) -.. parsed-literal:: +.. code:: { "branches": { "HEAD": { "target": "refs/heads/master", "target_type": "alias", "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" }, "refs/heads/devel": { "target": "e298b8c5692b18928013a68e41fd185419515075", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/e298b8c5692b18928013a68e41fd185419515075/" }, "refs/heads/features/cr152_anonymise_data": { "target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162/" }, "refs/heads/features/cr164_github_project": { "target": "0005abb080e4c67a97533ee923e9d28142877752", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" }, "refs/heads/features/cr165_github_its": { "target": "0005abb080e4c67a97533ee923e9d28142877752", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" }, "refs/heads/features/cr89_gitlabwizard": { "target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/b941fd5f93a6cfc2349358b891e47d0fffe0ed2d/" }, "refs/heads/master": { "target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" } }, "id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", "next_branch": null } Get the root directory ^^^^^^^^^^^^^^^^^^^^^^ The revision associated to the branch can be retrieved by following the corresponding link in the ``target_url`` attribute. We will follow the ``refs/heads/master`` branch and get the associated revision object. In this case (a git repository) the revision is equivalent to a commit, with an ID and message. .. code:: python print(f"Revision ID is {snapshotj['id']}.") master_url = snapshotj['branches']['refs/heads/master']['target_url'] masterr = requests.get(master_url) masterj = masterr.json() jprint(masterj) -.. parsed-literal:: +.. code:: Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc { "author": { "email": "boris.baldassari@gmail.com", "fullname": "Boris Baldassari ", "name": "Boris Baldassari" }, "committer": { "email": "boris.baldassari@gmail.com", "fullname": "Boris Baldassari ", "name": "Boris Baldassari" }, "committer_date": "2020-11-01T12:55:13+01:00", "date": "2020-11-01T12:55:13+01:00", "directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8", "directory_url": "https://archive.softwareheritage.org/api/1/directory/fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8/", "extra_headers": [], "history_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/log/", "id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", "merge": false, "message": "#163 Fix dygraphs zero padding in forums plugin.\n", "metadata": {}, "parents": [ { "id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151", "url": "https://archive.softwareheritage.org/api/1/revision/a4a2d8925c1cc43612602ac28e4ca9a31728b151/" } ], "synthetic": false, "type": "git", "url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" } -The revision references the root directory of the project. We can -list all files and directories at the root by requesting more -information from the ``directory_url`` attribute. The endpoint is -documented -`here `__ and -has the following syntax: +The revision references the root directory of the project. We can list all files and +directories at the root by requesting more information from the ``directory_url`` +attribute. The endpoint is documented :swh_web:`here ` and has the +following syntax: ``/api/1/directory//`` The structure of the response is an **array of directory entries**. **Content entries** are represented like this: :: { "checksums": { "sha1": "5973b582bfaeffa71c924e3fe7150620230391d8", "sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", "sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc" }, "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", "length": 101, "name": ".dockerignore", "perms": 33188, "status": "visible", "target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", "target_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b/", "type": "file" } And **directory entries** are represented with: :: { "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", "length": null, "name": "doc", "perms": 16384, "target": "316468df4988351911992ecbf1866f1c1f575c23", "target_url": "https://archive.softwareheritage.org/api/1/directory/316468df4988351911992ecbf1866f1c1f575c23/", "type": "dir" } We will print the list of contents and directories located at the root of the repository at the time of analysis: .. code:: python root_url = masterj['directory_url'] rootr = requests.get(root_url) rootj = rootr.json() for f in rootj: print(f"- {f['name']}.") -.. parsed-literal:: +.. code:: - .dockerignore - .env - .gitignore - CODE_OF_CONDUCT.html - CODE_OF_CONDUCT.md - LICENCE.html - LICENCE.md - Readme.md - doc - docker - docker-compose.run.yml - docker-compose.test.yml - dockercfg.encrypted - mojo - resources We could follow the links up (or down) to the leaves in order to rebuild the project structure and download all files individually to rebuild the project locally. However the archive can do it for us, and provides a feature to download the content of a whole project in one step: **cooking**. The feature is described in the :ref:`swh-vault documentation `. Download content of a project ----------------------------- When we ask the Archive to cook a directory for us, it invokes an asynchronous job to recuversively fetch the directories and files of the project, following the graph up to the leaves (files) and exporting the result as a tar.gz file. This procedure is handled by the :ref:`swh-vault component `, and it’s all automatic. Order the meal ^^^^^^^^^^^^^^ A cooking job can be invoked for revisions, directories or snapshots (soon). It is initiated with a POST request on the ``vault//`` endpoint, and its complete syntax is: ``/api/1/vault/directory//`` -The first POST request initiates the cooking, and subsequent GET -requests can fetch the job result and download the archive. See the -`Software Heritage documentation ` on this, with useful -examples. The API endpoint is documented `here `__. +The first POST request initiates the cooking, and subsequent GET requests can fetch the +job result and download the archive. See the `Software Heritage documentation +` on this, with useful examples. The API endpoint is documented +:swh_web:`here `. In this example we will fetch the content of the root directory that we previously identified. .. code:: python mealr = requests.post("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") mealj = mealr.json() jprint(mealj) -.. parsed-literal:: +.. code:: { "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", "id": 379321799, "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", "obj_type": "directory", "progress_message": null, "status": "done" } Ask if it’s ready ^^^^^^^^^^^^^^^^^ We can use a GET request on the same URL to get information about the process status: .. code:: python statusr = requests.get("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") statusj = statusr.json() jprint(statusj) -.. parsed-literal:: +.. code:: { "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", "id": 379321799, "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", "obj_type": "directory", "progress_message": null, "status": "done" } Get the plate ^^^^^^^^^^^^^ Once the processing is finished (it can take up to a few minutes) the tar.gz archive can be downloaded through the ``fetch_url`` link, and extracted as a tar.gz archive: :: boris@castalia:downloads$ curl https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/ -o myarchive.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9555k 100 9555k 0 0 1459k 0 0:00:06 0:00:06 --:--:-- 1717k boris@castalia:downloads$ ls myarchive.tar.gz boris@castalia:downloads$ tar xzf myarchive.tar.gz 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.md 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.md 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/Readme.md 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/Readme.md 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config [SNIP] Conclusion ---------- -In this article, we learned **how to explore and use the Software -Heritage archive using its API**: searching for a repository, -identifying projects and downloading specific snapshots of a repository. -There is a lot more to the Archive and its API than what we have seen, -and all features are generously documented on the `Software Heritage web -site `__. +In this article, we learned **how to explore and use the Software Heritage archive using +its API**: searching for a repository, identifying projects and downloading specific +snapshots of a repository. There is a lot more to the Archive and its API than what we +have seen, and all features are generously documented on the :swh_web:`Software Heritage +web site `.