Page MenuHomeSoftware Heritage

swh-scanner: add support for local DB of known SWHIDs
Closed, MigratedEdits Locked

Description

We want to allow using swh-scanner with a local DB of known SWHID, as an alternative to using the Web API over the net.

Use cases for this are: (1) reproducibility of benchmarks/experiments done with swh-scanner (as with this feature it will be possible to "freeze" the archive state); (2) real use of swh-scanner without having to go through the net, which is a requirement in several enterprise settings (this will require having a full list of known SWHIDs locally, but it's technically doable).

The blueprint for an initial implementation of this feature is as follows:

  1. a batch importer that will take as input a list of textual SWHIDs (from a local file or standard input) and produce a sqlite database containing a single table of known SWHIDs (with an index)
  2. a new simple HTTP service implementing an API compatible with the official Web API, but implementing only the /known endpoint

In terms of user interface, the proposal is to introduce a new CLI sub command swh scanner db, which in turn will have two subcommands:

  1. swh scanner db import [--input SWHID_LIST.txt] [--output SWHID_DB.sqlite]: it will read SWHID_LIST.txt (or stdin, if - is given) and create the sqlite db in SWHID_DB.sqlite
  2. swh scanner db serve SWHID_DB.sqlite: it will start the API service using SWHID_DB.sqlite as sqlite DB containing the list of known SWHIDs (generated using step (1))

with that, the scanner can then be used locally with something like:

  1. swh scanner scan -u http://localhost:5011/api/1 -x *.git ~/source/dir/to/scan/

Other requirements:

  • we should use a DB-API 2.0 compatible interface for accessing sqlite (the module in the stdlib should do), so that it will be easy in the future to switch to a more serious DB for enterprise use