We want to allow using swh-scanner with a local DB of known SWHID, as an alternative to using the Web API over the net.
Use cases for this are: (1) reproducibility of benchmarks/experiments done with swh-scanner (as with this feature it will be possible to "freeze" the archive state); (2) real use of swh-scanner without having to go through the net, which is a requirement in several enterprise settings (this will require having a full list of known SWHIDs locally, but it's technically doable).
The blueprint for an initial implementation of this feature is as follows:
1) a batch importer that will take as input a list of textual SWHIDs (from a local file or standard input) and produce a sqlite database containing a single table of known SWHIDs (with an index)
2) a new simple HTTP service implementing an API compatible with the [[ https://archive.softwareheritage.org/api/ | official Web API ]], but implementing only the [[ https://archive.softwareheritage.org/api/1/known/doc/ | /known endpoint ]]
In terms of user interface, the proposal is to introduce a new CLI sub command `swh scanner db`, which in turn will have two subcommands:
1) `swh scanner db import [--input SWHID_LIST.txt] [--output SWHID_DB.sqlite]`: it will read `SWHID_LIST.txt` (or stdin, if `-` is given) and create the sqlite db in `SWHID_DB.sqlite`
2) `swh scanner db serve SWHID_DB.sqlite`: it will start the API service using `SWHID_DB.sqlite` as sqlite DB containing the list of known SWHIDs (generated using step (1))
with that, the scanner can then be used locally with something like:
3) `swh scanner scan -u http://localhost:5011/api/1 -x *.git ~/source/dir/to/scan/`
Other requirements:
- we should use a DB-API 2.0 compatible interface for accessing sqlite (the [[ https://docs.python.org/3.8/library/sqlite3.html | module in the stdlib ]] should do), so that it will be easy in the future to switch to a more serious DB for enterprise use