We want to allow using swh-scanner with a local DB of known SWHID, as an alternative to using the Web API over the net.
Use cases for this are: (1) reproducibility of benchmarks/experiments done with swh-scanner (as with this feature it will be possible to "freeze" the archive state); (2) real use of swh-scanner without having to go through the net, which is a requirement in several enterprise settings (this will require having a full list of known SWHIDs locally, but it's technically doable).
The blueprint for an initial implementation of this feature is as follows:
- a batch importer that will take as input a list of textual SWHIDs (from a local file or standard input) and produce a sqlite database containing a single table of known SWHIDs (with an index)
- a new simple HTTP service implementing an API compatible with the official Web API, but implementing only the /known endpoint
- swh scanner db import [--input SWHID_LIST.txt] [--output SWHID_DB.sqlite]: it will read SWHID_LIST.txt (or stdin, if - is given) and create the sqlite db in SWHID_DB.sqlite
- swh scanner db serve SWHID_DB.sqlite: it will start the API service using SWHID_DB.sqlite as sqlite DB containing the list of known SWHIDs (generated using step (1))
with that, the scanner can then be used locally with something like:
- swh scanner scan -u http://localhost:5011/api/1 -x *.git ~/source/dir/to/scan/
- we should use a DB-API 2.0 compatible interface for accessing sqlite (the module in the stdlib should do), so that it will be easy in the future to switch to a more serious DB for enterprise use