Page MenuHomeSoftware Heritage

scanner: 'db import' option to create local database with known swhids
ClosedPublic

Authored by DanSeraf on Nov 18 2020, 2:24 PM.

Details

Summary

Implementation regarding the import option.

Related T2760

Diff Detail

Repository
rDTSCN Code scanner
Branch
import-option
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 17271
Build 26670: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 26669: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4508 (id=15990)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit 80c92b80817b84171870eef37fa05a272cbabf6a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/75/ for more details.

vlorentz added a subscriber: vlorentz.

Hi

Could you add test for the CLI itself? (grep CliRunner on the swh codebase to find examples of ci tests)

swh/scanner/cli.py
160

use click.File("r") (I don't remember the exact syntax), so 1. click opens it for you 2. - can be passed to read from stdin

173

nitpick: the underscore should be at the end ( https://www.python.org/dev/peps/pep-0008/#function-and-method-arguments )

swh/scanner/db.py
50

the name doesn't make it clear that this method connects to swh-web and fills the db

80–85

these should cause a non-0 status code to be returned

93

nitpick: return res is not None

This revision now requires changes to proceed.Nov 18 2020, 3:11 PM

Build is green

Patch application report for D4508 (id=16008)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit db06a1058caf9d69df2bb54a5f4150530193e6f0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/76/ for more details.

zack requested changes to this revision.Nov 18 2020, 8:26 PM
zack added subscribers: seirl, zack.

Thanks, this looks great, and almost there.

I've pointed out in the code some issues about index/constraint in the DB, some minor naming issues, and a misunderstanding about the archive lookup logic (which is handled already elsewhere).

swh/scanner/db.py
42

I'd just call this table "swhids", because that is what it contains.

Additionally, you kneed here an uniqueness constraint and/or an unique index. That would avoid having to manually check for duplicates in add and also speed up lookups.

cc: @seirl in case he has performance tips for the table and/or index. For context: this table of known indexes in the use case we have in mind will end up containing ~30 million SWHIDs.

44

I wonder if you want a add_one/add_many split here, to be able to insert many SWHIDs at once. In general inserting stuff one-by-one in a table is quite slow, and will get worse with an index. So an option to insert many at once with a single query, and inserting entire "pages" of SWHIDs from the input file at once with it could be interested.

cc: @seirl for the same reason as above

48

this WHERE predicate can be removed with an index/constraint

52–70

Oh, sorry, maybe we weren't clear on this part of the spec. I don't think this logic of querying the archive to check if SWHIDs are known belongs to the scanner. The spec in T2760 assumed that the SWHIDs in the input .txt file must be considered as known, and hence added to the DB, without further querying the archive.

So I think this function can just be removed.

(It is tangential to this diff, but FWIW for the ongoing swh-scanner paper, the list of SWHIDs fed to swh-scanner db import has already been filtered to include only known SWHID. See my data/swh-known.py script in the paper repo.)

79–82

So here you can just insert all SWHIDs from the file into the DB. (Ideally paginating with the add_many methods I suggested above, assuming it's more efficient.) And avoid all the API error handling.

92

this should just be called "known", no?

This revision now requires changes to proceed.Nov 18 2020, 8:26 PM
swh/scanner/db.py
42

or mark swhid as a PRIMARY KEY

requested changes:

  • SWHID as PRIMARY KEY in db
  • SWHID insertion without query the Web API
  • bulk insert of SWHID values in db

Build is green

Patch application report for D4508 (id=16064)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit d5815b1aa15edb08d2591b74d0ba6c73554a9079
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/77/ for more details.

swh/scanner/db.py
41–43

There's a grouper function in swh-core that you can use instead of get_chunk.

66

That works, but it's not reusable as a library. Could you raise an exception and catch it in the CLI instead?

If not, that's not a big deal

Build is green

Patch application report for D4508 (id=16069)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit 05d3bb5f17e9f79455783a90b7b94bee7228f3d1
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/78/ for more details.

LGTM, I've only reported a minor typing issue.

swh/scanner/db.py
40

Minor nit: I think you want Iterable[str] here instead of List[str], as the latter would force client code to actually materialize a potentially long list in memory. (A general rule of thumb for typing is "accept abstract types, return concrete types".)

But I don't know what grouper expects, so please verify that before changing this.

swh/scanner/tests/test_db.py
39

Sorry I did not notice it earlier, but it looks like this line never runs.

minor changes:

  • mypy annotation
  • tests

Build is green

Patch application report for D4508 (id=16083)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit e8949840a9da756044f5ed6e2f4f1c471dd9fa06
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/79/ for more details.

This revision is now accepted and ready to land.Nov 21 2020, 4:50 AM