Page MenuHomeSoftware Heritage

scanner: 'db import' option to create local database with known swhids
ClosedPublic

Authored by DanSeraf on Wed, Nov 18, 2:24 PM.

Details

Summary

Implementation regarding the import option.

Related T2760

Diff Detail

Repository
rDTSCN Code scanner
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

DanSeraf created this revision.Wed, Nov 18, 2:24 PM

Build is green

Patch application report for D4508 (id=15990)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit 80c92b80817b84171870eef37fa05a272cbabf6a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/75/ for more details.

vlorentz requested changes to this revision.Wed, Nov 18, 3:11 PM
vlorentz added a subscriber: vlorentz.

Hi

Could you add test for the CLI itself? (grep CliRunner on the swh codebase to find examples of ci tests)

swh/scanner/cli.py
161

use click.File("r") (I don't remember the exact syntax), so 1. click opens it for you 2. - can be passed to read from stdin

174

nitpick: the underscore should be at the end ( https://www.python.org/dev/peps/pep-0008/#function-and-method-arguments )

swh/scanner/db.py
51

the name doesn't make it clear that this method connects to swh-web and fills the db

81–86

these should cause a non-0 status code to be returned

94

nitpick: return res is not None

This revision now requires changes to proceed.Wed, Nov 18, 3:11 PM
DanSeraf updated this revision to Diff 16008.Wed, Nov 18, 5:43 PM

requested changes

Build is green

Patch application report for D4508 (id=16008)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit db06a1058caf9d69df2bb54a5f4150530193e6f0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/76/ for more details.

zack requested changes to this revision.Wed, Nov 18, 8:26 PM
zack added subscribers: seirl, zack.

Thanks, this looks great, and almost there.

I've pointed out in the code some issues about index/constraint in the DB, some minor naming issues, and a misunderstanding about the archive lookup logic (which is handled already elsewhere).

swh/scanner/db.py
43

I'd just call this table "swhids", because that is what it contains.

Additionally, you kneed here an uniqueness constraint and/or an unique index. That would avoid having to manually check for duplicates in add and also speed up lookups.

cc: @seirl in case he has performance tips for the table and/or index. For context: this table of known indexes in the use case we have in mind will end up containing ~30 million SWHIDs.

45

I wonder if you want a add_one/add_many split here, to be able to insert many SWHIDs at once. In general inserting stuff one-by-one in a table is quite slow, and will get worse with an index. So an option to insert many at once with a single query, and inserting entire "pages" of SWHIDs from the input file at once with it could be interested.

cc: @seirl for the same reason as above

49

this WHERE predicate can be removed with an index/constraint

53–71

Oh, sorry, maybe we weren't clear on this part of the spec. I don't think this logic of querying the archive to check if SWHIDs are known belongs to the scanner. The spec in T2760 assumed that the SWHIDs in the input .txt file must be considered as known, and hence added to the DB, without further querying the archive.

So I think this function can just be removed.

(It is tangential to this diff, but FWIW for the ongoing swh-scanner paper, the list of SWHIDs fed to swh-scanner db import has already been filtered to include only known SWHID. See my data/swh-known.py script in the paper repo.)

80–83

So here you can just insert all SWHIDs from the file into the DB. (Ideally paginating with the add_many methods I suggested above, assuming it's more efficient.) And avoid all the API error handling.

93

this should just be called "known", no?

This revision now requires changes to proceed.Wed, Nov 18, 8:26 PM
vlorentz added inline comments.Thu, Nov 19, 9:33 AM
swh/scanner/db.py
43

or mark swhid as a PRIMARY KEY

DanSeraf updated this revision to Diff 16064.Thu, Nov 19, 4:04 PM

requested changes:

  • SWHID as PRIMARY KEY in db
  • SWHID insertion without query the Web API
  • bulk insert of SWHID values in db

Build is green

Patch application report for D4508 (id=16064)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit d5815b1aa15edb08d2591b74d0ba6c73554a9079
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/77/ for more details.

vlorentz added inline comments.Thu, Nov 19, 4:45 PM
swh/scanner/db.py
42–44

There's a grouper function in swh-core that you can use instead of get_chunk.

67

That works, but it's not reusable as a library. Could you raise an exception and catch it in the CLI instead?

If not, that's not a big deal

DanSeraf updated this revision to Diff 16069.Thu, Nov 19, 5:08 PM

minor changes

Build is green

Patch application report for D4508 (id=16069)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit 05d3bb5f17e9f79455783a90b7b94bee7228f3d1
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/78/ for more details.

zack edited the summary of this revision. (Show Details)Thu, Nov 19, 9:08 PM
zack accepted this revision.Thu, Nov 19, 9:12 PM

LGTM, I've only reported a minor typing issue.

swh/scanner/db.py
41

Minor nit: I think you want Iterable[str] here instead of List[str], as the latter would force client code to actually materialize a potentially long list in memory. (A general rule of thumb for typing is "accept abstract types, return concrete types".)

But I don't know what grouper expects, so please verify that before changing this.

vlorentz added inline comments.Thu, Nov 19, 11:19 PM
swh/scanner/tests/test_db.py
40

Sorry I did not notice it earlier, but it looks like this line never runs.

DanSeraf updated this revision to Diff 16083.Fri, Nov 20, 10:03 AM

minor changes:

  • mypy annotation
  • tests

Build is green

Patch application report for D4508 (id=16083)

Rebasing onto 729334a238...

Current branch diff-target is up to date.
Changes applied before test
commit e8949840a9da756044f5ed6e2f4f1c471dd9fa06
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Wed Nov 18 14:19:17 2020 +0100

    'db import' option to create local database with known swhids

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/79/ for more details.

vlorentz accepted this revision.Sat, Nov 21, 4:50 AM
This revision is now accepted and ready to land.Sat, Nov 21, 4:50 AM