Page MenuHomeSoftware Heritage

Make snapshot_add take an iterable of snapshot.
ClosedPublic

Authored by vlorentz on Apr 1 2019, 4:23 PM.

Details

Summary

For uniformity with other _add() endpoints (except origin_visit_add).

Noone uses the new signature of snapshot_add except the tests yet,
so it's the perfect time to make this breaking change.

Depends on D1325.

Diff Detail

Repository
rDSTO Storage manager
Branch
snapshot_add-many
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5040
Build 6772: tox-on-jenkinsJenkins
Build 6771: arc lint + arc unit

Event Timeline

douardda added inline comments.
swh/storage/api/client.py
87

wouldn't be acceptable to make this method a bit more flexible and accept snapshots to be a scalar? I mean smthg like:

if isinstance(snapshots, dict):
    snapshots = [snapshots]
swh/storage/api/client.py
87

Then we should do the same for all other _add methods.

olasd requested changes to this revision.Apr 2 2019, 10:38 PM
olasd added a subscriber: olasd.

If we're doing this, it'd be cleaner to change the SQL logic to actually batch the insertions, rather than doing the loop outside of SQL. Creating temporary tables is fairly expensive and churn on the database catalog is costly (we've been trying to cut it down lately).

You'll need to:

  1. add a snapshot id (sha1_git) column to the temporary table
  2. add all distinct snapshot ids to the snapshot table
  3. add all distinct branch specs to the snapshot_branch table
  4. join all three tables to fill in the snapshot_branches table

Alternatively, you can at least truncate the table rather than dropping it at each loop (but we lose the potential scale benefits of steps 3/4 above).

This revision now requires changes to proceed.Apr 2 2019, 10:38 PM

How would you feel about using truncate temporarily, and opening a task to do the right thing later?

How would you feel about using truncate temporarily, and opening a task to do the right thing later?

Yeah, that's fine.

  • use truncate instead of drop.
  • Add docstring to DirectKafkaWriter.
  • undo push to the wrong diff

Just one last comment but I think that's good to go.

swh/storage/storage.py
814–815

Can we do the existence check only once (by getting the list of snapshots that don't exist at that point) instead of having it here and in the loop?

This revision is now accepted and ready to land.Apr 3 2019, 2:14 PM

Don't duplicate calls to db.snapshot_exists.

This revision was landed with ongoing or failed builds.Apr 3 2019, 2:47 PM
This revision was automatically updated to reflect the committed changes.