Page MenuHomeSoftware Heritage

add support for reverse lookup from swh:1:ori:... PIDs to origin URLs
Closed, MigratedEdits Locked

Description

Now that we have defined an intrinsic PID schema for origins and support for it in both swh identify and swh-graph (as graph roots), we need a way to reverse lookup from origin PIDs to origin URLs.

As I understand it that means:

  • adding a column to the origin table for the origin checksum (either as a PID or, more consistently with the rest of the SQL schema, as a SHA1 checksum)
  • patch the storage functions that create new origins to also fill the SHA1 column
  • add a storage function to perform the SHA1→URL lookup

For the transition we will need to:

  1. initially mark the SHA1 column as NULL-able
  2. deploy in production a storage version that fills the SHA1 for new origins
  3. perform a one off conversion of all old origins that have NULL SHA1s
  4. mark the SHA1 column as non NULL-able (and add a B-tree index on it)

Event Timeline

zack triaged this task as Normal priority.Oct 19 2019, 2:45 PM
zack created this task.
zack raised the priority of this task from Normal to High.Nov 18 2019, 5:50 PM
zack added a project: Compressed graph service.

We should consider just adding a btree index on sha1(url) and see where that takes us.

Launched on somerset:

create index concurrently on origin using btree(digest(url, 'sha1'));