Page MenuHomeSoftware Heritage

Add index to origin table on belvedere for looking up by sha1
Closed, MigratedEdits Locked

Description

Now that the Web API graph endpoint is available in production, we are able to query the Graph service and resolve origin sha1s to their associated URL.

But currently, the web application in production is sending requests to belvedere (as somerset is experiencing disk usage issues).

While somerset has an index on the origin table for fast origin lookup by sha1:

15:11 $ psql service=swh-replica
psql (13.0 (Debian 13.0-1.pgdg100+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

softwareheritage=> \d origin
                            Table "public.origin"
 Column |  Type  | Collation | Nullable |              Default               
--------+--------+-----------+----------+------------------------------------
 id     | bigint |           | not null | nextval('origin_id_seq'::regclass)
 url    | text   |           | not null | 
Indexes:
    "origin_pkey" PRIMARY KEY, btree (id)
    "origin_digest_idx" btree (digest(url, 'sha1'::text))
    "origin_url_idx" btree (url)
    "origin_url_idx1" gin (url gin_trgm_ops)

it is missing on belvedere:

12:27 $ psql service=swh
psql (13.0 (Debian 13.0-1.pgdg100+1), server 12.4 (Debian 12.4-1.pgdg100+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

softwareheritage=> \d origin
                            Table "public.origin"
 Column |  Type  | Collation | Nullable |              Default               
--------+--------+-----------+----------+------------------------------------
 id     | bigint |           | not null | nextval('origin_id_seq'::regclass)
 url    | text   |           | not null | 
Indexes:
    "origin_pkey" PRIMARY KEY, btree (id)
    "origin_url_idx" gin (url gin_trgm_ops)
    "origin_url_idx1" hash (url)
    "origin_url_idx2" UNIQUE, btree (url)
Referenced by:
    TABLE "origin_metadata" CONSTRAINT "origin_metadata_origin_fkey" FOREIGN KEY (origin_id) REFERENCES origin(id)
    TABLE "origin_visit" CONSTRAINT "origin_visit_origin_fkey" FOREIGN KEY (origin) REFERENCES origin(id)
    TABLE "skipped_content" CONSTRAINT "skipped_content_origin_fkey" FOREIGN KEY (origin) REFERENCES origin(id)
Publications:
    "softwareheritage"

Event Timeline

anlambert lowered the priority of this task from High to Normal.
anlambert created this task.

I triggered the following index creation (on db hosted on belvedere) [1]

create index concurrently on origin using btree(digest(url, 'sha1'));

which according to the same file should already be there anyway (it's not part of any conditional in the sql file).

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/60-indexes.sql$39

ardumont changed the task status from Open to Work in Progress.Oct 12 2020, 8:25 PM

it got stuck for a while...

@olasd unstuck the blocking process and the index is now done.

Note for later:
To check what's blocking:

select * from pg_stat_activity where pid = any(pg_blocking_pids($blocked_pid));

$blocked_pid is the process id (can find it with pg_activity or some such).

ardumont claimed this task.