Page MenuHomeSoftware Heritage

Clean up entries in {origin_intrinsic,revision}_metadata with no metadata
Closed, ResolvedPublic

Description

For now, there's ~90% rows whose dictionary has a single @context key.

We should delete them.

When done, the following two queries should both return 0:

softwareheritage-indexer=> select count(*) from revision_intrinsic_metadata where metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb;
 count  
--------
 742428
(1 row)

softwareheritage-indexer=> select count(*) from origin_intrinsic_metadata where metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb;
 count  
--------
 369096
(1 row)

Event Timeline

vlorentz triaged this task as Low priority.

My tentative proposal is to delete all table entries for which no metadata has been found.
The invariant will be: if an origin/revision has metadata, there will be an entry in the table(s); if not, the origin/revision will not appear.

The more I think at the other use case (keeping track of what has already been indexed and hat hasn't, for efficiency purposes), the more I get convinced that the solution to that problem should be found elsewhere, e.g.:

  • if we change an existing indexer, we need to query all revisions where the relevant metadata file paths can be found, and schedule reindexing of those. This is something that can be built on top of the provenance map
  • adding a new indexer would result in the same query (and hence solution), using the metadata file paths of the new indexer

So I'm for just DELETE-ing the "empty" lines, once the current full pass is complete. Oh, and of course update the indexers accordingly :-)

What is the provenance map?

zack added a comment.Feb 26 2019, 3:12 PM

What is the provenance map?

The service keeping track of where-we-have-seen-what, of which we have a POC developed by grouss ATM. Not yet in production, but it will be at some point.

Sounds like a solution for T1528 :)

zack added a comment.Feb 26 2019, 5:41 PM

Yes :-)
so, do we agree that the right fix for this task is just to get rid of empty-ish rows? or are there other arguments that we haven't considered yet?

and the NULL option

zack added a comment.Feb 26 2019, 6:42 PM

I don't understand your comment. What are the remaining arguments for using NULL instead of just deleting rows?

To know that we went over the origin and did not find anything, without additional tables.

But I agree that's a pretty weak argument

vlorentz claimed this task.Feb 27 2019, 3:40 PM
zack changed the task status from Open to Work in Progress.EditedMar 1 2019, 2:51 PM

I've started the first of following queries on somerset (in a screen of my user):

DELETE FROM origin_intrinsic_metadata
WHERE metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb ;

DELETE FROM revision_metadata
WHERE translated_metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb ;

the second will follow soon thereafter.

zack claimed this task.Mar 1 2019, 3:12 PM
zack added a comment.Mar 1 2019, 4:49 PM

As discussed on IRC, even after cleaning up origin_intrinsic_metadata, the DELETE on revision_metadata fails with:

softwareheritage-indexer=# DELETE FROM revision_metadata
softwareheritage-indexer-# WHERE translated_metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::json
b ;

ERROR:  update or delete on table "revision_metadata" violates foreign key constraint "origin_intrinsic_metadata_revision_
metadata_fkey" on table "origin_intrinsic_metadata"
DETAIL:  Key (id, indexer_configuration_id)=(\x142cdce9678d71340f8fc5398d07bf69e35d3271, 110502138) is still referenced fr
om table "origin_intrinsic_metadata".
zack added a comment.Mar 2 2019, 9:53 AM

The following fix for the above (suggested by @vlorentz ) is now running:

update revision_metadata
set translated_metadata = origin_intrinsic_metadata.metadata
from origin_intrinsic_metadata
where revision_metadata.id=origin_intrinsic_metadata.from_revision and revision_metadata.translated_metadata='{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}' and origin_intrinsic_metadata.metadata != '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}';
zack added a comment.EditedMar 2 2019, 1:18 PM

The update completed, but a first attempt at the second DELETE failed with a deadlock (?!):

softwareheritage-indexer=# DELETE FROM revision_metadata
WHERE translated_metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb ;                                                                                            ERROR:  deadlock detected
DETAIL:  Process 10966 waits for ShareLock on transaction 197265813; blocked by process 11754.
Process 11754 waits for ShareLock on transaction 197264487; blocked by process 10966.
HINT:  See server log for query details.
CONTEXT:  while deleting tuple (1380733,15) in relation "revision_metadata"
Time: 170864,091 ms (02:50,864)

Update: it's repeatable, and it looks like the deadlock is with a running indexer.

D1218

Once this is landed and deployed, ordering your DELETEs by revision_metadata.id will acquire locks in the same order as the idx_storage, solving the deadlock issue.

zack added a comment.Mar 4 2019, 3:11 PM

D1218
Once this is landed and deployed, ordering your DELETEs by revision_metadata.id will acquire locks in the same order as the idx_storage, solving the deadlock issue.

Thanks, accepted.

(At the same time deploying it would require restarting the indexers, so we're gonna wait for a deployment anyway before being able to complete the cleanup.)

zack added a comment.Mar 15 2019, 9:25 PM

@vlorentz: lather, rinse, repeat.

softwareheritage-indexer=# DELETE FROM revision_intrinsic_metadata
WHERE metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb ;
ERROR:  deadlock detected
DETAIL:  Process 23900 waits for ShareLock on transaction 212164862; blocked by process 20175.
Process 20175 waits for ShareLock on transaction 212164381; blocked by process 23900.
HINT:  See server log for query details.
CONTEXT:  while deleting tuple (772424,55) in relation "revision_intrinsic_metadata"
Time: 33048,828 ms (00:33,049)

(just happened, after indexers have been restarted including D1218)

zack reassigned this task from zack to vlorentz.Mar 20 2019, 12:10 PM
zack raised the priority of this task from Low to Normal.
zack updated the task description. (Show Details)Apr 2 2019, 4:37 PM
zack updated the task description. (Show Details)
zack updated the task description. (Show Details)Apr 2 2019, 4:40 PM
zack updated the task description. (Show Details)
This comment was removed by vlorentz.
vlorentz updated the task description. (Show Details)Apr 3 2019, 10:41 AM
vlorentz updated the task description. (Show Details)Apr 3 2019, 11:14 AM
zack closed this task as Resolved.Wed, Apr 24, 5:18 PM

This is now done, aside from a minor issue noted below:

softwareheritage-indexer=# select count(*) from revision_intrinsic_metadata where metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb;
 count 
-------
     0
(1 row)

softwareheritage-indexer=# select count(*) from origin_intrinsic_metadata where metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb;
 count 
-------
     0
(1 row)

softwareheritage-indexer=# select count(*) from revision_intrinsic_metadata where mappings = array[]::text[];
 count 
-------
   372
(1 row)

softwareheritage-indexer=# select count(*) from origin_intrinsic_metadata where mappings = array[]::text[];
 count 
-------
     0
(1 row)