Page MenuHomeSoftware Heritage

Provide stats on indexed metadata per origin.
ClosedPublic

Authored by vlorentz on Jan 30 2019, 4:09 PM.

Details

Summary

Running the pgsql query in production right now (3M indexed origins) takes
3 seconds.


'{}'::jsonb @> (metadata - '@context') looks computationaly expensive,
but it allows getting the number of non-empty metadata dicts in the same
scan as for the other ones.

As far as I know, postgresql does not offer an operator to test the strict
inclusion of the set of keys, so the only other way I can think of is to
test for string inequality with a second query.

Removing this sum() from the query, and sending:

select count(*) from origin_intrinsic_metadata
where metadata != '{\"@context\": \"https://doi.org/10.5063/schema/codemeta-2.0\"}';

as a second query results in two queries, which take 2 seconds and
1 second respectively.

And this is less correct (because the context value is hardcoded), hence
this implementation.

Diff Detail

Repository
rDCIDX Object indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Jan 30 2019, 4:09 PM
vlorentz updated this revision to Diff 3424.Feb 6 2019, 5:09 PM
  • Fix docstring.
douardda accepted this revision.Feb 7 2019, 10:17 AM
douardda added a subscriber: douardda.

LGTM

This revision is now accepted and ready to land.Feb 7 2019, 10:17 AM
vlorentz updated this revision to Diff 3448.Feb 7 2019, 3:25 PM
  • rebase
This revision was automatically updated to reflect the committed changes.