Paths

Table of Contentst

Differential D1040

Provide stats on indexed metadata per origin.
ClosedPublic
Actions

Authored by vlorentz on Jan 30 2019, 4:09 PM.

Details

Reviewers

douardda

Group Reviewers

Reviewers

Maniphest Tasks

T1484: Provide stats on extracted metadata in the indexer storage api

Commits

rDCIDX78a24361f2ce: Provide stats on indexed metadata per origin.

Summary

Running the pgsql query in production right now (3M indexed origins) takes
3 seconds.

'{}'::jsonb @> (metadata - '@context') looks computationaly expensive,
but it allows getting the number of non-empty metadata dicts in the same
scan as for the other ones.

As far as I know, postgresql does not offer an operator to test the strict
inclusion of the set of keys, so the only other way I can think of is to
test for string inequality with a second query.

Removing this sum() from the query, and sending:

select count(*) from origin_intrinsic_metadata
where metadata != '{\"@context\": \"https://doi.org/10.5063/schema/codemeta-2.0\"}';

as a second query results in two queries, which take 2 seconds and
1 second respectively.

And this is less correct (because the context value is hardcoded), hence
this implementation.

Diff Detail

Repository

rDCIDX Metadata indexer

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Jan 30 2019, 4:09 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptJan 30 2019, 4:09 PM

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/298/ for more details.

Harbormaster completed remote builds in B3890: Diff 3308.Jan 30 2019, 4:12 PM

vlorentz added a task: T1484: Provide stats on extracted metadata in the indexer storage api.Jan 30 2019, 4:14 PM

Fix docstring.

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/326/ for more details.

Harbormaster completed remote builds in B4052: Diff 3424.Feb 6 2019, 5:11 PM

LGTM

This revision is now accepted and ready to land.Feb 7 2019, 10:17 AM

rebase

Closed by commit rDCIDX78a24361f2ce: Provide stats on indexed metadata per origin. (authored by vlorentz). · Explain WhyFeb 7 2019, 3:25 PM

This revision was automatically updated to reflect the committed changes.

Harbormaster failed remote builds in B4076: Diff 3448!Feb 7 2019, 3:25 PM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/369/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/369/console

Revision Contents
Changeset List

Path

Size

swh/

indexer/

storage/

__init__.py

47 lines

in_memory.py

31 lines

tests/

storage/

test_storage.py

80 lines

Diff 3449

View Options

swh/indexer/storage/init.py

View Options

swh/indexer/storage/in_memory.py

View Options

swh/indexer/tests/storage/test_storage.py

Provide stats on indexed metadata per origin.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 3449

swh/indexer/storage/__init__.py

swh/indexer/storage/in_memory.py

swh/indexer/tests/storage/test_storage.py

Provide stats on indexed metadata per origin.
ClosedPublic
Actions

Revision Contents
Changeset List

swh/indexer/storage/init.py