Page MenuHomeSoftware Heritage

update indexer for storage 0.0.156
ClosedPublic

Authored by douardda on Thu, Oct 31, 2:08 PM.

Details

Summary

this imply a refactoring of the db schema for origin_intrinsic_metadata, since
we do not have nor want numerical ids for origins, we use origin urls instead.

Note that IndexerStorage.origin_intrinsic_metadata_search_by_producer() still
have start/end arguments which are expected to be strings, thus uses
lexicographic comparisons between origin urls. This is far from
ideal, but a proper fix requires an (new?) endpoint that handle pagination
properly.

Depends on D2206

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

douardda created this revision.Thu, Oct 31, 2:08 PM
vlorentz requested changes to this revision.Thu, Oct 31, 3:35 PM
vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/indexer/storage/__init__.py
695

singular (same for other occurences)

786

Make sure swh indexer schedule reindex_origin_metadata still works after this change.
You should also probably explain quickly in the docstring how to paginate properly now.

791–793

must be updated

swh/indexer/storage/in_memory.py
716

forgot to change id

swh/indexer/tests/storage/test_storage.py
453–456

self.origin_url_

1556–1571

that doesn't test how a consumer of the API would use it. It needs to start with start='url', then incrementally get new results using only the results from the previous call.

This revision now requires changes to proceed.Thu, Oct 31, 3:35 PM
douardda updated this revision to Diff 7594.Thu, Oct 31, 4:40 PM

fix almost all vlorentz' comments + fix in_memory's SubStorage.get_all()

douardda marked 4 inline comments as done.Thu, Oct 31, 4:42 PM
douardda added inline comments.
swh/indexer/storage/__init__.py
786

You should also probably explain quickly in the docstring how to paginate properly now.

unfortunately I have no idea how to do such a thing.

vlorentz requested changes to this revision.Mon, Nov 4, 12:05 PM
vlorentz added inline comments.
swh/indexer/storage/__init__.py
786

Using the last result of the previous call as a start. It also means that this endpoint should return only results strictly greater than start.

swh/indexer/tests/storage/test_storage.py
1556–1571

Sorry, I meant it should start with start=''.

This revision now requires changes to proceed.Mon, Nov 4, 12:05 PM
douardda added inline comments.Mon, Nov 4, 4:18 PM
swh/indexer/tests/storage/test_storage.py
1556–1571

I agree, but I just replicated what the tests used to do.
Once again, I believe the kind of modifications you ask for should be in a dedicated diff.

vlorentz accepted this revision.Mon, Nov 4, 4:31 PM

Let's not forget to fix swh.indexer.cli.list_origins_by_producer quickly, as it's broken by this change. (It should also be tested)

This revision is now accepted and ready to land.Mon, Nov 4, 4:31 PM
douardda updated this revision to Diff 7651.Tue, Nov 5, 3:07 PM

rebase + small fix

This revision was landed with ongoing or failed builds.Tue, Nov 5, 4:04 PM
This revision was automatically updated to reflect the committed changes.