Page MenuHomeSoftware Heritage

"text"-indexers: Migrate to partition index instead of range
ClosedPublic

Authored by ardumont on Aug 6 2020, 9:50 AM.

Details

Summary

Deprecated storage.content-get-range (cassandra storage unsupported) got dropped in the storage
in favor of a more compliant api storage.content-get-partition [1] (cassandra storage supported)

The text indexers were the sole users of that deprecated api.
This migrates them to move to the same pattern of using a partition.
This simplifies the setup as no range is to be computed (the api does it \o/).

The production impact is to stop current indexers, disable all their current tasks from the scheduler and change their input in the scheduler db.

Note that it's wip as it remains tests using the old interface to fix.

Also, this fixes:

  • mistyped codes following the migration to storage 0.12.0. This cannot be untangled from the partition migration though.
  • build [2]

[1] D3712 D3713

[2] https://jenkins.softwareheritage.org/job/DCIDX/job/tests/1084/console

Related to T645

Test Plan

tox

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Aug 6 2020, 9:50 AM

Build has FAILED

Patch application report for D3718 (id=13098)

Rebasing onto 62d73ed83d...

Current branch diff-target is up to date.
Changes applied before test
commit 465fbe023f0751ef666384f025e5772a06b771b1
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Aug 6 09:48:50 2020 +0200

    textual-indexers: Migrate to partition index instead of range
    
    This also fixes mistyped codes following the migration to storage 0.12.0. This
    cannot be untangled from the partition migration though.
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/30/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/30/console

ardumont planned changes to this revision.Aug 6 2020, 9:59 AM
ardumont retitled this revision from wip: textual-indexers: Migrate to partition index instead of range to wip: "text"-indexers: Migrate to partition index instead of range.Aug 6 2020, 10:03 AM
ardumont edited the summary of this revision. (Show Details)
ardumont added inline comments.
swh/indexer/indexer.py
286

on indexer partition

376

to fix... PagedResult[Sha1]

451

doc to fix

473–474

doc to fix.

swh/indexer/mimetype.py
144–147

doc fix needed.

swh/indexer/storage/__init__.py
178

here because exposing it through the interface did not work for some reason.

swh/indexer/storage/in_memory.py
106

doc fix needed.

125–126

line.

swh/indexer/storage/interface.py
62–63

doc fix.

283–291

doc fix.

305–306

line.

swh/indexer/tests/storage/test_storage.py
1038

@vlorentz and now we are really using opaque token id! ;)

(mentioning it as our first discussion about those started in the indexer a long time ago, and i did not grasp completely what you said at the time, now i understand ;)

ardumont updated this revision to Diff 13100.Aug 6 2020, 10:22 AM

Fix multiple docstring issues

ardumont planned changes to this revision.EditedAug 6 2020, 10:23 AM

still wip (still 4 tests to fix, they are using the old apis, need to migrate them).

Build has FAILED

Patch application report for D3718 (id=13100)

Rebasing onto 62d73ed83d...

Current branch diff-target is up to date.
Changes applied before test
commit a04ebc7253d925887a6c80b77f3a07c169c4815e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Aug 6 09:48:50 2020 +0200

    textual-indexers: Migrate to partition index instead of range
    
    This also fixes mistyped codes following the migration to storage 0.12.0. This
    cannot be untangled from the partition migration though.
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/31/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/31/console

ardumont updated this revision to Diff 13108.Aug 6 2020, 11:27 AM

Fix and simplify tests

Build is green

Patch application report for D3718 (id=13108)

Rebasing onto 62d73ed83d...

Current branch diff-target is up to date.
Changes applied before test
commit 3aedf90c9cb9c4f5be4d25dde1e03456045c9128
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Aug 6 09:48:50 2020 +0200

    textual-indexers: Migrate to partition index instead of range
    
    This also fixes mistyped codes following the migration to storage 0.12.0. This
    cannot be untangled from the partition migration though.
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/32/ for more details.

ardumont added inline comments.Aug 6 2020, 11:31 AM
swh/indexer/storage/__init__.py
178

"here" as in the docstring is here.

ardumont retitled this revision from wip: "text"-indexers: Migrate to partition index instead of range to "text"-indexers: Migrate to partition index instead of range.Aug 6 2020, 11:32 AM
ardumont edited the summary of this revision. (Show Details)
This revision was not accepted when it landed; it landed in state Needs Review.Aug 6 2020, 1:08 PM
This revision was automatically updated to reflect the committed changes.