Page MenuHomeSoftware Heritage

storage*: content_get_partition(...) -> PagedResult[Content]
ClosedPublic

Authored by ardumont on Aug 5 2020, 4:11 PM.

Details

Summary

This reworked the content_get_partition so it returns Paged Result of Content.

This also inlines the content_get_range algo within the content_get_partition since it's no longer exposed as a dedicated endpoint.

And now the indexers can be reworked properly ;)

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 14329
Build 22034: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 22033: arc lint + arc unit

Event Timeline

swh/storage/cassandra/storage.py
212

I found this not good.
I highly prefer taking one more object from the backend and take its "tok" id instead.
We assume too much by doing this...

Build is green

Patch application report for D3713 (id=13085)

Could not rebase; Attempt merge onto 334a016e44...

Updating 334a016e..a02a5263
Fast-forward
 swh/storage/cassandra/storage.py    |  32 ++++----
 swh/storage/in_memory.py            |  34 ++++++---
 swh/storage/interface.py            |  52 +++----------
 swh/storage/storage.py              |  56 ++++++--------
 swh/storage/tests/test_cassandra.py |  24 ------
 swh/storage/tests/test_storage.py   | 147 ++++++++----------------------------
 6 files changed, 104 insertions(+), 241 deletions(-)
Changes applied before test
commit a02a526349f3d6f91e315c898db43aa7f554d139
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Aug 5 16:10:52 2020 +0200

    storage*: content_get_partition(...) -> PagedResult[Content]
    
    Related to T645

commit b48d834984f776d63225e254e21d9e0619db48e1
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Aug 5 15:15:01 2020 +0200

    storage*: Drop deprecated content_get_range endpoint
    
    Related to T645

commit 472266327843b873d0541e0aa9b948d907521965
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Aug 5 12:56:51 2020 +0200

    storage*: object_find_by_sha1_git: Type remaining existing endpoints
    
    Related to T2517

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/698/ for more details.

swh/storage/cassandra/storage.py
212

I found this not good.

was talking about this:

next_page_token = str(last_id + 1)

Adapt cassandra implementation to avoid computing ourselves the next-page-token
(out of the tok id from cass)

anlambert added a subscriber: anlambert.

Looks good to me.

swh/storage/cassandra/storage.py
212

I am not a cassandra expert so I do not know which approach is better for getting the next token. At least no elements will be missed with the current one.

swh/storage/interface.py
198

We usually add a line break after Returns: in our docstrings.

This revision is now accepted and ready to land.Aug 5 2020, 4:29 PM
swh/storage/cassandra/storage.py
217–219

You should pop the tok value in the if body only

swh/storage/interface.py
198

oh yeah, my editor, when M-q (to format) removes it and i forgot to fix it.
thanks.

swh/storage/tests/test_storage.py
547

I remove the data as otherwise the comparison below will explode.
Content returned do not have their data field filled.

557–558

i'm missing an assertion on the length here.

swh/storage/cassandra/storage.py
217–219

I can't have for the Content creation line 212 so i'm popping it all the time.
We really need its value once though (for the pagination token).

Build has FAILED

Patch application report for D3713 (id=13088)

Rebasing onto b48d834984...

Current branch diff-target is up to date.
Changes applied before test
commit b36f2fa4805c2ad20427922213cad71e752b2b5b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Aug 5 16:10:52 2020 +0200

    storage*: content_get_partition(...) -> PagedResult[Content]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/699/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/699/console

swh/storage/cassandra/storage.py
217–219

Oh right, did not notice

swh/storage/cassandra/storage.py
217–219

/me mutters something about reading himself prior to submit

"I can't put the pop tok instruction in the body..."

swh/storage/cassandra/storage.py
212

oh well, apparently, tests do no like this version so i'll revert.

swh/storage/cassandra/storage.py
217–219

heads up also, it should be:

next_page_token = str(last_id)

(why doesn't mypy say anything...?)

  • cassandra implem: revert to first implementation
  • tests: rename one test so it's prefixed the same so we can run it alongside the other content_get_partition tests

Build is green

Patch application report for D3713 (id=13089)

Rebasing onto b48d834984...

Current branch diff-target is up to date.
Changes applied before test
commit 0d72ea229eccee997cf1eb31f69338afa210120b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Aug 5 16:10:52 2020 +0200

    storage*: content_get_partition(...) -> PagedResult[Content]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/700/ for more details.

vlorentz added inline comments.
swh/storage/cassandra/storage.py
212

tok+1 is not good, it would skip some contents in case of collision on tok. (And collisions do happen, it's a noncryptographic hash on 32 bits.)

You should instead use the token of the next token. It will return the same contents twice though; but that's less of an issue. Regardless, that's something to be fixed in another diff, so keep it like this for now and open a task