Page MenuHomeSoftware Heritage

Adapt content indexer to allow journal objects processing
ClosedPublic

Authored by ardumont on Jul 20 2022, 7:18 PM.

Details

Summary

This only adds the content mimetype indexer for now.
Some extra mocking work is needed to test the fossology license one so it will go in another diff.

Note that it also refactors the tests dataset to stop hard-coding wrong ids and use proper hash from our model.

In another extra diff after that, we'll drop obsolete parts and refactor to simplify existing base code.

Related to T4273

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 30507
Build 47696: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 47695: arc lint + arc unit

Unit TestsFailed

Event Timeline

Build has FAILED

Patch application report for D8147 (id=29415)

Rebasing onto fa67b73d6a...

Current branch diff-target is up to date.
Changes applied before test
commit 43ab37f7ad38ed4d07f5ab74643a8ba142be34b9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    wip: Make content indexer process data out of the journal
    
    Related to T4273

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/364/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/364/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 20 2022, 7:23 PM
Harbormaster failed remote builds in B30492: Diff 29415!

Build has FAILED

Patch application report for D8147 (id=29416)

Rebasing onto fa67b73d6a...

Current branch diff-target is up to date.
Changes applied before test
commit 86162040212ba8193eb537b9cfa714815f189afe
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    wip: Make content indexer process data out of the journal
    
    Related to T4273

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/365/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/365/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 20 2022, 7:31 PM
Harbormaster failed remote builds in B30493: Diff 29416!
swh/indexer/indexer.py
289

raise NotImplementedError

323

Write a TODO: use get_batch

Build is green

Patch application report for D8147 (id=29423)

Rebasing onto d0d9346b99...

First, rewinding head to replay your work on top of it...
Applying: Adapt content indexer to allow journal objects processing
Changes applied before test
commit 1cf9b6a4836cecca882c52093a38ffedbe05c5b2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/367/ for more details.

ardumont retitled this revision from wip: Make content indexer process data out of the journal to wip: Adapt content indexer to allow journal objects processing.Jul 21 2022, 11:56 AM

Build has FAILED

Patch application report for D8147 (id=29429)

Rebasing onto 7cf9cf59dd...

Current branch diff-target is up to date.
Changes applied before test
commit 410c2188eb981f6114c1d14b63d8250a29b5784a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/370/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/370/console

Build is green

Patch application report for D8147 (id=29430)

Rebasing onto 7cf9cf59dd...

Current branch diff-target is up to date.
Changes applied before test
commit 9c16a32833f72a5baac4fd06ecf6bc38eddb92f2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/371/ for more details.

swh/indexer/cli.py
218

you'll need to update that

swh/indexer/cli.py
218

yes, thx, it's done locally.
ongoing tests showed it to me ;)

Add tests

TODO: Unstuck regresssions in tests (for test reason, not for runtime change reasons)

Build has FAILED

Patch application report for D8147 (id=29444)

Could not rebase; Attempt merge onto 7cf9cf59dd...

Updating 7cf9cf5..0ac477b
Fast-forward
 swh/indexer/cli.py                          |  26 +-
 swh/indexer/indexer.py                      |  58 +++-
 swh/indexer/tests/conftest.py               |  15 +-
 swh/indexer/tests/test_cli.py               |  91 +++++-
 swh/indexer/tests/test_ctags.py             |  17 +-
 swh/indexer/tests/test_fossology_license.py |  20 +-
 swh/indexer/tests/test_metadata.py          |   6 +-
 swh/indexer/tests/test_mimetype.py          |  50 ++--
 swh/indexer/tests/utils.py                  | 427 ++++++++++++++--------------
 9 files changed, 440 insertions(+), 270 deletions(-)
Changes applied before test
commit 0ac477be06e8ec659a5ac501385b7cf933598e6a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 21 19:22:41 2022 +0200

    Add tests around new content-mimetype journal client indexer

commit 9c16a32833f72a5baac4fd06ecf6bc38eddb92f2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/372/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/372/console

Add missing requirements which fails some tests on objstorage

Build has FAILED

Patch application report for D8147 (id=29447)

Rebasing onto 7cf9cf59dd...

Current branch diff-target is up to date.
Changes applied before test
commit 6736fcab0aad2b87b900fa3188f9c4bedd6857cb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 21 19:22:41 2022 +0200

    Add tests around new content-mimetype journal client indexer

commit c7bdb5b4ec17ad0cdf242119c9c5d84bbc20b4ea
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/373/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/373/console

Fix tests (i've kept some setup in the fixture 'cause that made the tests fail and i
don't want to dig in just right now)

Build has FAILED

Patch application report for D8147 (id=29448)

Rebasing onto 7cf9cf59dd...

Current branch diff-target is up to date.
Changes applied before test
commit 58c257a6c936e8b61b224838ab24610f5acb3caf
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 21 19:22:41 2022 +0200

    Add tests around new content-mimetype journal client indexer

commit f069be81d76a0bd39684d6d06c415ddf4a9047c4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/374/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/374/console

swh/indexer/tests/utils.py
645–655 ↗(On Diff #29448)

There is something fishy in that setup regarding the sha1 == sha1_git ^.

I initially dropped that part as well to simplify to something consistent:

storage.content_add([Content.from_data(data) for data in OBJ_STORAGE_DATA.values()])

But then metadata tests fail after that. I don't want to dig but i'm in that rabbit hole currently to fix the tests...
It's probably related to the "discrepancy" with dir_entries using sha1-git as key id but objstorage used sha1 as id...
(directory metadata)...

@vlorentz heads up ^, any idea?

Fix inconsistency in test dataset

ardumont added a project: Indexer.
ardumont retitled this revision from wip: Adapt content indexer to allow journal objects processing to Adapt content indexer to allow journal objects processing.

Keep fossology license indexer out of the diff for now.
It will go in another diff.

swh/indexer/tests/test_metadata.py
100 ↗(On Diff #29450)

just to clarify the call below (target is sha1_git but the actual content id is the sha1, see below).

Build has FAILED

Patch application report for D8147 (id=29449)

Rebasing onto 7cf9cf59dd...

Current branch diff-target is up to date.
Changes applied before test
commit 9dae71283d9254c144baf2cd4edc177754729713
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 21 19:22:41 2022 +0200

    Add tests around new content-mimetype journal client indexer

commit ed30a6e1668396d53f202247169a38667d609a43
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/375/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/375/console

Build has FAILED

Patch application report for D8147 (id=29450)

Rebasing onto 7cf9cf59dd...

Current branch diff-target is up to date.
Changes applied before test
commit 9f7b4049165471624b88781dfb9ecce79a52fe1f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 21 19:22:41 2022 +0200

    Add tests around new content-mimetype journal client indexer

commit 05cc6a62620e3b3337266ba9f1afc0a6013ef295
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/376/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/376/console

Fix libmagic discrepancy between ci and dev box

Build is green

Patch application report for D8147 (id=29451)

Rebasing onto 7cf9cf59dd...

Current branch diff-target is up to date.
Changes applied before test
commit b088f7f5ee48ef191cffa69285bb604fcbfdc6f6
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 21 19:22:41 2022 +0200

    Add tests around new content-mimetype journal client indexer

commit 05cc6a62620e3b3337266ba9f1afc0a6013ef295
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/377/ for more details.

vlorentz added inline comments.
swh/indexer/indexer.py
43

What does this mean? If you want to write a docstring, it should mention these keys are names of Kafka topics and values are list of values of messages in that topic

swh/indexer/tests/conftest.py
80–83 ↗(On Diff #29451)
swh/indexer/tests/test_mimetype.py
39–40 ↗(On Diff #29451)

s/magic/libmagic/ otherwise it's confusing.

Could you approximate the version number where this changes?

88–89 ↗(On Diff #29451)

ditto

This revision is now accepted and ready to land.Jul 22 2022, 2:30 PM
swh/indexer/indexer.py
43

yes, well, i guess it should have been documented earlier.
I just wanted to stop seein' yellow warning when i saw that code but did not guess immediately what that was and then forget about it.
i"ll fix it with what you just said:

"""Typed objects whose keys are names of Kafka topics and values are list of values
of messages in that topic."""
swh/indexer/tests/conftest.py
80–83 ↗(On Diff #29451)

thx

and lol, i can even do better and simply drop that now useless bit.
It's a previous test when i used a pathslicing objstorage because i thought the memory objstorage was not doing ok.

swh/indexer/tests/test_mimetype.py
39–40 ↗(On Diff #29451)

Could you approximate the version number where this changes?

is that necessary?

if so, i guess i can explicit that buster version (1:5.35-4+deb10u2) and the bullseye version (1:5.39-3), would that be enough?

swh/indexer/tests/utils.py
645–655 ↗(On Diff #29448)

confusion comes from the use of hard-coded ids.
For directory target, it's sha1git, for content, ids are sha1.

We had some historic undocumented hack here to make sure both sha1-git and sha1 were the same in the content so tests were ok.
I've simplified this mess with something consistent and clearer (to the expanse of a bigger diff...)

swh/indexer/tests/test_mimetype.py
39–40 ↗(On Diff #29451)

Unsurprisingly, fhe initial change occurred in 11/2020 [1] to fix a debian build where test output diverged

[1] D4619

Adapt according to review and refactor a bit tests

swh/indexer/tests/test_mimetype.py
39–40 ↗(On Diff #29451)

yes

Build is green

Patch application report for D8147 (id=29452)

Rebasing onto 7cf9cf59dd...

Current branch diff-target is up to date.
Changes applied before test
commit d0b5f23300e1697435f64e703823bc4998cc9104
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 21 19:22:41 2022 +0200

    Add tests around new content-mimetype journal client indexer

commit 05cc6a62620e3b3337266ba9f1afc0a6013ef295
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 20 19:16:12 2022 +0200

    Adapt content indexer to allow journal objects processing
    
    Related to T4273

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/378/ for more details.