Page MenuHomeSoftware Heritage

rm ctags mocks + add ctags to idx db + fix doc.
ClosedPublic

Authored by vlorentz on Dec 5 2018, 5:05 PM.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
ctags-mock
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2893
Build 3659: tox-on-jenkinsJenkins
Build 3658: arc lint + arc unit

Event Timeline

vlorentz created this revision.
  • Remove useless var
  • Update CommonContentIndexerTest to work with non-mocked storage.
swh/indexer/storage/in_memory.py
170

This is not pointless.
This is an implementation detail from the indexer storage.
I expected the multiple ctags implementations (universal, exuberant, etc...) to be idempotent in their computations (still do).

So in the indexer storage, the function that add those data simply ignore the conflicted data (which should be exactly the same as before). In the end, only read operations are expected when we pass yet again on the same content.

Why were we expected to pass on the same content, you might ask?
Because not so long ago, the indexers were a pipeline. Thus, adding a new indexer would have triggered such behavior.
Because orchestrator would have broadcast yet again same contents to the new and possibly the other indexers as we..

As it's an implementation detail, in theory, you could implement this as you wish here as long as tests are fine ;)

vlorentz added inline comments.
swh/indexer/storage/in_memory.py
170

I expected the multiple ctags implementations (universal, exuberant, etc...) to be idempotent in their computations (still do).

ctags implementations are registered as tools, and rows from different tools do not conflict with each other: create unique index on content_ctags(id, hash_sha1(name), kind, line, lang, indexer_configuration_id);

swh/indexer/storage/in_memory.py
170

as tools, and rows from different tools do not conflict with each other from different tools do not conflict

Yes, i did that.
I should have avoided the multiple implementations reference. That's noisy. I mentioned both because i tested both (or more?). And both gave me the same result given the same input (independently from each other).

What i meant was for the case same tool, same content, the computed data is the same.
So the sql insertion function will simply drop the conflicted data. There will be no merge as it's supposed to not have divergent new data.

The also supposedly gain here is that there is no writes operation with this approach. So it's supposedly faster (we'd need metric to ensure that ;). Against what you proposed which would always write.
Like i said early on, implementation detail.

Hoping this is clearer.

This revision is now accepted and ready to land.Dec 6 2018, 9:24 PM
This revision was automatically updated to reflect the committed changes.