Page MenuHomeSoftware Heritage

Use elasticsearch aliases to simplify maintenance operations
ClosedPublic

Authored by vsellier on Mar 2 2021, 10:44 AM.

Details

Summary
  • Allow to explicitly configure the index and aliases names. It depreciates the prefix parameter.
  • Use dedicated aliases for read and write operations
  • manage them in the initialization method

Another diff will follow to implement the initialization during the server startup.
(The tests of the server part need to be declared before)

Related to T3076

Diff Detail

Event Timeline

Build is green

Patch application report for D5179 (id=18518)

Rebasing onto 04dadef938...

Current branch diff-target is up to date.
Changes applied before test
commit a8a9409c7a2ad60015239ad7955f187146cb3e37
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Mon Mar 1 22:08:54 2021 +0100

    Use elasticsearch aliases to simplify maintenance operations
    
    - Use dedicated aliases for read and write operations
    - manage them in the initialization method
    
    Related to T3076

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/92/ for more details.

Build is green

Patch application report for D5179 (id=18519)

Rebasing onto 04dadef938...

Current branch diff-target is up to date.
Changes applied before test
commit 7053dca24f35eddca06d66f3a31687ce3f5e88f2
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Mon Mar 1 22:08:54 2021 +0100

    Use elasticsearch aliases to simplify maintenance operations
    
    - Allow to explicitely configure the index and aliases names. It
      depreciates the prefix parameter.
    - Use dedicated aliases for read and write operations
    - manage them in the initialization method
    
    Related to T3076

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/93/ for more details.

This revision is now accepted and ready to land.Mar 2 2021, 11:22 AM
vlorentz added a subscriber: vlorentz.

That assumes that origin is the only alias; and hopefully it won't be for long (T2073)

Instead, it shouldn't be configurable on the CLI, and the config file should have an entry with a pair of names for each index

This revision now requires changes to proceed.Mar 2 2021, 11:59 AM

Will we used different indexes for T2073 ?
Even with several indexes, It's not clear (for me at least) if using a unique read alias with several underlying indexes could be more advantageous. It will probably depend of how the search will be used from the api perspective.
Perhaps it should be more prudent to keep this diff as simple as possible and implement the eventual improvements in T2073.
WDYT?

Will we used different indexes for T2073 ?

It should allow searching objects that aren't origins (eg. directories), so yes

Even with several indexes, It's not clear (for me at least) if using a unique read alias with several underlying indexes could be more advantageous.

What does that mean? Can an alias reference multiple indexes? How does that work in terms of ids for example?

Perhaps it should be more prudent to keep this diff as simple as possible and implement the eventual improvements in T2073.

We could, but I don't see how it's more prudent; allowing a config like {origin: {read: foo, write: bar}} isn't any harder than adding these CLI switches, so we might as well do it now

swh/search/cli.py
136

I don't think that rpc-server is used by the way.
Both docker and actual production use swh.search.api.server:make_app_from_configfile function to run.

So this begs the question to continue maintaining it (and it misses test btw) or just drop it.

@vlorentz, @vsellier what do you think?

What does that mean? Can an alias reference multiple indexes? How does that work in terms of ids for example?

yes, an alias can reference multi indexes. If same ids are present in several indexes, the risk is to have duplicate result if the documents are matching the search.

We could, but I don't see how it's more prudent; allowing a config like {origin: {read: foo, write: bar}} isn't any harder than adding these CLI switches, so we might as well do it now

ok for such kind of configuration but just having read and write aliases is not enough as we need to know which mapping to apply on the index.
Having a configuration like {origin: {index: name. read: foo, write: bar}} with origin a "static" indentifier should solve the problem.

swh/search/cli.py
136

I agree as it's not used either in docker and puppetized environments

Having a configuration like {origin: {index: name. read: foo, write: bar}} with origin a "static" indentifier should solve the problem.

sure

Configure the indexes with a Dict with an entry per index type

Build is green

Patch application report for D5179 (id=18547)

Rebasing onto e9ffac4fd1...

Current branch diff-target is up to date.
Changes applied before test
commit 7c795a603f7ac7ae43812991556c9ae574f476ce
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Mon Mar 1 22:08:54 2021 +0100

    Use elasticsearch aliases to simplify maintenance operations
    
    - Allow to explicitely configure the index and aliases names. It
      depreciates the prefix parameter.
    - Use dedicated aliases for read and write operations
    - manage them in the initialization method
    
    Related to T3076

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/100/ for more details.

This revision is now accepted and ready to land.Mar 3 2021, 10:46 AM