Page MenuHomeSoftware Heritage

scheduler: tasks archival: Test run on test elasticsearch cluster
Closed, MigratedEdits Locked

Description

As the title says, experiment on @ftigeot's elasticsearch cluster to try and archive tasks.
And report back new found errors here then fix the code accordingly.

This is currently running and raised unforeseen errors (during the development phase).

Note:

  • The tasks are the real ones (db: softwareheritage-scheduler)
  • The tryout run does not delete the archived/indexed tasks.

Event Timeline

ardumont changed the task status from Open to Work in Progress.Apr 12 2018, 9:30 AM
ardumont triaged this task as Normal priority.
ardumont created this task.

The road so far:

  • connection timeout error. The cli execution has been fixed (rDSCH962fd8b55e29) to cope with this and continue (it broke and stopped prior to commit). As far as i understood, that resulted in corrupted elasticsearch shards on the es server though (@ftigeot already fixed those).
  • illegal argument exception: Rapid analysis tends toward task's arguments field (args, kwargs) being too big (for the moment, that would be debian loader tasks). The solution seems to be field transformation to ease indexation. I need to dig in further.
...
ERROR:swh.scheduler.cli.archive:Error during {'index': {'_id': 'tDy4uGIBYF8D_5oo5zw7', '_type': 'task', 'status': 400, '_index': 'swh-tasks-2017-11', 'error': {'type': 'illegal_argument_exception', 'reason': 'Limit of total fields [1000] in index [swh-tasks-2017-11] has been exceeded'}}} indexation. Skipping.
ardumont updated the task description. (Show Details)

Shard corruption issues were caused by manually restarting one of the cluster nodes during an heavy indexing period. Nothing unexpected.
Now that mmap(2) is no longer used by this particular node, shard corruption risks should also be lower.

Shard corruption issues were caused by manually restarting one of the cluster nodes during an heavy indexing period. Nothing unexpected.

Right! So i misunderstood something. Thanks for correcting me ;)

Issues so far seems to have been fixed:

  • connection timeout error

Fixed by reducing the number of data to bulk index (default to 1000 at first, now down to 100).
That has been enough so far.

Another option would have been to increase the timeout on a per request basis (which defaults to 10s). It seems not necessary so far.

  • illegal argument exception... Limit of total fields [1000] in index [swh-tasks-2017-11] has been exceeded

Fixed by excluding the field "arguments" from the index.
This varied too wildly between tasks. That information is still present in the '_source' field though, meaning we can still retrieve the information.

Another option was tested, increase the default size from 1000 to 10000, but the limit was also exceeded. [1]

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-limit-settings

  • illegal argument exception... Limit of total fields [1000] in index [swh-tasks-2017-11] has been exceeded

Fixed by excluding the field "arguments" from the index.
This varied too wildly between tasks. That information is still present in the '_source' field though, meaning we can still retrieve the information.

Another option was tested, increase the default size from 1000 to 10000, but the limit was also exceeded. [1]

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-limit-settings

Can we query tasks by arguments if the fields are not indexed?

Can we query tasks by arguments if the fields are not indexed?

Probably not. I'll check.

If that's a pre-requisite (which sounds highly probable), that means another solution is in order.

Can we query tasks by arguments if the fields are not indexed?

Probably not. I'll check.

I said poppycock, i did not remove the data from the index.
I disabled the dynamic indexing for the "arguments" field's fields "args" and "kwargs" (shape is {"arguments": {"args": ...., "kwargs": ...}). [1] [2] [3]
So they still should be in the index thus queryable.

Now, I need to look into how we can query those fields to match the existing way of doing it in postgres.
For nested documents, [4].

Example:

{
  "query": {
    "nested": {
      "path": "arguments.args",
      "query": {
        "match": {
          "arguments.args.0": "https://github.com/kilovolt42/.dotfiles"
        }
      }
    }
  }
}

[1] https://forge.softwareheritage.org/source/swh-scheduler/browse/master/data/elastic-template.json;f4587a3ad5637418259e85d26efcbc9742993659$21-28
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#nested-params
[3] https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html#object-params
[4] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html#nested-inner-hits

Now, I need to look into how we can query those fields to match the existing way of doing it in postgres.

For the remaining "arguments.kwargs" (a field object type so far), i cannot seem to find the correct way to query yet.
I'll do some more tests.

In the mean time, as a hunch, I tried as spoken earlier to make it a nested type instead (and index stuff).
Now trying to query it as demo-ed earlier, the query is either failing (wrong syntax) or plainly does not return data (where there are data ;).

No result query (index: swh-tasks-2017-11):

{
  "query": {
    "nested": {
      "path": "arguments.kwargs",
      "query": {
        "match": {
          "arguments.kwargs.origin.type": "deb"
        }
      }
    }
  }
}

I'm pondering whether we should not simply stringify (aka json.dumps) those fields (well, at least "arguments.kwargs").
The query should then be simpler to execute (text query).
The trade-off being that we might get more false positives though.

I'm pondering whether we should not simply stringify (aka json.dumps) those fields (well, at least "arguments.kwargs").

Reading the object type documentation again [1], i am under the impression we have no choice there.

Everybody is saying, 'yeah schemaless'... In effect, it's all advertisement effect.

All examples in the elasticsearch documentation presented use structured data.
Even when using dynamic ones (which is anyway, yeah, i'll use your data to determine the structure the first time, then i'll discard what's not mapping after that...)

So, if you don't have a structure already, you are kind of left on the side in regards to query.
And that's our case here as the arguments.kwargs field is arbitrary.

It's not the case for the arguments.args which are really a list, so we are fine there.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html

There, changing the arguments.kwargs as text type in the template.
Dumping the json as string for the indexation, we can now query with something like:

{
  "query": {
    "bool": {
      "must": {
        "match": {
          "arguments.kwargs": "mediawiki-extensions"
        }
      }
    }
  }
}