scheduler: tasks archival: Test run on test elasticsearch cluster
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Apr 12 2018, 9:30 AM

Description

As the title says, experiment on @ftigeot's elasticsearch cluster to try and archive tasks.
And report back new found errors here then fix the code accordingly.

This is currently running and raised unforeseen errors (during the development phase).

Note:

The tasks are the real ones (db: softwareheritage-scheduler)
The tryout run does not delete the archived/indexed tasks.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T723 General improvements to the scheduler
Migrated	gitlab-migration	T724 Move scheduler task run logging to a more permanent location
Migrated	gitlab-migration	T986 Scheduler: Automate completed oneshot or disabled recurring tasks archival
Migrated	gitlab-migration	T1023 scheduler: tasks archival: Test run on test elasticsearch cluster

Event Timeline

ardumont changed the task status from Open to Work in Progress.Apr 12 2018, 9:30 AM

ardumont triaged this task as Normal priority.

ardumont created this task.

The road so far:

connection timeout error. The cli execution has been fixed (rDSCH962fd8b55e29) to cope with this and continue (it broke and stopped prior to commit). As far as i understood, that resulted in corrupted elasticsearch shards on the es server though (@ftigeot already fixed those).

illegal argument exception: Rapid analysis tends toward task's arguments field (args, kwargs) being too big (for the moment, that would be debian loader tasks). The solution seems to be field transformation to ease indexation. I need to dig in further.

...
ERROR:swh.scheduler.cli.archive:Error during {'index': {'_id': 'tDy4uGIBYF8D_5oo5zw7', '_type': 'task', 'status': 400, '_index': 'swh-tasks-2017-11', 'error': {'type': 'illegal_argument_exception', 'reason': 'Limit of total fields [1000] in index [swh-tasks-2017-11] has been exceeded'}}} indexation. Skipping.

ardumont updated the task description. (Show Details)Apr 12 2018, 9:51 AM

ardumont updated the task description. (Show Details)

Shard corruption issues were caused by manually restarting one of the cluster nodes during an heavy indexing period. Nothing unexpected.
Now that mmap(2) is no longer used by this particular node, shard corruption risks should also be lower.

ardumont mentioned this in rDSCH547eb89a70e5: swh.scheduler.cli: Add a bulk index flag to separate read from index.Apr 13 2018, 2:55 PM

ardumont mentioned this in rDSCHf4587a3ad563: data/template: Do not index the arguments field (it's in _source).

Shard corruption issues were caused by manually restarting one of the cluster nodes during an heavy indexing period. Nothing unexpected.

Right! So i misunderstood something. Thanks for correcting me ;)

Issues so far seems to have been fixed:

connection timeout error

Fixed by reducing the number of data to bulk index (default to 1000 at first, now down to 100).
That has been enough so far.

Another option would have been to increase the timeout on a per request basis (which defaults to 10s). It seems not necessary so far.

illegal argument exception... Limit of total fields [1000] in index [swh-tasks-2017-11] has been exceeded

Fixed by excluding the field "arguments" from the index.
This varied too wildly between tasks. That information is still present in the '_source' field though, meaning we can still retrieve the information.

Another option was tested, increase the default size from 1000 to 10000, but the limit was also exceeded. [1]

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-limit-settings

In T1023#18979, @ardumont wrote:

illegal argument exception... Limit of total fields [1000] in index [swh-tasks-2017-11] has been exceeded

Fixed by excluding the field "arguments" from the index.
This varied too wildly between tasks. That information is still present in the '_source' field though, meaning we can still retrieve the information.

Another option was tested, increase the default size from 1000 to 10000, but the limit was also exceeded. [1]

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-limit-settings

Can we query tasks by arguments if the fields are not indexed?

Can we query tasks by arguments if the fields are not indexed?

Probably not. I'll check.

If that's a pre-requisite (which sounds highly probable), that means another solution is in order.

Can we query tasks by arguments if the fields are not indexed?

Probably not. I'll check.

I said poppycock, i did not remove the data from the index.
I disabled the dynamic indexing for the "arguments" field's fields "args" and "kwargs" (shape is {"arguments": {"args": ...., "kwargs": ...}). [1] [2] [3]
So they still should be in the index thus queryable.

Now, I need to look into how we can query those fields to match the existing way of doing it in postgres.
For nested documents, [4].

Example:

{
  "query": {
    "nested": {
      "path": "arguments.args",
      "query": {
        "match": {
          "arguments.args.0": "https://github.com/kilovolt42/.dotfiles"
        }
      }
    }
  }
}

[1] https://forge.softwareheritage.org/source/swh-scheduler/browse/master/data/elastic-template.json;f4587a3ad5637418259e85d26efcbc9742993659$21-28
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#nested-params
[3] https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html#object-params
[4] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html#nested-inner-hits

Now, I need to look into how we can query those fields to match the existing way of doing it in postgres.

For the remaining "arguments.kwargs" (a field object type so far), i cannot seem to find the correct way to query yet.
I'll do some more tests.

In the mean time, as a hunch, I tried as spoken earlier to make it a nested type instead (and index stuff).
Now trying to query it as demo-ed earlier, the query is either failing (wrong syntax) or plainly does not return data (where there are data ;).

No result query (index: swh-tasks-2017-11):

{
  "query": {
    "nested": {
      "path": "arguments.kwargs",
      "query": {
        "match": {
          "arguments.kwargs.origin.type": "deb"
        }
      }
    }
  }
}

I'm pondering whether we should not simply stringify (aka json.dumps) those fields (well, at least "arguments.kwargs").
The query should then be simpler to execute (text query).
The trade-off being that we might get more false positives though.

I'm pondering whether we should not simply stringify (aka json.dumps) those fields (well, at least "arguments.kwargs").

Reading the object type documentation again [1], i am under the impression we have no choice there.

Everybody is saying, 'yeah schemaless'... In effect, it's all advertisement effect.

All examples in the elasticsearch documentation presented use structured data.
Even when using dynamic ones (which is anyway, yeah, i'll use your data to determine the structure the first time, then i'll discard what's not mapping after that...)

So, if you don't have a structure already, you are kind of left on the side in regards to query.
And that's our case here as the arguments.kwargs field is arbitrary.

It's not the case for the arguments.args which are really a list, so we are fine there.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html

There, changing the arguments.kwargs as text type in the template.
Dumping the json as string for the indexation, we can now query with something like:

{
  "query": {
    "bool": {
      "must": {
        "match": {
          "arguments.kwargs": "mediawiki-extensions"
        }
      }
    }
  }
}

ardumont mentioned this in rDSCH8124229642c3: swh.scheduler.cli.archive: Index arguments.kwargs as text.Apr 18 2018, 12:35 PM

ardumont closed this task as Resolved.May 25 2018, 8:06 AM

This task has been migrated to GitLab.

scheduler: tasks archival: Test run on test elasticsearch clusterClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

scheduler: tasks archival: Test run on test elasticsearch cluster
Closed, MigratedEdits Locked
Actions

Related Objects
Search...