Page MenuHomeSoftware Heritage

indexer-license: Investigate timeouts
Closed, MigratedEdits Locked

Description

Only 1 worker is currently running.
I expect only 1 query in the backend at a time with such setup.
That's not what's currently seen in the pg_activity -p 5434 (softwareheritage-indexer).

In the mean time, the current worker shows the following stacktrace [1].

So my take on this is that the query (using index scans as designed) works on a range too large for the query to finish.
What's not expected though is that the worker part explodes like [1] but the query in the backend (indexer-storage's db) happily continues querying.
Thus the load on somerset happily grows...

Maybe the following plan would be acceptable:

  • adding some @timeout on the indexer-storage's storage api (as we do in the swh-storage's)
  • and rework the ranges defined in the scheduler for the fossology-license indexer (IMSMW, 100k range tasks were created, we should reduce those ranges' size, thus increasing the number of tasks)

[1]

Jun 07 06:17:53 worker08 python3[123583]: [2019-06-07 06:17:53,918: INFO/MainProcess] Received task: swh.indexer.tasks.ContentRangeFossologyLicense[452abd0b-8db8-465c-9a2d-eb84d3ed90e5]
Jun 07 07:17:57 worker08 python3[59331]: [2019-06-07 07:17:57,176: ERROR/ForkPoolWorker-3] Problem when computing metadata.
                                         Traceback (most recent call last):
                                           File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 516, in run
                                             n=self.config['write_batch_size']):
                                           File "/usr/lib/python3/dist-packages/swh/core/utils.py", line 48, in grouper
                                             for _data in itertools.zip_longest(*args, fillvalue=stop_value):
                                           File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 479, in _index_with_skipping_already_done
                                             indexed_page = self.indexed_contents_in_range(start, end)
                                           File "/usr/lib/python3/dist-packages/swh/indexer/fossology_license.py", line 172, in indexed_contents_in_range
                                             start, end, self.tool['id'])
                                           File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 133, in meth_
                                             return self.post(meth._endpoint_path, post_data)
                                           File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 198, in post
                                             return self._decode_response(response)
                                           File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 235, in _decode_response
                                             response.content,
                                         swh.core.api.RemoteException: Unexpected status code for API request: 504 (b'<html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body bgcolor="white">\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n<hr><center>nginx/1.10.3</center>\r\n</body>\r\n</html>\r\n')

Event Timeline

ardumont triaged this task as Normal priority.Jun 7 2019, 10:27 AM
ardumont created this task.

In the mean time, i've stopped those indexers as this impacts other (i see transactions piling-up).

Note: ... comments stack pop ... (-> been there a while apparently)

adding some @timeout on the indexer-storage's storage api (as we do in the swh-storage's)

no, it's flaky, not configurable, db host dependent...

and rework the ranges defined in the scheduler for the fossology-license indexer (IMSMW, 100k range tasks were created, we should reduce those ranges' size, thus increasing the number of tasks)

that seems like a workaround


I'm wondering whether yet another scheme for indexer would not be best.
Only triggering tasks to compute the metadata when something requires it.

Something along the lines of the vault.
That would simplify a lot the current issues...
Then again...
here the issue is that error should not happen, i think.


This is a common need.
For example, @vlorentz did some modifications in the indexer storage clients endpoints for splitting the input data into smaller chunks so that the issue does not appear (D1754, D1750)...
It's not a complete solution though as this needs to be done for all endpoints... (and also not on the client side or else we'll replicate it for all clients).

Discussing with @douardda for another similar issues (storage clients), i agree it should not be the concern of the (indexer|objstorage|storage|...) clients...
As an (indexer storage|...) client, what we care for is calling it to store/read data.
And then the result about that operations should be "ok it's done, here is what i integrated" (or some such).

We can investigate 2 things:

  • check postgresql options to kill queries that takes too long (solely indexer-db right now) -> and find some way to report those
  • push the proxy client storage idea (started within the T1389 for the storage, currently wip) up to the indexer-storage.

I'm convinced that 2. is the way forward now.

We can investigate 2 things:

  • check postgresql options to kill queries that takes too long (solely indexer-db right now) -> and find some way to report those

That's the postgresql statement_timeout variable that we set for some methods on the storage backends.

  • push the proxy client storage idea (started within the T1389 for the storage, currently wip) up to the indexer-storage.

I'm convinced that 2. is the way forward now.

I'm not sure what that means, but if that means sending the data to the backend for storage in smaller chunks, then probably, yes.

That's the postgresql statement_timeout variable that we set for some methods on the storage backends.

oh ok.

I thought this was something more static (within the db's configuration setup files).

Thanks for the heads up.

I'm not sure what that means, but if that means sending the data to the backend for storage in smaller chunks, then probably, yes.

Yes, smaller chunks but those are dealt with transparently in some (other?) client storage implementation.

I mean some kind of client storage which transparently deals with slicing the input to something manageable by the *storage server.

so we still have the client code which looks like:

storage.content_mimetype_get([list_gazillion_revision_ids])  # or some other indexer storage api, nothing comes to mind here

But underneath, the client storage does split that gazillion list of elements into smaller chunks so that the storage backend can actually finish its query.