Page MenuHomeSoftware Heritage

Indexers - Find and implement a proper scheduling content messages indexing method
Closed, MigratedEdits Locked

Description

We are current indexing all our >> 3.8b contents.

(Probably as a pre-requisite to index on the fly new contents (most probably leveraging our journal stack)).

So far, i've been scheduling regularly some batch of contents from a snapshot file stored in uffizi
(latest one being /srv/storage/space/lists/indexer/orchestrator-all/last-remaining-hashes-less-than-100mib.txt.gz).

source: https://forge.softwareheritage.org/diffusion/DSNIP/browse/master/ardumont/send-batch-sha1s.sh
(well, some derivative form, running in worker01.euwest.azure).

This is:

  • ugly
  • heavy on my dayly workload (it increased most recently since the mimetype implementation change, T849, boosted the performance :)

Find a proper implementation for automating this (as 3.2b contents remains to be indexed).

Note:

  • We cannot directly use our rabbitmq host as we will not have enough disk space for it. That would make the host explode and break other workers (loader, lister, checker, etc...).
  • As per discussion with team, we cannot use the scheduler infrastructure either (with oneshot tasks). That would make explode the scheduler's db size. As we don't have cleaning up routine for that db yet.

Note 2:
It had been useful to trigger it that way though. Sometimes, the queue was almost empty except for some jobs.
Those kept being rescheduled for unexpected errors (latest one being T861/T862 but others issues were raised that way).

Note3:
That was how rehash computations were scheduled (T712) but that took "only" 2-3 months (june-august 2017).
Those are running since more than that (running since may or june 2017 now).
I never quite found the time to make it more appropriate...

Event Timeline

ardumont renamed this task from Find and implement a proper scheduling content messages indexing method to Indexers - Find and implement a proper scheduling content messages indexing method.Dec 2 2017, 12:42 PM
ardumont updated the task description. (Show Details)

Thinking more about this.

So far, i've been scheduling regularly some batch of contents from a snapshot file stored in uffizi

That is somehow what bugs me the most. I'd like to use the db for that (select query).

But so far, the mimetype table is empty and populated by the indexer when a new hash is indexed.
As opposed, for example of the archiver (well at one point that was the way, did not check back).
The archiver has the table pre-populated from the content table and when it passes, it updates the data.

In the current state of affairs, I think something can be improved using either:

  • not touching the actual implementation: use some form of materialized view (real table updated with triggers, indexes) which aggregates the table content and content_mimetype on (sha1, mimetype) values

Then we can use simple select (with right indexes) to schedule those hashes whose mimetype is null.

The cons for that one are:

  • heavy (i have something locally P198)
  • we triplicate yet again postgres indexes on both content_mimetype and content_mimetype_to_index (which also must be updated when new content is inserted, new content_mimetype is inserted or updated -> i only see triggers for that).
  • stopping the indexers for now, altering altogether the content_mimetype table to accept that the (mimetype, encoding) columns be null, pre-populate the table from the content table and making sure the content_mimetype table is updated regularly from content (using triggers since we share the db :/).

pros: that seems the simpler solution

But i'd rather split the db first (T867) to avoid having to use trigger (bound to the current model concern which is not good as explain in task).

I'd rather have a snapshot of the current contents that we aim at finalizing indexing... (well, that's what i'm doing since... so long it seems to be the dawn of time).
And use the same idea behind T494 later on to permit indexing on the fly.

After adding indexer dependency on the scheduler setup (rSPSITEfb6faecaaa928c4ddcbdbc81181bf3ffac2ace4c), this has been rescheduled through:

$ tail -n +172000000 /srv/storage/space/lists/indexer/orchestrator-all/last-remaining-hashes-less-than-100mib.txt | ./schedule_with_queue_length_check.py --queue-name indexer --threshold 10000 --waiting-time 120 | tee scheduling-indexer

Heads up, the output was too verbose so i updated that script to only show the last sha1 sent for computations:

$ tail -n +181998079 /srv/storage/space/lists/indexer/orchestrator-all/last-remaining-hashes-less-than-100mib.txt | ./schedule_with_queue_length_check.py --queue-name indexer --threshold 1000 --batch-size 1000 --waiting-time 120 | tee scheduling-indexer