Indexers - Find and implement a proper scheduling content messages indexing method
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Dec 1 2017, 12:18 PM

Description

We are current indexing all our >> 3.8b contents.

(Probably as a pre-requisite to index on the fly new contents (most probably leveraging our journal stack)).

So far, i've been scheduling regularly some batch of contents from a snapshot file stored in uffizi
(latest one being /srv/storage/space/lists/indexer/orchestrator-all/last-remaining-hashes-less-than-100mib.txt.gz).

source: https://forge.softwareheritage.org/diffusion/DSNIP/browse/master/ardumont/send-batch-sha1s.sh
(well, some derivative form, running in worker01.euwest.azure).

This is:

ugly
heavy on my dayly workload (it increased most recently since the mimetype implementation change, T849, boosted the performance :)

Find a proper implementation for automating this (as 3.2b contents remains to be indexed).

Note:

We cannot directly use our rabbitmq host as we will not have enough disk space for it. That would make the host explode and break other workers (loader, lister, checker, etc...).

As per discussion with team, we cannot use the scheduler infrastructure either (with oneshot tasks). That would make explode the scheduler's db size. As we don't have cleaning up routine for that db yet.

Note 2:
It had been useful to trigger it that way though. Sometimes, the queue was almost empty except for some jobs.
Those kept being rescheduled for unexpected errors (latest one being T861/T862 but others issues were raised that way).

Note3:
That was how rehash computations were scheduled (T712) but that took "only" 2-3 months (june-august 2017).
Those are running since more than that (running since may or june 2017 now).
I never quite found the time to make it more appropriate...

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T713 Index existing contents (mimetype, language, license)
		Migrated	gitlab-migration	T864 Indexers - Find and implement a proper scheduling content messages indexing method

Event Timeline

ardumont created this task.Dec 1 2017, 12:18 PM

ardumont updated the task description. (Show Details)Dec 1 2017, 12:21 PM

ardumont updated the task description. (Show Details)Dec 1 2017, 2:06 PM

ardumont mentioned this in T867: Separate indexers' database model to its own database - meta task.Dec 2 2017, 12:40 PM

ardumont renamed this task from Find and implement a proper scheduling content messages indexing method to Indexers - Find and implement a proper scheduling content messages indexing method.Dec 2 2017, 12:42 PM

ardumont updated the task description. (Show Details)

Thinking more about this.

So far, i've been scheduling regularly some batch of contents from a snapshot file stored in uffizi

That is somehow what bugs me the most. I'd like to use the db for that (select query).

But so far, the mimetype table is empty and populated by the indexer when a new hash is indexed.
As opposed, for example of the archiver (well at one point that was the way, did not check back).
The archiver has the table pre-populated from the content table and when it passes, it updates the data.

In the current state of affairs, I think something can be improved using either:

not touching the actual implementation: use some form of materialized view (real table updated with triggers, indexes) which aggregates the table content and content_mimetype on (sha1, mimetype) values

Then we can use simple select (with right indexes) to schedule those hashes whose mimetype is null.

The cons for that one are:

heavy (i have something locally P198)
we triplicate yet again postgres indexes on both content_mimetype and content_mimetype_to_index (which also must be updated when new content is inserted, new content_mimetype is inserted or updated -> i only see triggers for that).

stopping the indexers for now, altering altogether the content_mimetype table to accept that the (mimetype, encoding) columns be null, pre-populate the table from the content table and making sure the content_mimetype table is updated regularly from content (using triggers since we share the db :/).

pros: that seems the simpler solution

But i'd rather split the db first (T867) to avoid having to use trigger (bound to the current model concern which is not good as explain in task).

I'd rather have a snapshot of the current contents that we aim at finalizing indexing... (well, that's what i'm doing since... so long it seems to be the dawn of time).
And use the same idea behind T494 later on to permit indexing on the fly.

ardumont mentioned this in rDSNIPc452f7ff0b40: scheduling: Add direct task to celery using queue length check.Dec 11 2017, 5:00 PM

ardumont mentioned this in rDSNIP2d0923a6b9cc: scheduling: Permit to schedule other tasks (indexer) the same way.

After adding indexer dependency on the scheduler setup (rSPSITEfb6faecaaa928c4ddcbdbc81181bf3ffac2ace4c), this has been rescheduled through:

$ tail -n +172000000 /srv/storage/space/lists/indexer/orchestrator-all/last-remaining-hashes-less-than-100mib.txt | ./schedule_with_queue_length_check.py --queue-name indexer --threshold 10000 --waiting-time 120 | tee scheduling-indexer

ardumont mentioned this in rDSNIP3ee2a9cb5ceb: volatile scheduling: Improve state information output.Dec 13 2017, 4:38 PM

ardumont mentioned this in rDSNIPb85d77ecd111: volatile scheduling: Permit batch size override from cli.

Heads up, the output was too verbose so i updated that script to only show the last sha1 sent for computations:

$ tail -n +181998079 /srv/storage/space/lists/indexer/orchestrator-all/last-remaining-hashes-less-than-100mib.txt | ./schedule_with_queue_length_check.py --queue-name indexer --threshold 1000 --batch-size 1000 --waiting-time 120 | tee scheduling-indexer

ardumont closed this task as Resolved.Dec 14 2017, 1:30 PM

This task has been migrated to GitLab.

Indexers - Find and implement a proper scheduling content messages indexing methodClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Indexers - Find and implement a proper scheduling content messages indexing method
Closed, MigratedEdits Locked
Actions

Related Objects
Search...