Page MenuHomeSoftware Heritage

Loader profiling : Add Measure of ignored objects
Closed, MigratedEdits Locked

Description

Measure the number of objects ignored by the loader (listed but already loaded)

  • global metrics (statsd)
  • metrics per task (log)

Event Timeline

bchauvet created this task.

Loaders do not have this information themselves, it is handled by the "filtering storage proxy" (swh.storage.proxies.filter); which runs in the same process as loaders

Loaders do not have this information themselves, it is handled by the "filtering storage proxy" (swh.storage.proxies.filter); which runs in the same process as loaders

In practice, the loaders have this information: they send a list of objects to <foo>_add(), so they have the number of objects they've processed; and the <foo>_add() methods return counters for the number of objects that were really added.

We could add statsd probes in the loaders *before* calling <foo>_add() to count the number of objects processed (which would give us a "global" ratio of number of objects processed to objects effectively added), or we could just make the filtering storage proxy send said statsd probes itself (a count of inbound objects, before any filtering).

To get the per-task metrics, we need the tasks to keep both the number of objects processed and the number of objects added, and to ratio them. We have multiple options there:

  • Keep track of the counts in the loader directly?
  • Add a method to the filter storage proxy to measure this
  • Create a new "counting" storage proxy which would keep track of both the inbound and outbound objects and have a method to retrieve cumulative counts. We could use this proxy explicitly in the loaders.

the Git loader now exports a swh_loader_filtered_objects_total metric. We should generalize this to other loaders eventually; using one of the options above