Measure the number of objects ignored by the loader (listed but already loaded)
- global metrics (statsd)
- metrics per task (log)
Measure the number of objects ignored by the loader (listed but already loaded)
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T4080 Minimize archival lag w.r.t. upstream code hosting platforms | ||
Migrated | gitlab-migration | T4185 Loader profiling : Add Measure of ignored objects |
Loaders do not have this information themselves, it is handled by the "filtering storage proxy" (swh.storage.proxies.filter); which runs in the same process as loaders
In practice, the loaders have this information: they send a list of objects to <foo>_add(), so they have the number of objects they've processed; and the <foo>_add() methods return counters for the number of objects that were really added.
We could add statsd probes in the loaders *before* calling <foo>_add() to count the number of objects processed (which would give us a "global" ratio of number of objects processed to objects effectively added), or we could just make the filtering storage proxy send said statsd probes itself (a count of inbound objects, before any filtering).
To get the per-task metrics, we need the tasks to keep both the number of objects processed and the number of objects added, and to ratio them. We have multiple options there:
the Git loader now exports a swh_loader_filtered_objects_total metric. We should generalize this to other loaders eventually; using one of the options above