Due to legacy reasons we have imported (from the Antelink archive) contents in our object storage that are individually larger than the current maximum size limit (100 MB) used by loaders that are currently in production. That creates inconsistencies as well as performance problems for very large objects, like some DVD ISO images.
We want to move objects that are larger than the current limit to a separate object storage, and move them from the content table to the content_missing table in the DB, so that all their metadata are retained.
Description
Event Timeline
as a datapoint, at the time of writing the total (uncompressed) size occupied by content objects that are larger than our current limit is as follows:
softwareheritage=> select count(*), sum(length), avg(length) from content where length > 100 * 1024 * 1024; count | sum | avg --------+--------------------+-------------------- 54'803 | 18'605'873'775'395 | 339'504'658
i.e., 50 K contents, for 17 TB total, average size 320 MB.
Depending on the compression level, this might be up to a 5-10% or our current object storage total size.
and here's the actual disk usage (!= compressed size)
/srv/softwareheritage/scratch/lists $ xzcat oversize-contents.txt.xz | while read id ; do du $(swh-ls-obj $id) ; done | cut -f 1 | paste -sd+ | bc 16847900696
those are KB, so the total disk usage is ~15.5 TB (not bad!)
As a byproduct of this, the actual list of objects to move away is available on uffizi, in the file mentioned above.
ok, here are the building blocks I've prepared to resolve this task, as a step by step recipe:
- create the new object storage (DONE on uffizi:/srv/softwareheritage/oversize-objects, which is a symlink to the space partition with 46 TB available)
- copy (safely) the oversize objects from the old object storage to the new one, at depth 2 (instead of 3), using
- remove the oversize object from the old object storage (safely):
- move the content metadata (in the DB) from content to skipped_content (dealing with conflicts):
review/feedback welcome
As a comment on 4, the object_id column is per-table, so you should avoid carrying it over to the skipped_content table.
Updated SQL to also delete objects from tables that references them, e.g., the indexer ones.
This is now done on uffizi.
Potentially remaining sub-tasks before closing this:
- removing the oversize objects from banco