Page MenuHomeSoftware Heritage

Re design the lazy loading feature for Content objects
Closed, MigratedEdits Locked

Description

While working on T2422, it appeared that the way lazy loading of disk file backed Content objects suffers from design problems, and moreover, is actually not used any more in (loader implementations nor any other package).

Making the lazy-loading mechanism properly work and used could bring a significant gain in memory usage of some parts of the stack (loaders most probably, maybe others).

Event Timeline

douardda triaged this task as Normal priority.Jun 25 2020, 12:57 PM
douardda created this task.

Lazy loading for content objects (stored as files on the filesystem) gives us a choice where to put the cursor between the two ends of the eternal tradeoff between:

  • temporary storage space used on disk
  • memory usage of "materialized" Content objects

To minimize impact on both axes, the Debian loader used to do "really clever" (read: hard to follow) stuff with its temporary files. For each package version, it would:

  • extract the version
  • compute the hash for all the objects in the new version (dirs and files)
  • remove all the files that have already been referenced in a previous version of the package
  • run content_missing on the remaining files
  • remove all files already loaded in the archive

After a certain threshold volume of new files, it would:

  • flush new contents to the archive (reading their bytes again from disk)
  • remove all temporary files that have been flushed to the archive
  • then continue running the package version loop.

This "sparsification" of the extracted package versions minimized the disk space needed for loading large amounts of large packages (e.g. when bulk importing all firefox or libreoffice versions from snapshot.debian.org).

With the current approach of buffering/deferring the content insertions in database until the explicit call to flush() done at the point the snapshot is constructed, we've abstracted content addition to the archive away the lifetime of temporary files, which is why we've not re-introduced proper deferred loading yet.