Page MenuHomeSoftware Heritage

Add support for slices when getting objects from the objstorage.
Closed, MigratedEdits Locked

Description

Currently get methods only supports returning the full blob. We should add new parameters (eg. start and end) to specify the range of bytes the caller wants.

These get methods are defined in swh-objstorage/swh/objstorage/objstorage.py (the abstract base class) and in three different backends:

  • swh/objstorage/objstorage_pathslicing.py manipulates a file object, so it's a matter of using seek() and read().
  • swh/objstorage/objstorage_in_memory.py manipulates Python bytes objects, so it only needs slicing.
  • swh/objstorage/objstorage_rados.py uses RADOS, so it's a bit more tedious. Fortunately, it already uses the slicing logic of RADOS (self.ioctx.read(_obj_id, offset, READ_SIZE)), so it's a matter of changing values of the arguments to self.ioctx.read.

Event Timeline

vlorentz created this task.

objstorage_pathslicing manipulates a *gzipped* file object, which means that TTBOMK seek is not supported, and we will have to decompress the complete beginning of the file to get to the range that we really want to read.

Same issue for the azure blob storage (the objects there are compressed), and likely for the S3 storage as well. Whether that's a good design decision or not (it probably isn't) is beside the point, but that's what we have to work with now.

I'm not convinced that feature is really something that we want to implement (it's not really needed for the indexer, for instance), and I'm not convinced about the Easy hack classification either :)

Same here. Not that much an easy hack. And what is the real life use case that drive this feature request? YAGNI?

I didn't know objects are compressed. That indeed makes the issue harder.