Page MenuHomeSoftware Heritage

Cloud-based objstorage using Apache Libcloud

Authored by qcampos on Aug 5 2016, 4:32 PM.



This implementation allows the use of an object storage from a cloud
provider (atm Amazon's S3 and OpenStack object are implemented as first

Test Plan

This diff is submitted for first code review. Tests are still
needed and will be done once a cloud emulator will be deployed.

Diff Detail

rDOBJS Object storage
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

qcampos retitled this revision from to Cloud-based objstorage using Apache Libcloud.
qcampos edited the test plan for this revision. (Show Details)
qcampos updated this object.
zack requested changes to this revision.Aug 8 2016, 10:30 AM
zack added a reviewer: zack.
zack added a subscriber: zack.

Looks like a good start! I've noted down some comments for further discussion, has we discussed F2F last Friday.


nitpick: as this is new code written in 2016, you can drop "2015-" from here


as this is an abstract class not meant to be implemented, shouldn't it use the abc machinery?


so, is this really a "should" or a "shall"? does this class do anything useful if not subclassed or not?


"correct object" is not very useful doc for people looking into how to implement a specific concrete subclass of this abstract class.
Maybe you can add a pointer to the relevant entry in the libcloud doc here?


a (even just minimal) docstring here would be nice


The warning is useful, but should probably explain in what it would be inefficient. Is it just because it's a ping-pong with remote storage over the network, or is it something worse, like in-memory caching of the whole list?

A (minimal) docstring here would be nice too.


d'oh :(

if there is really no direct method to retrieve the size of the storage, I believe we should either not implement this method, or equip it with the same warning that goes with iter, otherwise it might be a timebomb


This and a lot of other docstrings in this module seem to be copy-paste from, which would make them a pain to maintain.
Rather than doing that, just mention "see base class [CLASS NAME]", in a way that allows to hyperlink to that at document generation time.


It is actually true that integrity is guaranteed by the cloud in the sense we want? We want that the key of an object, in the way we compute it matches the associate data. Is that guarantee part of libcloud contract with all its backend implementation?

  • if it is, it's pointless to actually retrieve the object here. It would be much better to just check for object existence, e.g. with some sort of "HEAD" method
  • if it is not, then we should really retrieve the object, compute its checksum according to our checksum algo, and verify it matches. Also: keep in mind that the algorithm that we use for key checking might evolve in the future, so if by coincidence today it matches what libcloud backends do, it might not be the case in the future
This revision now requires changes to proceed.Aug 8 2016, 10:30 AM

It doesn't seems a method to compute the size of a storage or a container exists.
This implementation was the initial one in the ObjStorage, though I didn't write the same docstring saying that this method (Iter and len) wasn't meant to be used in production.

I don't know if we will need to list the full content of an objstorage, or get its size ; if so, I can search for a better implementation we may use in production.


Oh, right. I was thinking about some bite-rot corruption, not id mismatching the content.

We do compute the hash ourselves (if not provided) and use it as a key for the storage. Libcloud doesn't do anything special (Although I guess the cloud behind will do whatever it needs with a hash to store the data) it will be retrievable with the key we provided).
However, the content might be invalid if the id provided in the add method was wrong. Didn't though about that case.

As a note, the _get_object returns a Libcloud Object that is just a pointer to the object and does not retrieve the content unless a download or as_stream method is used (For what I've seen, Object.as_stream() delegates to driver.download_as_stream(object, ...).


That feature is awesome! Is there any don't use-case for this? Because it could be used in a lot of classes in the swh.objstorgae module.

qcampos edited edge metadata.
qcampos marked 10 inline comments as done.

Correct some docstring and use ABC for the base class.

zack requested changes to this revision.Aug 10 2016, 9:59 AM
zack edited edge metadata.

Looks good!
I've just noted down a nit about docstrings in subclasses/overridden methods that add nothing w.r.t. the parent class, as they're effectively useless.


The diff looks generally good. I've just a remaining nitpick about the docstrings that add nothing w.r.t. the parent class.
I've checked, and if you override a method in a subclass but *not* specify a docstring, Python 3.x will automatically find the docstring of the parent class.
So it is indeed pointless to have docstrings that just say "see parent class".

Can you remove all of them, unless they add something that is specific to the subclass / method override?

This revision now requires changes to proceed.Aug 10 2016, 9:59 AM
qcampos edited edge metadata.
qcampos marked 3 inline comments as done.

Remove pointless docstrings when abc's one can be used

zack edited edge metadata.
This revision is now accepted and ready to land.Aug 12 2016, 10:52 AM
qcampos edited edge metadata.

No changes, but current diff needs to be updated as the parent commit as changed due to a rebase in top of previously pushed diff.

This revision now requires review to proceed.Aug 12 2016, 12:35 PM
qcampos edited edge metadata.

See previous diff; I made the update from the wrong branch.

olasd added a reviewer: olasd.
This revision is now accepted and ready to land.Aug 12 2016, 1:16 PM
This revision was automatically updated to reflect the committed changes.
ardumont added inline comments.

Typo in parameter's name: chech_presence -> check_presence.