Changeset View
Standalone View
swh/objstorage/cloud/objstorage_cloud.py
- This file was added.
# Copyright (C) 2016 The Software Heritage developers | |||||
zack: nitpick: as this is new code written in 2016, you can drop "2015-" from here | |||||
# See the AUTHORS file at the top-level directory of this distribution | |||||
# License: GNU General Public License version 3, or any later version | |||||
# See top-level LICENSE file for more information | |||||
import abc | |||||
from ..objstorage import ObjStorage, ID_HASH_ALGO | |||||
from ..exc import ObjNotFoundError, Error | |||||
from swh.core import hashutil | |||||
from libcloud.storage import providers | |||||
from libcloud.storage.types import ObjectDoesNotExistError | |||||
Done Inline Actionsas this is an abstract class not meant to be implemented, shouldn't it use the abc machinery? zack: as this is an abstract class not meant to be implemented, shouldn't it use the [[ https://docs. | |||||
Done Inline ActionsThat feature is awesome! Is there any don't use-case for this? Because it could be used in a lot of classes in the swh.objstorgae module. qcampos: That feature is awesome! Is there any //don't use//-case for this? Because it could be used in… | |||||
class CloudObjStorage(ObjStorage, metaclass=abc.ABCMeta): | |||||
""" Abstract ObjStorage that allows connection to a cloud using Libcloud | |||||
Done Inline Actionsso, is this really a "should" or a "shall"? does this class do anything useful if not subclassed or not? zack: so, is this really a "should" or a "shall"? does this class do anything useful if not… | |||||
Implementations of this class must redefine the _get_provider method to | |||||
Done Inline Actions"correct object" is not very useful doc for people looking into how to implement a specific concrete subclass of this abstract class. zack: "correct object" is not very useful doc for people looking into how to implement a specific… | |||||
make it return a driver provider (i.e. object that supports `get_driver` | |||||
method) which return a LibCloud driver | |||||
(see https://libcloud.readthedocs.io/en/latest/storage/api.html). | |||||
""" | |||||
def __init__(self, api_key, api_secret_key, container_name): | |||||
self.driver = self._get_driver(api_key, api_secret_key) | |||||
self.container_name = container_name | |||||
self.container = self.driver.get_container( | |||||
container_name=container_name) | |||||
def _get_driver(self, api_key, api_secret_key): | |||||
""" Initialize a driver to communicate with the cloud | |||||
Args: | |||||
api_key: key to connect to the API. | |||||
api_secret_key: secret key for authentification. | |||||
Returns: | |||||
a Libcloud driver to a cloud storage. | |||||
""" | |||||
# Get the driver class from its description. | |||||
cls = providers.get_driver(self._get_provider()) | |||||
# Initialize the driver. | |||||
return cls(api_key, api_secret_key) | |||||
@abc.abstractmethod | |||||
def _get_provider(self): | |||||
""" Get a libcloud driver provider | |||||
Done Inline Actionsa (even just minimal) docstring here would be nice zack: a (even just minimal) docstring here would be nice | |||||
This method must be overriden by subclasses to specify which of the | |||||
native libcloud driver the current storage should connect to. | |||||
Alternatively, provider for a custom driver may be returned, in which | |||||
case the provider will have tu support `get_driver` method. | |||||
""" | |||||
raise NotImplementedError('%s must implement `get_provider` method' | |||||
% type(self)) | |||||
def __contains__(self, obj_id): | |||||
Done Inline ActionsThe warning is useful, but should probably explain in what it would be inefficient. Is it just because it's a ping-pong with remote storage over the network, or is it something worse, like in-memory caching of the whole list? A (minimal) docstring here would be nice too. zack: The warning is useful, but should probably explain in what it would be inefficient. Is it just… | |||||
try: | |||||
self._get_object(obj_id) | |||||
except ObjectDoesNotExistError: | |||||
return False | |||||
else: | |||||
return True | |||||
Done Inline Actionsd'oh :( if there is really no direct method to retrieve the size of the storage, I believe we should either not implement this method, or equip it with the same warning that goes with iter, otherwise it might be a timebomb zack: d'oh :(
if there is really no direct method to retrieve the size of the storage, I believe we… | |||||
Done Inline ActionsIt doesn't seems a method to compute the size of a storage or a container exists. I don't know if we will need to list the full content of an objstorage, or get its size ; if so, I can search for a better implementation we may use in production. qcampos: It doesn't seems a method to compute the size of a storage or a container exists.
This… | |||||
def __iter__(self): | |||||
""" Iterate over the objects present in the storage | |||||
Warning: Iteration over the contents of a cloud-based object storage | |||||
may have bad efficiency: due to the very high amount of objects in it | |||||
and the fact that it is remote, get all the contents of the current | |||||
object storage may result in a lot of network requests. | |||||
You almost certainly don't want to use this method in production. | |||||
""" | |||||
yield from map(lambda obj: obj.name, | |||||
self.driver.iterate_container_objects(self.container)) | |||||
Done Inline ActionsThis and a lot of other docstrings in this module seem to be copy-paste from objstorage.py, which would make them a pain to maintain. zack: This and a lot of other docstrings in this module seem to be copy-paste from objstorage.py… | |||||
def __len__(self): | |||||
""" Compute the number of objects in the current object storage. | |||||
Warning: this currently uses `__iter__`, its warning about bad | |||||
performance applies. | |||||
Returns: | |||||
number of objects contained in the storage. | |||||
""" | |||||
return sum(1 for i in self) | |||||
def add(self, content, obj_id=None, check_presence=True): | |||||
if obj_id is None: | |||||
# Checksum is missing, compute it on the fly. | |||||
h = hashutil._new_hash(ID_HASH_ALGO, len(bytes)) | |||||
h.update(bytes) | |||||
obj_id = h.digest() | |||||
if check_presence and obj_id in self: | |||||
return obj_id | |||||
self._put_object(content, obj_id) | |||||
return obj_id | |||||
def restore(self, content, obj_id=None): | |||||
return self.add(content, obj_id, chech_presence=False) | |||||
ardumontUnsubmitted Not Done Inline ActionsTypo in parameter's name: chech_presence -> check_presence. ardumont: Typo in parameter's name: `chech_presence` -> `check_presence`. | |||||
def get(self, obj_id): | |||||
return bytes(self._get_object(obj_id).as_stream()) | |||||
def check(self, obj_id): | |||||
# Check that the file exists, as _get_object raises ObjNotFoundError | |||||
self._get_object(obj_id) | |||||
# Check the content integrity | |||||
obj_content = self.get(obj_id) | |||||
# TODO factorize the hash computation. | |||||
h = hashutil._new_hash(ID_HASH_ALGO, len(obj_content)) | |||||
h.update(obj_content) | |||||
content_obj_id = h.digest() | |||||
if content_obj_id != obj_id: | |||||
raise Error(obj_id) | |||||
def _get_object(self, obj_id): | |||||
""" Get a Libcloud wrapper for an object pointer. | |||||
This wrapper does not retrieve the content of the object direclty. | |||||
""" | |||||
hex_obj_id = hashutil.hash_to_hex(obj_id) | |||||
try: | |||||
return self.driver.get_object(self.container_name, hex_obj_id) | |||||
except ObjectDoesNotExistError as e: | |||||
raise ObjNotFoundError(e.object_name) | |||||
def _put_object(self, content, obj_id): | |||||
""" Create an object in the cloud storage. | |||||
Created object will contains the content and be referenced by the | |||||
given id. | |||||
""" | |||||
hex_obj_id = hashutil.hash_to_hex(obj_id) | |||||
Done Inline ActionsIt is actually true that integrity is guaranteed by the cloud in the sense we want? We want that the key of an object, in the way we compute it matches the associate data. Is that guarantee part of libcloud contract with all its backend implementation?
zack: It is actually true that integrity is guaranteed by the cloud in the sense we want? We want… | |||||
Done Inline ActionsOh, right. I was thinking about some bite-rot corruption, not id mismatching the content. We do compute the hash ourselves (if not provided) and use it as a key for the storage. Libcloud doesn't do anything special (Although I guess the cloud behind will do whatever it needs with a hash to store the data) it will be retrievable with the key we provided). As a note, the _get_object returns a Libcloud Object that is just a pointer to the object and does not retrieve the content unless a download or as_stream method is used (For what I've seen, Object.as_stream() delegates to driver.download_as_stream(object, ...). qcampos: Oh, right. I was thinking about some bite-rot corruption, not id mismatching the content.
We… | |||||
self.driver.upload_object_via_stream(iter(content), self.container, | |||||
hex_obj_id) | |||||
Done Inline ActionsThe diff looks generally good. I've just a remaining nitpick about the docstrings that add nothing w.r.t. the parent class. Can you remove all of them, unless they add something that is specific to the subclass / method override? zack: The diff looks generally good. I've just a remaining nitpick about the docstrings that add… |
nitpick: as this is new code written in 2016, you can drop "2015-" from here