Differential D94 Diff 327 swh/objstorage/cloud/objstorage_cloud.py

Changeset View

Standalone View

swh/objstorage/cloud/objstorage_cloud.py

This file was added.

				# Copyright (C) 2016 The Software Heritage developers
				zackUnsubmitted Done Inline Actions nitpick: as this is new code written in 2016, you can drop "2015-" from here zack: nitpick: as this is new code written in 2016, you can drop "2015-" from here
				# See the AUTHORS file at the top-level directory of this distribution
				# License: GNU General Public License version 3, or any later version
				# See top-level LICENSE file for more information

				import abc

				from ..objstorage import ObjStorage, ID_HASH_ALGO
				from ..exc import ObjNotFoundError, Error

				from swh.core import hashutil

				from libcloud.storage import providers
				from libcloud.storage.types import ObjectDoesNotExistError

				zackUnsubmitted Done Inline Actions as this is an abstract class not meant to be implemented, shouldn't it use the abc machinery? zack: as this is an abstract class not meant to be implemented, shouldn't it use the [[ https://docs.
				qcamposAuthorUnsubmitted Done Inline Actions That feature is awesome! Is there any don't use-case for this? Because it could be used in a lot of classes in the swh.objstorgae module. qcampos: That feature is awesome! Is there any //don't use//-case for this? Because it could be used in…

				class CloudObjStorage(ObjStorage, metaclass=abc.ABCMeta):
				""" Abstract ObjStorage that allows connection to a cloud using Libcloud
				zackUnsubmitted Done Inline Actions so, is this really a "should" or a "shall"? does this class do anything useful if not subclassed or not? zack: so, is this really a "should" or a "shall"? does this class do anything useful if not…

				Implementations of this class must redefine the _get_provider method to
				zackUnsubmitted Done Inline Actions "correct object" is not very useful doc for people looking into how to implement a specific concrete subclass of this abstract class. Maybe you can add a pointer to the relevant entry in the libcloud doc here? zack: "correct object" is not very useful doc for people looking into how to implement a specific…
				make it return a driver provider (i.e. object that supports `get_driver`
				method) which return a LibCloud driver
				(see https://libcloud.readthedocs.io/en/latest/storage/api.html).
				"""

				def __init__(self, api_key, api_secret_key, container_name):
				self.driver = self._get_driver(api_key, api_secret_key)
				self.container_name = container_name
				self.container = self.driver.get_container(
				container_name=container_name)

				def _get_driver(self, api_key, api_secret_key):
				""" Initialize a driver to communicate with the cloud

				Args:
				api_key: key to connect to the API.
				api_secret_key: secret key for authentification.

				Returns:
				a Libcloud driver to a cloud storage.
				"""
				# Get the driver class from its description.
				cls = providers.get_driver(self._get_provider())
				# Initialize the driver.
				return cls(api_key, api_secret_key)

				@abc.abstractmethod
				def _get_provider(self):
				""" Get a libcloud driver provider
				zackUnsubmitted Done Inline Actions a (even just minimal) docstring here would be nice zack: a (even just minimal) docstring here would be nice

				This method must be overriden by subclasses to specify which of the
				native libcloud driver the current storage should connect to.
				Alternatively, provider for a custom driver may be returned, in which
				case the provider will have tu support `get_driver` method.
				"""
				raise NotImplementedError('%s must implement `get_provider` method'
				% type(self))

				def __contains__(self, obj_id):
				zackUnsubmitted Done Inline Actions The warning is useful, but should probably explain in what it would be inefficient. Is it just because it's a ping-pong with remote storage over the network, or is it something worse, like in-memory caching of the whole list? A (minimal) docstring here would be nice too. zack: The warning is useful, but should probably explain in what it would be inefficient. Is it just…
				try:
				self._get_object(obj_id)
				except ObjectDoesNotExistError:
				return False
				else:
				return True
				zackUnsubmitted Done Inline Actions d'oh :( if there is really no direct method to retrieve the size of the storage, I believe we should either not implement this method, or equip it with the same warning that goes with iter, otherwise it might be a timebomb zack: d'oh :( if there is really no direct method to retrieve the size of the storage, I believe we…
				qcamposAuthorUnsubmitted Done Inline Actions It doesn't seems a method to compute the size of a storage or a container exists. This implementation was the initial one in the ObjStorage, though I didn't write the same docstring saying that this method (Iter and len) wasn't meant to be used in production. I don't know if we will need to list the full content of an objstorage, or get its size ; if so, I can search for a better implementation we may use in production. qcampos: It doesn't seems a method to compute the size of a storage or a container exists. This…

				def __iter__(self):
				""" Iterate over the objects present in the storage

				Warning: Iteration over the contents of a cloud-based object storage
				may have bad efficiency: due to the very high amount of objects in it
				and the fact that it is remote, get all the contents of the current
				object storage may result in a lot of network requests.

				You almost certainly don't want to use this method in production.
				"""
				yield from map(lambda obj: obj.name,
				self.driver.iterate_container_objects(self.container))

				zackUnsubmitted Done Inline Actions This and a lot of other docstrings in this module seem to be copy-paste from objstorage.py, which would make them a pain to maintain. Rather than doing that, just mention "see base class [CLASS NAME]", in a way that allows to hyperlink to that at document generation time. zack: This and a lot of other docstrings in this module seem to be copy-paste from objstorage.py…
				def __len__(self):
				""" Compute the number of objects in the current object storage.

				Warning: this currently uses `__iter__`, its warning about bad
				performance applies.

				Returns:
				number of objects contained in the storage.
				"""
				return sum(1 for i in self)

				def add(self, content, obj_id=None, check_presence=True):
				if obj_id is None:
				# Checksum is missing, compute it on the fly.
				h = hashutil._new_hash(ID_HASH_ALGO, len(bytes))
				h.update(bytes)
				obj_id = h.digest()

				if check_presence and obj_id in self:
				return obj_id

				self._put_object(content, obj_id)
				return obj_id

				def restore(self, content, obj_id=None):
				return self.add(content, obj_id, chech_presence=False)
				ardumontUnsubmitted Not Done Inline Actions Typo in parameter's name: `chech_presence` -> `check_presence`. ardumont: Typo in parameter's name: `chech_presence` -> `check_presence`.

				def get(self, obj_id):
				return bytes(self._get_object(obj_id).as_stream())

				def check(self, obj_id):
				# Check that the file exists, as _get_object raises ObjNotFoundError
				self._get_object(obj_id)
				# Check the content integrity
				obj_content = self.get(obj_id)
				# TODO factorize the hash computation.
				h = hashutil._new_hash(ID_HASH_ALGO, len(obj_content))
				h.update(obj_content)
				content_obj_id = h.digest()
				if content_obj_id != obj_id:
				raise Error(obj_id)

				def _get_object(self, obj_id):
				""" Get a Libcloud wrapper for an object pointer.

				This wrapper does not retrieve the content of the object direclty.
				"""
				hex_obj_id = hashutil.hash_to_hex(obj_id)
				try:
				return self.driver.get_object(self.container_name, hex_obj_id)
				except ObjectDoesNotExistError as e:
				raise ObjNotFoundError(e.object_name)

				def _put_object(self, content, obj_id):
				""" Create an object in the cloud storage.

				Created object will contains the content and be referenced by the
				given id.
				"""
				hex_obj_id = hashutil.hash_to_hex(obj_id)
				zackUnsubmitted Done Inline Actions It is actually true that integrity is guaranteed by the cloud in the sense we want? We want that the key of an object, in the way we compute it matches the associate data. Is that guarantee part of libcloud contract with all its backend implementation? if it is, it's pointless to actually retrieve the object here. It would be much better to just check for object existence, e.g. with some sort of "HEAD" method if it is not, then we should really retrieve the object, compute its checksum according to our checksum algo, and verify it matches. Also: keep in mind that the algorithm that we use for key checking might evolve in the future, so if by coincidence today it matches what libcloud backends do, it might not be the case in the future zack: It is actually true that integrity is guaranteed by the cloud in the sense we want? We want…
				qcamposAuthorUnsubmitted Done Inline Actions Oh, right. I was thinking about some bite-rot corruption, not id mismatching the content. We do compute the hash ourselves (if not provided) and use it as a key for the storage. Libcloud doesn't do anything special (Although I guess the cloud behind will do whatever it needs with a hash to store the data) it will be retrievable with the key we provided). However, the content might be invalid if the id provided in the `add` method was wrong. Didn't though about that case. As a note, the _get_object returns a Libcloud `Object` that is just a pointer to the object and does not retrieve the content unless a `download` or `as_stream` method is used (For what I've seen, `Object.as_stream()` delegates to `driver.download_as_stream(object, ...)`. qcampos: Oh, right. I was thinking about some bite-rot corruption, not id mismatching the content. We…
				self.driver.upload_object_via_stream(iter(content), self.container,
				hex_obj_id)
				zackUnsubmitted Done Inline Actions The diff looks generally good. I've just a remaining nitpick about the docstrings that add nothing w.r.t. the parent class. I've checked, and if you override a method in a subclass but not specify a docstring, Python 3.x will automatically find the docstring of the parent class. So it is indeed pointless to have docstrings that just say "see parent class". Can you remove all of them, unless they add something that is specific to the subclass / method override? zack: The diff looks generally good. I've just a remaining nitpick about the docstrings that add…