Differential D1694 Diff 5813 swh/loader/base/loader.py

Changeset View

Standalone View

swh/loader/base/loader.py

This file was added.

				# Copyright (C) 2019 The Software Heritage developers
				# See the AUTHORS file at the top-level directory of this distribution
				# License: GNU General Public License version 3, or any later version
				# See top-level LICENSE file for more information

				import os
				import shutil

				from abc import abstractmethod
				from tempfile import mkdtemp

				from swh.core import tarball
				from .abstractattribute import AbstractAttribute
				from swh.loader.core.utils import clean_dangling_folders
				from swh.loader.core.loader import BufferedLoader
				from swh.model.hashutil import MultiHash, HASH_BLOCK_SIZE
				from swh.storage.algos.snapshot import snapshot_get_all_branches
				from swh.model.from_disk import Directory

				from .build import construct_revision

				from swh.model.identifiers import (
				identifier_to_bytes, revision_identifier
				)

				DEBUG_MODE = ' DEBUG MODE '


				anlambertUnsubmitted Done Inline Actions How about renaming that class to `PackageLoader` ? This feels more explicit about what it is intended to do. Also, I would rename the folder containing these new source files from `base` to `package`. Maybe this could also be moved in a new dedicated `swh-loader-package` repository, but let's keep it there at the moment. anlambert: How about renaming that class to `PackageLoader` ? This feels more explicit about what it is…
				ardumontAuthorUnsubmitted Done Inline Actions How about renaming that class to PackageLoader ? This feels more explicit about what it is intended to do. I like it better indeed! Also, I would rename the folder containing these new source files from base to package. I asked to move those back in core. But your proposal is way better. Please do so in the end ;) Maybe this could also be moved in a new dedicated swh-loader-package repository, but let's keep it there at the moment. That's what we (olasd, nahimilega and me) agreed on from the start. First let's keep it here. Then, when we are mostly satisfied, we'll move it. ardumont: > How about renaming that class to PackageLoader ? This feels more explicit about what it is…
				class BaseLoader(BufferedLoader, construct_revision):
				"""

				Required Overrides:
				loader_name
				class_name
				def convert_to_standard_format

				Optional Overrides:
				def cleanup_artifact
				def extract_metadata

				ardumontAuthorUnsubmitted Done Inline Actions Improve the description here. For inspiration, you can have a look at the `swh.model.merkle.MerkleNode` class. ardumont: Improve the description here. For inspiration, you can have a look at the `swh.model.merkle.
				"""

				loader_name = AbstractAttribute("Name of the package manager") # e.g pypi
				class_name = AbstractAttribute("Name of the loader class") # eg PyPILoader

				def __init__(self):
				super().__init__(logging_class='swh.loader.%s.%s' % (self.loader_name,
				self.class_name))
				self.TEMPORARY_DIR_PREFIX_PATTERN = 'swh.loader.%s.' % self.loader_name
				self.CONFIG_BASE_FILENAME = 'loader/%s' % self.loader_name

				self.ADDITIONAL_CONFIG = {
				'temp_directory': ('str',
				'/tmp/swh.loader.%s/' % self.loader_name),
				'cache': ('bool', False),
				'cache_dir': ('str', ''),
				'debug': ('bool', False), # NOT FOR PRODUCTION
				}

				self.local_cache = None
				ardumontAuthorUnsubmitted Done Inline Actions Those should be docstrings, not values of the variables. loader_name = None """Package manager name""" class_name = None """Loader class name""" etc... ardumont: Those should be docstrings, not values of the variables. ``` loader_name = None """Package…
				self.dir_path = None

				temp_directory = self.config['temp_directory']
				os.makedirs(temp_directory, exist_ok=True)
				ardumontAuthorUnsubmitted Done Inline Actions If those are for the configuration file to load, we are moving away from this (even though, that's not apparent in the code). Please remove. We use the environment variable stanza (i explained to you in irc) to deploy those, be that in the docker-dev or the production environment. ardumont: If those are for the configuration file to load, we are moving away from this (even though…
				nahimilegaUnsubmitted Not Done Inline Actions Shall I also mention in the docstring or someplace tp environment variable stanza. Like. export SWH_CONFIG_FILENAME=/path/to/loader-gnu.yml nahimilega: Shall I also mention in the docstring or someplace tp environment variable stanza. Like. ```…
				ardumontAuthorUnsubmitted Not Done Inline Actions I don't really know where to define that part without duplicating... (it's already in the docker-dev for other loaders and in prod since we need it to work there ;) So maybe, let's not spend too much time on this documentation part for now. ardumont: I don't really know where to define that part without duplicating... (it's already in the…

				self.temp_directory = mkdtemp(
				suffix='-%s' % os.getpid(),
				prefix=self.TEMPORARY_DIR_PREFIX_PATTERN,
				dir=temp_directory)

				self.debug = self.config.get('debug', False)
				ardumontAuthorUnsubmitted Done Inline Actions Also, i'd need to check but the additional_config is also something to deprecate (so to remove here)... While that documents a bit what's the expectations around config is. That's not something that we want to keep. Rationale: if it's not maintained anymore, it's dead code, so that must go away. ardumont: Also, i'd need to check but the additional_config is also something to deprecate (so to remove…

				@abstractmethod
				def convert_to_standard_format(self, **kwargs):
				"""Fetch the metadata and convert it into a standard format
				The standard format is a dict with keys
				`name` (str): Holding name of the package
				`origin_url` (str): Holding the origin_url of the package
				`tarballs` (list): A list of dicts where each dict contains
				information related to a single version of the
				package. The `url` key in the dict is necessary and will
				hold tarball url. Other keys are optional and as per
				availability of metadata.

				Note: Keys `nature` and `response` are reserved keywords and cannot be
				used in the dict that are present under key `tarballs`

				Args:
				**kwargs: Arbitrary keyword arguments passed by the lister.

				Returns:
				dict: Containing information as directed by the guidelines
				mentioned above

				Example:
				{
				name:'8sync',
				origin_url:'https://ftp.gnu.org/gnu/8sync/',
				tarballs:[{url: 'https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz',
				time_modified: 1562878592 },
				{url: 'https://ftp.gnu.org/gnu/8sync/8sync-0.2.0.tar.gz',
				time_modified: 1599887203 },
				...
				]
				}

				"""
				pass

				def cleanup_artifact(self, uncompressed_path):
				"""Clean up unnecessary files from the downloaded tarball
				also some special operation if needed.

				Implementation of this method depends on the file structure of the
				tarball. It is used to clean up files from the uncompressed tarball
				that are not to be archived(eg binaries).

				Args:
				anlambertUnsubmitted Not Done Inline Actions From my point of view, this method is useless and could be removed if we slightly update the package listers implementation (GNU, CRAN, npm, PyPI, ...). Those listers should provide the following parameters to the loader: `name`, `origin_url` and `tarballs`. anlambert: From my point of view, this method is useless and could be removed if we slightly update the…
				ardumontAuthorUnsubmitted Not Done Inline Actions I agree again. If we converge with a use case, we can change the lister implementation to unify their output (scheduled tasks). Note that it means a migration from the existing `recurring` tasks in the scheduler db (out of scope from this task ;) ardumont: I agree again. If we converge with a use case, we can change the lister implementation to unify…
				nahimilegaUnsubmitted Not Done Inline Actions I am again not sure we should do this because, as once olasd told that we parallelize the loading task. Hence we should not keep the API call(to get metadata) step in the listers. For package managers for a huge amount of packages, lister will take up a lot of time to complete whereas this could be done in the loader and make it fast using parallelization. nahimilega: I am again not sure we should do this because, as once olasd told that we parallelize the…
				ardumontAuthorUnsubmitted Not Done Inline Actions Something rings a bell indeed. Moving the metadata retrieval done lister side to the loader-side (as the debian loader does). But i don't remember the details of that exchange (most probably, this is detailed in the task). Your sentences are a bit unclear. I am again not sure we should do this because, as once olasd told that we parallelize the loading task. Hence we should not keep the API call(to get metadata) step in the listers. I'm not sure i see the causality between the `parallelize the loading-task` and moving the api call to solve metadata loader side. For package managers for a huge amount of packages, lister will take up a lot of time to complete whereas this could be done in the loader and make it fast using parallelization. Same here. Be that loaders or listers, they all doing their job concurrently. That a lister takes some time to list is not per say a problem. Because surely, it does not exceed the worst case we already have on loaders. As of today, given certain origins, some loader-git/mercurial/svn takes forever to finish... In any case, if this method is the one that solves the metadata loading, it's not clear from its name. I now realize i must have skim through though as i see the docstring tries to explain this. Still, i think the name should be improved upon to explicit that. ardumont: Something rings a bell indeed. Moving the metadata retrieval done lister side to the loader…
				nahimilegaUnsubmitted Not Done Inline Actions I'm not sure I see the causality between the parallelise the loading-task and moving the API call to solve metadata loader side. My point was, when I was making packagist lister, in the first pass, I kept the metadata retrieval step (it would send an api request for each package to get metadata) in the lister itself. While testing it, on my pc, it took like 3 hours to list only about 5 % of the total packages. Now if this task is done in loading. The lister will pass the metadata URL of the package to the loader. And we will have a separate loading task for each package. If we can parallelise the task(i.e. run many loading tasks at the same time), then we could make metadata request for many packages at the same time, and it will make the whole process fast. As compared it lister where first all the metadata calls are made, the loading task is created. (I am emphasising on metadata request part so much because while testing listers, I felt this step is like a bottleneck. it is slow) One more point it will make the whole system more robust. As let us suppose that there are exceptions in the metadata API for one or two package, like some field is not present or was present at the time lister was developed now removed. Now if we make that metadata request in lister, then it will encounter an error, and the whole process might stop(although this possibility could be removed if the lister code is robust and written to handle these error). But if the metadata request is made in the loader, then the loading task of that particular package will fail. And everything will still work fine. Hope it is a bit clear what I am trying to say ;) In any case, if this method is the one that solves the metadata loading, it's not clear from its name. I will try to make it more explicit, but actually, the task of that method is to make the metadata request(if needed) and convert the output coming from lister to a standard format which is used in the loader. The standard format is the format as shown in docstring. nahimilega: >I'm not sure I see the causality between the parallelise the loading-task and moving the API…
				anlambertUnsubmitted Not Done Inline Actions @nahimilega , I totally agree with you that the metadata extraction for a package must be handled loader-side. This calls indeed for a dedicated method to perform that task in the `PackageLoader` class that derived ones must reimplement. But keep in mind that metadata extraction must be performed for each version of a package and that most of the times there is no need to perform an API call to a remote service as package content usually provide a file filled with these information that can be easily parsed (package.json, PKG-INFO, ...). So trying to extract metadata before iterating on each version of a package to load into the archive is not the right solution. I was also confused by the naming of the method, `convert_to_standard_format` is not reallly explicit about its role. As @ardumont explained, every processiong step for loading the different versions of a package into the archive should be clearly identified with enough granularity to cover all package managers available in the wild. This implies a lot of brainstorming to do it well and define an API that will be easily usable and maintainable ;-) Let's continue that work next week ! anlambert: @nahimilega , I totally agree with you that the metadata extraction for a package must be…
				nahimilegaUnsubmitted Not Done Inline Actions But keep in mind that metadata extraction must be performed for each version of a package. But for the package managers where we don't even know the versions(eg pypi) and we need to make a API call to get a list of package version, in that case, `convert_to_standard_format` can be used. And for extraction of metadata from the file provided(package.json, PKG-INFO, ...) `extract_metadata` method can be used. It iterated on each version of a package. nahimilega: >But keep in mind that metadata extraction must be performed for each version of a package. But…
				uncompressed_path (str): Path of uncompressed tarball

				Returns:
				uncompressed_path (str): Path of uncompressed tarball after
				removing unnecessary files

				"""
				return uncompressed_path

				def extract_metadata(self, package_path, package_source_data):
				"""Fetch the metadata from the downloaded file.

				"""
				return package_source_data

				# You probably don't need to override anything below this line.

				def prepare_origin_visit(self, **kwargs):
				"""Prepare package visit.

				Args:
				**kwargs: Arbitrary keyword arguments passed by the lister.

				"""
				# reset statuses
				self._load_status = 'uneventful'
				anlambertUnsubmitted Done Inline Actions The correct signature of this method is: def prepare_origin_visit(self, args, kwargs): The loader will fail to execute when executed by a celery worker otherwise (this is what we used in production to schedule and execute the loading of software origins in the archive).. anlambert:* The correct signature of this method is: ```lang=python def prepare_origin_visit(self, *args…
				self._visit_status = 'full'
				self.done = False
				# fetch the npm package metadata from the registry
				self.package_details = self.convert_to_standard_format(**kwargs)
				self.origin = {
				'url': self.package_details['origin_url'],
				'type': self.loader_name,
				}
				self.visit_date = None # loader core will populate it

				def prepare(self, **kwargs):
				"""Prepare effective loading of source tarballs for a package manager
				anlambertUnsubmitted Not Done Inline Actions If we remove the `convert_to_standard_format` method and ensure consistent inputs provided by the listers to that loader, this could be changed in: self.package_details = kwargs anlambert: If we remove the `convert_to_standard_format` method and ensure consistent inputs provided by…
				nahimilegaUnsubmitted Not Done Inline Actions I don't think that's a good idea. `convert_to_standard_format` is provided for loaders where we need to make an API call to get the metadata(eg pypi). nahimilega: I don't think that's a good idea. `convert_to_standard_format` is provided for loaders where we…
				package.

				Args:
				**kwargs: Arbitrary keyword arguments passed by the lister.

				"""
				self.contents = []
				self.directories = []
				self.revisions = []
				self.package_temp_dir = os.path.join(self.temp_directory,
				self.package_details['name'])

				last_snapshot = self.last_snapshot()
				anlambertUnsubmitted Done Inline Actions Same here, correct signature is: def prepare(self, args, kwargs): anlambert:* Same here, correct signature is: ```lang=python def prepare(self, args, *kwargs): ```
				self.known_versions = self.known_versions(last_snapshot)

				self.new_artifacts = \
				self.prepare_package_versions(self.package_details['tarballs'],
				self.known_versions)

				def _prepare_package_version(self, package_source_data, tarball_request):
				"""Process the package release version.

				The following operations are performed:

				1. Download the tarball
				2. Uncompress the tarball
				3. Delete unnecessary files (optional)
				4. Parse the file associated to the package version to extract
				metadata (optional)
				nahimilegaUnsubmitted Done Inline Actions Change the function name from `known_versions()` to `known_version()` because there was a class variable `self.known_versions`. nahimilega: Change the function name from `known_versions()` to `known_version()` because there was a class…
				anlambertUnsubmitted Done Inline Actions I would rather rename it to `get_known_versions`, the plural use is important here anlambert: I would rather rename it to `get_known_versions`, the plural use is important here
				ardumontAuthorUnsubmitted Not Done Inline Actions i agree! ardumont: i agree!

				Args:
				anlambertUnsubmitted Not Done Inline Actions `new_versions` seems a better name here anlambert: `new_versions` seems a better name here
				package_source_data (list): a list of dicts containing information
				about the respective tarball that is provided by lister.
				known_versions (dict): may be provided by the loader, it enables
				to filter out versions already ingested in the archive.

				Return:
				Tuple[dict, str]: tuples containing the following
				members:

				* a dict holding package tarball information and metadata
				* a string holding the path of the uncompressed package to
				load into the archive

				"""
				url = package_source_data['url']
				tarball_path, hashes = self.download_generate_hash(tarball_request,
				url)
				uncompressed_path = os.path.join(self.package_temp_dir, 'uncompressed',
				url) # SEE ME
				package_source_data['nature'] = self.uncompress_tarball(
				tarball_path, uncompressed_path)

				# remove tarball
				os.remove(tarball_path)

				if self.tarball_invalid:
				return None, None

				package_path = self.cleanup_artifact(uncompressed_path)
				package_source_data = self.extract_metadata(package_path,
				package_source_data)
				return package_source_data, package_path

				def fetch_data(self):
				"""Called once per release artifact version (can be many for one
				release).

				This will for each call:
				- retrieve a release artifact (associated to a release version)
				- Computes the swh objects

				Returns:
				True as long as data to fetch exist

				"""
				data = None
				if self.done:
				return False

				try:
				data = next(self.new_artifacts)
				self._load_status = 'eventful'
				except StopIteration:
				self.done = True
				return False

				package_source_data, dir_path = data
				# package release tarball was corrupted
				if self.tarball_invalid:
				return not self.done

				dir_path = dir_path.encode('utf-8')
				directory = Directory.from_disk(path=dir_path, data=True)
				objects = directory.collect()

				if 'content' not in objects:
				objects['content'] = {}
				if 'directory' not in objects:
				objects['directory'] = {}

				self.contents = objects['content'].values()
				self.directories = objects['directory'].values()

				revision = self.compute_revision(directory,
				package_source_data) # FIX ME

				revision['id'] = identifier_to_bytes(
				revision_identifier(revision))
				self.revisions.append(revision)

				# Change me to be compatable with if modified since
				package_key = package_source_data[self.compare_field] # check for this
				self.known_versions[package_key] = revision['id'] # SEE ME

				self.log.debug('Removing unpacked package files at %s', dir_path)
				shutil.rmtree(dir_path)

				return not self.done

				def last_snapshot(self):
				"""Retrieve the last snapshot of the package if any.

				"""
				# Done
				nahimilegaUnsubmitted Done Inline Actions previously function definition was `def check_file(self, length, filepath):` but it was called as `self.check_file(filepath, length)` which lead to assigning of filepath value to length variable. Fixed this bug nahimilega: previously function definition was `def check_file(self, length, filepath):` but it was called…
				visit = self.storage.origin_visit_get_latest(
				self.origin['url'], require_snapshot=True)
				if visit:
				return snapshot_get_all_branches(self.storage, visit['snapshot'])
				nahimilegaUnsubmitted Done Inline Actions Can you please check once for this method that docstring appropriate and correctly explains what is done in the function. I am not sure about this method that the thing I stated in docstring is actually done by this method nahimilega: Can you please check once for this method that docstring appropriate and correctly explains…
				ardumontAuthorUnsubmitted Done Inline Actions It's the method we use everywhere to compute hashes on a file. This method does 2 things, given 1 response which holds a raw file to some request: compute the hashes (sha1, sha1_git, sha256, blake2s) of the file write the raw file in filepath Implementation wise, this does this in one round-trip. It then returns the hashes (dict) of the file. IIRC, the writing part was done in a cache (on disk). ardumont: It's the method we use everywhere to compute hashes on a file. This method does 2 things…

				def store_data(self):
				"""Store fetched data in the database.

				"""
				# Done
				self.maybe_load_contents(self.contents)
				self.maybe_load_directories(self.directories)
				self.maybe_load_revisions(self.revisions)

				if self.done:
				self.generate_and_load_snapshot()
				self.flush()

				def generate_and_load_snapshot(self):
				"""
				Make me
				"""
				pass

				def download_generate_hash(self, response, url):
				"""Store file in temp directory and computes hash of its filepath

				Args:
				response (Response): Server response of the url
				url (str): Url of the tarball

				Returns:
				Tuple of local (filepath, hashes of filepath)

				"""
				# Done
				length = int(response.headers['content-length'])

				# SEE ME
				filepath = os.path.join(self.package_temp_dir, os.path.basename(url))

				h = MultiHash(length=length)
				with open(filepath, 'wb') as f:
				for chunk in response.iter_content(chunk_size=HASH_BLOCK_SIZE):
				h.update(chunk)
				f.write(chunk)

				actual_length = os.path.getsize(filepath)
				if length != actual_length:
				raise ValueError('Error when checking size: %s != %s' % (
				length, actual_length))

				hashes = {
				'length': length,
				**h.hexdigest()
				}
				return filepath, hashes

				def uncompress_tarball(self, filepath, path):
				"""Uncompress a tarball.

				Args:
				filepath (str): Path of tarball to uncompress
				path (str): The destination folder where to uncompress the tarball
				Returns:
				The nature of the tarball, zip or tar.

				"""
				# Done
				# filepath = tempdir + url
				try:
				self.tarball_invalid = False
				return tarball.uncompress(filepath, path)
				except Exception:
				self.tarball_invalid = True
				return None

				def pre_cleanup(self):
				"""To prevent disk explosion if some other workers exploded
				in mid-air (OOM killed), we try and clean up dangling files.

				"""
				# Done
				if self.debug:
				self.log.warn('%s Will not pre-clean up temp dir %s' % (
				DEBUG_MODE, self.temp_directory
				))
				return
				clean_dangling_folders(self.config['temp_directory'],
				pattern_check=self.TEMPORARY_DIR_PREFIX_PATTERN,
				log=self.log)

				def flush(self):
				"""Flush any potential dangling data not sent to swh-storage.

				Bypass the maybe_load_* methods which awaits threshold reached
				signal. We actually want to store those as we are done
				loading.

				"""
				# Done
				contents = self.contents.pop()
				directories = self.directories.pop()
				revisions = self.revisions.pop()
				releases = self.releases.pop()

				# and send those to storage if asked
				if self.config['send_contents']:
				self.send_batch_contents(contents)
				if self.config['send_contents']:
				self.send_batch_directories(directories)
				if self.config['send_revisions']:
				self.send_batch_revisions(revisions)
				if self.config['send_releases']:
				self.send_batch_releases(releases)
				if self.config['send_snapshot'] and self.snapshot:
				self.send_snapshot(self.snapshot)

				def cleanup(self):
				"""Clean up temporary disk use after downloading and extracting
				package tarballs.

				"""
				# Done
				if self.debug:
				self.log.warn('%s Will not clean up temp dir %s' % (
				DEBUG_MODE, self.temp_directory
				))
				return
				if os.path.exists(self.temp_directory):
				self.log.debug('Clean up %s' % self.temp_directory)
				shutil.rmtree(self.temp_directory)
				ardumontAuthorUnsubmitted Done Inline Actions Feels like those `cleanup` and `pre-cleanup` are already implemented elsewhere (in the loader-core?)... ardumont: Feels like those `cleanup` and `pre-cleanup` are already implemented elsewhere (in the loader…
				nahimilegaUnsubmitted Done Inline Actions https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/core/loader.py$718-723 https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/core/loader.py$833-838 They are not implemented in loader core nahimilega: https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/core/loader.
				ardumontAuthorUnsubmitted Done Inline Actions right, that must be in the other loaders then (mercurial, svn, pypi, npm) ardumont: right, that must be in the other loaders then (mercurial, svn, pypi, npm)
				ardumontAuthorUnsubmitted Not Done Inline Actions I don't understand that method. ardumont: I don't understand that method.
				nahimilegaUnsubmitted Not Done Inline Actions There is a `HEAD` for every branch that is created by the loader. In NPM loader this head branch is predicted by the latest version ie the latest version of the packages is the head of the branch. Now, this is possible when the loader with NPM to determine the latest release because the version of the package is provided in the metadata. But there could be some package manager(in fact many) which do not provide the version metadata or some metadata which could be used to determine the latest release to determine the HEAD. So I am not sure how to make properly identify the head of the branch hence for testing purpose I have done `if True:` One possible way out could be we just leave this method to be made in all the loaders separately(as done in `convert_to_standard_format()`. Is there any better approach to solve this problem? And is it necessary to have HEAD for the branch? nahimilega: There is a `HEAD` for every branch that is created by the loader. In NPM loader this head…
				nahimilegaUnsubmitted Done Inline Actions Improve docstring nahimilega: Improve docstring
				nahimilegaUnsubmitted Not Done Inline Actions Same problem of unavailability of metadata as mentioned in the inline comment `find_head()`. Same solution: One possible way out could be we just leave this method to be made in all the loaders separately(as done in `convert_to_standard_format().` Is there any better approach to solve this problem? nahimilega: Same problem of unavailability of metadata as mentioned in the inline comment `find_head()`.