Page MenuHomeSoftware Heritage

Auto-detect indexer tool versions instead of reading them from the config
Closed, MigratedEdits Locked

Description

Reading them from the config is prone to errors, eg. mismatch between the version declared in the config and the actual system.

Event Timeline

vlorentz triaged this task as Normal priority.Jan 8 2019, 2:37 PM
vlorentz created this task.
vlorentz added a subscriber: ardumont.

Can you elaborate on how this would be implemented?

In particular, i'd be against of trusting any version self-declared by the tool in use (e.g., by invoking --version), precisely because that information might be wrong (e.g., the self-declared version might be stale in the source code). It should be on us, as project running the tools, to declare which version of the tool we are running. As it's important reproducibility information, we shouldn't trust the tool output.

Maybe this is not what you plan here, but I felt it was useful to anticipate the concern :-)

any version self-declared by the tool in use

That was my first thought, but as not all tools do it (eg. file_magic and python-magic), that's not possible anyway.

I am currently investigating pkg_resources.

Proposal for Python packages:

import pkgutil # stdlib
import pkg_resources # part of setuptools

# Get a "package" by its unique name
# (avoids name clashes, like between `python-magic` and `file_magic`)
dist = pkg_resources.get_distribution(dist_spec)

# This is usually FileFinder for ~/.local/lib/python3.X/site-packages or /usr/python3.X/site-packages
importer = pkgutil.get_importer(dist.module_path)

# Actually import the module
module = importer.find_module(module_name).load_module()

for instance, for python-magic:

>>> dist_spec = 'python-magic'
>>> module_name = 'magic'
>>> dist = pkg_resources.get_distribution(dist_spec)
>>> importer = pkgutil.get_importer(dist.module_path)
>>> magic = importer.find_module(module_name).load_module()
>>> print(dist.version)
0.4.15
>>> magic.Magic.from_buffer
<function Magic.from_buffer at 0x7fb4b08c8c80>
>>> magic.detect_from_content
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'magic' has no attribute 'detect_from_content'

and for file_magic:

>>> dist_spec = 'file_magic'
>>> module_name = 'magic'
>>> dist = pkg_resources.get_distribution(dist_spec)
>>> importer = pkgutil.get_importer(dist.module_path)
>>> magic = importer.find_module(module_name).load_module()
>>> print(dist.version)
0.3.0
>>> magic.Magic.from_buffer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'Magic' has no attribute 'from_buffer'
>>> magic.detect_from_content
<function detect_from_content at 0x7fb4b08f3c80>

These version numbers are extracted from site-packages/*.dist-info/, which is written by package managers.

That's as close to a unique identifier as we can get using only a name and a version.
To get a really unique identifier, we would also need to know which package manager was used, and the repository used by the package manager. Is it worth it?

Another concern is non-Python packages, like libmagic itself. Other than asking dpkg db and hoping for the best (ie. there is no other version installed in /.local or /usr/local), I don't see how to do it.


Though if we decide to stick with hardcoding tool versions in the config, we could add some runtime checks for Python packages:

>>> dist_spec = 'file_magic==0.4.0'
>>> dist = pkg_resources.get_distribution(dist_spec)
[...]
pkg_resources.VersionConflict: (file-magic 0.3.0 (/usr/lib/python3/dist-packages), Requirement.parse('file_magic==0.4.0'))
>>> dist_spec = 'file_magic==0.3.0'
>>> dist = pkg_resources.get_distribution(dist_spec)

Summary of IRL chat with @zack and @ardumont : for now, we'll use pkg_resources for Python modules, and keep the configuration for other tools.