Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9340429
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
19 KB
Subscribers
None
View Options
diff --git a/PKG-INFO b/PKG-INFO
index 7c3d5af..005842a 100644
--- a/PKG-INFO
+++ b/PKG-INFO
@@ -1,69 +1,69 @@
Metadata-Version: 2.1
Name: swh.indexer
-Version: 0.0.169
+Version: 0.0.170
Summary: Software Heritage Content Indexer
Home-page: https://forge.softwareheritage.org/diffusion/78/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer
Description: swh-indexer
============
Tools to compute multiple indexes on SWH's raw contents:
- content:
- mimetype
- ctags
- language
- fossology-license
- metadata
- revision:
- metadata
An indexer is in charge of:
- looking up objects
- extracting information from those objects
- store those information in the swh-indexer db
There are multiple indexers working on different object types:
- content indexer: works with content sha1 hashes
- revision indexer: works with revision sha1 hashes
- origin indexer: works with origin identifiers
Indexation procedure:
- receive batch of ids
- retrieve the associated data depending on object type
- compute for that object some index
- store the result to swh's storage
Current content indexers:
- mimetype (queue swh_indexer_content_mimetype): detect the encoding
and mimetype
- language (queue swh_indexer_content_language): detect the
programming language
- ctags (queue swh_indexer_content_ctags): compute tags information
- fossology-license (queue swh_indexer_fossology_license): compute the
license
- metadata: translate file into translated_metadata dict
Current revision indexers:
- metadata: detects files containing metadata and retrieves translated_metadata
in content_metadata table in storage or run content indexer to translate
files.
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Provides-Extra: testing
diff --git a/swh.indexer.egg-info/PKG-INFO b/swh.indexer.egg-info/PKG-INFO
index 7c3d5af..005842a 100644
--- a/swh.indexer.egg-info/PKG-INFO
+++ b/swh.indexer.egg-info/PKG-INFO
@@ -1,69 +1,69 @@
Metadata-Version: 2.1
Name: swh.indexer
-Version: 0.0.169
+Version: 0.0.170
Summary: Software Heritage Content Indexer
Home-page: https://forge.softwareheritage.org/diffusion/78/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer
Description: swh-indexer
============
Tools to compute multiple indexes on SWH's raw contents:
- content:
- mimetype
- ctags
- language
- fossology-license
- metadata
- revision:
- metadata
An indexer is in charge of:
- looking up objects
- extracting information from those objects
- store those information in the swh-indexer db
There are multiple indexers working on different object types:
- content indexer: works with content sha1 hashes
- revision indexer: works with revision sha1 hashes
- origin indexer: works with origin identifiers
Indexation procedure:
- receive batch of ids
- retrieve the associated data depending on object type
- compute for that object some index
- store the result to swh's storage
Current content indexers:
- mimetype (queue swh_indexer_content_mimetype): detect the encoding
and mimetype
- language (queue swh_indexer_content_language): detect the
programming language
- ctags (queue swh_indexer_content_ctags): compute tags information
- fossology-license (queue swh_indexer_fossology_license): compute the
license
- metadata: translate file into translated_metadata dict
Current revision indexers:
- metadata: detects files containing metadata and retrieves translated_metadata
in content_metadata table in storage or run content indexer to translate
files.
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Provides-Extra: testing
diff --git a/swh/indexer/metadata.py b/swh/indexer/metadata.py
index f62f89a..339e3cf 100644
--- a/swh/indexer/metadata.py
+++ b/swh/indexer/metadata.py
@@ -1,377 +1,377 @@
# Copyright (C) 2017-2020 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from copy import deepcopy
from typing import Any, Callable, Dict, Iterator, List, Tuple
from swh.core.utils import grouper
from swh.indexer.codemeta import merge_documents
from swh.indexer.indexer import ContentIndexer, RevisionIndexer, OriginIndexer
from swh.indexer.origin_head import OriginHeadIndexer
from swh.indexer.metadata_dictionary import MAPPINGS
from swh.indexer.metadata_detector import detect_metadata
from swh.indexer.storage import INDEXER_CFG_KEY
from swh.model import hashutil
REVISION_GET_BATCH_SIZE = 10
ORIGIN_GET_BATCH_SIZE = 10
def call_with_batches(
f: Callable[[List[Dict[str, Any]]], Dict['str', Any]],
args: List[Dict[str, str]], batch_size: int
) -> Iterator[str]:
"""Calls a function with batches of args, and concatenates the results.
"""
groups = grouper(args, batch_size)
for group in groups:
yield from f(list(group))
class ContentMetadataIndexer(ContentIndexer):
"""Content-level indexer
This indexer is in charge of:
- filtering out content already indexed in content_metadata
- reading content from objstorage with the content's id sha1
- computing metadata by given context
- using the metadata_dictionary as the 'swh-metadata-translator' tool
- store result in content_metadata table
"""
def filter(self, ids):
"""Filter out known sha1s and return only missing ones.
"""
yield from self.idx_storage.content_metadata_missing((
{
'id': sha1,
'indexer_configuration_id': self.tool['id'],
} for sha1 in ids
))
def index(self, id, data, log_suffix='unknown revision'):
"""Index sha1s' content and store result.
Args:
id (bytes): content's identifier
data (bytes): raw content in bytes
Returns:
dict: dictionary representing a content_metadata. If the
translation wasn't successful the metadata keys will
be returned as None
"""
result = {
'id': id,
'indexer_configuration_id': self.tool['id'],
'metadata': None
}
try:
mapping_name = self.tool['tool_configuration']['context']
log_suffix += ', content_id=%s' % hashutil.hash_to_hex(id)
result['metadata'] = \
MAPPINGS[mapping_name](log_suffix).translate(data)
except Exception:
self.log.exception(
"Problem during metadata translation "
"for content %s" % hashutil.hash_to_hex(id))
if result['metadata'] is None:
return None
return result
def persist_index_computations(
self, results: List[Dict], policy_update: str
) -> Dict[str, int]:
"""Persist the results in storage.
Args:
results: list of content_metadata, dict with the
following keys:
- id (bytes): content's identifier (sha1)
- metadata (jsonb): detected metadata
policy_update: either 'update-dups' or 'ignore-dups' to
respectively update duplicates or ignore them
"""
return self.idx_storage.content_metadata_add(
results, conflict_update=(policy_update == 'update-dups'))
class RevisionMetadataIndexer(RevisionIndexer):
"""Revision-level indexer
This indexer is in charge of:
- filtering revisions already indexed in revision_intrinsic_metadata table
with defined computation tool
- retrieve all entry_files in root directory
- use metadata_detector for file_names containing metadata
- compute metadata translation if necessary and possible (depends on tool)
- send sha1s to content indexing if possible
- store the results for revision
"""
ADDITIONAL_CONFIG = {
'tools': ('dict', {
'name': 'swh-metadata-detector',
'version': '0.0.2',
'configuration': {
},
}),
}
def filter(self, sha1_gits):
"""Filter out known sha1s and return only missing ones.
"""
yield from self.idx_storage.revision_intrinsic_metadata_missing((
{
'id': sha1_git,
'indexer_configuration_id': self.tool['id'],
} for sha1_git in sha1_gits
))
def index(self, rev):
"""Index rev by processing it and organizing result.
use metadata_detector to iterate on filenames
- if one filename detected -> sends file to content indexer
- if multiple file detected -> translation needed at revision level
Args:
rev (dict): revision artifact from storage
Returns:
dict: dictionary representing a revision_intrinsic_metadata, with
keys:
- id (str): rev's identifier (sha1_git)
- indexer_configuration_id (bytes): tool used
- metadata: dict of retrieved metadata
"""
result = {
'id': rev['id'],
'indexer_configuration_id': self.tool['id'],
'mappings': None,
'metadata': None
}
try:
root_dir = rev['directory']
dir_ls = list(self.storage.directory_ls(root_dir, recursive=False))
if [entry['type'] for entry in dir_ls] == ['dir']:
# If the root is just a single directory, recurse into it
# eg. PyPI packages, GNU tarballs
subdir = dir_ls[0]['target']
dir_ls = self.storage.directory_ls(subdir, recursive=False)
files = [entry for entry in dir_ls if entry['type'] == 'file']
detected_files = detect_metadata(files)
(mappings, metadata) = self.translate_revision_intrinsic_metadata(
detected_files,
log_suffix='revision=%s' % hashutil.hash_to_hex(rev['id']))
result['mappings'] = mappings
result['metadata'] = metadata
except Exception as e:
self.log.exception(
'Problem when indexing rev: %r', e)
return result
def persist_index_computations(
self, results: List[Dict], policy_update: str
) -> Dict[str, int]:
"""Persist the results in storage.
Args:
results: list of content_mimetype, dict with the
following keys:
- id (bytes): content's identifier (sha1)
- mimetype (bytes): mimetype in bytes
- encoding (bytes): encoding in bytes
policy_update: either 'update-dups' or 'ignore-dups' to
respectively update duplicates or ignore them
"""
# TODO: add functions in storage to keep data in
# revision_intrinsic_metadata
return self.idx_storage.revision_intrinsic_metadata_add(
results, conflict_update=(policy_update == 'update-dups'))
def translate_revision_intrinsic_metadata(
self, detected_files: Dict[str, List[Any]], log_suffix: str
) -> Tuple[List[Any], List[Any]]:
"""
Determine plan of action to translate metadata when containing
one or multiple detected files:
Args:
detected_files: dictionary mapping context names (e.g.,
"npm", "authors") to list of sha1
Returns:
(List[str], dict): list of mappings used and dict with
translated metadata according to the CodeMeta vocabulary
"""
used_mappings = [MAPPINGS[context].name for context in detected_files]
metadata = []
tool = {
'name': 'swh-metadata-translator',
'version': '0.0.2',
'configuration': {
},
}
# TODO: iterate on each context, on each file
# -> get raw_contents
# -> translate each content
config = {
k: self.config[k]
for k in [INDEXER_CFG_KEY, 'objstorage', 'storage']
}
config['tools'] = [tool]
for context in detected_files.keys():
cfg = deepcopy(config)
cfg['tools'][0]['configuration']['context'] = context
c_metadata_indexer = ContentMetadataIndexer(config=cfg)
# sha1s that are in content_metadata table
sha1s_in_storage = []
metadata_generator = self.idx_storage.content_metadata_get(
detected_files[context])
for c in metadata_generator:
# extracting metadata
sha1 = c['id']
sha1s_in_storage.append(sha1)
local_metadata = c['metadata']
# local metadata is aggregated
if local_metadata:
metadata.append(local_metadata)
sha1s_filtered = [item for item in detected_files[context]
if item not in sha1s_in_storage]
if sha1s_filtered:
# content indexing
try:
c_metadata_indexer.run(sha1s_filtered,
policy_update='ignore-dups',
log_suffix=log_suffix)
# on the fly possibility:
for result in c_metadata_indexer.results:
local_metadata = result['metadata']
metadata.append(local_metadata)
except Exception:
self.log.exception(
"Exception while indexing metadata on contents")
metadata = merge_documents(metadata)
return (used_mappings, metadata)
class OriginMetadataIndexer(OriginIndexer):
ADDITIONAL_CONFIG = RevisionMetadataIndexer.ADDITIONAL_CONFIG
USE_TOOLS = False
def __init__(self, config=None, **kwargs) -> None:
super().__init__(config=config, **kwargs)
self.origin_head_indexer = OriginHeadIndexer(config=config)
self.revision_metadata_indexer = RevisionMetadataIndexer(config=config)
- def index_list(self, origin_urls):
+ def index_list(self, origin_urls, **kwargs):
head_rev_ids = []
origins_with_head = []
origins = list(call_with_batches(
self.storage.origin_get,
[{'url': url} for url in origin_urls], ORIGIN_GET_BATCH_SIZE))
for origin in origins:
if origin is None:
continue
head_result = self.origin_head_indexer.index(origin['url'])
if head_result:
origins_with_head.append(origin)
head_rev_ids.append(head_result['revision_id'])
head_revs = list(call_with_batches(
self.storage.revision_get,
head_rev_ids, REVISION_GET_BATCH_SIZE))
assert len(head_revs) == len(head_rev_ids)
results = []
for (origin, rev) in zip(origins_with_head, head_revs):
if not rev:
self.log.warning('Missing head revision of origin %r',
origin['url'])
continue
rev_metadata = self.revision_metadata_indexer.index(rev)
orig_metadata = {
'from_revision': rev_metadata['id'],
'id': origin['url'],
'metadata': rev_metadata['metadata'],
'mappings': rev_metadata['mappings'],
'indexer_configuration_id':
rev_metadata['indexer_configuration_id'],
}
results.append((orig_metadata, rev_metadata))
return results
def persist_index_computations(
self, results: List[Dict], policy_update: str
) -> Dict[str, int]:
conflict_update = (policy_update == 'update-dups')
# Deduplicate revisions
rev_metadata: List[Any] = []
orig_metadata: List[Any] = []
revs_to_delete: List[Any] = []
origs_to_delete: List[Any] = []
summary: Dict = {}
for (orig_item, rev_item) in results:
assert rev_item['metadata'] == orig_item['metadata']
if not rev_item['metadata'] or \
rev_item['metadata'].keys() <= {'@context'}:
# If we didn't find any metadata, don't store a DB record
# (and delete existing ones, if any)
if rev_item not in revs_to_delete:
revs_to_delete.append(rev_item)
if orig_item not in origs_to_delete:
origs_to_delete.append(orig_item)
else:
if rev_item not in rev_metadata:
rev_metadata.append(rev_item)
if orig_item not in orig_metadata:
orig_metadata.append(orig_item)
if rev_metadata:
summary_rev = self.idx_storage.revision_intrinsic_metadata_add(
rev_metadata, conflict_update=conflict_update)
summary.update(summary_rev)
if orig_metadata:
summary_ori = self.idx_storage.origin_intrinsic_metadata_add(
orig_metadata, conflict_update=conflict_update)
summary.update(summary_ori)
# revs_to_delete should always be empty unless we changed a mapping
# to detect less files or less content.
# However, origs_to_delete may be empty whenever an upstream deletes
# a metadata file.
if origs_to_delete:
summary_ori = self.idx_storage.origin_intrinsic_metadata_delete(
origs_to_delete)
summary.update(summary_ori)
if revs_to_delete:
summary_rev = self.idx_storage.revision_intrinsic_metadata_delete(
revs_to_delete)
summary.update(summary_rev)
return summary
diff --git a/version.txt b/version.txt
index abd41b9..b0475d0 100644
--- a/version.txt
+++ b/version.txt
@@ -1 +1 @@
-v0.0.169-0-g895544f
\ No newline at end of file
+v0.0.170-0-gcb29944
\ No newline at end of file
File Metadata
Details
Attached
Mime Type
text/x-diff
Expires
Jul 4 2025, 10:39 AM (4 w, 4 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3333455
Attached To
rDCIDX Metadata indexer
Event Timeline
Log In to Comment