Changeset View
Standalone View
swh/indexer/metadata.py
- This file was added.
# Copyright (C) 2017 The Software Heritage developers | |||||
ardumont: Use as lower bound the creating date year. Update the upper bound when a new year arises :) | |||||
# See the AUTHORS file at the top-level directory of this distribution | |||||
# License: GNU General Public License version 3, or any later version | |||||
# See top-level LICENSE file for more information | |||||
from .indexer import BaseIndexer | |||||
from swh.indexer.metadata_dictionary import compute_metadata | |||||
class ContentMetadataIndexer(BaseIndexer): | |||||
"""Indexer in charge of: | |||||
- filtering out content already indexed | |||||
- reading content from objstorage with the content's id sha1 | |||||
- computing translated_metadata by given context | |||||
- using the MetadataDict and a tool for each context | |||||
- store result in storage | |||||
""" | |||||
CONFIG_BASE_FILENAME = 'indexer/metadata' | |||||
ADDITIONAL_CONFIG = { | |||||
'tools': ('dict', { | |||||
Done Inline ActionsAre you sure the raw_content can be None? I think it can only be empty (cf. swh.indexer.indexer.BaseIndexer.run) ardumont: Are you sure the raw_content can be None?
I think at worst, it's empty (i.e. ```b''```) but not… | |||||
Not Done Inline Actionsyou are right moranegg: you are right | |||||
'name': 'hard_mapping_npm', | |||||
'version': '0.0.1', | |||||
'configuration': { | |||||
'type': 'test', | |||||
Not Done Inline ActionsIf possible, try to use bytes because:
Note: The content language is the exception amongst indexers that uses text and it's due to implementation detail of the layer it depends on. Every other indexer use bytes. ardumont: If possible, try to use bytes because:
- objstorage gives us bytes
- we don't always know the… | |||||
Not Done Inline ActionsCan we read the content using bytes? I saw in the language indexer the _detect_encoding function and the use of decode function. leaving decode() as is for now, but we should discuss this. moranegg: Can we read the content using bytes?
At the end I need to read the content as text for… | |||||
Not Done Inline Actions
Yes, it's bytes.
Are you sure you need to?
Like in my case with the language indexer indeed. ardumont: > Can we read the content using bytes?
Yes, it's bytes.
And the python3 api about… | |||||
'debian-package': '' | |||||
}, | |||||
}), | |||||
} | |||||
Done Inline Actionsself.log.exception('Problem during the content metadata extraction or something') Changing the message with something more appropriate would also be good :) ardumont: ```
self.log.exception('Problem during the content metadata extraction or something')
```… | |||||
Not Done Inline Actionssure! I printed e to see the errors invoked moranegg: sure! I printed e to see the errors invoked | |||||
def prepare(self): | |||||
super().prepare() | |||||
def filter_contents(self, sha1s): | |||||
"""Filter out known sha1s and return only missing ones. | |||||
""" | |||||
yield from self.storage.content_metadata_missing(( | |||||
{ | |||||
'id': sha1, | |||||
'indexer_configuration_id': self.tools['id'], | |||||
} for sha1 in sha1s | |||||
)) | |||||
def index_content(self, sha1, raw_content): | |||||
"""Index sha1s' content and store result. | |||||
Args: | |||||
sha1 (bytes): content's identifier | |||||
raw_content (bytes): raw content in bytes | |||||
Done Inline ActionsThis one was used from the content language indexer. ardumont: This one was used from the content language indexer.
If not used, please remove it. | |||||
Not Done Inline Actionsack moranegg: ack | |||||
Returns: | |||||
result (dict): representing a content_metadata | |||||
if translation wasn't successful the translated_metadata keys | |||||
will be kept as None | |||||
""" | |||||
result = { | |||||
'id': sha1, | |||||
Done Inline Actionssame here. ardumont: same here. | |||||
Not Done Inline Actionsack moranegg: ack | |||||
'indexer_configuration_id': self.tools['id'], | |||||
'translated_metadata': None | |||||
} | |||||
try: | |||||
context = self.tools['name'] | |||||
result['translated_metadata'] = compute_metadata( | |||||
context, raw_content) | |||||
except: | |||||
self.log.exception( | |||||
"Problem during tool retrieval of metadata translation") | |||||
return result | |||||
def persist_index_computations(self, results, policy_update): | |||||
"""Persist the results in storage. | |||||
Args: | |||||
results ([dict]): list of content_metadata, dict with the | |||||
following keys: | |||||
- id (bytes): content's identifier (sha1) | |||||
- translated_metadata (jsonb): detected metadata | |||||
policy_update ([str]): either 'update-dups' or 'ignore-dups' to | |||||
respectively update duplicates or ignore them | |||||
""" | |||||
self.storage.content_metadata_add( | |||||
results, conflict_update=(policy_update == 'update-dups')) | |||||
Done Inline ActionsHere, you need to pay attention as to what layer is supposed to try and raise. It's either the compute_metadata function you depend on or the method's body here. As you call compute_metadata which currently already 'traps' exception without raising it again, the body try...except here is not used. ardumont: Here, you need to pay attention as to what layer is supposed to try and raise.
It's either the… | |||||
Not Done Inline Actionsack moranegg: ack | |||||
Done Inline ActionsI see you want to reuse and it's good. Also, do use the 'def prepare` method to initialize what you need in other methods. ardumont: I see you want to reuse and it's good.
But... avoid using ADDITIONAL_CONFIG, it's the default… | |||||
Not Done Inline Actionsright ! good catch.. it was also an easy fix ;-) self.tools = self.retrieve_tools_information() so I'm adding it in the MockStorage and should see that this happens also from the storage part. moranegg: right ! good catch.. it was also an easy fix ;-)
self.tools = self.retrieve_tools_information… | |||||
Not Done Inline ActionsYes, just define as you mentioned in the prepare (for the implementation). Unfortunately, we need to repeat this initialization step in the derived test indexer class as well (in its prepare function). ardumont: Yes, just define as you mentioned in the prepare (for the implementation).
Unfortunately, we… |
Use as lower bound the creating date year. Update the upper bound when a new year arises :)