Details

Reviewers

Maniphest Tasks

T715: create indexing strategy for metadata

Commits

rDCIDX53eccf57491a: First draft of the metadata content indexer for npm (package.json)

Summary

for indexing content in content_metadata we want to use metadata tools
to extract metadata from manifest files and keep in same format(syntax) and with same terms(semantic)

temp solution:
in Metadata_Dictionary class dispatch the content for parsing and translation
using hard coded mapping (should be extracted from storage) to translate package.json files
to codemeta terms

testing:
in test_metadata 3 running tests for the compute_metadata function with and without content
and for the metadata indexer (the storage part of it isn't implemented) with local npm mapping

Diff Detail

Repository

rDCIDX Metadata indexer

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

moranegg created this revision.Jun 16 2017, 5:22 PM

Harbormaster completed remote builds in B901: Diff 707.Jun 16 2017, 5:22 PM

moranegg added a reviewer: ardumont.Jun 16 2017, 5:22 PM

moranegg retitled this revision from First draft of the metadata content indexer for npm (package.json) to First draft of the metadata content indexer for npm (package.json) T715.Jun 16 2017, 5:24 PM

moranegg added a project: Metadata implementation.

moranegg added a task: T715: create indexing strategy for metadata.Jun 16 2017, 5:27 PM

ardumont added inline comments.Jun 19 2017, 9:59 AM

swh/indexer/metadata.py
1	Use as lower bound the creating date year. Update the upper bound when a new year arises :)
21	Are you sure the raw_content can be None? I think at worst, it's empty (i.e. `b''`) but not None. I think it can only be empty (cf. swh.indexer.indexer.BaseIndexer.run)
25	If possible, try to use bytes because: objstorage gives us bytes we don't always know the encoding used. Here, I think this will used the platform's default (utf-8) and this will raise error if that does not work (UTF8DecodeError IIRC). Note: The content language is the exception amongst indexers that uses text and it's due to implementation detail of the layer it depends on. Every other indexer use bytes.
29	self.log.exception('Problem during the content metadata extraction or something') Changing the message with something more appropriate would also be good :)
50	This one was used from the content language indexer. If not used, please remove it.
58	same here.
89	I see you want to reuse and it's good. But... avoid using ADDITIONAL_CONFIG, it's the default configuration used in case a partial configuration file is provided (a kind of `fill-in the blank` mode). Also, do use the 'def prepare` method to initialize what you need in other methods. Initialize self.tool in `prepare` and then simply call `self.tool` in the `index_content` method.
96	Here, you need to pay attention as to what layer is supposed to try and raise. It's either the `compute_metadata` function you depend on or the method's body here. As you call compute_metadata which currently already 'traps' exception without raising it again, the body try...except here is not used.
swh/indexer/metadata_dictionary.py
1	2017

ardumont added inline comments.Jun 19 2017, 11:14 AM

swh/indexer/metadata_dictionary.py
57	You could use a dict of key: {context, translation function}. Something like: mapping_fn = { "hard_mapping_fn": translate_npm, "pom.xml": lambda content: translate_pom(parse_xml(content)), ... } # then parse method is simplified class MetadataDict(): def __init__(self): pass def parse(self, context, content): return mapping_fn(context)(content) ... Since i don't see any object state used in the Metadata dict functions (translate_npm, etc...), those could be simple functions (and not object methods). Hey, the `parse` method itself could be a function here :) Personally, i tend to use class when there needs to be shared state (accessible through the `self.` instance variable). Otherwise, i define functions.
62	`parse_xml`?
swh/indexer/tests/test_metadata.py
50	You can remove it.
69	You can remove this since we saw already the big diff on failure expectation without it :)

moranegg marked 10 inline comments as done.Jun 21 2017, 4:56 PM

moranegg added inline comments.

swh/indexer/metadata.py
21	you are right
25	Can we read the content using bytes? At the end I need to read the content as text for translation. I saw in the language indexer the _detect_encoding function and the use of decode function. Using only decode() was an easy fix here, but I should rethink this. leaving decode() as is for now, but we should discuss this.
29	sure! I printed e to see the errors invoked
50	ack
58	ack
89	right ! good catch.. it was also an easy fix ;-) self.tools = self.retrieve_tools_information() uses the MockStorage function: indexer_configuration_get() when I retrieve the information I need the tool's name so I'm adding it in the MockStorage and should see that this happens also from the storage part.
96	ack
swh/indexer/metadata_dictionary.py
1	ack
57	very useful. thanks for this, I will rework the design of the MetaDict to decide is the class is needed. I'm trying to integrate the mapping_fn but I have some issues with calling it.. I will continue this tomorrow..
swh/indexer/tests/test_metadata.py
50	ack
69	Actually, when the difference is longer than x it doesn't show the whole text. So with the metadata result which is longer than language result, this is useful.

ardumont added inline comments.Jun 22 2017, 9:52 AM

swh/indexer/metadata.py
25	Can we read the content using bytes? Yes, it's bytes. And the python3 api about reading/writing can deal with both text and bytes. If input is bytes, it returns bytes (symmetrically with text). At the end I need to read the content as text for translation. Are you sure you need to? I guess it will depend on what your api calls onward wants to use. I saw in the language indexer the _detect_encoding function and the use of decode function. Like in my case with the language indexer indeed. I was forced because of the api needed underneath (pygments deals only with text).
89	Yes, just define as you mentioned in the prepare (for the implementation). Unfortunately, we need to repeat this initialization step in the derived test indexer class as well (in its prepare function).
swh/indexer/metadata_dictionary.py
57	I will continue this tomorrow.. Sure. I'm trying to integrate the mapping_fn but I have some issues with calling it.. Calling it should be as simple as: mapping_fn['hard_mapping_fn'](content) mapping_fn['hard_mapping_fn'] being resolved to the 'translate_npm' function. 'content' being defined as a something that makes sense for the translate_npm function. oh... I see there is a typo in the gist i mentioned to you. Try using bracket instead of parenthesis. mapping_fn[context](content)

moranegg marked 10 inline comments as done.Jun 22 2017, 1:48 PM

moranegg added inline comments.

swh/indexer/metadata_dictionary.py
57	not using a class and moved the function compute_metadata into the metadata_dictionary to keep the logic: indexer to retrieve data from storage and store result compute_xxxxx() uses tool and return the result from the tool

Refactor metadata_dictionary and other edits

Refactor metadata dictionary

Harbormaster completed remote builds in B910: Diff 712.Jun 23 2017, 6:43 PM

Harbormaster completed remote builds in B911: Diff 713.

ardumont added inline comments.Jun 24 2017, 6:29 PM

swh/indexer/metadata_dictionary.py

135

Since you do need a distinct mapping dict for all your translation, you could use classes...
(Sorry for proposing yet again another implementation but there are more in your code now :)

To avoid having out of context dictionaries defined all over the place (npm_mapping, doap_mapping, etc...).
And since those are only used in a specific context (the <techno> mapping), we could define a class per mapping which enclose said mapping, something like:

class NpmMapping:
    mapping = {<npm-mapping-dict-here>}

    def translate(self, content):
        return translate(content, self.mapping)

class PomMapping:
    mapping = {<pom-mapping-here>}

    def translate(self, content):
        return translate(parse_xml(content), self.mapping)

class DoapMapping:
    mapping = {<doap-mapping-dict-here>} 

    def translate(self, content):
        return translate(parse_xml(content), self.mapping)

### Note that we could factor again code here :) ###

# mapping_tool_fn becomes:
mapping_tool_fn = {
    "hard_mapping_npm": NpmMapping(),
    "pom_xml": PomMapping(),
    "doap_xml": DoapMapping(),
}

# finally compute-metadata:
def compute_metadata(context, raw_content):
    content = convert(raw_content)
    if content is None:
        return None
    translated_metadata = mapping_tool_fn[context].translate(content)
    return translated_metadata


etc...

But feel free to continue on your initial way for now, it's just suggestion.

moranegg added inline comments.Jun 26 2017, 10:19 AM

swh/indexer/metadata_dictionary.py
135	I really like this approach and I could overload the "translate()" function if needed with a parent class metadataMapping() and the child classes could have your design for now with the hard coded mapping. Eventually I will use the CodeMeta crosswalk table to generate the mappings so this might change a bit. Thanks for the help with the design of the component !

Updating D215: First draft of the metadata content indexer for npm (package.json) T715

changed logic in metadata_dictionary with a parent class and dedicated child class for each mapping

moranegg marked an inline comment as done.Jun 27 2017, 12:15 PM

ardumont added inline comments.Jun 27 2017, 12:38 PM

swh/indexer/metadata_dictionary.py
110	why do you want to name it differently, the class already holds the npm inside. I found the plain 'mapping' name sufficient.
114	If that code is the same on all derived classes, this could prove it belongs to the BaseMapping's translate definition. The only override then being the mapping variable.
135	I really like this approach cool and I could overload the "translate()" ... indeed :)
swh/indexer/tests/test_metadata.py
38	you will check only one element here. If not found at the first position, you will always return false. I think you want to remove the else part. And return False after the while clause. cough, this function should be tested as well :)

moranegg marked 2 inline comments as done.Jun 27 2017, 1:56 PM

moranegg added inline comments.

swh/indexer/metadata_dictionary.py
114	now it's the same because I'm lazy and I haven't started working on other files, but in cases where the file doesn't contain json it should be decoded and/or parsed differently so this is why i didn't keep it in the translate method. Also I wanted to keep only one task in the translate method which the semantical translation from a dict with original terms to a dict with CodeMeta terms.
swh/indexer/tests/test_metadata.py
38	changing it to if elem not in expected: return False and after the while loop return True

Updating D215: First draft of the metadata content indexer for npm (package.json) T715

Harbormaster completed remote builds in B919: Diff 717.Jun 27 2017, 2:00 PM

ardumont added inline comments.Jun 27 2017, 4:21 PM

swh/indexer/metadata_dictionary.py
109	If there is nothing inside, you don't need to define it :)
114	now it's the same because I'm lazy... sure, aren't we all in some form or another? :)
swh/indexer/tests/test_metadata.py
38	... and after the while loop return True As a first thought, this is a nice catch up :D On second thought though, is it enough? I mean is checking the absence of unknown element enough? Counter Examples: If `captured` is a subset list of `expected`, the test shall pass even though there is a missing element in `captured`. -> Adding the length check on `captured` and `expected` would make that case appropriately fail. If `captured` is a subset list of `expected` with enough duplicated valid entries. This one could still pass even with the length check (providing the right amount of duplicated valid entries). -> The only way around that seems to check the other way around, from `expected` to `captured`. From an algo's point of view, we are in unit tests so not being optimal is not a problem here. I mean, as long as tests are fast enough, it's not really a problem (you are not writing/reading in a db for example :) Note: This function makes me uneasy. -> Comparing/Serializing nested data structure is not an easy subject -> Maybe there exists some kind of library which provide this functionality... 1'. Also, your compare function seems crafted specifically for the use case covered for the actual tests. What about other kind of nested data structure (list of dict for example). Or this could also suggest a data structure improvement. -> Do we need to enforce order in the current implementation? Adding tests on this function will definitely ease up though.

Updating D215: First draft of the metadata content indexer for npm (package.json) T715

deleted compare function in test_metadata.py

Harbormaster completed remote builds in B921: Diff 719.Jun 28 2017, 2:28 PM

ardumont accepted this revision.Jun 28 2017, 5:22 PM

This revision is now accepted and ready to land.Jun 28 2017, 5:22 PM

Closed by commit rDCIDX53eccf57491a: First draft of the metadata content indexer for npm (package.json) (authored by Morane Otilia Gruenpeter <morane.gg@gmail.com>). · Explain WhyJun 29 2017, 10:02 AM

This revision was automatically updated to reflect the committed changes.

First draft of the metadata content indexer for npm (package.json) T715
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 720

swh/indexer/indexer.py

swh/indexer/metadata.py

swh/indexer/metadata_dictionary.py

swh/indexer/tests/test_language.py

swh/indexer/tests/test_metadata.py

swh/indexer/tests/test_utils.py

First draft of the metadata content indexer for npm (package.json) T715ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 720

swh/indexer/indexer.py

swh/indexer/metadata.py

swh/indexer/metadata_dictionary.py

swh/indexer/tests/test_language.py

swh/indexer/tests/test_metadata.py

swh/indexer/tests/test_utils.py

First draft of the metadata content indexer for npm (package.json) T715
ClosedPublic
Actions

Revision Contents
Changeset List