Page MenuHomeSoftware Heritage

First draft of the metadata content indexer for npm (package.json) T715
ClosedPublic

Authored by moranegg on Jun 16 2017, 5:22 PM.

Details

Summary

for indexing content in content_metadata we want to use metadata tools
to extract metadata from manifest files and keep in same format(syntax) and with same terms(semantic)

temp solution:
in Metadata_Dictionary class dispatch the content for parsing and translation
using hard coded mapping (should be extracted from storage) to translate package.json files
to codemeta terms

testing:
in test_metadata 3 running tests for the compute_metadata function with and without content
and for the metadata indexer (the storage part of it isn't implemented) with local npm mapping

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

moranegg retitled this revision from First draft of the metadata content indexer for npm (package.json) to First draft of the metadata content indexer for npm (package.json) T715.Jun 16 2017, 5:24 PM
swh/indexer/metadata.py
1

Use as lower bound the creating date year. Update the upper bound when a new year arises :)

21

Are you sure the raw_content can be None?
I think at worst, it's empty (i.e. `b''`) but not None.

I think it can only be empty (cf. swh.indexer.indexer.BaseIndexer.run)

25

If possible, try to use bytes because:

  • objstorage gives us bytes
  • we don't always know the encoding used. Here, I think this will used the platform's default (utf-8) and this will raise error if that does not work (UTF8DecodeError IIRC).

Note: The content language is the exception amongst indexers that uses text and it's due to implementation detail of the layer it depends on. Every other indexer use bytes.

29
self.log.exception('Problem during the content metadata extraction or something')

Changing the message with something more appropriate would also be good :)

50

This one was used from the content language indexer.
If not used, please remove it.

58

same here.

89

I see you want to reuse and it's good.
But... avoid using ADDITIONAL_CONFIG, it's the default configuration used in case a partial configuration file is provided (a kind of fill-in the blank mode).

Also, do use the 'def prepare` method to initialize what you need in other methods.
Initialize self.tool in prepare and then simply call self.tool in the index_content method.

96

Here, you need to pay attention as to what layer is supposed to try and raise.

It's either the compute_metadata function you depend on or the method's body here.

As you call compute_metadata which currently already 'traps' exception without raising it again, the body try...except here is not used.

swh/indexer/metadata_dictionary.py
1

2017

swh/indexer/metadata_dictionary.py
57

You could use a dict of key: {context, translation function}.

Something like:

mapping_fn = {
    "hard_mapping_fn": translate_npm,
    "pom.xml": lambda content: translate_pom(parse_xml(content)),
  ...
}

# then parse method is simplified
class MetadataDict():
	​
	​    def __init__(self):
	​        pass
	​
	​    def parse(self, context, content):
	​        return mapping_fn(context)(content)
...

Since i don't see any object state used in the Metadata dict functions (translate_npm, etc...), those could be simple functions (and not object methods).

Hey, the parse method itself could be a function here :)

Personally, i tend to use class when there needs to be shared state (accessible through the self. instance variable).
Otherwise, i define functions.

62

parse_xml?

swh/indexer/tests/test_metadata.py
50

You can remove it.

69

You can remove this since we saw already the big diff on failure expectation without it :)

moranegg added inline comments.
swh/indexer/metadata.py
21

you are right

25

Can we read the content using bytes?
At the end I need to read the content as text for translation.

I saw in the language indexer the _detect_encoding function and the use of decode function.
Using only decode() was an easy fix here, but I should rethink this.

leaving decode() as is for now, but we should discuss this.

29

sure! I printed e to see the errors invoked

50

ack

58

ack

89

right ! good catch.. it was also an easy fix ;-)

self.tools = self.retrieve_tools_information()
uses the MockStorage function: indexer_configuration_get()
when I retrieve the information I need the tool's name

so I'm adding it in the MockStorage and should see that this happens also from the storage part.

96

ack

swh/indexer/metadata_dictionary.py
1

ack

57

very useful. thanks for this, I will rework the design of the MetaDict to decide is the class is needed.

I'm trying to integrate the mapping_fn but I have some issues with calling it..

I will continue this tomorrow..

swh/indexer/tests/test_metadata.py
50

ack

69

Actually, when the difference is longer than x it doesn't show the whole text.
So with the metadata result which is longer than language result, this is useful.

swh/indexer/metadata.py
25

Can we read the content using bytes?

Yes, it's bytes.
And the python3 api about reading/writing can deal with both text and bytes.
If input is bytes, it returns bytes (symmetrically with text).

At the end I need to read the content as text for translation.

Are you sure you need to?
I guess it will depend on what your api calls onward wants to use.

I saw in the language indexer the _detect_encoding function and the use of decode function.

Like in my case with the language indexer indeed.
I was forced because of the api needed underneath (pygments deals only with text).

89

Yes, just define as you mentioned in the prepare (for the implementation).

Unfortunately, we need to repeat this initialization step in the derived test indexer class as well (in its prepare function).

swh/indexer/metadata_dictionary.py
57

I will continue this tomorrow..

Sure.

I'm trying to integrate the mapping_fn but I have some issues with calling it..

Calling it should be as simple as:

mapping_fn['hard_mapping_fn'](content)
  • mapping_fn['hard_mapping_fn'] being resolved to the 'translate_npm' function.
  • 'content' being defined as a something that makes sense for the translate_npm function.

oh...

I see there is a typo in the gist i mentioned to you. Try using bracket instead of parenthesis.

mapping_fn[context](content)
moranegg added inline comments.
swh/indexer/metadata_dictionary.py
57

not using a class and moved the function compute_metadata into the metadata_dictionary to keep the logic:

  • indexer to retrieve data from storage and store result
  • compute_xxxxx() uses tool and return the result from the tool
moranegg marked an inline comment as done.

Refactor metadata_dictionary and other edits

Refactor metadata dictionary

swh/indexer/metadata_dictionary.py
135

Since you do need a distinct mapping dict for all your translation, you could use classes...
(Sorry for proposing yet again another implementation but there are more in your code now :)

To avoid having out of context dictionaries defined all over the place (npm_mapping, doap_mapping, etc...).
And since those are only used in a specific context (the <techno> mapping), we could define a class per mapping which enclose said mapping, something like:

class NpmMapping:
    mapping = {<npm-mapping-dict-here>}

    def translate(self, content):
        return translate(content, self.mapping)

class PomMapping:
    mapping = {<pom-mapping-here>}

    def translate(self, content):
        return translate(parse_xml(content), self.mapping)

class DoapMapping:
    mapping = {<doap-mapping-dict-here>} 

    def translate(self, content):
        return translate(parse_xml(content), self.mapping)

### Note that we could factor again code here :) ###

# mapping_tool_fn becomes:
mapping_tool_fn = {
    "hard_mapping_npm": NpmMapping(),
    "pom_xml": PomMapping(),
    "doap_xml": DoapMapping(),
}

# finally compute-metadata:
def compute_metadata(context, raw_content):
    content = convert(raw_content)
    if content is None:
        return None
    translated_metadata = mapping_tool_fn[context].translate(content)
    return translated_metadata


etc...

But feel free to continue on your initial way for now, it's just suggestion.

swh/indexer/metadata_dictionary.py
135

I really like this approach and I could overload the "translate()" function if needed with a parent class metadataMapping()
and the child classes could have your design for now with the hard coded mapping.

Eventually I will use the CodeMeta crosswalk table to generate the mappings so this might change a bit.

Thanks for the help with the design of the component !

Updating D215: First draft of the metadata content indexer for npm (package.json) T715

changed logic in metadata_dictionary with a parent class and dedicated child class for each mapping

swh/indexer/metadata_dictionary.py
110

why do you want to name it differently, the class already holds the npm inside.
I found the plain 'mapping' name sufficient.

114

If that code is the same on all derived classes, this could prove it belongs to the BaseMapping's translate definition.
The only override then being the mapping variable.

135

I really like this approach

cool

and I could overload the "translate()" ...

indeed :)

swh/indexer/tests/test_metadata.py
38

you will check only one element here.
If not found at the first position, you will always return false.

I think you want to remove the else part.
And return False after the while clause.

*cough*, this function should be tested as well :)

moranegg added inline comments.
swh/indexer/metadata_dictionary.py
114

now it's the same because I'm lazy and I haven't started working on other files, but in cases where the file doesn't contain json it should be decoded and/or parsed differently so this is why i didn't keep it in the translate method.

Also I wanted to keep only one task in the translate method which the semantical translation from a dict with original terms to a dict with CodeMeta terms.

swh/indexer/tests/test_metadata.py
38

changing it to

if elem not in expected:
    return False

and after the while loop return True

moranegg marked an inline comment as done.

Updating D215: First draft of the metadata content indexer for npm (package.json) T715

swh/indexer/metadata_dictionary.py
109

If there is nothing inside, you don't need to define it :)

114

now it's the same because I'm lazy...

sure, aren't we all in some form or another? :)

swh/indexer/tests/test_metadata.py
38

...
and after the while loop return True

As a first thought, this is a nice catch up :D

On second thought though, is it enough? I mean is checking the absence of unknown element enough?

Counter Examples:

  1. If captured is a subset list of expected, the test shall pass even though there is a missing element in captured.

-> Adding the length check on captured and expected would make that case appropriately fail.

  1. If captured is a subset list of expected with enough duplicated valid entries.

This one could still pass even with the length check (providing the right amount of duplicated valid entries).

-> The only way around that seems to check the other way around, from expected to captured.

From an algo's point of view, we are in unit tests so not being optimal is not a problem here.
I mean, as long as tests are fast enough, it's not really a problem (you are not writing/reading in a db for example :)

Note:

  1. This function makes me uneasy.

-> Comparing/Serializing nested data structure is not an easy subject
-> Maybe there exists some kind of library which provide this functionality...

1'. Also, your compare function seems crafted specifically for the use case covered for the actual tests.
What about other kind of nested data structure (list of dict for example).

  1. Or this could also suggest a data structure improvement.

-> Do we need to enforce order in the current implementation?

  1. Adding tests on this function will definitely ease up though.

Updating D215: First draft of the metadata content indexer for npm (package.json) T715

deleted compare function in test_metadata.py

This revision is now accepted and ready to land.Jun 28 2017, 5:22 PM
This revision was automatically updated to reflect the committed changes.