Details

Reviewers

Maniphest Tasks

Commits

rDCIDX12750558ccef: Changed usage of storage for filter and get content_metadata
rDCIDX8b8066002660: Refactor fetch content_metadata procedure in RevisionMetadataIndexer
rDCIDX8517ed6363c3: Added tests for RevisionMetadataIndexer with Mockstorage
rDCIDX51ede3e32c7a: Added level of abstraction in indexers to use BaseIndexer for revisions
rDCIDX0f0ade3a9b3b: Added RevisionMetadataIndexer with a detection tool for metadata
rDCIDX5fbcca11425c: Added level of abstraction in indexers to use BaseIndexer for revisions

Summary

renaming methods filter_contents to filter and index_content to index

in all sub-classes and orchestrator

renaming dependencies to ContentIndexer instead of BaseIndexer
renaming in tests

Added RevisionMetadataIndexer with a detection tool for metadata

RevisionMetadataIndexer takes a list of revisions and detects

in the root directory all the file names supported by the
swh-metadata-detector version 0.0.1 that can contain metadata

checks if files where translated before in the content_metadata

table

if not: sends the files to indexation

aggregates results

resolves T738

Note: should keep results in revision_metadata but this part
is not ready in the storage

also, changed init of ContentMetadataIndexer with tool in args

Updated documentation with new revision indexer

Test Plan

testing swh-metadata-detector
and RevisionMetadataIndexer with MockStorage

Diff Detail

Repository

rDCIDX Metadata indexer

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

moranegg created this revision.Jul 25 2017, 3:11 PM

moranegg edited the summary of this revision. (Show Details)Jul 25 2017, 4:23 PM

moranegg added a project: Metadata implementation.

moranegg added a task: T738: create Revision Indexer.

I'm not done yet, but here are some tiny remarks :)

README
4 ↗	(On Diff #764)	sentence adaptation proposal: ... SWH's data (content, revision, etc...): - content: - mimetype - ctags - language - fossology_license - metadata - revision: - metadata
swh/indexer/indexer.py
83	`directory`
242	s/content/data/
306	`Inherit from this class and override the methods mentioned in the BaseIndexer class.`?
351	`Inherit from this class and override the methods mentioned in the BaseIndexer class`.?
369	Since the storage api permits to retrieve information on a list of revisions, you could leverage this here: revs = self.storage.revision_get(sha1_gits) for rev in revs: if not rev: self.log.warn('Revision %s not found in storage' % hashutil.hash_to_hex(sha1_git)) continue res = self.index(rev) if res: # If no results, skip it results.append(res) ... ... I don't think that api raises ValueError. It could return None entry though.

Updating D233: Added level of abstraction in indexers to use BaseIndexer for revisions
Added tests for RevisionMetadataIndexer with Mockstorage

Harbormaster completed remote builds in B964: Diff 765.Jul 25 2017, 5:54 PM

Updating D233: Added level of abstraction in indexers to use BaseIndexer for revisions

Harbormaster completed remote builds in B965: Diff 766.Jul 26 2017, 2:14 PM

Working on storage part today, I hope I'll have a diff ready this afternoon..

swh/indexer/indexer.py
242	i wrote object type bellow so i need clarification for this comment

moranegg edited the test plan for this revision. (Show Details)Jul 26 2017, 2:19 PM

Hello, i've added some more comments.

My main concern are in regards to storage api calls using list of 1 element instead of using list of elements.
Other than that, this seems fine :)

swh/indexer/metadata.py
229	The same remark as https://forge.softwareheritage.org/D233#inline-1221 could be applied here. If the storage api can deal with multiple sha1s, this is probably better to use it that way :)
243	self.log.exception will log the exception correctly when running in production. So it's better to use that instead of print statement. (cf. below if problem to see logs during your run tryouts)
256	You can improve your main entry point using the click library. Something like: import click @click.command() @click.option('--git_sha1s', default='8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f,026040ea79dec1b49b4e3e7beda9132b6b26b51b,9699072e21eded4be8d45e3b8d543952533fa190', help='Default sha1_git to lookup') def main(git_sha1s): _git_sha1s = list(map(hashutil.hash_to_bytes, git_sha1s.split(','))) rev_metadata_indexer = RevisionMetadataIndexer() rev_metadata_indexer.run(_git_sha1s, 'update-dups') This avoid hardcoding too much :)
269	You could also add some logging setup in the main entry if you don't see enough log when running your tryouts, something like: if __name__ == '__main__': import logging logging.basicConfig(level=logging.INFO) main()
swh/indexer/orchestrator.py
101	ids instead of sha1s ids_filtered instead of sha1s_filtered If i'm not mistaken, at the orchestrator level now, we could be able to deal with any kind of indexer (content, revision)...

ardumont added inline comments.Jul 26 2017, 2:43 PM

swh/indexer/indexer.py
242	Sorry about that. I was making a reference to the `sed` tool which means `replace all occurrences of 'content' by 'data'`. In the docstring, you let the term content appear as one of the parameter. But this has been renamed to 'data' so it should be renamed in the docstring as well :)

Changed usage of storage for filter and get content_metadata

moranegg marked 5 inline comments as done.Jul 26 2017, 4:52 PM

moranegg added inline comments.

swh/indexer/metadata.py
229	here is a version with filter() to get all sha1s needing indexing and retrieve information about all the ones who don't in a batch but it feels slower than before so as i wrote on devel i'm thinking of refactoring this and directly get the content_metadata?
256	great option !

ardumont added inline comments.Jul 26 2017, 6:10 PM

swh/indexer/metadata.py
229	In my initial comment, I was more pushing to get the content_metadata as a batch :) (since content_metadata_get seem to be able to deal with batch) ... But i'm not sure of when we are at that moment, can you please clarify that part?

Yes but i tried a first version with filter (which is a bad idea globaly, i think it takes more time)
so now i'm updating new version, where first i try and get all sha1s and for the one i don't find i use the ContentMetadataIndexer
which keeps the results it computes and i get the result after computation. As you can see in the chart bellow:

But i'm thinking now that all of this should be in the ContentMetadataIndexer run method:
on a list of sha1s, first it fetches the results in storage and for the ones missing it computes the index while keeping all results internally

Then, with the get_results() method, the RevisionMetadataIndexer gets all results

mmm... i'm dwelling on that while implementing the storage part of revision_metadata and origin_metadata

Refactor fetch content_metadata procedure in RevisionMetadataIndexer

Harbormaster completed remote builds in B968: Diff 769.Jul 27 2017, 12:39 PM

added CodeRepository to translated_metadata for revision_metadata

Harbormaster completed remote builds in B970: Diff 771.Jul 27 2017, 3:42 PM

ardumont added inline comments.Jul 27 2017, 3:43 PM

swh/indexer/metadata.py

247

beware the print statement :)

You could use self.log.debug instead, this way, it's less problematic in production.
And adapt your run tryouts (in main endpoint) with:

logging.basicConfig(level=logging.DEBUG)

265

I wanted to propose you to use array (which i did but scratched and used the comma separated string instead).
I scratched it because i did not know how to use array in the command line call (when using other values than the default).

Looking into it, there seem to exist an option for that using http://click.pocoo.org/5/options/#multiple-options.
(TL;DR we need to add the flag multiple=True to the @click.option call)

Something like this then does the trick:

@click.command()
@click.option('--rev-id',
              default=['8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f',
                       '026040ea79dec1b49b4e3e7beda9132b6b26b51b',
                       '9699072e21eded4be8d45e3b8d543952533fa190'],
              help='Default sha1_git to lookup', multiple=True)
def main(rev_id):
    _git_sha1s = list(map(hashutil.hash_to_bytes, rev_id))
    rev_metadata_indexer = RevisionMetadataIndexer()
    rev_metadata_indexer.run(_git_sha1s, 'update-dups')

I know it's for test purpose and it's not high priority.

Indeed, in the actual state, this won't work if we try to pass stuff to it:

arc patch D233 ; python3 -m swh.indexer.metadata --revs_ids 8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-indexer/swh/indexer/metadata.py", line 281, in <module>
    main()
  File "/usr/lib/python3/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-indexer/swh/indexer/metadata.py", line 273, in main
    _git_sha1s = list(map(hashutil.hash_to_bytes, revs_ids))
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-model/swh/model/hashutil.py", line 230, in hash_to_bytes
    return bytes.fromhex(hash)
ValueError: non-hexadecimal number found in fromhex() arg at position 0

ardumont added inline comments.Jul 27 2017, 3:45 PM

swh/indexer/metadata.py
265	Oh and the cli call which works becomes something like: python3 -m swh.indexer.metadata --rev-id 8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f --rev-id 026040ea79dec1b49b4e3e7beda9132b6b26b51b --rev-id ...

moranegg marked 5 inline comments as done.Jul 27 2017, 4:44 PM

moranegg added inline comments.

swh/indexer/metadata.py
247	i thought i have erased all print().. sorry :-p and thanks for the tip

Updating D233: to run click with multiple args

ardumont accepted this revision.Jul 28 2017, 12:41 PM

This revision is now accepted and ready to land.Jul 28 2017, 12:41 PM

Closed by commit rDCIDX5fbcca11425c: Added level of abstraction in indexers to use BaseIndexer for revisions (authored by moranegg). · Explain WhyJul 28 2017, 12:45 PM

This revision was automatically updated to reflect the committed changes.

Added level of abstraction in indexers to use BaseIndexer for revisions
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 775

swh/indexer/ctags.py

swh/indexer/fossology_license.py

swh/indexer/indexer.py

swh/indexer/language.py

swh/indexer/metadata.py

swh/indexer/mimetype.py

swh/indexer/orchestrator.py

swh/indexer/tests/test_metadata.py

Added level of abstraction in indexers to use BaseIndexer for revisionsClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 775

swh/indexer/ctags.py

swh/indexer/fossology_license.py

swh/indexer/indexer.py

swh/indexer/language.py

swh/indexer/metadata.py

swh/indexer/mimetype.py

swh/indexer/orchestrator.py

swh/indexer/tests/test_metadata.py

Added level of abstraction in indexers to use BaseIndexer for revisions
ClosedPublic
Actions

Revision Contents
Changeset List