Page MenuHomeSoftware Heritage

Added level of abstraction in indexers to use BaseIndexer for revisions
ClosedPublic

Authored by moranegg on Jul 25 2017, 3:11 PM.

Details

Summary
  • renaming methods filter_contents to filter and index_content to index

in all sub-classes and orchestrator

  • renaming dependencies to ContentIndexer instead of BaseIndexer
  • renaming in tests

Added RevisionMetadataIndexer with a detection tool for metadata

  • RevisionMetadataIndexer takes a list of revisions and detects

in the root directory all the file names supported by the
swh-metadata-detector version 0.0.1 that can contain metadata

  • checks if files where translated before in the content_metadata

table

  • if not: sends the files to indexation
  • aggregates results

resolves T738

Note: should keep results in revision_metadata but this part
is not ready in the storage

  • also, changed init of ContentMetadataIndexer with tool in args

Updated documentation with new revision indexer

Test Plan

testing swh-metadata-detector
and RevisionMetadataIndexer with MockStorage

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 963
Build 1278: arc lint + arc unit

Event Timeline

I'm not done yet, but here are some tiny remarks :)

README
4

sentence adaptation proposal:

... SWH's data (content, revision, etc...):
- content:
  - mimetype
  - ctags
  - language
  - fossology_license
  - metadata
- revision:
  - metadata
swh/indexer/indexer.py
84

directory

243

s/content/data/

307

Inherit from this class and override the methods mentioned in the BaseIndexer class.?

352

Inherit from this class and override the methods mentioned in the BaseIndexer class.?

370

Since the storage api permits to retrieve information on a list of revisions, you could leverage this here:

    revs = self.storage.revision_get(sha1_gits)
    for rev in revs:
        if not rev:
            self.log.warn('Revision %s not found in storage' %
                                   hashutil.hash_to_hex(sha1_git))
            continue    
        res = self.index(rev)
        if res:  # If no results, skip it
            results.append(res)        ...
...

I don't think that api raises ValueError.
It could return None entry though.

Updating D233: Added level of abstraction in indexers to use BaseIndexer for revisions
Added tests for RevisionMetadataIndexer with Mockstorage

moranegg marked 4 inline comments as done.

Updating D233: Added level of abstraction in indexers to use BaseIndexer for revisions

Working on storage part today, I hope I'll have a diff ready this afternoon..

swh/indexer/indexer.py
243

i wrote object type bellow so i need clarification for this comment

Hello, i've added some more comments.

My main concern are in regards to storage api calls using list of 1 element instead of using list of elements.
Other than that, this seems fine :)

swh/indexer/metadata.py
219

The same remark as https://forge.softwareheritage.org/D233#inline-1221 could be applied here.
If the storage api can deal with multiple sha1s, this is probably better to use it that way :)

233

self.log.exception will log the exception correctly when running in production.
So it's better to use that instead of print statement.

(cf. below if problem to see logs during your run tryouts)

246

You can improve your main entry point using the click library.

Something like:

import click

@click.command()
@click.option('--git_sha1s',
              default='8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f,026040ea79dec1b49b4e3e7beda9132b6b26b51b,9699072e21eded4be8d45e3b8d543952533fa190',
              help='Default sha1_git to lookup')
def main(git_sha1s):
    _git_sha1s = list(map(hashutil.hash_to_bytes, git_sha1s.split(',')))
    rev_metadata_indexer = RevisionMetadataIndexer()
    rev_metadata_indexer.run(_git_sha1s, 'update-dups')

This avoid hardcoding too much :)

259

You could also add some logging setup in the main entry if you don't see enough log when running your tryouts, something like:

if __name__ == '__main__':
    import logging
    logging.basicConfig(level=logging.INFO)
    main()
swh/indexer/orchestrator.py
101

ids instead of sha1s
ids_filtered instead of sha1s_filtered

If i'm not mistaken, at the orchestrator level now, we could be able to deal with any kind of indexer (content, revision)...

swh/indexer/indexer.py
243

Sorry about that.
I was making a reference to the sed tool which means replace all occurrences of 'content' by 'data'.
In the docstring, you let the term content appear as one of the parameter.
But this has been renamed to 'data' so it should be renamed in the docstring as well :)

moranegg marked 3 inline comments as done.

Changed usage of storage for filter and get content_metadata

moranegg added inline comments.
swh/indexer/metadata.py
219

here is a version with filter() to get all sha1s needing indexing and retrieve information about all the ones who don't in a batch
but it feels slower than before
so as i wrote on devel i'm thinking of refactoring this and directly get the content_metadata?

246

great option !

swh/indexer/metadata.py
219

In my initial comment, I was more pushing to get the content_metadata as a batch :)
(since content_metadata_get seem to be able to deal with batch)
...

But i'm not sure of when we are at that moment, can you please clarify that part?

Yes but i tried a first version with filter (which is a bad idea globaly, i think it takes more time)
so now i'm updating new version, where first i try and get all sha1s and for the one i don't find i use the ContentMetadataIndexer
which keeps the results it computes and i get the result after computation. As you can see in the chart bellow:

But i'm thinking now that all of this should be in the ContentMetadataIndexer run method:
on a list of sha1s, first it fetches the results in storage and for the ones missing it computes the index while keeping all results internally

Then, with the get_results() method, the RevisionMetadataIndexer gets all results

mmm... i'm dwelling on that while implementing the storage part of revision_metadata and origin_metadata

Refactor fetch content_metadata procedure in RevisionMetadataIndexer

added CodeRepository to translated_metadata for revision_metadata

swh/indexer/metadata.py
237

beware the print statement :)

You could use self.log.debug instead, this way, it's less problematic in production.
And adapt your run tryouts (in main endpoint) with:

logging.basicConfig(level=logging.DEBUG)
255

I wanted to propose you to use array (which i did but scratched and used the comma separated string instead).
I scratched it because i did not know how to use array in the command line call (when using other values than the default).

Looking into it, there seem to exist an option for that using http://click.pocoo.org/5/options/#multiple-options.
(TL;DR we need to add the flag multiple=True to the @click.option call)

Something like this then does the trick:

@click.command()
@click.option('--rev-id',
              default=['8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f',
                       '026040ea79dec1b49b4e3e7beda9132b6b26b51b',
                       '9699072e21eded4be8d45e3b8d543952533fa190'],
              help='Default sha1_git to lookup', multiple=True)
def main(rev_id):
    _git_sha1s = list(map(hashutil.hash_to_bytes, rev_id))
    rev_metadata_indexer = RevisionMetadataIndexer()
    rev_metadata_indexer.run(_git_sha1s, 'update-dups')

I know it's for test purpose and it's not high priority.

Indeed, in the actual state, this won't work if we try to pass stuff to it:

arc patch D233 ; python3 -m swh.indexer.metadata --revs_ids 8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-indexer/swh/indexer/metadata.py", line 281, in <module>
    main()
  File "/usr/lib/python3/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-indexer/swh/indexer/metadata.py", line 273, in main
    _git_sha1s = list(map(hashutil.hash_to_bytes, revs_ids))
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-model/swh/model/hashutil.py", line 230, in hash_to_bytes
    return bytes.fromhex(hash)
ValueError: non-hexadecimal number found in fromhex() arg at position 0
swh/indexer/metadata.py
255

Oh and the cli call which works becomes something like:

python3 -m swh.indexer.metadata --rev-id 8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f --rev-id 026040ea79dec1b49b4e3e7beda9132b6b26b51b --rev-id ...
moranegg added inline comments.
swh/indexer/metadata.py
237

i thought i have erased all print()..
sorry :-p

and thanks for the tip

moranegg marked an inline comment as done.

Updating D233: to run click with multiple args

This revision is now accepted and ready to land.Jul 28 2017, 12:41 PM
This revision was automatically updated to reflect the committed changes.