Page MenuHomeSoftware Heritage

tasks: Open new fossology license range indexer task
ClosedPublic

Authored by ardumont on Nov 19 2018, 4:10 PM.

Details

Summary

indexer: Remove redundant conditional check

indexer.storage: Filter out textual data when needed

  • mimetype: no filtering is done.
  • fossology_license: Filtering to only compute on textual contents.

Related T1312

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
add-filtering-step
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2453
Build 3022: tox-on-jenkinsJenkins
Build 3021: arc lint + arc unit

Event Timeline

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/indexer/storage/db.py
136

cm.mimetype like 'text/%'

binary only means that libmagic has no idea what the mimetype is; so cm.mimetype != 'binary' would not exclude binary content that libmagic could recognize.

148

^^

This revision is now accepted and ready to land.Nov 19 2018, 4:40 PM
ardumont marked 2 inline comments as done.

Remove spurrious print statement

swh/indexer/storage/db.py
136

binary only means that libmagic has no idea what the mimetype is; so cm.mimetype != 'binary' would not exclude binary content that libmagic could recognize.

Right!

I was just re-implementing as it was before [1]
Not that this cannot change, of course ;)

[1]D630#inline-3605

cm.mimetype like 'text/%'

Seems better and this seems to be using index as well so \m/.

148

barf!

swh/indexer/storage/db.py
136

It used to be on the encoding, not on the mimetype.

swh/indexer/storage/db.py
136

Seems better and this seems to be using index as well so \m/.

Oh, I did not consider the performance issues of doing that. It's nice that it works out of the box :)

  • Improve filtering instructions as per @vlorentz's suggestion ;)
  • Fix hypothesis warning about twice @given wrapping (I did not see earlier)
ardumont added inline comments.
swh/indexer/tests/storage/test_storage.py
1753

Double @given -> warnings.

This revision was automatically updated to reflect the committed changes.

huh, right!
Bug!?

No, it's ok. I see that now that i slept ;)
We used to check for the encoding not being binary and now we check for the 'mimetype' in the form 'text/%'.
I thought that we detected an old bug. It was late ;)