Page MenuHomeSoftware Heritage

Web API: move content checksums under a common JSON key
ClosedPublic

Authored by anlambert on Sep 18 2017, 11:51 AM.

Details

Reviewers
olasd
ardumont
Group Reviewers
Reviewers
Summary

Groups all computed content checksums under a 'checksums' JSON key (T779)

All computed content checksums are now grouped under a common JSON key when reaching the
/api/1/content/<q: hash_type:hash>/ endpoint.

I have implemented this change on the service layer as other module clients could
benefit from it.

Diff Detail

Repository
rDWAPPS Web applications
Branch
group-content-checksums
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1013
Build 1343: Software Heritage Python tests
Build 1342: arc lint + arc unit

Event Timeline

Looks good as a fix T779.

But I took the freedom to comment about making this part of the Web UI code less dependent on hardcoded lists of checksum algorithms. If you prefer not to fix that in this diff that's totally fine, but in that case please file a separate task about switching to use swh.model.hashutil.ALGORITHMS in the future.

swh/web/api/utils.py
181–182

isn't blake2s256 missing here?

and, more generally, rather than hardcoding the list of hashing algorithms everywhere in the web UI, can't we use swh.model.hashutil.ALGORITHMS (and/or DEFAULT_ALGORITHMS) where appropriate?

swh/web/tests/api/views/test_content.py
391

shouldn't we also test blake2 here?

if it's missing from our testdata, we should probably update those

ardumont added inline comments.
swh/web/api/utils.py
181–182

To clarify the use of those variables.

  • swh.model.hashutil.ALGORITHMS references the supported algorithms for hashing.
  • swh.model.hashutil.DEFAULT_ALGORITHMS references the algorithms used by default by the hashing functions when nothing is specified. The loaders for example, mostly do not specify those, thus using DEFAULT_ALGORITHMS. Thus, that should be the the one to use here.
swh/web/api/utils.py
181–182

... Thus, that should be the the one to use here.

I assumed wrongly that the hashes won't change in time.
So the full list should be used (ALGORITHMS) instead.

swh/web/api/utils.py
181–182

Ok, currently testing using the swh.model.hashutil.ALGORITHMS set.

Nevertheless, there is a side effect of now using it. As hashutil.ALGORITHMS is a set,
the order of iteration on its elements can vary between Python sessions (while the order
of iteration is guaranteed to be the same on the current one).

So enriched content data with /filetype/language/license url will now use the first checksum
returned when iterating on ALGORITHMS. We should have a hint in swh.model.hashutil
on which default hash to use.

I keep current implementation of enrich_content until we find a proper solution.

Updating D247: Web API: move content checksums under a common JSON key

Integrate zack reviews

olasd added inline comments.
swh/web/api/utils.py
181–182

If you want a reproducible order, you can just use sorted(ALGORITHMS) ?

In my opinion, this "enrichment" function should really reuse the query string that we used for getting to the content in the first place, rather than picking an arbitrary default.

swh/web/common/service.py
652

dictionnary -> dict(ionary)

swh/web/api/utils.py
181–182

I agree about reusing the same query string (with default one 'sha1' as before), it would be more coherent.

zack requested changes to this revision.Sep 19 2017, 11:28 AM

(no change or additional comment, just updating the review status to reflect the discussion)

This revision now requires changes to proceed.Sep 19 2017, 11:28 AM

Updating D247: Web API: move content checksums under a common JSON key

Follow up on comments from the last diff: now content enrich function uses the same hash method as in the
query string provided to the content endpoint.

Also group content checksums for a file when browsing a directory through the /directory/
and /revision/directory/ endpoints for more coherency (by the way, blake2 hash is not returned
through them).

swh/web/api/utils.py
169–175

for napoleon/rst, the above should be formatted as:

Args:
    content: dict of data associated to a swh content object
    top_url: whether or not to include the content url in
        the enriched data
    query_string: optionnal query string of type '<algo>:<hash>'
        used when requesting the content, it acts as a hint
        for picking the same hash method when computing
        the url listed above

i.e., the indentation level for continuation lines is constant, does not depend on the length of the identifier

unrelated: "optionnal" -> "optional"

I think I'm happy with that diff as it stands (zack's remarks on docs formatting still stand). Thanks @anlambert !

As a side note, it seems that the tests involving directory entries don't have a blake2s256 hash, this should probably be fixed :)

This revision is now accepted and ready to land.Sep 19 2017, 2:55 PM

The diff is quite fine.

Just a note about the checksums grouping, this could have gone inside the converters module:

  1. the returned result (dict mostly) passes there already
  2. It's a structure conversion so it makes sense

That way, this avoids multiple map calls.

swh/web/api/utils.py
165

language ;p

Thanks to all for the reviews.

@olasd: I will add the test regarding the missing blake2 hash in the directory entries. But when browsing directories with the api, the blake2 hash is missing, it seems that hash value is not returned by the underlying database request for listing a directory.

@ardumont: you're right, I should move the _group_checksums to the converters module, currently looking into it.

anlambert added a subscriber: zack.

Updating D247: Web API: move content checksums under a common JSON key

final differential taking into account reviews from @zack, @ardumont and @olasd