Page MenuHomeSoftware Heritage

swh-web/coverage: Add origin count for each referenced code provider
ClosedPublic

Authored by anlambert on Feb 4 2019, 4:53 PM.

Details

Summary

For each referenced code provider in the archive coverage list, count
the associated number of origins and display it in the coverage widget.

As this operation takes some time (between 1 and 2 minutes to get
all counts), execute it once per day and cache its results to database.
The cached counts will then be served instead of executing the
underlying long storage queries each time.

Depends on D1075

Related T1463

Diff Detail

Repository
rDWAPPS Web applications
Branch
coverage-origin-count
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 4176
Build 5505: tox-on-jenkinsJenkins
Build 5504: arc lint + arc unit

Event Timeline

Nitpick on the code style of swh/web/misc/coverage.py: each dict in the list should have an extra comma (so that line doesn't need to change next time we add a key-value to that dict).

swh/web/assets/src/bundles/webapp/webapp.css
439

As the extra height is for the text, it should be calc(65px + 1em). Or we could remove this property of .swh-coverage and set it to .swh-coverage-logo instead.

swh/web/misc/coverage.py
81

Add an extra slash at the end (you don't want to match https://hal.archives-ouvertes.fr.foobar.com)

swh/web/misc/coverage.py
81

Same for gitlab.com, gitlab.inria.fr, and pypi.org.

Since the storage allows regexps, we can make use of them to make sure origin_url_pattern are prefixes, eg: ^https://framagit.org/ instead of https://framagit.org/. (And something like `[a-z]+://[^/]+.googlecode.com/)

swh/web/templates/coverage.html
44–58

Why use Javascript to retrieve these counters instead of doing it while rendering the page?

Since the storage allows regexps, we can make use of them to make sure origin_url_pattern are prefixes, eg: ^https://framagit.org/ instead of https://framagit.org/. (And something like `[a-z]+://[^/]+.googlecode.com/)

Effectively, results will be more accurate when using regexps bu the count queries will take a little bit longer to execute.
But as count results are cached and are only executed once a day, I do not have any objection using regexps.

swh/web/assets/src/bundles/webapp/webapp.css
439

Thanks for the tip! It works great among all browsers.

swh/web/misc/coverage.py
81

ack

swh/web/templates/coverage.html
44–58

Because when count results are not in cache or when the cache expires, the count queries need to be executed again and this can take a couple of minutes.

So to avoid having to wait until the queries get executed, display the coverage page and update the count labels
once the results are available.

vlorentz added inline comments.
swh/web/templates/coverage.html
44–58

ok

Update:

  • address vlorentz comments
  • rework cache management for origin counts to avoid sending the same count query twice to the storage database

Update:

  • slightly rework cache mechanism: return previous count value instead of -1 when a new count query is currently processing
  • only display origin counts in the UI if all have been computed
This revision is now accepted and ready to land.Feb 7 2019, 3:30 PM

Update: Rebase, bump storage version, add configuration key to enable/disable the origin counts in the coverage page

This revision was automatically updated to reflect the committed changes.