The archive content grows continually and new visit types are added over time.
For instance in a near future cvs and opam vist types will be added in production.
Those visit types are used in swh-web in the origin search form but they are currently hardcoded
which implies modifying that list manually each time a new visit type is introduced.
We should have a way to get that visit types list dynamically for commodity of use.
Turns out we can extract that list in an efficient way using the following elasticsearch query:
anlambert@carnavalet:~/tmp$ cat es_query.sh #!/bin/bash curl -X POST http://localhost:9200/origin-production/_search?pretty -H 'Content-Type: application/json' -d ' { "aggs" : { "vist_types" : { "terms" : { "field" : "visit_types", "size":10000 } } }, "size" : 0 }' anlambert@carnavalet:~/tmp$ time ./es_query.sh { "took" : 429, "timed_out" : false, "_shards" : { "total" : 90, "successful" : 90, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 10000, "relation" : "gte" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "vist_types" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "git", "doc_count" : 151794432 }, { "key" : "npm", "doc_count" : 1660597 }, { "key" : "svn", "doc_count" : 678797 }, { "key" : "hg", "doc_count" : 381103 }, { "key" : "pypi", "doc_count" : 326793 }, { "key" : "deb", "doc_count" : 72303 }, { "key" : "cran", "doc_count" : 18019 }, { "key" : "ftp", "doc_count" : 1205 }, { "key" : "deposit", "doc_count" : 1079 }, { "key" : "tar", "doc_count" : 389 }, { "key" : "nixguix", "doc_count" : 2 } ] } } } real 0m0,553s user 0m0,013s sys 0m0,014s
We could then add a new method to swh-search interface named origin_visit_types
wrapping that request and returning a dict mapping each visit type to its count.