Page MenuHomeSoftware Heritage

Implement query to get origin visit types dynamically
Open, NormalPublic

Description

The archive content grows continually and new visit types are added over time.
For instance in a near future cvs and opam vist types will be added in production.

Those visit types are used in swh-web in the origin search form but they are currently hardcoded
which implies modifying that list manually each time a new visit type is introduced.

We should have a way to get that visit types list dynamically for commodity of use.
Turns out we can extract that list in an efficient way using the following elasticsearch query:

anlambert@carnavalet:~/tmp$ cat es_query.sh 
#!/bin/bash

curl -X POST http://localhost:9200/origin-production/_search?pretty -H 'Content-Type: application/json' -d '
{
    "aggs" : {
        "vist_types" : {
            "terms" : { "field" : "visit_types", "size":10000 }
        }
    },
    "size" : 0
}'
anlambert@carnavalet:~/tmp$ time ./es_query.sh 
{
  "took" : 429,
  "timed_out" : false,
  "_shards" : {
    "total" : 90,
    "successful" : 90,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "vist_types" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "git",
          "doc_count" : 151794432
        },
        {
          "key" : "npm",
          "doc_count" : 1660597
        },
        {
          "key" : "svn",
          "doc_count" : 678797
        },
        {
          "key" : "hg",
          "doc_count" : 381103
        },
        {
          "key" : "pypi",
          "doc_count" : 326793
        },
        {
          "key" : "deb",
          "doc_count" : 72303
        },
        {
          "key" : "cran",
          "doc_count" : 18019
        },
        {
          "key" : "ftp",
          "doc_count" : 1205
        },
        {
          "key" : "deposit",
          "doc_count" : 1079
        },
        {
          "key" : "tar",
          "doc_count" : 389
        },
        {
          "key" : "nixguix",
          "doc_count" : 2
        }
      ]
    }
  }
}

real    0m0,553s
user    0m0,013s
sys     0m0,014s

We could then add a new method to swh-search interface named origin_visit_types
wrapping that request and returning a dict mapping each visit type to its count.

Event Timeline

anlambert triaged this task as Normal priority.Thu, Jul 22, 2:41 PM
anlambert created this task.