Page MenuHomeSoftware Heritage

Implement query to get origin visit types dynamically
Closed, MigratedEdits Locked

Description

The archive content grows continually and new visit types are added over time.
For instance in a near future cvs and opam vist types will be added in production.

Those visit types are used in swh-web in the origin search form but they are currently hardcoded
which implies modifying that list manually each time a new visit type is introduced.

We should have a way to get that visit types list dynamically for commodity of use.
Turns out we can extract that list in an efficient way using the following elasticsearch query:

anlambert@carnavalet:~/tmp$ cat es_query.sh 
#!/bin/bash

curl -X POST http://localhost:9200/origin-production/_search?pretty -H 'Content-Type: application/json' -d '
{
    "aggs" : {
        "vist_types" : {
            "terms" : { "field" : "visit_types", "size":10000 }
        }
    },
    "size" : 0
}'
anlambert@carnavalet:~/tmp$ time ./es_query.sh 
{
  "took" : 429,
  "timed_out" : false,
  "_shards" : {
    "total" : 90,
    "successful" : 90,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "vist_types" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "git",
          "doc_count" : 151794432
        },
        {
          "key" : "npm",
          "doc_count" : 1660597
        },
        {
          "key" : "svn",
          "doc_count" : 678797
        },
        {
          "key" : "hg",
          "doc_count" : 381103
        },
        {
          "key" : "pypi",
          "doc_count" : 326793
        },
        {
          "key" : "deb",
          "doc_count" : 72303
        },
        {
          "key" : "cran",
          "doc_count" : 18019
        },
        {
          "key" : "ftp",
          "doc_count" : 1205
        },
        {
          "key" : "deposit",
          "doc_count" : 1079
        },
        {
          "key" : "tar",
          "doc_count" : 389
        },
        {
          "key" : "nixguix",
          "doc_count" : 2
        }
      ]
    }
  }
}

real    0m0,553s
user    0m0,013s
sys     0m0,014s

We could then add a new method to swh-search interface named origin_visit_types
wrapping that request and returning a dict mapping each visit type to its count.