Page MenuHomeSoftware Heritage

search: Add count_visit_types to interface
ClosedPublic

Authored by anlambert on Aug 25 2021, 6:29 PM.

Details

Summary

It enables to return the origin counts per visit type.

It also enables to get all available visit types dynamically in
other components like swh-web.

The underlying elasticsearch query has been tested on production
cluster and it is pretty efficient.

(swh) ✔ ~/swh/swh-environment/swh-search [count-visit-types L|⚑ 3] 
18:27 $ ssh -L 9200:192.168.100.86:9200 search-esnode4.internal.softwareheritage.org
Linux search-esnode4 5.10.0-0.bpo.5-amd64 #1 SMP Debian 5.10.24-1~bpo10+1 (2021-03-29) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Aug 25 16:26:42 2021 from 192.168.101.15
anlambert@search-esnode4:~$
anlambert@carnavalet:~/tmp$ time curl -X POST http://localhost:9200/origin-production/_search?pretty -H 'Content-Type: application/json' -d '
{
    "aggs" : {
      "not_blocklisted" : {
        "filter": {
          "bool": {
            "must_not": [
                {"term": {"blocklisted": true}}
            ]
        }
        },
        "aggs": {
          "visit_types": {
            "terms" : { "field" : "visit_types", "size": 1000 }
          }
        }
      }
    },
    "size" : 0
}'
{
  "took" : 940,
  "timed_out" : false,
  "_shards" : {
    "total" : 90,
    "successful" : 90,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "not_blocklisted" : {
      "doc_count" : 162289904,
      "visit_types" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
          {
            "key" : "git",
            "doc_count" : 154006431
          },
          {
            "key" : "npm",
            "doc_count" : 1660597
          },
          {
            "key" : "svn",
            "doc_count" : 679040
          },
          {
            "key" : "hg",
            "doc_count" : 415270
          },
          {
            "key" : "pypi",
            "doc_count" : 398714
          },
          {
            "key" : "deb",
            "doc_count" : 72303
          },
          {
            "key" : "cran",
            "doc_count" : 18019
          },
          {
            "key" : "ftp",
            "doc_count" : 1205
          },
          {
            "key" : "deposit",
            "doc_count" : 1114
          },
          {
            "key" : "tar",
            "doc_count" : 390
          },
          {
            "key" : "nixguix",
            "doc_count" : 2
          }
        ]
      }
    }
  }
}

real    0m1,168s
user    0m0,012s
sys     0m0,005s

Related to T3441.

Diff Detail

Repository
rDSEA Archive search
Branch
count-visit-types
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23176
Build 36160: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 36159: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6137 (id=22208)

Rebasing onto 26f800cde3...

Current branch diff-target is up to date.
Changes applied before test
commit 44493ba15bc316d82434415a4e234071f08ba2bb
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Aug 25 18:23:42 2021 +0200

    search: Add count_visit_types to interface
    
    It enables to return the origin counts per visit type.
    
    It also enables to get all available visit types dynamically in
    other components like swh-web.
    
    The underlying elasticsearch query has been tested on production
    cluster and it is pretty efficient.
    
    Related to T3441.

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/281/ for more details.

Rename count_visit_types to visit_types_count.

Build is green

Patch application report for D6137 (id=22210)

Rebasing onto 26f800cde3...

Current branch diff-target is up to date.
Changes applied before test
commit 9c18b776bca4c811fdf05a3a064676f4e2b22cb2
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Aug 25 18:23:42 2021 +0200

    search: Add visit_types_count to interface
    
    It enables to return the origin counts per visit type.
    
    It also enables to get all available visit types dynamically in
    other components like swh-web.
    
    The underlying elasticsearch query has been tested on production
    cluster and it is pretty efficient.
    
    Related to T3441.

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/282/ for more details.

ardumont added a subscriber: ardumont.

lgtm, one suggestion inline.

swh/search/tests/test_search.py
1181

maybe also add some blocklisted entry here as well to check it's indeed not counted (assertion should stay the same iiuc).

This revision is now accepted and ready to land.Aug 26 2021, 9:35 AM
swh/search/tests/test_search.py
1181

good idea, will update test

Update test according to @ardumont suggesstion

Build is green

Patch application report for D6137 (id=22218)

Rebasing onto 26f800cde3...

Current branch diff-target is up to date.
Changes applied before test
commit 3893e39edef24f432c2bb068b09043b6c804c2f9
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Aug 25 18:23:42 2021 +0200

    search: Add visit_types_count to interface
    
    It enables to return the origin counts per visit type.
    
    It also enables to get all available visit types dynamically in
    other components like swh-web.
    
    The underlying elasticsearch query has been tested on production
    cluster and it is pretty efficient.
    
    Related to T3441.

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/283/ for more details.