Page MenuHomeSoftware Heritage

Add "swh web search" command to perform archive searches via the CLI
ClosedPublic

Authored by haltode on Dec 9 2020, 2:39 PM.

Details

Summary

Close T2822.

TODO (probably in separate diffs):

  • Add metadata-search, in a separate endpoint/command?
  • Add config file option for "swh web" commands (T2872)
  • Add tests for new CLI commands (require config file first)

Diff Detail

Repository
rDWCLI Web client
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4696 (id=16659)

Rebasing onto 814715517f...

Current branch diff-target is up to date.
Changes applied before test
commit 852fa217805287ecefa81c2b36e483f5cab68915
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:17:58 2020 +0100

    cli: add 'swh web search' subcommand
    
    Close T2822.

commit 0541dbb7cdd48478cd15b938764dbd9dc2fc3773
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:16:31 2020 +0100

    client: add origin_search() method

See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/30/ for more details.

haltode retitled this revision from Add "swh web search" command to perform archive searches via the CLI to WIP: Add "swh web search" command to perform archive searches via the CLI.Dec 9 2020, 2:41 PM
haltode edited the summary of this revision. (Show Details)
haltode retitled this revision from WIP: Add "swh web search" command to perform archive searches via the CLI to Add "swh web search" command to perform archive searches via the CLI.

That's a good start but documentation should be improved and pagination support should be added.
I also think outputting JSON should be preferred over raw CSV.

Also error handling should be taken care of, currently a traceback is displayed when encountering an HTTP error,
that's not really user friendly.

11:45 $ swh web search python
Traceback (most recent call last):
  File "/home/anlambert/.virtualenvs/swh/bin/swh", line 8, in <module>
    sys.exit(main())
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/swh/core/cli/__init__.py", line 135, in main
    return swh(auto_envvar_prefix="SWH")
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/anlambert/swh/swh-environment/swh-web-client/swh/web/client/cli.py", line 156, in search
    results = client.origin_search(url_pattern, limit, with_visit)
  File "/home/anlambert/swh/swh-environment/swh-web-client/swh/web/client/client.py", line 586, in origin_search
    f"origin/search/{url_pattern}/", params=params, **req_args
  File "/home/anlambert/swh/swh-environment/swh-web-client/swh/web/client/client.py", line 191, in _call
    r.raise_for_status()
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Proxy Error for url: https://archive.softwareheritage.org/api/1/origin/search/python/
swh/web/client/cli.py
147 ↗(On Diff #16659)

You should add a description about the output format of that command.

Also maybe JSON output could be used instead of CSV one ?

155 ↗(On Diff #16659)

I think you should move the client instantiation in the web group directly in that diff

swh/web/client/client.py
559

You should add a page_token parameter to return a specific page of the found origins. Currently, only the first page is returned.
I also think the page_token value to get the next page of search results should be part of the command response.
FYI, that value must be considered as opaque, currently it is a number but once we will switch to elasticsearch as search backend the format will change.

See https://archive.softwareheritage.org/api/1/origin/search/python/?limit=2 as an example.

I also just noticed it is not documented, will fix that today.

This revision now requires changes to proceed.Dec 10 2020, 12:20 PM
zack requested changes to this revision.Dec 10 2020, 2:17 PM
zack added inline comments.
swh/web/client/cli.py
132 ↗(On Diff #16659)

This is not really a pattern. Rather, it's a list of keywords. They should be described as such to the user, and I think you should make the argument n-ary, as something that can be passed multiple times, one per keyword.

135 ↗(On Diff #16659)

I don't think we want an explicit limit in the CLI.
I'd rather make the client retrieve everything, and let the user filter to a given limit downstream of the tool, e.g., with head, or with jq if we add JSON output.

May the UNIX philosophy be with you :-)

As long as it's lazy, the user can also always interrupt with Ctrl-C, and we have server-side rate limit to counter abuse anyway.

137 ↗(On Diff #16659)

this is tricky to name in an intuitive manner, the best I came up with is

@click.option("--include-unvisited/--exclude-unvisited", default=False, show_default=True, help="...")

(also, it needs a help string)

138 ↗(On Diff #16659)

add a help string, which also tells what's the default

also, my favorite form would be more explicit than this, e.g.:

@click.option("--url-encode/--no-url-encode", default=False, show_default=True, help="....")

(also, it needs a help string)

147 ↗(On Diff #16659)

I'd like to have a textual/tabular (= CSV, with tabs as separator) output as default, because it's easier to read and manipulate with standard UNIX tools, such as cut.

I'm not against an optional output in other formats, including JSON. I'd see that as an additional feature though, not necessarily a blocker for a first implementation with another output format. YMMV.

swh/web/client/client.py
559

+1, we really want pagination support for this one

swh/web/client/cli.py
132 ↗(On Diff #16659)

For the url_pattern naming I followed the API doc of the endpoint here: https://archive.softwareheritage.org/api/1/origin/search/doc/

Rework documentation and CLI interface

Build is green

Patch application report for D4696 (id=16709)

Rebasing onto 814715517f...

Current branch diff-target is up to date.
Changes applied before test
commit 423bc5bcb65be5ffd25a83d80dc50f2e51f1d82d
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:17:58 2020 +0100

    cli: add 'swh web search' subcommand
    
    Close T2822.

commit ee72da1c8890900efae4915415c2b96a6b990d1a
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:16:31 2020 +0100

    client: add origin_search() method

See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/31/ for more details.

Support pagination in origin_search.

Build is green

Patch application report for D4696 (id=16721)

Rebasing onto 814715517f...

Current branch diff-target is up to date.
Changes applied before test
commit 71c0573f4526b1b540f942154b68c90ead03b28b
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:17:58 2020 +0100

    cli: add 'swh web search' subcommand
    
    Close T2822.

commit 472adb752c0fc913efc904a517866010d751dad4
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:16:31 2020 +0100

    client: add origin_search() method

See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/32/ for more details.

LGTM! just added a few minor comments that can be taken care of before landing

swh/web/client/cli.py
156–157 ↗(On Diff #16721)

better: "if true, escape origin URLs in results with percent encoding (RFC 3986)"

Build is green

Patch application report for D4696 (id=16723)

Rebasing onto 814715517f...

Current branch diff-target is up to date.
Changes applied before test
commit 096d8f1e307c32d42ac8013a5458ed7b44cb897f
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:17:58 2020 +0100

    cli: add 'swh web search' subcommand
    
    Close T2822.

commit 472adb752c0fc913efc904a517866010d751dad4
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:16:31 2020 +0100

    client: add origin_search() method

See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/33/ for more details.

That's better but this still misses error handling for the CLI command and some tests.

swh/web/client/cli.py
127 ↗(On Diff #16723)

imports should be at the top of the file

169 ↗(On Diff #16723)

same here

174 ↗(On Diff #16723)

you should add error handling here and prints exception message to give some feedback to the user.

176–180 ↗(On Diff #16723)

This code is not covered, could you add tests for the CLI command ?

181 ↗(On Diff #16723)

why is it needed ?

185 ↗(On Diff #16723)

in which context the BrokenPipeError error is raised ?

swh/web/client/client.py
584–603

You could make that block more compact the following way, also renaming json variable to origins makes the code easier to read:

nb_returned = 0
q = f"origin/search/{query}/"
while q and nb_returned < limit:
    r = self._call(q, params=params, **req_args)
    origins = r.json()
    if limit and nb_returned + len(origins) > limit:
        origins = origins[: limit - nb_returned]

    nb_returned += len(origins)
    yield from origins

    q = None
    if "next" in r.links and "url" in r.links["next"]:
        params = []
        q = r.links["next"]["url"]
599–601

This code is not covered, could you add a test for pagination ?

This revision now requires changes to proceed.Dec 11 2020, 3:23 PM
zack added inline comments.
swh/web/client/cli.py
127 ↗(On Diff #16723)

nah, in a cli.py module moving import to the top would make @douardda sad :-)

swh/web/client/cli.py
127 ↗(On Diff #16723)

ah right I forgot about that, this will import requests by transitivity indeed

haltode added inline comments.
swh/web/client/cli.py
176–180 ↗(On Diff #16723)

Without having the WebAPIClient configurable (T2872) i cannot add offline tests as it is the case for the client.py endpoints.

181 ↗(On Diff #16723)

This is needed to pipe the results into head (or other commands), see https://stackoverflow.com/a/26738736

swh/web/client/client.py
584–603

Hm limit is an optional argument so it can be None, meaning i cannot compare it as the while loop you are suggesting (or adding another check for None before but I am not sure it will be clearer in the end).

Build is green

Patch application report for D4696 (id=16747)

Rebasing onto 814715517f...

Current branch diff-target is up to date.
Changes applied before test
commit 176c6b134cbb0dfdf6de6e0afd18ec8ada5c3766
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:17:58 2020 +0100

    cli: add 'swh web search' subcommand
    
    Close T2822.

commit 472adb752c0fc913efc904a517866010d751dad4
Author: Thibault Allançon <haltode@gmail.com>
Date:   Wed Dec 9 14:16:31 2020 +0100

    client: add origin_search() method

See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/34/ for more details.

This revision is now accepted and ready to land.Dec 14 2020, 1:20 PM
anlambert added inline comments.
swh/web/client/client.py
584–603
while q and (limit is None or nb_returned < limit):

is totally readable to me.