Details
- Reviewers
zack - Group Reviewers
Reviewers - Maniphest Tasks
- T2822: add "swh web search" command to perform archive searches via the CLI
- Commits
- rDWCLI176c6b134cbb: cli: add 'swh web search' subcommand
rDWCLI472adb752c0f: client: add origin_search() method
Diff Detail
- Repository
- rDWCLI Web client
- Branch
- feature/origin-search
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 17833 Build 27557: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 27556: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D4696 (id=16659)
Rebasing onto 814715517f...
Current branch diff-target is up to date.
Changes applied before test
commit 852fa217805287ecefa81c2b36e483f5cab68915 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:17:58 2020 +0100 cli: add 'swh web search' subcommand Close T2822. commit 0541dbb7cdd48478cd15b938764dbd9dc2fc3773 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:16:31 2020 +0100 client: add origin_search() method
See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/30/ for more details.
That's a good start but documentation should be improved and pagination support should be added.
I also think outputting JSON should be preferred over raw CSV.
Also error handling should be taken care of, currently a traceback is displayed when encountering an HTTP error,
that's not really user friendly.
11:45 $ swh web search python Traceback (most recent call last): File "/home/anlambert/.virtualenvs/swh/bin/swh", line 8, in <module> sys.exit(main()) File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/swh/core/cli/__init__.py", line 135, in main return swh(auto_envvar_prefix="SWH") File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func return f(get_current_context(), *args, **kwargs) File "/home/anlambert/swh/swh-environment/swh-web-client/swh/web/client/cli.py", line 156, in search results = client.origin_search(url_pattern, limit, with_visit) File "/home/anlambert/swh/swh-environment/swh-web-client/swh/web/client/client.py", line 586, in origin_search f"origin/search/{url_pattern}/", params=params, **req_args File "/home/anlambert/swh/swh-environment/swh-web-client/swh/web/client/client.py", line 191, in _call r.raise_for_status() File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/requests/models.py", line 943, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 502 Server Error: Proxy Error for url: https://archive.softwareheritage.org/api/1/origin/search/python/
swh/web/client/cli.py | ||
---|---|---|
147 | You should add a description about the output format of that command. Also maybe JSON output could be used instead of CSV one ? | |
155 | I think you should move the client instantiation in the web group directly in that diff | |
swh/web/client/client.py | ||
560 | You should add a page_token parameter to return a specific page of the found origins. Currently, only the first page is returned. See https://archive.softwareheritage.org/api/1/origin/search/python/?limit=2 as an example. I also just noticed it is not documented, will fix that today. |
swh/web/client/cli.py | ||
---|---|---|
132 | This is not really a pattern. Rather, it's a list of keywords. They should be described as such to the user, and I think you should make the argument n-ary, as something that can be passed multiple times, one per keyword. | |
135 | I don't think we want an explicit limit in the CLI. May the UNIX philosophy be with you :-) As long as it's lazy, the user can also always interrupt with Ctrl-C, and we have server-side rate limit to counter abuse anyway. | |
137 | this is tricky to name in an intuitive manner, the best I came up with is @click.option("--include-unvisited/--exclude-unvisited", default=False, show_default=True, help="...") (also, it needs a help string) | |
138 | add a help string, which also tells what's the default also, my favorite form would be more explicit than this, e.g.: @click.option("--url-encode/--no-url-encode", default=False, show_default=True, help="....") (also, it needs a help string) | |
147 | I'd like to have a textual/tabular (= CSV, with tabs as separator) output as default, because it's easier to read and manipulate with standard UNIX tools, such as cut. I'm not against an optional output in other formats, including JSON. I'd see that as an additional feature though, not necessarily a blocker for a first implementation with another output format. YMMV. | |
swh/web/client/client.py | ||
560 | +1, we really want pagination support for this one |
swh/web/client/cli.py | ||
---|---|---|
132 | For the url_pattern naming I followed the API doc of the endpoint here: https://archive.softwareheritage.org/api/1/origin/search/doc/ |
Build is green
Patch application report for D4696 (id=16709)
Rebasing onto 814715517f...
Current branch diff-target is up to date.
Changes applied before test
commit 423bc5bcb65be5ffd25a83d80dc50f2e51f1d82d Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:17:58 2020 +0100 cli: add 'swh web search' subcommand Close T2822. commit ee72da1c8890900efae4915415c2b96a6b990d1a Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:16:31 2020 +0100 client: add origin_search() method
See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/31/ for more details.
Build is green
Patch application report for D4696 (id=16721)
Rebasing onto 814715517f...
Current branch diff-target is up to date.
Changes applied before test
commit 71c0573f4526b1b540f942154b68c90ead03b28b Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:17:58 2020 +0100 cli: add 'swh web search' subcommand Close T2822. commit 472adb752c0fc913efc904a517866010d751dad4 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:16:31 2020 +0100 client: add origin_search() method
See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/32/ for more details.
LGTM! just added a few minor comments that can be taken care of before landing
swh/web/client/cli.py | ||
---|---|---|
155–156 | better: "if true, escape origin URLs in results with percent encoding (RFC 3986)" |
Build is green
Patch application report for D4696 (id=16723)
Rebasing onto 814715517f...
Current branch diff-target is up to date.
Changes applied before test
commit 096d8f1e307c32d42ac8013a5458ed7b44cb897f Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:17:58 2020 +0100 cli: add 'swh web search' subcommand Close T2822. commit 472adb752c0fc913efc904a517866010d751dad4 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:16:31 2020 +0100 client: add origin_search() method
See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/33/ for more details.
That's better but this still misses error handling for the CLI command and some tests.
swh/web/client/cli.py | ||
---|---|---|
126 | imports should be at the top of the file | |
168 | same here | |
173 | you should add error handling here and prints exception message to give some feedback to the user. | |
175–179 | This code is not covered, could you add tests for the CLI command ? | |
180 | why is it needed ? | |
184 | in which context the BrokenPipeError error is raised ? | |
swh/web/client/client.py | ||
584–603 | You could make that block more compact the following way, also renaming json variable to origins makes the code easier to read: nb_returned = 0 q = f"origin/search/{query}/" while q and nb_returned < limit: r = self._call(q, params=params, **req_args) origins = r.json() if limit and nb_returned + len(origins) > limit: origins = origins[: limit - nb_returned] nb_returned += len(origins) yield from origins q = None if "next" in r.links and "url" in r.links["next"]: params = [] q = r.links["next"]["url"] | |
599–601 | This code is not covered, could you add a test for pagination ? |
swh/web/client/cli.py | ||
---|---|---|
126 | ah right I forgot about that, this will import requests by transitivity indeed |
swh/web/client/cli.py | ||
---|---|---|
175–179 | Without having the WebAPIClient configurable (T2872) i cannot add offline tests as it is the case for the client.py endpoints. | |
180 | This is needed to pipe the results into head (or other commands), see https://stackoverflow.com/a/26738736 |
swh/web/client/client.py | ||
---|---|---|
584–603 | Hm limit is an optional argument so it can be None, meaning i cannot compare it as the while loop you are suggesting (or adding another check for None before but I am not sure it will be clearer in the end). |
Build is green
Patch application report for D4696 (id=16747)
Rebasing onto 814715517f...
Current branch diff-target is up to date.
Changes applied before test
commit 176c6b134cbb0dfdf6de6e0afd18ec8ada5c3766 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:17:58 2020 +0100 cli: add 'swh web search' subcommand Close T2822. commit 472adb752c0fc913efc904a517866010d751dad4 Author: Thibault Allançon <haltode@gmail.com> Date: Wed Dec 9 14:16:31 2020 +0100 client: add origin_search() method
See https://jenkins.softwareheritage.org/job/DWCLI/job/tests-on-diff/34/ for more details.
swh/web/client/client.py | ||
---|---|---|
584–603 | while q and (limit is None or nb_returned < limit): is totally readable to me. |