Use http api point to get package names and build origin urls.
Details
- Reviewers
anlambert - Group Reviewers
Reviewers - Maniphest Tasks
- T4597: Create a Hackage Lister
- Commits
- rDLS6696a8424ad1: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
Diff Detail
- Repository
- rDLS Listers
- Branch
- hackage
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 31335 Build 49018: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 49017: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D8338 (id=30110)
Rebasing onto b7b11887a0...
Current branch diff-target is up to date.
Changes applied before test
commit b3c640c54121c55286d0fa0ecf8c41670bbcbe56 Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 [WIP] Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/628/ for more details.
Build is green
Patch application report for D8338 (id=30146)
Rebasing onto c6ce862d32...
First, rewinding head to replay your work on top of it... Applying: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort".
Rebase failed (ret=1)!
Could not rebase; Attempt merge onto c6ce862d32...
Already up to date.
Changes applied before test
commit c8b66bfea3a125cbb558200e0757038c5811713c Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/634/ for more details.
@ardumont @vlorentz This one is quite simple but like pubdev we do not have access to coherent data to set a last_update. See https://hackage.haskell.org/packages/
To retrieve origins we can alternatively retrieve an index.tar.gz which list package names, related version and finally a cabal file with some metatata, but nothing date related and in this case the only benefit is to get some related versions.
Example for the package 4Blocks in index/4Blocks/0.1/4Blocks.cabal:
-- 4Blocks.cabal auto-generated by cabal init. For additional options, see -- http://www.haskell.org/cabal/release/cabal-latest/doc/users-guide/authors.html#pkg-descr. -- The name of the package. Name: 4Blocks Version: 0.1 Synopsis: A tetris-like game (works with GHC 6.8.3 and Gtk2hs 0.9.13) Description: A tetris-like game implemented in Haskell and making use of Gtkh2s (works with GHC 6.8.3 and Gtk2hs 0.9.13) Homepage: http://lambdacolyte.wordpress.com/2009/08/06/tetris-in-haskell/ License: BSD3 License-file: LICENSE Author: Andrew Calleja Maintainer: drewcalleja@gmail.com Category: Game Build-type: Simple Cabal-version: >=1.2 Tested-with: GHC == 6.8.3 Executable 4Blocks Main-is: 4Blocks.hs Build-depends: base >= 2 && <= 4,gtk>=0.9.13,haskell98,cairo>=0.9.13,containers>=0.1.0.2,mtl>=1.1.0.1
There is an API that provides access to the lastUpload:
$ curl "https://hackage.haskell.org/packages/search" -H "Accept: application/json" -H "Content-Type: application/json" --data '{"page": 0, "sortColumn": "default", "sortDirection": "ascending", "searchQuery": "(deprecated:any)"}' -X POST | jq . | head -n 50 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 23907 0 23806 100 101 40145 170 --:--:-- --:--:-- --:--:-- 40315 { "numberOfResults": 16711, "pageContents": [ { "description": "Haskell package for easy integration with the 2captcha API.", "downloads": 1, "lastUpload": "2021-09-09T05:13:30.343509948Z", "maintainers": [ { "display": "qwbarch", "uri": "/user/qwbarch" } ], "name": { "display": "2captcha", "uri": "/package/2captcha" }, "tags": [ { "display": "deprecated", "uri": "/packages/tag/deprecated" }, { "display": "library", "uri": "/packages/tag/library" }, { "display": "mit", "uri": "/packages/tag/mit" }, { "display": "network", "uri": "/packages/tag/network" } ], "votes": 1.5 }, { "description": "Examples of 3D graphics programming with OpenGL", "downloads": 8, "lastUpload": "2016-07-22T14:26:23.038905Z", "maintainers": [ { "display": "WolfgangJeltsch", "uri": "/user/WolfgangJeltsch" } ], "name": { "display": "3d-graphics-examples", "uri": "/package/3d-graphics-examples"
You can also use the same API for incremental listing by filtering on lastUpload in the search query.
I now understand why I do not experiment this endpoint in the first place. It is not documented as a POST (and it does not seems natural to use POST to get something that usually use GET with query params).
I've made an implementation this way but now I have to manage pagination. The endpoint returns only 50 entries and I did not find a way to bypass that (using pageSize has no effect).
It is: https://hackage.haskell.org/api#search/browse%20backend
But since the Hackage's documentation is clearly spotty, I used Firefox's debugger to see what API the GUI used, that's how I found this endpoint.
Change http api endpoint for search in order to retrieve a last_update
Switch from GET to POST to get results.
Lister is not a single page anymore, each page list 50 origins.
Build is green
Patch application report for D8338 (id=30239)
Rebasing onto 7638f2028b...
First, rewinding head to replay your work on top of it... Applying: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort".
Rebase failed (ret=1)!
Could not rebase; Attempt merge onto 7638f2028b...
Already up to date.
Changes applied before test
commit 2eb481a71b73ae93271da0a8f7bd8b4246d2c295 Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/637/ for more details.
Lister runs fine on docker
swh-lister_1 | [2022-09-01 16:34:17,233: INFO/ForkPoolWorker-1] Task swh.lister.hackage.tasks.HackageListerTask[5e1d7981-0aca-4ee2-a8c7-1520ef28d959] succeeded in 97.8861533490126s: {'pages': 334, 'origins': 16700}
swh/lister/hackage/lister.py | ||
---|---|---|
91–108 |
|
Build is green
Patch application report for D8338 (id=30246)
Rebasing onto 7638f2028b...
First, rewinding head to replay your work on top of it... Applying: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort".
Rebase failed (ret=1)!
Could not rebase; Attempt merge onto 7638f2028b...
Already up to date.
Changes applied before test
commit 201bb0c8249abdef51e89c99708ba4df18de50eb Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/638/ for more details.
swh/lister/hackage/lister.py | ||
---|---|---|
91–108 | Ok, thanks, better now |
Testing Docker with that last commit
Task swh.lister.hackage.tasks.HackageListerTask[3c406dfb-7671-4413-8dda-13e27fd8a175] succeeded in 97.21437864698237s: {'pages': 335, 'origins': 16714}
Build is green
Patch application report for D8338 (id=30303)
Rebasing onto 44560c2383...
First, rewinding head to replay your work on top of it... Applying: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort".
Rebase failed (ret=1)!
Could not rebase; Attempt merge onto 44560c2383...
Already up to date.
Changes applied before test
commit 67d8cee5d2506695e396946ae90367ed3d66dc6f Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/648/ for more details.
Build is green
Patch application report for D8338 (id=30350)
Rebasing onto 44560c2383...
First, rewinding head to replay your work on top of it... Fast-forwarded diff-target to base-revision-653-D8338.
Changes applied before test
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/653/ for more details.
Build is green
Patch application report for D8338 (id=30357)
Rebasing onto c819cc237d...
Current branch diff-target is up to date.
Changes applied before test
commit 9ee0432b0992e1955ac5987672d9e02fcdbcd23b Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/654/ for more details.
Build is green
Patch application report for D8338 (id=30753)
Rebasing onto d5c30a3ce3...
Current branch diff-target is up to date.
Changes applied before test
commit fecadff078b7439baf2a897169da65ba8f0c8d7f Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/693/ for more details.
swh/lister/hackage/lister.py | ||
---|---|---|
25–28 | You can remove the user agent setting code, it is now handled in base Lister class. | |
54 | you can remove that line, session is now created in base lister class | |
65–80 | You can remove that method, I added an http_request method in base lister class to deduplicate some code. | |
97–99 | Use this instead: data = self.http_request( url=self.PACKAGE_NAMES_URL_PATTERN.format(base_url=self.url), method="POST", json=params, ).json() | |
109–112 | same as my latest comment above | |
swh/lister/hackage/tests/test_lister.py | ||
1–125 | Nitpicks about tests implementation, it is better to use the requests_mock fixture plus a couple of improvements, see diff below: diff --git a/swh/lister/hackage/tests/test_lister.py b/swh/lister/hackage/tests/test_lister.py index eada037..93bb6f4 100644 --- a/swh/lister/hackage/tests/test_lister.py +++ b/swh/lister/hackage/tests/test_lister.py @@ -3,21 +3,16 @@ # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information +import functools import json -from os import path from pathlib import Path from urllib.parse import unquote, urlparse -import pytest -import requests_mock - from swh.lister.hackage.lister import HackageLister -def json_callback(request, context): +def json_callback(request, context, datadir): """Callback for requests_mock that load a json file regarding a page number""" - here = path.abspath(path.dirname(__file__)) - datadir = Path(here, "data") page = request.json()["page"] unquoted_url = unquote(request.url) @@ -31,19 +26,13 @@ def json_callback(request, context): return json.loads(Path(datadir, dirname, f"{filename}_{page}").read_text()) -@pytest.fixture -def mock_post(): - """Mock `https://hackage.haskell.org/packages/search`""" - with requests_mock.Mocker() as requests_mocker: - requests_mocker.post( - url="https://hackage.haskell.org/packages/search", - status_code=200, - json=json_callback, - ) - yield - +def test_hackage_lister(swh_scheduler, requests_mock, datadir): -def test_hackage_lister(swh_scheduler, mock_post, datadir): + requests_mock.post( + url="https://hackage.haskell.org/packages/search", + status_code=200, + json=functools.partial(json_callback, datadir=datadir), + ) expected_origins = [] @@ -63,7 +52,7 @@ def test_hackage_lister(swh_scheduler, mock_post, datadir): res = lister.run() assert res.pages == 3 - assert res.origins == 50 + 50 + 50 + assert res.origins == res.pages * 50 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results @@ -84,19 +73,12 @@ def test_hackage_lister(swh_scheduler, mock_post, datadir): } -@pytest.fixture -def mock_post_49(): - """Mock 49 entries""" - with requests_mock.Mocker() as requests_mocker: - requests_mocker.post( - url="https://fake49.haskell.org/packages/search", - status_code=200, - json=json_callback, - ) - yield - - -def test_hackage_lister_pagination_49(swh_scheduler, mock_post_49, datadir): +def test_hackage_lister_pagination_49(swh_scheduler, requests_mock, datadir): + requests_mock.post( + url="https://fake49.haskell.org/packages/search", + status_code=200, + json=functools.partial(json_callback, datadir=datadir), + ) lister = HackageLister(scheduler=swh_scheduler, url="https://fake49.haskell.org/") pages = list(lister.get_pages()) # there should be 1 page with 49 entries @@ -104,19 +86,12 @@ def test_hackage_lister_pagination_49(swh_scheduler, mock_post_49, datadir): assert len(pages[0]) == 49 -@pytest.fixture -def mock_post_51(): - """Mock 51 entries""" - with requests_mock.Mocker() as requests_mocker: - requests_mocker.post( - url="https://fake51.haskell.org/packages/search", - status_code=200, - json=json_callback, - ) - yield - - -def test_hackage_lister_pagination_51(swh_scheduler, mock_post_51, datadir): +def test_hackage_lister_pagination_51(swh_scheduler, requests_mock, datadir): + requests_mock.post( + url="https://fake51.haskell.org/packages/search", + status_code=200, + json=functools.partial(json_callback, datadir=datadir), + ) lister = HackageLister(scheduler=swh_scheduler, url="https://fake51.haskell.org/") pages = list(lister.get_pages()) # there should be 2 pages with 50 + 1 entries |
Improvments after review
Make use of http_retry instead of throttling_retry decorator after D8519
Rewrite test implementation
Adapt docker documentation usage example
swh/lister/hackage/tests/test_lister.py | ||
---|---|---|
1–125 | Nice, thanks |
Build is green
Patch application report for D8338 (id=30782)
Rebasing onto fd1a4244a0...
Current branch diff-target is up to date.
Changes applied before test
commit 31188925b579b29bb25ab05c901e95824d174599 Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/698/ for more details.
Build is green
Patch application report for D8338 (id=30823)
Rebasing onto 8ff418fbc2...
Current branch diff-target is up to date.
Changes applied before test
commit 6696a8424ad19feb137429ffb66ba08cc77a2e34 Author: Franck Bret <franck.bret@octobus.net> Date: Mon Aug 29 18:53:31 2022 +0200 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/707/ for more details.