Page MenuHomeSoftware Heritage

Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
ClosedPublic

Authored by franckbret on Aug 29 2022, 6:57 PM.

Details

Summary

Use http api point to get package names and build origin urls.

Diff Detail

Repository
rDLS Listers
Branch
hackage
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 31383
Build 49093: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 49092: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8338 (id=30110)

Rebasing onto b7b11887a0...

Current branch diff-target is up to date.
Changes applied before test
commit b3c640c54121c55286d0fa0ecf8c41670bbcbe56
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    [WIP] Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/628/ for more details.

Remove forgotten and now useless test file, remoove WIP prefix

franckbret retitled this revision from [WIP] Hackage: List origins from hackage.haskell.org, The Haskell Package Repository to Hackage: List origins from hackage.haskell.org, The Haskell Package Repository.Aug 30 2022, 4:04 PM

Build is green

Patch application report for D8338 (id=30146)

Rebasing onto c6ce862d32...

First, rewinding head to replay your work on top of it...
Applying: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
Using index info to reconstruct a base tree...
M	setup.py
Falling back to patching base and 3-way merge...
Auto-merging setup.py
CONFLICT (content): Merge conflict in setup.py
Patch failed at 0001 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto c6ce862d32...

Already up to date.
Changes applied before test
commit c8b66bfea3a125cbb558200e0757038c5811713c
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/634/ for more details.

@ardumont @vlorentz This one is quite simple but like pubdev we do not have access to coherent data to set a last_update. See https://hackage.haskell.org/packages/

To retrieve origins we can alternatively retrieve an index.tar.gz which list package names, related version and finally a cabal file with some metatata, but nothing date related and in this case the only benefit is to get some related versions.

Example for the package 4Blocks in index/4Blocks/0.1/4Blocks.cabal:

-- 4Blocks.cabal auto-generated by cabal init. For additional options, see
-- http://www.haskell.org/cabal/release/cabal-latest/doc/users-guide/authors.html#pkg-descr.
-- The name of the package.

Name:                4Blocks

Version:             0.1

Synopsis:            A tetris-like game (works with GHC 6.8.3 and Gtk2hs 0.9.13)

Description:         A tetris-like game implemented in Haskell and making use of Gtkh2s (works with GHC 6.8.3 and Gtk2hs 0.9.13)
        
Homepage:            http://lambdacolyte.wordpress.com/2009/08/06/tetris-in-haskell/

License:             BSD3

License-file:        LICENSE

Author:              Andrew Calleja

Maintainer:          drewcalleja@gmail.com

Category:            Game

Build-type:          Simple

Cabal-version:       >=1.2

Tested-with:         GHC == 6.8.3

Executable 4Blocks
  Main-is:    4Blocks.hs
          
  Build-depends:     base >= 2 && <= 4,gtk>=0.9.13,haskell98,cairo>=0.9.13,containers>=0.1.0.2,mtl>=1.1.0.1

There is an API that provides access to the lastUpload:

$ curl "https://hackage.haskell.org/packages/search" -H "Accept: application/json" -H "Content-Type: application/json" --data '{"page": 0, "sortColumn": "default", "sortDirection": "ascending", "searchQuery": "(deprecated:any)"}' -X POST | jq . | head -n 50 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23907    0 23806  100   101  40145    170 --:--:-- --:--:-- --:--:-- 40315
{
  "numberOfResults": 16711,
  "pageContents": [
    {
      "description": "Haskell package for easy integration with the 2captcha API.",
      "downloads": 1,
      "lastUpload": "2021-09-09T05:13:30.343509948Z",
      "maintainers": [
        {
          "display": "qwbarch",
          "uri": "/user/qwbarch"
        }
      ],
      "name": {
        "display": "2captcha",
        "uri": "/package/2captcha"
      },
      "tags": [
        {
          "display": "deprecated",
          "uri": "/packages/tag/deprecated"
        },
        {
          "display": "library",
          "uri": "/packages/tag/library"
        },
        {
          "display": "mit",
          "uri": "/packages/tag/mit"
        },
        {
          "display": "network",
          "uri": "/packages/tag/network"
        }
      ],
      "votes": 1.5
    },
    {
      "description": "Examples of 3D graphics programming with OpenGL",
      "downloads": 8,
      "lastUpload": "2016-07-22T14:26:23.038905Z",
      "maintainers": [
        {
          "display": "WolfgangJeltsch",
          "uri": "/user/WolfgangJeltsch"
        }
      ],
      "name": {
        "display": "3d-graphics-examples",
        "uri": "/package/3d-graphics-examples"

You can also use the same API for incremental listing by filtering on lastUpload in the search query.

There is an API that provides access to the lastUpload:

$ curl "https://hackage.haskell.org/packages/search" -H "Accept: application/json" -H "Content-Type: application/json" --data '{"page": 0, "sortColumn": "default", "sortDirection": "ascending", "searchQuery": "(deprecated:any)"}' -X POST | jq . | head -n 50 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23907    0 23806  100   101  40145    170 --:--:-- --:--:-- --:--:-- 40315
{
  "numberOfResults": 16711,
  "pageContents": [
    {
      "description": "Haskell package for easy integration with the 2captcha API.",
      "downloads": 1,
      "lastUpload": "2021-09-09T05:13:30.343509948Z",
      "maintainers": [
        {
          "display": "qwbarch",
          "uri": "/user/qwbarch"
        }
      ],
      "name": {
        "display": "2captcha",
        "uri": "/package/2captcha"
      },
      "tags": [
        {
          "display": "deprecated",
          "uri": "/packages/tag/deprecated"
        },
        {
          "display": "library",
          "uri": "/packages/tag/library"
        },
        {
          "display": "mit",
          "uri": "/packages/tag/mit"
        },
        {
          "display": "network",
          "uri": "/packages/tag/network"
        }
      ],
      "votes": 1.5
    },
    {
      "description": "Examples of 3D graphics programming with OpenGL",
      "downloads": 8,
      "lastUpload": "2016-07-22T14:26:23.038905Z",
      "maintainers": [
        {
          "display": "WolfgangJeltsch",
          "uri": "/user/WolfgangJeltsch"
        }
      ],
      "name": {
        "display": "3d-graphics-examples",
        "uri": "/package/3d-graphics-examples"

You can also use the same API for incremental listing by filtering on lastUpload in the search query.

Thanks, I missed this oneu

There is an API that provides access to the lastUpload:

$ curl "https://hackage.haskell.org/packages/search" -H "Accept: application/json" -H "Content-Type: application/json" --data '{"page": 0, "sortColumn": "default", "sortDirection": "ascending", "searchQuery": "(deprecated:any)"}' -X POST | jq . | head -n 50 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23907    0 23806  100   101  40145    170 --:--:-- --:--:-- --:--:-- 40315
{
  "numberOfResults": 16711,
  "pageContents": [
    {
      "description": "Haskell package for easy integration with the 2captcha API.",
      "downloads": 1,
      "lastUpload": "2021-09-09T05:13:30.343509948Z",
      "maintainers": [
        {
          "display": "qwbarch",
          "uri": "/user/qwbarch"
        }
      ],
      "name": {
        "display": "2captcha",
        "uri": "/package/2captcha"
      },
      "tags": [
        {
          "display": "deprecated",
          "uri": "/packages/tag/deprecated"
        },
        {
          "display": "library",
          "uri": "/packages/tag/library"
        },
        {
          "display": "mit",
          "uri": "/packages/tag/mit"
        },
        {
          "display": "network",
          "uri": "/packages/tag/network"
        }
      ],
      "votes": 1.5
    },
    {
      "description": "Examples of 3D graphics programming with OpenGL",
      "downloads": 8,
      "lastUpload": "2016-07-22T14:26:23.038905Z",
      "maintainers": [
        {
          "display": "WolfgangJeltsch",
          "uri": "/user/WolfgangJeltsch"
        }
      ],
      "name": {
        "display": "3d-graphics-examples",
        "uri": "/package/3d-graphics-examples"

You can also use the same API for incremental listing by filtering on lastUpload in the search query.

Thanks, I missed this one

I now understand why I do not experiment this endpoint in the first place. It is not documented as a POST (and it does not seems natural to use POST to get something that usually use GET with query params).
I've made an implementation this way but now I have to manage pagination. The endpoint returns only 50 entries and I did not find a way to bypass that (using pageSize has no effect).

I now understand why I do not experiment this endpoint in the first place. It is not documented as a POST (and it does not seems natural to use POST to get something that usually use GET with query params).

It is: https://hackage.haskell.org/api#search/browse%20backend

But since the Hackage's documentation is clearly spotty, I used Firefox's debugger to see what API the GUI used, that's how I found this endpoint.

Change http api endpoint for search in order to retrieve a last_update

Switch from GET to POST to get results.
Lister is not a single page anymore, each page list 50 origins.

Build is green

Patch application report for D8338 (id=30239)

Rebasing onto 7638f2028b...

First, rewinding head to replay your work on top of it...
Applying: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
Using index info to reconstruct a base tree...
M	setup.py
Falling back to patching base and 3-way merge...
Auto-merging setup.py
CONFLICT (content): Merge conflict in setup.py
Patch failed at 0001 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto 7638f2028b...

Already up to date.
Changes applied before test
commit 2eb481a71b73ae93271da0a8f7bd8b4246d2c295
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/637/ for more details.

Change http api endpoint for search in order to retrieve a last_update

Switch from GET to POST to get results.
Lister is not a single page anymore, each page list 50 origins.

Lister runs fine on docker

swh-lister_1                        | [2022-09-01 16:34:17,233: INFO/ForkPoolWorker-1] Task swh.lister.hackage.tasks.HackageListerTask[5e1d7981-0aca-4ee2-a8c7-1520ef28d959] succeeded in 97.8861533490126s: {'pages': 334, 'origins': 16700}
swh/lister/hackage/lister.py
91–108
  1. avoids dropping the last page because // rounds down (using divmod + remainder) (please add a test for that)
  2. avoids fetching the first page twice (by yielding before the loop and starting iteration at page=1)
  3. uses less variables
  4. more pythonic (for loop instead of while, renames *_qty to nb_*)
  5. removes unnecessary type casts

Better implementation of pagination

Build is green

Patch application report for D8338 (id=30246)

Rebasing onto 7638f2028b...

First, rewinding head to replay your work on top of it...
Applying: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
Using index info to reconstruct a base tree...
M	setup.py
Falling back to patching base and 3-way merge...
Auto-merging setup.py
CONFLICT (content): Merge conflict in setup.py
Patch failed at 0001 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto 7638f2028b...

Already up to date.
Changes applied before test
commit 201bb0c8249abdef51e89c99708ba4df18de50eb
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/638/ for more details.

franckbret added inline comments.
swh/lister/hackage/lister.py
91–108

Ok, thanks, better now

Testing Docker with that last commit

Task swh.lister.hackage.tasks.HackageListerTask[3c406dfb-7671-4413-8dda-13e27fd8a175] succeeded in 97.21437864698237s: {'pages': 335, 'origins': 16714}

@ardumont @vlorentz If you don't have other comments or suggestions I think we can merge this one

Build is green

Patch application report for D8338 (id=30303)

Rebasing onto 44560c2383...

First, rewinding head to replay your work on top of it...
Applying: Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
Using index info to reconstruct a base tree...
M	setup.py
Falling back to patching base and 3-way merge...
Auto-merging setup.py
CONFLICT (content): Merge conflict in setup.py
Patch failed at 0001 Hackage: List origins from hackage.haskell.org, The Haskell Package Repository

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto 44560c2383...

Already up to date.
Changes applied before test
commit 67d8cee5d2506695e396946ae90367ed3d66dc6f
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/648/ for more details.

You didn't add the test I asked for in D8338#inline-59471

Is the last diff with 2 new tests ok for you?

Build is green

Patch application report for D8338 (id=30350)

Rebasing onto 44560c2383...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-653-D8338.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/653/ for more details.

@franckbret fyi you have updated the wrong diff (pubdev instead of haskell)

@franckbret fyi you have updated the wrong diff (pubdev instead of haskell)

Oh! thx didn't see. Rebasing again

Build is green

Patch application report for D8338 (id=30357)

Rebasing onto c819cc237d...

Current branch diff-target is up to date.
Changes applied before test
commit 9ee0432b0992e1955ac5987672d9e02fcdbcd23b
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/654/ for more details.

Make use of http_retry instead of throttling_retry decorator after D8519

Build is green

Patch application report for D8338 (id=30753)

Rebasing onto d5c30a3ce3...

Current branch diff-target is up to date.
Changes applied before test
commit fecadff078b7439baf2a897169da65ba8f0c8d7f
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/693/ for more details.

anlambert added inline comments.
swh/lister/hackage/lister.py
25–28

You can remove the user agent setting code, it is now handled in base Lister class.

54

you can remove that line, session is now created in base lister class

65–80

You can remove that method, I added an http_request method in base lister class to deduplicate some code.

97–99

Use this instead:

data = self.http_request(
    url=self.PACKAGE_NAMES_URL_PATTERN.format(base_url=self.url),
    method="POST",
    json=params,
).json()
109–112

same as my latest comment above

swh/lister/hackage/tests/test_lister.py
1–125

Nitpicks about tests implementation, it is better to use the requests_mock fixture plus a couple of improvements, see diff below:

diff --git a/swh/lister/hackage/tests/test_lister.py b/swh/lister/hackage/tests/test_lister.py
index eada037..93bb6f4 100644
--- a/swh/lister/hackage/tests/test_lister.py
+++ b/swh/lister/hackage/tests/test_lister.py
@@ -3,21 +3,16 @@
 # License: GNU General Public License version 3, or any later version
 # See top-level LICENSE file for more information
 
+import functools
 import json
-from os import path
 from pathlib import Path
 from urllib.parse import unquote, urlparse
 
-import pytest
-import requests_mock
-
 from swh.lister.hackage.lister import HackageLister
 
 
-def json_callback(request, context):
+def json_callback(request, context, datadir):
     """Callback for requests_mock that load a json file regarding a page number"""
-    here = path.abspath(path.dirname(__file__))
-    datadir = Path(here, "data")
     page = request.json()["page"]
 
     unquoted_url = unquote(request.url)
@@ -31,19 +26,13 @@ def json_callback(request, context):
     return json.loads(Path(datadir, dirname, f"{filename}_{page}").read_text())
 
 
-@pytest.fixture
-def mock_post():
-    """Mock `https://hackage.haskell.org/packages/search`"""
-    with requests_mock.Mocker() as requests_mocker:
-        requests_mocker.post(
-            url="https://hackage.haskell.org/packages/search",
-            status_code=200,
-            json=json_callback,
-        )
-        yield
-
+def test_hackage_lister(swh_scheduler, requests_mock, datadir):
 
-def test_hackage_lister(swh_scheduler, mock_post, datadir):
+    requests_mock.post(
+        url="https://hackage.haskell.org/packages/search",
+        status_code=200,
+        json=functools.partial(json_callback, datadir=datadir),
+    )
 
     expected_origins = []
 
@@ -63,7 +52,7 @@ def test_hackage_lister(swh_scheduler, mock_post, datadir):
     res = lister.run()
 
     assert res.pages == 3
-    assert res.origins == 50 + 50 + 50
+    assert res.origins == res.pages * 50
 
     scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
 
@@ -84,19 +73,12 @@ def test_hackage_lister(swh_scheduler, mock_post, datadir):
     }
 
 
-@pytest.fixture
-def mock_post_49():
-    """Mock 49 entries"""
-    with requests_mock.Mocker() as requests_mocker:
-        requests_mocker.post(
-            url="https://fake49.haskell.org/packages/search",
-            status_code=200,
-            json=json_callback,
-        )
-        yield
-
-
-def test_hackage_lister_pagination_49(swh_scheduler, mock_post_49, datadir):
+def test_hackage_lister_pagination_49(swh_scheduler, requests_mock, datadir):
+    requests_mock.post(
+        url="https://fake49.haskell.org/packages/search",
+        status_code=200,
+        json=functools.partial(json_callback, datadir=datadir),
+    )
     lister = HackageLister(scheduler=swh_scheduler, url="https://fake49.haskell.org/")
     pages = list(lister.get_pages())
     # there should be 1 page with 49 entries
@@ -104,19 +86,12 @@ def test_hackage_lister_pagination_49(swh_scheduler, mock_post_49, datadir):
     assert len(pages[0]) == 49
 
 
-@pytest.fixture
-def mock_post_51():
-    """Mock 51 entries"""
-    with requests_mock.Mocker() as requests_mocker:
-        requests_mocker.post(
-            url="https://fake51.haskell.org/packages/search",
-            status_code=200,
-            json=json_callback,
-        )
-        yield
-
-
-def test_hackage_lister_pagination_51(swh_scheduler, mock_post_51, datadir):
+def test_hackage_lister_pagination_51(swh_scheduler, requests_mock, datadir):
+    requests_mock.post(
+        url="https://fake51.haskell.org/packages/search",
+        status_code=200,
+        json=functools.partial(json_callback, datadir=datadir),
+    )
     lister = HackageLister(scheduler=swh_scheduler, url="https://fake51.haskell.org/")
     pages = list(lister.get_pages())
     # there should be 2 pages with 50 + 1 entries
This revision now requires changes to proceed.Sep 26 2022, 2:50 PM
franckbret marked 6 inline comments as done.

Improvments after review

Make use of http_retry instead of throttling_retry decorator after D8519
Rewrite test implementation
Adapt docker documentation usage example

swh/lister/hackage/tests/test_lister.py
1–125

Nice, thanks

Build is green

Patch application report for D8338 (id=30782)

Rebasing onto fd1a4244a0...

Current branch diff-target is up to date.
Changes applied before test
commit 31188925b579b29bb25ab05c901e95824d174599
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/698/ for more details.

This revision is now accepted and ready to land.Sep 27 2022, 10:47 AM
This revision was landed with ongoing or failed builds.Sep 27 2022, 2:25 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D8338 (id=30823)

Rebasing onto 8ff418fbc2...

Current branch diff-target is up to date.
Changes applied before test
commit 6696a8424ad19feb137429ffb66ba08cc77a2e34
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Aug 29 18:53:31 2022 +0200

    Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
    
    Use http api point to get package names and build origin urls.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/707/ for more details.