Page MenuHomeSoftware Heritage

Hackage: Loads Hackage Listed origins
ClosedPublic

Authored by franckbret on Sep 2 2022, 9:10 AM.

Details

Summary

The loader make an http api call to retrieve package related versions.
It then download tar.gz archive for each version.

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
swh/loader/package/hackage/loader.py
175–176

Last commit goes this way and the last commented test pass. Even if the cabal file is not parsed correctly, we have acceptable data to load the archive.
Regarding cabal file, if I understand correctly it looks like its the base data authority. The hackage api has an endpoint that returns cabal file and the content is exactly the same as the one in tar.gz. So if there is human errors the api won't help.
We should have less error now!

artifacts are now list

various code simplification after review

its a follow up of lister evolution D8588

Build is green

Patch application report for D8379 (id=31009)

Rebasing onto f774aba59e...

First, rewinding head to replay your work on top of it...
Applying: Conda: Anaconda packages archive loader
Changes applied before test
commit d355e045f520f635a092e848f5e41ea759bc219b
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/928/ for more details.

@franckbret , you updated the wrong diff (conda instead of hackage)

restore previous state after I've arc diff to bad differential number

Build is green

Patch application report for D8379 (id=31012)

Rebasing onto f774aba59e...

First, rewinding head to replay your work on top of it...
Applying: Hackage: Loads Hackage Listed origins
Changes applied before test
commit 959a1c684f29d032b69fc9a1839909ae9961ff3b
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/930/ for more details.

@franckbret , you updated the wrong diff (conda instead of hackage)

Yeo sorry for this should I remove/delete comment mentionning conda from here?

LGTM, just a couple of nitpicks to handle and I will accept it.

swh/loader/package/hackage/loader.py
185–200

Nitpicks for shorter code and empty description handling.

author_str = intrinsic_metdata.get("author") or p_info.author
author = Person.from_fullname(author_str.encode())

description = intrinsic_metadata.get("synopsis", intrinsic_metadata.get("description"))

message = (
    f"Synthetic release for Haskell source package {p_info.name} "
    f"version {p_info.version}"
)

if description:
    message += f"\n\n{description}"
This revision now requires changes to proceed.Sep 30 2022, 4:37 PM
vlorentz added inline comments.
swh/loader/package/hackage/loader.py
53–66

I don't understand why this loop is designed this way.

First, it assumes field names are either lower-case or capitalized, but "Case is not significant in field names, but is significant in field values." according to https://cabal.readthedocs.io/en/3.4/cabal-package.html#package-descriptions

Also I find the logic of checking keys one by one with both forms, and using them to split is confusing. I think you can simplify the loop to something like this:

keys = ["name", "version", "synopsis", "author", "description"]
for line in lines:
    (key, value) = line.split(":", 1)
    key = key.lower()
    if key in keys:
        result[key] = value.strip()
        keys.remove(key)

Finally, it does not support multi-line values, which seem to be common for descriptions (and are valid as synopsis too, I assume)

franckbret marked 2 inline comments as done.

Manage empty description and simplify .cabal parsing

Build is green

Patch application report for D8379 (id=31017)

Rebasing onto f774aba59e...

First, rewinding head to replay your work on top of it...
Applying: Hackage: Loads Hackage Listed origins
Changes applied before test
commit 86002252c626ca3bdf3586b28f73b8c47738a8a5
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/933/ for more details.

swh/loader/package/hackage/loader.py
53–66

That code still does not parse multiline values, for instance we would miss full description by parsing that cabal file.

You should add tests for that function to cover the edge cases.

Add some tests to check that extract_intrinsic_metadata works as expected

test fallback for author
test key and value not on the same line
test multiline value

Build is green

Patch application report for D8379 (id=31081)

Rebasing onto c631349aea...

First, rewinding head to replay your work on top of it...
Applying: Hackage: Loads Hackage Listed origins
Using index info to reconstruct a base tree...
M	.pre-commit-config.yaml
Falling back to patching base and 3-way merge...
Auto-merging .pre-commit-config.yaml
CONFLICT (content): Merge conflict in .pre-commit-config.yaml
Patch failed at 0001 Hackage: Loads Hackage Listed origins

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto c631349aea...

Already up to date.
Changes applied before test
commit 7c9e08f48eed2d5c51f31f9935b5801d886d2be7
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/952/ for more details.

franckbret added inline comments.
swh/loader/package/hackage/loader.py
53–66

I don't understand why this loop is designed this way.

First, it assumes field names are either lower-case or capitalized, but "Case is not significant in field names, but is significant in field values." according to https://cabal.readthedocs.io/en/3.4/cabal-package.html#package-descriptions

Also I find the logic of checking keys one by one with both forms, and using them to split is confusing. I think you can simplify the loop to something like this:

keys = ["name", "version", "synopsis", "author", "description"]
for line in lines:
    (key, value) = line.split(":", 1)
    key = key.lower()
    if key in keys:
        result[key] = value.strip()
        keys.remove(key)

ok, simpler. I added a check against ":" if not it failed on comment and multilines

Finally, it does not support multi-line values, which seem to be common for descriptions (and are valid as synopsis too, I assume)

Last commit fix the issues. By the way please note that "synopsis" is the field used on the forge to display a title and description is more like a README content.
We add description field when we discover that in some old versions synopsis were missing but description was present.

The current implementation use synopsis and fallback to description if any, and description may be multiline. Can you confirm its ok this way or do you prefer we use description as a default?

Also description is the only multiline field that we need to manage.
At this point I did not take to time to find a .cabal file that have a multiline description and and empty synopsis, so the last test is quite useless.

185–200

Do agree for description part, but for author the problem is that p_info.author is of type Person, so I let it like this.
I'm not sure if \n is mandatory after description, in previous loaders its always there, but its missing from your code, should I remove it?

swh/loader/package/hackage/loader.py
185–200

Ah right, use this instead then:

author_str = intrinsic_metdata.get("author")
author = Person.from_fullname(author_str.encode()) if author_str else p_info.author

description = intrinsic_metadata.get("synopsis", intrinsic_metadata.get("description"))

message = (
    f"Synthetic release for Haskell source package {p_info.name} "
    f"version {p_info.version}\n"
)

if description:
    message += f"\n{description}"

description should not be special-cased, multi-line values are allowed in plenty of fields. The spec says "To continue a field value, indent the next line relative to the field name." without specifically mentioning description.

Plus, it doesn't work if there is a : character in the description (other than the first line):

>>> (path / "foo.cabal").open("w").write("description:\n  foo\n  bar: baz"); extract_intrinsic_metadata(path, "foo")
{'description': 'foo'}

Just in case you are reading these comments before the chat:

10:25:24 <+vlorentz> franckbret: I just realized something: the "description" and "synopsis" in .cabal files is a description of the package, right?
10:25:28 <+vlorentz> not the release
10:25:35 <+vlorentz> so it shouldn't go in the release message
10:28:33 <+vlorentz> we should also remove descriptions from other loaders...

Remove release description from message

Remove anything related to 'description' that was extrated from intrinsic metadata
Update docstring and documentation, remove some test data fixtures
Uses a simple regex to parse .cabal file

Build is green

Patch application report for D8379 (id=31124)

Could not rebase; Attempt merge onto b26b98810d...

Merge made by the 'recursive' strategy.
 .pre-commit-config.yaml                            |   2 +-
 docs/package-loader-specifications.rst             |  18 +
 setup.py                                           |   2 +
 swh/loader/package/conda/__init__.py               |  17 +
 swh/loader/package/conda/loader.py                 | 168 +++++
 swh/loader/package/conda/tasks.py                  |  14 +
 swh/loader/package/conda/tests/__init__.py         |   0
 swh/loader/package/conda/tests/data/fake_conda.sh  | 231 ++++++
 ...inux-64_lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2 | Bin 0 -> 1742 bytes
 ...inux-64_lifetimes-0.11.1-py36hc560c46_1.tar.bz2 | Bin 0 -> 1286 bytes
 swh/loader/package/conda/tests/test_conda.py       | 133 ++++
 swh/loader/package/conda/tests/test_tasks.py       |  24 +
 swh/loader/package/hackage/__init__.py             |  17 +
 swh/loader/package/hackage/loader.py               | 194 +++++
 swh/loader/package/hackage/tasks.py                |  14 +
 swh/loader/package/hackage/tests/__init__.py       |   0
 .../package/hackage/tests/data/fake_hackage.sh     | 796 +++++++++++++++++++++
 .../data/https_hackage.haskell.org/package_Hs2lib  |   2 +
 .../package_Hs2lib-0.6.3_Hs2lib-0.6.3.tar.gz       | Bin 0 -> 2236 bytes
 .../package_Hs2lib-0.6.3_revisions                 |   8 +
 .../data/https_hackage.haskell.org/package_aeson   |   2 +
 .../package_aeson-2.1.0.0_aeson-2.1.0.0.tar.gz     | Bin 0 -> 2231 bytes
 .../package_aeson-2.1.0.0_revisions                |  18 +
 .../data/https_hackage.haskell.org/package_colors  |   2 +
 .../package_colors-0.1_colors-0.1.tar.gz           | Bin 0 -> 560 bytes
 .../package_colors-0.1_revisions                   |   8 +
 .../package_colors-0.3.0.2_colors-0.3.0.2.tar.gz   | Bin 0 -> 531 bytes
 .../package_colors-0.3.0.2_revisions               |  18 +
 .../https_hackage.haskell.org/package_haskell2010  |   2 +
 ..._haskell2010-1.0.0.0_haskell2010-1.0.0.0.tar.gz | Bin 0 -> 930 bytes
 .../package_haskell2010-1.0.0.0_revisions          |   8 +
 .../https_hackage.haskell.org/package_numeric-qq   |   2 +
 ...ackage_numeric-qq-0.1.0_numeric-qq-0.1.0.tar.gz | Bin 0 -> 1026 bytes
 .../package_numeric-qq-0.1.0_revisions             |   8 +
 swh/loader/package/hackage/tests/test_hackage.py   | 287 ++++++++
 swh/loader/package/hackage/tests/test_tasks.py     |  23 +
 36 files changed, 2017 insertions(+), 1 deletion(-)
 create mode 100644 swh/loader/package/conda/__init__.py
 create mode 100644 swh/loader/package/conda/loader.py
 create mode 100644 swh/loader/package/conda/tasks.py
 create mode 100644 swh/loader/package/conda/tests/__init__.py
 create mode 100644 swh/loader/package/conda/tests/data/fake_conda.sh
 create mode 100644 swh/loader/package/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2
 create mode 100644 swh/loader/package/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_lifetimes-0.11.1-py36hc560c46_1.tar.bz2
 create mode 100644 swh/loader/package/conda/tests/test_conda.py
 create mode 100644 swh/loader/package/conda/tests/test_tasks.py
 create mode 100644 swh/loader/package/hackage/__init__.py
 create mode 100644 swh/loader/package/hackage/loader.py
 create mode 100644 swh/loader/package/hackage/tasks.py
 create mode 100644 swh/loader/package/hackage/tests/__init__.py
 create mode 100644 swh/loader/package/hackage/tests/data/fake_hackage.sh
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib-0.6.3_Hs2lib-0.6.3.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib-0.6.3_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson-2.1.0.0_aeson-2.1.0.0.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson-2.1.0.0_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.1_colors-0.1.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.1_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_colors-0.3.0.2.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010-1.0.0.0_haskell2010-1.0.0.0.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010-1.0.0.0_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq-0.1.0_numeric-qq-0.1.0.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq-0.1.0_revisions
 create mode 100644 swh/loader/package/hackage/tests/test_hackage.py
 create mode 100644 swh/loader/package/hackage/tests/test_tasks.py
Changes applied before test
commit 34fe949f822fa35308ff0a99f032316110ad586e
Merge: b26b988 c3623ea
Author: Jenkins user <jenkins@localhost>
Date:   Wed Oct 5 05:50:40 2022 +0000

    Merge branch 'diff-target' into HEAD

commit c3623eac528781679645e0c1cd47c3ce96893e67
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

commit e418aab490115d77bb91d58457a0b013b35d63a5
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/957/ for more details.

shorter code for author fallback + use packaging.version parse

Build is green

Patch application report for D8379 (id=31125)

Could not rebase; Attempt merge onto b26b98810d...

Merge made by the 'recursive' strategy.
 .pre-commit-config.yaml                            |   2 +-
 docs/package-loader-specifications.rst             |  18 +
 setup.py                                           |   2 +
 swh/loader/package/conda/__init__.py               |  17 +
 swh/loader/package/conda/loader.py                 | 168 +++++
 swh/loader/package/conda/tasks.py                  |  14 +
 swh/loader/package/conda/tests/__init__.py         |   0
 swh/loader/package/conda/tests/data/fake_conda.sh  | 231 ++++++
 ...inux-64_lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2 | Bin 0 -> 1742 bytes
 ...inux-64_lifetimes-0.11.1-py36hc560c46_1.tar.bz2 | Bin 0 -> 1286 bytes
 swh/loader/package/conda/tests/test_conda.py       | 133 ++++
 swh/loader/package/conda/tests/test_tasks.py       |  24 +
 swh/loader/package/hackage/__init__.py             |  17 +
 swh/loader/package/hackage/loader.py               | 194 +++++
 swh/loader/package/hackage/tasks.py                |  14 +
 swh/loader/package/hackage/tests/__init__.py       |   0
 .../package/hackage/tests/data/fake_hackage.sh     | 796 +++++++++++++++++++++
 .../data/https_hackage.haskell.org/package_Hs2lib  |   2 +
 .../package_Hs2lib-0.6.3_Hs2lib-0.6.3.tar.gz       | Bin 0 -> 2236 bytes
 .../package_Hs2lib-0.6.3_revisions                 |   8 +
 .../data/https_hackage.haskell.org/package_aeson   |   2 +
 .../package_aeson-2.1.0.0_aeson-2.1.0.0.tar.gz     | Bin 0 -> 2231 bytes
 .../package_aeson-2.1.0.0_revisions                |  18 +
 .../data/https_hackage.haskell.org/package_colors  |   2 +
 .../package_colors-0.1_colors-0.1.tar.gz           | Bin 0 -> 560 bytes
 .../package_colors-0.1_revisions                   |   8 +
 .../package_colors-0.3.0.2_colors-0.3.0.2.tar.gz   | Bin 0 -> 531 bytes
 .../package_colors-0.3.0.2_revisions               |  18 +
 .../https_hackage.haskell.org/package_haskell2010  |   2 +
 ..._haskell2010-1.0.0.0_haskell2010-1.0.0.0.tar.gz | Bin 0 -> 930 bytes
 .../package_haskell2010-1.0.0.0_revisions          |   8 +
 .../https_hackage.haskell.org/package_numeric-qq   |   2 +
 ...ackage_numeric-qq-0.1.0_numeric-qq-0.1.0.tar.gz | Bin 0 -> 1026 bytes
 .../package_numeric-qq-0.1.0_revisions             |   8 +
 swh/loader/package/hackage/tests/test_hackage.py   | 287 ++++++++
 swh/loader/package/hackage/tests/test_tasks.py     |  23 +
 36 files changed, 2017 insertions(+), 1 deletion(-)
 create mode 100644 swh/loader/package/conda/__init__.py
 create mode 100644 swh/loader/package/conda/loader.py
 create mode 100644 swh/loader/package/conda/tasks.py
 create mode 100644 swh/loader/package/conda/tests/__init__.py
 create mode 100644 swh/loader/package/conda/tests/data/fake_conda.sh
 create mode 100644 swh/loader/package/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2
 create mode 100644 swh/loader/package/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_lifetimes-0.11.1-py36hc560c46_1.tar.bz2
 create mode 100644 swh/loader/package/conda/tests/test_conda.py
 create mode 100644 swh/loader/package/conda/tests/test_tasks.py
 create mode 100644 swh/loader/package/hackage/__init__.py
 create mode 100644 swh/loader/package/hackage/loader.py
 create mode 100644 swh/loader/package/hackage/tasks.py
 create mode 100644 swh/loader/package/hackage/tests/__init__.py
 create mode 100644 swh/loader/package/hackage/tests/data/fake_hackage.sh
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib-0.6.3_Hs2lib-0.6.3.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib-0.6.3_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson-2.1.0.0_aeson-2.1.0.0.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson-2.1.0.0_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.1_colors-0.1.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.1_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_colors-0.3.0.2.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010-1.0.0.0_haskell2010-1.0.0.0.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010-1.0.0.0_revisions
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq-0.1.0_numeric-qq-0.1.0.tar.gz
 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq-0.1.0_revisions
 create mode 100644 swh/loader/package/hackage/tests/test_hackage.py
 create mode 100644 swh/loader/package/hackage/tests/test_tasks.py
Changes applied before test
commit 2ff73cd030d396eb7a3ad89d34316e330c1d1689
Merge: b26b988 02e1472
Author: Jenkins user <jenkins@localhost>
Date:   Wed Oct 5 06:05:02 2022 +0000

    Merge branch 'diff-target' into HEAD

commit 02e147207e9bb1200e663dce11de4efdec9e0314
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

commit e418aab490115d77bb91d58457a0b013b35d63a5
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/958/ for more details.

vlorentz added inline comments.
docs/package-loader-specifications.rst
121–123

loader specification documentation, update Hackage section

franckbret added inline comments.
swh/loader/package/hackage/loader.py
185–200

description has been removed

Build is green

Patch application report for D8379 (id=31314)

Rebasing onto a13e3e6f35...

Current branch diff-target is up to date.
Changes applied before test
commit 04da51dc6b8c6e156d8cbed7d78bcca2081ed6bc
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/997/ for more details.

@vlorentz @anlambert We can get archive checksums by making an HEAD call to archive download url :

franck@debian-franck:/tmp$ http HEAD http://hackage.haskell.org/package/binary-bits-0.4/binary-bits-0.4.tar.gz
HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 3379
Cache-Control: public, no-transform, max-age=2592000
Connection: keep-alive
Content-Length: 11534
Content-MD5: nF+DoGYGh/0oSokoIFwmtw==
Content-Type: application/x-gzip
Date: Thu, 13 Oct 2022 07:04:47 GMT
ETag: "9c5f83a0660687fd284a8928205c26b7"
Last-modified: Fri, 09 Jan 2015 13:41:41 GMT
Server: nginx/1.18.0 (Ubuntu)
Via: 1.1 varnish
X-Cache: HIT
X-Cache-Hits: 1
X-Served-By: cache-cdg20749-CDG
X-Timer: S1665644688.529240,VS0,VE1

We can extract Content-Length and Content-MD5. Will do so in another patch to manage checksums and extrinsic raw metadata.
Regarding raw extrinsic metadata, we can store original-artifacts-json and another one for revisions ( see swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_revisions)?
What do you think?

@vlorentz @anlambert We can get archive checksums by making an HEAD call to archive download url :

franck@debian-franck:/tmp$ http HEAD http://hackage.haskell.org/package/binary-bits-0.4/binary-bits-0.4.tar.gz
HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 3379
Cache-Control: public, no-transform, max-age=2592000
Connection: keep-alive
Content-Length: 11534
Content-MD5: nF+DoGYGh/0oSokoIFwmtw==
Content-Type: application/x-gzip
Date: Thu, 13 Oct 2022 07:04:47 GMT
ETag: "9c5f83a0660687fd284a8928205c26b7"
Last-modified: Fri, 09 Jan 2015 13:41:41 GMT
Server: nginx/1.18.0 (Ubuntu)
Via: 1.1 varnish
X-Cache: HIT
X-Cache-Hits: 1
X-Served-By: cache-cdg20749-CDG
X-Timer: S1665644688.529240,VS0,VE1

We can extract Content-Length and Content-MD5. Will do so in another patch to manage checksums and extrinsic raw metadata.

Nice!

Regarding raw extrinsic metadata, we can store original-artifacts-json and another one for revisions ( see swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_revisions)?
What do you think?

You lost me. original-artifacts-json should already be stored for every archive the loader loads. Am I missing something?

docs/package-loader-specifications.rst
121–123

Oh, I just noticed that it seems to be in the wrong direction. Shouldn't it prioritize extrinsic metadata over intrinsic?

Get md5 checksums from HEAD request on archive url

Build is green

Patch application report for D8379 (id=31506)

Rebasing onto 6d8e2abac5...

First, rewinding head to replay your work on top of it...
Applying: Hackage: Loads Hackage Listed origins
Changes applied before test
commit 3cf53eb2e18af6761054eae9fd43c4f3ef7adf60
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1001/ for more details.

docs/package-loader-specifications.rst
121–123

I think its ok this way. The most accurate data is intrinsic, but sometime author is missing. In this case it falls back to get the username who published new revisions of the package (we use the revisions endpoint to get last update). If getting that username as a fallback does not make sense, let's remove it.

Apart the md5 decoding issue, loader looks pretty stable when testing it in docker.

I will accept the diff once the inline comments handled.

swh/loader/package/hackage/loader.py
176

Got the following warnings when testing the loader:

docker-swh-loader-1  | [2022-10-19 15:43:19,789: DEBUG/ForkPoolWorker-32] Fetching https://hackage.haskell.org/package/hoodle
docker-swh-loader-1  | [2022-10-19 15:43:19,928: DEBUG/ForkPoolWorker-32] Fetching https://hackage.haskell.org/package/hoodle-0.1/revisions/
docker-swh-loader-1  | [2022-10-19 15:43:20,211: WARNING/ForkPoolWorker-32] Can not decode md5 checksum b'y\x0b\x141S\x14\x93%\xa8\xe5s\x87!W_\xad' for 'https://hackage.haskell.org/package/hoodle-0.1/hoodle-0.1.tar.gz'

You need to use this instead to get the hexadecimal representation of the md5 hash:

from swh.model.hashutil import hash_to_hex

checksums = {"md5": hash_to_hex(md5)}
swh/loader/package/hackage/tests/test_tasks.py
1–23

Please replace that test by the following:

# Copyright (C) 2022  The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information

import uuid

import pytest

from swh.scheduler.model import ListedOrigin, Lister

NAMESPACE = "swh.loader.package.hackage"


@pytest.fixture
def hackage_lister():
    return Lister(name="hackage", instance_name="example", id=uuid.uuid4())


@pytest.fixture
def hackage_listed_origin(hackage_lister):
    return ListedOrigin(
        lister_id=hackage_lister.id,
        url="https://hackage.haskell.org/package/package_name",
        visit_type="hackage",
    )


def test_hackage_loader_task_for_listed_origin(
    loading_task_creation_for_listed_origin_test,
    hackage_lister,
    hackage_listed_origin,
):

    loading_task_creation_for_listed_origin_test(
        loader_class_name=f"{NAMESPACE}.loader.HackageLoader",
        task_function_name=f"{NAMESPACE}.tasks.LoadHackage",
        lister=hackage_lister,
        listed_origin=hackage_listed_origin,
    )
This revision now requires changes to proceed.Oct 19 2022, 5:51 PM
franckbret marked 2 inline comments as done.

fix issue with hexadecimal representation of the checksum md5 hash

swh/loader/package/hackage/loader.py
176

Thanks I was not sure about how to deal with that

Build is green

Patch application report for D8379 (id=31555)

Rebasing onto 6d8e2abac5...

Current branch diff-target is up to date.
Changes applied before test
commit 496a218e03663b603f14ca9a8b1f7cfbec08cc02
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1005/ for more details.

docs/package-loader-specifications.rst
121–123

The most accurate data is intrinsic

Is it?

Usually, extrinsic metadata is provided/validated by the forge, while intrinsic metadata is provided by any package author. So the former is more reliable.

Other package loaders prefer extrinsic metadata to build release fields for this reason, so we should also do it here for consistency, unless there is a good reason to special-case Hackage.

docs/package-loader-specifications.rst
121–123

Based on what I see, the extrinsic metadata for a hackage package version contains the username of the hackage account that uploaded the package while intrinsic metadata can provide the fullname of the package author.

See for instance the Dish package, returned extrinsic metadata are:

$ curl -H "Accept: application/json" https://hackage.haskell.org/package/Dish-0.0.0.6/revisions/ | jq
[
  {
    "number": 0,
    "time": "2017-11-29T13:15:29Z",
    "user": "zcourts"
  }
]

while the associated cabal file contains the full author name.

So I think @franckbret approach is the right one here.

docs/package-loader-specifications.rst
121–123

yep the hackage extrinsic metadata are the same than intrinsic one. Everything in the hackage server is related to parse .cabal file.

We could have use another api endpoint call for each version of a package as extrinsic metadata but its useless as the data is the same. By the way the endpoint that returns the most information about the package is the .cabal file itself.

http -b --json https://hackage.haskell.org/package/Dish-0.0.0.6
{
    "author": "Courtney Robinson",
    "copyright": "Courtney Robinson",
    "description": "A group of Hash related utilities (currently wraps MurmurHash3 C implementation)",
    "homepage": "http://github.com/zcourts/Dish",
    "license": "BSD-3-Clause",
    "metadata_revision": 0,
    "synopsis": "Hash modules (currently Murmur3)"
}

In the hackage server everything is parsing the .cabal file. If the cabal file is not valid, one can not upload a new package or revision.

See:

https://github.com/haskell/hackage-server/blob/master/src/Distribution/Server/Packages/Types.hs

https://github.com/haskell/hackage-server/blob/b4ea1c73ba85f24985259fd1df4664747ceb0159/src/Distribution/Server/Features/Upload.hs#L295

https://github.com/haskell/hackage-server/blob/c9952843dcdb02df7ef965c0368923f37af2f326/src/Distribution/Server/Features/PackageInfoJSON.hs#L132

docs/package-loader-specifications.rst
121–123

alright then

@anlambert hi, can we merge this one?

@franckbret , sure looks good to me !

swh/loader/package/hackage/loader.py
162–163

could be simplified to:

last_modified  = max(item["time"] for item in revisions)
This revision is now accepted and ready to land.Mon, Nov 21, 11:40 AM

Simplify some code and rebase

This revision was landed with ongoing or failed builds.Mon, Nov 21, 3:07 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D8379 (id=31949)

Rebasing onto a196c85d5a...

Current branch diff-target is up to date.
Changes applied before test
commit b9bd1287e8ad43f0a1d16d20a26976f30c234230
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Sep 2 09:06:15 2022 +0200

    Hackage: Loads Hackage Listed origins
    
    The loader make an http api call to retrieve package related versions.
    It then download tar.gz archive for each version.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1033/ for more details.