The loader make an http api call to retrieve package related versions.
It then download tar.gz archive for each version.
Details
- Reviewers
anlambert vlorentz - Group Reviewers
Reviewers - Maniphest Tasks
- T4598: Create a Hackage Loader
- Commits
- rDLDBASEb9bd1287e8ad: Hackage: Loads Hackage Listed origins
Diff Detail
- Repository
- rDLDBASE Generic VCS/Package Loader
- Branch
- hackage
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 32452 Build 50828: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 50827: arc lint + arc unit
Event Timeline
swh/loader/package/hackage/loader.py | ||
---|---|---|
175–176 | Last commit goes this way and the last commented test pass. Even if the cabal file is not parsed correctly, we have acceptable data to load the archive. |
artifacts are now list
various code simplification after review
its a follow up of lister evolution D8588
Build is green
Patch application report for D8379 (id=31009)
Rebasing onto f774aba59e...
First, rewinding head to replay your work on top of it... Applying: Conda: Anaconda packages archive loader
Changes applied before test
commit d355e045f520f635a092e848f5e41ea759bc219b Author: Franck Bret <franck.bret@octobus.net> Date: Wed Sep 28 16:23:45 2022 +0200 Conda: Anaconda packages archive loader For each origin it takes advantage of 'artifacts' data send through 'extra_loader_arguments' of the conda lister, providing versions, archive url, checksum, etc. Author and description are extracted from intrinsic metadata. Related T4579
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/928/ for more details.
Build is green
Patch application report for D8379 (id=31012)
Rebasing onto f774aba59e...
First, rewinding head to replay your work on top of it... Applying: Hackage: Loads Hackage Listed origins
Changes applied before test
commit 959a1c684f29d032b69fc9a1839909ae9961ff3b Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/930/ for more details.
LGTM, just a couple of nitpicks to handle and I will accept it.
swh/loader/package/hackage/loader.py | ||
---|---|---|
185–200 | Nitpicks for shorter code and empty description handling. author_str = intrinsic_metdata.get("author") or p_info.author author = Person.from_fullname(author_str.encode()) description = intrinsic_metadata.get("synopsis", intrinsic_metadata.get("description")) message = ( f"Synthetic release for Haskell source package {p_info.name} " f"version {p_info.version}" ) if description: message += f"\n\n{description}" |
swh/loader/package/hackage/loader.py | ||
---|---|---|
53–66 | I don't understand why this loop is designed this way. First, it assumes field names are either lower-case or capitalized, but "Case is not significant in field names, but is significant in field values." according to https://cabal.readthedocs.io/en/3.4/cabal-package.html#package-descriptions Also I find the logic of checking keys one by one with both forms, and using them to split is confusing. I think you can simplify the loop to something like this: keys = ["name", "version", "synopsis", "author", "description"] for line in lines: (key, value) = line.split(":", 1) key = key.lower() if key in keys: result[key] = value.strip() keys.remove(key) Finally, it does not support multi-line values, which seem to be common for descriptions (and are valid as synopsis too, I assume) |
Build is green
Patch application report for D8379 (id=31017)
Rebasing onto f774aba59e...
First, rewinding head to replay your work on top of it... Applying: Hackage: Loads Hackage Listed origins
Changes applied before test
commit 86002252c626ca3bdf3586b28f73b8c47738a8a5 Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/933/ for more details.
swh/loader/package/hackage/loader.py | ||
---|---|---|
53–66 | That code still does not parse multiline values, for instance we would miss full description by parsing that cabal file. You should add tests for that function to cover the edge cases. |
Add some tests to check that extract_intrinsic_metadata works as expected
test fallback for author
test key and value not on the same line
test multiline value
Build is green
Patch application report for D8379 (id=31081)
Rebasing onto c631349aea...
First, rewinding head to replay your work on top of it... Applying: Hackage: Loads Hackage Listed origins Using index info to reconstruct a base tree... M .pre-commit-config.yaml Falling back to patching base and 3-way merge... Auto-merging .pre-commit-config.yaml CONFLICT (content): Merge conflict in .pre-commit-config.yaml Patch failed at 0001 Hackage: Loads Hackage Listed origins Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort".
Rebase failed (ret=1)!
Could not rebase; Attempt merge onto c631349aea...
Already up to date.
Changes applied before test
commit 7c9e08f48eed2d5c51f31f9935b5801d886d2be7 Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/952/ for more details.
swh/loader/package/hackage/loader.py | ||
---|---|---|
53–66 |
ok, simpler. I added a check against ":" if not it failed on comment and multilines
Last commit fix the issues. By the way please note that "synopsis" is the field used on the forge to display a title and description is more like a README content. The current implementation use synopsis and fallback to description if any, and description may be multiline. Can you confirm its ok this way or do you prefer we use description as a default? Also description is the only multiline field that we need to manage. | |
185–200 | Do agree for description part, but for author the problem is that p_info.author is of type Person, so I let it like this. |
swh/loader/package/hackage/loader.py | ||
---|---|---|
185–200 | Ah right, use this instead then: author_str = intrinsic_metdata.get("author") author = Person.from_fullname(author_str.encode()) if author_str else p_info.author description = intrinsic_metadata.get("synopsis", intrinsic_metadata.get("description")) message = ( f"Synthetic release for Haskell source package {p_info.name} " f"version {p_info.version}\n" ) if description: message += f"\n{description}" |
description should not be special-cased, multi-line values are allowed in plenty of fields. The spec says "To continue a field value, indent the next line relative to the field name." without specifically mentioning description.
Plus, it doesn't work if there is a : character in the description (other than the first line):
>>> (path / "foo.cabal").open("w").write("description:\n foo\n bar: baz"); extract_intrinsic_metadata(path, "foo") {'description': 'foo'}
Just in case you are reading these comments before the chat:
10:25:24 <+vlorentz> franckbret: I just realized something: the "description" and "synopsis" in .cabal files is a description of the package, right? 10:25:28 <+vlorentz> not the release 10:25:35 <+vlorentz> so it shouldn't go in the release message 10:28:33 <+vlorentz> we should also remove descriptions from other loaders...
Remove release description from message
Remove anything related to 'description' that was extrated from intrinsic metadata
Update docstring and documentation, remove some test data fixtures
Uses a simple regex to parse .cabal file
Build is green
Patch application report for D8379 (id=31124)
Could not rebase; Attempt merge onto b26b98810d...
Merge made by the 'recursive' strategy. .pre-commit-config.yaml | 2 +- docs/package-loader-specifications.rst | 18 + setup.py | 2 + swh/loader/package/conda/__init__.py | 17 + swh/loader/package/conda/loader.py | 168 +++++ swh/loader/package/conda/tasks.py | 14 + swh/loader/package/conda/tests/__init__.py | 0 swh/loader/package/conda/tests/data/fake_conda.sh | 231 ++++++ ...inux-64_lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2 | Bin 0 -> 1742 bytes ...inux-64_lifetimes-0.11.1-py36hc560c46_1.tar.bz2 | Bin 0 -> 1286 bytes swh/loader/package/conda/tests/test_conda.py | 133 ++++ swh/loader/package/conda/tests/test_tasks.py | 24 + swh/loader/package/hackage/__init__.py | 17 + swh/loader/package/hackage/loader.py | 194 +++++ swh/loader/package/hackage/tasks.py | 14 + swh/loader/package/hackage/tests/__init__.py | 0 .../package/hackage/tests/data/fake_hackage.sh | 796 +++++++++++++++++++++ .../data/https_hackage.haskell.org/package_Hs2lib | 2 + .../package_Hs2lib-0.6.3_Hs2lib-0.6.3.tar.gz | Bin 0 -> 2236 bytes .../package_Hs2lib-0.6.3_revisions | 8 + .../data/https_hackage.haskell.org/package_aeson | 2 + .../package_aeson-2.1.0.0_aeson-2.1.0.0.tar.gz | Bin 0 -> 2231 bytes .../package_aeson-2.1.0.0_revisions | 18 + .../data/https_hackage.haskell.org/package_colors | 2 + .../package_colors-0.1_colors-0.1.tar.gz | Bin 0 -> 560 bytes .../package_colors-0.1_revisions | 8 + .../package_colors-0.3.0.2_colors-0.3.0.2.tar.gz | Bin 0 -> 531 bytes .../package_colors-0.3.0.2_revisions | 18 + .../https_hackage.haskell.org/package_haskell2010 | 2 + ..._haskell2010-1.0.0.0_haskell2010-1.0.0.0.tar.gz | Bin 0 -> 930 bytes .../package_haskell2010-1.0.0.0_revisions | 8 + .../https_hackage.haskell.org/package_numeric-qq | 2 + ...ackage_numeric-qq-0.1.0_numeric-qq-0.1.0.tar.gz | Bin 0 -> 1026 bytes .../package_numeric-qq-0.1.0_revisions | 8 + swh/loader/package/hackage/tests/test_hackage.py | 287 ++++++++ swh/loader/package/hackage/tests/test_tasks.py | 23 + 36 files changed, 2017 insertions(+), 1 deletion(-) create mode 100644 swh/loader/package/conda/__init__.py create mode 100644 swh/loader/package/conda/loader.py create mode 100644 swh/loader/package/conda/tasks.py create mode 100644 swh/loader/package/conda/tests/__init__.py create mode 100644 swh/loader/package/conda/tests/data/fake_conda.sh create mode 100644 swh/loader/package/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2 create mode 100644 swh/loader/package/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_lifetimes-0.11.1-py36hc560c46_1.tar.bz2 create mode 100644 swh/loader/package/conda/tests/test_conda.py create mode 100644 swh/loader/package/conda/tests/test_tasks.py create mode 100644 swh/loader/package/hackage/__init__.py create mode 100644 swh/loader/package/hackage/loader.py create mode 100644 swh/loader/package/hackage/tasks.py create mode 100644 swh/loader/package/hackage/tests/__init__.py create mode 100644 swh/loader/package/hackage/tests/data/fake_hackage.sh create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib-0.6.3_Hs2lib-0.6.3.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib-0.6.3_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson-2.1.0.0_aeson-2.1.0.0.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson-2.1.0.0_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.1_colors-0.1.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.1_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_colors-0.3.0.2.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010-1.0.0.0_haskell2010-1.0.0.0.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010-1.0.0.0_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq-0.1.0_numeric-qq-0.1.0.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq-0.1.0_revisions create mode 100644 swh/loader/package/hackage/tests/test_hackage.py create mode 100644 swh/loader/package/hackage/tests/test_tasks.py
Changes applied before test
commit 34fe949f822fa35308ff0a99f032316110ad586e Merge: b26b988 c3623ea Author: Jenkins user <jenkins@localhost> Date: Wed Oct 5 05:50:40 2022 +0000 Merge branch 'diff-target' into HEAD commit c3623eac528781679645e0c1cd47c3ce96893e67 Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version. commit e418aab490115d77bb91d58457a0b013b35d63a5 Author: Franck Bret <franck.bret@octobus.net> Date: Wed Sep 28 16:23:45 2022 +0200 Conda: Anaconda packages archive loader For each origin it takes advantage of 'artifacts' data send through 'extra_loader_arguments' of the conda lister, providing versions, archive url, checksum, etc. Author and description are extracted from intrinsic metadata. Related T4579
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/957/ for more details.
Build is green
Patch application report for D8379 (id=31125)
Could not rebase; Attempt merge onto b26b98810d...
Merge made by the 'recursive' strategy. .pre-commit-config.yaml | 2 +- docs/package-loader-specifications.rst | 18 + setup.py | 2 + swh/loader/package/conda/__init__.py | 17 + swh/loader/package/conda/loader.py | 168 +++++ swh/loader/package/conda/tasks.py | 14 + swh/loader/package/conda/tests/__init__.py | 0 swh/loader/package/conda/tests/data/fake_conda.sh | 231 ++++++ ...inux-64_lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2 | Bin 0 -> 1742 bytes ...inux-64_lifetimes-0.11.1-py36hc560c46_1.tar.bz2 | Bin 0 -> 1286 bytes swh/loader/package/conda/tests/test_conda.py | 133 ++++ swh/loader/package/conda/tests/test_tasks.py | 24 + swh/loader/package/hackage/__init__.py | 17 + swh/loader/package/hackage/loader.py | 194 +++++ swh/loader/package/hackage/tasks.py | 14 + swh/loader/package/hackage/tests/__init__.py | 0 .../package/hackage/tests/data/fake_hackage.sh | 796 +++++++++++++++++++++ .../data/https_hackage.haskell.org/package_Hs2lib | 2 + .../package_Hs2lib-0.6.3_Hs2lib-0.6.3.tar.gz | Bin 0 -> 2236 bytes .../package_Hs2lib-0.6.3_revisions | 8 + .../data/https_hackage.haskell.org/package_aeson | 2 + .../package_aeson-2.1.0.0_aeson-2.1.0.0.tar.gz | Bin 0 -> 2231 bytes .../package_aeson-2.1.0.0_revisions | 18 + .../data/https_hackage.haskell.org/package_colors | 2 + .../package_colors-0.1_colors-0.1.tar.gz | Bin 0 -> 560 bytes .../package_colors-0.1_revisions | 8 + .../package_colors-0.3.0.2_colors-0.3.0.2.tar.gz | Bin 0 -> 531 bytes .../package_colors-0.3.0.2_revisions | 18 + .../https_hackage.haskell.org/package_haskell2010 | 2 + ..._haskell2010-1.0.0.0_haskell2010-1.0.0.0.tar.gz | Bin 0 -> 930 bytes .../package_haskell2010-1.0.0.0_revisions | 8 + .../https_hackage.haskell.org/package_numeric-qq | 2 + ...ackage_numeric-qq-0.1.0_numeric-qq-0.1.0.tar.gz | Bin 0 -> 1026 bytes .../package_numeric-qq-0.1.0_revisions | 8 + swh/loader/package/hackage/tests/test_hackage.py | 287 ++++++++ swh/loader/package/hackage/tests/test_tasks.py | 23 + 36 files changed, 2017 insertions(+), 1 deletion(-) create mode 100644 swh/loader/package/conda/__init__.py create mode 100644 swh/loader/package/conda/loader.py create mode 100644 swh/loader/package/conda/tasks.py create mode 100644 swh/loader/package/conda/tests/__init__.py create mode 100644 swh/loader/package/conda/tests/data/fake_conda.sh create mode 100644 swh/loader/package/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2 create mode 100644 swh/loader/package/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_lifetimes-0.11.1-py36hc560c46_1.tar.bz2 create mode 100644 swh/loader/package/conda/tests/test_conda.py create mode 100644 swh/loader/package/conda/tests/test_tasks.py create mode 100644 swh/loader/package/hackage/__init__.py create mode 100644 swh/loader/package/hackage/loader.py create mode 100644 swh/loader/package/hackage/tasks.py create mode 100644 swh/loader/package/hackage/tests/__init__.py create mode 100644 swh/loader/package/hackage/tests/data/fake_hackage.sh create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib-0.6.3_Hs2lib-0.6.3.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_Hs2lib-0.6.3_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson-2.1.0.0_aeson-2.1.0.0.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_aeson-2.1.0.0_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.1_colors-0.1.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.1_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_colors-0.3.0.2.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010 create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010-1.0.0.0_haskell2010-1.0.0.0.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_haskell2010-1.0.0.0_revisions create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq-0.1.0_numeric-qq-0.1.0.tar.gz create mode 100644 swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_numeric-qq-0.1.0_revisions create mode 100644 swh/loader/package/hackage/tests/test_hackage.py create mode 100644 swh/loader/package/hackage/tests/test_tasks.py
Changes applied before test
commit 2ff73cd030d396eb7a3ad89d34316e330c1d1689 Merge: b26b988 02e1472 Author: Jenkins user <jenkins@localhost> Date: Wed Oct 5 06:05:02 2022 +0000 Merge branch 'diff-target' into HEAD commit 02e147207e9bb1200e663dce11de4efdec9e0314 Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version. commit e418aab490115d77bb91d58457a0b013b35d63a5 Author: Franck Bret <franck.bret@octobus.net> Date: Wed Sep 28 16:23:45 2022 +0200 Conda: Anaconda packages archive loader For each origin it takes advantage of 'artifacts' data send through 'extra_loader_arguments' of the conda lister, providing versions, archive url, checksum, etc. Author and description are extracted from intrinsic metadata. Related T4579
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/958/ for more details.
docs/package-loader-specifications.rst | ||
---|---|---|
112–114 |
swh/loader/package/hackage/loader.py | ||
---|---|---|
185–200 | description has been removed |
Build is green
Patch application report for D8379 (id=31314)
Rebasing onto a13e3e6f35...
Current branch diff-target is up to date.
Changes applied before test
commit 04da51dc6b8c6e156d8cbed7d78bcca2081ed6bc Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/997/ for more details.
@vlorentz @anlambert We can get archive checksums by making an HEAD call to archive download url :
franck@debian-franck:/tmp$ http HEAD http://hackage.haskell.org/package/binary-bits-0.4/binary-bits-0.4.tar.gz HTTP/1.1 200 OK Accept-Ranges: bytes Age: 3379 Cache-Control: public, no-transform, max-age=2592000 Connection: keep-alive Content-Length: 11534 Content-MD5: nF+DoGYGh/0oSokoIFwmtw== Content-Type: application/x-gzip Date: Thu, 13 Oct 2022 07:04:47 GMT ETag: "9c5f83a0660687fd284a8928205c26b7" Last-modified: Fri, 09 Jan 2015 13:41:41 GMT Server: nginx/1.18.0 (Ubuntu) Via: 1.1 varnish X-Cache: HIT X-Cache-Hits: 1 X-Served-By: cache-cdg20749-CDG X-Timer: S1665644688.529240,VS0,VE1
We can extract Content-Length and Content-MD5. Will do so in another patch to manage checksums and extrinsic raw metadata.
Regarding raw extrinsic metadata, we can store original-artifacts-json and another one for revisions ( see swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_revisions)?
What do you think?
Nice!
Regarding raw extrinsic metadata, we can store original-artifacts-json and another one for revisions ( see swh/loader/package/hackage/tests/data/https_hackage.haskell.org/package_colors-0.3.0.2_revisions)?
What do you think?
You lost me. original-artifacts-json should already be stored for every archive the loader loads. Am I missing something?
docs/package-loader-specifications.rst | ||
---|---|---|
112–114 | Oh, I just noticed that it seems to be in the wrong direction. Shouldn't it prioritize extrinsic metadata over intrinsic? |
Build is green
Patch application report for D8379 (id=31506)
Rebasing onto 6d8e2abac5...
First, rewinding head to replay your work on top of it... Applying: Hackage: Loads Hackage Listed origins
Changes applied before test
commit 3cf53eb2e18af6761054eae9fd43c4f3ef7adf60 Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1001/ for more details.
docs/package-loader-specifications.rst | ||
---|---|---|
112–114 | I think its ok this way. The most accurate data is intrinsic, but sometime author is missing. In this case it falls back to get the username who published new revisions of the package (we use the revisions endpoint to get last update). If getting that username as a fallback does not make sense, let's remove it. |
Apart the md5 decoding issue, loader looks pretty stable when testing it in docker.
I will accept the diff once the inline comments handled.
swh/loader/package/hackage/loader.py | ||
---|---|---|
176 | Got the following warnings when testing the loader: docker-swh-loader-1 | [2022-10-19 15:43:19,789: DEBUG/ForkPoolWorker-32] Fetching https://hackage.haskell.org/package/hoodle docker-swh-loader-1 | [2022-10-19 15:43:19,928: DEBUG/ForkPoolWorker-32] Fetching https://hackage.haskell.org/package/hoodle-0.1/revisions/ docker-swh-loader-1 | [2022-10-19 15:43:20,211: WARNING/ForkPoolWorker-32] Can not decode md5 checksum b'y\x0b\x141S\x14\x93%\xa8\xe5s\x87!W_\xad' for 'https://hackage.haskell.org/package/hoodle-0.1/hoodle-0.1.tar.gz' You need to use this instead to get the hexadecimal representation of the md5 hash: from swh.model.hashutil import hash_to_hex checksums = {"md5": hash_to_hex(md5)} | |
swh/loader/package/hackage/tests/test_tasks.py | ||
1–23 | Please replace that test by the following: # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import uuid import pytest from swh.scheduler.model import ListedOrigin, Lister NAMESPACE = "swh.loader.package.hackage" @pytest.fixture def hackage_lister(): return Lister(name="hackage", instance_name="example", id=uuid.uuid4()) @pytest.fixture def hackage_listed_origin(hackage_lister): return ListedOrigin( lister_id=hackage_lister.id, url="https://hackage.haskell.org/package/package_name", visit_type="hackage", ) def test_hackage_loader_task_for_listed_origin( loading_task_creation_for_listed_origin_test, hackage_lister, hackage_listed_origin, ): loading_task_creation_for_listed_origin_test( loader_class_name=f"{NAMESPACE}.loader.HackageLoader", task_function_name=f"{NAMESPACE}.tasks.LoadHackage", lister=hackage_lister, listed_origin=hackage_listed_origin, ) |
swh/loader/package/hackage/loader.py | ||
---|---|---|
176 | Thanks I was not sure about how to deal with that |
Build is green
Patch application report for D8379 (id=31555)
Rebasing onto 6d8e2abac5...
Current branch diff-target is up to date.
Changes applied before test
commit 496a218e03663b603f14ca9a8b1f7cfbec08cc02 Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1005/ for more details.
docs/package-loader-specifications.rst | ||
---|---|---|
112–114 |
Is it? Usually, extrinsic metadata is provided/validated by the forge, while intrinsic metadata is provided by any package author. So the former is more reliable. Other package loaders prefer extrinsic metadata to build release fields for this reason, so we should also do it here for consistency, unless there is a good reason to special-case Hackage. |
docs/package-loader-specifications.rst | ||
---|---|---|
112–114 | Based on what I see, the extrinsic metadata for a hackage package version contains the username of the hackage account that uploaded the package while intrinsic metadata can provide the fullname of the package author. See for instance the Dish package, returned extrinsic metadata are: $ curl -H "Accept: application/json" https://hackage.haskell.org/package/Dish-0.0.0.6/revisions/ | jq [ { "number": 0, "time": "2017-11-29T13:15:29Z", "user": "zcourts" } ] while the associated cabal file contains the full author name. So I think @franckbret approach is the right one here. |
docs/package-loader-specifications.rst | ||
---|---|---|
112–114 | yep the hackage extrinsic metadata are the same than intrinsic one. Everything in the hackage server is related to parse .cabal file. We could have use another api endpoint call for each version of a package as extrinsic metadata but its useless as the data is the same. By the way the endpoint that returns the most information about the package is the .cabal file itself. http -b --json https://hackage.haskell.org/package/Dish-0.0.0.6 { "author": "Courtney Robinson", "copyright": "Courtney Robinson", "description": "A group of Hash related utilities (currently wraps MurmurHash3 C implementation)", "homepage": "http://github.com/zcourts/Dish", "license": "BSD-3-Clause", "metadata_revision": 0, "synopsis": "Hash modules (currently Murmur3)" } In the hackage server everything is parsing the .cabal file. If the cabal file is not valid, one can not upload a new package or revision. See: https://github.com/haskell/hackage-server/blob/master/src/Distribution/Server/Packages/Types.hs |
docs/package-loader-specifications.rst | ||
---|---|---|
112–114 | alright then |
@franckbret , sure looks good to me !
swh/loader/package/hackage/loader.py | ||
---|---|---|
162–163 | could be simplified to: last_modified = max(item["time"] for item in revisions) |
Build is green
Patch application report for D8379 (id=31949)
Rebasing onto a196c85d5a...
Current branch diff-target is up to date.
Changes applied before test
commit b9bd1287e8ad43f0a1d16d20a26976f30c234230 Author: Franck Bret <franck.bret@octobus.net> Date: Fri Sep 2 09:06:15 2022 +0200 Hackage: Loads Hackage Listed origins The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1033/ for more details.