Page MenuHomeSoftware Heritage

Hackage: Implement incremental mode
ClosedPublic

Authored by franckbret on Oct 12 2022, 10:13 AM.

Details

Summary

Use http api lastUpload argument in search query to retrieve new or
updated origins since last run

Related to T4597

Diff Detail

Repository
rDLS Listers
Branch
hackage-incremental
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32780
Build 51356: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51355: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8663 (id=31286)

Rebasing onto 05cd1de1cd...

First, rewinding head to replay your work on top of it...
Applying: Hackage: Implement incremental mode
Changes applied before test
commit 2390eefdd68e0adb58e758bc691b8ce95464a986
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Oct 12 10:08:38 2022 +0200

    Hackage: Implement incremental mode
    
    Use http api lastUpload argument in search query to retrieve new or
    updated origins since last run
    
    Related T4597

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/787/ for more details.

vlorentz added a subscriber: vlorentz.
for entry in page:
    last_update = iso8601.parse_date(entry["lastUpload"])
    if not self.earliest_update or last_update > self.earliest_update:
        self.earliest_update = last_update

This makes self.earliest_update the latest lastUpload, not the earliest.

Either way, I don't think this is the right way to do it, because:

  1. if you flip the operator, then index_last_update will barely advance in time across listings, so it won't really be incremental
  2. if you change the name and definition to match the computation, then we will miss packages uploaded while the lister is running, because it won't see some of them (listed early in its run, but updated later in its run), but may see some more recent ones (listed late in its run, and uploaded earlier)

Assuming Hackage has no lag between the lastUpload timestamp and the moment packages are available, you should use the time the lister started.

This revision now requires changes to proceed.Oct 12 2022, 10:47 AM
for entry in page:
    last_update = iso8601.parse_date(entry["lastUpload"])
    if not self.earliest_update or last_update > self.earliest_update:
        self.earliest_update = last_update

This makes self.earliest_update the latest lastUpload, not the earliest.

Ok earliest is a too ambiguous term. I wanted to have something to store the max update_date of a complete listing.

Either way, I don't think this is the right way to do it, because:

  1. if you flip the operator, then index_last_update will barely advance in time across listings, so it won't really be incremental

For now the operator is greater than, but do you mean it should be dynamic at lister instantiation?

  1. if you change the name and definition to match the computation, then we will miss packages uploaded while the lister is running, because it won't see some of them (listed early in its run, but updated later in its run), but may see some more recent ones (listed late in its run, and uploaded earlier)

Assuming Hackage has no lag between the lastUpload timestamp and the moment packages are available, you should use the time the lister started.

ok got it, will use the time the lister started
Thanks

  1. if you flip the operator, then index_last_update will barely advance in time across listings, so it won't really be incremental

For now the operator is greater than, but do you mean it should be dynamic at lister instantiation?

No, I was just confused by the term "earliest" and the use of greater-than

Incremental operations are now related to the last_listing_date

Build is green

Patch application report for D8663 (id=31304)

Rebasing onto 05cd1de1cd...

First, rewinding head to replay your work on top of it...
Applying: Hackage: Implement incremental mode
Changes applied before test
commit 88f8389a8506e03e311f801609febb231eb03f5a
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Oct 12 10:08:38 2022 +0200

    Hackage: Implement incremental mode
    
    Use http api lastUpload argument in search query to retrieve new or
    updated origins since last run
    
    Related to T4597

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/788/ for more details.

lastUpload is documented as working on dates rather than datetime, so we are going to need to do the same stuff as with NuGet.

Also, this code is sending datetimes, instead of dates, so the API ignores the filter entirely. For example:

curl "https://hackage.haskell.org/packages/search" -H "Accept: application/json" -H "Content-Type: application/json" --data '{"page": 0, "sortColumn": "default", "sortDirection": "ascending", "searchQuery": "(lastUpload > 2022-11-10T01:00:00Z)"}' -X POST | jq . | head
{
  "numberOfResults": 15644,
  "pageContents": [
    {
      "description": "Examples of 3D graphics programming with OpenGL",
      "downloads": 18,
      "lastUpload": "2016-07-22T14:26:23.038905Z",
      "maintainers": [
        {
          "display": "WolfgangJeltsch",
swh/lister/hackage/__init__.py
37–39

double backticks for monospace

This revision now requires changes to proceed.Nov 10 2022, 1:18 PM

Oh, I missed that you actually convert using .date() in the code, my bad.

This revision is now accepted and ready to land.Nov 10 2022, 1:19 PM
vlorentz requested changes to this revision.EditedNov 10 2022, 1:23 PM

buuuut you are using a strict inequality, so you need to subtract one day, in order not to miss uploads submitted after the previous run of the lister but on the same day.

Also, you should apply .astimezone(tz=timezone.utc) before converting to date, because the database is not guaranteed to return timestamps in UTC even when they were written in UTC.

(Sorry for the back-and-forth; hopefully I'm done now.)

This revision now requires changes to proceed.Nov 10 2022, 1:23 PM
franckbret marked an inline comment as done.

Use greater than or equal instead of strict comparison when building http api query params for incremental listing

buuuut you are using a strict inequality, so you need to subtract one day, in order not to miss uploads submitted after the previous run of the lister but on the same day.

Also, you should apply .astimezone(tz=timezone.utc) before converting to date, because the database is not guaranteed to return timestamps in UTC even when they were written in UTC.

(Sorry for the back-and-forth; hopefully I'm done now.)

The http api is able to do greater than equal filtering. Now the filter operator is ">=" instead of ">".

Build is green

Patch application report for D8663 (id=31843)

Rebasing onto ea146ce297...

Current branch diff-target is up to date.
Changes applied before test
commit d475c1529985162f7c4e0b098e73f958b349decb
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Oct 12 10:08:38 2022 +0200

    Hackage: Implement incremental mode
    
    Use http api lastUpload argument in search query to retrieve new or
    updated origins since last run
    
    Related to T4597

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/859/ for more details.

sweet.

One last thing: could you make tests check the request body is as expected? See https://requests-mock.readthedocs.io/en/latest/history.html

Improve test for incremental listing, ensure the http searchQuery/lastUpload value is a is a date

Build is green

Patch application report for D8663 (id=31865)

Rebasing onto ea146ce297...

Current branch diff-target is up to date.
Changes applied before test
commit 518cb4b9761dde8b38c2fe99e268268ced15cb5f
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Oct 12 10:08:38 2022 +0200

    Hackage: Implement incremental mode
    
    Use http api lastUpload argument in search query to retrieve new or
    updated origins since last run
    
    Related to T4597

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/861/ for more details.

Looks good, but please make this last change before landing.

Thanks!

swh/lister/hackage/tests/test_lister.py
143–144

tests more accurately (number of requests + full body)

165–168

ditto

This revision is now accepted and ready to land.Nov 17 2022, 6:42 PM

Check the number of http requests done in incremental tests

Build is green

Patch application report for D8663 (id=31924)

Rebasing onto 6ad61aec23...

First, rewinding head to replay your work on top of it...
Applying: Hackage: Implement incremental mode
Changes applied before test
commit e48a1817ab3f340f64e8d4ce4df2dfe52ae474fa
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Oct 12 10:08:38 2022 +0200

    Hackage: Implement incremental mode
    
    Use http api lastUpload argument in search query to retrieve new or
    updated origins since last run
    
    Related to T4597

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/864/ for more details.

Build is green

Patch application report for D8663 (id=31925)

Rebasing onto 6ad61aec23...

Current branch diff-target is up to date.
Changes applied before test
commit 065b3f81a1e8769095251c6dbd88f8661a577366
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Oct 12 10:08:38 2022 +0200

    Hackage: Implement incremental mode
    
    Use http api lastUpload argument in search query to retrieve new or
    updated origins since last run
    
    Related to T4597

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/865/ for more details.

This revision was automatically updated to reflect the committed changes.