Page MenuHomeSoftware Heritage

Add a successive_visits counter to origin visit stats
ClosedPublic

Authored by ardumont on Jul 7 2021, 5:26 PM.

Details

Summary

This maintains the number of successive visits resulting in the same status. This will
help implementing disabling of too many successive failed or not_found visits for a
given origin [1]

Related to T2345
Depends on D5956

[1] D5980

Supersedes D4895 (which I plan to commandeer and close)

Test Plan

tox

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
arcpatch-D4895
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22520
Build 35092: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35091: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D5978 (id=21541)

Could not rebase; Attempt merge onto 1006f0aee4...

Updating 1006f0a..7d71637
Fast-forward
 sql/updates/29.sql                         |  31 +++
 swh/scheduler/backend.py                   |  41 ++-
 swh/scheduler/interface.py                 |  19 ++
 swh/scheduler/journal_client.py            | 168 ++++++++++-
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |  23 +-
 swh/scheduler/tests/test_api_client.py     |   2 +
 swh/scheduler/tests/test_journal_client.py | 429 ++++++++++++++++++++---------
 swh/scheduler/tests/test_scheduler.py      | 173 +++++++++++-
 9 files changed, 744 insertions(+), 152 deletions(-)
 create mode 100644 sql/updates/29.sql
Changes applied before test
commit 7d71637db7c6521b4cacbafd4e63e2bbd6432bd3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to OriginVisitStats
    
    which (tries to) maintain the number of successive visits resulting in the same status.
    
    For example, if the last 3 visits are successful evenful ones, this field value will be
    3. Then, an uneventful visit coming right after the latest eventful one will reset this
    counter to one (last event is "uneventful" and only one such event happened
    successively).
    
    This will help when we will implement the disabling of too many failed visit attempt on
    a given origin
    
    Related to T2345

commit b02db7ce6222feeb5db7a7aff83a11c3a3697bd3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 1 12:18:49 2021 +0200

    Introduce new scheduling policy to grab origins without last update
    
    This is in charge of scheduling origins without last update. This also updates the
    global queue position so the journal client can initialize correctly the next position
    per origin and visit type.
    
    Related to T2345

commit 8c4ae9f14d6abdca41a4f01b438310501ecb6259
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 16:00:01 2021 +0200

    journal_client: Compute next position for origin visit
    
    For origin without any last_update information [1], the journal client is now also in
    charge of moving their next position in the queue for rescheduling. Depending on their
    status, the next position offset and next_visit_queue_position are updated after each
    visit completes:
    
    - if the visit has failed, increase the next visit target by the minimal visit
      interval (to take into account transient loading issues)
    - if the visit is successful, and records some changes, decrease the visit interval
      index by 2 (visit the origin *way* more often).
    - if the visit is successful, and records no changes, increase the visit interval index
      by 1 (visit the origin less often).
    
    We then set the next visit target to its current value + the new visit interval
    multiplied by a random fudge factor (picked in the -/+ 10% range).
    
    The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins
    e.g. when a number of origins from a single hoster are processed at once.
    
    Note that the computations happen for all origins for simplicity and code maintenance
    but it will only be used by a new soon-to-be scheduling policy.
    
    [1] Lister cannot provide it for some reason.

commit cb1edf1ab24d1c8db5821578a7fb2633fab50ff4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 18:07:59 2021 +0200

    Introduce storage for the recurrent visit scheduler queue position

commit ec6e69f6415a007611c46f25e7c48e909a793d53
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:42:26 2021 +0200

    Start handling of recurrent loading tasks in scheduler
    
    This deals first and foremost with the next_position_offset update done by the scheduler
    journal client.

commit c486b28ece7c0b127fea10bbb4d7f5d1ad5c50ba
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 14:41:07 2021 +0200

    journal_client: Explicit docstring

commit 98f99b9fd457820dc2d4b5dab7e89cb8261a34a4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:39:40 2021 +0200

    journal_client: Only check last_* fields for some permutation tests
    
    In a future commit, we will add new fields whose values will be permutation dependent.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/410/ for more details.

I got the idea but I am wondering if it would not be simpler to store the list of visit statuses directly in the model ?
This would allow finer processing for the scheduling policies, but this will be more costly to store.

swh/scheduler/journal_client.py
180

recent

swh/scheduler/journal_client.py
180

lol i rebased a diff from david but i kept the typo, nice or what!

Build is green

Patch application report for D5978 (id=21544)

Could not rebase; Attempt merge onto 1006f0aee4...

Updating 1006f0a..5c7bd83
Fast-forward
 sql/updates/29.sql                         |  31 +++
 swh/scheduler/backend.py                   |  41 ++-
 swh/scheduler/interface.py                 |  19 ++
 swh/scheduler/journal_client.py            | 168 ++++++++++-
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |  23 +-
 swh/scheduler/tests/test_api_client.py     |   2 +
 swh/scheduler/tests/test_journal_client.py | 429 ++++++++++++++++++++---------
 swh/scheduler/tests/test_scheduler.py      | 173 +++++++++++-
 9 files changed, 744 insertions(+), 152 deletions(-)
 create mode 100644 sql/updates/29.sql
Changes applied before test
commit 5c7bd832476083020f2ac71c4d3aaf415a579b50
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to OriginVisitStats
    
    which (tries to) maintain the number of successive visits resulting in the same status.
    
    For example, if the last 3 visits are successful evenful ones, this field value will be
    3. Then, an uneventful visit coming right after the latest eventful one will reset this
    counter to one (last event is "uneventful" and only one such event happened
    successively).
    
    This will help when we will implement the disabling of too many failed visit attempt on
    a given origin
    
    Related to T2345

commit b02db7ce6222feeb5db7a7aff83a11c3a3697bd3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 1 12:18:49 2021 +0200

    Introduce new scheduling policy to grab origins without last update
    
    This is in charge of scheduling origins without last update. This also updates the
    global queue position so the journal client can initialize correctly the next position
    per origin and visit type.
    
    Related to T2345

commit 8c4ae9f14d6abdca41a4f01b438310501ecb6259
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 16:00:01 2021 +0200

    journal_client: Compute next position for origin visit
    
    For origin without any last_update information [1], the journal client is now also in
    charge of moving their next position in the queue for rescheduling. Depending on their
    status, the next position offset and next_visit_queue_position are updated after each
    visit completes:
    
    - if the visit has failed, increase the next visit target by the minimal visit
      interval (to take into account transient loading issues)
    - if the visit is successful, and records some changes, decrease the visit interval
      index by 2 (visit the origin *way* more often).
    - if the visit is successful, and records no changes, increase the visit interval index
      by 1 (visit the origin less often).
    
    We then set the next visit target to its current value + the new visit interval
    multiplied by a random fudge factor (picked in the -/+ 10% range).
    
    The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins
    e.g. when a number of origins from a single hoster are processed at once.
    
    Note that the computations happen for all origins for simplicity and code maintenance
    but it will only be used by a new soon-to-be scheduling policy.
    
    [1] Lister cannot provide it for some reason.

commit cb1edf1ab24d1c8db5821578a7fb2633fab50ff4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 18:07:59 2021 +0200

    Introduce storage for the recurrent visit scheduler queue position

commit ec6e69f6415a007611c46f25e7c48e909a793d53
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:42:26 2021 +0200

    Start handling of recurrent loading tasks in scheduler
    
    This deals first and foremost with the next_position_offset update done by the scheduler
    journal client.

commit c486b28ece7c0b127fea10bbb4d7f5d1ad5c50ba
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 14:41:07 2021 +0200

    journal_client: Explicit docstring

commit 98f99b9fd457820dc2d4b5dab7e89cb8261a34a4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:39:40 2021 +0200

    journal_client: Only check last_* fields for some permutation tests
    
    In a future commit, we will add new fields whose values will be permutation dependent.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/411/ for more details.

ardumont edited the summary of this revision. (Show Details)

reword commit message

Build is green

Patch application report for D5978 (id=21545)

Could not rebase; Attempt merge onto 1006f0aee4...

Updating 1006f0a..4723b0c
Fast-forward
 sql/updates/29.sql                         |  31 +++
 swh/scheduler/backend.py                   |  41 ++-
 swh/scheduler/interface.py                 |  19 ++
 swh/scheduler/journal_client.py            | 168 ++++++++++-
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |  23 +-
 swh/scheduler/tests/test_api_client.py     |   2 +
 swh/scheduler/tests/test_journal_client.py | 429 ++++++++++++++++++++---------
 swh/scheduler/tests/test_scheduler.py      | 173 +++++++++++-
 9 files changed, 744 insertions(+), 152 deletions(-)
 create mode 100644 sql/updates/29.sql
Changes applied before test
commit 4723b0c0e08e7c2dd942881c3f3dad78ec143e5f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to OriginVisitStats
    
    This maintains the number of successive visits resulting in the same status.
    
    For example, when the last 3 visits are successful evenful ones, this field value will
    be 3. Then, an uneventful visit coming right after the latest eventful one will reset
    this counter to 1 (last event is "uneventful" and only one such event happened
    successively)...
    
    This will help when implementing disabling of too many failed visit attempts on a given
    origin.
    
    Related to T2345

commit b02db7ce6222feeb5db7a7aff83a11c3a3697bd3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 1 12:18:49 2021 +0200

    Introduce new scheduling policy to grab origins without last update
    
    This is in charge of scheduling origins without last update. This also updates the
    global queue position so the journal client can initialize correctly the next position
    per origin and visit type.
    
    Related to T2345

commit 8c4ae9f14d6abdca41a4f01b438310501ecb6259
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 16:00:01 2021 +0200

    journal_client: Compute next position for origin visit
    
    For origin without any last_update information [1], the journal client is now also in
    charge of moving their next position in the queue for rescheduling. Depending on their
    status, the next position offset and next_visit_queue_position are updated after each
    visit completes:
    
    - if the visit has failed, increase the next visit target by the minimal visit
      interval (to take into account transient loading issues)
    - if the visit is successful, and records some changes, decrease the visit interval
      index by 2 (visit the origin *way* more often).
    - if the visit is successful, and records no changes, increase the visit interval index
      by 1 (visit the origin less often).
    
    We then set the next visit target to its current value + the new visit interval
    multiplied by a random fudge factor (picked in the -/+ 10% range).
    
    The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins
    e.g. when a number of origins from a single hoster are processed at once.
    
    Note that the computations happen for all origins for simplicity and code maintenance
    but it will only be used by a new soon-to-be scheduling policy.
    
    [1] Lister cannot provide it for some reason.

commit cb1edf1ab24d1c8db5821578a7fb2633fab50ff4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 18:07:59 2021 +0200

    Introduce storage for the recurrent visit scheduler queue position

commit ec6e69f6415a007611c46f25e7c48e909a793d53
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:42:26 2021 +0200

    Start handling of recurrent loading tasks in scheduler
    
    This deals first and foremost with the next_position_offset update done by the scheduler
    journal client.

commit c486b28ece7c0b127fea10bbb4d7f5d1ad5c50ba
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 14:41:07 2021 +0200

    journal_client: Explicit docstring

commit 98f99b9fd457820dc2d4b5dab7e89cb8261a34a4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:39:40 2021 +0200

    journal_client: Only check last_* fields for some permutation tests
    
    In a future commit, we will add new fields whose values will be permutation dependent.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/412/ for more details.

I got the idea but I am wondering if it would not be simpler to store the list of visit statuses directly in the model ?

It's in the model ;) (but not in ListedOrigin if that's what you meant ;)

This would allow finer processing for the scheduling policies, but this will be more costly to store.

Hum, yes, i think that will be too much to store and i'm not sure what the other benefits would be
aside from the current need [1]. Also, that duplicates the information we already have in the archive?

[1] Right now, the need is a simple one, a counter to ease determine if too many failed
attempts occurred or not. If so, disable the origin in the model here (it should come soon, before the end of the week, i'd say ;)

I got the idea but I am wondering if it would not be simpler to store the list of visit statuses directly in the model ?

It's in the model ;) (but not in ListedOrigin if that's what you meant ;)

This would allow finer processing for the scheduling policies, but this will be more costly to store.

Hum, yes, i think that will be too much to store and i'm not sure what the other benefits would be
aside from the current need [1]. Also, that duplicates the information we already have in the archive?

[1] Right now, the need is a simple one, a counter to ease determine if too many failed
attempts occurred or not. If so, disable the origin in the model here (it will come during the week

Ack, I guess you will find the visit status associated to the counter based on the dates saved in the stats model ?

I got the idea but I am wondering if it would not be simpler to store the list of visit statuses directly in the model ?

It's in the model ;) (but not in ListedOrigin if that's what you meant ;)

This would allow finer processing for the scheduling policies, but this will be more costly to store.

Hum, yes, i think that will be too much to store and i'm not sure what the other benefits would be
aside from the current need [1]. Also, that duplicates the information we already have in the archive?

[1] Right now, the need is a simple one, a counter to ease determine if too many failed
attempts occurred or not. If so, disable the origin in the model here (it will come during the week

Ack, I guess you will find the visit status associated to the counter based on the dates saved in the stats model ?

yes, something like that ;)

ardumont retitled this revision from Add a successive_visits counter to OriginVisitStats to Add a successive_visits counter to origin visit stats.Jul 8 2021, 10:29 AM
ardumont edited the summary of this revision. (Show Details)

Refactor (to extract a function) and add tests around that function

Build has FAILED

Patch application report for D5978 (id=21551)

Could not rebase; Attempt merge onto 1006f0aee4...

Updating 1006f0a..196254c
Fast-forward
 sql/updates/29.sql                         |  31 ++
 swh/scheduler/backend.py                   |  41 ++-
 swh/scheduler/interface.py                 |  19 ++
 swh/scheduler/journal_client.py            | 183 ++++++++++-
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |  23 +-
 swh/scheduler/tests/test_api_client.py     |   2 +
 swh/scheduler/tests/test_journal_client.py | 473 +++++++++++++++++++++--------
 swh/scheduler/tests/test_scheduler.py      | 173 ++++++++++-
 9 files changed, 802 insertions(+), 153 deletions(-)
 create mode 100644 sql/updates/29.sql
Changes applied before test
commit 196254ce6b8e974d73ee7d3285dcd5f1ed32c017
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to origin visit stats
    
    This maintains the number of successive visits resulting in the same status. This will
    help when implementing the disabling of too many failed visit attempts for a given
    origin.
    
    Related to T2345

commit b02db7ce6222feeb5db7a7aff83a11c3a3697bd3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 1 12:18:49 2021 +0200

    Introduce new scheduling policy to grab origins without last update
    
    This is in charge of scheduling origins without last update. This also updates the
    global queue position so the journal client can initialize correctly the next position
    per origin and visit type.
    
    Related to T2345

commit 8c4ae9f14d6abdca41a4f01b438310501ecb6259
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 16:00:01 2021 +0200

    journal_client: Compute next position for origin visit
    
    For origin without any last_update information [1], the journal client is now also in
    charge of moving their next position in the queue for rescheduling. Depending on their
    status, the next position offset and next_visit_queue_position are updated after each
    visit completes:
    
    - if the visit has failed, increase the next visit target by the minimal visit
      interval (to take into account transient loading issues)
    - if the visit is successful, and records some changes, decrease the visit interval
      index by 2 (visit the origin *way* more often).
    - if the visit is successful, and records no changes, increase the visit interval index
      by 1 (visit the origin less often).
    
    We then set the next visit target to its current value + the new visit interval
    multiplied by a random fudge factor (picked in the -/+ 10% range).
    
    The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins
    e.g. when a number of origins from a single hoster are processed at once.
    
    Note that the computations happen for all origins for simplicity and code maintenance
    but it will only be used by a new soon-to-be scheduling policy.
    
    [1] Lister cannot provide it for some reason.

commit cb1edf1ab24d1c8db5821578a7fb2633fab50ff4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 18:07:59 2021 +0200

    Introduce storage for the recurrent visit scheduler queue position

commit ec6e69f6415a007611c46f25e7c48e909a793d53
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:42:26 2021 +0200

    Start handling of recurrent loading tasks in scheduler
    
    This deals first and foremost with the next_position_offset update done by the scheduler
    journal client.

commit c486b28ece7c0b127fea10bbb4d7f5d1ad5c50ba
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 14:41:07 2021 +0200

    journal_client: Explicit docstring

commit 98f99b9fd457820dc2d4b5dab7e89cb8261a34a4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:39:40 2021 +0200

    journal_client: Only check last_* fields for some permutation tests
    
    In a future commit, we will add new fields whose values will be permutation dependent.

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/413/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/413/console

Build is green

Patch application report for D5978 (id=21552)

Could not rebase; Attempt merge onto 1006f0aee4...

Updating 1006f0a..64aa445
Fast-forward
 sql/updates/29.sql                         |  31 ++
 swh/scheduler/backend.py                   |  41 ++-
 swh/scheduler/interface.py                 |  19 ++
 swh/scheduler/journal_client.py            | 181 ++++++++++-
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |  23 +-
 swh/scheduler/tests/test_api_client.py     |   2 +
 swh/scheduler/tests/test_journal_client.py | 473 +++++++++++++++++++++--------
 swh/scheduler/tests/test_scheduler.py      | 173 ++++++++++-
 9 files changed, 801 insertions(+), 152 deletions(-)
 create mode 100644 sql/updates/29.sql
Changes applied before test
commit 64aa4458ddd2bb5bd9a913c17950732d962129e6
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to origin visit stats
    
    This maintains the number of successive visits resulting in the same status. This will
    help when implementing the disabling of too many failed visit attempts for a given
    origin.
    
    Related to T2345

commit b02db7ce6222feeb5db7a7aff83a11c3a3697bd3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 1 12:18:49 2021 +0200

    Introduce new scheduling policy to grab origins without last update
    
    This is in charge of scheduling origins without last update. This also updates the
    global queue position so the journal client can initialize correctly the next position
    per origin and visit type.
    
    Related to T2345

commit 8c4ae9f14d6abdca41a4f01b438310501ecb6259
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 16:00:01 2021 +0200

    journal_client: Compute next position for origin visit
    
    For origin without any last_update information [1], the journal client is now also in
    charge of moving their next position in the queue for rescheduling. Depending on their
    status, the next position offset and next_visit_queue_position are updated after each
    visit completes:
    
    - if the visit has failed, increase the next visit target by the minimal visit
      interval (to take into account transient loading issues)
    - if the visit is successful, and records some changes, decrease the visit interval
      index by 2 (visit the origin *way* more often).
    - if the visit is successful, and records no changes, increase the visit interval index
      by 1 (visit the origin less often).
    
    We then set the next visit target to its current value + the new visit interval
    multiplied by a random fudge factor (picked in the -/+ 10% range).
    
    The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins
    e.g. when a number of origins from a single hoster are processed at once.
    
    Note that the computations happen for all origins for simplicity and code maintenance
    but it will only be used by a new soon-to-be scheduling policy.
    
    [1] Lister cannot provide it for some reason.

commit cb1edf1ab24d1c8db5821578a7fb2633fab50ff4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 18:07:59 2021 +0200

    Introduce storage for the recurrent visit scheduler queue position

commit ec6e69f6415a007611c46f25e7c48e909a793d53
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:42:26 2021 +0200

    Start handling of recurrent loading tasks in scheduler
    
    This deals first and foremost with the next_position_offset update done by the scheduler
    journal client.

commit c486b28ece7c0b127fea10bbb4d7f5d1ad5c50ba
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 29 14:41:07 2021 +0200

    journal_client: Explicit docstring

commit 98f99b9fd457820dc2d4b5dab7e89cb8261a34a4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 23 16:39:40 2021 +0200

    journal_client: Only check last_* fields for some permutation tests
    
    In a future commit, we will add new fields whose values will be permutation dependent.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/414/ for more details.

... it will come before the end of the week

Ack, I guess you will find the visit status associated to the counter based on the dates saved in the stats model ?

yes, something like that ;)

done in D5980 ;)

Build is green

Patch application report for D5978 (id=21881)

Rebasing onto 4fa29fe128...

Current branch diff-target is up to date.
Changes applied before test
commit cdc2af4733752ddccea01bdbd70b9805fbdaf6f1
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to origin visit stats
    
    This maintains the number of successive visits resulting in the same status. This will
    help implementing disabling of too many successive failed or not_found visits for a
    given origin.
    
    Related to T2345

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/420/ for more details.

ardumont edited the summary of this revision. (Show Details)

Start successive_visits to 0 like in the model declaration

Build is green

Patch application report for D5978 (id=21885)

Rebasing onto 4fa29fe128...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-422-D5978.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/422/ for more details.

Use correct commit range...

Build is green

Patch application report for D5978 (id=21888)

Rebasing onto 4fa29fe128...

Current branch diff-target is up to date.
Changes applied before test
commit d616934db615e6f53ea89c629a6b660bd24176e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to origin visit stats
    
    This maintains the number of successive visits resulting in the same status. This will
    help implementing disabling of too many successive failed or not_found visits for a
    given origin.
    
    Related to T2345

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/425/ for more details.

Add missing assertions on successive_visits

Build is green

Patch application report for D5978 (id=21889)

Rebasing onto 4fa29fe128...

Current branch diff-target is up to date.
Changes applied before test
commit 015d16158df9a87cdea29d76a55381d6798ee4e3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to origin visit stats
    
    This maintains the number of successive visits resulting in the same status. This will
    help implementing disabling of too many successive failed or not_found visits for a
    given origin.
    
    Related to T2345

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/426/ for more details.

vlorentz added a subscriber: vlorentz.

Informally, if there is a single visit with a given status, then the number of successive visits with that status is 1; but with the semantics defined here, it's 0. This seems confusing to me; could you make counters start at 1 (unless there are no visits at all, in which case 0 is correct)

swh/scheduler/journal_client.py
180–195

I find this more readable

284

if increment_successive_visits == False resets the counter, it's probably not the best variable name.

What about same_visit_status?

This revision now requires changes to proceed.Aug 3 2021, 11:59 AM

Informally, if there is a single visit with a given status, then the number of
successive visits with that status is 1; but with the semantics defined here, it's 0.
This seems confusing to me; could you make counters start at 1 (unless there are no
visits at all, in which case 0 is correct)

yes, I sense it is confusing. Although starting the counter at 1 then resets it at 0
after when we switch to a new status is what's confusing to me...

Shouldn't we make that counter always start at 1 and then be reset to 1 instead?

My trail of thought is that if we have an entry in origin_visit_stats it means there is
at least one visit so 1 sound fine to me (it's also in accord with the start of your
sentence).

thoughts?

Informally, if there is a single visit with a given status, then the number of
successive visits with that status is 1; but with the semantics defined here, it's 0.
This seems confusing to me; could you make counters start at 1 (unless there are no
visits at all, in which case 0 is correct)

yes, I sense it is confusing. Although starting the counter at 1 then resets it at 0
after when we switch to a new status is what's confusing to me...

Shouldn't we make that counter always start at 1 and then be reset to 1 instead?

That's what I said. It's 0 before we have any visit, but as soon as we have a visit, it will never go back to 0 again.

My trail of thought is that if we have an entry in origin_visit_stats it means there is
at least one visit so 1 sound fine to me (it's also in accord with the start of your
sentence).

Then 0 is never recorded in the DB, fine with me.

ardumont marked 2 inline comments as done.

Adapt according to review:

  • start increment to 1 then adapt tests
  • rename variable to a more meaningful name

Build is green

Patch application report for D5978 (id=21894)

Rebasing onto 4fa29fe128...

Current branch diff-target is up to date.
Changes applied before test
commit 1bcf84d5e66d02c006698a89d2571911d3fd0764
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 7 16:55:57 2021 +0200

    Add a successive_visits counter to origin visit stats
    
    This maintains the number of successive visits resulting in the same status. This will
    help implementing disabling of too many successive failed or not_found visits for a
    given origin.
    
    Related to T2345

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/428/ for more details.

This revision is now accepted and ready to land.Aug 3 2021, 1:08 PM