Page MenuHomeSoftware Heritage

origin_visits/get_origin_visit: Improve default visit picking strategy
ClosedPublic

Authored by anlambert on Jun 30 2020, 6:37 PM.

Details

Summary

When no visit info (timestamp, id or snapshot) is provided to that function
return the last full visit with a valid snapshot or the last partial visit
with a valid snapshot otherwise.

This fixes a browsing issue when an origin is no more available but swh keeps
visiting it afterwards.

For instance the origin https://github.com/baifanwudi/PythonCode has been removed
since the first swh visit. Currently, when trying to browse it, we get
and empty snapshot view.
After that change, the last full visit with non empty snapshot will be displayed instead.

Diff Detail

Repository
rDWAPPS Web applications
Branch
improve-get-origin-visit
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 13248
Build 20231: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 20230: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D3385 (id=12012)

Rebasing onto f433684afd...

Current branch diff-target is up to date.
Changes applied before test
commit 774c25ab2e227ff207f8eed55e128935afff53f3
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Jun 30 18:25:58 2020 +0200

    origin_visits/get_origin_visit: Improve default visit picking strategy
    
    When no visit info (timestamp, id or snapshot) is provided to that function
    return the last full visit with a valid snapshot or the last partial visit
    with a valid snapshot otherwise.
    
    This fixes a browsing issue when an origin is no more available but swh keeps
    visiting it afterwards.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/257/ for more details.

Build is green

Patch application report for D3385 (id=12017)

Rebasing onto c2726ac124...

Current branch diff-target is up to date.
Changes applied before test
commit 8f66f393a0d770338576ffc44473d4c4477f05a6
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Jun 30 18:25:58 2020 +0200

    origin_visits/get_origin_visit: Improve default visit picking strategy
    
    When no visit info (timestamp, id or snapshot) is provided to that function
    return the last full visit with a valid snapshot or the last partial visit
    with a valid snapshot otherwise.
    
    This fixes a browsing issue when an origin is no more available but swh keeps
    visiting it afterwards.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/260/ for more details.

Looks good, some remarks above though (proposal to improve coverage).

swh/web/common/origin_visits.py
166

Having the snapshot-id set does not mean that it's necessarily valid currently (unfortunately) [1]

[1] D3322

swh/web/tests/common/test_origin_visits.py
229

That does not really test the second part (fallback on partial visit with a snapshot ;).

For that maybe make a first pass on your local visits list without the visit id 3.
That should then found the partial visit with id 6 (if i'm not mistaken).

Then append visit id 3 to your list.
Trigger a search again which should be the current scenario ;)

Also, i don't know if the list is already ordered or not.

What do you think?

swh/web/tests/common/test_origin_visits.py
229

Also, i don't know if the list is already ordered or not.

I mean it apparently is as you reverse it...
No idea if the order should be enforced "more" by sorting on the key "date" there...


Also, it kinda feel redundant with the
`swh.storage.algos.origin_get_latest_visit_status(require_snapshot=True,
allowed_statuses=["full", "partial"])` somehow (<- except you don't have
priority granularity you coded here regarding the status, full first, then
partial as fallback...)

This needs more work in order to reuse utility functions from storage and rewrite tests using hypothesis.

swh/web/tests/common/test_origin_visits.py
229

Indeed, this is the same processing but that visit picking strategy was in place since a long time in swh-web and the origin_get_latest_visit_status function has landed recently in storage so I did not really have it in mind.

I think I might be able to use it in the get_origin_visit function.

229

I think the best option here is to rewrite the tests using hypothesis to avoid hardcoding visits data and mocking functions.

Update:

  • Use new swh.storage.algos.origin.origin_get_latest_visit_status utility function in revised get_origin_visit implementation
  • Rewrite tests using swh models and hypothesis
  • Update some function documentation

Build is green

Patch application report for D3385 (id=12029)

Rebasing onto b8eb774184...

Current branch diff-target is up to date.
Changes applied before test
commit 8df76362f8a0a6e14ba49d551ca0e5b85b6332f9
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Jun 30 18:25:58 2020 +0200

    origin_visits/get_origin_visit: Improve default visit picking strategy
    
    When no visit info (timestamp, id or snapshot) is provided to that function
    return the last full visit with a valid snapshot or the last partial visit
    with a valid snapshot otherwise.
    
    This fixes a browsing issue when an origin is no more available but swh keeps
    visiting it afterwards.
    
    Use new swh.storage.algos.origin.origin_get_latest_visit_status utility
    function in the revised get_origin_visit_implementation.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/262/ for more details.

This revision is now accepted and ready to land.Jul 1 2020, 8:21 PM

Update: Improve tests implementation

Build is green

Patch application report for D3385 (id=12031)

Rebasing onto b8eb774184...

Current branch diff-target is up to date.
Changes applied before test
commit 76d9162807e06e8ab86d2b9f517aad343df22845
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Jun 30 18:25:58 2020 +0200

    origin_visits/get_origin_visit: Improve default visit picking strategy
    
    When no visit info (timestamp, id or snapshot) is provided to that function
    return the last full visit with a valid snapshot or the last partial visit
    with a valid snapshot otherwise.
    
    This fixes a browsing issue when an origin is no more available but swh keeps
    visiting it afterwards.
    
    Use new swh.storage.algos.origin.origin_get_latest_visit_status utility
    function in the revised get_origin_visit_implementation.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/263/ for more details.