Page MenuHomeSoftware Heritage

Schedule save code now as recurring origins to ingest when successful
ClosedPublic

Authored by ardumont on Jun 11 2021, 4:06 PM.

Details

Summary

To allow users to request only once their save code now origins, once the first
ingestion is successfully ingested, we also mark it as recurrent origin to crawl.

Implementation wise, the scheduling routine in charge of updating the save code now
statuses reported in the save code now ui is in charge of this.

Related to T1524

Test Plan

tox fails for now

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

swh/web/common/origin_save.py
514–520 ↗(On Diff #20965)

not sure if we need to reschedule those kind of visit types as artifacts list will not change here

WIP as tests need updating. Also opened to discuss where that part should go. I've
chosen immediately when creating the oneshot task but this could also be put after the
first actually ingestion occur and is successful (and not a plain not_found and failed
status for example)...

I would go for the second option, we are sure that the origin is valid in that case.

Build has FAILED

Patch application report for D5858 (id=20965)

Rebasing onto da39599f34...

Current branch diff-target is up to date.
Changes applied before test
commit 681af839d2708759d30d9ad424a7818a2c013e25
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jun 11 16:01:36 2021 +0200

    Schedule save code now as recurring origins to ingest as well
    
    To avoid for the user to come back and schedule again the same origin later, this also
    manage origins as if the save code now was a lister and record the origin submitted by
    the user.
    
    Related to T1524

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/863/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/863/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 11 2021, 4:14 PM
Harbormaster failed remote builds in B21944: Diff 20965!

WIP as tests need updating. Also opened to discuss where that part should go. I've
chosen immediately when creating the oneshot task but this could also be put after the
first actually ingestion occur and is successful (and not a plain not_found and failed
status for example)...

I would go for the second option, we are sure that the origin is valid in that case.

If that's not too much work (with the current status polling, I guess it's not), I agree with that

swh/web/common/origin_save.py
512 ↗(On Diff #20965)

Should we make this instance name the name of the frontend that sent the request? (e.g. archive.softwareheritage.org)

514–520 ↗(On Diff #20965)

Yeah, I'd agree with that.

Thanks for the inputs both ;)

swh/web/common/origin_save.py
512 ↗(On Diff #20965)

yes, sounds reasonable

514–520 ↗(On Diff #20965)

oh yeah, good point ;)

Adapt according to suggestions (still no tests)

Build has FAILED

Patch application report for D5858 (id=20968)

Rebasing onto da39599f34...

Current branch diff-target is up to date.
Changes applied before test
commit 3c5199803577dfe934c47f3ea1e667769a1be006
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jun 11 16:01:36 2021 +0200

    Schedule save code now as recurring origins to ingest as well
    
    To avoid for the user to come back and schedule again the same origin later, this also
    manage origins as if the save code now was a lister and record the origin submitted by
    the user.
    
    Related to T1524

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/865/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/865/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 11 2021, 5:16 PM
Harbormaster failed remote builds in B21948: Diff 20968!
swh/web/common/management/commands/refresh_savecodenow_statuses.py
32

^ fixme: find the archive's instance name instead of host

Build has FAILED

Patch application report for D5858 (id=21015)

Rebasing onto a0db251b32...

Current branch diff-target is up to date.
Changes applied before test
commit b12529f49d83e173502d470317699a93424eb204
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jun 11 16:01:36 2021 +0200

    Schedule save code now as recurring origins to ingest as well
    
    To avoid for the user to come back and schedule again the same origin later, this also
    manage origins as if the save code now was a lister and record the origin submitted by
    the user.
    
    Related to T1524

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/871/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/871/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 15 2021, 11:39 AM
Harbormaster failed remote builds in B21994: Diff 21015!
ardumont retitled this revision from wip: Schedule save code now as recurring origins to ingest as well to Schedule save code now as recurring origins to ingest as well.Jun 15 2021, 11:40 AM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)
  • rework commit message
  • make mypy happier
ardumont retitled this revision from Schedule save code now as recurring origins to ingest as well to Schedule save code now as recurring origins to ingest when successful.Jun 15 2021, 11:47 AM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)

Build was aborted

Patch application report for D5858 (id=21017)

Rebasing onto a0db251b32...

Current branch diff-target is up to date.
Changes applied before test
commit c2723ffa1bd5409e69a9a23c708dd66f9326fbcf
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jun 11 16:01:36 2021 +0200

    Schedule save code now as recurring origins to ingest as well
    
    To allow users to request only once their save code now origins. We also schedule the
    origin when the ingestion is successfully ingested in the archive.
    
    Implementation wise, this is the scheduling routine in charge of updating the save code
    now status which is in charge of this.
    
    Related to T1524

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/872/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/872/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 15 2021, 11:53 AM
Harbormaster failed remote builds in B21995: Diff 21017!
ardumont edited the summary of this revision. (Show Details)

rework commit message

Build is green

Patch application report for D5858 (id=21020)

Rebasing onto a0db251b32...

Current branch diff-target is up to date.
Changes applied before test
commit 3b16bc9051b34b1afdfbdabfbdbe7ad3d746311c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jun 11 16:01:36 2021 +0200

    Schedule save code now as recurring origins to ingest when successful
    
    To allow users to request only once their save code now origins, once the first
    ingestion is successfully ingested, we also mark it as recurrent origin to crawl
    
    Implementation wise, this is the scheduling routine in charge of updating the save code
    now status which is in charge of this.
    
    Related to T1524

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/874/ for more details.

Possible duplicated origins should be handled but looks good otherwise.

swh/web/common/management/commands/refresh_savecodenow_statuses.py
35–46

You can have duplicated origins in the final list as multiple save code now requests can be submitted for a same origin URL. You should avoid that duplication as it will result in errors when trying to insert the data in scheduler database.

Also maybe the last_update field of ListedOrigin could be set to the visit date ?

swh/web/tests/common/test_django_command.py
55–63

I would have split that fixture in two separate ones but not a big deal.

This revision now requires changes to proceed.Jun 15 2021, 2:06 PM
swh/web/common/management/commands/refresh_savecodenow_statuses.py
35–46

definitely a good idea for the first part.

The second part sounds good as well but I'm not sure what that entails.

swh/web/tests/common/test_django_command.py
55–63

I agree. I did not spend much time in trying to refactor the tests after making it green ;)

swh/web/common/management/commands/refresh_savecodenow_statuses.py
35–46

The second part sounds good as well but I'm not sure what that entails.

Thinking it back, this should be set only if the visit was eventful as we know for sure we ingest some new code here.
If I recall correctly, last update date can be of interest in some new scheduling strategies but also not sure if it makes
sense for the save code now.

swh/web/common/management/commands/refresh_savecodenow_statuses.py
35–46

The second part sounds good as well but I'm not sure what that entails.

I played the bonus card and asked a friend ;)

14:24 <+olasd> ardumont: if you set last_update, then it won't get visited again until that field is updated

So no.

swh/web/common/management/commands/refresh_savecodenow_statuses.py
35–46

yep, using that field seems only relevant for a real lister exploiting forge API.

  • Rebase
  • adapt according to review (split fixtures, filter out duplicates prior to record listed origins)
  • Open a configuration key instance_name for naming lister (webapp1/moma needs the same configuration as they are 2 instances of the same webapp)
This revision is now accepted and ready to land.Jun 15 2021, 3:42 PM

Build is green

Patch application report for D5858 (id=21040)

Rebasing onto 8b4e77cae6...

Current branch diff-target is up to date.
Changes applied before test
commit 01ffb31ff72a9dbb1ea3eabad32ecd7fa1e231cc
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jun 11 16:01:36 2021 +0200

    Schedule save code now as recurring origins to ingest when successful
    
    To allow users to request only once their save code now origins, once the first
    ingestion is successfully ingested, we also mark it as recurrent origin to crawl.
    
    Implementation wise, the scheduling routine in charge of updating the save code now
    statuses reported in the save code now ui is in charge of this.
    
    Related to T1524

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/877/ for more details.