Page MenuHomeSoftware Heritage

Define high level load-git-high task
AbandonedPublic

Authored by ardumont on Apr 13 2021, 10:22 AM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3084: Fast track save code now requests
Summary

Those will be used by the save code now from now on.

The same logic will be applied to svn and mercurial loaders (probably without diff
though).

Related to T3084

Diff Detail

Repository
rDLDG Git loader
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 20668
Build 32072: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 32071: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D5488 (id=19625)

Rebasing onto 8327ec6d52...

Current branch diff-target is up to date.
Changes applied before test
commit 5ad24ab5fd015509295951f881dbb74d58bead86
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Apr 13 10:13:32 2021 +0200

    Define high level load-git-high task
    
    Related to T3084

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/95/ for more details.

I don't understand why this is needed. Aren't we be able to explicitly send instances of the existing swh.loader.git.tasks.UpdateGitRepository task to a separate queue, and have a celery process consume the "regular" tasks from that queue directly?

I don't understand why this is needed. Aren't we be able to explicitly send instances
of the existing swh.loader.git.tasks.UpdateGitRepository task to a separate queue, and
have a celery process consume the "regular" tasks from that queue directly?

I thought we could but in the end, I don't really see how. I initially intended to use
the priority task property to help detect and trigger a reroute of messages but...

The scheduler runner is working on a specific queue per task type, not multiple queues.
The task type has no notion of priority at all so we need to determine the priority late
in the loop (when actually grabbing tasks for a given task type).

At this time though, if we reroute messages to the new queue, we will have:

  • incomplete data in the new queue, we'll only take a ratio of a given difference from the standard queue. That queue have more space than what we will route. (That may not be a problem)
  • incomplete data in the standard queue. As we will reroute part of its initial intended messages, we could have enqueued more messages there (That may not be problem either)

Overall though, that feels quite incomplete and it will make the code less readable.
(Also, that part is not tested so ugh...)

Furthermore, note that I don't understand what setup is required for the new worker to
consume from a different queue than the one it's hard-coded for in the
swh.loader.<type>.tasks module.

All in all, it feels to me the simpler would be to open new messages for a dedicated
queue.

I'm open to suggestion as i'm hitting a wall :)

(Your latest suggestion in the swh-site don't work out for now and i'm also struggling to
make the scheduler tests go green...)

(please, exscuse the "brevity" but i lost my first response)

Your latest suggestion in the swh-site don't work out for now

Finally ok, i hitted some strange behavior (because registering new task type with
same name is transparently ignored by the register task type process...)
so nothing too obvious... and lots of head->desk debugging in between ;)

ardumont mentioned this in D5493: scheduler: Redirect priority task in their own dedicated task queue.

@olasd I've opened ^ to show you how i've implemented the task so far (don't mind the failing tests for now
as i'm still on it, it's working otherwise in docker stack)

Update task name according to latest change in D5486

Build is green

Patch application report for D5488 (id=19640)

Rebasing onto 8327ec6d52...

Current branch diff-target is up to date.
Changes applied before test
commit 2b7ade1a2e2714fe69b116676aefb346710829f1
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Apr 13 10:13:32 2021 +0200

    Define high level load-git-high task
    
    Related to T3084

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/96/ for more details.

We should have a need for this.