Page MenuHomeSoftware Heritage

Decide on the semantics of origin-visit status(es)
Closed, ResolvedPublic

Description

For T2310, we want to add a new object type: origin visits status. Each visit would have multiple statuses, but all of these objects are immutable:

  • origin_visit stores the initial state of the visit
  • origin_visit_statuses are created at each visit state change (e.g when a loader starts or updates a visit).

There are two meanings we can give to origin_visit_status:

  1. a snapshot of the state of the loading at a given date (ie. all fields are present)
  2. a "patch" over the previous status update

Cons of 1:

  • More duplication, so possible way more space needed if we start using the metadata field of visits (we currently don't)

Cons of 2:

  • harder to implement, especially for the pg storage
  • less efficient as you may need to read multiple visits (although not that much)
  • state is not inherently consistent, eg if we're missing an older update (eg. while replaying); although we could add a "pointer" on the previous update to detect when it's missing, and temporarily return an error

Event Timeline

vlorentz renamed this task from Semantics of origin-visit updates to Decide on the semantics of origin-visit updates.Apr 2 2020, 10:39 AM
vlorentz triaged this task as Normal priority.
vlorentz created this task.

Thanks for recording this.

After consideration, I tend to agree with the choice of having origin visit updates contain the full data. My main concern about that was two-fold: loaders having to carry this state over their lifetime, instead of just sending a message to update such or such field, and the storage space. But the fields are tiny, and our current loader framework is pretty monolithic/single-process, so both of these are not a big concern.

As a side note, we don't have any visits with metadata currently, so there's a case for just dropping the field altogether, which would make the storage space argument even less salient.

The only concern I have about removing the metadata field, is that at some point I'd like the "size" of the visit to enter into consideration in the feedback loop of the scheduler (T2345). A metadata field in the visit with the count of objects added (or even just a "visit score") could be a way of recording that info. It would also help the web frontend show the activity for a given repository.

In T2346#43055, @olasd wrote:

The only concern I have about removing the metadata field, is that at some point I'd like the "size" of the visit to enter into consideration in the feedback loop of the scheduler (T2345). A metadata field in the visit with the count of objects added (or even just a "visit score") could be a way of recording that info. It would also help the web frontend show the activity for a given repository.

I'd say, let's keep the metadata field for now, just to avoid migrating back and forth.

And if we want to pack it with lots of data, we can switch from semantic 1 to semantic 2 later, which shouldn't be too much trouble.

I'd say, let's keep the metadata field for now, just to avoid migrating back and forth.

And if we want to pack it with lots of data, we can switch from semantic 1 to semantic 2 later, which shouldn't be too much trouble.

Yeah, I think that's fair enough.

vlorentz reopened this task as Open.
vlorentz claimed this task.

@olasd So we agree to go with #1, right?

ardumont renamed this task from Decide on the semantics of origin-visit updates to Decide on the semantics of origin-visit status(es).Apr 28 2020, 3:21 PM
ardumont closed this task as Resolved.
ardumont updated the task description. (Show Details)
ardumont added a subscriber: ardumont.

We are going with 1.

Closing.