Page MenuHomeSoftware Heritage

Make origin visits immutable
Started, Work in Progress, NormalPublic

Description

Current problem

Currently, origin visits are the only mutable object type in our data model. First it's created with "ongoing" status, then optionally updated multiple times with a partial snapshot, then has its status set to either "full" or "partial" with an optional snapshot.

This causes an issue with messages pushed to Kafka, which does not guarantee the last update of a message will be the one preserved when compacting; so the journal may end up with an outdated version of a visit.

Solution

We should split a visit object from its updates in the journal: have one topic for the visits themselves (identified by (origin_url, visit_id)), and one for the successive updates (identified by (origin_url, visit_id, update_id); where update_id must be totally ordered). This way, all visit updates would be stored in the journal; these updates and their order would be preserved by the replayer, even though they will be replayed in an arbitrary order.

The current fields of origin visits would be split this way:

  • the new "visit" objects would get the type (git/tar/...), maybe (?) metadata, and maybe a new start_date field
  • the "visit_update" objects would get the other fields (date, status, metadata (?), snapshot).

There are two ways to make the replayer work with this:

  1. simply add a "version" field to origin visits, so that when the replayer passes origin_visit_upsert this version, and swh-storage would discard the upsert if it's older than what is currently in the DB
  2. make visit updates a first class citizen of the data model and swh-storage, and always keep all updates

New data model for the visits

an origin-visit represents a run of a loader. It currently carries the information:

  • origin url
  • visit id: unique for a given origin
  • type (git, hg, ...)
  • start_date: when the loader was started, shortly before it created the origin visit
  • (snapshot): snapshot of all the branches already/currently loaded
  • (metadata): associated metadata (unused)

Note: The left-member wrapping (parenthesis) conveys the optional nature of the property.

and origin-visit-status represents a snapshot of a visit's loader at a point in time (sent from time to time by the loader, like a heartbeat). It has the fields:

  • origin url
  • visit id
  • date: the timestamp of the snapshot of the loader task
  • status: Status of the visit (possible values: created, ongoing, full, partial)
  • (snapshot): snapshot of all the branches already/currently loaded
  • (metadata): associated metadata (not used, kept for future update)

The following operations are supported on origin visits:

  • creating a visit (to get a unique id from the storage)
  • getting a visit from its origin url + id
  • listing visits of an origin (with filters, order, etc.)
  • upsert a visit (to add an origin with a predetermined id, needed for the replayer)

(note that there is no (need for an) equivalent for the current "origin_visit_update()" endpoint, as origin visits are now immutable.)

and on origin visit updates:

  • adding a new status (using the origin url and visit id; there must be an override to allow the replayer to add an update for a visit that doesn't exist yet)
  • getting the last status of a visit (so one can know if it's completed and get the id of its snapshot)
  • listing all statuses of a visit (?) (there is no need for it yet)

Example

Loaders will use the storage API like this:

# start
id = storage.origin_visit_add(OriginVisit(origin=origin_url, type=..., start_date=...))
# load some stuff
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# load more stuff
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# finish loading everything
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='full', snapshot=..., metadata=None))

and readers (mostly just swh-web) would get this from the API:

{'origin': origin_url, 'visit': id, 'start_date': ...} = storage.origin_visit_get_latest(origin_url)
{'origin': origin_url, 'visit': id, 'date': ..., status: ..., snapshot: ..., metadata: ...} = storage.origin_visit_status_get_latest(origin_url, id)

Work plan

  • Amend origin_visit_update to make the status mandatory, like in the model: D2886 D2887 D2888 D2889 D2891
  • D2880: D3001: add OriginVisitStatus to swh-model
  • D2941: Update swh-model documentation
  • D2879: storage*: Split origin_visit table in backends (internal change): new origin_visit_status table
  • D3101: pg-storage: Write both origin-visit and origin-visit-status in parallel
  • Deploy storage (and migrate data, this now can occur while loaders are running)
  • migrate remaining data (data that did not get migrated during the first migration, while loaders continued their work)
  • D3180: pg-storage: Switch over the read queries to the new tables (=> revert D3101)
  • Deploy storage
  • storage*: add origin_visit_status_add endpoint
  • storage*: remove origin_visit_upsert
    • journal: adapt swh.journal.replayer to use origin_visit_status_add
  • model: remove fields of OriginVisit
    • write: migrate loader code
    • read: swh-web
  • storage/journal: produce to new kafka topic (origin_visit_add, origin_visit_status)
  • kafka topics: backfill origin_visit_status, overwrite origin_visit to remove fields (can be done at the end)
  • journal: adapt replayer to new topics

Related Objects

StatusAssignedTask
Work in ProgressNone
Resolvedvlorentz

Event Timeline

vlorentz updated the task description. (Show Details)Mar 25 2020, 1:57 PM
vlorentz added a comment.EditedMar 25 2020, 2:23 PM

Current plan:

  • Amend origin_visit_update to make the status mandatory, like in the model: D2886 D2887 D2888 D2889 D2891
  • split the origin_visit table in the backends (with no change in the API)
    • implementation wise, internal changes in origin_visit_add, origin_visit_update, and origin_visit_upsert, and origin_visit_get*
    • migration scripts (sql,...)
  • add OriginVisitUpdate to swh-model
  • add origin_visit_update_add
  • remove origin_visit_upsert
    • adapt swh.journal.replayer
  • remove fields of OriginVisit
    • write: migrate loader code
    • read: swh-web
  • produce to new kafka topic (origin_visit_add, origin_visit_update)
  • kafka topics: backfill origin_visit_update, overwrite origin_visit to remove fields (can be done at the end)
  • adapt replayer to the new topics
vlorentz updated the task description. (Show Details)Apr 1 2020, 4:37 PM
ardumont updated the task description. (Show Details)Apr 1 2020, 4:40 PM
ardumont updated the task description. (Show Details)Apr 2 2020, 10:23 AM
vlorentz renamed this task from Mutability of origin visits to Make origin visits immutable.Apr 2 2020, 10:32 AM
vlorentz updated the task description. (Show Details)Apr 2 2020, 11:27 AM
vlorentz updated the task description. (Show Details)Apr 2 2020, 11:38 AM
vlorentz updated the task description. (Show Details)Apr 2 2020, 11:40 AM
vlorentz updated the task description. (Show Details)Apr 2 2020, 3:04 PM
vlorentz updated the task description. (Show Details)

As I understand this, an origin visit, consisting in one OriginVisit object plus a list of OriginVisitUpdate represent the process of visiting an origin to load its content in the archive.

This process is a simple state-machine with only a few steps (according to the task):

  • created: this is the initial step and is currently the case when the OriginVisit has been created with no OriginVisitUpdate;
  • ongoing: this is the case when an OriginVisit exists, and the latest OriginVisitUpdate object has the 'ongoing' status field,
  • partial: this is a final step; the OriginVisit exists and its last OriginVisitUpdate object has the 'partial' status field; meaning the loading is over, but some part of the loaded artifact could not be loaded for some reason;
  • full: this is a final step; the OriginVisit exists and its last OriginVisitUpdate object has the 'full' status field; meaning the loading is over and no error occurred during the process.

With the following transitions:

  • start (created -> ongoing): currently implicit by adding a first OriginVisitUpdate(status='ongoing') to the OriginVisit,
  • update (ongoing -> ongoing): represented by a new OriginVisitUpdate(status='ongoing') while there was already a previous OriginVisitUpdate(status='ongoing')
  • fail (ongoing -> partial)
  • success (ongoing -> full)

Regarding this model, a few questions come to my mind:

  • do we allow an OriginVisitUpdate(status='ongoing', snaphost=None)? what would be the meaning of this? or do we enforce one just after the created step to model the start transition?
  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=xxx) with the same snapshot xxx as the previous update? If so, what would be the meaning of it?
  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=yyy) with the snapshot yyy not a superset of a previous update? Or should the snapshot be considered as what has been added between the previous update and this one?
  • how should we consider OriginVisit that never reach a final step but have valid ongoing ones (with snapshots)?

Also, one more question/interrogation:

What information the OriginVisit object carries that cannot be found on an OriginVisitUpdate one? Currently, only the type field seems missing from the OriginVisitUpdate model description, maybe we could just get rid of the OriginVisit table/entity and only have the (currently called) OriginVisitUpdate (with a type column) ? The start_date is then the first OriginVisitUpdate timestamp.

We currently don't have "created" (so no "start" either), but it would make sense to create it.

Regarding this model, a few questions come to my mind:

  • do we allow an OriginVisitUpdate(status='ongoing', snaphost=None)? what would be the meaning of this? or do we enforce one just after the created step to model the start transition?

This could mean these things:

  1. on a first update, to mean the visit was created (but we don't need it if we have a "created" state)
  2. on subsequent updates, for the loader to say it's still alive but doesn't have enough data for a snapshot yet (though I'm not sure that's useful)
  3. to void a previous snapshot (I don't see any use for that)
  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=xxx) with the same snapshot xxx as the previous update? If so, what would be the meaning of it?

Same as point 2 before.

  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=yyy) with the snapshot yyy not a superset of a previous update?

It doesn't make sense to have this, but I'm not sure we should care.

Or should the snapshot be considered as what has been added between the previous update and this one?

Now, that makes the data model, especially reads, way more complicated; and I don't see any reason to do it (other than for the sake of purity of the model)

  • how should we consider OriginVisit that never reach a final step but have valid ongoing ones (with snapshots)?

It means either the loader is still running, or there is a bug in the loader. Telling the difference is out of the scope of the data model.

What information the OriginVisit object carries that cannot be found on an OriginVisitUpdate one? Currently, only the type field seems missing from the OriginVisitUpdate model description, maybe we could just get rid of the OriginVisit table/entity and only have the (currently called) OriginVisitUpdate (with a type column) ?

Indeed. We could move type to OriginVisitUpdate, but I wanted to keep it on OriginVisit for consistency (ie. to make sure updates of the same visit don't have different types).
It's not any more useful than checking snapshots are an increasing sequence, but it's cheaper to check so we kept it.

The start_date is then the first OriginVisitUpdate timestamp.

That kind of makes sense, but also kind of doesn't. There is (currently) no reason for a reader to care about updates other than the last one except for that particular field on the first update.

ardumont added a subscriber: ardumont.EditedApr 6 2020, 12:13 PM

Thanks for the questions. I'm unsure about some questions and i replied as best
i could.

do we allow an OriginVisitUpdate(status='ongoing', snaphost=None)? what would
be the meaning of this?

Yes. It means "loading started, so no snapshot yet".
That sounds sensible ;)

Also that's currently how it's done.

or do we enforce one just after the created step to model the start
transition?

Do you mean enforce a snapshot with all "origin visit updates"? I think that
means we won't be able to add visit update unless we are done ingesting
(because we don't have snapshot before that time).

And so there is no longer a need for origin-visit/origin-visit-update
separation at all (your last point/question iiuc).

I don't remember how we came to create origin visits early in the loading
process but i think there is a good reason there. If there aren't then we will
have an answer \m/.


I initially replied on "do we enforce on visit update after the created step"
(as i understood it the first time). In that regard, we implemented adding a
new visit update (with status "ongoing" and snapshot None) at first. See
current diffs D2937, D2938, D2939. That behavior maps with the current loader
implementations.

do we allow an OriginVisitUpdate(status='ongoing', snapshot=xxx) with the
same snapshot xxx as the previous update? If so, what would be the meaning of
it?

I don't see why not.

That could be meaning, we started a new visit starting from the last visit's
snapshot "xxx". Some loaders, package ones for example, could use that to
always reference the last snapshot in case of dramatic failures on their latest
loading visits. If that makes sense.

While technically, that could improve performance on querying the snapshot of
the last visit for some loaders, i'm unsure about the functional benefits.

do we allow an OriginVisitUpdate(status='ongoing', snapshot=yyy) with the
snapshot yyy not a superset of a previous update?

no idea.

How do we compare snapshot in the first place? With their branch names?

Or should the snapshot be considered as what has been added between the
previous update and this one?

That would change how we currently see and manipulate a snapshot.

And possibly would complexify how to reify a full snapshot out of multiple
smaller ones.

Well, i think it means we need to actually have some more improved dsl to do
computations on snapshots.

What information the OriginVisit object carries that cannot be found on an
OriginVisitUpdate one? Currently, only the type field seems missing from the
OriginVisitUpdate model description, maybe we could just get rid of the
OriginVisit table/entity and only have the (currently called)
OriginVisitUpdate (with a type column) ? The start_date is then the first
OriginVisitUpdate timestamp.

The type and date are only in origin_visit.
The rest is only in origin_visit_update.

If we go that way, isn't it simpler to keep origin_visit table. And:

  • make sure the loaders are calling only once the current "origin_visit_add" storage endpoint (which they now mostly do, need to make sure of that).
  • drop "storage.origin_visit_update"
  • rework "storage.origin_visit_upsert" to make sure it's only working with origin_visit_add.
  • Enforce the order on the current origin_visit date field

The rest is only in origin_visit_update.

origin_visit_update also has a date

If we go that way, isn't it simpler to keep origin_visit table. And:

With such drastic changes in meaning, I think it's best to consider we're creating a new one; and recycling the old one is just an implementation detail in the pg backend

ardumont added a comment.EditedApr 6 2020, 4:24 PM

(also, i agree with @vlorentz's faster first reply ;)

origin_visit_update also has a date

yes, i meant it's not the same as the one in origin_visit_update table

With such drastic changes in meaning, I think it's best to consider we're
creating a new one; and recycling the old one is just an implementation
detail in the pg backend

right


So in the end, what do we do?

My main concern regarding the current task's description might be the name we
initially chose when orally discussing it. (And that we kept when starting the
implementation). So, do we keep origin_visit_update? In another oral
discussion from last week, the name change has come up.

And, I do found the future origin_visit_update_add storage endpoint
confusing... (that can be read as 2 verbs next to each other). Plus it
overwrote the current origin_visit_update which is also confusing.

I recall the current proposition on a name change are (it may be incomplete):

  1. origin-visit-update
  2. origin-visit-state
  3. origin-visit-heartbeat
  4. origin-visit-snapshot
  5. origin-visit-snap
  6. origin-visit-status
  7. origin-visit-transition

My current choice would be 2. (origin-visit-state).

We currently don't have "created" (so no "start" either), but it would make sense to create it.

Regarding this model, a few questions come to my mind:

  • do we allow an OriginVisitUpdate(status='ongoing', snaphost=None)? what would be the meaning of this? or do we enforce one just after the created step to model the start transition?

This could mean these things:

  1. on a first update, to mean the visit was created (but we don't need it if we have a "created" state)

If we go that way, I'd much rather like an explicit "created" state...

  1. on subsequent updates, for the loader to say it's still alive but doesn't have enough data for a snapshot yet (though I'm not sure that's useful)
  2. to void a previous snapshot (I don't see any use for that)

I believe these 2 possibilities do not make much sense and should then be explicitly forbidden.

  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=xxx) with the same snapshot xxx as the previous update? If so, what would be the meaning of it?

Same as point 2 before.

Same response, if it does not really make sense, let's forbid it.

  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=yyy) with the snapshot yyy not a superset of a previous update?

It doesn't make sense to have this, but I'm not sure we should care.

I think this is a rather simple check to implement so I don't see why not do it. Intrinsic robustness is always (if not over complex) a good thing add.

Or should the snapshot be considered as what has been added between the previous update and this one?

Now, that makes the data model, especially reads, way more complicated; and I don't see any reason to do it (other than for the sake of purity of the model)

Agreed. I added this for the sake of "completeness" but did not though it's a good idea back then.

  • how should we consider OriginVisit that never reach a final step but have valid ongoing ones (with snapshots)?

It means either the loader is still running, or there is a bug in the loader. Telling the difference is out of the scope of the data model.

Yes but having though of what to do with origin visit that never reach a final step now is better than not having this edge case in mind. This will happen. How do we handle it.

What information the OriginVisit object carries that cannot be found on an OriginVisitUpdate one? Currently, only the type field seems missing from the OriginVisitUpdate model description, maybe we could just get rid of the OriginVisit table/entity and only have the (currently called) OriginVisitUpdate (with a type column) ?

Indeed. We could move type to OriginVisitUpdate, but I wanted to keep it on OriginVisit for consistency (ie. to make sure updates of the same visit don't have different types).

This consistency is easy to enforce at API level (we need sanity checks at each transition, this is one of them). Could also be that this type field is only possible to fill when creating an OriginVisitUpdate (or whatevet we call this entity) with 'status="created"') for example.

It's not any more useful than checking snapshots are an increasing sequence, but it's cheaper to check so we kept it.

Don't understand what's cheaper than what here.

The start_date is then the first OriginVisitUpdate timestamp.

That kind of makes sense, but also kind of doesn't. There is (currently) no reason for a reader to care about updates other than the last one except for that particular field on the first update.

Thanks for the questions. I'm unsure about some questions and i replied as best
i could.

do we allow an OriginVisitUpdate(status='ongoing', snaphost=None)? what would
be the meaning of this?

Yes. It means "loading started, so no snapshot yet".
That sounds sensible ;)

but then (as already said) why not make it explicit? (via the created step for example).

Also that's currently how it's done.

It's exactly the kind of argument I do not want to see in this discussion :-)

or do we enforce one just after the created step to model the start
transition?

Do you mean enforce a snapshot with all "origin visit updates"? I think that
means we won't be able to add visit update unless we are done ingesting
(because we don't have snapshot before that time).

Not sure I get what you mean here...

And so there is no longer a need for origin-visit/origin-visit-update
separation at all (your last point/question iiuc).

(I believe, but...)

I don't remember how we came to create origin visits early in the loading
process but i think there is a good reason there. If there aren't then we will
have an answer \m/.


I initially replied on "do we enforce on visit update after the created step"
(as i understood it the first time). In that regard, we implemented adding a
new visit update (with status "ongoing" and snapshot None) at first. See
current diffs D2937, D2938, D2939. That behavior maps with the current loader
implementations.

do we allow an OriginVisitUpdate(status='ongoing', snapshot=xxx) with the
same snapshot xxx as the previous update? If so, what would be the meaning of
it?

I don't see why not.

Because if it makes no sense, then we should not allow it.

That could be meaning, we started a new visit starting from the last visit's
snapshot "xxx". Some loaders, package ones for example, could use that to
always reference the last snapshot in case of dramatic failures on their latest
loading visits. If that makes sense.

It looks you misread what I meant: I was talking about a new OriginVisitUpdate with a snapshot "inconsistent" with the previous snapshot reported by the previous OriginVisitUpdate for the same visit.

While technically, that could improve performance on querying the snapshot of
the last visit for some loaders, i'm unsure about the functional benefits.

do we allow an OriginVisitUpdate(status='ongoing', snapshot=yyy) with the
snapshot yyy not a superset of a previous update?

no idea.

How do we compare snapshot in the first place? With their branch names?

Just comparing a simple set of couples (branchname, target) should do the trick.

Or should the snapshot be considered as what has been added between the
previous update and this one?

That would change how we currently see and manipulate a snapshot.

And possibly would complexify how to reify a full snapshot out of multiple
smaller ones.

Yeah, as already stated by @vlorentz this is probably not a good idea (too much added complexity for no real value).

Well, i think it means we need to actually have some more improved dsl to do
computations on snapshots.

What information the OriginVisit object carries that cannot be found on an
OriginVisitUpdate one? Currently, only the type field seems missing from the
OriginVisitUpdate model description, maybe we could just get rid of the
OriginVisit table/entity and only have the (currently called)
OriginVisitUpdate (with a type column) ? The start_date is then the first
OriginVisitUpdate timestamp.

The type and date are only in origin_visit.
The rest is only in origin_visit_update.

If we go that way, isn't it simpler to keep origin_visit table. And:

  • make sure the loaders are calling only once the current "origin_visit_add" storage endpoint (which they now mostly do, need to make sure of that).
  • drop "storage.origin_visit_update"
  • rework "storage.origin_visit_upsert" to make sure it's only working with origin_visit_add.
  • Enforce the order on the current origin_visit date field

I'm not there yet :-)

  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=yyy) with the snapshot yyy not a superset of a previous update?

It doesn't make sense to have this, but I'm not sure we should care.

I think this is a rather simple check to implement so I don't see why not do it. Intrinsic robustness is always (if not over complex) a good thing add.

It needs to fully read all branches of a snapshot. But let's implement it and see if it's an issue.

Don't understand what's cheaper than what here.

Having a type on origin visits is cheaper than reading a complete snapshot

I recall the current proposition on a name change are (it may be incomplete):

  1. origin-visit-update
  2. origin-visit-state
  3. origin-visit-heartbeat
  4. origin-visit-snapshot
  5. origin-visit-snap
  6. origin-visit-status

My current choice would be 2. (origin-visit-state).

Not 2, because it doesn't represent a state, but a state change / triggered transition (plus some extra data outside the transition itself). As someone pointed out, it's not a heartbeat so not 3; and 4 and 5 are confusing.

ardumont added a comment.EditedApr 7 2020, 11:38 AM

It looks you misread what I meant: I was talking about a new OriginVisitUpdate with a snapshot "inconsistent" with the previous snapshot reported by the previous OriginVisitUpdate for the same visit.

Ok.
I have difficulties seeing how we determine what an inconsistent snapshot is.

Not 2, because it doesn't represent a state, but a state change / triggered transition (plus some extra data outside the transition itself). As someone pointed out, it's not a heartbeat so not 3; and 4 and 5 are confusing.

ok.
I'm not sure we should care about the state / state transition nuance you are making here though.
As you already mentioned at some point, we are interested in the last state of the visit after all.

  1. (status) is also confusing in some ways.

i forgot (7.) origin-visit-transition which i added to my initial comment.

Yeah, as already stated by @vlorentz this is probably not a good idea (too much added complexity for no real value).

It passed incognito yesterday on irc #swh-devel but that may be adding values to the journal... (given the current discussion on very large objects there).
Having smaller snapshots (because they are incremental/diff ones) there could maybe help in decreasing the current errors...


Exciting discussions / questions, thanks ;)

olasd added a subscriber: olasd.Apr 7 2020, 12:43 PM

I'll start with a general reasoning about origin visit vs. origin visit state objects in our "conceptual" data model, as it was sprinkled throughout my comment initially.

I think it's useful to keep an OriginVisit object in our data model, even if it ends up as a shallow shell containing a list of successive States. The origin visit is still the main operation that materializes the archival of an origin in our archive: it makes sense to be able to list all the visits of an origin, with their start time and current state, but it also makes sense to have the full list of successive states of a given visit if we need it (e.g. for deep introspection, which we currently suck at.)

There's some "backend" arguments against having separate origin visit and origin visit update objects: having separate objects creates a (soft) dependency when reading the journal: you'd only get a consistent state if you have read the origin visit object before its associated origin visit updates. But this consistency issue on the backend doesn't negate that the conceptual "origin visit" object is critical to reason about the archival process in our data model.

Once we have reasoned about higher-level data models and the associated APIs, we can consider adding shortcuts in the different storage backends, to make some common operations more efficient, and to make sure we always get consistent data.

  • it makes sense to tune the storage (e.g. by adding a "materialized view"-like table) to make "when retrieving an origin visit, prefetch its latest state" as cheap as possible.
  • we can solve the journal issue by inlining the full visit (which, in the end, is a tiny object) in all the "origin visit update" messages. Once we do that, we can even completely drop the topic for origin_visits themselves.

One of my proposals is adding a new component (yes, again), the "origin visit reaper". We have a somewhat common (but fairly infrequent) pattern of visits crashing hard, and lingering forever in an ongoing state.

If we allow noop state transitions, we can use them as a heartbeat process. We can then specify that ongoing visits with no state update in the last $timeframe are dead, and should be set as failed by an external process. This process would not depend on the worker scaffolding, but just be a janitorial process that runs once in a while to keep the archive tidy.


Then, here comes the painful part because of the deep quoting; I answer to @douardda's messages because they quote messages from @vlorentz and @ardumont I'll try to liberally trim the quotes and to keep my comments to this single message though, as well as try to avoid repeating myself too much...

As I understand this, an origin visit, consisting in one OriginVisit object plus a list of OriginVisitUpdate represent the process of visiting an origin to load its content in the archive.

This process is a simple state-machine with only a few steps (according to the task):

  • created: this is the initial step and is currently the case when the OriginVisit has been created with no OriginVisitUpdate;
  • ongoing: this is the case when an OriginVisit exists, and the latest OriginVisitUpdate object has the 'ongoing' status field,
  • partial: this is a final step; the OriginVisit exists and its last OriginVisitUpdate object has the 'partial' status field; meaning the loading is over, but some part of the loaded artifact could not be loaded for some reason;
  • full: this is a final step; the OriginVisit exists and its last OriginVisitUpdate object has the 'full' status field; meaning the loading is over and no error occurred during the process.

With the following transitions:

  • create (ø -> created)
  • start (created -> ongoing): currently implicit by adding a first OriginVisitUpdate(status='ongoing') to the OriginVisit,
  • update (ongoing -> ongoing): represented by a new OriginVisitUpdate(status='ongoing') while there was already a previous OriginVisitUpdate(status='ongoing')
  • fail (ongoing -> partial)
  • success (ongoing -> full)
  • reaped (ongoing -> partial, but by an external process)

Regarding this model, a few questions come to my mind:

  • do we allow an OriginVisitUpdate(status='ongoing', snaphost=None)? what would be the meaning of this? or do we enforce one just after the created step to model the start transition?

[ heartbeat for the reaper proposal ]

  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=xxx) with the same snapshot xxx as the previous update? If so, what would be the meaning of it?

[ heartbeat for the reaper proposal ]

  • do we allow an OriginVisitUpdate(status='ongoing', snapshot=yyy) with the snapshot yyy not a superset of a previous update? Or should the snapshot be considered as what has been added between the previous update and this one?

Defining the "superset" operation between snapshots in general terms is probably expensive. I can see a few cases for "partial" snapshots (which, btw, should probably be recorded as such intrinsically):

  • recording the current state of loading a VCS repository, where we have loaded only some of the commits. We can expect the HEAD of such a snapshot to have the HEAD of a previous iteration as parent.
  • recording the current state of loading a VCS repository, where we have loaded only some of the branches. We can expect the set of branches of such a snapshot to contain the set of branches of a previous iteration.
  • recording the current state of loading packages from a package manager, where we have loaded only some of the URLs. This is pretty much the same thing as the previous case (the tip of the branches is not expected to change).

I could see us start to load DVCSs incrementally as well (starting from the bottom of the history and building up); such an incremental, costly process could use recording partial snapshots as well.

  • how should we consider OriginVisit that never reach a final step but have valid ongoing ones (with snapshots)?

[ pointer to the reaper proposal ]

Also, one more question/interrogation:

What information the OriginVisit object carries that cannot be found on an OriginVisitUpdate one? Currently, only the type field seems missing from the OriginVisitUpdate model description, maybe we could just get rid of the OriginVisit table/entity and only have the (currently called) OriginVisitUpdate (with a type column) ? The start_date is then the first OriginVisitUpdate timestamp.

[ pointer to the origin visit as a critical SWH concept argument ]

I agree with getting rid of the table, but not about getting read of the entity. I also think that after reasoning about concepts, the table will need to be resurrected to make common operations on the entity efficient.

  1. on a first update, to mean the visit was created (but we don't need it if we have a "created" state)

If we go that way, I'd much rather like an explicit "created" state...

I think it wouldn't hurt to fully materialize the origin visit state machine at this point, indeed.

[ trimmed parts where I think I already replied ]

Or should the snapshot be considered as what has been added between the previous update and this one?

Now, that makes the data model, especially reads, way more complicated; and I don't see any reason to do it (other than for the sake of purity of the model)

Agreed. I added this for the sake of "completeness" but did not though it's a good idea back then.

Yeah, let's not start to store diffs instead of fully internally consistent (well, you know...) objects.

[ origin visit with final state == 'ongoing' ]

Yes but having though of what to do with origin visit that never reach a final step now is better than not having this edge case in mind. This will happen. How do we handle it.

This already happens indeed. [ pointer to the reaper proposal ]

[usefulness of the origin visit table / object]

This consistency is easy to enforce at API level (we need sanity checks at each transition, this is one of them). Could also be that this type field is only possible to fill when creating an OriginVisitUpdate (or whatevet we call this entity) with 'status="created"') for example.

We'll need to be able to turn off the sanity checks in the replayer, as there's no guarantee that we won't replay the states out of order.

[...]

Don't understand what's cheaper than what here.

If you encode the type in the OriginVisit model, then you don't have to check that it doesn't change during your state transitions. That costs 0. Whatever the definition of "superset", checking whether a snapshot is a superset of another is more costly than not doing anything ;)

The start_date is then the first OriginVisitUpdate timestamp.

That kind of makes sense, but also kind of doesn't. There is (currently) no reason for a reader to care about updates other than the last one except for that particular field on the first update.

I agree the last state its going to be used way more than others, but this is an implementation detail of the high-level APIs.

[ I had another bit of the origin visit as a conceptual object argument here ]

or do we enforce one just after the created step to model the start
transition?

Do you mean enforce a snapshot with all "origin visit updates"? I think that
means we won't be able to add visit update unless we are done ingesting
(because we don't have snapshot before that time).

Not sure I get what you mean here...

Lots of loaders could add partial, incremental snapshots throughout their loading process, but we currently don't really do that.

And so there is no longer a need for origin-visit/origin-visit-update
separation at all (your last point/question iiuc).

(I believe, but...)

[ repeat the concept vs tables argument once more :D ]

[ trim repeat argument about state transitions, already commented above; trim incremental snapshots argument too ]

What information the OriginVisit object carries that cannot be found on an
OriginVisitUpdate one? Currently, only the type field seems missing from the
OriginVisitUpdate model description, maybe we could just get rid of the
OriginVisit table/entity and only have the (currently called)
OriginVisitUpdate (with a type column) ? The start_date is then the first
OriginVisitUpdate timestamp.

The type and date are only in origin_visit.
The rest is only in origin_visit_update.

If we go that way, isn't it simpler to keep origin_visit table. And:

  • make sure the loaders are calling only once the current "origin_visit_add" storage endpoint (which they now mostly do, need to make sure of that).
  • drop "storage.origin_visit_update"
  • rework "storage.origin_visit_upsert" to make sure it's only working with origin_visit_add.
  • Enforce the order on the current origin_visit date field

I'm not there yet :-)

[ I'll save you one more reiteration of my schtick ]

We have a somewhat common (but fairly infrequent) pattern of visits crashing hard, and lingering forever in an ongoing state.

technical side-track but hey. The worker lost error, memory error, etc... which ends up with a reaped worker are at least one source for this...

And the new janitorial process "reaper" dealing with that is a good idea indeed.

I can see a few cases for "partial" snapshots (which, btw, should probably be recorded as such intrinsically):

do you mean adding a "state/status" to the snapshot?

I agree with getting rid of the table, but not about getting read of the entity. I also think that after reasoning about concepts, the table will need to be resurrected to make common operations on the entity efficient.

So in the end, we keep the entity and the table.
And i guess the model.


fwiw, agreed with olasd's proposal.

Thanks.

ardumont updated the task description. (Show Details)Apr 21 2020, 2:22 PM
ardumont updated the task description. (Show Details)Apr 28 2020, 6:00 PM
ardumont updated the task description. (Show Details)Apr 29 2020, 10:47 AM
ardumont updated the task description. (Show Details)Apr 29 2020, 11:24 AM
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)Apr 29 2020, 11:27 AM
ardumont added a comment.EditedApr 30 2020, 11:31 AM

Status on this.

D2937, D2938, D2939 landed. To ensure deployment be ok, couple of questions got
raised in irc which led to further discussions (P656).

After discussion on how to properly deployed, it has been decided it would
be best to actually write both in parallel in the new origin visit status and
the "old" origin visit schema tables (1).

After further discussion, for testing purposes, we agreed on dumping and
restoring origins and origin-visits from production to staging. And try to
check what the behavior is (2).

It actually took some time to do that. It's now currently migrating to the new
schema (147.sql). That should give a time order for the production migration
time. So that's a win.

In the mean time, I've started to work on 1. as i'm a bit uneasy about letting
master in a state we cannot really deploy yet. As there is some work nearly
ready to be deployed (new metadata api endpoints) and i don't want to prevent
its deployment.

ardumont added a comment.EditedApr 30 2020, 1:58 PM

In the mean time, I've started to work on 1. as i'm a bit uneasy about letting
master in a state we cannot really deploy yet. As there is some work nearly
ready to be deployed (new metadata api endpoints) and i don't want to prevent
its deployment.

That's D3101.

ardumont updated the task description. (Show Details)Apr 30 2020, 2:48 PM

After further discussion, for testing purposes, we agreed on dumping and
restoring origins and origin-visits from production to staging. And try to
check what the behavior is (2).

for information:

  • dumping out of production origin/origin-visits: ~ 2 hours
  • restoring in staging (~dropping indexers, restoring, creating index): ~ 3-4 hours
  • migration 147 (~ insert origin-visit-status out of origin-visits): ~ 12 to 24 hours
ardumont added a comment.EditedWed, May 6, 8:39 PM
  • Deploy storage (and migrate data, this now can occur while loaders are running)

without breaking a sweat... (not)

Migration run through P664
(Thank olasd for the replication issues and fixes)

ardumont updated the task description. (Show Details)Wed, May 6, 8:39 PM
ardumont updated the task description. (Show Details)Mon, May 25, 7:09 PM
  • migrate remaining data (data that did not get migrated during the first migration, while loaders continued their work)

That's been running since 15:30 or so.

ardumont updated the task description. (Show Details)Mon, May 25, 8:18 PM

That's been running since 15:30 or so.

It was done not long after I commented.

ardumont updated the task description. (Show Details)Wed, May 27, 1:35 PM
ardumont updated the task description. (Show Details)Wed, May 27, 4:18 PM
ardumont updated the task description. (Show Details)Thu, May 28, 5:18 PM
ardumont changed the task status from Open to Work in Progress.Fri, May 29, 12:40 PM
ardumont updated the task description. (Show Details)Mon, Jun 1, 2:29 PM