Page MenuHomeSoftware Heritage

Add method to efficiently retrieve latest statuses of origin visits
Closed, MigratedEdits Locked

Description

Currently, the following methods are available in the storage interface regarding origin visits retrieval:

  • origin_visit_get(self, origin: str, page_token: Optional[str] = None, order: ListOrder = ListOrder.ASC, limit: int = 10) -> PagedResult[OriginVisit]: returns all visits of an origin in a paginated way
  • origin_visit_status_get(self, origin: str, visit: int, page_token: Optional[str] = None, order: ListOrder = ListOrder.ASC, limit: int = 10) -> PagedResult[OriginVisitStatus]: return all statuses for a given visit in a paginated way
  • origin_visit_status_get_latest(self, origin_url: str, visit: int, allowed_statuses: Optional[List[str]] = None, require_snapshot: bool = False) -> Optional[OriginVisitStatus]: return the latest status for a given visit

For the use case when one wants to list all visits and their latest statuses (for instance in the visits view of the webapp), the following process must be applied:

  1. Get all visits by calling origin_visit_get
  2. For each visit, get its latest status by calling origin_visit_status_get_latest

The second step is a real bottleneck when an origin has a lot of visits as we have to send a lot of queries to the storage to get their latest statuses.

We should add a new method to the storage interface to efficiently retrieve the latest statuses of origin visits.
The signature could be the following:

origin_visit_status_latest_get_all(self, origin: str, page_token: Optional[str] = None, order: ListOrder = ListOrder.ASC, limit: int = 10) -> PagedResult[OriginVisitStatus]

Event Timeline

anlambert triaged this task as Normal priority.Mar 25 2022, 11:40 AM
anlambert created this task.
anlambert renamed this task from Add method to efficiently retrieve origin visits and their latest statuses to Add method to efficiently retrieve latest statuses of origin visits .Mar 28 2022, 10:41 AM
anlambert updated the task description. (Show Details)

I have the feeling that, in terms of API extensibility, we'll want to be returning both the OriginVisit and its latest OriginVisitStatus.

If we don't do that, we might find ourselves in a situation where we want to combine the fields from both objects into one, which feels a bit clunky when we could just return a proper composite type.

So, I would suggest an API of the form:

@attr.s
class OriginVisitWithStatuses:
    visit: OriginVisit
    statuses: List[OriginVisitStatus]

def origin_visit_get_with_latest_status(
    self,
    origin: str,
    allowed_statuses: Optional[List[str]] = None,
    require_snapshot: bool = False, 
    page_token: Optional[str] = None,
    order: ListOrder = ListOrder.ASC,
    limit: int = 10
) -> PagedResult[OriginVisitWithStatuses]

This way, this new API will be getting the extensions to both model objects without needing to change, and we can reuse the combined result type for OV + OVS in other relevant APIs (e.g. ones that would return all statuses instead of a single one).

Alternatively, we could define a one-shot return type such as:

@attr.s
class OriginVisitWithLastStatus:
    visit: OriginVisit
    last_status: Optional[OriginVisitStatus]

Or even drop the Optional, if we make the API return visits only if the latest status exists and matches the filters supplied in the request.

We should probably be deprecating origin_visit_get, while we're here; I don't see a compelling argument for getting origin_visits alone, if we have an API that will easily return them alongside the latest status.

I have the feeling that, in terms of API extensibility, we'll want to be returning both the OriginVisit and its latest OriginVisitStatus.
If we don't do that, we might find ourselves in a situation where we want to combine the fields from both objects into one, which feels a bit clunky when we could just return a proper composite type.

That was my initial idea but in my first implementation returned a PagedResult[Tuple[OriginVisit, OriginVisitStatus]] and tuples could not be serialized by the RPC layer.
I did not think using a composite type, that's a great idea.

So I will update D7442 to add a method with that signature:

@attr.s
class OriginVisitWithStatuses:
    visit: OriginVisit
    statuses: List[OriginVisitStatus]

def origin_visit_get_with_statuses(
    self,
    origin: str,
    allowed_statuses: Optional[List[str]] = None,
    require_snapshot: bool = False, 
    page_token: Optional[str] = None,
    order: ListOrder = ListOrder.ASC,
    limit: int = 10
) -> PagedResult[OriginVisitWithStatuses]

Better returning all statuses as there are only a couple of them for each visit.

We should probably be deprecating origin_visit_get, while we're here; I don't see a compelling argument for getting origin_visits alone, if we have an API that will easily return them alongside the latest status.

Agreed, visit without status info is not really interesting to retrieve.

anlambert claimed this task.

Feature has been implemented and deployed, closing this.