Page MenuHomeSoftware Heritage

Type swh-storage endpoints with swh.model objects
Started, Work in Progress, NormalPublic

Description

TL; DR

Our data is typed in the database but it is not when consumed from swh-storage's internal api.
Resulting in too much conversion code in the swh-storage's internal api consumers (api mainly).
We could benefit from having types to help.

Details

The possible data structure returned by swh-storage are dictionary, list, bytes, int, date, etc...
Some of those data structure are not serializable in x (x in {json, yaml, ...}).

The thread connection the api's client and swh-storage's internal api use custom types for the data which is not natively serializable (bytes, datetime, etc...) to permit transit of that data.

However, as soon as the api's swh-storage's client has consumed the data, we are back to where we started. That is with possibly non serializable data structure (bytes, date, etc...)

It is then up to the api to convert those data structure to something serializable (json, yaml, etc...) before returning the results to api consumers.

Implementation wise, today, the api converts the values based on key names.
This is not a sustainable model.
Indeed, if a new key with non-serializable data arises (and it will), we need to update the base code to deal with that case.
Furthermore, this is dealt with at endpoint's type level ({content, directory, revision, release, occurrence, etc...}). So, if that key is redundant between endpoints, all the more things to adapt.

If we were to have types in the output from swh-storage, up to after the consumption from swh-storage's internal api, we could generically transform that data according to that type.

We then, would only need to update the base code if a new type arises (which must be rarer than new key with non-serializable values).

Note:
I think this is swh-model's goal but it's not pushed there yet.

Revisions and Commits

rDLDG Git loader
D3608
rDSCH Scheduling utilities
D3682
rDSEA Archive search
D3657
rDWAPPS Web applications
D3877
D3854
D3737
D3693
D3675
D3661
D3647
D3626
D3618
rDLDHG Mercurial loader
D3869
D3853
rDDEP Push deposit
D3864
D3645
D3607
rDLDBASE Generic VCS/Package Loader
D3868
D3735
rDCIDX Metadata indexer
D3865
D3734
D3718
D3619
rDSTO Storage manager
Abandoned
Closed
D3883
D3863
D3852
D3733
D3715
D3713
D3712
D3708
D3707
D3707
D3707
D3706
D3706
D3706
D3705
D3704
D3704
D3703
D3702
D3701
D3700
D3699
D3698
D3697
D3692
D3687
D3681
D3671
D3669
D3651
D3650
D3648
D3641
D3629
D3627
D3625
D3622
D3620
D3605
D3594
rDMOD Data model
D3738
rDCORE Foundations and core functionalities
D3664
D3632
rDLDSVN Subversion (SVN) loader
D3870
D3609

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ardumont changed the task status from Open to Work in Progress.Aug 4 2020, 10:17 AM

Current status, related endpoints to origin, origin-visit and origin-visit-status are done now both read/write.
Remains dag model objects (content, directory, revision, release, snapshot) reading endpoints to align and type.

ardumont added a comment.EditedAug 4 2020, 5:26 PM

Next:

I'll type with what we have right now, that will simplify the next diffs which introduce type changes.
But also demonstrates the inconsistencies we have right now.

Then we'll propose a better consistent api for the remaining endpoints especially {content,release,revision, ...}_get endpoints.
temporary draft is at [1]

[1] https://hebdo.framapad.org/p/2l52re8k5w-9i73?lang=en

ardumont renamed this task from Type swh-storage endpoints to Type swh-storage endpoints with swh.model objects.Aug 5 2020, 2:06 PM

I'll type with what we have right now, that will simplify the next diffs which introduce type changes.
But also demonstrates the inconsistencies we have right now.

done

Then we'll propose a better consistent api for the remaining endpoints especially {content,release,revision, ...}_get

Proposition on its way.

ardumont added a comment.EditedAug 5 2020, 5:45 PM
{object}_missing for object in {content, directory, revision, release, snapshot}

Some inconsistencies in the *_missing endpoints:

def content_missing(self, contents: List[Dict[str, Any]], key_hash: str = "sha1") -> Iterable[bytes]
def content_missing_per_sha1(self, contents: List[bytes]) -> Iterable[bytes]
def content_missing_per_sha1_git(elf, contents: List[Sha1Git]) -> Iterable[Sha1Git]
def directory_missing(self, directories: List[Sha1Git]) -> Iterable[Sha1Git]
def revision_missing(self, revisions: List[Sha1Git]) -> Iterable[Sha1Git]
def release_missing(self, releases: List[Sha1Git]) -> Iterable[Sha1Git]
def snapshot_missing(self, snapshots: List[Sha1Git]) -> Iterable[Sha1Git]
def skipped_content_missing(self, contents: List[Dict[str, Any]]) -> Iterable[Dict[str, Any]]

The main part to change would be to make them return List[Sha1Git] and not Iterable[Sha1Git].
As we cannot stream result with our current rpc layer anyway.

The most inconsistent thing to readapt is then skipped_content_missing:

def skipped_content_missing(self, contents: List[Sha1Git]) -> List[Sha1Git]
{object}_get* for object in {content, directory, revision, release, snapshot}

For the remaining part:

def content_get(self, contents: List[bytes]) -> Iterable[Optional[Dict[str, bytes]]]:
def content_get_metadata(self, contents: List[bytes]) -> Dict[bytes, List[Dict]]:
def revision_get(self, revisions: List[Sha1Git]) -> Iterable[Optional[Dict[str, Any]]]:
def release_get(self, releases: List[Sha1Git]) -> Iterable[Optional[Dict[str, Any]]]:
def snapshot_get(self, snapshot_id: Sha1Git) -> Optional[Dict[str, Any]]:
def snapshot_get_branches(
    self,
    snapshot_id: Sha1Git,
    branches_from: bytes = b"",
    branches_count: int = 1000,
    target_types: Optional[List[str]] = None,
) -> Optional[Dict[str, Any]]:

Proposal is to:

  • rename content_get to content_get_data to clarify it's the raw data we are interested in:
def content_get_data(self, content_id: Sha1Git) -> Optional[bytes]:
  • rename content_get_metadata to content_get to clarify it's the content object we are interested in:
def content_get_metadata(self, contents: List[bytes]) -> List[Optional[Content]]

Use the data model for:

def revision_get(self, revisions: List[Sha1Git]) -> List[Optional[Revision]]:
def release_get(self, releases: List[Sha1Git]) -> List[Optional[Release]]:
  • drop snapshot_get because for some snapshots, we cannot have it full anyway.

Instead rely on snapshot_get_branches to retrieve an
Optional[PagedResult[SnapshotBranch]] for a given snapshot. Optional because
the snapshot_id could not exist:

def snapshot_get_branches(
    self,
    snapshot_id: Sha1Git,
    branches_from: bytes = b"",
    page_token: Optional[str] = None,
    limit: int = 1000,
    target_types: Optional[List[str]] = None,  # <- drop?
) -> Optional[PagedResult[SnapshotBranch]]:
specific endpoints
directory_ls
def directory_ls(self, directory: Sha1Git, recursive: bool = False) -> Iterable[Dict[str, Any]]:

to:

def directory_ls(self, directory: Sha1Git, recursive: bool = False) -> Optional[PagedResult[DirectoryEntry]]

Note: Optional because the directory could not exist as well (or raise if it
does not exist?)

revision_*log
def revision_log(self, revisions: List[Sha1Git], limit: Optional[int] = None) -> Iterable[Optional[Dict[str, Any]]]:

to

def revision_log(self, revisions: List[Sha1Git], limit: Optional[int] = None) -> PagedResult[Revision]]]:

Note: Raise in case some root revisions does not exist to keep the interface clearer.

For revision_shortlog, we probably need a new type or an alias or we keep it as is, ymmv:

what's missing for this task to be closed?

ardumont added a comment.EditedAug 27 2020, 5:11 PM

well, storage is typed now but some endpoints remains inconsistent (as T645#47156 explicits with a unification proposal which is not done yet, aside content_get_data and content_get_metadata)

So we could close it but open a new one about making the type consistent... (which I'm trying to get back to).