Type swh-storage endpoints with swh.model objects
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Jan 26 2017, 12:32 PM

Description

TL; DR

Our data is typed in the database but it is not when consumed from swh-storage's internal api.
Resulting in too much conversion code in the swh-storage's internal api consumers (api mainly).
We could benefit from having types to help.

Details

The possible data structure returned by swh-storage are dictionary, list, bytes, int, date, etc...
Some of those data structure are not serializable in x (x in {json, yaml, ...}).

The thread connection the api's client and swh-storage's internal api use custom types for the data which is not natively serializable (bytes, datetime, etc...) to permit transit of that data.

However, as soon as the api's swh-storage's client has consumed the data, we are back to where we started. That is with possibly non serializable data structure (bytes, date, etc...)

It is then up to the api to convert those data structure to something serializable (json, yaml, etc...) before returning the results to api consumers.

Implementation wise, today, the api converts the values based on key names.
This is not a sustainable model.
Indeed, if a new key with non-serializable data arises (and it will), we need to update the base code to deal with that case.
Furthermore, this is dealt with at endpoint's type level ({content, directory, revision, release, occurrence, etc...}). So, if that key is redundant between endpoints, all the more things to adapt.

If we were to have types in the output from swh-storage, up to after the consumption from swh-storage's internal api, we could generically transform that data according to that type.

We then, would only need to update the base code if a new type arises (which must be rarer than new key with non-serializable values).

Note:
I think this is swh-model's goal but it's not pushed there yet.

Revisions and Commits

rDLDG Git loader
		D3608	rDLDG68e30d5cc636 loader: Update swh.storage.origin_get call to latest api change
rDSCH Scheduling utilities
		D3682	rDSCH642620845ac2 cli.task: Migrate scheduler cli to latest storage change on iter_origins
rDSEA Archive search
		D3657	rDSEA0912f9504161 search*: Type origin_search(...) -> PagedResult[Dict]
rDWAPPS Web applications
		D3877	rDWAPPS8545e0a44589 Adapt storage.revision_get calls according to latest api change
		D3854	rDWAPPS9c41d9f90f58 Adapt to latest storage release_get api change
		D3737	rDWAPPS478a7acb46e0 Adapt type and rename content_get_metadata calls to content_get
		D3693	rDWAPPS8f1e003bc78f service: Adapt according to the latest storage.content_find changes
		D3675	rDWAPPSb19fb3f661fe origin: Migrate use to storage.origin_list instead of origin_get_range
		D3661	rDWAPPSddfb658d2502 common/service: Migrate origin_search to latest apis change
		D3647	rDWAPPSb712b2679eae service: Migrate to latest origin_visit_get api change
		D3626	rDWAPPSc61a7a2ffba5 Update swh.storage.origin_visit_get_by calls to latest api change
		D3618	rDWAPPS26622a3425f4 Update swh.storage.origin_get calls to latest api change
rDLDHG Mercurial loader
		D3869	rDLDHG0e86a987ead9 test_loader: Adapt to latest storage revision_get change
		D3853	rDLDHG3baeea1fab5e test_loader: Adapt to latest storage release_get change
rDDEP Push deposit
		D3864	rDDEP7c5c84b595c9 deposit/migrations: Adapt according to latest storage api change
		D3645	rDDEPe189378e1353 deposit.migrations: Migrate to latest storage api change
		D3607	rDDEP402e4b83bb46 migrations: Update swh.storage.origin_get calls to latest api change
rDLDBASE Generic VCS/Package Loader
		D3868	rDLDBASE079b7ecf2bd7 loader: Adapt to latest storage revision_get change
		D3735	rDLDBASE57d3e372fad8 test_npm: Adapt content_get_metadata call to content_get
rDCIDX Metadata indexer
		D3865	rDCIDX32994d21d715 metadata: Adapt to latest storage revision_get change
		D3734	rDCIDXe45d76d12dd8 indexer.rehash: Adapt content_get_metadata call to content_get
		D3718	rDCIDX3aedf90c9cb9 textual-indexers: Migrate to partition index instead of range
		D3619	rDCIDXbd86eb7184c6 metadata: Update swh.storage.origin_get call to latest api change
rDSTO Storage manager
	Abandoned		D3683 storage.in_memory: Fix origin_list implementation
	Closed		D3643 storage*: Simplify next-page-token computation
		D3883	rDSTO374e01cf3634 algos.diff: Add missed revision_get conversion
		D3863	rDSTO356eacd763d6 Refactor revision_get storage API to return Revision objects
		D3852	rDSTOe6fcfb931a7c storage*: release_get(...) -> List[Optional[Release]]
		D3733	rDSTOd9ff3912d5ab storage*: Rename and type content_get(List[Sha1]) -> List[Optional[Content]]
		D3715	rDSTObe9e958d6c75 in_memory: Drop dead code
		D3713	rDSTO0d72ea229ecc storage*: content_get_partition(...) -> PagedResult[Content]
		D3712	rDSTOb48d834984f7 storage*: Drop deprecated content_get_range endpoint
		D3708	rDSTOc5d63ada5f51 storage*: origin_get_by_sha1: Drop generator from pgstorage
		D3707	rDSTO760cbf6db540 storage: revision_log: Type remaining existing endpoints
		D3707	rDSTO38ee5255b0e0 storage*: revision_get: Type remaining existing endpoints
		D3707	rDSTO8b6d18ef2f9e storage*: revision_missing: Type remaining existing endpoints
		D3706	rDSTO9f214bc745ee storage*: directory_entry_get_by_path: Type remaining existing endpoints
		D3706	rDSTOf9d09527a6ba storage*: directory_ls: Type remaining existing endpoints
		D3706	rDSTOfd5fd86b11a8 storage*: directory_missing: Type remaining existing endpoints
		D3705	rDSTO5d13cd7c5451 storage*: skipped_content_missing: Type remaining existing endpoints
		D3704	rDSTO1a2aa70c687c storage*: content_missing_per_sha1_git: Type remaining existing endpoints
		D3704	rDSTO15e48633a6a7 storage*: content_missing_per_sha1: Type remaining existing endpoints
		D3703	rDSTOb62afbbbdd95 storage*: content_missing: Unify and type remaining existing endpoints
		D3702	rDSTOd6f26e45e9a6 storage*: content_get_partition: Type remaining existing endpoints
		D3701	rDSTO864473370acd storage*: content_get_range: Type remaining existing endpoints
		D3700	rDSTOc6da28289ec4 storage*: content_get: Type remaining existing endpoints
		D3699	rDSTO25ebc48198d6 storage*: content_update: Type remaining existing endpoints
		D3698	rDSTOc32e224d2399 storage*: origin_get_by_sha1: Type remaining existing endpoints
		D3697	rDSTO26ef01563f05 storage*: check_config: Type remaining existing endpoints
		D3692	rDSTO15e8c996d441 storage*: Type content_find(...) -> List[Content]
		D3687	rDSTO3c2e5a3d7d43 storage*: Type {cnt,dir,rev,rel,snp}_get_random(...) -> Sha1Git
		D3681	rDSTOaa58e1092ffb storage*: Drop origin-get-range in favor of origin-list
		D3671	rDSTO92f1183de0c8 storage*: Add type annotation to origin_count
		D3669	rDSTO3466e48be8c5 Reuse swh.core stream_results function
		D3651	rDSTOcf9f44e80532 storage*: Type origin_search(...) -> PagedResult[Origin]
		D3650	rDSTO4d52fc1d076f storage*: Adapt origin_list(...) -> PagedResult[Origin]
		D3648	rDSTO7beba93a9702 algos.snapshot: Open snapshot_id_get_from_revision
		D3641	rDSTOb81f928fa7b9 storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
		D3629	rDSTO21b77304a044 storage*: use an enum to explicit the order in origin_visit_get
		D3627	rDSTO643ebc6e7eb9 storage*: origin_visit_get(...) -> PagedResult[OriginVisit]
		D3625	rDSTO119d01e41620 storage*: origin_visit_get_by -> Optional[OriginVisit]
		D3622	rDSTO57e305e03b66 storage*: origin_visit_find_by_date -> Optional[OriginVisit]
		D3620	rDSTO5344a6ff9c40 storage*: type origin_visit_get_latest endpoint result
		D3605	rDSTO7e94767a51a7 storage*: origin_get(Iterable[str]) -> Iterable[Optional[Origin]]
		D3594	rDSTOd8583eb4685d storage*.origin_visit_get_random: Read model objects
rDMOD Data model
		D3738	rDMODb1a16b168b58 model: Add Sha1 alias
rDCORE Foundations and core functionalities
		D3664	rDCORE6d49f04a6b84 api.classes: Open swh.core.api.classes.stream_results
		D3632	rDCORE1ec4df82e477 core.api: Expose serializable PagedResult object for pagination api
rDLDSVN Subversion (SVN) loader
		D3870	rDLDSVN34565424f521 Adapt storage.revision_get calls according to latest api change
		D3609	rDLDSVNe99c68b8df9a loader: Update swh.storage.origin_get call to latest api change

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2221 Development workflow & code quality
Migrated	gitlab-migration	T2223 Type checking
Migrated	gitlab-migration	T645 Type swh-storage endpoints with swh.model objects
Migrated	gitlab-migration	T2494 tests: Use data model objects within tests (drop dicts)
Migrated	gitlab-migration	T2499 Drop the storage validate proxy used only in tests
Migrated	gitlab-migration	T2517 Add remaining missing types to swh.storage.interface

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ardumont mentioned this in rDSEA46af49019c22: swh.search.get_search: Simplify instantiation.Aug 3 2020, 10:38 AM

ardumont mentioned this in rDSEA56f94586de06: swh.search: Define an interface for search backends and use it.

ardumont added a commit: rDSTOaa58e1092ffb: storage*: Drop origin-get-range in favor of origin-list.Aug 3 2020, 11:53 AM

ardumont added a commit: rDSCH642620845ac2: cli.task: Migrate scheduler cli to latest storage change on iter_origins.Aug 3 2020, 12:24 PM

ardumont added a revision: D3687: storage*: Type {cnt,dir,rev,rel,snp}_get_random(...) -> Sha1Git.Aug 3 2020, 1:27 PM

ardumont added a commit: rDSTO3c2e5a3d7d43: storage*: Type {cnt,dir,rev,rel,snp}_get_random(...) -> Sha1Git.Aug 3 2020, 6:06 PM

ardumont added a revision: D3692: storage*: Type content_find(...) -> List[Content].Aug 4 2020, 10:11 AM

Current status, related endpoints to origin, origin-visit and origin-visit-status are done now both read/write.
Remains dag model objects (content, directory, revision, release, snapshot) reading endpoints to align and type.

ardumont added a revision: D3693: service: Adapt according to the latest storage.content_find changes.Aug 4 2020, 11:15 AM

ardumont added a commit: rDSTO15e8c996d441: storage*: Type content_find(...) -> List[Content].Aug 4 2020, 1:23 PM

ardumont added a commit: rDWAPPS8f1e003bc78f: service: Adapt according to the latest storage.content_find changes.Aug 4 2020, 2:51 PM

I'll type with what we have right now, that will simplify the next diffs which introduce type changes.
But also demonstrates the inconsistencies we have right now.

Then we'll propose a better consistent api for the remaining endpoints especially {content,release,revision, ...}_get endpoints.
temporary draft is at [1]

[1] https://hebdo.framapad.org/p/2l52re8k5w-9i73?lang=en

ardumont added a revision: D3697: storage*: check_config: Type remaining existing endpoints.Aug 4 2020, 6:00 PM

ardumont added a revision: D3698: storage*: origin_get_by_sha1: Type remaining existing endpoints.

ardumont added a revision: D3699: storage*: content_update: Type remaining existing endpoints.Aug 4 2020, 6:40 PM

ardumont added a revision: D3700: storage*: content_get: Type remaining existing endpoints.

ardumont added a revision: D3701: storage*: content_get_range: Type remaining existing endpoints.Aug 4 2020, 6:47 PM

ardumont added a revision: D3702: storage*: content_get_partition: Type remaining existing endpoints.

ardumont added a revision: D3703: storage*: content_missing: Unify and type remaining existing endpoints.

ardumont added a revision: D3704: storage*: content_missing_per_sha1(_git): Type remaining existing endpoints.Aug 4 2020, 6:58 PM

ardumont added a revision: D3705: storage*: skipped_content_missing: Type remaining existing endpoints.Aug 4 2020, 11:11 PM

ardumont added a revision: D3706: storage*: directory_*: Type remaining existing endpoints.

ardumont added a revision: D3707: storage*: revision_*: Type remaining existing endpoints.Aug 5 2020, 8:38 AM

ardumont added a revision: D3708: storage*: origin_get_by_sha1: Drop generator from pgstorage.Aug 5 2020, 9:55 AM

ardumont added a commit: rDSTO26ef01563f05: storage*: check_config: Type remaining existing endpoints.Aug 5 2020, 11:02 AM

ardumont added a commit: rDSTOc32e224d2399: storage*: origin_get_by_sha1: Type remaining existing endpoints.Aug 5 2020, 11:17 AM

ardumont added a commit: rDSTO25ebc48198d6: storage*: content_update: Type remaining existing endpoints.

ardumont added a commit: rDSTOc6da28289ec4: storage*: content_get: Type remaining existing endpoints.Aug 5 2020, 11:31 AM

ardumont added a commit: rDSTO864473370acd: storage*: content_get_range: Type remaining existing endpoints.Aug 5 2020, 11:39 AM

ardumont added a commit: rDSTOd6f26e45e9a6: storage*: content_get_partition: Type remaining existing endpoints.Aug 5 2020, 11:48 AM

ardumont added a commit: rDSTOb62afbbbdd95: storage*: content_missing: Unify and type remaining existing endpoints.Aug 5 2020, 11:54 AM

ardumont added a commit: rDSTO15e48633a6a7: storage*: content_missing_per_sha1: Type remaining existing endpoints.

ardumont added a commit: rDSTO1a2aa70c687c: storage*: content_missing_per_sha1_git: Type remaining existing endpoints.

ardumont added a commit: rDSTO5d13cd7c5451: storage*: skipped_content_missing: Type remaining existing endpoints.

ardumont added a commit: rDSTOfd5fd86b11a8: storage*: directory_missing: Type remaining existing endpoints.

ardumont added a commit: rDSTOf9d09527a6ba: storage*: directory_ls: Type remaining existing endpoints.

ardumont added a commit: rDSTO9f214bc745ee: storage*: directory_entry_get_by_path: Type remaining existing endpoints.

ardumont added a commit: rDSTO8b6d18ef2f9e: storage*: revision_missing: Type remaining existing endpoints.

ardumont added a commit: rDSTO38ee5255b0e0: storage*: revision_get: Type remaining existing endpoints.

ardumont added a commit: rDSTO760cbf6db540: storage*: revision_*log: Type remaining existing endpoints.

ardumont added a commit: rDSTOc5d63ada5f51: storage*: origin_get_by_sha1: Drop generator from pgstorage.Aug 5 2020, 12:49 PM

ardumont renamed this task from Type swh-storage endpoints to Type swh-storage endpoints with swh.model objects.Aug 5 2020, 2:06 PM

ardumont added a revision: D3712: storage*: Drop deprecated content_get_range endpoint.Aug 5 2020, 3:18 PM

ardumont added a revision: D3713: storage*: content_get_partition(...) -> PagedResult[Content].Aug 5 2020, 4:11 PM

ardumont added a commit: rDSTOb48d834984f7: storage*: Drop deprecated content_get_range endpoint.Aug 5 2020, 4:25 PM

ardumont closed subtask T2517: Add remaining missing types to swh.storage.interface as Resolved.Aug 5 2020, 5:29 PM

I'll type with what we have right now, that will simplify the next diffs which introduce type changes.
But also demonstrates the inconsistencies we have right now.

done

Then we'll propose a better consistent api for the remaining endpoints especially {content,release,revision, ...}_get

Proposition on its way.

{object}_missing for object in {content, directory, revision, release, snapshot}

Some inconsistencies in the *_missing endpoints:

def content_missing(self, contents: List[Dict[str, Any]], key_hash: str = "sha1") -> Iterable[bytes]
def content_missing_per_sha1(self, contents: List[bytes]) -> Iterable[bytes]
def content_missing_per_sha1_git(elf, contents: List[Sha1Git]) -> Iterable[Sha1Git]
def directory_missing(self, directories: List[Sha1Git]) -> Iterable[Sha1Git]
def revision_missing(self, revisions: List[Sha1Git]) -> Iterable[Sha1Git]
def release_missing(self, releases: List[Sha1Git]) -> Iterable[Sha1Git]
def snapshot_missing(self, snapshots: List[Sha1Git]) -> Iterable[Sha1Git]
def skipped_content_missing(self, contents: List[Dict[str, Any]]) -> Iterable[Dict[str, Any]]

The main part to change would be to make them return List[Sha1Git] and not Iterable[Sha1Git].
As we cannot stream result with our current rpc layer anyway.

The most inconsistent thing to readapt is then skipped_content_missing:

def skipped_content_missing(self, contents: List[Sha1Git]) -> List[Sha1Git]

{object}_get* for object in {content, directory, revision, release, snapshot}

For the remaining part:

def content_get(self, contents: List[bytes]) -> Iterable[Optional[Dict[str, bytes]]]:
def content_get_metadata(self, contents: List[bytes]) -> Dict[bytes, List[Dict]]:
def revision_get(self, revisions: List[Sha1Git]) -> Iterable[Optional[Dict[str, Any]]]:
def release_get(self, releases: List[Sha1Git]) -> Iterable[Optional[Dict[str, Any]]]:
def snapshot_get(self, snapshot_id: Sha1Git) -> Optional[Dict[str, Any]]:
def snapshot_get_branches(
    self,
    snapshot_id: Sha1Git,
    branches_from: bytes = b"",
    branches_count: int = 1000,
    target_types: Optional[List[str]] = None,
) -> Optional[Dict[str, Any]]:

Proposal is to:

rename content_get to content_get_data to clarify it's the raw data we are interested in:

def content_get_data(self, content_id: Sha1Git) -> Optional[bytes]:

rename content_get_metadata to content_get to clarify it's the content object we are interested in:

def content_get_metadata(self, contents: List[bytes]) -> List[Optional[Content]]

Use the data model for:

def revision_get(self, revisions: List[Sha1Git]) -> List[Optional[Revision]]:
def release_get(self, releases: List[Sha1Git]) -> List[Optional[Release]]:

drop snapshot_get because for some snapshots, we cannot have it full anyway.

Instead rely on snapshot_get_branches to retrieve an
Optional[PagedResult[SnapshotBranch]] for a given snapshot. Optional because
the snapshot_id could not exist:

def snapshot_get_branches(
    self,
    snapshot_id: Sha1Git,
    branches_from: bytes = b"",
    page_token: Optional[str] = None,
    limit: int = 1000,
    target_types: Optional[List[str]] = None,  # <- drop?
) -> Optional[PagedResult[SnapshotBranch]]:

specific endpoints

directory_ls

def directory_ls(self, directory: Sha1Git, recursive: bool = False) -> Iterable[Dict[str, Any]]:

to:

def directory_ls(self, directory: Sha1Git, recursive: bool = False) -> Optional[PagedResult[DirectoryEntry]]

Note: Optional because the directory could not exist as well (or raise if it
does not exist?)

revision_*log

def revision_log(self, revisions: List[Sha1Git], limit: Optional[int] = None) -> Iterable[Optional[Dict[str, Any]]]:

def revision_log(self, revisions: List[Sha1Git], limit: Optional[int] = None) -> PagedResult[Revision]]]:

Note: Raise in case some root revisions does not exist to keep the interface clearer.

For revision_shortlog, we probably need a new type or an alias or we keep it as is, ymmv:

ardumont added a commit: rDSTO0d72ea229ecc: storage*: content_get_partition(...) -> PagedResult[Content].Aug 5 2020, 5:59 PM

ardumont added a revision: D3715: in_memory: Drop dead code.Aug 5 2020, 8:02 PM

ardumont added a commit: rDSTObe9e958d6c75: in_memory: Drop dead code.Aug 5 2020, 8:43 PM

ardumont added a revision: D3718: "text"-indexers: Migrate to partition index instead of range.Aug 6 2020, 9:50 AM

ardumont added a commit: rDCIDX3aedf90c9cb9: textual-indexers: Migrate to partition index instead of range.Aug 6 2020, 1:08 PM

ardumont added a revision: D3733: storage*: Rename and type content_get(List[Sha1]) -> List[Optional[Content]].Aug 7 2020, 12:45 AM

ardumont added a revision: D3734: indexer.rehash: Adapt content_get_metadata call to content_get.Aug 7 2020, 12:53 AM

ardumont added a revision: D3735: test_npm: Adapt content_get_metadata call to content_get.Aug 7 2020, 12:58 AM

ardumont added a revision: D3737: Adapt type and rename content_get_metadata calls to content_get.Aug 7 2020, 6:40 AM

ardumont added a revision: D3738: model: Add Sha1 alias.Aug 7 2020, 9:54 AM

ardumont added a commit: rDMODb1a16b168b58: model: Add Sha1 alias.Aug 7 2020, 10:01 AM

ardumont mentioned this in rDSTObfa8f46ea44d: storage*: Rename content_get_data(Sha1) -> Optional[bytes].Aug 7 2020, 12:35 PM

ardumont added a commit: rDSTOd9ff3912d5ab: storage*: Rename and type content_get(List[Sha1]) -> List[Optional[Content]].

vlorentz mentioned this in D3740: Make snapshot_get_branches return a TypedDict containing SnapshotBranch objects..Aug 7 2020, 2:04 PM

vlorentz mentioned this in D3741: Update for the new type of snapshot_get_branches in swh-storage > 0.13.0.Aug 7 2020, 2:26 PM

ardumont added a commit: rDLDBASE57d3e372fad8: test_npm: Adapt content_get_metadata call to content_get.Aug 7 2020, 2:53 PM

ardumont added a commit: rDCIDXe45d76d12dd8: indexer.rehash: Adapt content_get_metadata call to content_get.Aug 7 2020, 3:00 PM

ardumont added a commit: rDWAPPS478a7acb46e0: Adapt type and rename content_get_metadata calls to content_get.Aug 7 2020, 3:02 PM

vlorentz mentioned this in rDSTO4918759fc843: Make snapshot_get_branches return a TypedDict containing SnapshotBranch objects..Aug 7 2020, 6:13 PM

what's missing for this task to be closed?

well, storage is typed now but some endpoints remains inconsistent (as T645#47156 explicits with a unification proposal which is not done yet, aside content_get_data and content_get_metadata)

So we could close it but open a new one about making the type consistent... (which I'm trying to get back to).

ardumont added a revision: D3852: storage*: release_get(...) -> List[Optional[Release]].Aug 31 2020, 3:42 PM

ardumont added a revision: D3853: test_loader: Adapt to latest storage release_get change.Aug 31 2020, 3:49 PM

ardumont added a revision: D3854: swh.web: Adapt to latest storage release_get api change.Aug 31 2020, 4:44 PM

ardumont added a commit: rDSTOe6fcfb931a7c: storage*: release_get(...) -> List[Optional[Release]].Sep 1 2020, 2:27 PM

ardumont added a commit: rDLDHG3baeea1fab5e: test_loader: Adapt to latest storage release_get change.Sep 1 2020, 2:51 PM

ardumont added a commit: rDWAPPS9c41d9f90f58: Adapt to latest storage release_get api change.Sep 1 2020, 3:11 PM

ardumont added a revision: D3863: Refactor revision_get storage API to return Revision objects.Sep 2 2020, 3:25 PM

ardumont added a revision: D3864: migrations: Adapt according to latest storage revision_get api change.Sep 2 2020, 3:49 PM

ardumont added a revision: D3865: metadata: Adapt to latest storage revision_get change.Sep 2 2020, 4:08 PM

ardumont added a revision: D3868: loader: Adapt to latest storage revision_get change.Sep 2 2020, 6:39 PM

ardumont added a revision: D3869: test_loader: Adapt to latest storage revision_get change.Sep 3 2020, 1:20 PM