Store content -> revision cache in azure table storage
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Nov 7 2016, 4:04 PM

Description

We've been hitting PostgreSQL limitations for storage of the content -> revision cache. Azure table storage looks like a relevant candidate to store that cache.

Table storage provides a schemaless storage API which uses a compound primary key containing a PartitionKey and a RowKey, clustering on PartitionKeys and ordering queries on RowKeys. Each entry can have up to 255 properties and weigh up to 1MB.

A good candidate for PartitionKey would be the content identifier (well distributed except for corner cases).
We need to figure out a RowKey that's intrinsic to the line provided (properties : Revision identifier, path), and gives us a relevant ordering for files with multiple entries.

Limitations:

PartitionKey and RowKey are strings, and a bunch of control characters aren't allowed. Better use some kind of ASCII I suppose. Both can be up to 1KB in size.

Resources:

Azure table storage patterns: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/
How to use Azure Table Storage from Python: https://azure.microsoft.com/en-us/documentation/articles/storage-python-how-to-use-table-storage/
Understanding the table service data model: https://msdn.microsoft.com/en-us/library/dd179338.aspx
Scalable partitioning strategy for azure table storage: https://msdn.microsoft.com/en-us/library/hh508997.aspx

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T547 Azure prototype: Content provenance information API
Migrated	gitlab-migration	T598 Store content -> revision cache in azure table storage

Event Timeline

olasd created this task.Nov 7 2016, 4:04 PM

olasd updated the task description. (Show Details)Nov 7 2016, 4:12 PM

We need to generate RowKeys that are:

Intrinsic / reproducible
Give out a relevant sort order for queries with lots of results (think kernel entries)

I think ordering first by "revision date" would help giving out relevant results, so we could imagine using a compound RowKey such as:

{revision timestamp}_{revision identifier}_{ordering number in revision}

So, after discussion with @olasd:

PartitionKey: content ID
RowKey: {revision_timestamp}_{revision ID} (without trailing sequential no)

With the invariant that at each PartitionKey/RowKey compound key, it is associated a composite value that is a list of paths that, in a specific revision point to content with a specific id.
Ideally, the list of paths should be complete (i.e., no path is left out). But we might decide to trim it arbitrarily if we end up on crazy cases.

zack lowered the priority of this task from High to Low.Feb 12 2017, 6:17 PM

we're taking a different route for this now, based on @grouss WIP

This task has been migrated to GitLab.

Store content -> revision cache in azure table storageClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Store content -> revision cache in azure table storage
Closed, MigratedEdits Locked
Actions

Related Objects
Search...