Page MenuHomeSoftware Heritage

Store content -> revision cache in azure table storage
Closed, MigratedEdits Locked

Description

We've been hitting PostgreSQL limitations for storage of the content -> revision cache. Azure table storage looks like a relevant candidate to store that cache.

Table storage provides a schemaless storage API which uses a compound primary key containing a PartitionKey and a RowKey, clustering on PartitionKeys and ordering queries on RowKeys. Each entry can have up to 255 properties and weigh up to 1MB.

A good candidate for PartitionKey would be the content identifier (well distributed except for corner cases).
We need to figure out a RowKey that's intrinsic to the line provided (properties : Revision identifier, path), and gives us a relevant ordering for files with multiple entries.

Limitations:

PartitionKey and RowKey are strings, and a bunch of control characters aren't allowed. Better use some kind of ASCII I suppose. Both can be up to 1KB in size.

Resources:

Event Timeline

We need to generate RowKeys that are:

  • Intrinsic / reproducible
  • Give out a relevant sort order for queries with lots of results (think kernel entries)

I think ordering first by "revision date" would help giving out relevant results, so we could imagine using a compound RowKey such as:

{revision timestamp}_{revision identifier}_{ordering number in revision}

So, after discussion with @olasd:

  • PartitionKey: content ID
  • RowKey: {revision_timestamp}_{revision ID} (without trailing sequential no)

With the invariant that at each PartitionKey/RowKey compound key, it is associated a composite value that is a list of paths that, in a specific revision point to content with a specific id.
Ideally, the list of paths should be complete (i.e., no path is left out). But we might decide to trim it arbitrarily if we end up on crazy cases.

zack lowered the priority of this task from High to Low.Feb 12 2017, 6:17 PM
zack added a subscriber: grouss.

we're taking a different route for this now, based on @grouss WIP