git-loading-design.txt
No OneTemporary
Actions

Size

5 KB

Subscribers

None

git-loading-design.txt
View Options

	Design considerations
	=====================

	* Caching: our storage contains two main parts: a file storage, and a git
	object storage. Both parts are accessible as key-value storage. Whenever
	possible we want to avoid checking in content that is provably already in
	there.

	* Concurrency: our storage will be accessed concurrently for both read and
	write purposes. In particular for writing, it is possible that multiple
	workers will be in the process, at the same time, of loading into our storage
	different git repositories that have a lot of overlap, if not completely
	identical. Whenever possible they should be able to benefit from each other
	work, collaborating de facto.

	* Robustness: workers that load content into our storage might crash before
	completion. Wherever possible, the work done before completion should be
	preserved by the storage. Eventually another worker (possibly the same as
	before) will pickup the same git repository, try it again, and drive it to
	completion.

	* Ctime and Atimes: for every piece of content we are interested in both
	creation time ("ctime", i.e., the first time we have seen a given content and
	added it to our storage) and access times ("atime", i.e., every time we see
	the same content elsewhere we want to be able to store the fact we have
	seen it again, and again, and again...).

	* Transactionality: every content addition should be transactional: only
	after having stored the content in full we will tell the world (and other
	workers) that the content is available. (Without locking, which is desirable)
	This might result in temporary races where multiple workers trying to add the
	same content without knowing of each other---this situation should be handled
	gracefully.

	Transactionality should apply across different storage media: in particular
	the filesystem used to store file content and the DB used to store the
	corresponding metadata should cooperate. It is OK for the filesystem to have
	content that is not indexed in the DB; but for all purposes that should be
	equivalent to not having stored the content at all.


	Git traversal
	=============

	To load the whole content of a git repo in our storage we need to traverse the
	git object graph, and inject every single version of every single file we find.
	We discuss below how we should traverse the git graph to that end.

	For the sake of conciseness we do not distinguish git object types. The actual
	code does need to treat differently different kind of git objects though (and
	in particular commits -> trees -> and blobs); see the implementation for
	details about this.


	Top-down
	--------

	* Top-down, topological traversal of the git object graph starting from the
	current refs is optimal from the point of view of caching. Once a given
	object is found in the cache we know that we have already loaded it in the
	storage and we do not need to treat its descendants any further.

	* Top-down however is not good for robustness. If we store the current node
	before its descendants and the loading fails to complete, in the future we
	will believe to have stored all its descendants whereas we have note. FAIL.

	Conclusion: pure top-down traversal is bad for us.


	Bottom-up
	---------

	* Bottom-up, topological traversal is good for robustness. Once we reach the
	top we know we have stored all its descendants, so in the future we can
	benefit from caching.

	* However, bottom-up is bad for caching. If we always treat descendants before
	ancestors we will benefit from caching only at the level of individual
	objects, and never at the level of whole subgraphs.

	Conclusion: pure bottom-up traversal is OK, but does not allow to benefit from
	subgraph caching.


	Mixed top-down/bottom-up
	------------------------

	To get the best of both worlds we need a mixed approach, something like
	(pseudocode):

	let rec load_git_graph node =
	if not (node in storage) then
	for node' in children(node)
	load_git_graph(node')
	add_to_storage(node, storage)

	Note: non tail-recursive.

	Conclusion: the above offers both robustness w.r.t. loading crashes and
	subgraph caching.


	Atime maintenance
	-----------------

	Bad news: with the mixed approach it's easy to maintain ctimes, but atimes
	cannot be maintained (because we do not visit at all subgraphs). More
	generally: subgraph caching or atime maintenance <- choose one.

	If we do want to maintain atimes (at this level---as opposed to, say, do that
	separately from git repo loading) we need to give up on subgraph caching. If we
	do that, top-down vs bottom-up doesn't really matter.


	Cross file system + DB transactions
	===================================

	To ensure file system DB transaction, to add a single file to our storage we
	proceed as follows, where KEY is the key of the file to be added, and WORKER_ID
	the unique identifier of the worker that is updating the storage:

	1. BEGIN TRANSACTION
	2. create file KEY.WORKER_ID, overwriting destination if needed
	3. write file content to KEY.WORKER_ID
	4. rename(KEY.WORKER_ID, KEY), overwriting destination if needed
	5. INSERT KEY INTO ...
	6. COMMIT

	any error in the above would cause a transaction ABORT.

	Failure scenarios (that should all be handled properly by the above protocol):

	* worker crash, at any moment during the above
	* parallel execution, resulting in one worker failing due to key duplication
	upon step (5) or (6)

File Metadata

Mime Type: text/plain
Expires: Thu, Jul 3, 11:45 AM (4 d, 15 h ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3362435

git-loading-design.txtNo OneTemporaryActions

git-loading-design.txtView Options

File Metadata

Event Timeline

git-loading-design.txt
No OneTemporary
Actions

git-loading-design.txt
View Options