r+d-challenges.org
No OneTemporary
Actions

Size

2 KB

Subscribers

None

r+d-challenges.org
View Options

	#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)

	# R&D challenges

	#+INCLUDE: "prelude.org" :minlevel 1
	* R&D challenges
	:PROPERTIES:
	:CUSTOM_ID: main
	:END:
	** Data model
	*** The real world sucks
	- corrupted repositories
	- takedown notices
	- partial irrecoverable data losses
	#+BEAMER: \pause
	*** /Incomplete/ Merkle DAGs
	- nodes can go missing at archival time or disappear later on
	- top-level hash(es) no longer capture the full state of the archive
	#+BEAMER: \pause
	*** Open questions
	- how do you capture such full state then?
	- how do you efficiently check if something is to be re-archived?
	- ultimately, what's your notion of having "fully archived" something?
	** Storage
	*** Archive stats
	- as a graph: ~10 B nodes, ~100 B edges
	- nodes breakdown: ~40% contents, ~40% directories, ~10% commits
	- content size: ~400 TB (raw), ~200 TB compressed (content by content)
	- median compressed size: 3 KB
	- i.e., a lot of very small files
	#+BEAMER: \pause
	*** Current storage solution (unsatisfactory)
	- contents: ad hoc object storage with multiple backends
	- file-system, Azure, AWS, etc.
	- rest of the graph: Postgres (~6 TB)
	- rationale: recursive queries to traverse the graph
	- (no, it doesn't work at this scale)
	** Storage (cont.)
	*** Requirements
	- long-term storage
	- suitable for distribution/replica
	- suitable for scale-out processing
	#+BEAMER: \pause
	*** Graph
	- early experiences with Ceph (RADOS)
	- not a good fit out of the box
	- 7x size increase over target retention policy due to large minimum
	chunk size (64 KB)
	- ad-hoc object packing (?)
	- .oO( do we really have to re-invent a file-system? )
	** Storage (cont.)
	*** Contents --- size considerations
	- a few hundreds TB is not /that/ big, but it cuts off volunteer mirrors
	#+BEAMER: \pause
	*** Content compression
	- low compression ration (2x) with 1-by-1 compression
	- typical Git/VCS packing heuristics do not work here, because contents
	occur in many different contexts
	- early experiences with Rabin-style compression & co. were unsatisfactory
	#+BEAMER: \pause
	*** Distributed archival
	- massively distributed archival (e.g., P2P) would be nice
	- but most P2P techs are more like CDNs than archives and do not offer
	retention policy guarantees (e.g., self-healing)
	** Efficient graph processing
	*** Use cases
	- Vault: recursive visits to collect archived objects
	- Provenance: single-destination shortest path
	#+BEAMER: \pause
	*** Technology
	- beyond the capabilities of off-the-shelf graph DBs
	- graph topology: scale-free, but not small world
	- /probably/ bad fit for Pregel/Chaoss/etc
	- are web graph style compression techniques suitable for storing and
	processing the Merkle DAG in memory? (unclear)
	** Provenance tracking :noexport:
	- TODO

File Metadata

Mime Type: text/plain
Expires: Fri, Jul 4, 2:25 PM (2 d, 5 h ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3245500

r+d-challenges.orgNo OneTemporaryActions

r+d-challenges.orgView Options

File Metadata

Event Timeline

r+d-challenges.org
No OneTemporary
Actions

r+d-challenges.org
View Options