Page MenuHomeSoftware Heritage

r+d-challenges.org
No OneTemporary

r+d-challenges.org

#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)
# R&D challenges
#+INCLUDE: "prelude.org" :minlevel 1
* R&D challenges
:PROPERTIES:
:CUSTOM_ID: main
:END:
** Data model
*** The real world sucks
- corrupted repositories
- takedown notices
- partial irrecoverable data losses
#+BEAMER: \pause
*** /Incomplete/ Merkle DAGs
- nodes can go missing at archival time or disappear later on
- top-level hash(es) no longer capture the full state of the archive
#+BEAMER: \pause
*** Open questions
- how do you capture such full state then?
- how do you efficiently check if something is to be re-archived?
- ultimately, what's your notion of having "fully archived" something?
** Storage
*** Archive stats
- as a graph: ~10 B nodes, ~100 B edges
- nodes breakdown: ~40% contents, ~40% directories, ~10% commits
- content size: ~400 TB (raw), ~200 TB compressed (content by content)
- median compressed size: 3 KB
- i.e., *a lot of very small files*
#+BEAMER: \pause
*** Current storage solution (unsatisfactory)
- contents: ad hoc object storage with multiple backends
- file-system, Azure, AWS, etc.
- rest of the graph: Postgres (~6 TB)
- rationale: recursive queries to traverse the graph
- (no, it doesn't work at this scale)
** Storage (cont.)
*** Requirements
- long-term storage
- suitable for distribution/replica
- suitable for scale-out processing
#+BEAMER: \pause
*** Graph
- early experiences with Ceph (RADOS)
- not a good fit out of the box
- 7x size increase over target retention policy due to large minimum
chunk size (64 KB)
- ad-hoc object packing (?)
- .oO( do we really have to re-invent a file-system? )
** Storage (cont.)
*** Contents --- size considerations
- a few hundreds TB is not /that/ big, but it cuts off volunteer mirrors
#+BEAMER: \pause
*** Content compression
- low compression ration (2x) with 1-by-1 compression
- typical Git/VCS packing heuristics do not work here, because contents
occur in many different contexts
- early experiences with Rabin-style compression & co. were unsatisfactory
#+BEAMER: \pause
*** Distributed archival
- massively distributed archival (e.g., P2P) would be nice
- but most P2P techs are more like CDNs than archives and do not offer
retention policy guarantees (e.g., self-healing)
** Efficient graph processing
*** Use cases
- Vault: recursive visits to collect archived objects
- Provenance: single-destination shortest path
#+BEAMER: \pause
*** Technology
- beyond the capabilities of off-the-shelf graph DBs
- graph topology: scale-free, but not small world
- /probably/ bad fit for Pregel/Chaoss/etc
- are web graph style compression techniques suitable for storing and
processing the Merkle DAG in memory? (unclear)
** Provenance tracking :noexport:
- TODO

File Metadata

Mime Type
text/plain
Expires
Fri, Jul 4, 2:25 PM (2 d, 5 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3245500

Event Timeline