diff --git a/talks-public/2018-02-13-inria-saclay/2018-02-13-inria-saclay.org b/talks-public/2018-02-13-inria-saclay/2018-02-13-inria-saclay.org new file mode 100644 index 0000000..f15d681 --- /dev/null +++ b/talks-public/2018-02-13-inria-saclay/2018-02-13-inria-saclay.org @@ -0,0 +1,221 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Software Heritage: Preserving the Free Software Commons +# does not allow short title, so we override it for beamer as follows : +#+BEAMER_HEADER: \title[Software Heritage]{Software Heritage\\Preserving the Free Software Commons} +#+BEAMER_HEADER: \author{Nicolas Dandrimont} +#+BEAMER_HEADER: \date[2018-02-13 Inria Saclay]{13 february 2018\\Demandez le Programme! - Inria Saclay} +#+AUTHOR: Nicolas Dandrimont +#+DATE: 13 February 2018 +#+EMAIL: nicolas@dandrimont.eu +#+DESCRIPTION: Software Heritage: Preserving the Free Software Commons +#+KEYWORDS: software heritage legacy preservation knowledge mankind technology + +#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[Software Heritage]{Software Engineer - Software Heritage\\\href{mailto:nicolas@dandrimont.eu}{\tt nicolas@dandrimont.eu}} + +#+LATEX_HEADER: \usepackage{bbding} +#+LATEX_HEADER: \DeclareUnicodeCharacter{66D}{\FiveStar} +* The Software Commons + #+INCLUDE: "../../common/modules/source-code-different-short.org::#softwareisdifferent" :minlevel 2 +** Our Software Commons + #+INCLUDE: "../../common/modules/foss-commons.org::#commonsdef" :only-contents t + #+BEAMER: \pause +*** Source code is /a precious part/ of our commons + \hfill are we taking care of it? + # #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#main" :only-contents t :minlevel 2 + #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#fragile" :minlevel 2 + #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#research" :minlevel 2 +* Software Heritage + #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 +** Our principles + #+latex: \begin{center} + #+ATTR_LATEX: :width .9\linewidth + file:SWH-as-foundation-slim.png + #+latex: \end{center} +#+BEAMER: \pause +*** Open approach :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.4 + :BEAMER_env: block + :END: + - 100% FOSS + - transparency +*** In for the long haul :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.4 + :BEAMER_env: block + :END: + - replication + - non profit + +* Architecture + #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :only-contents t +# #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#merklerevision" :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#giantdag" :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#technology" :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#development" :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#features" :minlevel 2 + +* Gory details +** Technology: how do you store the SWH DAG? + +*** Problem statement +- How would you store and query a graph with 10 billion nodes and 60 billion edges? +- How would you store the contents of more than 3 billion files, 300TB of raw data? +- on a limited budget (100 000 € of hardware overall) + +#+BEAMER: \pause + +*** Our hardware stack +- two hypervisors with 512GB RAM, 20TB SSD each, sharing access to a storage array (60 x 6TB spinning rust) +- one backup server with 48GB RAM and another storage array + +*** Our software stack +- A RDBMS (PostgreSQL, what else?), for storage of the graph nodes and edges +- filesystems for storing the actual file contents + +** Technology: archive storage components + +*** Metadata storage +- Python module *swh.storage* +- thin Python API over a pile of PostgreSQL functions +- motivation: keeping relational integrity at the lowest layer + +*** Content ("object") storage +- Python module *swh.objstorage* +- very thin object storage abstraction layer (PUT, APPEND and GET) over regular storage technologies +- separate layer for asynchronous replication and integrity management (*swh.archiver*) +- motivation: stay as technology neutral as possible for future mirrors + +** Technology: object storage +*** Current primary deployment +- Storage on 16 sharded XFS filesystems; key = /sha1/ (content), value = /gzip/ (content) +- if sha1 = *abcdef01234...*, file path = / srv / storage / *a* / *ab* / *cd* / *ef* / *abcdef01234...* +- 3 directory levels deep, each level 256-wide = 16 777 216 directories (1 048 576 per partition) +*** Secondary deployment +- Storage on Azure blob storage +- 16 storage containers, objects stored in a flat structure there + +** Technology: object storage review + +*** Generic model is fine +The abstraction layer is fairly simple and generic, and the implementation of the upper layers (replication, integrity checking) was a breeze. + +*** Filesystem implementation is bad +Slow spinning storage + little RAM (48GB) + 16 million dentries = (very) bad performance + +** Technology: metadata storage +*** Current deployment +- PostgreSQL deployed in primary/replica mode, using pg\under{}logical for replication: different indexes on primary (tuned for writes) and replicas (tuned for reads). +- most logic done in SQL +- thin Pythonic API over the SQL functions + +*** end goals +- proper handling of relations between objects at the lowest level +- doing fast recursive queries on the graph (e.g. find the provenance info for a content, walking up the whole graph, in one single query) + +** Technology: metadata storage review + +*** Limited resources +PostgreSQL works really well +#+BEAMER: \pause +... until your indexes don't fit in RAM + +#+BEAMER: \pause +*** +Our recursive queries jump between different object types, and between evenly distributed hashes. Data locality doesn't exist. Caches break down. + +#+BEAMER: \pause +*** +Massive deduplication = efficient storage +#+BEAMER: \pause + +*but* Massive deduplication = exponential width for recursive queries + +#+BEAMER: \pause +*** Reality check + +Referential integrity? +#+BEAMER: \pause +Real repositories downloaded from the internet are all kinds of broken. + +** Technology: outlook + +*** Object storage + +Our azure prototype shows that using a scale-out "cloudy" technology for our +object storage works really well. Plain filesystems on spinning rust, not so +much. +#+BEAMER: \pause + +We have started working on a prototype Ceph infrastructure for our main copy +of the archive, as our budget ramps up. +#+BEAMER: \pause + +*** Metadata storage +Our initial assumption that we wanted referential integrity and built-in +recursive queries was wrong. +#+BEAMER: \pause + +We could probably migrate to "dumb" object storages for each type of object, +with another layer to check metadata integrity regularly. + +* Come in, we're open! +** You can help! +*** Coding + - \url{forge.softwareheritage.org} --- *our own code* + #+BEAMER: \vspace{-5mm} + | ٭٭٭ | listers for unsupported forges, distros, pkg. managers | + | ٭٭٭ | loaders for unsupported VCS, source package formats | + | ٭٭ | Web UI: eye candy wrapper around the Web API | + #+BEAMER: \pause +*** Community + | ٭٭ | spread the news, help us with long-term sustainability | + | ٭٭٭ | document endangered source code | + #+BEAMER: \vspace{-3mm} \scriptsize \centering + \url{wiki.softwareheritage.org/index.php?title=Suggestion_box} + +** The Software Heritage community +*** Core team + 10 people working on the project full-time, split across engineering, research, and fundraising/management topics. + #+BEAMER: \pause +*** Inria as initiator :B_picblock: + :PROPERTIES: + :BEAMER_env: picblock + :BEAMER_opt: pic=inria-logo-new,leftpic=true,width=\extblockscale{.2\linewidth} + :END: + - .fr national computer science research entity + - strong Free Software culture + # - creating a non profit, international organisation + #+BEAMER: \vspace{-2mm} + #+BEAMER: \pause +*** Early Sponsors and Supporters + *Société Générale, Microsoft, Huawei, Nokia, DANS, Univ. Bologna,* + #+latex: ~~ + ACM, Creative Commons, Eclipse, Engineering, FSF, Gandi, GitHub, IEEE, OIN, + OSI, OW2, Software Freedom Conservancy, SFLC, The Document Foundation, ... + +* Conclusion + #+INCLUDE: "../../common/modules/swh-backmatter.org::#conclusion" :minlevel 2 +* FAQ :B_appendix: + :PROPERTIES: +# :BEAMER_env: appendix + :END: +** Q: do you archive /only/ Free Software? + - We only crawl origins /meant/ to host source code (e.g., forges) + - Most (~90%) of what we /actually/ retrieve is textual content + #+BEAMER: \vfill +*** Our goal + Archive *the entire Free Software Commons* + + #+BEAMER: \vfill +*** + - Large parts of what we retrieve is /already/ Free Software, today + - Most of the rest /will become/ Free Software in the long term + - e.g., at copyright expiration +** Q: how about SHA1 collisions? + #+BEAMER: \lstinputlisting[language=SQL,basicstyle=\small]{../../common/source/swh-content.sql} diff --git a/talks-public/2018-02-13-inria-saclay/Makefile b/talks-public/2018-02-13-inria-saclay/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2018-02-13-inria-saclay/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides