diff --git a/talks-public/2017-08-10-debconf/2017-08-10-debconf.org b/talks-public/2017-08-10-debconf/2017-08-10-debconf.org index add2ec1..f7aae30 100644 --- a/talks-public/2017-08-10-debconf/2017-08-10-debconf.org +++ b/talks-public/2017-08-10-debconf/2017-08-10-debconf.org @@ -1,122 +1,204 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage: Our Software Commons, Forever. #+SUBTITLE: a status update #+BEAMER_HEADER: \date[DebConf]{10 August 2017\\DebConf17 --- Montreal, CA} #+AUTHOR: Nicolas Dandrimont, Stefano Zacchiroli #+DATE: 10 August 2017 #+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" #+BEAMER_HEADER: \institute{Inria, Software Heritage} #+LATEX_HEADER: \usepackage{bbding} #+LATEX_HEADER: \DeclareUnicodeCharacter{66D}{\FiveStar} #+BEAMER_HEADER: \pgfdeclareimage[height=90mm,width=160mm]{bgd}{swh-world-169.png} #+BEAMER_HEADER: \setbeamertemplate{background}{\pgfuseimage{bgd}} -#+LaTeX_CLASS_OPTIONS: [aspectratio=169,handout,xcolor=table] +#+LaTeX_CLASS_OPTIONS: [aspectratio=169,xcolor=table] * The Software Commons ** Free Software is everywhere #+latex: \begin{center} #+ATTR_LATEX: :width .7\linewidth file:software-center.pdf #+latex: \end{center} #+INCLUDE: "../../common/modules/source-code-different-short.org::#softwareisdifferent" :minlevel 2 ** Our Software Commons #+INCLUDE: "../../common/modules/foss-commons.org::#commonsdef" :only-contents t #+BEAMER: \pause *** Source code is /a precious part/ of our commons \hfill are we taking care of it? #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#main" :only-contents t :minlevel 2 * Software Heritage #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 ** Our principles #+latex: \begin{center} #+ATTR_LATEX: :width .9\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} #+BEAMER: \pause *** Open approach :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: - 100% FOSS - transparency *** In for the long haul :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: - replication - non profit * Technical overview #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#merklerevision" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#giantdag" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#api" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#features" :minlevel 2 +* Dive into the Software Heritage technical stack +** Technology: how do you store the SWH DAG? +*** Problem statement +- How would you store and query a graph with 10 billion nodes and 60 billion edges? +- How would you store the contents of more than 3 billion files, 300TB of raw data? +- on a limited budget (100 000 € of hardware overall) + +#+BEAMER: \pause + +*** Our hardware stack +- two hypervisors with 512GB RAM, 20TB SSD each, sharing access to a storage array (60 x 6TB spinning rust) +- one backup server with 48GB RAM and another storage array + +*** Our software stack +- A RDBMS (PostgreSQL, what else?), for storage of the graph nodes and edges +- filesystems for storing the actual file contents + +** Technology: archive storage components + +*** Metadata storage +- Python module *swh.storage* +- thin Python API over a pile of PostgreSQL functions +- motivation: keeping relational integrity at the lowest layer + +*** Content ("object") storage +- Python module *swh.objstorage* +- very thin object storage abstraction layer (PUT, APPEND and GET) over regular storage technologies +- separate layer for asynchronous replication and integrity management (*swh.archiver*) +- motivation: stay as technology neutral as possible for future mirrors + +** Technology: object storage +*** Current primary deployment +- Storage on 16 sharded XFS filesystems; key = /sha1/(content), value = /gzip/(content) +- if sha1 = *abcdef01234...*, file path = / srv / storage / *a* / *ab* / *cd* / *ef* / *abcdef01234...* +- 3 directory levels deep, each level 256-wide = 16 777 216 directories (1 048 576 per partition) +*** Secondary deployment +- Storage on Azure blob storage +- 16 storage containers, objects stored in a flat structure there + +** Technology: object storage review + +*** Generic model is fine +The abstraction layer is pretty trivial +*** Filesystem implementation is bad +slow spinning storage + little RAM + 16 million dentries = (very) bad performance +** Technology: metadata storage +*** Current deployment +- PostgreSQL deployed in primary/replica mode, using pg\_logical for replication: different indexes on primary (tuned for writes) and replicas (tuned for reads). +- most logic done in SQL +- thin Pythonic API over the SQL functions + +*** end goals +- proper handling of relations between objects at the lowest level +- doing fast recursive queries on the graph (e.g. find the provenance info for a content, walking up the whole graph, in one single query) + + +** Technology: metadata storage review + +*** Limited resources +PostgreSQL works really well +#+BEAMER: \pause +... until your indexes don't fit in RAM + +#+BEAMER: \pause +*** +Our recursive queries jump between different object types, and between evenly distributed hashes. Data locality doesn't exist. Caches break down. + +#+BEAMER: \pause +*** +Massive deduplication = efficient storage +#+BEAMER: \pause + +*but* Massive deduplication = exponential width for recursive queries + +#+BEAMER: \pause +*** Reality check + +Referential integrity? +#+BEAMER: \pause +Real repositories downloaded from the internet are all kinds of broken. + +** Technology: outlook * Community ** You can help! #+BEAMER: \vspace{-1mm} *** Coding - \url{www.softwareheritage.org/community/developers/} - \url{forge.softwareheritage.org} --- *our own code* #+BEAMER: \vspace{-3mm} *** Current development priorities | ٭٭٭ | listers for unsupported forges, distros, pkg. managers | | ٭٭٭ | loaders for unsupported VCS, source package formats | | ٭٭ | Web UI: eye candy wrapper around the Web API | | ٭ | content indexing and search | #+BEAMER: \vspace{-2mm} … /all/ contributions equally welcome! #+BEAMER: \pause \vspace{-1mm} *** Join us - \url{www.softwareheritage.org/jobs} --- *job openings* - \url{wiki.softwareheritage.org} --- *internships* #+INCLUDE: "../../common/modules/endorsement.org::#endorsement" :minlevel 2 #+INCLUDE: "../../common/modules/swh-sponsors.org::#sponsors" :minlevel 2 ** Going global *** April 3rd, 2017: landmark UNESCO/Inria agreement... #+BEGIN_EXPORT latex \includegraphics[width=\extblockscale{.25\linewidth}]{inria-logo-new} \hfill \includegraphics[width=\extblockscale{.35\linewidth}]{unesco-accord} \hfill \includegraphics[width=\extblockscale{.2\linewidth}]{unesco}\\[1em] \mbox{}\hfill \includegraphics[width=\extblockscale{.2\linewidth}]{rdc-fh-ib} \hfill \includegraphics[width=\extblockscale{.15\linewidth}]{SWH-logo_share} \hfill \includegraphics[width=\extblockscale{.2\linewidth}]{swh-team-2017-04-03}\hfill \mbox{}\\ \begin{center} \footnotesize \url{www.softwareheritage.org/?p=11623} \end{center} #+END_EXPORT *** *Next step:* 27-28 Sep 2017: UNESCO/Inria conference in Paris\hfill * Conclusion ** Conclusion *** Software Heritage is - a /reference archive/ of /all/ FOSS ever written # - a fantastic new tool for /research/ software - a unique /complement/ for /development platforms/ - an international, open, nonprofit, /mutualized infrastructure/ - at the service of our community, at the service of society *** Come in, we're open! \url{www.softwareheritage.org} --- /sponsoring/, /job openings/ \\ \url{wiki.softwareheritage.org} --- /internships/, /leads/ \\ \url{forge.softwareheritage.org} --- /our own code/ #+BEAMER: \vfill \flushright {\Huge Questions?} \vfill * FAQ :B_appendix: :PROPERTIES: :BEAMER_env: appendix :END: ** Q: how about SHA1 collisions? #+BEAMER: \lstinputlisting[language=SQL,basicstyle=\footnotesize]{../../common/source/swh-content.sql}