diff --git a/common/modules/status-extended.org b/common/modules/status-extended.org index 9c6d111..ddb1433 100644 --- a/common/modules/status-extended.org +++ b/common/modules/status-extended.org @@ -1,218 +1,216 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 * Status :PROPERTIES: :CUSTOM_ID: main :END: ** The people :PROPERTIES: :CUSTOM_ID: people :END: *** The core team :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=team,width=.4\linewidth :END: - Roberto Di Cosmo - Stefano Zacchiroli - Nicolas Dandrimont (Engineer) - Antoine Dumont (Engineer) - and /Jordi, Quentin and Guillaume/ *** Scientific advisors - Serge Abiteboul (French Science Academy) - Jean-François Abramatic (former W3C director) - Gerard Berry (CNRS Gold Medal, French Science Academy) - Julia Lawall (Coccinelle, Linux Kernel, Outreachy) -** Archive status +** Archive coverage :PROPERTIES: :CUSTOM_ID: archive :END: *** Our sources :PROPERTIES: :BEAMER_act: +- :END: - - GitHub --- all public repositories as of October 2016 + - GitHub --- full, up-to-date mirror - Debian --- daily snapshots of all suites since 2005--2015 - GNU --- all releases as of August 2015 - - Gitorious --- retrieved full mirror from Archive Team - - Google Code --- retrieved full mirror from Google + - Gitorious, Google Code --- local copy (Archive Team & Google) *** Some numbers :PROPERTIES: :BEAMER_act: +- :END: -#+latex: \begin{center} -#+ATTR_LATEX: :width \extblockscale{.8\linewidth} -file:growth.png -#+latex: \end{center} - # - 25 million repositories ingested (10M next in line) - # - 12 million people, 5 million releases - # - 600 million commits, 2.2 billion directories - # - 2.9 billion unique source files / 200 TB of raw source code + #+latex: \centering + #+ATTR_LATEX: :width \extblockscale{.8\linewidth} + file:growth.png + #+latex: \footnotesize\vspace{-3mm} + 150 TB blobs, 5 TB database (as a graph: 4 B nodes + 40 B edges) *** :PROPERTIES: :BEAMER_act: +- :END: \hfill The /richest/ source code archive already, ... and growing daily! ** The structure of the archive :noexport: *** On-disk storage - flat file storage for contents - postgres database for the metadata *** Data model: /one/ big Merkle DAG, inspired by the git model - Origins (= repositories) - Occurrences (= branches) - Releases (= tags) - Revisions (= commits) - Directories (= trees) - Contents (= blobs) ** Architecture :noexport: :PROPERTIES: :CUSTOM_ID: architecture :END: *** Data flow :PROPERTIES: :CUSTOM_ID: dataflow :END: #+BEAMER: \hspace*{-0.7cm}\includegraphics[width=1.15\textwidth]{swh-dataflow.pdf} ** Data model :noexport: *** General schema - VCS-independent - fully deduplicated + files, directories and commits are /shared/ - biggest git-like /graph/ in the world *** \begin{center} \url{http://deb.li/swhdm} \end{center} *** full hash index (sha1, sha256, ...) Some funny facts: - the GPL2 licence appears under more than 500 names + including /aa.css.txt/ and /FullSync.txt/ ~ :-) ** Merkle structure :noexport: :PROPERTIES: :CUSTOM_ID: merkle :END: *** Merkle trees :PROPERTIES: :CUSTOM_ID: merkletree :END: - # R. C. Merkle, A digital signature based on a conventional encryption function, Crypto '87 + # R. C. Merkle, A digital signature based on a conventional encryption + # function, Crypto '87 + #+BEAMER: \vspace{-3mm} **** Merkle tree (R. C. Merkle, Crypto 1979) :B_picblock: :PROPERTIES: :BEAMER_opt: pic=merkle, leftpic=true, width=.7\linewidth :BEAMER_env: picblock :BEAMER_act: :END: Combination of - tree - hash function - #+BEAMER: \pause + #+BEAMER: \pause **** Classical cryptographic construction - fast, parallel signature of large data structures - widely used (e.g., Git, Bitcoin, IPFS, ...) - built-in deduplication *** The archive in a few pictures :PROPERTIES: :CUSTOM_ID: merkledemo :END: **** A giant (extended) Merkle DAG #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_1.pdf}}} #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/contents.pdf}}} #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_2_contents.pdf}}} #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/directories.pdf}}} #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_3_directories.pdf}}} #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/revisions.pdf}}} #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_4_revisions.pdf}}} #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/releases.pdf}}} #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_5_releases.pdf}}} # #+LATEX: {\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_1.pdf}}} ** Merkle structure (short) :noexport: :PROPERTIES: :CUSTOM_ID: giantdag :END: *** The archive: a (giant) Merkle DAG # Using an empty frame because the image is difficult to read on swh bg. # Finding a way to override image bg for just this frame would be better. **** #+BEAMER: \includegraphics[width=\textwidth]{git-merkle/merkle_5_releases} ** Technology :noexport: :PROPERTIES: :CUSTOM_ID: technology :END: *** Hardware **** hosted by Inria - Hypervisor with a dozen virtual machines - High density storage array (60 * 6TB => 300TB usable) - Copy in another server room; logical leader/follower mirroring - Soon to enable a mirror network to duplicate our contents **** Azure cloud (work in progress prototype) - full mirror using distributed object storage - workers for batch analyses and crawling *** Software **** 3rd party FOSS - Debian distribution, orchestrated with Puppet - PostgreSQL for metadata storage - RabbitMQ for task scheduling - Python3 and psycopg2 for the backend - Flask and Bootstrap for the web apps #+BEAMER: \\ $\to$ \alert{\footnotesize \url{https://www.softwareheritage.org/jobs/}} - Phabricator forge **** in-house FOSS - ~50 Git repositories (~20 Python packages, ~10 Puppet modules) - ~20 kSLOC Python / ~10 kSLOC SQL / ~1 kSLOC Puppet - licence choice: GPLv3 (backend) / AGPLv3 (frontend) - https://forge.softwareheritage.org/ *** Software architecture **** Module dependencies (internal + external) :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=swh-modules-deps-all,width=\linewidth :END: **** let's zoom in: http://deb.li/swhdeps ** Software development :noexport: :PROPERTIES: :CUSTOM_ID: development :END: *** Software development **** classic FOSS development - language: English - development mailing list #+BEAMER: \\{\small \url{https://sympa.inria.fr/sympa/info/swh-devel}} - IRC #+BEAMER: \\ #swh-devel / FreeNode - Forge #+BEAMER: \\{\small \url{https://forge.softwareheritage.org}} - Git, tasks, code review, etc. **** for more information #+BEAMER: \scriptsize https://www.softwareheritage.org/community/developers/ ** The road ahead :PROPERTIES: :CUSTOM_ID: features :END: *** Planned features... - /lookup/ by content hash (done) - /download/: wget and git clone from Software Heritage - /provenance information/ for all archived code and metadata - /browsing/: wayback machine for archived code and its history - /full-text search/ on all archived source code files #+BEAMER: \pause *** ... and much more than one could possibly imagine all the world's software development history in a single graph! # \hfill /that makes a 150TB archive / 5TB database already.../ ** Some technical challenges :PROPERTIES: :CUSTOM_ID: techchallenges :END: *** Expanding the archive - discover and classify /all/ the software sources - importers for other VCSs (SVN, Hg, ...) \hfill /We need your help!/ *** Staying current get new repositories and commits ASAP\\ \hfill /We need reliable, standardised event feeds./ *** Handling the backlog ingesting all the pre-existing data\\ \hfill /Decades of software development are waiting!/