Page MenuHomeSoftware Heritage

status-extended.org
No OneTemporary

status-extended.org

#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)
#+INCLUDE: "prelude.org" :minlevel 1
* Status
:PROPERTIES:
:CUSTOM_ID: main
:END:
** The people
:PROPERTIES:
:CUSTOM_ID: people
:END:
*** The core team :B_picblock:
:PROPERTIES:
:BEAMER_env: picblock
:BEAMER_opt: pic=team,width=.4\linewidth
:END:
- Roberto Di Cosmo
- Stefano Zacchiroli
- Nicolas Dandrimont (Engineer)
- Antoine Dumont (Engineer)
- and /Jordi, Quentin and Guillaume/
*** Scientific advisors
- Serge Abiteboul (French Sience Academy)
- Jean-François Abramatic (former W3C director)
- Gerard Berry (CNRS Gold Medal, French Science Academy)
- Julia Lawall (Coccinelle, Linux Kernel, Outreachy)
** The archive
:PROPERTIES:
:CUSTOM_ID: archive
:END:
*** Our sources
:PROPERTIES:
:BEAMER_act: +-
:END:
- GitHub --- all public repositories as of August 2016
- Debian --- daily snapshots of all suites since 2005--2015
- GNU --- all releases as of August 2015
- Gitorious --- retrieved full mirror from Archive Team
- Google Code --- retrieved full mirror from Google
*** Some numbers
:PROPERTIES:
:BEAMER_act: +-
:END:
#+latex: \begin{center}
#+ATTR_LATEX: :width .8\linewidth
file:growth.png
#+latex: \end{center}
# - 25 million repositories ingested (10M next in line)
# - 12 million people, 5 million releases
# - 600 million commits, 2.2 billion directories
# - 2.9 billion unique source files / 200 TB of raw source code
***
:PROPERTIES:
:BEAMER_act: +-
:END:
\hfill The /richest/ source code archive already, ... and growing daily!
** The structure of the archive :noexport:
*** On-disk storage
- flat file storage for contents
- postgres database for the metadata
*** Data model: /one/ big Merkle DAG, inspired by the git model
- Origins (= repositories)
- Occurrences (= branches)
- Releases (= tags)
- Revisions (= commits)
- Directories (= trees)
- Contents (= blobs)
** Data model :noexport:
*** General schema
- VCS-independent
- fully deduplicated
+ files, directories and commits are /shared/
- biggest git-like /graph/ in the world
***
\begin{center}
\url{http://deb.li/swhdm}
\end{center}
*** full hash index (sha1, sha256, ...)
Some funny facts:
- the GPL2 licence appears under more than 500 names
+ including /aa.css.txt/ and /FullSync.txt/ ~ :-)
** Merkle structure :noexport:
:PROPERTIES:
:CUSTOM_ID: merkle
:END:
*** Merkle trees
# R. C. Merkle, A digital signature based on a conventional encryption function, Crypto '87
**** Merkle tree (R. C. Merkle, Crypto 1979) :B_picblock:
:PROPERTIES:
:BEAMER_opt: pic=merkle, leftpic=true, width=.7\linewidth
:BEAMER_env: picblock
:BEAMER_act:
:END:
Combination of
- tree
- hash function
#+BEAMER: \pause
**** Classical cryptographic construction
- fast, parallel signature of large data structures
- widely used by /Git/, /Bitcoin/, etc.
- natural extension: Merkle /DAG/
*** The archive in a few pictures
**** A giant (extended) Merkle DAG
#+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/merkle_1.pdf}}}
#+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/contents.pdf}}}
#+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/merkle_2_contents.pdf}}}
#+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/directories.pdf}}}
#+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/merkle_3_directories.pdf}}}
#+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/revisions.pdf}}}
#+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/merkle_4_revisions.pdf}}}
#+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/releases.pdf}}}
#+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/merkle_5_releases.pdf}}}
# #+LATEX: {\colorbox{white}{\includegraphics[width=.9\linewidth]{git-merkle/merkle_1.pdf}}}
** Technology :noexport:
*** Hardware (currently hosted by Inria)
- Hypervisor with a dozen virtual machines
- High density storage array (60 * 6TB => 300TB usable)
- Copy in another server room; logical leader/follower mirroring
- Soon to enable a mirror network to duplicate our contents
*** Software
- Debian distribution
- PostgreSQL for metadata storage
- RabbitMQ for task scheduling
# - Python3 and psycopg2 for the backend
# - Flask for the web apps
*** Licences
- GPLv3 for the backend code
- AGPLv3 for the frontend
- Apache2 for the Puppet manifests
***
https://forge.softwareheritage.org/
** The road ahead
:PROPERTIES:
:CUSTOM_ID: features
:END:
*** Planned features...
- /lookup/ by hashes for contents (done)
- /download/: git clone from Software Heritage
- /provenance information/ for all the content
- /browsing/: wayback machine for software source code
- /full text search/: dive into the Software Heritage archive
#+BEAMER: \pause
*** ... and much more one could possibly imagine
all the world's software development history in a single graph!\\
\hfill /that makes a 150TB archive / 5TB database already.../
** Some technical challenges
:PROPERTIES:
:CUSTOM_ID: techchallenges
:END:
*** Expanding the archive
- discover and classify /all/ the software sources
- importers for other VCSs (SVN, Hg, ...)
\hfill /We need your help!/
*** Staying current
get new repositories and commits ASAP\\
\hfill /We need reliable, standardised event feeds./
*** Handling the backlog
ingesting all the pre-existing data\\
\hfill /Decades of software development are waiting!/

File Metadata

Mime Type
text/x-tex
Expires
Jun 4 2025, 7:42 PM (10 w, 3 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3399338

Event Timeline