diff --git a/common/modules/status-extended.org b/common/modules/status-extended.org index 0412f97..8750407 100644 --- a/common/modules/status-extended.org +++ b/common/modules/status-extended.org @@ -1,364 +1,367 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 * Status :PROPERTIES: :CUSTOM_ID: main :END: ** The people :PROPERTIES: :CUSTOM_ID: people :END: *** The core team :B_picblock: :PROPERTIES: :CUSTOM_ID: coreteam :BEAMER_env: picblock :BEAMER_opt: pic=team,width=.4\linewidth :END: - Roberto Di Cosmo - Stefano Zacchiroli - Nicolas Dandrimont (Engineer) - Antoine Dumont (Engineer) - and /Jordi, Quentin and Guillaume/ *** Scientific advisors - Serge Abiteboul (French Science Academy) - Jean-François Abramatic (former W3C director) - Gerard Berry (CNRS Gold Medal, French Science Academy) - Julia Lawall (Coccinelle, Linux Kernel, Outreachy) ** Archive coverage :PROPERTIES: :CUSTOM_ID: archive :END: *** Our sources :PROPERTIES: :BEAMER_act: +- :END: - GitHub --- full, up-to-date mirror - Debian --- daily snapshots of all suites since 2005--2015 - GNU --- all releases as of August 2015 - Gitorious, Google Code --- local copy (Archive Team & Google) *** Some numbers :PROPERTIES: :BEAMER_act: +- :END: #+latex: \centering #+ATTR_LATEX: :width \extblockscale{.8\linewidth} file:growth.png #+latex: \footnotesize\vspace{-3mm} 150 TB blobs, 6 TB database (as a graph: 5 B nodes + 50 B edges) *** :PROPERTIES: :BEAMER_act: +- :END: \hfill The /richest/ source code archive already, ... and growing daily! ** The structure of the archive :noexport: *** On-disk storage - flat file storage for contents - postgres database for the metadata *** Data model: /one/ big Merkle DAG, inspired by the git model - Origins (= repositories) - Occurrences (= branches) - Releases (= tags) - Revisions (= commits) - Directories (= trees) - Contents (= blobs) ** Architecture :noexport: :PROPERTIES: :CUSTOM_ID: architecture :END: *** Data flow :PROPERTIES: :CUSTOM_ID: dataflow :END: #+BEAMER: \hspace*{-0.7cm}\includegraphics[width=1.15\textwidth]{swh-dataflow.pdf} ** Data model :noexport: *** General schema - VCS-independent - fully deduplicated + files, directories and commits are /shared/ - biggest git-like /graph/ in the world *** \begin{center} \url{http://deb.li/swhdm} \end{center} *** full hash index (sha1, sha256, ...) Some funny facts: - the GPL2 licence appears under more than 500 names + including /aa.css.txt/ and /FullSync.txt/ ~ :-) ** Merkle DAG *** Merkle structure :PROPERTIES: :CUSTOM_ID: merkle :END: **** Merkle trees :PROPERTIES: :CUSTOM_ID: merkletree :END: # R. C. Merkle, A digital signature based on a conventional encryption # function, Crypto '87 #+BEAMER: \vspace{-3mm} ***** Merkle tree (R. C. Merkle, Crypto 1979) :B_picblock: :PROPERTIES: :BEAMER_opt: pic=merkle, leftpic=true, width=.7\linewidth :BEAMER_env: picblock :BEAMER_act: :END: Combination of - tree - hash function #+BEAMER: \pause ***** Classical cryptographic construction - fast, parallel signature of large data structures - widely used (e.g., Git, Bitcoin, IPFS, ...) - built-in deduplication **** The archive in a few pictures :PROPERTIES: :CUSTOM_ID: merkledemo :END: ***** A giant (extended) Merkle DAG #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_1.pdf}}} #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/contents.pdf}}} #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_2_contents.pdf}}} #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/directories.pdf}}} #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_3_directories.pdf}}} #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/revisions.pdf}}} #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_4_revisions.pdf}}} #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/releases.pdf}}} #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_5_releases.pdf}}} # #+LATEX: {\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_1.pdf}}} *** A revision node :PROPERTIES: :CUSTOM_ID: merklerevision :END: **** Example: a Software Heritage revision ***** #+BEAMER: \vspace{-.5cm}\includegraphics[width=0.95\textwidth]{git-merkle/revisions} ***** - Note: most object kinds currently use Git-compatible identifiers + Note: most object kinds currently have Git-compatible identifiers *** Giant DAG :PROPERTIES: :CUSTOM_ID: giantdag :END: **** The archive: a (giant) Merkle DAG # Using an empty frame because the image is difficult to read on swh bg. # Finding a way to override image bg for just this frame would be better. ***** #+BEAMER: \includegraphics[width=\textwidth]{git-merkle/merkle_5_releases} ** Technology :noexport: :PROPERTIES: :CUSTOM_ID: technology :END: *** Hardware **** hosted by Inria - Hypervisor with a dozen virtual machines - High density storage array (60 * 6TB => 300TB usable) - Copy in another server room; logical leader/follower mirroring - Soon to enable a mirror network to duplicate our contents **** Azure cloud (work in progress prototype) - full mirror using distributed object storage - workers for batch analyses and crawling *** Software **** 3rd party FOSS - Debian distribution, orchestrated with Puppet - PostgreSQL for metadata storage - RabbitMQ for task scheduling - Python3 and psycopg2 for the backend - Flask and Bootstrap for the web apps #+BEAMER: \\ $\to$ \alert{\footnotesize \url{https://www.softwareheritage.org/jobs/}} - Phabricator forge **** in-house FOSS - ~50 Git repositories (~20 Python packages, ~10 Puppet modules) - ~20 kSLOC Python / ~10 kSLOC SQL / ~1 kSLOC Puppet - licence choice: GPLv3 (backend) / AGPLv3 (frontend) - https://forge.softwareheritage.org/ *** Software architecture **** Module dependencies (internal + external) :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=swh-modules-deps-all,width=\linewidth :END: **** let's zoom in: http://deb.li/swhdeps ** Software development :noexport: :PROPERTIES: :CUSTOM_ID: development :END: *** Software development **** classic FOSS development - language: English - development mailing list #+BEAMER: \\{\small \url{https://sympa.inria.fr/sympa/info/swh-devel}} - IRC #+BEAMER: \\ #swh-devel / FreeNode - Forge #+BEAMER: \\{\small \url{https://forge.softwareheritage.org}} - Git, tasks, code review, etc. **** for more information #+BEAMER: \scriptsize https://www.softwareheritage.org/community/developers/ ** Roadmap :PROPERTIES: :CUSTOM_ID: features :END: *** Features... - (done) *lookup* by content hash - *browsing*: "wayback machine" for archived code - (done) via Web API - (todo) via Web UI - (todo) *download*: =wget= / =git clone= from the archive - (todo) *provenance information* for all archived content - (todo) *full-text search* on all archived source code files #+BEAMER: \pause *** ... and much more than one could possibly imagine all the world's software development history in a single graph! ** Web API :noexport: :PROPERTIES: :CUSTOM_ID: api :END: *** Web API (FOSDEM'17 release) :PROPERTIES: :CUSTOM_ID: apiintro :END: **** Fresh from the oven: first public version of our Web API\\ *\url{https://archive.softwareheritage.org/api/}* + #+BEAMER: \pause **** Features - - pointwise browsing of the Software Heritage archive + - pointwise *browsing* of the Software Heritage archive - … releases → revisions → directories → contents … - - full access to the metadata of archived objects - - crawling information + - full access to the *metadata* of archived objects + - *crawling* information - /when have you last visited this Git repository I care about?/ - /where were its branches/tags pointing to at the time?/ - derived information about archived contents (WIP) - MIME type, programming language, license, etc. + #+BEAMER: \pause **** Complete endpoint index \url{https://archive.softwareheritage.org/api/1/} *** A tour of the Web API --- origins & visits #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ git/url/https://github.com/hylang/hy { "id": 1, "origin_visits_url": "/api/1/origin/1/visits/", "type": "git", "url": "https://github.com/hylang/hy" } #+END_SRC #+BEAMER: \vfill #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ 1/visits/ [ ..., { "date": 1473851066.769266, "origin": 1, "origin_visit_url": "/api/1/origin/1/visit/13/", "status": "full", "visit": 13 }, ... ] #+END_SRC *** A tour of the Web API --- snapshots #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ 1/visit/13/ { ..., "occurrences": { ..., "refs/heads/master": { "target": "b94211251...", "target_type": "revision", "target_url": "/api/1/revision/b94211251.../" }, "refs/tags/0.10.0": { "target": "7045404f3...", "target_type": "release", "target_url": "/api/1/release/7045404f3.../" }, ... }, "origin": 1, "origin_url": "/api/1/origin/1/", "status": "full", "visit": 13 } #+END_SRC *** A tour of the Web API --- releases :noexport: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/release/ \ 7045404f3d1c54e6473c71bbb716529fbad4be24/ { "author": { "email": "tag@pault.ag", "fullname": "Paul Tagliamonte ", "id": 96, "name": "Paul Tagliamonte" }, "date": "2014-04-10T23:01:28-04:00", "message": "0.10: The Oh f*ck it's PyCon release", "name": "0.10.0", "synthetic": false, "target": "6072557b6...", "target_type": "revision", "target_url": "/api/1/revision/6072557b6.../", ... } #+END_SRC *** A tour of the Web API --- revisions #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/revision/ \ 6072557b6c10cd9a21145781e26ad1f978ed14b9/ { "author": { "email": "tag@pault.ag", "fullname": "Paul Tagliamonte ", "id": 96, "name": "Paul Tagliamonte" }, "committer": { ... }, "date": "2014-04-10T23:01:11-04:00", "committer_date": "2014-04-10T23:01:11-04:00", "directory": "2df4cd84e...", "directory_url": "/api/1/directory/2df4cd84e.../", "history_url": "/api/1/revision/6072557b6.../log/", "merge": false, "message": "0.10: The Oh f*ck it's PyCon release", "parent_urls": [ "/api/1/revision/10149f66e.../" ], "parents": [ "10149f66e..." ], ... } #+END_SRC *** A tour of the Web API --- contents #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/content/ \ adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/ { "data_url": "/api/1/content/sha1:adc83b19e.../raw/", "filetype_url": "/api/1/content/sha1:.../filetype/", "language_url": "/api/1/content/sha1:.../language/", "length": 1, "license_url": "/api/1/content/sha1:.../license/", "sha1": "adc83b19e...", "sha1_git": "8b1378917...", "sha256": "01ba4719c...", "status": "visible" } #+END_SRC -#+BEAMER: \normalsize \vfill - Note: rate limits apply throughout the API +#+BEAMER: \normalsize \vfill \pause + - rate limits apply throughout the API + - blob download not available yet ** Some technical challenges :PROPERTIES: :CUSTOM_ID: techchallenges :END: *** Expanding the archive - discover and classify /all/ the software sources - importers for other VCSs (SVN, Hg, ...) \hfill /We need your help!/ *** Staying current get new repositories and commits ASAP\\ \hfill /We need reliable, standardised event feeds./ *** Handling the backlog ingesting all the pre-existing data\\ \hfill /Decades of software development are waiting!/ diff --git a/talks-public/2017-02-04-FOSDEM/2017-02-04-FOSDEM.org b/talks-public/2017-02-04-FOSDEM/2017-02-04-FOSDEM.org index 97cf3a1..756a069 100644 --- a/talks-public/2017-02-04-FOSDEM/2017-02-04-FOSDEM.org +++ b/talks-public/2017-02-04-FOSDEM/2017-02-04-FOSDEM.org @@ -1,215 +1,214 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage #+SUBTITLE: Preserving the Free Software Commons #+AUTHOR: Roberto Di Cosmo and Stefano Zacchiroli #+DATE: 4 February 2017 #+DESCRIPTION: Preserving the Free Software Commons #+KEYWORDS: software heritage legacy preservation knowledge mankind technology #+BEAMER_HEADER: \date[FOSDEM'17]{4 February 2017\\ FOSDEM'17\\ Brussels, Belgium} #+BEAMER_HEADER: \author[R. Di Cosmo, S. Zacchiroli]{Roberto Di Cosmo and Stefano Zacchiroli} # # prelude.org contains all the information needed to export the main beamer latex source # use prelude-toc.org to get the table of contents # #+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 #+BEAMER_HEADER: \institute[Irill/INRIA/UPD]{\url{roberto@dicosmo.org, zack@upsilon.cc}} # #+LATEX_HEADER: \usepackage{enumitem} # # Part I: vision # * Software, Source Code, and the Software Commons ** Free Software is everywhere :PROPERTIES: :CUSTOM_ID: softwareispervasive :END: #+latex: \begin{center} #+ATTR_LATEX: :width .9\linewidth file:software-center.pdf #+latex: \end{center} # # Source code # #+INCLUDE: "../../common/modules/source-code-different-short.org::#softwareisdifferent" :minlevel 2 ** Our Software Commons #+INCLUDE: "../../common/modules/foss-commons.org::#commonsdef" :only-contents t #+BEAMER: \pause *** Source code is /a precious part/ of our commons \hfill we need to take care of it! # # Negative presentation (what we are missing) # # *** Our source code is /precious knowledge/ # \hfill are we taking care of it? # #+INCLUDE: "../../common/modules/swh-motivations.org::#main" :only-contents t :minlevel 2 # # The project # #+INCLUDE: "../../common/modules/swh-overview-sourcecode.org::#mission" :minlevel 2 # # Positive presentation (what we are building) # #+INCLUDE: "../../common/modules/swh-goals.org::#exhaustive" :minlevel 2 #+INCLUDE: "../../common/modules/swh-goals.org::#longterm" :minlevel 2 ** Our principles #+latex: \begin{center} #+ATTR_LATEX: :width .9\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} *** Open approach :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: - 100% FOSS - transparency *** In for the long haul :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: - replication - non profit ** Archiving goals Targets: VCS repositories & source code releases (e.g., tarballs) *** We DO archive - file *content* (= blobs) - *revisions* (= commits), with full metadata - *releases* (= tags), ditto - - (project metadata) - where (*origin*) & when (*visit*) we found any of the above # - time-indexed repo *snapshots* (i.e., we never delete anything) … in a VCS-/archive-agnostic *canonical data model* *** We DON'T archive (UNIX philosophy) # - diffs → derived data from related contents - - homepages, wikis → collaboration with the Internet Archive + - homepages, wikis - BTS/issues/code reviews/etc. - mailing lists Long term vision: play our part in a /"semantic wikipedia of software"/ # # Part II: roadmap # * Where we are today: technical overview #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#merklerevision" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#giantdag" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#api" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#features" :minlevel 2 # # Part III: # * Come in, we're open! ** You can help! *** Coding TO DO #+BEAMER: \vfill *** Join us - \url{www.softwareheritage.org/jobs} --- *job openings* - \url{wiki.softwareheritage.org} --- *internships* ** There is a whole lot to do! :noexport: #+latex: \begin{center} #+ATTR_LATEX: :width \extblockscale{\textwidth} file:SWH-as-foundation-block.png #+latex: \end{center} #+BEAMER: \pause *** Collect :B_exampleblock: :PROPERTIES: :BEAMER_env: exampleblock :BEAMER_COL: .3 :END: - discover + sources - harvest + protocols - ingest + VCS + data models *** Organise and Preserve :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .4 :END: - enrich + metadata - analyze + traits - replicate + locations + technologies + stakeholders *** Share :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .3 :END: - download - browse + wayback machine - search + facets - watch + trends #+BEAMER: \pause *** \hfill we need *your* help! ** The Software Heritage community *** A small, but dedicated core team :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=team,width=.4\linewidth :END: *** Inria as initiator :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=inria-logo-new,leftpic=true,width=\extblockscale{.4\linewidth} :END: - .fr national CS research institution - strong FOSS culture, W3C founding partner # - creating a non profit, international organisation *** Supporters and /first partners/ *Société Générale, Microsoft, Huawei, Nokia Bell Labs, DANS,* ACM, Adullact, Creative Commons, Eclipse, Free Software Foundation, Open Source Initiative, GitHub, IEEE, OIN, OW2, Software Freedom Conservancy, SFLC, The Document Foundation, ... * Conclusion ** Conclusion *** Software Heritage is - a /reference archive/ of /all/ FOSS ever written # - a fantastic new tool for /research/ software - a unique /complement/ for /development platforms/ - an international, open, nonprofit, /mutualized infrastructure/ - at the service of our community, at the service of society *** Come in, we're open! \url{www.softwareheritage.org} --- /sponsoring/, /*job openings*/ \\ \url{wiki.softwareheritage.org} --- /*internships*/, /leads/ \\ \url{forge.softwareheritage.org} --- /*our own code*/ #+BEAMER: \vfill \flushright {\Huge Questions?} \vfill * FAQ :B_appendix: :PROPERTIES: :BEAMER_env: appendix :END: ** Q: how about SHA1 collisions? #+BEAMER: \lstinputlisting[language=SQL,basicstyle=\small]{../../common/source/swh-content.sql} ** Q: do you archive /only/ Free Software? - We only crawl origins /meant/ to host source code (e.g., forges) - Most (~90%) of what we /actually/ retrieve is textual content #+BEAMER: \vfill - Our goal: archive /the entire Free Software commons/ #+BEAMER: \vfill - Large parts of what we retrieve is /already/ Free Software, today - Most of the rest /will become/ Free Software in the long term - e.g., at copyright expiration