diff --git a/common/modules/status-extended.org b/common/modules/status-extended.org index 1e81467..b1d0135 100644 --- a/common/modules/status-extended.org +++ b/common/modules/status-extended.org @@ -1,410 +1,410 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 # not to be included as a whole, just pick individual slides as you see fit * Status :PROPERTIES: :CUSTOM_ID: main :END: ** The people :PROPERTIES: :CUSTOM_ID: people :END: *** The core team :B_picblock: :PROPERTIES: :CUSTOM_ID: core-team-formal :BEAMER_env: picblock :BEAMER_opt: pic=team,width=.4\linewidth :END: - Roberto Di Cosmo - Stefano Zacchiroli - Nicolas Dandrimont (Engineer) - Antoine Dumont (Engineer) # - and /Jordi, Quentin and Guillaume/ *** Scientific advisors - Serge Abiteboul (French Science Academy) - Jean-François Abramatic (former W3C director) - Gerard Berry (CNRS Gold Medal, French Science Academy) - Julia Lawall (Coccinelle, Linux Kernel, Outreachy) ** Archive coverage :PROPERTIES: :CUSTOM_ID: archive :END: #+BEAMER: \vspace{-2mm} - #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1.2\linewidth}]{growth.png}\end{center} + #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1.2\linewidth}]{2017-09-archive-growth.png}\end{center} #+BEAMER: \vspace{-2mm} *** Our current sources - GitHub - Debian, GNU - WIP: Gitorious, Google Code, Bitbucket #+BEAMER: \pause *** 150 TB blobs, 5 TB database (as a graph: 7 B nodes + 60 B edges) #+BEAMER: \pause *** \hfill The /richest/ source code archive already, ... and growing daily! ** The structure of the archive :noexport: *** On-disk storage - flat file storage for contents - postgres database for the metadata *** Data model: /one/ big Merkle DAG, inspired by the git model - Origins (= repositories) - Occurrences (= branches) - Releases (= tags) - Revisions (= commits) - Directories (= trees) - Contents (= blobs) ** Archiving goals :PROPERTIES: :CUSTOM_ID: archivinggoals :END: Targets: VCS repositories & source code releases (e.g., tarballs) *** We DO archive - file *content* (= blobs) - *revisions* (= commits), with full metadata - *releases* (= tags), ditto - where (*origin*) & when (*visit*) we found any of the above # - time-indexed repo *snapshots* (i.e., we never delete anything) … in a VCS-/archive-agnostic *canonical data model* *** We DON'T archive # - diffs → derived data from related contents - homepages, wikis - BTS/issues/code reviews/etc. - mailing lists Long term vision: play our part in a /"semantic wikipedia of software"/ ** Architecture :PROPERTIES: :CUSTOM_ID: architecture :END: *** Data flow :PROPERTIES: :CUSTOM_ID: dataflow :END: # #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1.2\textwidth}]{swh-dataflow.pdf}\end{center} ** Data model :noexport: *** General schema - VCS-independent - fully deduplicated + files, directories and commits are /shared/ - biggest git-like /graph/ in the world *** \begin{center} \url{http://deb.li/swhdm} \end{center} *** full hash index (sha1, sha256, ...) Some funny facts: - the GPL2 licence appears under more than 500 names + including /aa.css.txt/ and /FullSync.txt/ ~ :-) ** Merkle DAG *** Merkle structure :PROPERTIES: :CUSTOM_ID: merkle :END: **** Merkle trees :PROPERTIES: :CUSTOM_ID: merkletree :END: # R. C. Merkle, A digital signature based on a conventional encryption # function, Crypto '87 #+BEAMER: \vspace{-3mm} ***** Merkle tree (R. C. Merkle, Crypto 1979) :B_picblock: :PROPERTIES: :BEAMER_opt: pic=merkle, leftpic=true, width=.7\linewidth :BEAMER_env: picblock :BEAMER_act: :END: Combination of - tree - hash function #+BEAMER: \pause #+BEAMER: \footnotesize ***** Classical cryptographic construction - fast, parallel signature of large data structures - widely used (e.g., Git, blockchains, IPFS, ...) - built-in deduplication **** The archive in a few pictures :PROPERTIES: :CUSTOM_ID: merkledemo :END: ***** A giant (extended) Merkle DAG #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_1.pdf}}} #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/contents.pdf}}} #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_2_contents.pdf}}} #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/directories.pdf}}} #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_3_directories.pdf}}} #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/revisions.pdf}}} #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_4_revisions.pdf}}} #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/releases.pdf}}} #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_5_releases.pdf}}} # #+LATEX: {\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_1.pdf}}} *** A revision node :PROPERTIES: :CUSTOM_ID: merklerevision :END: **** Example: a Software Heritage revision ***** #+BEAMER: \vspace{-.5cm}\centering\includegraphics[width=0.9\textwidth]{git-merkle/revisions} ***** Note: most object kinds currently have Git-compatible identifiers *** Giant DAG :PROPERTIES: :CUSTOM_ID: giantdag :END: **** The archive: a (giant) Merkle DAG # Using an empty frame because the image is difficult to read on swh bg. # Finding a way to override image bg for just this frame would be better. ***** #+BEAMER: \centering \includegraphics[width=\extblockscale{\textwidth}]{git-merkle/merkle_5_releases} *** Giant DAG (single slide) :PROPERTIES: :CUSTOM_ID: giantdag1slide :END: **** The Software Heritage archive: a gigantic Merkle DAG #+LATEX: \centering #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_1}}} #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/contents}}} #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_2_contents}}} #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/directories}}} #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_3_directories}}} #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/revisions}}} #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_4_revisions}}} #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/releases}}} #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_5_releases}}} ** Technology :noexport: :PROPERTIES: :CUSTOM_ID: technology :END: *** Software stack **** 3rd party - Debian, Puppet - PostgreSQL for metadata storage, with barman & pglogical - Celery (RabbitMQ backend) for task scheduling - Python3 and psycopg2 for the backend - Flask and Bootstrap for Web stuff - Phabricator **** in house - /ad hoc/ object storage (to avoid imposing tech to mirrors) - data model implementation, listers, loaders, scheduler - ~50 Git repositories (~20 Python packages, ~10 Puppet modules) - ~30 kSLOC Python / ~12 kSLOC SQL / ~4 kSLOC Puppet - licence choice: GPLv3 (backend) / AGPLv3 (frontend) *** Hardware stack **** in house - 2x hypervisors with ~20 VMs - 2x high density storage array (60 * 6TB => 300TB usable) **** on Azure - full object storage mirror - workers for content indexing *** Software architecture :noexport: **** Module dependencies (internal + external) :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=swh-modules-deps-all,width=\linewidth :END: **** let's zoom in: http://deb.li/swhdeps ** Software development :noexport: :PROPERTIES: :CUSTOM_ID: development :END: *** Software development **** classic FOSS development - language: English - development mailing list #+BEAMER: \\{\small \url{https://sympa.inria.fr/sympa/info/swh-devel}} - IRC #+BEAMER: \\ #swh-devel / FreeNode - Forge #+BEAMER: \\{\small \url{https://forge.softwareheritage.org}} - Git, tasks, code review, etc. **** for more information #+BEAMER: \scriptsize https://www.softwareheritage.org/community/developers/ ** Roadmap :PROPERTIES: :CUSTOM_ID: features :END: *** Features... - (done) *lookup* by content hash - *browsing*: "wayback machine" for archived code - (done) via Web API - (todo) via Web UI - (todo) *download*: =wget= / =git clone= from the archive - (todo) *deposit* of source code bundles directly to the archive - (todo) *provenance* lookup for all archived content - (todo) *full-text search* on all archived source code files #+BEAMER: \pause *** ... and much more than one could possibly imagine all the world's software development history in a single graph! ** Web API :noexport: :PROPERTIES: :CUSTOM_ID: api :END: *** Web API :PROPERTIES: :CUSTOM_ID: apiintro :END: **** First public version of our Web API (Feb 2017) \\ *\url{https://archive.softwareheritage.org/api/}* **** Features - pointwise *browsing* of the Software Heritage archive - … releases → revisions → directories → contents … - full access to the *metadata* of archived objects - *crawling* information - /when have you last visited this Git repository I care about?/ - /where were its branches/tags pointing to at the time?/ # - derived information about archived contents (WIP) # - MIME type, programming language, license, etc. **** Complete endpoint index \url{https://archive.softwareheritage.org/api/1/} *** A tour of the Web API --- origins & visits :PROPERTIES: :CUSTOM_ID: apitourvisits :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ git/url/https://github.com/hylang/hy { "id": 1, "origin_visits_url": "/api/1/origin/1/visits/", "type": "git", "url": "https://github.com/hylang/hy" } #+END_SRC #+BEAMER: \vfill #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ 1/visits/ [ ..., { "date": "2016-09-14T11:04:26.769266+00:00", "origin": 1, "origin_visit_url": "/api/1/origin/1/visit/13/", "status": "full", "visit": 13 }, ... ] #+END_SRC *** A tour of the Web API --- snapshots :PROPERTIES: :CUSTOM_ID: apitoursnapshots :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ 1/visit/13/ { ..., "occurrences": { ..., "refs/heads/master": { "target": "b94211251...", "target_type": "revision", "target_url": "/api/1/revision/b94211251.../" }, "refs/tags/0.10.0": { "target": "7045404f3...", "target_type": "release", "target_url": "/api/1/release/7045404f3.../" }, ... }, "origin": 1, "origin_url": "/api/1/origin/1/", "status": "full", "visit": 13 } #+END_SRC *** A tour of the Web API --- releases :noexport: :PROPERTIES: :CUSTOM_ID: apitourreleases :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/release/ \ 7045404f3d1c54e6473c71bbb716529fbad4be24/ { "author": { "email": "tag@pault.ag", "fullname": "Paul Tagliamonte ", "id": 96, "name": "Paul Tagliamonte" }, "date": "2014-04-10T23:01:28-04:00", "message": "0.10: The Oh f*ck it's PyCon release", "name": "0.10.0", "synthetic": false, "target": "6072557b6...", "target_type": "revision", "target_url": "/api/1/revision/6072557b6.../", ... } #+END_SRC *** A tour of the Web API --- revisions :PROPERTIES: :CUSTOM_ID: apitourrevisions :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/revision/ \ 6072557b6c10cd9a21145781e26ad1f978ed14b9/ { "author": { "email": "tag@pault.ag", "fullname": "Paul Tagliamonte ", "id": 96, "name": "Paul Tagliamonte" }, "committer": { ... }, "date": "2014-04-10T23:01:11-04:00", "committer_date": "2014-04-10T23:01:11-04:00", "directory": "2df4cd84e...", "directory_url": "/api/1/directory/2df4cd84e.../", "history_url": "/api/1/revision/6072557b6.../log/", "merge": false, "message": "0.10: The Oh f*ck it's PyCon release", "parents": [ { "id": "10149f66e...", "url": "/api/1/revision/10149f66e.../" } ], ... } #+END_SRC *** A tour of the Web API --- contents :PROPERTIES: :CUSTOM_ID: apitourcontents :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/content/ \ adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/ { "data_url": "/api/1/content/sha1:adc83b19e.../raw/", "filetype_url": "/api/1/content/sha1:.../filetype/", "language_url": "/api/1/content/sha1:.../language/", "length": 1, "license_url": "/api/1/content/sha1:.../license/", "sha1": "adc83b19e...", "sha1_git": "8b1378917...", "sha256": "01ba4719c...", "status": "visible" } #+END_SRC #+BEAMER: \normalsize \vfill \pause **** Caveats - rate limits apply throughout the API - blob download available for selected contents ** Some technical challenges :PROPERTIES: :CUSTOM_ID: techchallenges :END: *** Expanding the archive - discover and classify /all/ the software sources - importers for other VCSs (SVN, Hg, ...) \hfill /We need your help!/ *** Staying current get new repositories and commits ASAP\\ \hfill /We need reliable, standardised event feeds./ *** Handling the backlog ingesting all the pre-existing data\\ \hfill /Decades of software development are waiting!/ diff --git a/common/modules/swh-organisation-roadmap.org b/common/modules/swh-organisation-roadmap.org index 8c0f833..f93c60e 100644 --- a/common/modules/swh-organisation-roadmap.org +++ b/common/modules/swh-organisation-roadmap.org @@ -1,79 +1,79 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) # # Software is all around us # #+INCLUDE: "prelude.org" :minlevel 1 * Growing a sustainable organisation :PROPERTIES: :CUSTOM_ID: main :END: ** Growing a sustainable common digital infrastructure :PROPERTIES: :CUSTOM_ID: phases :END: #+latex: \setbeamercolor{block title alert}{bg=SwhLightRed,fg=white} #+latex: \setbeamercolor{block title}{bg=SwhLightRed!75,fg=white} #+latex: \setbeamercolor{block title example}{bg=SwhLightRed!50,fg=white} #+latex: \setbeamercolor{block body example}{bg=SwhLightRed!10,fg=black} #+latex: \setbeamercolor{block body alert}{bg=SwhLightRed!10,fg=black} #+latex: \setbeamercolor{block body}{bg=SwhLightRed!10,fg=black} # #+latex: \begin{center} # #+ATTR_LATEX: :width \extblockscale{\textwidth} # file:SWH-as-foundation-block.png # #+latex: \end{center} # #+BEAMER: \pause -*** Ignition (3 to 5 Y) \alert{\em Inria} :B_exampleblock: +*** Ignition (3 Y) \alert{\em Inria} :B_exampleblock: :PROPERTIES: :BEAMER_env: exampleblock :BEAMER_COL: .3 :BEAMER_ACT: +- :END: - Project design + SOTA + plan - Vision + focus + collaboration + openness - Core resources + team, infra - Legitimacy + prototype + awareness -*** Scale up (4 to 8 Y) :B_block: +*** Scale up (5 Y) :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .35 :BEAMER_ACT: +- :END: - Engineer + automatise + clear 50 Y backlog - Extend + coverage - Connect + research, industry + culture, education - Build community + mirrors + partners + contributors *** Stable Operation :B_block: :PROPERTIES: :BEAMER_env: alertblock :BEAMER_COL: .38 :BEAMER_ACT: +- :END: - Maintain + archive, community + bylaws, organisation - Interact+Engage + research + industry + education + culture - Sustainability + /key/ \alert{infrastructure} + /ecosystem/ \alert{diversity} + /foundation/ \alert{endowment} diff --git a/talks-public/2017-09-19-RDA-IG/2017-09-19-RDA-IG-intro.org b/talks-public/2017-09-19-RDA-IG/2017-09-19-RDA-IG-intro.org index c3b9378..65759a6 100644 --- a/talks-public/2017-09-19-RDA-IG/2017-09-19-RDA-IG-intro.org +++ b/talks-public/2017-09-19-RDA-IG/2017-09-19-RDA-IG-intro.org @@ -1,214 +1,214 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Source Code Interest Group #+SUBTITLE: Introduction # does not allow short title, so we override it for beamer as follows : #+BEAMER_HEADER: \title[Software Source Code Interest Group (CC-BY-SA 4.0)]{Software Source Code Interest Group} #+BEAMER_HEADER: \author[{\bf Roberto Di Cosmo (INRIA)}, Neil Chue Hong (SSI)]{{\bf Roberto Di Cosmo (INRIA)}, Neil Chue Hong (SSI)} #+AUTHOR: *Roberto Di Cosmo (Inria)*, Neil Chue Huong (SSI) #+DATE: September 19th, 2017 #+EMAIL: roberto@dicosmo.org #+DESCRIPTION: Preserving the technological knowledge of mankind #+KEYWORDS: software heritage legacy preservation knowledge mankind technology # # # Prelude contains all the information needed to export the main beamer latex source # #+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 # #+INCLUDE: "../../common/modules/169.org" ** Why we are here *** Software is /an essential component/ of modern scientific research :B_picblock: # Deep knowledge embodied in complex software systems :PROPERTIES: :BEAMER_opt: pic=papermountain,width=.25\linewidth,leftpic=true :BEAMER_env: picblock :BEAMER_act: +- :END: Top 100 papers (Nature, October 2014)\\ #+BEGIN_QUOTE [...] the vast majority describe experimental methods or sofware that have become essential in their fields.\\ #+END_QUOTE http://www.nature.com/news/the-top-100-papers-1.16224 #+BEAMER: \pause *** The /source code/ is essential - it contains the /real knowledge/, - it is currently poorly accounted for ** Reminder: the /source code/ of a software artefact :PROPERTIES: :CUSTOM_ID: thesourcecode :END: #+LATEX: \includegraphics[width=.10\linewidth]{software.png} #+BEGIN_QUOTE “The source code for a work means the preferred form of the work for making modifications to it." \hfill GPL Licence #+END_QUOTE #+Beamer: \pause *** :PROPERTIES: :BEAMER_env: block :BEAMER_act: +- :END: #+latex: \begin{center} Hello World \end{center} *** Program (excerpt of binary) :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.5 :BEAMER_env: block :BEAMER_act: +- :END: #+begin_src hex :exports code 4004e6: 55 4004e7: 48 89 e5 4004ea: bf 84 05 40 00 4004ef: b8 00 00 00 00 4004f4: e8 c7 fe ff ff 4004f9: 90 4004fa: 5d 4004fb: c3 #+end_src *** Program (source code) :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.55 :BEAMER_env: block :BEAMER_act: +- :END: #+begin_src c :exports code /* Hello World program */ #include void main() { printf("Hello World"); } #+end_src -** R1: Software Source Code is /special/ +** Software Source Code is /special/ :PROPERTIES: :CUSTOM_ID: softwareisdifferent :END: *** Harold Abelson, Structure and Interpretation of Computer Programs /“Programs must be written for people to read, and only incidentally for machines to execute.”/ *** Quake 2 source code (excerpt) :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.45 :BEAMER_env: block :END: #+LATEX: \includegraphics[width=\linewidth]{quake-carmack-sqrt-1.png} # smart efficient implementation of 1/sqrt(x) on a CPU without special support *** Net. queue in Linux (excerpt) :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.45 :BEAMER_env: block :END: #+LATEX: \includegraphics[width=\linewidth]{juliusz-sfb-short.png} # Juliusz implementation of stochastic fair blue in the Linux Kernel linux/net/sched/sch_sfb.c *** :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: *** Len Shustek, Computer History Museum \hfill /“Source code provides a view into the mind of the designer.”/ -** R2: Source code is not ... just data +** Source code is not ... just data #+BEAMER: \pause *** /executable/ and /human readable/ knowledge (an /all time new/) + written /by humans for humans/ + formats not really an issue: /text files are forever/ #+BEAMER: \pause *** the /development history/ is key to its /understanding/ + version history + literate programming #+BEAMER: \pause *** complexity: + large /web of dependencies/ + millions of SLOCs #+BEAMER: \pause *** \hfill *Bottomline:* software source code /is not just another/ sequence of bits -** R3: we are not taking care of it +** we are not taking care of it *** No universal catalog :B_block: :PROPERTIES: :BEAMER_COL: .4 :BEAMER_env: block :END: #+ATTR_LATEX: :width \extblockscale{\linewidth} file:myriadsources.png *** No universal archive :B_block: :PROPERTIES: :BEAMER_COL: .4 :BEAMER_env: block :END: #+ATTR_LATEX: :width \extblockscale{\linewidth} file:fragilecloud.png #+BEAMER:\pause *** The Knowledge Conservancy Magic Triangle :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=PreservationTriangle, leftpic=true, width=.4\linewidth :END: - Articles: HAL, ArXiv, 100s of inst. repositories - Data: Zenodo, Figshare, 100s of various repositories - Software: \pause - + *R4*: GitHub does not fit the bill \pause - + *R5*: we want to avoid duplication of efforts + + GitHub does not fit the bill \pause + + we want to avoid duplication of efforts ** Question 6: what are key properties for Software Source code archives? :noexport: *** Availability :PROPERTIES: :BEAMER_act: +- :END: - /all/ the /history/ of /all/ the software - no restrictions (technical, legal, ... ) on /content/ or /metadata/ *** Traceability :PROPERTIES: :BEAMER_act: +- :END: # - /unique/ identifiers : /one/ name for each object - /persistent/ and /intrinsic/ identifiers : no DOI, no URL, no middle man, no dangling pointers! - full /provenance/ information *** Uniformity :PROPERTIES: :BEAMER_act: +- :END: - one /standard/ metadata structure, /irrespective of the origins/ - /uniform/ naming /schema/ ** RDA is a good place for starting the conversation on... *** Metadata :PROPERTIES: :BEAMER_act: +- :END: - what kind of /ontology/ exist for software? - what would be appropriate for Source Code? *** Use cases :PROPERTIES: :BEAMER_act: +- :END: - discovery - citation - classification - documentation, ... *** Relation to professional software development :PROPERTIES: :BEAMER_act: +- :END: - is scientific software different from, say, usual open source software? - can we learn from the experience of millions of open source developers? -** Objectives and Agenda +** Objectives and Agenda :noexport: *** Objectives - \alert{metadata} frameworks for source code - analyze and identify gaps - collect \alert{use cases} *** Agenda 1. Introduction (done) 2. Overview of metadata frameworks for source code 3. Parallel discussion and gap identification 4. Collection of potential use cases 5. Summary of results and wrap up * References -** Reminder +** Reminder :noexport: *** RDA SCIG page [[https://www.rd-alliance.org/groups/software-source-code-ig][https://www.rd-alliance.org/groups/software-source-code-ig]] *** Working document used during the session [[https://bit.ly/RDA10SoftwareIGNotes][https://bit.ly/RDA10SoftwareIGNotes]] diff --git a/talks-public/2017-09-19-RDA-PID/2017-09-19-RDA-PID.org b/talks-public/2017-09-19-RDA-PID/2017-09-19-RDA-PID.org index ab3bf41..2cefc84 100644 --- a/talks-public/2017-09-19-RDA-PID/2017-09-19-RDA-PID.org +++ b/talks-public/2017-09-19-RDA-PID/2017-09-19-RDA-PID.org @@ -1,121 +1,157 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Identifying 3.5 billions source code files #+SUBTITLE: Intrinsic identifiers, the Software Heritage experience # does not allow short title, so we override it for beamer as follows : # +BEAMER_HEADER: \title[Availability and traceability]{Preserving Software and Data} #+BEAMER_HEADER: \author[Roberto Di Cosmo]{Roberto Di Cosmo (Software Heritage, INRIA)} #+AUTHOR: Roberto Di Cosmo (Software Heritage, Inria) #+DATE: September 19th, 2017 #+EMAIL: roberto@dicosmo.org #+DESCRIPTION: Intrinsic identifiers for digital objects #+KEYWORDS: software heritage legacy preservation knowledge mankind technology # # # Prelude contains all the information needed to export the main beamer latex source # #+INCLUDE: "../../common/modules/prelude.org" :minlevel 1 # #+INCLUDE: "../../common/modules/169.org" * The quest for a PID ** The Software Heritage Project \hfill www.softwareheritage.org :PROPERTIES: :CUSTOM_ID: mission :END: #+latex: \begin{center} #+ATTR_LATEX: :width \linewidth # file:SWH-logo+motto.pdf file:SWH-logo.pdf #+latex: \end{center} *** Our mission *Collect*, *preserve* and *share* the /source code/ of /all the software/ that is publicly available *** Past, present and future \hfill /Preserving/ the past, /enhancing/ the present, /preparing/ the future \hfill # Better society, better education, better science, better industry -# +# +** Our principles +#+latex: \begin{center} +#+ATTR_LATEX: :width .7\linewidth +file:SWH-as-foundation-slim.png +#+latex: \end{center} +#+BEAMER: \pause +*** Open approach :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.3 + :BEAMER_env: block + :END: + open source, transparency +*** Unix philosophy :B_block:BMCOL:noexport:noexport: + :PROPERTIES: + :BEAMER_opt: + :BEAMER_env: block + :BEAMER_col: 0.3 + :END: + - do /one/ thing + - do it /well/ +*** In for the long haul :B_block:BMCOL: + :PROPERTIES: + :BEAMER_col: 0.3 + :BEAMER_env: block + :END: + non profit, replication ** Archive coverage :PROPERTIES: :CUSTOM_ID: archive :END: #+BEAMER: \vspace{-2mm} *** Our sources - GitHub --- full, up-to-date mirror - Debian --- daily snapshots of all suites since 2005--2015 - GNU --- all releases as of August 2015 - - Gitorious, Google Code --- processing (Archive Team & Google) + - Gitorious, Google Code --- almost done (Archive Team & Google) - Bitbucket --- WIP #+BEAMER: \pause #+BEAMER: \vspace{-1mm} *** Some numbers #+latex: \centering #+ATTR_LATEX: :width \extblockscale{.8\linewidth} file:growth.png #+latex: \footnotesize\vspace{-3mm} - 150 TB blobs, 5 TB database (as a graph: 7 B nodes + 60 B edges) - #+BEAMER: \vspace{-1mm} + as a graph: 7 B nodes + 60 B edges + #+BEAMER: \vspace{-2mm} *** \hfill The /richest/ source code archive already, ... and growing daily! ** Our challenge in the PID arena *** Long term Identifiers must be there for the long term *** No middle man Identifiers must be meaningful even if resolvers go away *** Integrity, not just naming Identifier must ensure that the retrieved object is the intended one +*** Uniqueness by design + only one name for each object, each object has only one name ** Exploring the PID landscape *** A lot of options out there... URL, URI, PURL, URN, ARK, DOI, ... -*** ... used out of (the original) scope - promoted for all data and software artefacts +*** ... some are widely used + - articles + - data + - even software artefacts! #+BEAMER: \pause -*** And yet, ... we can get no satisfaction - of all the key criteria -** The Software Heritage approach +*** We can get no satisfaction + \hfill of all the key criteria +#+BEAMER: \pause +*** + \hfill we adopted something radically different \hfill +** Intrinsic identifiers in Software Heritage # R. C. Merkle, A digital signature based on a conventional encryption # function, Crypto '87 #+BEAMER: \vspace{-3mm} ***** Merkle tree (R. C. Merkle, Crypto 1979) :B_picblock: :PROPERTIES: :BEAMER_opt: pic=merkle, leftpic=true, width=.5\linewidth :BEAMER_env: picblock :BEAMER_act: :END: Combination of - tree - hash function ***** Classical cryptographic construction fast, parallel signature of large data structures, built-in deduplication -***** No (longer) rocket science +#+BEAMER: \pause - satisfies all three criteria - - widely used in industry (e.g., Git, blockchains, IPFS, ...) + - widely used in industry (e.g., Git, nix, blockchains, IPFS, ...) ** Back to basics: DIOs vs. IDOs - Where does the confusion come from? *** DIO (digital identifier of an object) - digital identifiers for traditional (non digital) objects - - with all the epistemic complications + - epistemic complications (manifestations, versions, locations, etc.) + - significant governance issues, ... #+BEAMER: \pause *** IDO (identifier of a digital object) - (digital) identifier for digital objects - much simpler to build/handle + - can (and must) be intrinsic #+BEAMER: \pause *** Separation of concerns - yes, we \alert{need both} DIOs and IDOs - - no, we \alert{must not mix} DIOs and IDOs + - no, we \alert{must not mistake} DIOs for IDOs (and viceversa) #+BEAMER: \pause ** Working together *** Example: links to /software source code/ in an article - Leveraging Software Heritage as universal archive: + Leveraging the Software Heritage universal archive: - set of files :: \small\url{swh:1:tree:06741c8c37c5a384083082b99f4c5ad94cd0cd1f}\\ id of tree object listing all the files in a project (at a given time) - revision :: \url{swh:1:rev:7598fb94d59178d65bd8d2892c19356290f5d4e3}\\ id of commit object which a tree and (a pointer to) the history +#+BEAMER: \pause - metadata :: this /will/ involve some form of DIO + - and we get all the complications back #+BEAMER: \pause *** Come in, we're open - http://www.softwareheritage.org - - + http://www.softwareheritage.org \hfill (position paper at iPres 2017) +** A look at the internals + #+INCLUDE: "../../common/modules/status-extended.org::#merkledemo" :only-contents t