diff --git a/talks-public/2018-03-12-team/2018-03-12-team.org b/talks-public/2018-03-12-team/2018-03-12-team.org index 2d3489a..9c45591 100644 --- a/talks-public/2018-03-12-team/2018-03-12-team.org +++ b/talks-public/2018-03-12-team/2018-03-12-team.org @@ -1,725 +1,851 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+TITLE: Software Heritage #+SUBTITLE: Vision and outlook #+AUTHOR: Roberto Di Cosmo -#+DATE: 18/7/2017 +#+DATE: 12/3/2018 #+EMAIL: roberto@dicosmo.org #+DESCRIPTION: Preserving the technological knowledge of mankind #+KEYWORDS: software heritage legacy preservation knowledge mankind technology #+BEAMER_HEADER: \title[Strategic team meeting]{Software Heritage: vision and outlook} -#+BEAMER_HEADER: \date[18/7/2017]{July 18th 2017\\ Paris} +#+BEAMER_HEADER: \date[12/3/2018]{March 12th 2018\\ Paris} #+LATEX_HEADER: \usepackage{color} #+LATEX_HEADER: \usepackage{colortbl} #+LATEX_HEADER: \usepackage[table]{xcolor}% http://ctan.org/pkg/xcolor #+LATEX_HEADER: \usepackage{array} #+LATEX_HEADER: \usepackage{supertabular} # # prelude.org contains all the information needed to export the main beamer latex source # use prelude-toc.org to get the table of contents # #+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 #+INCLUDE: "../../common/modules/169.org" # # Some context: where we come from # # +INCLUDE: "../../common/modules/mancoosi-background.org::#main" :minlevel 1 # # Basic properties for software studies # # +INCLUDE: "../../common/modules/software-studies-stepback-properties.org::#main" :minlevel 2 :only-contents t * Context and motivations ** Software Heritage in a nutshell #+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :only-contents t :minlevel 3 ** Why now *** Looking at the past - a lot of old software misplaced, lost, or behind barriers, but... - most founding fathers are still here, and willing to share - \alert{urgent} to collect their knowledge \hfill Only a few years left. #+BEAMER: \pause *** Looking at the future - software development skyrockets - \alert{essential} to provide a platform for the future \hfill Every year that goes by makes the problem worse. ** Approach and principles \hfill \url{http://bit.ly/swhpaper} #+latex: \begin{center} #+ATTR_LATEX: :width 0.8\linewidth file:SWH-as-foundation-slim.png #+latex: \end{center} #+BEAMER: \pause *** Technology :PROPERTIES: :BEAMER_col: 0.34 :BEAMER_env: block :END: - transparency and FOSS - - replication all the way down + - replication all around *** Content :PROPERTIES: :BEAMER_col: 0.32 :BEAMER_env: block :END: - intrinsic identifiers - facts and provenance *** Organization :PROPERTIES: :BEAMER_col: 0.33 :BEAMER_env: block :END: - non-profit - multi-stakeholder ** A great ambition... in a few taglines *** Culture (catalog+archive) \hfill The Library of Alexandria of Source Code +*** Science (pillar of Open Science) + \hfill The reference archive of research software *** Science (research instrument) \hfill The CERN of Computer Science *** Industry (reference catalog) \hfill The universal software knowledge base * Key properties, and principles ** Three properties are key for Software Heritage's mission :PROPERTIES: :CUSTOM_ID: keyproperties :END: *** Availability :PROPERTIES: :BEAMER_act: +- :END: - /all/ the /history/ of /all/ the software - no restrictions (technical, legal, ... ) on /content/ or /metadata/ *** Traceability :PROPERTIES: :BEAMER_act: +- :END: - know /what/ we get, /when/, from /where/ and /how/ - [ ] /persistent/ and /intrinsic/ identifiers : no middle man, no dangling pointers! *** Uniformity :PROPERTIES: :BEAMER_act: +- :END: - one /standard/ metadata structure, /irrespective of the origins/ - /uniform/ naming /schema/ ** Software Heritage's approach :PROPERTIES: :CUSTOM_ID: keyproperties :END: *** Availability :PROPERTIES: :BEAMER_act: +- :END: - collect /all/ software from /all/ possible places - /replicate/ the archive in a network of mirrors *** Traceability :PROPERTIES: :BEAMER_act: +- :END: - keep /provenance/ information, systematically + [ ] keep incoming sources until full testing succeeds (and more if possible) - /unique/ identifiers : use /cryptographic hashes/, derived from the software itself - - [ ] *NEW*: accountability /for all changes/ (see [[https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf][CRDT]] Shapiro et al., blockchains) + + [ ] *NEW*: accountability /for all changes/ (see [[https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf][CRDT]] Shapiro et al., blockchains) *** Uniformity :PROPERTIES: :BEAMER_act: +- :END: - version control data model designed to /represent all the others/ * Yes, we really mean all the source code ** All the source code #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{\linewidth}]{swh-collect-axes}\end{center} ** All the source code, strategies #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{\linewidth}]{swh-collect-strategies}\end{center} ** Strategy to collect all the source code *** Different unit cost for each sector #+BEGIN_EXPORT latex \begin{center} \tablefirsthead{} \tablehead{} \tabletail{} \tablelasttail{} \begin{supertabular}{|c|c|c|} \cline{2-3} %\rowcolor{blue!25} \multicolumn{1}{c}{~} & \multicolumn{1}{|c|}{\cellcolor{yellow}Closed} & \multicolumn{1}{c|}{\cellcolor{yellow}Open}\\\hline \cellcolor{yellow} Online & SWH: {\bf \$\$}, ~~~ extern: {\bf \$\$} & \cellcolor{yellow} SWH: {\bf \$}, ~~~ extern: {\bf \$} \\\hline \cellcolor{yellow} Offline & SWH:{\bf \$\$}, ~~~ extern: {\bf \$\$\$} & SWH:{\bf \$}, ~~~ extern: {\bf \$\$} \\\hline \end{supertabular} \end{center} #+END_EXPORT #+BEAMER: \pause *** Different approaches for each sector :noexport: #+BEGIN_EXPORT latex \begin{center} \tablefirsthead{} \tablehead{} \tabletail{} \tablelasttail{} \begin{supertabular}{|c|c|c|} \cline{2-3} %\rowcolor{blue!25} \multicolumn{1}{c}{~} & \multicolumn{1}{|c|}{\cellcolor{yellow}Open} & \multicolumn{1}{c|}{\cellcolor{yellow}Proprietary}\\\hline \cellcolor{yellow} Current and future & \cellcolor{yellow}{{\bf Automation}} & {\bf Embargo} \\\hline \cellcolor{yellow} Legacy & {\bf Crowdsourcing} & {\bf Focused search} \\\hline \end{supertabular} \end{center} #+END_EXPORT #+BEAMER: \pause # IMPACTS *** We started on the first quadrant, we need all four! - [ ] *technical*: security, identification, authorization, access control - *legal*: policies, contracts - *community*: network, standards, endorsement #+BEAMER: \pause *** Important technical issues - [ ] setup space for "/collections/" (staging area waiting for curation) + make it simple for contributors to donate! - [ ] keep the embargo/takedown issue in mind #+INCLUDE: "../../common/modules/swh-functional-architecture.org::#phases" :minlevel 2 * Community is essential -** TODO! - - share part: API/hooks? / feed (mirrors) - - functionalities (see sponsor meeting) - - collection staging area # IMPACTS -*** Daunting task - - challenge: extreme variability of sources and technologies - - opportunity: highly parallelisable, /if we provide good abstractions/ +** A daunting task: + - challenge :: extreme variability of sources and technologies + - opportunity :: highly parallelisable, /if we provide good abstractions/ + and welcome contributors #+BEAMER: \pause -*** Collect phase entry points - - forge listers (e.g.: Avi's and Sushant's work) - - forge protocol extensions (e.g.: Adullact's work on FusionForge) - - VCS loaders (e.g.: Avi's work) - - Web crawler connection (e.g.: Internet Archive discussions) +*** Collect entry points :B_block: + :PROPERTIES: + :BEAMER_COL: .43 + :BEAMER_env: block + :END: + - listers (see Avi's blog post) + - protocols (Adullact+FusionForge) + - [ ] VCS loaders (e.g.: Avi's work) + - [ ] Web crawlers (IA, Qwant) + - [ ] curation of the collections #+BEAMER: \pause -*** Archive phase entry points - - storage and indexing backends - - application specific data representations - +*** Preserve entry points + :PROPERTIES: + :BEAMER_COL: .3 + :BEAMER_env: block + :END: + - [ ] mirrors + - [ ] storage and indexing backends + - [ ] event feeds + - [ ] data compression +*** Share entry points + :PROPERTIES: + :BEAMER_COL: .27 + :BEAMER_env: block + :END: +# application specific data representation + - [ ] data representation + - [ ] APIs + - [ ] WebHooks + - [ ] indexes +*** + \hfill tag tasks with Collect, Preserve, Share when possible * Building for the long term ** Three pillars *** Awareness, visibility, endorsement - promote public and private policies - attract users, unlock funds - turn copycats into partners #+BEAMER: \pause *** Resources - fund the long term effort: people, collaborators, organisation, infrastructure... #+BEAMER: \pause *** Science and technology - build on sound basis: /we need external help/ + [ ] be prepared to learn from others! \hfill /"Seul on va plus vite, mais ensemble on va plus loin"/ # Where we are today: endorsement # #+INCLUDE: "../../common/modules/endorsement.org::#endorsement" :minlevel 2 ** Political awareness *** April 3rd, 2017: landmark Inria Unesco agreement... #+BEGIN_EXPORT latex \includegraphics[width=\extblockscale{.25\linewidth}]{inria-logo-new} \hfill \includegraphics[width=\extblockscale{.35\linewidth}]{unesco-accord} \hfill \includegraphics[width=\extblockscale{.2\linewidth}]{unesco}\\[1em] \mbox{}\hfill \includegraphics[width=\extblockscale{.2\linewidth}]{rdc-fh-ib} \hfill \includegraphics[width=\extblockscale{.15\linewidth}]{SWH-logo_share} \hfill \includegraphics[width=\extblockscale{.2\linewidth}]{swh-team-2017-04-03}\hfill % \mbox{}\\ % \url{https://www.softwareheritage.org/blog} #+END_EXPORT *** September 27-28: Mauritius Call \hfill mentions the importance of software heritage *** Sometimes in 2018 \hfill opening of the archive (we'll come back to this) ** Resources #+INCLUDE: "../../common/modules/swh-sponsors.org::#sponsors" :only-contents t #+BEAMER: \pause *** Breaking news! :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=Qwant_Logo,leftpic=true,width=\extblockscale{.2\linewidth} :END: \hfill contract awarded for building together the source code search engine ** Science *** Communication - CACM Viewpoint *accepted!!!* (thanks Moshe Vardi) - RDA 2018 - Keynote Devoxx (April), ICSE (May), and ASE (September) *** Collaboration - Qwant and Almanach (search/classification, AP+Zack+Roberto) - Crossminer (MG) and Linked Data (MG and Roberto) - RDFox (Zack and Roberto ), H2020 (Zack is on the deck) - [ ] distributed storage, databases, graphs, crypto, blockchains, etc... #+BEAMER: \pause *** Essential - [ ] reliable interface with scientific community (human and technical) * Roadmap for a sustainable organisation :PROPERTIES: :CUSTOM_ID: main :END: -** Growing a sustainable common digital infrastructure +** Growing a sustainable common digital infrastructure :noexport: :PROPERTIES: :CUSTOM_ID: phases :END: *** Ignition (3 Y) \alert{\em Inria} :B_exampleblock: :PROPERTIES: :BEAMER_env: exampleblock :BEAMER_COL: .3 :BEAMER_ACT: +- :END: - Vision - Team - Core infrastructure - Identity + communication + community - Legitimacy + awareness + support *** Scale up (5 Y) :B_block: :PROPERTIES: :BEAMER_env: block :BEAMER_COL: .35 :BEAMER_ACT: +- :END: - Core Infra (engineer) - Collect (4 strategies) - Preserve + mirrors, multiple techs - Share + search, browse, APIs - Connect + community - Organisation + build the foundation *** Stable Operation ($\infty$) :B_block: :PROPERTIES: :BEAMER_env: alertblock :BEAMER_COL: .38 :BEAMER_ACT: +- :END: - Maintain+Evolve + archive, community + bylaws, organisation - Interact+Engage + research + industry + education + culture - Sustainability + /key/ \alert{infrastructure} + /ecosystem/ \alert{diversity} + /foundation/ \alert{endowment} +** Towards a sustainable common digital infrastructure + :PROPERTIES: + :CUSTOM_ID: phases + :END: +*** Launching (2015-2017) :B_exampleblock: + :PROPERTIES: + :BEAMER_env: exampleblock + :BEAMER_COL: .3 + :BEAMER_ACT: +- + :END: + - Vision + - Team + - Core infrastructure + - Identity + - Legitimacy +*** Building (2018-2022) :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .35 + :BEAMER_ACT: +- + :END: + - Expand collection + - Support use cases + - Build community + - Grow mirror network + - Independent Foundation +*** Stable Operation (2023-$\infty$) :B_block: + :PROPERTIES: + :BEAMER_env: alertblock + :BEAMER_COL: .38 + :BEAMER_ACT: +- + :END: + - Maintain+Evolve + + archive, community + + bylaws, organisation + - Interact+Engage + + research and industry + + culture and education +*** Sustainability + :PROPERTIES: + :BEAMER_ACT: +- + :END: + + /key/ \alert{infrastructure} + + /ecosystem/ \alert{diversity} + + /foundation/ \alert{endowment} + ** Today: team *** Management - Roberto and Stefano (CEO/CTO) - Jean-Fran\c{c}ois Abramatic (Head of Advisory Board) - Magali Fitzgibbon (Legal, Contracts) *** R and D, Ops - 5 engineers (Morane thanks to Crossminer) - 1 PhD - 1 visiting scientist *** Everything else \hfill provided by Inria ** Today: funding *** Baseline Inria engagement (~ 500Ke/year) *** Sponsoring - 3 platinum sponsors (Microsoft, Intel, SocGen) - 1 silver sponsor (Huawei), 4 bronze sponsors (DANS, Nokia, DISI, GitHub) *** Partnerships - HAL and Intel - Crossminer - Qwant - ClearlyDefined *** \hfill a /huge/ part of my time +** Today: sponsor's view +*** Features +#+BEGIN_EXPORT latex + \begin{columns}[t] + \begin{column}{0.48\linewidth} + In production + \begin{itemize} + \item \emph{lookup} a content using its hash + \item \emph{navigation} of the archive with an API: \url{http://archive.softwareheritage.org/api} + \end{itemize} + \end{column}\pause + \begin{column}{0.48\linewidth} + Work in progress + \begin{itemize} + \item \emph{browsing}: "wayback machine" for archived code via Web UI (demo?) + \item \emph{download}: copy from the archive + \item \emph{deposit}: into the archive + \item \emph{reverse index}: map hashes to origins/commits + \item \emph{classification}: (very early stage) + \end{itemize} + \end{column} + \end{columns} +#+END_EXPORT + * The transition has started +** Organisation +*** The Software Heritage Foundation + - legal :: contract ongoing + - funding :: will accept donations as soon as possible + + [ ] updated website (AL+RDC+Zack) + + [ ] /donate/ button (AL+RDC) + + from 1 euro to 1Me :-) +#+BEAMER: \pause +*** Foundation vs. Inria: separation of concerns (transitional) + - the Foundation collects funds for Software Heritage + - Inria operates Software Heritage ** Operations *** Software Heritage is /no longer/ a "project" - they *depend on us* + HAL *now*, /mirrors/ and /Intel use case/ soon + UNESCO event requires ~24/7 stable operation + [ ] state of Azure clone? - - [ ] APIs need to be maintained - - PURLs must be carefully defined - + [ ] /cite me button/ - + [ ] /documentation/ and /rationale/ (part is ongoing, Morane+Zack+Roberto) - + [ ] "/software citation/" (we need Inria teams onboard!) #+BEAMER: \pause *** Moving to ~24/7 - think about a way of implementing /in production/ stable operation - - TODO send me (cc: Zack) /privately/ your ideas by *Friday, March 15* + - TODO send me (cc: Zack) /privately/ your ideas by *Friday, March 23rd* ** Mirror network *** Terminology - copy :: instance of the archive under SWH own control - mirror :: instance of the archive outside SWH own control *** How it works - legal :: 5 documents + [X] contract (RDC+MF), technical annex (RDC+ND), ethical charter (RDC), + [ ] CLA, Code of conduct - - technical :: quite a lot of work (ND) + - technical :: quite a lot of work to do (ND) *** Status - advanced :: Grenoble - exploratory :: 2 more in France, 1 in Norway -** Organisation -*** The Software Heritage Foundation - - legal :: contract ongoing - - funding :: will accept donations as soon as possible - + [ ] updated website (AL+RDC+Zack) - + [ ] /donate/ button (AL+RDC) - + from 1 euro to 1Me :-) -#+BEAMER: \pause -*** Foundation vs. Inria: separation of concerns (transitional) - - the Foundation collects funds for Software Heritage - - Inria operates Software Heritage ** Technology *** Evolutions ongoing - move to more flexible in-house storage (Ceph, FT, ND) - experiment data compression - [ ] explore NoSQL solutions #+BEAMER: \pause *** Evolutions forthcoming - [ ] blockchain - [ ] embargo/escrow #+BEAMER: \pause *** Memento - *modular* software stack: we need to enable - other programming languages - other backends/frontends +** Technology, cont'd (interfacing with the world) +**** Existing line of work + - APIs (must be maintained!) + - PURLs (must be carefully defined!) + + [ ] /cite me button/ + + [ ] /documentation/ and /rationale/ (part is ongoing, Morane+Zack+Roberto) + + [ ] "/software citation/" (we need Inria teams onboard!) +**** Forthcoming + - Journal / blockchain + + [ ] Mirrors feed, trust and accountability (blockchain) + - Web hooks + + [ ] allow others to build Software Heritage integrated services ** Team and Community *** Expanding core team in 2018 - 2 new hires (TBD) *** Community - [ ] we need to bring in contributors + software collectors + developers + partner platforms + curators ** The next 5 years *** Collect - *stable process* for adding new listers/loaders - community of contributors *** Preserve - *stable process* for mirror network - at least 10 mirrors worldwide *** Share - - *browse/download/upload/search/*, automatic classification + - in production *browse/download/upload/search/index/automatic classification* - support for research and industry use *** Process - continuous improvement (tech, community) ** The next 5 years, cont'd *** Team 30 full time people on SWH core\\ management, dev/ops, fundraising, comm, product, liaison\hfill \alert{structured} *** Funding - 4 or 5 Me/year + ~5 Me/year *** Organisation - Independent international foundation - International network of peers *** Community - research, industry, culture, ... - collectors/curators/scholars/museums ... ** Pause *** Yes, it is - a huge challenge - an unprecedented effort - much more than just technology - high risk, high gain #+BEAMER: \pause *** \hfill I believe we can make it! ** What we need to succeed *** Operations - stability, reliability, efficiency #+BEAMER: \pause *** Engineering - modularity (platform/plugins, tech oecumenism) - replicability (mirrors, contributors, \alert{docs}) - evolvability (testing env, sandbox, exps) #+BEAMER: \pause *** Product vision - "users" and "clients" are coming #+BEAMER: \pause *** Mindset - make the principles guide the technology\\ - /not the other way around/ + \hfill /not the other way around/ * Conclusion ** Come in, we're open *** Software Heritage is - a /reference archive/ of /all available/ source code - a fantastic new tool for /research/ software - a unique /complement/ for /development platforms/ - an international, open, nonprofit, /mutualized infrastructure/ - at the service of our community, at the service of society *** Questions :B_ignoreheading: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+BEAMER: {\vfill\begin{center}\Huge{Questions ?}\end{center}\vfill} +* Team report +** Task priorities (established November 2017) +*** short term :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .45 + :END: + - browse (lead AL) + - ideal ETA beta open Q4 2017 + + - deposit (lead AD+MG) + - ideal ETA + + state diagram/high level specs for [2017-12-05 Tue] + + working pipeline [2017-12-06 Wed 23:00 CET] + + - download (lead AP) + - ideal ETA working pipeline Q4 2017 +*** short/medium term :B_block: + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .45 + :END: + - mirrors (lead ND+MF) + - ideal ETA Q2 2018 + - preliminary work on legal+tech specs needed by Jan 16th 2018 + - provenance (lead GR) + - ideal ETA production index Q2 2018 + - preliminary Azure experiment ETA Q4 2017 * Appendix :B_appendix: :PROPERTIES: :BEAMER_env: appendix :END: # # How we want to work, including core properties # * Zoom on science :noexport: # # Software Research # ** Multiple facets *** Scientists as users - reproducibility via SWH (all) - SWH as dataset (computer science) *** Scientists as providers/partners - research on SWH challenges ** An Universal Archive of Software Development :PROPERTIES: :CUSTOM_ID: main :END: #+LATEX: \includegraphics[width=\extblockscale{.15\linewidth}]{universal.png} *** /Repeatable/ Software Studies :PROPERTIES: :BEAMER_act: +- :END: - vulnerability detection - dependency analysis - pattern elicitation - study of the development graph - ... the sky is the limit *** Prerequisites clean, evolvable data and metadata model ** How we built our scientific knowledge # # Scientific method, reproducibility # #+INCLUDE: "../../common/modules/scientific-method.org::#short" :only-contents t # # Connection with Open Access # #+INCLUDE: "../../common/modules/conservancy.org::#main" :minlevel 2 # # URLS are not good tracers # #+INCLUDE: "../../common/modules/urls-decay.org::#main" :only-contents t :minlevel 2 # # DOI is not a solution # #+INCLUDE: "../../common/modules/doi-analysis.org::#main" :only-contents t :minlevel 2 ** What could the good links look like? *** Links to /software source code/ in an article Leverage Software Heritage as universal archive: - set of files :: \small\url{swh:1:tree:06741c8c37c5a384083082b99f4c5ad94cd0cd1f}\\ id of tree object listing all the files in a project (at a given time) - revision :: \url{swh:1:rev:7598fb94d59178d65bd8d2892c19356290f5d4e3}\\ id of commit object which a tree and (a pointer to) the history - metadata :: this /may/ involve a DOI *** \hfill this is also of /industrial/ relevance! *** Links to /data/ in /software source code/ :noexport: - external linking mechanisms /that guarantee integrity/ + git lfs + git annex - need to extend them into a generic, VCS independent solution ** The SWH - HAL connector *** Strategic First open access / open source archival process *** Opportunity - HAL is one of a kind - ArXiv uses the same tech * Selected research challenges : building the archive :noexport: ** Data compression Deduplication is performed at the file level /across all projects in the world/ *** Pros - very efficient to cope with file clones - quite resilient to technology changes *** Cons - a minor edit creates two different files #+BEAMER: \pause *** Challenge: exploit file similarities - adapt / improve variable size checksums / diff detection - compression rates of up to 100 to 1 may arise ** Metadata alignment :noexport: *** Many concepts related to source code - project, archive, source, language, licence, bts, mailing list, ... - developer, committer, author, architect, ... *** Many existing ontologies DOAP, FOAF, Appstream, schema.org, ADMS.SW, ... *** Many disparate catalogs :PROPERTIES: :BEAMER_act: +- :END: # mostly manual Freecode (40.000+), Plume (400+), Debian (25.000+), OpenHub (670.000+), ... # FramaSoft (1500+), # OpenHub is mostly automatic # Wikipedia ? *** Challenge : scale up metadata to millions of projects :PROPERTIES: :BEAMER_act: +- :END: - /reconcile/ existing ontologies - /link/ and /check/ existing catalogs with Software Heritage - handle /inconsistent data/ and /provenance information/ - synthesise missing information (machine learning) ** Software phylogenetics :noexport: *** The Software Diaspora :PROPERTIES: :BEAMER_act: +- :END: - Code often /migrates/ across projects : forks, copy-paste - Code gets /cloned/ : reuse, language limitations, code smells - Projects /migrate/ across forges : fashion, functionality - Projects get /cloned/ : mirrors, packages *** Challenge: tracing software evolution across billions of files :PROPERTIES: :BEAMER_act: +- :END: - rebuild the history of software artefacts - identify code origins - spot code clones - build project impact graphs ** Distributed infrastructure *** The software graph - files - directories - commits - projects all de-duplicated in Software Heritage *** Challenge: design efficient architectures and algorithms - replication and availability (CAP?) - navigation - query - path analysis * Selected research challenges : using the archive :noexport: ** Code search *** A natural need :PROPERTIES: # :BEAMER_act: +- :END: - Find the definition of a function/class/procedure/type/structure - Search examples of code usage in an archive of source code - you name it... *** Approaches :PROPERTIES: # :BEAMER_act: +- :END: - language specific /patterns/ - working on /abstract syntax trees/ Regular expressions are a nice /swiss-army knife/ approximation, can we build a specific tool that scales? *** What about /all the source code/ in the world? :PROPERTIES: :BEAMER_act: +- :END: - /hundreds/ of billions of LOCs We need new insight for handling this. ** Software as Big Data *** Remember the numbers - 60+ million repositories ingested - 700+ million commits - 3+ billion unique source files / 200 TB of raw source code and growing by the day! *** Challenge: what can machines learn here? - programming patterns / trends - developer skills - vulnerabilities - bugs and fixes ** Efficient data representation :noexport: *** Remember the numbers - 60+ million repositories ingested - 700+ million commits - 3+ billion unique source files / 200 TB of raw source code and growing by the day! *** Challenge: can we make this fit in memory? - efficient graph representation - fast non-local queries - mitigate the size/speed tradeoff * A glimpse of the archive :noexport: #+INCLUDE: "../../common/modules/status-extended.org::#api" :only-contents t * Bits from the drawing board :noexport: #+INCLUDE: "../../common/modules/bits-drawing-board.org::#keyproperties" :minlevel 2 #+INCLUDE: "../../common/modules/bits-drawing-board.org::#foss" :minlevel 2 #+INCLUDE: "../../common/modules/bits-drawing-board.org::#intrinsicids" :minlevel 2 #+INCLUDE: "../../common/modules/bits-drawing-board.org::#replication" :minlevel 2 ** Some planned working groups #+INCLUDE: "../../common/modules/your-help-wg.org::#sodi" :minlevel 3 #+INCLUDE: "../../common/modules/your-help-wg.org::#sapi" :minlevel 3 #+INCLUDE: "../../common/modules/your-help-wg.org::#opad" :minlevel 3 * Tech bits :noexport: ** More details on the internals #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#merklerevision" :only-contents t # # Contributing to the great picture # ** The team :noexport: #+latex: \begin{center} #+ATTR_LATEX: :width .35\linewidth file:core-team-formal.png #+latex: \end{center} #+BEAMER: \pause * Technical status :noexport: # #+INCLUDE: "../../common/modules/status-extended.org::#people" :minlevel 2 #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 ** Archiving goals Targets: VCS repositories & source code releases (e.g., tarballs) *** We DO archive - file *content* (= blobs) - *revisions* (= commits), with full metadata - *releases* (= tags), ditto - where (*origin*) & when (*visit*) we found any of the above # - time-indexed repo *snapshots* (i.e., we never delete anything) … in a VCS-/archive-agnostic *canonical data model* *** We DON'T archive (for now) # - diffs → derived data from related contents - homepages, wikis - BTS/issues/code reviews/etc. - mailing lists Long term vision: play our part in a /"semantic wikipedia of software"/ ** Dataflow #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{.9\textwidth}]{swh-dataflow.pdf}\end{center} # # Key properties of the system # ** Much more than an archive! #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :only-contents t #+INCLUDE: "../../common/modules/status-extended.org::#merkledemo" :minlevel 2 # +INCLUDE: "../../common/modules/status.org::#datamodel" :minlevel 2 # +INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 # +INCLUDE: "../../common/modules/status-extended.org::#merkledemo" :minlevel 2 # +INCLUDE: "../../common/modules/status-extended.org::#architecture" :only-contents t # +INCLUDE: "../../common/modules/status-extended.org::#merklerevision" :only-contents t # +INCLUDE: "../../common/modules/status-extended.org::#giantdag" :only-contents t # +INCLUDE: "../../common/modules/status-extended.org::#features" :minlevel 2