Page MenuHomeSoftware Heritage

2018-03-12-team.org
No OneTemporary

2018-03-12-team.org

#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)
#+TITLE: Software Heritage
#+SUBTITLE: Vision and outlook
#+AUTHOR: Roberto Di Cosmo
#+DATE: 12/3/2018
#+EMAIL: roberto@dicosmo.org
#+DESCRIPTION: Preserving the technological knowledge of mankind
#+KEYWORDS: software heritage legacy preservation knowledge mankind technology
#+BEAMER_HEADER: \title[Strategic team meeting]{Software Heritage: vision and outlook}
#+BEAMER_HEADER: \date[12/3/2018]{March 12th 2018\\ Paris}
#+LATEX_HEADER: \usepackage{color}
#+LATEX_HEADER: \usepackage{colortbl}
#+LATEX_HEADER: \usepackage[table]{xcolor}% http://ctan.org/pkg/xcolor
#+LATEX_HEADER: \usepackage{array}
#+LATEX_HEADER: \usepackage{supertabular}
#
# prelude.org contains all the information needed to export the main beamer latex source
# use prelude-toc.org to get the table of contents
#
#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1
#+INCLUDE: "../../common/modules/169.org"
#
# Some context: where we come from
#
# +INCLUDE: "../../common/modules/mancoosi-background.org::#main" :minlevel 1
#
# Basic properties for software studies
#
# +INCLUDE: "../../common/modules/software-studies-stepback-properties.org::#main" :minlevel 2 :only-contents t
* Context and motivations
** Software Heritage in a nutshell
#+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :only-contents t :minlevel 3
** Why now
*** Looking at the past
- a lot of old software misplaced, lost, or behind barriers, but...
- most founding fathers are still here, and willing to share
- \alert{urgent} to collect their knowledge
\hfill Only a few years left.
#+BEAMER: \pause
*** Looking at the future
- software development skyrockets
- \alert{essential} to provide a platform for the future
\hfill Every year that goes by makes the problem worse.
** Approach and principles \hfill \url{http://bit.ly/swhpaper}
#+latex: \begin{center}
#+ATTR_LATEX: :width 0.8\linewidth
file:SWH-as-foundation-slim.png
#+latex: \end{center}
#+BEAMER: \pause
*** Technology
:PROPERTIES:
:BEAMER_col: 0.34
:BEAMER_env: block
:END:
- transparency and FOSS
- replication all around
*** Content
:PROPERTIES:
:BEAMER_col: 0.32
:BEAMER_env: block
:END:
- intrinsic identifiers
- facts and provenance
*** Organization
:PROPERTIES:
:BEAMER_col: 0.33
:BEAMER_env: block
:END:
- non-profit
- multi-stakeholder
** A great ambition... in a few taglines
*** Culture (catalog+archive)
\hfill The Library of Alexandria of Source Code
*** Science (pillar of Open Science)
\hfill The reference archive of research software
*** Science (research instrument)
\hfill The CERN of Computer Science
*** Industry (reference catalog)
\hfill The universal software knowledge base
* Key properties, and principles
** Three properties are key for Software Heritage's mission
:PROPERTIES:
:CUSTOM_ID: keyproperties
:END:
*** Availability
:PROPERTIES:
:BEAMER_act: +-
:END:
- /all/ the /history/ of /all/ the software
- no restrictions (technical, legal, ... ) on /content/ or /metadata/
*** Traceability
:PROPERTIES:
:BEAMER_act: +-
:END:
- know /what/ we get, /when/, from /where/ and /how/
- [ ] /persistent/ and /intrinsic/ identifiers : no middle man, no dangling pointers!
*** Uniformity
:PROPERTIES:
:BEAMER_act: +-
:END:
- one /standard/ metadata structure, /irrespective of the origins/
- /uniform/ naming /schema/
** Software Heritage's approach
:PROPERTIES:
:CUSTOM_ID: keyproperties
:END:
*** Availability
:PROPERTIES:
:BEAMER_act: +-
:END:
- collect /all/ software from /all/ possible places
- /replicate/ the archive in a network of mirrors
*** Traceability
:PROPERTIES:
:BEAMER_act: +-
:END:
- keep /provenance/ information, systematically
+ [ ] keep incoming sources until full testing succeeds (and more if possible)
- /unique/ identifiers : use /cryptographic hashes/, derived from the software itself
+ [ ] *NEW*: accountability /for all changes/ (see [[https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf][CRDT]] Shapiro et al., blockchains)
*** Uniformity
:PROPERTIES:
:BEAMER_act: +-
:END:
- version control data model designed to /represent all the others/
* Yes, we really mean all the source code
** All the source code
#+BEAMER: \begin{center}\includegraphics[width=\extblockscale{\linewidth}]{swh-collect-axes}\end{center}
** All the source code, strategies
#+BEAMER: \begin{center}\includegraphics[width=\extblockscale{\linewidth}]{swh-collect-strategies}\end{center}
** Strategy to collect all the source code
*** Different unit cost for each sector
#+BEGIN_EXPORT latex
\begin{center}
\tablefirsthead{}
\tablehead{}
\tabletail{}
\tablelasttail{}
\begin{supertabular}{|c|c|c|}
\cline{2-3}
%\rowcolor{blue!25}
\multicolumn{1}{c}{~}
&
\multicolumn{1}{|c|}{\cellcolor{yellow}Closed} &
\multicolumn{1}{c|}{\cellcolor{yellow}Open}\\\hline
\cellcolor{yellow} Online &
SWH: {\bf \$\$}, ~~~ extern: {\bf \$\$} &
\cellcolor{yellow} SWH: {\bf \$}, ~~~ extern: {\bf \$}
\\\hline
\cellcolor{yellow} Offline &
SWH:{\bf \$\$}, ~~~ extern: {\bf \$\$\$} &
SWH:{\bf \$}, ~~~ extern: {\bf \$\$}
\\\hline
\end{supertabular}
\end{center}
#+END_EXPORT
#+BEAMER: \pause
*** Different approaches for each sector :noexport:
#+BEGIN_EXPORT latex
\begin{center}
\tablefirsthead{}
\tablehead{}
\tabletail{}
\tablelasttail{}
\begin{supertabular}{|c|c|c|}
\cline{2-3}
%\rowcolor{blue!25}
\multicolumn{1}{c}{~}
&
\multicolumn{1}{|c|}{\cellcolor{yellow}Open} &
\multicolumn{1}{c|}{\cellcolor{yellow}Proprietary}\\\hline
\cellcolor{yellow} Current and future &
\cellcolor{yellow}{{\bf Automation}} &
{\bf Embargo}
\\\hline
\cellcolor{yellow} Legacy &
{\bf Crowdsourcing} &
{\bf Focused search}
\\\hline
\end{supertabular}
\end{center}
#+END_EXPORT
#+BEAMER: \pause
# IMPACTS
*** We started on the first quadrant, we need all four!
- [ ] *technical*: security, identification, authorization, access control
- *legal*: policies, contracts
- *community*: network, standards, endorsement
#+BEAMER: \pause
*** Important technical issues
- [ ] setup space for "/collections/" (staging area waiting for curation)
+ make it simple for contributors to donate!
- [ ] keep the embargo/takedown issue in mind
#+INCLUDE: "../../common/modules/swh-functional-architecture.org::#phases" :minlevel 2
* Community is essential
# IMPACTS
** A daunting task:
- challenge :: extreme variability of sources and technologies
- opportunity :: highly parallelisable, /if we provide good abstractions/
and welcome contributors
#+BEAMER: \pause
*** Collect entry points :B_block:
:PROPERTIES:
:BEAMER_COL: .43
:BEAMER_env: block
:END:
- listers (see Avi's blog post)
- protocols (Adullact+FusionForge)
- [ ] VCS loaders (e.g.: Avi's work)
- [ ] Web crawlers (IA, Qwant)
- [ ] curation of the collections
#+BEAMER: \pause
*** Preserve entry points
:PROPERTIES:
:BEAMER_COL: .3
:BEAMER_env: block
:END:
- [ ] mirrors
- [ ] storage and indexing backends
- [ ] event feeds
- [ ] data compression
*** Share entry points
:PROPERTIES:
:BEAMER_COL: .27
:BEAMER_env: block
:END:
# application specific data representation
- [ ] data representation
- [ ] APIs
- [ ] WebHooks
- [ ] indexes
***
\hfill tag tasks with Collect, Preserve, Share when possible
* Building for the long term
** Three pillars
*** Awareness, visibility, endorsement
- promote public and private policies
- attract users, unlock funds
- turn copycats into partners
#+BEAMER: \pause
*** Resources
- fund the long term effort: people, collaborators, organisation, infrastructure...
#+BEAMER: \pause
*** Science and technology
- build on sound basis: /we need external help/
+ [ ] be prepared to learn from others!
\hfill /"Seul on va plus vite, mais ensemble on va plus loin"/
# Where we are today: endorsement
#
#+INCLUDE: "../../common/modules/endorsement.org::#endorsement" :minlevel 2
** Political awareness
*** April 3rd, 2017: landmark Inria Unesco agreement...
#+BEGIN_EXPORT latex
\includegraphics[width=\extblockscale{.25\linewidth}]{inria-logo-new} \hfill
\includegraphics[width=\extblockscale{.35\linewidth}]{unesco-accord} \hfill
\includegraphics[width=\extblockscale{.2\linewidth}]{unesco}\\[1em]
\mbox{}\hfill
\includegraphics[width=\extblockscale{.2\linewidth}]{rdc-fh-ib} \hfill
\includegraphics[width=\extblockscale{.15\linewidth}]{SWH-logo_share} \hfill
\includegraphics[width=\extblockscale{.2\linewidth}]{swh-team-2017-04-03}\hfill
% \mbox{}\\
% \url{https://www.softwareheritage.org/blog}
#+END_EXPORT
*** September 27-28: Mauritius Call
\hfill mentions the importance of software heritage
*** Sometimes in 2018
\hfill opening of the archive (we'll come back to this)
** Resources
#+INCLUDE: "../../common/modules/swh-sponsors.org::#sponsors" :only-contents t
#+BEAMER: \pause
*** Breaking news! :B_picblock:
:PROPERTIES:
:BEAMER_env: picblock
:BEAMER_opt: pic=Qwant_Logo,leftpic=true,width=\extblockscale{.2\linewidth}
:END:
\hfill contract awarded for building together the source code search engine
** Science
*** Communication
- CACM Viewpoint *accepted!!!* (thanks Moshe Vardi)
- RDA 2018
- Keynote Devoxx (April), ICSE (May), and ASE (September)
*** Collaboration
- Qwant and Almanach (search/classification, AP+Zack+Roberto)
- Crossminer (MG) and Linked Data (MG and Roberto)
- RDFox (Zack and Roberto ), H2020 (Zack is on the deck)
- [ ] distributed storage, databases, graphs, crypto, blockchains, etc...
#+BEAMER: \pause
*** Essential
- [ ] reliable interface with scientific community (human and technical)
* Roadmap for a sustainable organisation
:PROPERTIES:
:CUSTOM_ID: main
:END:
** Growing a sustainable common digital infrastructure :noexport:
:PROPERTIES:
:CUSTOM_ID: phases
:END:
*** Ignition (3 Y) \alert{\em Inria} :B_exampleblock:
:PROPERTIES:
:BEAMER_env: exampleblock
:BEAMER_COL: .3
:BEAMER_ACT: +-
:END:
- Vision
- Team
- Core infrastructure
- Identity
+ communication
+ community
- Legitimacy
+ awareness
+ support
*** Scale up (5 Y) :B_block:
:PROPERTIES:
:BEAMER_env: block
:BEAMER_COL: .35
:BEAMER_ACT: +-
:END:
- Core Infra (engineer)
- Collect (4 strategies)
- Preserve
+ mirrors, multiple techs
- Share
+ search, browse, APIs
- Connect
+ community
- Organisation
+ build the foundation
*** Stable Operation ($\infty$) :B_block:
:PROPERTIES:
:BEAMER_env: alertblock
:BEAMER_COL: .38
:BEAMER_ACT: +-
:END:
- Maintain+Evolve
+ archive, community
+ bylaws, organisation
- Interact+Engage
+ research
+ industry
+ education
+ culture
- Sustainability
+ /key/ \alert{infrastructure}
+ /ecosystem/ \alert{diversity}
+ /foundation/ \alert{endowment}
** Towards a sustainable common digital infrastructure
:PROPERTIES:
:CUSTOM_ID: phases
:END:
*** Launching (2015-2017) :B_exampleblock:
:PROPERTIES:
:BEAMER_env: exampleblock
:BEAMER_COL: .3
:BEAMER_ACT: +-
:END:
- Vision
- Team
- Core infrastructure
- Identity
- Legitimacy
*** Building (2018-2022) :B_block:
:PROPERTIES:
:BEAMER_env: block
:BEAMER_COL: .35
:BEAMER_ACT: +-
:END:
- Expand collection
- Support use cases
- Build community
- Grow mirror network
- Independent Foundation
*** Stable Operation (2023-$\infty$) :B_block:
:PROPERTIES:
:BEAMER_env: alertblock
:BEAMER_COL: .38
:BEAMER_ACT: +-
:END:
- Maintain+Evolve
+ archive, community
+ bylaws, organisation
- Interact+Engage
+ research and industry
+ culture and education
*** Sustainability
:PROPERTIES:
:BEAMER_ACT: +-
:END:
+ /key/ \alert{infrastructure}
+ /ecosystem/ \alert{diversity}
+ /foundation/ \alert{endowment}
** Today: team
*** Management
- Roberto and Stefano (CEO/CTO)
- Jean-Fran\c{c}ois Abramatic (Head of Advisory Board)
- Magali Fitzgibbon (Legal, Contracts)
*** R and D, Ops
- 5 engineers (Morane thanks to Crossminer)
- 1 PhD
- 1 visiting scientist
*** Everything else
\hfill provided by Inria
** Today: funding
*** Baseline
Inria engagement (~ 500Ke/year)
*** Sponsoring
- 3 platinum sponsors (Microsoft, Intel, SocGen)
- 1 silver sponsor (Huawei), 4 bronze sponsors (DANS, Nokia, DISI, GitHub)
*** Partnerships
- HAL and Intel
- Crossminer
- Qwant
- ClearlyDefined
***
\hfill a /huge/ part of my time
** Today: sponsor's view
*** Features
#+BEGIN_EXPORT latex
\begin{columns}[t]
\begin{column}{0.48\linewidth}
In production
\begin{itemize}
\item \emph{lookup} a content using its hash
\item \emph{navigation} of the archive with an API: \url{http://archive.softwareheritage.org/api}
\end{itemize}
\end{column}\pause
\begin{column}{0.48\linewidth}
Work in progress
\begin{itemize}
\item \emph{browsing}: "wayback machine" for archived code via Web UI (demo?)
\item \emph{download}: copy from the archive
\item \emph{deposit}: into the archive
\item \emph{reverse index}: map hashes to origins/commits
\item \emph{classification}: (very early stage)
\end{itemize}
\end{column}
\end{columns}
#+END_EXPORT
* The transition has started
** Organisation
*** The Software Heritage Foundation
- legal :: contract ongoing
- funding :: will accept donations as soon as possible
+ [ ] updated website (AL+RDC+Zack)
+ [ ] /donate/ button (AL+RDC)
+ from 1 euro to 1Me :-)
#+BEAMER: \pause
*** Foundation vs. Inria: separation of concerns (transitional)
- the Foundation collects funds for Software Heritage
- Inria operates Software Heritage
** Operations
*** Software Heritage is /no longer/ a "project"
- they *depend on us*
+ HAL *now*, /mirrors/ and /Intel use case/ soon
+ UNESCO event requires ~24/7 stable operation
+ [ ] state of Azure clone?
#+BEAMER: \pause
*** Moving to ~24/7
- think about a way of implementing /in production/ stable operation
- TODO send me (cc: Zack) /privately/ your ideas by *Friday, March 23rd*
** Mirror network
*** Terminology
- copy :: instance of the archive under SWH own control
- mirror :: instance of the archive outside SWH own control
*** How it works
- legal :: 5 documents
+ [X] contract (RDC+MF), technical annex (RDC+ND), ethical charter (RDC),
+ [ ] CLA, Code of conduct
- technical :: quite a lot of work to do (ND)
*** Status
- advanced :: Grenoble
- exploratory :: 2 more in France, 1 in Norway
** Technology
*** Evolutions ongoing
- move to more flexible in-house storage (Ceph, FT, ND)
- experiment data compression
- [ ] explore NoSQL solutions
#+BEAMER: \pause
*** Evolutions forthcoming
- [ ] blockchain
- [ ] embargo/escrow
#+BEAMER: \pause
*** Memento
- *modular* software stack: we need to enable
- other programming languages
- other backends/frontends
** Technology, cont'd (interfacing with the world)
**** Existing line of work
- APIs (must be maintained!)
- PURLs (must be carefully defined!)
+ [ ] /cite me button/
+ [ ] /documentation/ and /rationale/ (part is ongoing, Morane+Zack+Roberto)
+ [ ] "/software citation/" (we need Inria teams onboard!)
**** Forthcoming
- Journal / blockchain
+ [ ] Mirrors feed, trust and accountability (blockchain)
- Web hooks
+ [ ] allow others to build Software Heritage integrated services
** Team and Community
*** Expanding core team in 2018
- 2 new hires (TBD)
*** Community
- [ ] we need to bring in contributors
+ software collectors
+ developers
+ partner platforms
+ curators
** The next 5 years
*** Collect
- *stable process* for adding new listers/loaders
- community of contributors
*** Preserve
- *stable process* for mirror network
- at least 10 mirrors worldwide
*** Share
- in production *browse/download/upload/search/index/automatic classification*
- support for research and industry use
*** Process
- continuous improvement (tech, community)
** The next 5 years, cont'd
*** Team
30 full time people on SWH core\\
management, dev/ops, fundraising, comm, product, liaison\hfill \alert{structured}
*** Funding
~5 Me/year
*** Organisation
- Independent international foundation
- International network of peers
*** Community
- research, industry, culture, ...
- collectors/curators/scholars/museums ...
** Pause
*** Yes, it is
- a huge challenge
- an unprecedented effort
- much more than just technology
- high risk, high gain
#+BEAMER: \pause
***
\hfill I believe we can make it!
** What we need to succeed
*** Operations
- stability, reliability, efficiency
#+BEAMER: \pause
*** Engineering
- modularity (platform/plugins, tech oecumenism)
- replicability (mirrors, contributors, \alert{docs})
- evolvability (testing env, sandbox, exps)
#+BEAMER: \pause
*** Product vision
- "users" and "clients" are coming
#+BEAMER: \pause
*** Mindset
- make the principles guide the technology\\
\hfill /not the other way around/
* Conclusion
** Come in, we're open
*** Software Heritage is
- a /reference archive/ of /all available/ source code
- a fantastic new tool for /research/ software
- a unique /complement/ for /development platforms/
- an international, open, nonprofit, /mutualized infrastructure/
- at the service of our community, at the service of society
*** Questions :B_ignoreheading:
:PROPERTIES:
:BEAMER_env: ignoreheading
:END:
#+BEAMER: {\vfill\begin{center}\Huge{Questions ?}\end{center}\vfill}
* Team report
** Task priorities (established November 2017)
*** short term :B_block:
:PROPERTIES:
:BEAMER_env: block
:BEAMER_COL: .45
:END:
- browse (lead AL)
- ideal ETA beta open Q4 2017
- deposit (lead AD+MG)
- ideal ETA
+ state diagram/high level specs for [2017-12-05 Tue]
+ working pipeline [2017-12-06 Wed 23:00 CET]
- download (lead AP)
- ideal ETA working pipeline Q4 2017
*** short/medium term :B_block:
:PROPERTIES:
:BEAMER_env: block
:BEAMER_COL: .45
:END:
- mirrors (lead ND+MF)
- ideal ETA Q2 2018
- preliminary work on legal+tech specs needed by Jan 16th 2018
- provenance (lead GR)
- ideal ETA production index Q2 2018
- preliminary Azure experiment ETA Q4 2017
* Appendix :B_appendix:
:PROPERTIES:
:BEAMER_env: appendix
:END:
#
# How we want to work, including core properties
#
* Zoom on science :noexport:
#
# Software Research
#
** Multiple facets
*** Scientists as users
- reproducibility via SWH (all)
- SWH as dataset (computer science)
*** Scientists as providers/partners
- research on SWH challenges
** An Universal Archive of Software Development
:PROPERTIES:
:CUSTOM_ID: main
:END:
#+LATEX: \includegraphics[width=\extblockscale{.15\linewidth}]{universal.png}
*** /Repeatable/ Software Studies
:PROPERTIES:
:BEAMER_act: +-
:END:
- vulnerability detection
- dependency analysis
- pattern elicitation
- study of the development graph
- ... the sky is the limit
*** Prerequisites
clean, evolvable data and metadata model
** How we built our scientific knowledge
#
# Scientific method, reproducibility
#
#+INCLUDE: "../../common/modules/scientific-method.org::#short" :only-contents t
#
# Connection with Open Access
#
#+INCLUDE: "../../common/modules/conservancy.org::#main" :minlevel 2
#
# URLS are not good tracers
#
#+INCLUDE: "../../common/modules/urls-decay.org::#main" :only-contents t :minlevel 2
#
# DOI is not a solution
#
#+INCLUDE: "../../common/modules/doi-analysis.org::#main" :only-contents t :minlevel 2
** What could the good links look like?
*** Links to /software source code/ in an article
Leverage Software Heritage as universal archive:
- set of files :: \small\url{swh:1:tree:06741c8c37c5a384083082b99f4c5ad94cd0cd1f}\\
id of tree object listing all the files in a project (at a given time)
- revision :: \url{swh:1:rev:7598fb94d59178d65bd8d2892c19356290f5d4e3}\\
id of commit object which a tree and (a pointer to) the history
- metadata :: this /may/ involve a DOI
***
\hfill this is also of /industrial/ relevance!
*** Links to /data/ in /software source code/ :noexport:
- external linking mechanisms /that guarantee integrity/
+ git lfs
+ git annex
- need to extend them into a generic, VCS independent solution
** The SWH - HAL connector
*** Strategic
First open access / open source archival process
*** Opportunity
- HAL is one of a kind
- ArXiv uses the same tech
* Selected research challenges : building the archive :noexport:
** Data compression
Deduplication is performed at the file level /across all projects in the world/
*** Pros
- very efficient to cope with file clones
- quite resilient to technology changes
*** Cons
- a minor edit creates two different files
#+BEAMER: \pause
*** Challenge: exploit file similarities
- adapt / improve variable size checksums / diff detection
- compression rates of up to 100 to 1 may arise
** Metadata alignment :noexport:
*** Many concepts related to source code
- project, archive, source, language, licence, bts, mailing list, ...
- developer, committer, author, architect, ...
*** Many existing ontologies
DOAP, FOAF, Appstream, schema.org, ADMS.SW, ...
*** Many disparate catalogs
:PROPERTIES:
:BEAMER_act: +-
:END:
# mostly manual
Freecode (40.000+), Plume (400+), Debian (25.000+), OpenHub (670.000+), ...
# FramaSoft (1500+),
# OpenHub is mostly automatic
# Wikipedia ?
*** Challenge : scale up metadata to millions of projects
:PROPERTIES:
:BEAMER_act: +-
:END:
- /reconcile/ existing ontologies
- /link/ and /check/ existing catalogs with Software Heritage
- handle /inconsistent data/ and /provenance information/
- synthesise missing information (machine learning)
** Software phylogenetics :noexport:
*** The Software Diaspora
:PROPERTIES:
:BEAMER_act: +-
:END:
- Code often /migrates/ across projects : forks, copy-paste
- Code gets /cloned/ : reuse, language limitations, code smells
- Projects /migrate/ across forges : fashion, functionality
- Projects get /cloned/ : mirrors, packages
*** Challenge: tracing software evolution across billions of files
:PROPERTIES:
:BEAMER_act: +-
:END:
- rebuild the history of software artefacts
- identify code origins
- spot code clones
- build project impact graphs
** Distributed infrastructure
*** The software graph
- files
- directories
- commits
- projects
all de-duplicated in Software Heritage
*** Challenge: design efficient architectures and algorithms
- replication and availability (CAP?)
- navigation
- query
- path analysis
* Selected research challenges : using the archive :noexport:
** Code search
*** A natural need
:PROPERTIES:
# :BEAMER_act: +-
:END:
- Find the definition of a function/class/procedure/type/structure
- Search examples of code usage in an archive of source code
- you name it...
*** Approaches
:PROPERTIES:
# :BEAMER_act: +-
:END:
- language specific /patterns/
- working on /abstract syntax trees/
Regular expressions are a nice /swiss-army knife/ approximation, can we build a specific tool that scales?
*** What about /all the source code/ in the world?
:PROPERTIES:
:BEAMER_act: +-
:END:
- /hundreds/ of billions of LOCs
We need new insight for handling this.
** Software as Big Data
*** Remember the numbers
- 60+ million repositories ingested
- 700+ million commits
- 3+ billion unique source files / 200 TB of raw source code
and growing by the day!
*** Challenge: what can machines learn here?
- programming patterns / trends
- developer skills
- vulnerabilities
- bugs and fixes
** Efficient data representation :noexport:
*** Remember the numbers
- 60+ million repositories ingested
- 700+ million commits
- 3+ billion unique source files / 200 TB of raw source code
and growing by the day!
*** Challenge: can we make this fit in memory?
- efficient graph representation
- fast non-local queries
- mitigate the size/speed tradeoff
* A glimpse of the archive :noexport:
#+INCLUDE: "../../common/modules/status-extended.org::#api" :only-contents t
* Bits from the drawing board :noexport:
#+INCLUDE: "../../common/modules/bits-drawing-board.org::#keyproperties" :minlevel 2
#+INCLUDE: "../../common/modules/bits-drawing-board.org::#foss" :minlevel 2
#+INCLUDE: "../../common/modules/bits-drawing-board.org::#intrinsicids" :minlevel 2
#+INCLUDE: "../../common/modules/bits-drawing-board.org::#replication" :minlevel 2
** Some planned working groups
#+INCLUDE: "../../common/modules/your-help-wg.org::#sodi" :minlevel 3
#+INCLUDE: "../../common/modules/your-help-wg.org::#sapi" :minlevel 3
#+INCLUDE: "../../common/modules/your-help-wg.org::#opad" :minlevel 3
* Tech bits :noexport:
** More details on the internals
#+INCLUDE: "../../common/modules/status-extended.org::#architecture" :only-contents t
#+INCLUDE: "../../common/modules/status-extended.org::#merklerevision" :only-contents t
#
# Contributing to the great picture
#
** The team :noexport:
#+latex: \begin{center}
#+ATTR_LATEX: :width .35\linewidth
file:core-team-formal.png
#+latex: \end{center}
#+BEAMER: \pause
* Technical status :noexport:
# #+INCLUDE: "../../common/modules/status-extended.org::#people" :minlevel 2
#+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2
** Archiving goals
Targets: VCS repositories & source code releases (e.g., tarballs)
*** We DO archive
- file *content* (= blobs)
- *revisions* (= commits), with full metadata
- *releases* (= tags), ditto
- where (*origin*) & when (*visit*) we found any of the above
# - time-indexed repo *snapshots* (i.e., we never delete anything)
… in a VCS-/archive-agnostic *canonical data model*
*** We DON'T archive (for now)
# - diffs → derived data from related contents
- homepages, wikis
- BTS/issues/code reviews/etc.
- mailing lists
Long term vision: play our part in a /"semantic wikipedia of software"/
** Dataflow
#+BEAMER: \begin{center}\includegraphics[width=\extblockscale{.9\textwidth}]{swh-dataflow.pdf}\end{center}
#
# Key properties of the system
#
** Much more than an archive!
#+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :only-contents t
#+INCLUDE: "../../common/modules/status-extended.org::#merkledemo" :minlevel 2
# +INCLUDE: "../../common/modules/status.org::#datamodel" :minlevel 2
# +INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2
# +INCLUDE: "../../common/modules/status-extended.org::#merkledemo" :minlevel 2
# +INCLUDE: "../../common/modules/status-extended.org::#architecture" :only-contents t
# +INCLUDE: "../../common/modules/status-extended.org::#merklerevision" :only-contents t
# +INCLUDE: "../../common/modules/status-extended.org::#giantdag" :only-contents t
# +INCLUDE: "../../common/modules/status-extended.org::#features" :minlevel 2

File Metadata

Mime Type
text/x-tex
Expires
Thu, Jul 3, 12:07 PM (2 d, 5 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3245323

Event Timeline