diff --git a/common/modules/graph-compression.org b/common/modules/graph-compression.org index 8c2d452..64598b2 100644 --- a/common/modules/graph-compression.org +++ b/common/modules/graph-compression.org @@ -1,306 +1,306 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 # Depends: \usepackage{pdfpages} * Graph Compression :PROPERTIES: :CUSTOM_ID: main :END: ** Graph compression on the Software Heritage archive :PROPERTIES: :CUSTOM_ID: intro :END: *** #+BEGIN_EXPORT latex \vspace{-3mm} \begin{thebibliography}{Foo Bar, 1969} \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli \newblock Ultra-Large-Scale Repository Analysis via Graph Compression \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE \end{thebibliography} #+END_EXPORT *** Research question Is it possible to efficiently perform software development history analyses at ultra large scale (= the scale of Software Heritage archive or more), on a single, relatively cheap machine? *** Idea Apply state-of-the-art graph compression techniques from the field of Web graph / social network analysis. ** Background --- (Web) graph compression :PROPERTIES: :CUSTOM_ID: background1 :END: *** The graph of the Web :PROPERTIES: :BEAMER_env: definition :END: Directed graph that has Web pages as nodes and hyperlinks between them as edges. *** Properties (1) - - **Locality:** pages links to pages whose URLs are lexicographically + - **Locality:** pages link to pages whose URLs are lexicographically similar. URLs share long common prefixes. → use *D-gap compression* *** Adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.51 :END: #+BEAMER: \scriptsize | *Node* | *Outdegree* | *Successors* | |--------+-------------+--------------------------------------| | ... | ... | ... | | 15 | 11 | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 10 | 15,16,17,22,23,24,315,316,317,3041 | | 17 | 0 | | | 18 | 5 | 13,15,16,17,50 | | ... | ... | ... | *** D-gapped adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.48 :END: #+BEAMER: \scriptsize | *Node* | *Outdegree* | *Successors* | |--------+-------------+-----------------------------| | ... | ... | ... | | 15 | 11 | 3,1,0,0,0,0,3,0,178,111,718 | | 16 | 10 | 1,0,0,4,0,0,290,0,0,2723 | | 17 | 0 | | | 18 | 5 | 9,1,0,0,32 | | ... | ... | ... | ** Background --- (Web) graph compression (cont.) :PROPERTIES: :CUSTOM_ID: background2 :END: *** The graph of the Web :PROPERTIES: :BEAMER_env: definition :END: Directed graph that has Web pages as nodes and hyperlinks between them as edges. *** Properties (2) - **Similarity:** pages that are close together in lexicographic order tend to have many common successors. → use *reference compression* *** Adjacency lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.47 :END: #+BEAMER: \scriptsize | *Node* | *Outd.* | *Successors* | |--------+---------+--------------------------------------| | ... | ... | ... | | 15 | 11 | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 10 | 15,16,17,22,23,24,315,316,317,3041 | | 17 | 0 | | | 18 | 5 | 13,15,16,17,50 | | ... | ... | ... | *** Copy lists :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.60 :END: #+BEAMER: \scriptsize | *Node* | *Ref.* | *Copy list* | *Extra nodes* | |--------+--------+-------------+--------------------------------------| | ... | ... | ... | ... | | 15 | 0 | | 13,15,16,17,18,19,23,24,203,315,1034 | | 16 | 1 | 01110011010 | 22,316,317,3041 | | 17 | | | | | 18 | 3 | 11110000000 | 50 | | ... | ... | ... | | ** Background --- Web graph compression (OLD) :noexport: Borrowing (great!) slides from: #+BEGIN_EXPORT latex \begin{thebibliography}{} \bibitem{Pibiri2018} Giulio Ermanno Pibiri \newblock Effective Web Graph Representations, 2018 \newblock \url{http://pages.di.unipi.it/pibiri/slides/webgraphs\_compression.pdf} \end{thebibliography} #+END_EXPORT ** Background -- Web graph compression (imported slides) (OLD) :noexport: :PROPERTIES: :BEAMER_env: ignoreheading :END: #+BEGIN_EXPORT latex { \setbeamercolor{background canvas}{bg=} \setbeamertemplate{background}{} \includepdf[pages={4,11,12,13}]{webgraphs_compression.pdf} \addtocounter{framenumber}{4} } #+END_EXPORT ** Corpus :PROPERTIES: :CUSTOM_ID: corpus :END: *** Nodes :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.5 :END: | *Node type* | *N. of nodes* | |-------------+---------------| | origins | 88 M | | snapshots | 57 M | | releases | 9.9 M | | revisions | 1.1 B | | directories | 4.9 B | | contents | 5.5 B | |-------------+---------------| | Total nodes | 12 B | *** Edges :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.5 :END: | *Edge type* | *N. of edges* | |-----------------------+---------------| | origin → snapshot | 195 M | | snapshot → revision | 616 M | | snapshot → release | 215 M | | release → revision | 9.9 M | | revision → revision | 1.2 B | | revision → directory | 1.1 B | | directory → directory | 48 B | | directory → revisiony | 482 M | | directory → content | 112 B | |-----------------------+---------------| | Total edges | 165 B | *** :PROPERTIES: :BEAMER_env: ignoreheading :END: Archive snapshot 2018-09-25, from the Software Heritage graph dataset.\\ - Growth rate: exponential, doubling every 22-30 months (Rousseau, Di Cosmo, - Zacchiroli; ESE 2020, to appear). + Growth rate: exponential, doubling every 22-30 months [Rousseau, Di Cosmo, + Zacchiroli, Empirical Software Engineering 2020]. ** Graph compression pipeline :PROPERTIES: :CUSTOM_ID: pipeline :END: #+BEAMER: \hspace*{-0.1\linewidth} \includegraphics[width=1.2\linewidth]{compression/compression-steps} #+BEAMER: \vspace{-1cm} *** - *MPH*: minimal perfect hash, mapping Merkle IDs to 0..N-1 integers - *BV* compress: Boldi-Vigna compression (based on MPH order) - *BFS*: breadth-first visit to renumber - *Permute*: update BV compression according to BFS order *** (Re)establishing locality - key for good compression is a node ordering that ensures locality and similarity - which is very much /not/ the case with Merkle IDs, ...but is the case /again/ after BFS reordering ** Compression experiment :PROPERTIES: :CUSTOM_ID: compexp :END: | *Step* | *Wall time* (hours) | |-------------+---------------------| | MPH | 2 | | BV Compress | 84 | | BFS | 19 | | Permute | 18 | | Transpose | 15 | |-------------+---------------------| | Total | 138 (6 days) | - server equipped with 24 CPUs and 750 GB of RAM - RAM mostly used as I/O cache for the BFS step - /minimum/ memory requirements are close to the RAM needed to load the final compressed graph in memory ** Compression efficiency (space) :PROPERTIES: :CUSTOM_ID: spaceefficiency :END: *** :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.4 :END: | *Forward graph* | | | total size | 91 GiB | | bits per edge | 4.91 | | compression ratio | 15.8% | *** :PROPERTIES: :BEAMER_env: block :BEAMER_col: 0.4 :END: | *Backward graph* | | | total size | 83 GiB | | bits per edge | 4.49 | | compression ratio | 14.4% | *** Operating cost The structure of a full bidirectional archive graph fits in less than 200 GiB of RAM, for a hardware cost of ~300 USD. ** Compression efficiency (time) :PROPERTIES: :CUSTOM_ID: timeefficiency :END: *** Benchmark --- Full BFS visit (single thread) #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} | *Forward graph* | | |------------------+----------------| | wall time | 1h48m | | throughput | 1.81 M nodes/s | | | (553 ns/node) | #+BEAMER: \end{column}\begin{column}{0.45\textwidth} | *Backward graph* | | |------------------+----------------| | wall time | 3h17m | | throughput | 988 M nodes/s | | | (1.01 µs/node) | #+BEAMER: \end{column}\end{columns} *** Benchmark --- Edge lookup random sample: 1 B nodes (8.3% of entire graph); then enumeration of all successors #+BEAMER: \begin{columns}\begin{column}{0.45\textwidth} | *Forward graph* | | |------------------+----------------| | visited edges | 13.6 B | | throughput | 12.0 M edges/s | | | (83 ns/edge) | #+BEAMER: \end{column}\begin{column}{0.45\textwidth} | *Backward graph* | | |------------------+----------------| | visited edges | 13.6 B | | throughput | 9.45 M edges/s | | | (106 ns/edge) | #+BEAMER: \end{column}\end{columns} *** :PROPERTIES: :BEAMER_env: ignoreheading :END: Note how edge lookup time is close to DRAM random access time (50-60 ns). ** Discussion :PROPERTIES: :CUSTOM_ID: discussion :END: *** Incrementality compression is *not incremental*, due to the use of contiguous integer ranges - but the graph is append-only, so... - ...based on expected graph growth rate it should be possible to pre-allocate enough free space in the integer ranges to support *amortized incrementality* (future work) #+BEAMER: \pause *** In-memory v. on-disk the compressed in-memory graph structure has *no attributes* - usual design is to exploit the 0..N-1 integer ranges to *memory map node attributes* to disk for efficient access - works well for queries that does graph traversal first and "join" node attributes last; ping-pong between the two is expensive - edge attributes are more problematic diff --git a/common/modules/status-extended.org b/common/modules/status-extended.org index 78440f5..c555370 100644 --- a/common/modules/status-extended.org +++ b/common/modules/status-extended.org @@ -1,505 +1,504 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 # not to be included as a whole, just pick individual slides as you see fit * Status :PROPERTIES: :CUSTOM_ID: main :END: ** The people :PROPERTIES: :CUSTOM_ID: people :END: *** The core team :B_picblock: :PROPERTIES: :CUSTOM_ID: core-team-formal :BEAMER_env: picblock :BEAMER_opt: pic=team,width=.4\linewidth :END: - Roberto Di Cosmo - Stefano Zacchiroli - Nicolas Dandrimont (Engineer) - Antoine Dumont (Engineer) # - and /Jordi, Quentin and Guillaume/ *** Scientific advisors - Serge Abiteboul (French Science Academy) - Jean-François Abramatic (former W3C director) - Gerard Berry (CNRS Gold Medal, French Science Academy) - Julia Lawall (Coccinelle, Linux Kernel, Outreachy) ** Archive coverage --- archive.softwareheritage.org :PROPERTIES: :CUSTOM_ID: archive :END: #+BEAMER: \vspace{-1mm} #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1\linewidth}]{archive-growth.png}\end{center} #+BEAMER: \vspace{-2mm} *** #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1\linewidth}]{archive-coverage.png}\end{center} # #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/github} # #+BEAMER: \hfill # #+BEAMER: \includegraphics[width=0.13\linewidth]{coverage/gitlab} # #+BEAMER: \hfill # #+BEAMER: \raisebox{2mm}{\includegraphics[width=0.14\linewidth]{coverage/bitbucket}} # #+BEAMER: \hfill # #+BEAMER: \includegraphics[width=0.14\linewidth]{coverage/googlecode} # #+BEAMER: \hfill # #+BEAMER: \includegraphics[width=0.14\linewidth]{coverage/gitorious} # #+BEAMER: \hfill # #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/framagit} # #+BEAMER: \\ # #+BEAMER: \includegraphics[width=0.10\linewidth]{coverage/hal} # #+BEAMER: \hfill # #+BEAMER: \raisebox{2mm}{\includegraphics[width=0.12\linewidth]{coverage/debian}} # #+BEAMER: \hfill # #+BEAMER: \raisebox{1mm}{\includegraphics[width=0.11\linewidth]{coverage/npm}} # #+BEAMER: \hfill # #+BEAMER: \includegraphics[width=0.06\linewidth]{coverage/cran} # #+BEAMER: \hfill # #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/gnu} # #+BEAMER: \hfill # #+BEAMER: \includegraphics[width=0.12\linewidth]{coverage/inria} # #+BEAMER: \hfill # #+BEAMER: \raisebox{-1mm}{\includegraphics[width=0.11\linewidth]{coverage/pypi}} #+BEAMER: \pause *** - on disk: ~700 TB (uncompressed); as a graph ~20 B nodes, ~200 B edges - the largest public source code archive in the world (and growing!) ** The structure of the archive :noexport: *** On-disk storage - flat file storage for contents - postgres database for the metadata *** Data model: /one/ big Merkle DAG, inspired by the git model - Origins (= repositories) - Occurrences (= branches) - Releases (= tags) - Revisions (= commits) - Directories (= trees) - Contents (= blobs) ** Archiving goals :PROPERTIES: :CUSTOM_ID: archivinggoals :END: Targets: VCS repositories & source code releases (e.g., tarballs, packages) *** We DO archive - file *content* (= blobs) - *revisions* (= commits), with full metadata - *releases* (= tags), ditto - where (*origin*) & when (*visit*) we found any of the above # - time-indexed repo *snapshots* (i.e., we never delete anything) … in a VCS-/archive-agnostic *canonical data model* *** We DON'T archive (yet) # - diffs → derived data from related contents - homepages, wikis - BTS/issues/code reviews/etc. - mailing lists Long term vision: play our part in a /"semantic wikipedia of software"/ ** Architecture :PROPERTIES: :CUSTOM_ID: architecture :END: *** Data flow :PROPERTIES: :CUSTOM_ID: dataflow :END: # #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1.4\textwidth}]{swh-dataflow.pdf}\end{center} ** Data model :noexport: *** General schema - VCS-independent - fully deduplicated + files, directories and commits are /shared/ - biggest git-like /graph/ in the world *** \begin{center} \url{http://deb.li/swhdm} \end{center} *** full hash index (sha1, sha256, ...) Some funny facts: - the GPL2 licence appears under more than 500 names + including /aa.css.txt/ and /FullSync.txt/ ~ :-) ** Merkle DAG *** Merkle structure :PROPERTIES: :CUSTOM_ID: merkle :END: **** Merkle trees :PROPERTIES: :CUSTOM_ID: merkletree :END: # R. C. Merkle, A digital signature based on a conventional encryption # function, Crypto '87 - #+BEAMER: \vspace{-3mm} + #+BEAMER: \vspace{-2mm} ***** Merkle tree (R. C. Merkle, CRYPTO 1987) :B_picblock: :PROPERTIES: :BEAMER_opt: pic=merkle, leftpic=true, width=.7\linewidth :BEAMER_env: picblock :BEAMER_act: :END: Combination of - tree - hash function #+BEAMER: \pause #+BEAMER: \footnotesize ***** Classical cryptographic construction - fast, parallel signature of large data structures - widely used (e.g., Git, blockchains, IPFS, ...) - built-in deduplication #+BEAMER: \vspace{-1mm} **** Data Model :PROPERTIES: :CUSTOM_ID: datamodel :END: ***** The archive: a (giant) Merkle DAG - #+BEAMER: \vspace{-3mm} #+BEAMER: \centering \includegraphics[width=\textwidth]{swh-data-model-h} **** The archive in a few pictures :PROPERTIES: :CUSTOM_ID: merkledemo :END: ***** A giant (extended) Merkle DAG #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_1.pdf}}} #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/contents.pdf}}} #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_2_contents.pdf}}} #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/directories.pdf}}} #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_3_directories.pdf}}} #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/revisions.pdf}}} #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_4_revisions.pdf}}} #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/releases.pdf}}} #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_5_releases.pdf}}} # #+LATEX: {\colorbox{white}{\includegraphics[width=\extblockscale{.9\linewidth}]{git-merkle/merkle_1.pdf}}} *** A revision node :PROPERTIES: :CUSTOM_ID: merklerevision :END: **** Example: a Software Heritage revision ***** #+BEAMER: \vspace{-.5cm}\centering\includegraphics[width=0.9\textwidth]{git-merkle/revisions} ***** Note: most object kinds currently have Git-compatible identifiers *** Giant DAG :PROPERTIES: :CUSTOM_ID: giantdag :END: **** The archive: a (giant) Merkle DAG # Using an empty frame because the image is difficult to read on swh bg. # Finding a way to override image bg for just this frame would be better. ***** #+BEAMER: \centering \includegraphics[width=\extblockscale{\textwidth}]{git-merkle/merkle_5_releases} *** Giant DAG (single slide) :PROPERTIES: :CUSTOM_ID: giantdag1slide :END: **** The Software Heritage archive: a gigantic Merkle DAG #+LATEX: \centering\forcebeamerstart{} #+LATEX: \only<1>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_1}}} #+LATEX: \only<2>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/contents}}} #+LATEX: \only<3>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_2_contents}}} #+LATEX: \only<4>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/directories}}} #+LATEX: \only<5>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_3_directories}}} #+LATEX: \only<6>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/revisions}}} #+LATEX: \only<7>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_4_revisions}}} #+LATEX: \only<8>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/releases}}} #+LATEX: \only<9>{\colorbox{white}{\includegraphics[width=.75\linewidth]{git-merkle/merkle_5_releases}}} #+LATEX: \forcebeamerend{} *** Giant DAG (detailed) :PROPERTIES: :CUSTOM_ID: dagdetail :END: **** The archive: a (giant) Merkle DAG #+BEAMER: \vspace{-3mm} #+BEAMER: \centering \includegraphics[width=\textwidth]{swh-merkle-dag-wide} *** Giant DAG (detailed) :PROPERTIES: :CUSTOM_ID: dagdetailsmall :END: **** The archive: a (giant) Merkle DAG #+BEAMER: \centering #+BEAMER: \only<1>{\includegraphics[width=\textwidth]{swh-merkle-dag-small-visit1}} #+BEAMER: \only<2>{\includegraphics[width=\textwidth]{swh-merkle-dag-small-visit2}} #+BEAMER: \only<3>{\includegraphics[width=\textwidth]{swh-merkle-dag-small}} ** Technology :noexport: :PROPERTIES: :CUSTOM_ID: technology :END: *** Software stack :PROPERTIES: :CUSTOM_ID: swstack :END: **** 3rd party - Debian, Proxmox, ZFS on Linux, Puppet - PostgreSQL for metadata storage, with barman & pglogical - Celery (RabbitMQ backend) for task scheduling - Python3 and psycopg2 for the backend - Django, Bootstrap, D3.js for Web stuff **** in house - /ad hoc/ object storage (to avoid imposing tech to mirrors) - data model implementation, listers, loaders, scheduler - ~60 Git repositories (~20 Python packages, ~30 Puppet modules) - ~60 kSLOC Python / ~12 kSLOC SQL / ~4 kSLOC Puppet - licence choice: GPLv3 (backend) / AGPLv3 (frontend) *** Deployment architecture #+BEAMER: \vspace{1mm} #+BEAMER: \centering \includegraphics[height=.9\textheight]{general-architecture} *** Hardware stack :PROPERTIES: :CUSTOM_ID: hwstack :END: **** in house - 3x hypervisors with ~40 VMs - 1x high performance database server; read-only replica on a container - 2x dedicated storage servers, one of them using ZFS. - 3x high density storage array (2 x 60 x 6TB; 1 x 60 x 10TB) - 3x nodes for a kafka+elasticsearch cluster **** on Azure - full object storage mirror - full mirror of the database containing the graph - workers for content indexing - workers for download bundle preparation *** Software architecture :noexport: **** Module dependencies (internal + external) :B_picblock: :PROPERTIES: :BEAMER_env: picblock :BEAMER_opt: pic=swh-modules-deps-all,width=\linewidth :END: **** let's zoom in: http://deb.li/swhdeps ** Technology :noexport: :PROPERTIES: :CUSTOM_ID: technology-short :END: *** Deployment and resource usage **** Software - around 60k SLOC of custom Python code, running on Debian Stable - PostgreSQL database for the metadata storage - Full docker-compose development environment - Work in progress: scale-out metadata storage (Cassandra?) - Work in progress: mirroring infrastructure (Kafka) **** Hardware - 12 servers (hypervisors, database, storage, staging and testing infrastructure) / 40 virtual machines with mass storage and a backup server at Inria - In-kind sponsorship of cloud and storage resources (Microsoft, University of Bologna) ** Software development :noexport: :PROPERTIES: :CUSTOM_ID: development :END: *** Software development **** classic FOSS development - language: English - development mailing list #+BEAMER: \\{\small \url{https://sympa.inria.fr/sympa/info/swh-devel}} - IRC #+BEAMER: \\ #swh-devel / FreeNode - Forge #+BEAMER: \\{\small \url{https://forge.softwareheritage.org}} - Git, tasks, code review, etc. **** for more information #+BEAMER: \scriptsize https://www.softwareheritage.org/community/developers/ ** Roadmap :PROPERTIES: :CUSTOM_ID: features :END: *** Features... - (done) *lookup* by content hash - (done) *browsing*: "wayback machine" for source code (API + UI) - (early access) *deposit* of source code bundles directly to the archive - (early access) *save code now*, on-demand archive - (done) *download*: =wget= / =git clone= from the archive - (todo) *provenance* lookup for all archived content - (todo) *full-text search* on all archived source code files #+BEAMER: \pause *** ... and much more than one could possibly imagine all the world's software development history at hand's reach! ** Web API :noexport: :PROPERTIES: :CUSTOM_ID: api :END: *** Web API :PROPERTIES: :CUSTOM_ID: apiintro :END: **** RESTful API to programmatically access the Software Heritage archive \\ *\url{https://archive.softwareheritage.org/api/}* **** Features - pointwise *browsing* of the archive - … snapshots → revisions → directories → contents … - full access to the *metadata* of archived objects - *crawling* information - /when have you last visited this Git repository I care about?/ - /where were its branches/tags pointing to at the time?/ # - derived information about archived contents (WIP) # - MIME type, programming language, license, etc. **** Endpoint index \url{https://archive.softwareheritage.org/api/1/} *** A tour of the Web API --- origins & visits :PROPERTIES: :CUSTOM_ID: apitourvisits :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ git/url/https://github.com/hylang/hy { "id": 1, "origin_visits_url": "/api/1/origin/1/visits/", "type": "git", "url": "https://github.com/hylang/hy" } #+END_SRC #+BEAMER: \vfill #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ 1/visits/ [ ..., { "date": "2016-09-14T11:04:26.769266+00:00", "origin": 1, "origin_visit_url": "/api/1/origin/1/visit/13/", "status": "full", "visit": 13 }, ... ] #+END_SRC *** A tour of the Web API --- snapshots :PROPERTIES: :CUSTOM_ID: apitoursnapshots :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/origin/ \ 1/visit/13/ { ..., "occurrences": { ..., "refs/heads/master": { "target": "b94211251...", "target_type": "revision", "target_url": "/api/1/revision/b94211251.../" }, "refs/tags/0.10.0": { "target": "7045404f3...", "target_type": "release", "target_url": "/api/1/release/7045404f3.../" }, ... }, "origin": 1, "origin_url": "/api/1/origin/1/", "status": "full", "visit": 13 } #+END_SRC *** A tour of the Web API --- releases :noexport: :PROPERTIES: :CUSTOM_ID: apitourreleases :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/release/ \ 7045404f3d1c54e6473c71bbb716529fbad4be24/ { "author": { "email": "tag@pault.ag", "fullname": "Paul Tagliamonte ", "id": 96, "name": "Paul Tagliamonte" }, "date": "2014-04-10T23:01:28-04:00", "message": "0.10: The Oh f*ck it's PyCon release", "name": "0.10.0", "synthetic": false, "target": "6072557b6...", "target_type": "revision", "target_url": "/api/1/revision/6072557b6.../", ... } #+END_SRC *** A tour of the Web API --- revisions :PROPERTIES: :CUSTOM_ID: apitourrevisions :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/revision/ \ 6072557b6c10cd9a21145781e26ad1f978ed14b9/ { "author": { "email": "tag@pault.ag", "fullname": "Paul Tagliamonte ", "id": 96, "name": "Paul Tagliamonte" }, "committer": { ... }, "date": "2014-04-10T23:01:11-04:00", "committer_date": "2014-04-10T23:01:11-04:00", "directory": "2df4cd84e...", "directory_url": "/api/1/directory/2df4cd84e.../", "history_url": "/api/1/revision/6072557b6.../log/", "merge": false, "message": "0.10: The Oh f*ck it's PyCon release", "parents": [ { "id": "10149f66e...", "url": "/api/1/revision/10149f66e.../" } ], ... } #+END_SRC *** A tour of the Web API --- contents :PROPERTIES: :CUSTOM_ID: apitourcontents :END: #+BEAMER: \footnotesize #+BEGIN_SRC GET https://archive.softwareheritage.org/api/1/content/ \ adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/ { "data_url": "/api/1/content/sha1:adc83b19e.../raw/", "filetype_url": "/api/1/content/sha1:.../filetype/", "language_url": "/api/1/content/sha1:.../language/", "length": 1, "license_url": "/api/1/content/sha1:.../license/", "sha1": "adc83b19e...", "sha1_git": "8b1378917...", "sha256": "01ba4719c...", "status": "visible" } #+END_SRC #+BEAMER: \normalsize \vfill \pause **** Caveats - rate limits apply throughout the API - raw download available for textual contents ** Accessing the archive :noexport: :PROPERTIES: :CUSTOM_ID: accessing-short :END: *** Browse :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: #+BEAMER: \begin{center}\includegraphics[width=0.5\textwidth]{archive-browse}\end{center} - https://archive.softwareheritage.org/browse - way back machine for software source code #+BEAMER: \pause *** Web API :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: #+BEAMER: \begin{center}\includegraphics[width=0.5\textwidth]{archive-webapi}\end{center} - https://archive.softwareheritage.org/api - point-wise navigation of the archive as a graph ** Some technical challenges :PROPERTIES: :CUSTOM_ID: techchallenges :END: *** Expanding the archive - discover and classify /all/ the software sources - importers for other VCSs (SVN, Hg, ...) \hfill /We need your help!/ *** Staying current get new repositories and commits ASAP\\ \hfill /We need reliable, standardised event feeds./ *** Handling the backlog ingesting all the pre-existing data\\ \hfill /Decades of software development are waiting!/ diff --git a/talks-public/2021-03-30-cergy/2020-03-30-cergy.org b/talks-public/2021-03-30-cergy/2020-03-30-cergy.org new file mode 100644 index 0000000..e5b67bc --- /dev/null +++ b/talks-public/2021-03-30-cergy/2020-03-30-cergy.org @@ -0,0 +1,185 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Software Heritage +#+SUBTITLE: Analyzing the Global Graph of Public Software Development +#+BEAMER_HEADER: \date[2021-03-30, CYU]{30 Mars 2021\\Lab. ETIS --- CY Cergy Paris Université\\ (online)\\[-2ex]} +#+AUTHOR: Stefano Zacchiroli +#+DATE: 30 March 2021 +#+EMAIL: zack@upsilon.cc + +#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[UParis \& Inria]{Université de Paris \& Inria --- {\tt zack@upsilon.cc, @zacchiro}} +#+BEAMER_HEADER: \author{Stefano Zacchiroli} + +# Required by graph-compression.org module +#+LATEX_HEADER_EXTRA: \usepackage{pdfpages} + +# Syntax highlighting setup +#+LATEX_HEADER_EXTRA: \usepackage{minted} +#+LaTeX_HEADER_EXTRA: \usemintedstyle{tango} +#+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize} +#+name: setup-minted +#+begin_src emacs-lisp :exports results :results silent + (setq org-latex-listings 'minted) + (setq org-latex-minted-options + '(("fontsize" "\\scriptsize") + ("linenos" ""))) + (setq org-latex-to-pdf-process + '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" + "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" + "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) +#+end_src +# End syntax highlighting setup + +* About me :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+INCLUDE: "this/zack.org" :minlevel 2 +* Software Heritage +#+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :minlevel 2 +** An international, non profit initiative + :PROPERTIES: + :CUSTOM_ID: support + :END: +*** Sharing the vision :B_block: + :PROPERTIES: + :CUSTOM_ID: endorsement + :BEAMER_COL: .5 + :BEAMER_env: block + :END: + #+LATEX: \begin{center}{\includegraphics[width=\extblockscale{.4\linewidth}]{unesco_logo_en_285}}\end{center} + #+LATEX: \vspace{-0.8cm} + #+LATEX: \begin{center}\vskip 1em \includegraphics[width=\extblockscale{1.4\linewidth}]{support.pdf}\end{center} + #+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/testimonials} +*** Donors, members, sponsors :B_block: + :PROPERTIES: + :CUSTOM_ID: sponsors + :BEAMER_COL: .5 + :BEAMER_env: block + :END: + #+LATEX: \begin{center}\includegraphics[width=\extblockscale{.4\linewidth}]{inria-logo-new}\end{center} + #+LATEX: \begin{center} + #+LATEX: \colorbox{white}{\includegraphics[width=\extblockscale{1.4\linewidth}]{sponsors.pdf}} + #+latex:\mbox{}~~~~~~~\tiny\url{www.softwareheritage.org/support/sponsors} + #+LATEX: \end{center} +** Status :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: +#+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 +#+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t +#+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 +#+INCLUDE: "../../common/modules/status-extended.org::#datamodel" :minlevel 2 :only-contents t +#+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :minlevel 2 :only-contents t +#+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 +* Querying the archive +** Exploitation + #+BEAMER: \LARGE \centering + How do you query the Software Heritage archive? + #+BEAMER: \Large \\ + (on a budget) + +** Use cases --- product needs + e.g., for https://archive.softwareheritage.org +*** Browsing + - =ls= + - =git log= (Linux kernel: 800K+ commits) +*** Wayback machine + - tarball + - =git bundle= (Linux kernel: 7M+ nodes) +*** Provenance tracking + - commit provenance (one/all contexts) (note: requires backtracking) + - origin provenance (one/all contexts) +*** Note :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + Note: we therefore need both the direct Merkle DAG graph and its + *transposed* + +** Use cases --- research questions +*** For the sake of it + - local graph topology + - connected component size + - enabling question to identify the best approach (e.g., scale-up + v. scale-out) to conduct large-scale analyses + - any other emerging property +*** Software Engineering topics + - software provenance analysis at this scale is pretty much unexplored yet + - industry frontier: increase granularity down to the individual line of + code + - replicate at this scale (famous) studies that have generally been + conducted on (much) smaller version control system samples to + confirm/refute their findings + - ... + + #+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t + #+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t +** Sample study --- 50 years of gender differnces in code contributions + - start from the Software Heritage graph dataset + - detect gender of author names using standard tooling (=gender-guesser=) + # - caveat: how to identify /first/ name? + - analyze both authors and commits over time, bucketing by commit timestamp + #+BEAMER: \begin{center} \includegraphics[height=0.45\textheight]{this/commits-pie.pdf} \includegraphics[height=0.45\textheight]{this/ratio-female-authors.pdf} \\ \scriptsize total commits by author gender (left), ratio of active female commiters over time (right)\end{center} +*** + #+BEGIN_EXPORT latex + \vspace{-1mm} + \begin{thebibliography}{} \footnotesize + \bibitem{Zacchiroli2021} Stefano Zacchiroli + \newblock Gender Differences in Public Code Contributions: a 50-year Perspective + \newblock IEEE Software, March/April 2021 + \end{thebibliography} + #+END_EXPORT + +** Discussion + - one /can/ query such a corpus SQL-style + - but relational representation shows its limits at this scale + - ...at least as deployed on commercial SQL offerings such as Athena + - note: (naive) sharding is ineffective, due to the pseudo-random + distribution of node identifiers + - experiments with Google BigQuery are ongoing + - (we broke it at the first import attempt..., due to very large arrays in + directory entry tables) + +* Graph compression + #+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t + +* Conclusion + #+INCLUDE: "this/roadmap.org" :minlevel 2 +** Wrapping up + #+latex: \vspace{-2mm} +*** + - Software Heritage archives all public source code as a huge Merkle DAG + - Querying and analyzing it at scale (20/200 B nodes/edges) is an open + problem + - Gold mine of research leads in sw. eng., complex networks, big code, + reproducibility + #+latex: \vspace{-2mm} +*** References + #+latex: \vspace{-1mm} + #+BEGIN_EXPORT latex + \begin{thebibliography}{} + \scriptsize + + \bibitem{Abramatic2018} Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli + \newblock Building the Universal Archive of Source Code + \newblock Communications of the ACM, October 2018 + + \bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli + \newblock The Software Heritage graph dataset: public software development under one roof + \newblock MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE + + \bibitem{Boldi2020} Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli + \newblock Ultra-Large-Scale Repository Analysis via Graph Compression + \newblock SANER 2020, 27th Intl. Conf. on Software Analysis, Evolution and Reengineering. IEEE + + \end{thebibliography} + #+END_EXPORT +*** Contacts + Stefano Zacchiroli / [[https://upsilon.cc/~zack/][upsilon.cc]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]] + +* Appendix :B_appendix: + :PROPERTIES: + :BEAMER_env: appendix + :END: diff --git a/talks-public/2021-03-30-cergy/Makefile b/talks-public/2021-03-30-cergy/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2021-03-30-cergy/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides diff --git a/talks-public/2021-03-30-cergy/this/commits-pie.pdf b/talks-public/2021-03-30-cergy/this/commits-pie.pdf new file mode 100644 index 0000000..8b0e856 Binary files /dev/null and b/talks-public/2021-03-30-cergy/this/commits-pie.pdf differ diff --git a/talks-public/2021-03-30-cergy/this/ratio-female-authors.pdf b/talks-public/2021-03-30-cergy/this/ratio-female-authors.pdf new file mode 100644 index 0000000..bcf325a Binary files /dev/null and b/talks-public/2021-03-30-cergy/this/ratio-female-authors.pdf differ diff --git a/talks-public/2021-03-30-cergy/this/ratio-female-commits.pdf b/talks-public/2021-03-30-cergy/this/ratio-female-commits.pdf new file mode 100644 index 0000000..baab5bf Binary files /dev/null and b/talks-public/2021-03-30-cergy/this/ratio-female-commits.pdf differ diff --git a/talks-public/2021-03-30-cergy/this/roadmap.org b/talks-public/2021-03-30-cergy/this/roadmap.org new file mode 100644 index 0000000..0559610 --- /dev/null +++ b/talks-public/2021-03-30-cergy/this/roadmap.org @@ -0,0 +1,77 @@ +** A (brief) research roadmap --- 1 +*** Graph compression + - incremental, amortized compression → ongoing UniMi collaboration + - graph query languages on top of the compressed representation → LIRIS + collaboration (early stages) + #+BEAMER: \vfill \pause +*** Complex networks + - local topology of the global VCS graph + - emergent properties (the "classics": scale-free, small world, etc.) + - dynamic modeling of graph evolution over time → collab. with physics @ + UParis +*** + #+BEGIN_EXPORT latex + \vspace{-2mm} + \begin{thebibliography}{} + \small + \bibitem{Pietri2019} Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli\newblock + Determining the Intrinsic Structure of Public Software Development History\newblock + MSR 2020: 17th Intl. Conf. on Mining Software Repositories. IEEE\newblock + registered study protocol + \end{thebibliography} + #+END_EXPORT +** A (brief) research roadmap --- 2 + #+BEAMER: \vspace{-2mm} +*** Very-large-scale "big code" + - /big code/ = apply ML/DL to source code and other development byproducts + - current results are language-specific and limited in scale; even the + simplest problems become challenging at this scale and heterogeneity + - lead: scalable language detection → collaboration with UniBo + - lead: project classification → collaboration with CELI + - the VCS graph remains largely unexplored in big code + - lead: apply GNN (Graph Neural Network) for VCS node classification + #+BEAMER: \vspace{-1mm} +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEAMER: \vspace{-1mm} \pause +*** Very-large-scale source code indexing + - common AST-based approaches for code indexing are not viable here due do + maximum heterogeneity + - alternative: treat code as text and full-text index it + - previous exp.: 3-gram based indexing in Debsources, supporting regexp + matching + - goal: find a sweet spot between the two + #+BEAMER: \vspace{-1mm} +** Recent results on (visual) programming language identification + #+BEGIN_EXPORT latex + \begin{thebibliography}{} + \small + + \bibitem{DelBonifro2021a} Francesca Del Bonifro, Maurizio Gabbrielli, Stefano Zacchiroli + \newblock Content-Based Textual File Type Detection at Scale + \newblock ICMLC 2021: The 13th International Conference on Machine Learning and Computing. ACM, 2021 + + \bibitem{DelBonifro2021b} Francesca Del Bonifro, Maurizio Gabbrielli, Antonio Lategano, Stefano Zacchiroli + \newblock Image-based many-language programming language identification + \newblock (under review) + + \end{thebibliography} + #+END_EXPORT +** A (brief) research roadmap --- 3 +*** Very-large-scale reproducibility in software engineering + - most results in empirical software engineering are determined on corpuses + significantly smaller than Software Heritage + + → external validity threat; do results generalize to the full body of + public code? + - 2-year research plan + 1) identify impactful sw. eng. studies that can be reproduced using + Software Heritage + - selected topics (tentative): code reuse, code quality, project + classification, technical debt, developer productivity + 2) reproduce selected studies one-by-one, at Software Heritage scale + 3) document findings, e.g., via RENE (REproducibility Studies and + NEgative Results) scientific initiatives + - collaboration with Microsoft Research (early stages) diff --git a/talks-public/2021-03-30-cergy/this/zack.org b/talks-public/2021-03-30-cergy/this/zack.org new file mode 100644 index 0000000..25b290b --- /dev/null +++ b/talks-public/2021-03-30-cergy/this/zack.org @@ -0,0 +1,25 @@ + +** About me + - Associate Professor (/Maître de conférences/), Université de Paris + - on leave (/délégation/) at Inria + - Free/Open Source Software activist (20+ years) + - Debian Developer & Former 3x Debian Project Leader + - Former Open Source Initiative (OSI) director + - Software Heritage co-founder & CTO + + #+BEAMER: \vfill \pause + +*** Research path + 1) Formal methods for ensuring the quality of software upgrades (Mancoosi + project) + + Industry adoption: Debian, OPAM, Eclipse P2 + 2) Formal methods for automated upgrade planning in the cloud (Aeolus + project) + + Industry adoption: Mandriva, Kyriba + 3) Large-scale software evolution analysis (Debsources platform) + 4) Very-large-scale source code analysis and preservation (Software + Heritage) + + → this talk