diff --git a/talks-public/2021-06-17-graphrm/2021-06-17-graphrm.org b/talks-public/2021-06-17-graphrm/2021-06-17-graphrm.org new file mode 100644 index 0000000..ec451ba --- /dev/null +++ b/talks-public/2021-06-17-graphrm/2021-06-17-graphrm.org @@ -0,0 +1,157 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Software Heritage +#+SUBTITLE: The Global Graph of Public Code +#+BEAMER_HEADER: \date[2021-06-17, GraphRM]{17 June 2021\\ Meetup GraphRM, Roma \\ (online)\\[-2ex]} +#+AUTHOR: Stefano Zacchiroli +#+DATE: 17 June 2021 +#+EMAIL: zack@upsilon.cc + +#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[Software Heritage]{Université de Paris \& Software Heritage --- {\tt zack@upsilon.cc, @zacchiro}} +#+BEAMER_HEADER: \author{Stefano Zacchiroli} + +# Syntax highlighting setup +#+LATEX_HEADER_EXTRA: \usepackage{minted} +#+LaTeX_HEADER_EXTRA: \usemintedstyle{tango} +#+LaTeX_HEADER_EXTRA: \newminted{sql}{fontsize=\scriptsize} +#+name: setup-minted +#+begin_src emacs-lisp :exports results :results silent + (setq org-latex-listings 'minted) + (setq org-latex-minted-options + '(("fontsize" "\\scriptsize") + ("linenos" ""))) + (setq org-latex-to-pdf-process + '("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" + "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" + "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f")) +#+end_src +# End syntax highlighting setup + +* About me :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+INCLUDE: "this/zack.org" :minlevel 2 +* Software Heritage + #+INCLUDE: "../../common/modules/swh-motivations-foss.org::#fragile" :minlevel 2 + #+INCLUDE: "../../common/modules/swh-goals-oneslide-vertical.org::#goals" :minlevel 2 + #+INCLUDE: "../../common/modules/principles-compact.org::#principles" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#archivinggoals" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#architecture" :minlevel 2 :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#merkletree" :minlevel 2 + #+INCLUDE: "../../common/modules/data-model.org::#merklestruct" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#dagdetailsmall" :minlevel 2 :only-contents t + #+INCLUDE: "../../common/modules/status-extended.org::#archive" :minlevel 2 +** Web interface + #+BEAMER: \centering \huge + #+BEAMER: \href{https://archive.softwareheritage.org}{archive.softwareheritage.org} + #+BEAMER: \vfill + #+BEAMER: DEMO TIME ! + + # searches: + # - gnu gcc git + # - git-annex + # - apollo11 chrislgarry + # - unix history dspinellis + +* Querying the archive +** Use cases --- product needs + e.g., for https://archive.softwareheritage.org +*** Browsing + - =ls= + - =git log= (Linux kernel: 800K+ commits) +*** Wayback machine + - tarball + - =git bundle= (Linux kernel: 7M+ nodes) +*** Provenance tracking + - commit provenance (one/all contexts) \hfill note: requires backtracking + - origin provenance (one/all contexts) +*** Note :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + Note: we therefore need both the direct Merkle DAG graph and its + *transposed* + +** Use cases --- research questions +*** For the sake of it + - local graph topology + - connected component size + - enabling question to identify the best approach (e.g., scale-up + v. scale-out) to conduct large-scale analyses + - any other emerging property +*** Software Engineering topics + - software provenance analysis at this scale is pretty much unexplored yet + - industry frontier: increase granularity down to the individual line of + code + - replicate at this scale (famous) studies that have generally been + conducted on (much) smaller version control system samples to + confirm/refute their findings + - ... +** Exploitation + #+BEAMER: \LARGE \centering + How do you query the Software Heritage archive? + #+BEAMER: \Large \\ + (on a budget) + +** The Software Heritage Graph Dataset :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+INCLUDE: "../../common/modules/dataset.org::#main" :minlevel 2 :only-contents t + #+INCLUDE: "../../common/modules/dataset.org::#morequery" :minlevel 2 :only-contents t + +** Sample study --- 50 years of gender differences in code contributions + - start from the Software Heritage graph dataset + - detect gender of author names using standard tooling (=gender-guesser=) + # - caveat: how to identify /first/ name? + - analyze both authors and commits over time, bucketing by commit timestamp + #+BEAMER: \begin{center} \includegraphics[height=0.45\textheight]{this/commits-pie.pdf} \includegraphics[height=0.45\textheight]{this/ratio-female-authors.pdf} \\ \scriptsize total commits by author gender (left), ratio of active female commiters over time (right)\end{center} +*** + #+BEGIN_EXPORT latex + \vspace{-1mm} + \begin{thebibliography}{} \footnotesize + \bibitem{Zacchiroli2021} Stefano Zacchiroli + \newblock Gender Differences in Public Code Contributions: a 50-year Perspective + \newblock IEEE Softw. 38(2): 45-50 (2021) + \end{thebibliography} + #+END_EXPORT + +** Discussion + - one /can/ query such a corpus SQL-style + - but relational representation shows its limits at this scale + - ...at least as deployed on commercial SQL offerings such as Athena + - note: (naive) sharding is ineffective, due to the pseudo-random + distribution of node identifiers + - experiments with Google BigQuery are ongoing + - (we broke it at the first import attempt..., due to very large arrays in + directory entry tables) + +* Graph compression + #+INCLUDE: "../../common/modules/graph-compression.org::#main" :minlevel 2 :only-contents t + +* Graph challenges + #+INCLUDE: "this/graph-challenges.org" :minlevel 2 +* Conclusion +** Wrapping up +*** + - Software Heritage archives all public source code as a huge Merkle DAG + - The Software Heritage graph offers an unified view on the software + commons + - Querying and analyzing it at scale (20/200 B nodes/edges) is an open + problem + - Gold mine of R&D challenges and leads for graph geeks +*** References + - homepage: [[https://www.softwareheritage.org/][www.softwareheritage.org]] + - development info: [[https://www.softwareheritage.org/community/developers][www.softwareheritage.org/community/developers]] +*** Contacts + Stefano Zacchiroli / [[https://upsilon.cc/~zack/][upsilon.cc]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]] + +* Appendix :B_appendix: + :PROPERTIES: + :BEAMER_env: appendix + :END: + #+INCLUDE: "../../common/modules/swhid.org::#oneslide" :minlevel 2 + #+INCLUDE: "../../common/modules/status-extended.org::#apiintro" :minlevel 2 + #+INCLUDE: "../../common/modules/swh-fuse.org::#oneslide" :minlevel 2 diff --git a/talks-public/2021-06-17-graphrm/Makefile b/talks-public/2021-06-17-graphrm/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2021-06-17-graphrm/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides diff --git a/talks-public/2021-06-17-graphrm/this/commits-pie.pdf b/talks-public/2021-06-17-graphrm/this/commits-pie.pdf new file mode 100644 index 0000000..8b0e856 Binary files /dev/null and b/talks-public/2021-06-17-graphrm/this/commits-pie.pdf differ diff --git a/talks-public/2021-06-17-graphrm/this/graph-challenges.org b/talks-public/2021-06-17-graphrm/this/graph-challenges.org new file mode 100644 index 0000000..c535ddf --- /dev/null +++ b/talks-public/2021-06-17-graphrm/this/graph-challenges.org @@ -0,0 +1,51 @@ +** Technical leads for graph geeks (selected) + +*** A graph query language for Software Heritage + + - Design and implement a *graph query language* suitable for Software + Heritage use cases and *specifities* (e.g., read-only, compressed + representation, split in-memory/on-disk representation, etc). + + - Possibly building on-top existing graph technologies: e.g., an *Apache + TinkerPop backend* for WebGraph + + - Learn more: + https://wiki.softwareheritage.org/wiki/Graph_query_language_for_the_archive_(internship) + +** Technical leads for graph geeks (selected) (cont.) + +*** Python bindings for WebGraph (Java) + + - Enable researchers and *data scientists* to exploit the graph in a + *familiar ecosystem* (NumPy, Pandas, iGraph, etc.) + + - Requires *careful performance trade-offs* to minimize jumping between the + two runtimes + + - Learn more: + https://wiki.softwareheritage.org/wiki/Python_bindings_for_WebGraph_(internship) + +** Graph storage challenge + +*** Can your graph database do better? + + - A recent dump of the Software Heritage graph is available in *Apache ORC* + data format *from AWS S3 as an open dataset* (8.4 TiB) + - Schema documentation: + https://docs.softwareheritage.org/devel/swh-dataset/graph/schema.html + - Try it out: load it into *your favorite graph database* and see if it is + *up to the challenge*. We'd love to hear about it! + +*** + #+BEAMER: \scriptsize + #+BEGIN_EXAMPLE +$ aws s3 ls s3://softwareheritage/graph/2021-03-23/orc/ + PRE content/ + PRE directory/ + PRE origin/ + PRE release/ + PRE revision/ + [...] +$ aws s3 ls s3://softwareheritage/graph/2021-03-23/orc/directory/ | head -n 1 +2021-04-13 06:59:21 3004055177 graph-03891ef6-18c0-46e7-88db-42056fa33282.orc + #+END_EXAMPLE diff --git a/talks-public/2021-06-17-graphrm/this/r-b-approach.drawio b/talks-public/2021-06-17-graphrm/this/r-b-approach.drawio new file mode 100644 index 0000000..a06de60 --- /dev/null +++ b/talks-public/2021-06-17-graphrm/this/r-b-approach.drawio @@ -0,0 +1 @@ +7Vxbc5s4G/41mdm9MIMECLhMnLRp2k76bdrudm92OMg2CUYu4Njpr/8kECch29gGJ23j3WmMBDLofd7To1ecaeP5+m3sLGYfiY/DM6j66zPt8gxCoAOD/mEtT3mLadh5wzQO/LxJrRrugh+YX1m0LgMfJ40TU0LCNFjwRpA3eiSKsJc22pw4JqvmaRMS+o2GhTPFrYY7zwnbrX8HfjrLWy1DrdqvcTCdFb8MVN4zd4qTeUMyc3yyqjVpV2faOCYkzb/N12McsslrzsubDb3ljcU4SrtcMHmPbr9HweUVJLeL5GZ+k3jXI50P8+iES/7EZxCFdMCLCaHj0ttOn/hcoO9LUnSMkkxS5/QEe7Gu+ui3Kfv7LvLxAtN/sjHcZUCnnY9L7zAfOj+TT075KzAmS3odu2lAu1ezIMV3C8djvSuKMdo2S+ch754EYTgmIYmzazXfwdbEo+1JGpMHXOtBnoXdCe0JHReHn0gSpAGJaJ9HbxHTky4ecZwGVPAfhBPmge+ze7twwmAqveKcd7gkTcm8fKa6aIppplfgda2Ji+otJnOcxk/0FN6rQa44XG80TVVs3rSqgKgVcJvVQGjxNodjf1qOXsGDfuEI2Qct6itaXiZaCnP5JByfAio3P+4fvlyv1X+9e3/+cLv6MgbWCGoSqLzKTiI7KpiG7ExTaeu5bkiEB4aTnnG0ogMg0/SryB8tEzqndISnJMXzYlg3rlR8U8ugtsDAlq/L8GRBV0Po58GTqdkNPAEEWnACuimDkzYQnHR9IDi5QeSw+xMR4wdUjIG7zASxN3pU2k3idEamJKJSJWTBMXOP0/SJx6jOMiVNROF1kP5T+/6NDUV1OT+6XPORs4MnfpCQZezhTzgO6EzjmME1iKa0k0ExdeIpTiWdqgRCKbvLzfjJf2ibyvMT8x/dJkyOEuw3IueNeBypiqoCs4FJg9uXGIdOGjw2Y20Z/Pjon0iQwaU4hUwmCb1ZEZ/lTRxhAY8PjDdAth7dVIh1qDQnjpcmXeFK04kF+0pl7YQhDsk0duZNQOamrEhb6CReLAosiRd+qjqkhrNpWp3Y41oA2Kj5I1OsU6ffiwkzDFVBWgMxlswrAllIgwZziu0IhunAHT+sLMZV1XrxLFaFHdQkmrUdbhmKWHKnwh+qzZXUm44LWM0huD3Mr+pf5c2NGp8snOhwjecTvC2syX8gb/aIj38rbYeGrWhWB20Hp9R2cHyuC9Q9HMAiJh5Oupv/lrTrUOBoma+njJVTJiFZeTPqYpQ5pir0H4n/S1ISZ9SXFAB7BhfdZa2rdifLXlAdpxH1sxj2jbO5MwI7lUHWTNEgm80hePQ6lEHWZRxCHxZ5jxisbpl/K6tsoiYBCSygaPCZQzBDZpT7zuP6zsmKCE5VbL0RxLGs6HRhXJHy75O3Iatpq0eQD3KkZQFQVUwBXxXvtcO8tEYr8VYMpXcLHE+ZTxonzifJMl0su2eT1CikMsNV8GARibBAmvGmndxWyYJtt2lbwb1HgGFbYoABoJRRldgtOJjdgqeVfhD9nsJXoQJE4VsnE718zWwgyb+umR2bdtpNr1EeP9vyasF3HJiHgJ7DHahui3dg1xRG+qhWx1jk6AxGWEUXh+gvg5E+5mZK6TgOYROllHNHh66f/crpTKnJT8JxTdurCLSu7jpUoLkZeEdpvDUQPvZyBz8tlYQEibIEVa0+7ZgPwTLrqEtYqzip/m16D5FfJuI8RFLDIMKj4uZZL+NojNdlpgNzBSECsBW7/tFfgIkoCihPGxT0teo0RHBRZzqeM7gQ4dMKLvpbr5I+pibzHiJUIv+cVcGyoDp0kiTw5HLO6JhK1GC7oCN64//UD2rwYIfVZdlRcV0niW7lrnbGlV05rppKyyqrirZjVzShmGIczHEZAhcP9dZQ/cGN3P7vbvXuxrz/PL4Ht/jdzf2br6MeSsKgLFbxyJxB4nCyok8SoizSZgdTpjH8O71LpjvZd3Z/3HJCvUh4LxzvYZoNJnAkeS+JfRyLhAodp5ZA28jUzlE/rg1ZgmszzJYvG4oGcUYP449v5vq3Ufh3AB+mqmuZo6HoT2+GvYdkKakjVNcX5+eX+8Y5FDYhWaa7eZF6jMMvqkc3fhBjj/MbK5xkK7yc8YC5sWSDuE6ShTcSomViediTEi2uZbBy0H5WeSzBrGjtYkGkt3GCkGh9ekPKoV4tofYv3ershBnGwDewKZthpokOaoW2W8POF+JyEIJNgVqGAg1Lh8imXTYskqZ9/Y+4HEgbhvI+cvsxFI9K/zpzptuRm+SZqSo2qWugaeZLsCRUWgxDgimxzcqUoDbOO5Q2t1xoHy6otSdBtgxT+KmGE4KDGRcwVGH7K4yGgpG4EcqwlXYsMxyMMPr3/V8ODiZv35nJ6vrjp8C9KhZ1nqcStMrDv53VsjNpqrZRAPWUS/qIsKNbOzaJNpqV4UCsHenPr2x7zFMtzLK9q97MCTrvSij5N14euNMMuGUecrtMGVvI25uUUJuz60FVNZreisuv8kpOOFDRkFTEbQJtmKLsrYq4tW7ocC3tuhfjWC0VfDkwBmQftj3oqfS0WFv3AvxKlTNfKtbmy/VaG0iv0RuIL6+dz9c3GtbdR9v7+tdlQSm8BCe8nS/tpN5b92HuLAkuTuzPEHQVzfqrfXvvfj4HdvrBfW9ML358n/WgrcI+cdUrw82qUUOY/ddW6j+WURovkxT7f/6MNVAttZSAZ3O8jISASmuvYoGhqD8pGECHreAsYllsfHr+cg/HLU5X954VwXyZ7RQC6JKVPYAYfXL8xMhXNYZKRr8sKHgxdT7MxEzSVUavqz5+pL6FOpyk7e/2aRmyvGwymUA56+kjFxmn2obd9NO8DqJaGKgvBtQXCdqo3K4Qu93v8+Kzh/KlvXbE1cuXfqFo61gQFDGYrgBU/whLPDQiQ/XqBa1l4mykNAaQvASgj2hN+ryb+duNe3m4z2c4iqfuH9n+P/r/OMNV4/ufZ0V9TIa/iTMPwqf8ypKFozZEY/TVDIePmKl9q6c5iBCLNLryW2V9EYnnTtjsXtXKc3RVLTtDGlnieJQUuZ/kehaGjIK8qjfrr9f+ZJ1p7ETJhF5VXB/h8oQVif3m8PXL/SBZhA6fliDKuIHyvkPipPURRZXFU4X2sbcpqG+D9Hrpdt0jxQOrHW5gA3O4IxLaolgbtchuqgBqFpEB2g9lhIWxU3O0ofiLthW+q5z6I01TqZK85HD2uFcBaGZrP4clZZQUbSChyFGGnj+oNWATvNTaKKi93QHoqKTlGnPTQ82jdG6O51SlbwnDSmaDPgTRcp2bs9o7XpgjcBaLbHByTDXJi037tsBwI0KAQg1a7WNINEcWVg6XCx6/jNsTM8BmN+fUf0WOYC+wlEyBBBwnJQpedulHSf2ddSX9N2/gFXi6nezglpqI56o0MYWlHu3w0pLmC8ugNVhpiRx29ivs9oKd8bywA4ptmyawoW21lo+1dmFaZxiKFbb2YCtc8qiphxfOvJam7FmachTVbph6EzGm3fKgwxWmyEE0VJlc1zrbw3cdnhBNTTC1oNFC0UVRcSvg7o01vhqP+wETKl6uwsGE2qG6jdpY0tDeSKKH1VvHc+tVvbtdu/o/7Vxbc5s4FP41mWkf7OFisPOYpGm63e40bZrZ5qkjg4xpBWKEiJ399SuBMBfJNgkGnEs600gHXeB856ZzCCfmRbC+IiBa/oNdiE4MzV2fmB9ODEPXpjP2i1MeMop1amQEj/iuGFQQbvz/YD5TUBPfhXFlIMUYUT+qEh0chtChFRogBK+qwxYYVXeNgAclwo0DkEz913fpMqPOLK2gf4K+t8x31jVxJQD5YEGIl8DFqxLJvDwxLwjGNGsF6wuIOPNyvnxKPs6/BXe/vqPb5dfzi8/ffkfLUbbYx8dM2TwCgSF98tLXPz5//Txzzq6W36Mfn3/q8Mvf30f6JFv7HqBEMCwE/j1kpHniM1ZnT04fcnZSuGa3cL6kAWIEnTVjSvAfeIERJowS4pCNPF/4CNVIAPleyLoOewjI6Of3kFCfAXUmLgS+6/JtzldLn8KbCDh8zxUTS0YjOAldyB9F48vjkApRMyasLx6CLQjXNdj38EzfAMk0AOIAUvLA5olVpqbAXgi/LbqrQpImgrQsCVEuMUDIrrdZuMCHNQREj4DL1CW4CARotMKEY/WGmQIzU5uOLQk2u1fYZC2TQPIYt6KtPBHWEczz4dpBeKVbVV5NFPKtEnC7K04ZEqO4SFMu2n4IxAZltjGjHPEmYwRACCLsERCohD03/owb5xEkPrtfLtHVidfFhT0SzRcGxBECrfNV46ytjQ2zqXjvkJatmJkKkDZOqxeQTAkF6DKfK7qY0CX2cAjQZUGt2YJizBeMI8HO35DSB8FQkFBcRTHbk2/0FG6yu8UJceCOgbYITgDx4K4FDTU6BCJAueesBDAK1oup19gPaYGqMa1breoK2e2LSTX8NnfxdEhlA3Vzy1huI+445oS1PN56J7hoaBxNIg9IHdD716ylZa3U+9RKW4LQwUHkIw5TDY7ci38Bc4iucexTH1e8+VY3XwYsx5RgB8ZxIyw2zNcnFWRs1kO1m9nEHFIQ0plp1adDm9bZEKYVrn36s9S+SyGyRO/DWqycdh5Ep3NzPGloju1OzLE+tXaa4+y2OjPHuhzni9C+bnAJjBNEJak5jpC/M0U1ZvbeuLXPAF9Xx60KvPwwSl4dXBtHeCRwmZYElwTJMOcxo8ooQ5M5ZRkqVnXmkXQ5rhCircHw3ic4DGAoS3TV76tkryTsZbk+MUwXwNnCkZSAXbGdGZwvdgULe0ObvQozx5TioJ06WHtjw75RnClRfGUH6y2wrHMTVFM+Vd6o33BQP30u8SDvlEBuFyNaDWPETUFhX5CYiX+LKLFdUkuTlO/tdN1KVe0mmtrr6dtoHLHn52UJRQkixWE7WHu8TDdeILxyloDQcQCZ7P/C5FdMMUlrYUrU6l6OclvQHUCGae1H6NTqE6CtIXpaj2TM9EMJk9ebKNkHrzGuAqwfgbc0TDXEI8oMmqyILowgY3no+FChja/OptaOHrpuqAC1FICedgboRMLlSMOf7kMdrWGoY2w5YfYU6lhDIlagdFcGaSDEjOeBmHy+lyEM3TP+Kg73LQjEse+cHKY215iXZktelk/eCguW01pmjScTfWzUEsfTjV/cU8qTVzuVVzOk1TrORBvTpnHtS81E7/aa5qSaibZVYVCvyU1DnepRIPZCc9G7AdNrhfbhAdOnrXym/jSfKaxQHphyMPzQYxf5gUMYFfkih8YF8XKzcy11u7D4P2XqNv3pw1nkuc79zsJs6XjTqcwzgofSgIib4niHk6i9c5UL1jY3cGrtGs4a2Q1s2ayWKdF7fq/EaJfAfJPtJ8u2cQyyrc9aCaumPzV20rWa1hSn2Z4CJ7NBXH0UNUYrx6wcI6sc4OQQHvCvy08Po1t4tfaie7r+rUPyLRrJ+fFXVWLcbx7sLUIyUIlRCaI6v/p8K4yHROUoKoxK1AZ5l/ewBcbGQJW9q5IZluxcd4l6zc/3kMFR3s3rfne3AzUdtLq4SzCfY3GxA3wGLS4q8dn+jtQLry12gu7AtUUlwlvSoC+wtNgFpIOWFpV4PrMX7Q8X4yjeq9/lco4kxhn0PbjHlxUPB5fir9KeAVy6fKKXAWxUVDwcJ6cH5uRbSfGpwvGS/rilA285aElRjZg6t6NA7PhLih0ANmhJUQ1YK3f5PKsuh/MURkNPMT20z32rJ+4V7Hb5yjfJbibZ9lFK9uuoJqrlvt2bsKmIJ0GUjxfeta/UfcgevnSQ4t27fA3eKaalvXxezJhK5TNCSv7oo73aaWi71NOQ8iypjhZfKUozJLxf0kwt/UnvdHN6wSwsyiilm6rr+gLajrLe6E5P51oXJ0e9aTVDO7Sytwtd5Fw3k9aRw3AgGCHIk6pZlGmeyfHniF0NIV1h8kd9UZ2v41cqpWa2PfF55T1WD87TuxGgyy1DkhDwrS54zIdjWvSSGJKsp5oWs7gW0IRs2zkJQLzl2RBOP3unvLYExF0x2WLNd3DsjbNbubi+ZSBd8f+1Bcy2lStAcvieh+UILmi7oHzzUT3e8biOi7aDA67taTvNSp8D54+XTq4fG+oq27XqNTgZ5I5iJr9bYRTfIyufBfRJZ4cBW0J0WN/xmHxayXdM+3ce+6x8R87lCJxH02zZwSPFdqIul2m483A5OIEf+nGw1WswnnPjiInL0GQYKgcRELo4CNMKq/ZujhmyjOsxTFzMHyd00ycF6L16emputTiCUJE6SY2/w04LCUr3fzPCh8qnSQXFR5jl6eOtMusWH4XNDgDFp3XNy/8B \ No newline at end of file diff --git a/talks-public/2021-06-17-graphrm/this/r-b-approach.pdf b/talks-public/2021-06-17-graphrm/this/r-b-approach.pdf new file mode 100644 index 0000000..d1799ac Binary files /dev/null and b/talks-public/2021-06-17-graphrm/this/r-b-approach.pdf differ diff --git a/talks-public/2021-06-17-graphrm/this/ratio-female-authors.pdf b/talks-public/2021-06-17-graphrm/this/ratio-female-authors.pdf new file mode 100644 index 0000000..bcf325a Binary files /dev/null and b/talks-public/2021-06-17-graphrm/this/ratio-female-authors.pdf differ diff --git a/talks-public/2021-06-17-graphrm/this/ratio-female-commits.pdf b/talks-public/2021-06-17-graphrm/this/ratio-female-commits.pdf new file mode 100644 index 0000000..baab5bf Binary files /dev/null and b/talks-public/2021-06-17-graphrm/this/ratio-female-commits.pdf differ diff --git a/talks-public/2021-06-17-graphrm/this/security.org b/talks-public/2021-06-17-graphrm/this/security.org new file mode 100644 index 0000000..d394279 --- /dev/null +++ b/talks-public/2021-06-17-graphrm/this/security.org @@ -0,0 +1,122 @@ +** Securing the open source supply chain + + *Software supply chain attacks* are becoming more and more popular and + raising in profile. → Cf. /SolarWindws attacks/ (2021), breaching several US + govt. branches + +*** Definition --- Reproducible Builds (R-B) + The build process of a software product is *reproducible* if, after + designating a specific version of its source code and all of its build + dependencies, every build produces *bit-for-bit identical artifacts*, no + matter the environment in which the build is performed. + +*** + - R-B allows to *increase trust in binary executables* built from trusted + (open source) code by untrusted 3rd-party software vendors (e.g., app + stores, distros) + + - The *[[https://reproducible-builds.org/][reproducible-builds.org project]]* has popularized the notion, is + backed by major open source industry players, and has made large open + source software collections reproducible (e.g., 95% of Debian packages) + +*** References :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEGIN_EXPORT latex + \begin{thebibliography}{} + \footnotesize + \bibitem{Lamb2021RB} Chris Lamb, Stefano Zacchiroli + \newblock Reproducible Builds: Increasing the Integrity of Software Supply + \newblock IEEE Software 2021 (to appear, DOI 10.1109/MS.2021.3073045) + \end{thebibliography} + #+END_EXPORT + +** Securing the open source supply chain (cont.) + #+BEAMER: \begin{center}\includegraphics[width=\textwidth]{this/r-b-approach}\end{center} + +** Securing the open source supply chain (cont.) +*** + - Software Heritage provides key ingredients for R-B pipelines: on-demand + archival (e.g., of VCS commits referenced by build recipes) + long-term + availability + - We have implemented this by integrating the GNU Guix package manager with + Software Heritage + +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + #+BEAMER: \begin{center}\hfill\includegraphics[height=0.4\textheight]{swh-guix-1}\hfill\includegraphics[height=0.4\textheight]{swh-guix-2}\hfill~\end{center} + #+BEAMER: \scriptsize + - \url{https://www.softwareheritage.org/2019/04/18/software-heritage-and-gnu-guix-join-forces-to-enable-long-term-reproducibility/} + - \url{https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/} + +** Tracking of vulnerable source code artifacts + +*** + Software Heritage provides a unique observatory on the (best approximation + of) the entire /Software Commons/, i.e., all software published in source + code form + +*** Software provenance tracking at the scale of the world + - by following the /transposed/ Software Heritage graph we can locate *all + known public occurrences* of source code artifacts (individual source + files, entier source tree, commits) in other commits or repositories + + - we have developed two approaches to do that: + + 1. database-based (Rousseau et al. EMSE 2020): incremental, answers a + fixed set of queries, requires significant disk space + + 2. compressed-graph-based (Boldi et al. SANER 2020): non-incremental, + flexible graph-base querying, fits in RAM + + - current applications: "intellectual property"/prior art, open source + license compliance, software composition analysis (SCA) → collab. with + CAST + +** Tracking of vulnerable source code artifacts (cont.) + +*** Adding in-memory commit timestamps (experimental) + Idea: in-memory timestamp array (us precision, 8 bytes each), indexed by + revision node id. This enables to efficiently exploit timestamp information + during graph visits. + +*** Finding the /earliest/ commit referencing a source file/dir + Early experiment: finding the earliest revision containing a given file + using in-memory commit timestamps, on 10 M randomly selected blobs. + + Mean lookup time: 4.1 ms (avg on 95% percentile: 2.2 ms) + +*** Tracking vulnerable source code files/trees + Given a source file/tree affected by a known vulnerability (e.g., + identified by a CVE) we can efficiently identify /all/ commits (and + repositories, extending the traversals) that reference it, triggering + further inspection. Furthermore, we can efficiently select which commits to + filter out during visits (e.g., "recent" ones, only in selected repos, + etc.), based on timestamps of other attributes (that fit in memory or are + mmap()-ed to disk). + +** Tracking of vulnerable source code artifacts (cont.) + +*** v. State-of-the-art industry offerings + Similar to what GitHub/GitLab offer as a service, but: + + - without having to rely on repository scanning, because the "big picture" + is already present in the Software Heritage archive by design + + - independent from the development platform vendor (e.g., a "vulnerable + file" primarily hosted on GitHub can be spotted in GitLab repositories + and vice-versa) + + - complementary and synergistic with analyses of vulnerable dependency + information (which are also available in Software Heritage via metadata + mining) + +*** Caveats + + - current granularity stops at the file level and traceability breaks with + even just whitespace changes. Increasing tracking granularity to the + snippet/line of code level is possible, but untested at this scale yet + (cf. research roadmap) diff --git a/talks-public/2021-06-17-graphrm/this/zack.org b/talks-public/2021-06-17-graphrm/this/zack.org new file mode 100644 index 0000000..9c29ca7 --- /dev/null +++ b/talks-public/2021-06-17-graphrm/this/zack.org @@ -0,0 +1,7 @@ + +** About me + - Associate Professor, Université de Paris, on leave at Inria + - Free/Open Source Software activist (20+ years) + - Debian Developer & Former 3x Debian Project Leader + - Former Open Source Initiative (OSI) director + - Software Heritage co-founder & CTO