diff --git a/talks-public/2021-11-30-sponsors-tech/2021-11-30-sponsors-tech.org b/talks-public/2021-11-30-sponsors-tech/2021-11-30-sponsors-tech.org new file mode 100644 index 0000000..5ff4bd3 --- /dev/null +++ b/talks-public/2021-11-30-sponsors-tech/2021-11-30-sponsors-tech.org @@ -0,0 +1,360 @@ +#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) +#+TITLE: Source Code Tracking at Software Heritage Scale +#+SUBTITLE: for compliance, open science, and security +#+BEAMER_HEADER: \date[30 Nov 2021, \#swh5years]{30 Nov 2021\\\#swh5years --- Sponsors meeting\\UNESCO\\[-2ex]} +#+AUTHOR: Stefano Zacchiroli +#+DATE: 30 November 2021 +#+EMAIL: zack@upsilon.cc + +#+INCLUDE: "../../common/modules/prelude-toc.org" :minlevel 1 +#+INCLUDE: "../../common/modules/169.org" +#+BEAMER_HEADER: \institute[Software Heritage]{Software Heritage --- {\tt zack@upsilon.cc, @zacchiro}} +#+BEAMER_HEADER: \title[Source Code Tracking at SWH Scale]{Source Code Tracking at Software Heritage Scale} +#+BEAMER_HEADER: \author{Stefano Zacchiroli} + +#+LATEX_HEADER_EXTRA: \usepackage{pifont} +#+LATEX_HEADER_EXTRA: \usepackage{xspace} +#+LATEX_HEADER_EXTRA: \def\OK{\mbox{\ding{51}}\xspace} +#+LATEX_HEADER_EXTRA: \def\KO{\mbox{\ding{55}}\xspace} +#+LATEX_HEADER: \definecolor{links}{HTML}{2A1B81} +#+LATEX_HEADER: \hypersetup{colorlinks,linkcolor=,urlcolor=links} + +** About the speaker :noexport: + #+INCLUDE: "this/zack.org::#bio" :only-contents t + +* FOSS Source Code Tracking... +** The largest free/open source software archive + # #+INCLUDE: "../../common/modules/status-extended.org::#archive" :only-contents t :minlevel 3 + #+BEAMER: \vspace{-1mm} + #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{.95\linewidth}]{archive-growth.png}\end{center} + #+BEAMER: \vspace{-2mm} +*** + #+BEAMER: \begin{center}\includegraphics[width=\extblockscale{1\linewidth}]{archive-coverage.png}\end{center} +*** + The largest public source code archive in the world (and growing!) + +** Automation and storage + #+BEAMER: \begin{center} + #+BEAMER: \includegraphics[width=\extblockscale{1.3\textwidth}]{swh-dataflow-merkle.pdf} + #+BEAMER: \end{center} + Full development history *permanently archived* in a *uniform data model*. +** Meet the Software Heritage Identifiers (SWHIDs) \hfill [[https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html][(full spec)]] + #+INCLUDE: "../../common/modules/swhid.org::#oneslide" :only-contents t + +** "It's +Turtles+ SWHIDs all the way down" +*** + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .3 + :END: + #+BEAMER: \centering \includegraphics[width=\linewidth]{git-merkle/merkle-vertical} +*** + :PROPERTIES: + :BEAMER_env: block + :BEAMER_COL: .73 + :END: + Reference */any/ source code artifact* that has ever been shared---source + code file, tree, commit, release, repository state---using the same, + standard identifier. + #+BEAMER: \end{block} \begin{block}{} + Try it out: + #+BEAMER: \footnotesize + #+BEGIN_SRC + $ pip install swh.model[cli] + $ swh identify /srv/src/linux/kernel/ + swh:1:dir:b770a2aed8db52df737f88f18ca6bf39a1582240 + #+END_SRC + +* ... for Open Compliance +** Source Code Tracking for... --- Open Compliance + # (Open Compliance, /noun/ --- the reason we are gathered here today)\\ + # More seriously, here is + #+BEAMER: \begin{definition}[Open Compliance] + The *pursuit of compliance* with /license obligations/ and other /best + practices/ for the management of open source software components, *using + only open technologies* such as: _open source_ software, _open data_ + information, and _open access_ documentation. + #+BEAMER: \end{definition} + #+BEAMER: \pause +*** Why + Reduced lock-in risks, lower total cost of ownership (TCO), crowdsourcing, + alignment with FOSS community ethos. + #+BEAMER: \pause +*** + #+BEAMER: \bfseries + We still lack a source code scanning tool that is compliant with Open + Compliance principles and addresses industry practical needs. + #+BEAMER: \\[2ex] + Can we build one on top of Software Heritage? + +** Tech preview: swh-scanner + #+BEAMER: \vspace{-1mm} +*** Vision + =swh-scanner= is an *open source* and *open data* source code scanner for + *open compliance* workflows, backed by the *largest public archive* of FOSS + source code. + #+BEAMER: \pause +*** Design (of the current prototype) + - query the Software Heritage archive as source of truth about public code + - leverages the Merkle DAG model and SWHIDs for maximum scanning efficiency + - e.g., no need to query the back-end for files contained in a known + directory + - file-level granularity + - output: source tree partition into known (= published before) v. unknown +*** :B_ignoreheading: + :PROPERTIES: + :BEAMER_env: ignoreheading + :END: + Code: [[https://forge.softwareheritage.org/source/swh-scanner/][forge.softwareheritage.org/source/swh-scanner]] (GPL 3+)\\ + Package: [[https://pypi.org/project/swh.scanner/][pypi.org/project/swh.scanner]] + +** swh-scanner --- Example + #+BEAMER: \scriptsize \vspace{-1mm} + #+BEGIN_SRC + $ pip install swh.scanner + + $ swh scanner scan -f json /srv/src/linux/kernel + { + [...] + "/srv/src/linux/kernel/auditsc.c": { + "known": true, + "swhid": "swh:1:cnt:814406a35db163080bbf937524d63690861ff750" }, + "/srv/src/linux/kernel/backtracetest.c": { + "known": true, + "swhid": "swh:1:cnt:a2a97fa3071b1c7ee6595d61a172f7ccc73ea40b" }, + "/srv/src/linux/kernel/bounds.c": { + "known": true, + "swhid": "swh:1:cnt:9795d75b09b2323306ad6a058a6350a87a251443" }, + "/srv/src/linux/kernel/bpf": { + "known": true, + "swhid": "swh:1:dir:fcd9987804d26274fee1eb6711fac38036ccaee7" }, + "/srv/src/linux/kernel/capability.c": { + "known": true, + "swhid": "swh:1:cnt:1444f3954d750ba685b9423e94522e0243175f90" }, + [...] + } + 0,53s user 0,61s system 145% cpu 1,867 total + #+END_SRC + +** swh-scanner --- Example (cont.) + #+BEAMER: \scriptsize + #+BEGIN_SRC + $ du -sh --exclude=.git /srv/src/linux + 1,1G /srv/src/linux + + $ time swh scanner scan -f json -x *.git /srv/src/linux + { + [...] + "/srv/src/linux/arch": { + "known": true, + "swhid": "swh:1:dir:590c329d3548b7d552fc913a51965353f01c9e2f" }, + [...] + "/srv/src/linux/scripts/kallsyms.c": { + "known": true, + "swhid": "swh:1:cnt:0096cd9653327584fe62ce56ba158c68875c5067" }, + "/srv/src/linux/scripts/kconfig": { + "known": false, + "swhid": "swh:1:dir:548afc93bd01d2fba0dfcc0fd8c69f4b082ab8c6" }, + "/srv/src/linux/scripts/kconfig/.conf.o.cmd": { + "known": false, + "swhid": "swh:1:cnt:0d8be19e430c082ece6a3803923ad6ecb9e7d413" }, + [...] + } + 20,84s user 1,52s system 103% cpu 21,540 total + #+END_SRC + +** swh-scanner --- Example (cont.) + Interactive mode to drill-down and inspect unknown files: + #+BEAMER: \footnotesize + #+BEGIN_SRC + $ swh scanner scan -f sunburst -x *.git /srv/src/linux + #+END_SRC + #+BEAMER: \begin{center} \includegraphics[width=0.6\linewidth]{swh-scanner-sunburst} \end{center} + +** swh-scanner --- Going further +*** + swh-scanner shows that /it is possible/ to create a source code scanner + that is both open source and backed by the most comprehensive open data + FOSS archive. + #+BEAMER: \pause +*** Roadmap + swh-scanner is /not a production-ready scanner/. The following features are + still missing: + - license information \hfill $\to$ in-house scanning + ClearlyDefined + - provenance information \hfill $\to$ Software Heritage crawling info + - increase granularity to snippet/SLOC + Some of these are low-hanging fruits, some require substantial R&D + investments. + + #+BEAMER: \pause +*** Feedback welcome + - feel free to play with swh-scanner, feedback is very welcome! + - caveat: intensive use will result in hitting the API rate-limit + +* ... for Open Science +** Prior Art Search & Plagiarism Detection +*** Use case 1: Researcher (Prior Art Search) + Verify that the novelty status of the replication package of a paper under + submission matches expectations: + - Original code written for the experiment should be novel + - Reused 3rd-party FOSS components should not be novel (verifying this also + helps with spotting undesirable local patches) +*** Use case 2: Open Science Publisher (Plagiarism Detection) + - Verify that the source code part of papers, submitted as original work by + the authors, are in fact original. + - This is already standard publisher procedure for the /textual part/ of + submitted papers, but it isn't yet for software source code. + +** Prior Art Search & Plagiarism Detection --- Example +*** + Let's verify that in the replication package of our ICSE 2021 paper about + swh-fuse we have used a public, archived version of the package: + #+BEAMER: \small + #+BEGIN_SRC + $ swh scanner scan -f ndjson replication-package/swh-fuse/ + {".": {"swhid": "swh:1:dir:3d4f903b[...]", "known": true}} + ... + #+END_SRC +*** + Let's check that the rest of the replication package is novel (at + submission time, it will be archived in Software Heritage at camera-ready + time): + #+BEAMER: \small + #+BEGIN_SRC + $ swh scanner scan -f ndjson replication-package + {".": {"swhid": "swh:1:dir:14ecd6[...]", "known": false}} + ... + #+END_SRC +*** + - *Researchers* can integrate these checks in their pre-submission + checklists + - *Publishers* can integrate these checks into existing plagiarism + detection pipelines, making results available to scientific editors and + reviewers + +* ... for Security +** Tracking unfixed vulnerabilities +*** OpenSSL (prior to 1.0.1i) has vulnerabilities fixed on 2015-01-01 + let's find software that still uses unpatched versions after 2015-01-01 + #+BEAMER: \pause +*** Obtain SWHIDs of known vulnerable version of d1\under{}both.c: + # $ for f in openssl-1.0.1*/ssl/d1_both.c; do swh-identify $f; done + # swh:1:cnt:0a84f957118afa9804451add380eca4719a9765e openssl-1.0.1-beta1/ssl/d1_both.c + # swh:1:cnt:7a5596a6b373aeabbd6d8d674f0e20b1618c5012 openssl-1.0.1f/ssl/d1_both.c + # swh:1:cnt:2e8cf681ed0976e2b16460170fda27c77cfec6cc openssl-1.0.1g/ssl/d1_both.c + # swh:1:cnt:04aa23107ec53c184505e98091306c7391091bb5 openssl-1.0.1h/ssl/d1_both.c + # swh:1:cnt:de8bab873f2cf114d0d1b3e49acfa09bb9d0e4f7 openssl-1.0.1/ssl/d1_both.c + $ for f in openssl-1.0.1*/ssl/d1\under{}both.c; do swh-identify $f; done + #+BEAMER: \pause + #+BEAMER: \small + |----------------------------------------------------+------------------------------------------| + | swh:1:cnt:0a84f957118afa9804451add380eca4719a9765e | openssl-1.0.1-beta1/ssl/d1\under{}both.c | + | swh:1:cnt:7a5596a6b373aeabbd6d8d674f0e20b1618c5012 | openssl-1.0.1f/ssl/d1\under{}both.c | + | swh:1:cnt:2e8cf681ed0976e2b16460170fda27c77cfec6cc | openssl-1.0.1g/ssl/d1\under{}both.c | + | swh:1:cnt:04aa23107ec53c184505e98091306c7391091bb5 | openssl-1.0.1h/ssl/d1\under{}both.c | + | swh:1:cnt:de8bab873f2cf114d0d1b3e49acfa09bb9d0e4f7 | openssl-1.0.1/ssl/d1\under{}both.c | + |----------------------------------------------------+------------------------------------------| + #+BEAMER: \pause +*** + What repositories still ship a vulnerable version of OpenSSL after the fix? + +** Example: vulnerable software + #+ATTR_LATEX: width: .8\linewidth + file:openssl-track.png + +** How do we go from file/dir SWHIDs back to repositories? +*** + At the scale of Software Heritage, querying efficiently /all/ the places + where a given SWHID can be found in is a challenging R&D problem, that + nobody has ever tackled in its full generality (see Rousseau et al., + Empir. Softw. Eng. (2020) for the scientific details). +*** + - We are working on a *complete software provenance index* for the archive + - Meanwhile we have building blocks in place that address specific parts of + the problem, based on a *compressed graph* representation of the archive + (see Boldi et al., SANER (2020) for sci. details). They come with an API + you can play with. +*** + Code: [[https://forge.softwareheritage.org/source/swh-graph/][forge.softwareheritage.org/source/swh-graph]] (GPL 3+)\\ + Doc: [[https://docs.softwareheritage.org/devel/swh-graph/][docs.softwareheritage.org/devel/swh-graph]]\\ + API: [[https://docs.softwareheritage.org/devel/swh-graph/api.html][docs.softwareheritage.org/devel/swh-graph/api.html]] + +** Graph API --- Example +*** + (All) repositories containing vulnerable versions of d1\under{}both.c: + #+BEAMER: \footnotesize + #+BEGIN_SRC + $ swh-whereis swh:1:cnt:0a84f957118afa9804451add380eca4719a9765e + https://github.com/mathiassamuelson/openssl-dane-ms + https://gitorious.org/baserock-morphs/openssl.git + https://github.com/tack/openssl_tack + https://gitorious.org/myopenssl/myopenssl.git + https://bitbucket.org/xreach/android-external-openssl.git + [...] + #+END_SRC +*** + Where swh-whereis is a trivial wrapper around the swh-graph API: + #+BEAMER: \scriptsize + #+BEGIN_SRC + curl --silent --fail --location \ + "${API_URL}/graph/leaves/${SWHID}/?direction=backward&resolve_origins=true" + #+END_SRC +*** Caveats + - no filtering/sorting on commit timestamps (yet) + - no filtering on path (e.g., OpenSSL should be included in a sub-dir, etc) + +** Graph API --- Example (cont.) +*** Same API, compliance use case + Find software that has been licensed under the original 1.0 version of the + Affero GPL license (SWHID: + swh:1:cnt:8f3209754390bbc58953d49701ed45c9d4a1a47f). + #+BEAMER: \footnotesize + #+BEGIN_SRC + $ curl https://archive.softwareheritage.org/api/1/graph/randomwalk/\ + swh:1:cnt:8f3209754390bbc58953d49701ed45c9d4a1a47f/ori/\ + ?direction=backward&limit=-1&resolve_origins=true + https://github.com/uwsampa/grappa/ + #+END_SRC +*** + - note the random walk, for spot checks, examples, etc. + - (Grappa has since been re-released under the BSD 3-Clause license) + +* Wrapping up +** Wrapping up + #+BEGIN_EXPORT latex + \begin{center} + \includegraphics[width=.45\linewidth]{SWH-logo+motto.pdf}\\ + \hfill \href{https://www.softwareheritage.org}{www.softwareheritage.org} + \hfill \href{https://twitter.com/swheritage}{@swheritage} \hfill~ + \end{center} + #+END_EXPORT +*** + - thanks to its data model, Software Heritage provides a *global view* on + the largest public collection of software source code artifacts + - this global view enables *source code tracking* of public code (FOSS, and + more) at an unprecedented scale + - global source code tracking is a key building block for addressing use + cases in domains such as *license compliance*, *open science*, and + *security* + - we are laying the foundations for addressing these use cases +*** Contacts + [[https://upsilon.cc/~zack/][Stefano Zacchiroli]] / [[mailto:zack@upsilon.cc][zack@upsilon.cc]] / [[https://twitter.com/zacchiro][@zacchiro]] / [[https://mastodon.xyz/@zacchiro][@zacchiro@mastodon.xyz]] + +* Appendix :B_appendix: + :PROPERTIES: + :BEAMER_env: appendix + :END: +** Web API --- Integrate your tools with the Software Heritage archive + #+INCLUDE: "../../common/modules/status-extended.org::#apiintro" :only-contents t + +** Anatomy of a KYSW toolchain + #+BEAMER: \begin{center}{\includegraphics[width=0.8\textwidth]{compliance-toolchain}}\end{center} + #+BEAMER: {\tiny \vspace{-1mm} + source: [[https://upsilon.cc/~zack/talks/2016/2016-01-31-fosdem-compliance.pdf][/A Community Take on the License Compliance Industry/]], Stefano + Zacchiroli, FOSDEM 2016, Legal and Policy Issues devroom, + https://upsilon.cc/~zack/talks/2016/2016-01-31-fosdem-compliance.pdf + #+BEAMER: } +*** + A source *code scanner* is the key ingredient of all KYSW toolchains: it + scans a local /source/ code base and compares it to a FOSS knowledge base, + summarizing findings. diff --git a/talks-public/2021-11-30-sponsors-tech/Makefile b/talks-public/2021-11-30-sponsors-tech/Makefile new file mode 100644 index 0000000..68fbee7 --- /dev/null +++ b/talks-public/2021-11-30-sponsors-tech/Makefile @@ -0,0 +1 @@ +include ../Makefile.slides diff --git a/talks-public/2021-11-30-sponsors-tech/this/zack.org b/talks-public/2021-11-30-sponsors-tech/this/zack.org new file mode 100644 index 0000000..e509805 --- /dev/null +++ b/talks-public/2021-11-30-sponsors-tech/this/zack.org @@ -0,0 +1,12 @@ + +** Short Bio: Stefano Zacchiroli + :PROPERTIES: + :CUSTOM_ID: bio + :END: +*** + - Professor of Computer Science, Télécom Paris, Institut Polytechnique de + Paris + - Free/Open Source Software activist (20+ years) + - Debian Developer & Former 3x Debian Project Leader + - Former Open Source Initiative (OSI) director + - Software Heritage co-founder & CTO