diff --git a/common/modules/help-cs.org b/common/modules/help-cs.org index b944ba7..a211f4c 100644 --- a/common/modules/help-cs.org +++ b/common/modules/help-cs.org @@ -1,135 +1,151 @@ #+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt) #+INCLUDE: "prelude.org" :minlevel 1 # # Research challenges # * Selected research challenges: building the archive :PROPERTIES: :CUSTOM_ID: main :END: ** Metadata alignment + :PROPERTIES: + :CUSTOM_ID: metadata + :END: *** Many concepts related to source code :PROPERTIES: :BEAMER_act: +- :END: - project, archive, source, language, licence, bts, mailing list, ... - developer, committer, author, architect, ... *** Many existing ontologies, catalogs :PROPERTIES: :BEAMER_act: +- :END: - DOAP, FOAF, Appstream, schema.org, ADMS.SW, ... # mostly manual - Freecode (40.000+), Plume (400+), Debian (25.000+), FramaSoft (1500+), OpenHub (670.000+), ... # OpenHub is mostly automatic # Wikipedia ? *** Challenge : scale up metadata to millions of projects :PROPERTIES: :BEAMER_act: +- :END: - /reconcile/ existing ontologies - /link/ and /check/ existing catalogs with Software Heritage - handle /inconsistent data/ and /provenance information/ - synthesise missing information (machine learning) ** Software phylogenetics + :PROPERTIES: + :CUSTOM_ID: phylogenetics + :END: *** The Software Diaspora :PROPERTIES: :BEAMER_act: +- :END: - Code often /migrates/ across projects: forks, copy-paste - Code gets /cloned/: reuse, language limitations, code smells - Projects /migrate/ across forges: fashion, functionality - Projects get /cloned/: mirrors, packages *** Challenge: tracing software evolution across billions of files :PROPERTIES: :BEAMER_act: +- :END: - rebuild the history of software artefacts - identify code origins - spot code clones - build project impact graphs ** Distributed infrastructure + :PROPERTIES: + :CUSTOM_ID: infrastructure + :END: *** The software graph - files - directories - commits - projects all de-duplicated in Software Heritage *** Challenge: design efficient architectures and algorithms - replication and availability - navigation - what happens to CAP? (updates are nondestructive!) - query +** Efficient updates +*** Many sources + - GitHub (tens of millions of projects) + - Gitlab (tens of thousands of instances) +*** Frequent changes + - millions of commits/push per day +*** Keeping up + - what is the most efficient way to keep up with this stream of data? * Selected research challenges: using the archive :PROPERTIES: :CUSTOM_ID: using :END: ** Code search: an old problem *** A natural need :PROPERTIES: :BEAMER_act: +- :END: - Find the definition of a function/class/procedure/type/structure - Search examples of code usage in an archive of source code - you name it... *** A natural approach :PROPERTIES: :BEAMER_act: +- :END: - Regular expressions *** We have all used /grep/ since the 1970's! :PROPERTIES: :BEAMER_act: +- :END: \hfill where is the challenge? ** Finding a needle in a haystack: size matters! How do we search in /millions/ of source code files? *** Google code search (open 2006, closed 2011) :PROPERTIES: :BEAMER_act: +- :END: #+latex: see {\small \url{https://swtch.com/~rsc/regexp/regexp4.html}} #+latex: reborn in Debian {\small \url{http://codesearch.debian.net/}} *** how :PROPERTIES: :BEAMER_act: +- :END: - build an inverted index of /trigrams/ from all source files - /map/ regexps to trigrams - /filter/ files that may match - run /grep/ on each file (using the "cloud") *** performance :PROPERTIES: :BEAMER_act: +- :END: \hfill scaled reasonably well up to /1 billion lines of codes/ ** Challenge: scaling up code search *** What about /all the source code/ in the world? :PROPERTIES: :BEAMER_act: +- :END: Software Heritage is /two orders of magnitude/ bigger already - over /two billion/ unique source files - /hundreds/ of billions of LOCs We need new insight for handling this. *** Beyond regular expressions? :PROPERTIES: :BEAMER_act: +- :END: Advanced code search requires - language specific /patterns/ - working on /abstract syntax trees/ Regular expressions are a nice /swiss-army knife/ approximation, can we build a specific tool that scales? ** Software as Big Data = "Big Code" *** Remember the numbers - - 30 million repositories ingested (10M next in line) - - 700 million commits - - 3 billion unique source files / 200 TB of raw source code - ... and growing by the day! + - tes of millions repositories ingested + - over 1 billion commits + - billions of unique source files ... and growing by the day! *** Challenge: what can machines learn here? - programming patterns - developer skills - vulnerabilities - bugs and fixes