dataset.org
No OneTemporary
Actions

Size

8 KB

Subscribers

None

dataset.org
View Options

	#+COLUMNS: %40ITEM %10BEAMER_env(Env) %9BEAMER_envargs(Env Args) %10BEAMER_act(Act) %4BEAMER_col(Col) %10BEAMER_extra(Extra) %8BEAMER_opt(Opt)
	#+INCLUDE: "prelude.org" :minlevel 1

	* Open Datasets
	:PROPERTIES:
	:CUSTOM_ID: main
	:END:
	** Software Heritage Graph dataset
	:PROPERTIES:
	:CUSTOM_ID: graphdataset
	:END:
	#+BEAMER: \vspace{-1mm}

	Use case: large scale analyses of the most comprehensive corpus on the
	development history of free/open source software.

	***
	#+BEGIN_EXPORT latex
	\vspace{-2mm}
	\begin{thebibliography}{Foo Bar, 1969}
	\small
	\bibitem{Pietri2019} Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli\newblock
	The Software Heritage Graph Dataset: Public software development under one roof\newblock
	MSR 2019: 16th Intl. Conf. on Mining Software Repositories. IEEE\newblock
	preprint: \url{http://deb.li/swhmsr19}
	\end{thebibliography}
	#+END_EXPORT

	#+BEAMER: \vspace{-1mm}
	*** Dataset
	- Relational representation of the full graph as a set of tables
	- Available as open data: https://doi.org/10.5281/zenodo.2583978
	- Chosen as subject for the MSR 2020 Mining Challenge

	#+BEAMER: \vspace{-1mm}
	*** Formats
	- Local use: PostgreSQL dumps, or Apache Parquet files (~1 TiB each)
	- Live usage: Amazon Athena (SQL-queriable), Azure Data Lake (soon)

	** Sample query --- most frequent first commit words
	:PROPERTIES:
	:CUSTOM_ID: graphquery1
	:END:
	***
	#+begin_src sql
	SELECT COUNT(*) AS c, word FROM (
	SELECT LOWER(REGEXP_EXTRACT(FROM_UTF8(
	message), 'ˆ\w+')) AS word FROM revision)
	WHERE word != ''
	GROUP BY word ORDER BY COUNT(*) DESC LIMIT 5;
	#+end_src
	#+BEAMER: \pause
	***
	\| Count \| Word \|
	\|------------+--------\|
	\| 71 338 310 \| update \|
	\| 64 980 346 \| merge \|
	\| 56 854 372 \| add \|
	\| 44 971 954 \| added \|
	\| 33 222 056 \| fix \|

	** Sample query --- fork and merge arities
	:PROPERTIES:
	:CUSTOM_ID: graphquery2
	:END:
	*** Fork arity
	:PROPERTIES:
	:BEAMER_env: block
	:BEAMER_COL: 0.5
	:END:
	i.e., how often is a commit based upon?
	#+BEAMER: \scriptsize
	#+begin_src sql
	SELECT fork_deg, count(*) FROM (
	SELECT id, count(*) AS fork_deg
	FROM revision_history GROUP BY id) t
	GROUP BY fork_deg ORDER BY fork_deg;
	#+end_src
	#+BEAMER: \includegraphics[width=\linewidth]{fork-degree}
	#+BEAMER: \pause
	*** Merge arity
	:PROPERTIES:
	:BEAMER_env: block
	:BEAMER_COL: 0.5
	:END:
	i.e., how large are merges?
	#+BEAMER: \scriptsize
	#+begin_src sql
	SELECT merge_deg, COUNT(*) FROM (
	SELECT parent_id, COUNT(*) AS merge_deg
	FROM revision_history GROUP BY parent_id) t
	GROUP BY deg ORDER BY deg;
	#+end_src
	#+BEAMER: \includegraphics[width=\linewidth]{merge-degree}
	* Other queries
	:PROPERTIES:
	:CUSTOM_ID: morequery
	:END:
	** Sample query --- ratio of commits performed during weekends
	:PROPERTIES:
	:CUSTOM_ID: weekendsrc
	:END:
	#+BEGIN_SRC sql
	WITH revision_date AS
	(SELECT FROM_UNIXTIME(date / 1000000) AS date
	FROM revision)
	SELECT yearly_rev.year AS year,
	CAST(yearly_weekend_rev.number AS DOUBLE)
	/ yearly_rev.number * 100.0 AS weekend_pc
	FROM
	(SELECT YEAR(date) AS year, COUNT(*) AS number
	FROM revision_date
	WHERE YEAR(date) BETWEEN 1971 AND 2018
	GROUP BY YEAR(date) ) AS yearly_rev
	JOIN
	(SELECT YEAR(date) AS year, COUNT(*) AS number
	FROM revision_date
	WHERE DAY_OF_WEEK(date) >= 6
	AND YEAR(date) BETWEEN 1971 AND 2018
	GROUP BY YEAR(date) ) AS yearly_weekend_rev
	ON yearly_rev.year = yearly_weekend_rev.year
	ORDER BY year DESC;
	#+END_SRC

	** Sample query --- ratio of commits performed during weekends (cont.)
	:PROPERTIES:
	:CUSTOM_ID: weekendout
	:END:
	\| Year \| Weekend \| Total \| Weekend percentage \|
	\|--------+-----------+-----------+----------------------\|
	\| 2018 \| 15130065 \| 78539158 \| 19.26 \|
	\| 2017 \| 33776451 \| 168074276 \| 20.09 \|
	\| 2016 \| 43890325 \| 209442130 \| 20.95 \|
	\| 2015 \| 35781159 \| 166884920 \| 21.44 \|
	\| 2014 \| 24591048 \| 122341275 \| 20.10 \|
	\| 2013 \| 17792778 \| 88524430 \| 20.09 \|
	\| 2012 \| 12794430 \| 64516008 \| 19.83 \|
	\| 2011 \| 9765190 \| 48479321 \| 20.14 \|
	\| 2010 \| 7766348 \| 38561515 \| 20.14 \|
	\| 2009 \| 6352253 \| 31053219 \| 20.45 \|
	\| 2008 \| 4568373 \| 22474882 \| 20.32 \|
	\| 2007 \| 3318881 \| 16289632 \| 20.37 \|
	\| 2006 \| 2597142 \| 12224905 \| 21.24 \|
	\| 2005 \| 2086697 \| 9603804 \| 21.72 \|
	\| 2004 \| 1752400 \| 7948104 \| 22.04 \|
	\| 2003 \| 1426033 \| 6941593 \| 20.54 \|
	\| 2002 \| 1159294 \| 5378538 \| 21.55 \|
	\| 2001 \| 849905 \| 4098587 \| 20.73 \|
	\| 2000 \| 2091770 \| 4338842 \| 48.21 \|
	\| 1999 \| 438540 \| 2026906 \| 21.63 \|
	\| 1998 \| 311888 \| 1430567 \| 21.80 \|
	\| 1997 \| 263995 \| 1129249 \| 23.37 \|
	\| 1996 \| 192543 \| 795827 \| 24.19 \|
	\| 1995 \| 176270 \| 670417 \| 26.29 \|
	\| 1994 \| 137811 \| 581563 \| 23.69 \|
	\| 1993 \| 169767 \| 697343 \| 24.34 \|
	\| 1992 \| 74923 \| 422068 \| 17.75 \|
	\| 1991 \| 92782 \| 484547 \| 19.14 \|
	\| 1990 \| 113201 \| 340489 \| 33.24 \|
	\| 1989 \| 31742 \| 182325 \| 17.40 \|
	\| 1988 \| 44983 \| 206275 \| 21.80 \|
	\| 1987 \| 27892 \| 146157 \| 19.08 \|
	\| 1986 \| 54200 \| 237330 \| 22.83 \|
	\| 1985 \| 75595 \| 306564 \| 24.65 \|
	\| 1984 \| 26391 \| 95506 \| 27.63 \|
	\| 1983 \| 89776 \| 370687 \| 24.21 \|
	\| 1982 \| 51524 \| 191933 \| 26.84 \|
	\| 1981 \| 32995 \| 123618 \| 26.69 \|
	\| 1980 \| 31832 \| 133733 \| 23.80 \|
	\| 1979 \| 20943 \| 175164 \| 11.95 \|
	\| 1978 \| 3773 \| 33677 \| 11.20 \|
	\| 1977 \| 4783 \| 19376 \| 24.68 \|
	\| 1976 \| 1907 \| 7048 \| 27.05 \|
	\| 1975 \| 2089 \| 26579 \| 7.85 \|
	\| 1974 \| 2095 \| 14290 \| 14.66 \|
	\| 1973 \| 2988 \| 15580 \| 19.17 \|
	\| 1972 \| 1755 \| 6552 \| 26.78 \|
	\| 1971 \| 1723 \| 6125 \| 28.13 \|

	** Sample query --- average size of the most popular file types
	:PROPERTIES:
	:CUSTOM_ID: popfilesrc
	:END:
	#+BEGIN_SRC sql
	SELECT suffix,
	ROUND(COUNT() 100 / 1e6) AS Million_files,
	ROUND(AVG(length) / 1024) AS Average_k_length
	FROM
	(SELECT length, suffix
	FROM -- File length in joinable form
	(SELECT TO_BASE64(sha1_git) AS sha1_git64, length
	FROM content ) AS content_length
	JOIN -- Sample of files with popular suffixes
	(SELECT target64, file_suffix_sample.suffix AS suffix
	FROM -- Popular suffixes
	(SELECT suffix FROM (
	SELECT REGEXP_EXTRACT(FROM_UTF8(name),
	'\.[^.]+$') AS suffix
	FROM directory_entry_file) AS file_suffix
	GROUP BY suffix
	ORDER BY COUNT(*) DESC LIMIT 20 ) AS pop_suffix
	JOIN -- Sample of files and suffixes
	(SELECT TO_BASE64(target) AS target64,
	REGEXP_EXTRACT(FROM_UTF8(name),
	'\.[^.]+$') AS suffix
	FROM directory_entry_file TABLESAMPLE BERNOULLI(1))
	AS file_suffix_sample
	ON file_suffix_sample.suffix = pop_suffix.suffix)
	AS pop_suffix_sample
	ON pop_suffix_sample.target64 = content_length.sha1_git64)
	GROUP BY suffix
	ORDER BY AVG(length) DESC;
	#+END_SRC

	* Discussion
	:PROPERTIES:
	:CUSTOM_ID: discussion
	:END:
	** Discussion
	- one /can/ query such a corpus SQL-style
	- but relational representation shows its limits at this scale
	- ...at least as deployed on commercial SQL offerings such as Athena
	- note: (naive) sharding is ineffective, due to the pseudo-random
	distribution of node identifiers
	- experiments with Google BigQuery are ongoing
	- (we broke it at the first import attempt..., due to very large arrays in
	directory entry tables)

File Metadata

Mime Type: text/x-tex
Expires: Thu, Jul 3, 10:58 AM (1 w, 3 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3291575

dataset.orgNo OneTemporaryActions

dataset.orgView Options

File Metadata

Event Timeline

dataset.org
No OneTemporary
Actions

dataset.org
View Options