diff --git a/docs/report/circle.pdf b/docs/report/circle.pdf
index ddd0342..c317cf2 100644
Binary files a/docs/report/circle.pdf and b/docs/report/circle.pdf differ
diff --git a/docs/report/missfont.log b/docs/report/missfont.log
deleted file mode 100644
index bb465b6..0000000
--- a/docs/report/missfont.log
+++ /dev/null
@@ -1,72 +0,0 @@
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/BI/OT
-mktextfm SimSun/OT
-mktextfm SimSun/B/OT
-mktextfm SimSun/OT
-mktextfm SimSun/I/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/BI/OT
-mktextfm SimSun/OT
-mktextfm SimSun/B/OT
-mktextfm SimSun/OT
-mktextfm SimSun/I/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/BI/OT
-mktextfm SimSun/OT
-mktextfm SimSun/B/OT
-mktextfm SimSun/OT
-mktextfm SimSun/I/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/BI/OT
-mktextfm SimSun/OT
-mktextfm SimSun/B/OT
-mktextfm SimSun/OT
-mktextfm SimSun/I/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/BI/OT
-mktextfm SimSun/OT
-mktextfm SimSun/B/OT
-mktextfm SimSun/OT
-mktextfm SimSun/I/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm song/OT
-mktextfm song/OT
-mktextfm song/OT
-mktextfm song/BI/OT
-mktextfm song/OT
-mktextfm song/B/OT
-mktextfm song/OT
-mktextfm song/I/OT
-mktextfm song/OT
-mktextfm song/OT
-mktextfm song/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/BI/OT
-mktextfm SimSun/OT
-mktextfm SimSun/B/OT
-mktextfm SimSun/OT
-mktextfm SimSun/I/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
-mktextfm SimSun/OT
diff --git a/docs/report/rapport.pdf b/docs/report/rapport.pdf
deleted file mode 100644
index 954bac2..0000000
Binary files a/docs/report/rapport.pdf and /dev/null differ
diff --git a/docs/report/rapport.tex b/docs/report/rapport.tex
deleted file mode 100644
index 5b8b9b0..0000000
--- a/docs/report/rapport.tex
+++ /dev/null
@@ -1,103 +0,0 @@
-\documentclass[a4paper,12pt]{article}
-\usepackage[french]{babel}
-\usepackage[parfill]{parskip}
-\usepackage{graphicx}
-\usepackage{amssymb}
-\usepackage{amsmath}
-\usepackage{amsthm}
-\usepackage{xunicode}
-\usepackage[T1]{fontenc}
-%\usepackage{times}
-
-\title{Détection de langage de programmation à grande échelle}
-\author{Yuan YIN}
-\date{}
-
-\begin{document}
-\maketitle
- \begin{abstract}
- (à compléter)
- \end{abstract}
-
-\section{Introduction}
-
-La détection de langage de programmation est un problème qui consiste à identifier en quel langage un morceau de code source est écrit. Le code source est désigné par une représentation séquentielle textuelle d'un artéfact du logiciel, qui est souvent sous forme de chaînes de caractères ou, dans un sens plus général, chaînes de bytes. Plus précisément, il s'agit de introduire un modèle qui peut prendre en entrée une chaîne de bytes et puis prédire le langage utilisé.
-
-Celle-ci peut être utilisée dans de nombreux contextes. Voici quelques scénarios de son application : Dans les EDI, la détection automatique de langage leur permet de réaliser les fonctionnalités telles que la coloration syntaxique, l'auto-complétion du code, \emph{etc}. Dans les archives comme GitHub, ces techniques facilite l'automatisation de l'indexation des dépôts. Ces techniques sont aussi pratiquées dans les projets
-
-Dans le cadre du TRE, la recherche se déroule dans le contexte de l'archive \emph{Software Heritage} de l'Inria, dans lequel sont stockés plus de 4 milliards fichiers provenant de plus de 80 millions projets. La structure de données de \emph{Software Heritage}, conçue pour éviter la duplication, ne contient que des artéfacts, c'est-à-dire des fichiers sous forme de métadonnée. Chaque artéfact dispose d'un seul identifiant généré depuis soi-même, il est donc impossible de récupérer le nom d'un fichier. C'est la raison pour laquelle, au lieu d'observer l'extension de fichier, nous ne pouvons qu'analyser son contenu pour identifier le langage d'un artéfact.
-
-Ce travail est composé de deux étapes : 1) étudier les différentes méthodes existantes du domaine en rédigeant une review technique des articles scientifiques, 2) évaluer ces méthodes à grande échelle et adapter à un sous-ensemble de l'archive.
-
-Ce rapport est organisé de la manière suivante. La Section 2 traite des approches existantes théoriques et pratiques. Dans la Section 3, les méthodes testées dans les expérimentations sont respectivement présentées. Les principaux résultats issus des expérimentations sont présentés et analysés dans la Section 4. La Section 5 donne une vision d'extension possible quand nous devons adapter les méthodes aux besoins réels rencontrés dans l'archive \emph{Software Heritage}.
-
-\section{État de l'art}
-
-Vu que la détection de langage de programmation peut être considérée comme un sous-problème de la classification de textes, les approches relatives sont soit des applications des algorithmes bien conçus pour le traitement automatique du langage naturel, soit des outils regroupant de différentes heuristiques.
-
-Les heuristiques primitives sont largement pratiquées dans les logiciels libres tels que \textsf{Ohcount} et \textsf{SLOCCount}. Afin de distinguer un langage des autres, les développeurs cherchent des motifs spécifiques caractérisant le langage. Ces heuristiques sont souvent purement empiriques, par exemple, un fichier commence par \texttt{\#include} est vraisemblablement écrit en C. Malgré la compréhensibilité des sens de motifs, les approches qui dependent des expressions régulières restent sous-performantes et inefficaces.
-
-À part les heuristiques, les approches de l'apprentissage supervisé, notamment celles qui sont étudiées dans le domaine du traitement automatique du langage naturel, s'appliquent au problème et donnent des résultats.
-
-\section{Méthodes}
-
-\subsection{Jeu de données}
-
-Etant donné qu'aucun jeu de données de tel genre n'est publiquement disponible de la part des auteurs des articles précédents, nous avons besoin d'en construire un nouveau nous permettant de disposer d'un ensemble de fichiers avec les étiquettes d'identité indiquant le langage utilisé. Comme GitHub est l'un des services de gestion de développement de logiciels les plus populaires, nous pouvons récupérer un nombre considérable de fichiers pour chaque langage pris en compte.
-
-\subsubsection{Supposition sur la vérité au sol}
-
-\textsf{Linguist}, logiciel développé par l'équipe GitHub, est capable de reconnaître plus de 400 langages de programmation, de représentation de données et de balisage pour les fichiers contenus dans un dépôt Git.
-
-\subsubsection{Langages}
-
-\subsubsection{Récupération de données}
-
-Pour ce faire,
-
-\subsection{Méthode de référence : classification en comparant les fréquences de n-grams}
-
-\subsection{Classification avec le model de langage n-grams}
-
-\subsection{Modèle bayésien naïf multinominal}
-
-\subsection{Réseau neuronal convolutif au niveau de tokens}
-
-\subsection{Réseau neuronal convolutif au niveau de bytes}
-
-\section{Résultats expérimentaux}
-
-\subsection{Evaluation des modeles}
-
-\subsubsection{Impact de nombre de classes}
-
-\subsubsection{Confusion d'inter-classes}
-
-\subsection{Benchmark}
-
-\section{Défis à grande échelle}
-
-\subsection{Déployabilité des approches dans les systèmes distribués}
-
-\subsection{Découverte de nouvelles classes}
-
-\subsubsection{Regroupement hiérarchique agglomératif}
-
-\subsection{Apprentissage incrémental (partie optionelle)}
-
-\section{Conclusion}
-
-\bibliography{bib-rapport}
-\bibliographystyle{plain}
-
-%Rapport
-%
-%Il doit faire de 15 à 30 pages et, dans la mesure du possible, doit être en grande part lisible par des non-spécialistes. Son plan peut être par exemple :
-%présentation du domaine de recherche (le jury n'est pas constitué seulement de spécialistes du domaine, tenez-en compte !) ;
-%énoncé et motivation du sujet ;
-%résultats existants s'y rapportant (état de l'art, commentaire d'article, ...) ;
-%vos résultats personnels (clairement identifiés comme tels).
-%Le rapport devra être assorti d'un résumé d'une page compréhensible par quiconque.
-
-\tableofcontents
-\end{document}
\ No newline at end of file
diff --git a/docs/report/report-en.pdf b/docs/report/report-en.pdf
index 1f94fd7..153457d 100644
Binary files a/docs/report/report-en.pdf and b/docs/report/report-en.pdf differ
diff --git a/docs/report/report-en.tex b/docs/report/report-en.tex
index 25d9f46..2a6c3ef 100644
--- a/docs/report/report-en.tex
+++ b/docs/report/report-en.tex
@@ -1,716 +1,716 @@
\documentclass[a4paper,12pt]{article}
\usepackage[a4paper,left=3cm,right=3cm,top=3cm,bottom=3cm]{geometry}
\usepackage[english]{babel}
\usepackage[parfill]{parskip}
\usepackage{graphicx}
\usepackage{xeCJK}
\setCJKmainfont{Songti SC Light}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{xunicode}
\usepackage[utf8]{inputenc}
\usepackage[charter]{mathdesign}
\usepackage{url}
\usepackage{hyperref}
\usepackage{multirow}
\usepackage[toc,page]{appendix}
\usepackage{tabularx}
\usepackage{longtable}
\usepackage{listings}
\lstset{basicstyle=\footnotesize\ttfamily,breaklines=true, upquote=true}
\usepackage{textcomp}
\usepackage{graphicx}
\usepackage{subfig}
\usepackage[labelfont={small,sc}, font={small}]{caption}
+\hyphenation{ConvNet}
+
\DeclareTextCommand{\nobreakspace}{T1}{\leavevmode\nobreak\ }
\title{Large-scale Programming Language Detection}
-\author{Yuan YIN \\ \small{Supervisor: Stefano Zacchiroli}}
+\author{Yuan Yin \\~\\\normalsize{Supervisor: Stefano Zacchiroli}}
\begin{document}
\maketitle
\thispagestyle{empty}
\begin{abstract}
In large code bases, it is significant to recognise the language of files in the archive for further application. However, the only available way to detect the language in which a plain text document is written is to recognise from the document itself without filename and extension. Programming Language Detection is a problem aiming at building a function that calculates and decides which language is used in a document. In this report, we investigate current techniques and tools for the detection task then test several relevant approaches.
- For the evaluations of supervised learning methods, we build a dataset containing more than 5 million files from GitHub, written in 374 languages. We then practice some of the text categorisation methods in Natural Language Processing (NLP) in our experimentations, \emph{i.e.} $n$-gram frequency profile distance, Multinomial Naïve Bayes (MNB), $n$-gram model, word-level and byte-level convolution neural networks (ConvNet). The most performing method, byte-level ConvNet method, achieves an accuracy of 88.64\% in our test set with full language list. We then apply the classifier to a subset of the Software Heritage archive and check manually if document's inherent language corresponds to its projection by ConvNet. It reaches only around 66\% in the real-world application, allegedly due to the inequality between classes of the model and the impurity of the training set.
+ For the evaluations of supervised learning methods, we build a dataset containing more than 5 million files from GitHub, written in 374 languages. We then practice some of the text categorisation methods in Natural Language Processing (NLP) in our experimentations, \emph{i.e.} $n$-gram frequency profile distance, Multinomial Naïve Bayes (MNB), $n$-gram model, word-level and byte-level convolution neural networks (ConvNet). The most effective method, byte-level ConvNet method, achieves an accuracy of 88.64\% in our test set with full language list. We then apply the classifier to a subset of the Software Heritage archive and check manually if document's inherent language corresponds to its projection by ConvNet. It reaches only around 66\% in the real-world application, allegedly due its weakness on recognising popular languages.
Besides, we investigate several further subjects for more complex demands of Software Heritage, \emph{e.g.} to discover newly incoming languages by unsupervised learning, to integrate these new languages (with or without regenerating the function from training set) to the model. We know that with current available techniques it is inevitable that the quality of models will degrade after adding new classes. Nevertheless, these possible development paths of our methods are still interesting for further discovery.
\end{abstract}
\clearpage
\tableofcontents
\pagenumbering{Roman}
\clearpage
\pagenumbering{arabic}
\setcounter{page}{1}
\section{Introduction}
-Programming Language Detection is the problem of identifying in which programming language a piece of source code is written. \cite{Klein11, vanDam16} We here define the piece of source code as a textual sequential representation of an artefact, which is normally in the form of character sequence or, more generally, byte sequence. More precisely, the objective is to build a model that predicts the language of a given sequence.
+Programming Language Detection is the problem of identifying in which programming language a piece of source code is written \cite{Klein11, vanDam16}. We here define the piece of source code as a textual sequential representation of an artefact, which is normally in the form of character sequence or, more generally, byte sequence. More precisely, the objective is to build a model that predicts the language of a given sequence.
The formal definition of the problem is given as follows: On the input, a byte sequence $d$ and a set of $n$ languages $\{l_1, ..., l_n\}$ are given. On the output,
\[l_d = \underset{l_i\in \{l_1, ..., l_n\}}{\arg \max}\ m(d, l_i)\]
is calculated, where $l_d$ is the projected language, $m$ is a model that calculates a value indicating the likelihood of the document $d$ written in language $l_i$. The most likely language $l_d$ is chosen by $\arg \max$ and attributed to the document as its recognised language.
In general, Programming Language Detection could be utilised in different situations, here are several example applications: (a) language composition of software project in version control systems. For instance, GitHub team is developing the project Linguist to return which languages are the project written in; (b) code searching in plain text, in order to track the popularity of a language; (c) language detection helps also IDEs to choose the language whose support functionalities, like syntax highlighting, are implemented.
-We dive into this problem in the context of Software Heritage\footnote{\url{https://www.softwareheritage.org/}}. Software Heritage, initiated by Inria, is an archive in which more than 4.5 billions unique source code files from over 83 millions projects are collected (June 2018). Its aim is to collect and preserve the precious but fragile source code containing human readable knowledge.~\cite{DiCosmo17} The ``large scale'' profile of the problem is not only about the size of the archive but also the number of languages that should be recognised.
+We dive into this problem in the context of Software Heritage\footnote{\url{https://www.softwareheritage.org/}}. Software Heritage, initiated by Inria, is an archive in which more than 4.5 billions unique source code files from over 83 millions projects are collected (June 2018). Its aim is to collect and preserve the precious but fragile source code containing human readable knowledge~\cite{DiCosmo17}. The ``large scale'' profile of the problem is not only about the size of the archive but also the number of languages that should be recognised.
-The fact that the language of a file could not be found from its filename extension in Software Heritage adds more extra challenge on such task. In Software Heritage, every source code file is a blob which contains raw content of the file, that means a sequence of bytes without any extra information, such as filename (including filename extension), metadata, \emph{etc}. Since each blob could be represented by an intrinsic identifier generated from the blob itself, the duplication of files is avoided. For this reason, all existing tools depending on filenames fail in our context, and the methods for recognising the language from a sequence of bytes is strongly demanded.
+The fact that the language of a file could not be found from its filename extension in Software Heritage adds more extra challenge on such task. In Software Heritage, every source code file is a blob which contains raw content of the file, that means a sequence of bytes without any extra information, such as filename (including filename extension), metadata, \emph{etc}. Since each blob could be represented by an intrinsic identifier generated from the blob itself, the duplication of files is avoided but the filenames are missing in consequence. For this reason, all existing tools depending on filenames fail in our context, and the methods for recognising the language from a sequence of bytes is strongly demanded.
-Given such specific constraints in Software Heritage, our objective is to investigate several state-of-the-art methods of Programming Language Detection, to experiment them on our dataset for quantitative evaluation and qualitative analysis and to extend the model to meet the future demand of Software Heritage.
+Given such specific constraints in Software Heritage, our objective is to investigate several state-of-the-art methods of Programming Language Detection, to experiment them on our dataset for quantitative evaluation and qualitative analysis and to extend the model to meet the future demands of Software Heritage.
-In this report, we introduce briefly the state-of-the-art methods in Section~\ref{sec:related-works}. In Section~\ref{sec:dataset}, the procedure of making a feasible dataset is explained. In Section~\ref{sec:methods-for-evaluation}, we explicate the methods took in account for the evaluation. Experimental results including the comparison between methods and the observation on certain questions are described in Section~\ref{sec:experimentations}. The best performing method is finally applied to a subset of Software Heritage, Section~\ref{sec:application-in-software-heritage} gives a preview of its performance in the real world. Section~\ref{sec:challenges} draws several possible tracks for the future amelioration of the tools.
+In this report, we introduce briefly the state-of-the-art methods in Section~\ref{sec:related-works}. In Section~\ref{sec:dataset}, the procedure of making a feasible dataset is explained. In Section~\ref{sec:methods-for-evaluation}, we explicate the methods taken in account for the evaluation. Experimental results including the comparison between methods and the observation on certain questions are described in Section~\ref{sec:experimentations}. The most effective method is finally applied to a subset of Software Heritage, Section~\ref{sec:application-in-software-heritage} gives a preview of its performance in the real world. Section~\ref{sec:challenges} draws several possible tracks for the future amelioration of the tools.
We provide the implemented methods and more detailed results in a git repository on Forge of Software Heritage \footnote{\url{https://forge.softwareheritage.org/source/internship-lang-detection}}.
\section{Related Works}
\label{sec:related-works}
The existing approaches for Programming Language Detection could be divided into two categories: heuristic methods and machine learning methods.
Heuristic methods are mostly based on several empirical or external information, basic ideas are presented as follows:
\begin{itemize}
\item Judging from filename extension. Ohcount~\cite{ohcount} and Linguist~\cite{linguist} practice the detection by projecting filename extension to corresponding language. The problem of this straightforward method is that some extensions are related to different languages, \emph{e.g.} \texttt{*.m} refers to a file written in Objective-C or MATLAB, \texttt{*.pl} points to Perl or Prolog.
\item Grammar-based approaches. The principal is to parse through all languages, which is complex in modelling and demand a heavy consumption of calculation time.
- \item Heuristics approaches. Most of them, such as SLOCCount~\cite{sloccount}, use predefined regular expressions to capture empirically discovered features, \emph{e.g.} a file start with ``\texttt{\#include}'' is probably written in C. Some other looks for hints in the file, such as shebang lines, Vim modelines, Emacs modelines, \emph{etc}.
+ \item Heuristics other than filename extension. Most of them, such as SLOCCount~\cite{sloccount}, use predefined regular expressions to capture empirically discovered features, \emph{e.g.} a file start with ``\texttt{\#include}'' is probably written in C. Some others look for hints in the file, such as shebang lines, Vim modelines, Emacs modelines, \emph{etc}.
\end{itemize}
-In Machine learning, the problem is regarded as a variant problem of \emph{text categorisation} or \emph{natural language identification}, which means that given a piece of text, we find a function that predicts which category the text belongs to \cite{vanDam16, Gilda17}. The state-of-the-art methods build such function based on example input-output pairs, which are categorised as \emph{supervised learning}.
+In machine learning, the problem is regarded as a variant problem of \emph{text categorisation} or \emph{natural language identification}, which means that given a piece of text, we find a function that predicts which category the text belongs to \cite{vanDam16, Gilda17}. The state-of-the-art methods build such function based on example input-output pairs, which are categorised as \emph{supervised learning}.
-Ugurel \emph{et al.}~\cite{Ugurel02} selects firstly the features by Expected Entropy Loss for each language, then vectorise the tested document into a vector representing the presence of a selected feature. The $n$-class classification is resolved by training $n \choose 2$ Support Vector Machine (SVM) binary classifier in the form of decision tree. Van~Dam and Zaytsev~\cite{vanDam16} test several popular and performant methods in Natural Language Processing. Multi-nominal Naïve Bayes (MNB), one of the variants of Naïve Bayes Classifiers, utilises unified frequency of a word or a sequence of words in a byte-sequence to decide the most possibly corresponding programming language. $N$-gram model and skip-gram model calculate for each gram the possibility of its appearance after $N$ grams. Normalised Compression Distance compares a piece of compressed code to the examples in the training set, then chooses the nearest language on as projection. MNB and $N$-gram model outperform others according to the experimental results. Gilda~\cite{Gilda17} and Aylien~\cite{Aylien16} adopt a general setup of Convolutional Neural Network (ConvNet) in NLP at word-level and proofs its performance.
+Ugurel \emph{et al.}~\cite{Ugurel02} selects firstly the features by Expected Entropy Loss for each language, then vectorise the tested document into a vector representing the presence of a selected feature. The $n$-class classification is resolved by training $n \choose 2$ Support Vector Machine (SVM) binary classifier in the form of decision tree. Van~Dam and Zaytsev~\cite{vanDam16} test several popular and effective methods in Natural Language Processing. Multi-nominal Naïve Bayes (MNB), one of the variants of Naïve Bayes classifiers, utilises unified frequencies of words or sequences of words in a byte sequence to decide the most possibly corresponding programming language. $N$-gram model and skip-gram model calculate for each gram the possibility of its appearance after $N$ grams. Normalised Compression Distance compares a piece of compressed code to the examples in the training set, then chooses the nearest language on as projection. MNB and $N$-gram model outperform others according to the experimental results. Gilda~\cite{Gilda17} and Aylien~\cite{Aylien16} adopt a general setup of Convolutional Neural Network (ConvNet) in NLP at word-level and proofs its performance.
-Our tested methods are partly issued from the preceding state of the art. They will be detailed with other applied algorithms in Section~\ref{sec:methods-for-evaluation}.
+Our tested methods are partially issued from the preceding state of the art. They will be detailed with other applied algorithms in Section~\ref{sec:methods-for-evaluation}.
\section{Dataset}
\label{sec:dataset}
We considered applying both supervised learning and unsupervised learning for the problem. However, the usage of unsupervised learning is quite limited in classification problems (we will talk about it later in Section~\ref{sec:challenges}). We then focus on supervised methods.
-Supervised learning methods require a dataset containing labeled inputs to train and evaluate the model. Unfortunately, the articles are rarely accompanied by a publicly available dataset. Therefore, we natively build a novel dataset for our experiments.
+Supervised learning methods require a dataset containing labeled inputs to train and evaluate the model. Unfortunately, the articles are rarely accompanied by a publicly available dataset. Therefore, we natively built a novel dataset for our experiments.
-GitHub\footnote{\url{https://www.github.com/}} is one of the most popular web-based hosting service for Git version control system, reporting having 85 million repositories~\cite{GitHub}. We build the dataset using GitHub for its large amount of languages included and its popularity in free software community.
+GitHub\footnote{\url{https://www.github.com/}} is one of the most popular web-based hosting service for Git version control system, reporting having 85 million repositories~\cite{GitHub}. We built the dataset using GitHub for its large number of languages included and its popularity in free software community.
\paragraph{Ground Truth Supposition}
-In the context of Software Heritage, our aim is to cover as many languages as possible for classification, thus the dataset we build possesses inevitably a large amount of files, which is unaffordable to be labeled manually. We thus seek help from automatic labelling tools.
+In the context of Software Heritage, our aim is to cover as many languages as possible for classification, thus the dataset we build possesses inevitably a large number of files, which is unaffordable to be labeled manually. We thus seek help from automatic labelling tools.
-Linguist~\cite{linguist} is the tool of language detection developed by the GitHub team for unveiling the language composition in git repository, service provided on GitHub through their API. There exists a command line version Linguist producing list of files by language for repository. Given that filename extensions are visible for Linguist and such features boost the accuracy of classification (we will show this claim in later experiment), we assume that the language recognised by Linguist is the ground truth language attributed to it. Since the original Linguist did not give detailed results of some data description languages, \emph{i.e.} XML, JSON, we slightly modified Linguist to make them visible on the file list.
+Linguist~\cite{linguist} is the tool of language detection developed by the GitHub team for unveiling the language composition in git repository, service provided on GitHub through their API. There exists a command line version Linguist producing list of files by language for repository. Given that filename extensions are visible for Linguist and such features boost the accuracy of classification (we will show this claim in later experiment), we assume that the language recognised by Linguist is the ground truth language. Since the original Linguist did not give detailed results of some data description languages, \emph{i.e.} XML, JSON, it has been modified to make such files visible on the file list.
\paragraph{Source Code Recuperation and Languages Included}
-As stated above, the list of languages we consider should cover as many languages as possible. Therefore, we initially took the entire language list of Linguist into account for repository fetching. For each language, we fetched the first 75 repositories which top on the list ordered by number of stars, manifesting the popularity of the repository. To avoid huge repositories, we ignore all repositories whose size is superior to 150~MiB.
+As stated above, the list of languages we consider should cover as many languages as possible. Therefore, we took the entire language list of Linguist into account for repository fetching. For each language, we fetched the first 75 repositories which top on the list ordered by number of stars, manifesting the popularity of the repository. To avoid huge repositories, we ignored all repositories whose size is superior to 150~MiB.
-Some languages were eliminated, for the reason that we could not fetch enough files from GitHub. We successfully fetched {5,162,128} files for 374 valid languages showed in Table~\ref{tab:lan} in Appendices.
+Some languages were eliminated, for the reason that we could not fetch any repository or enough files from GitHub. {5,162,128} files for 374 valid languages were fetched for later experimentations. The language list is shown in Table~\ref{tab:lan} in Appendices.
\section{Methods for Evaluation}
\label{sec:methods-for-evaluation}
In this section, we describe several NLP methods here tested on our dataset:
\begin{itemize}
\item $n$-gram-based frequency distance,
\item $n$-gram model,
\item Multinominal Naïve Bayes (MNB), and
\item Convolutional Neural Networks (ConvNet).
\end{itemize}
The first approach is regarded as the baseline method for the evaluation of the accuracy and the efficiency of the model.
-Given that every file in Software Heritage is only a sequence of bytes which we are not able to determine its character encoding, even unable to judge whether it is a binary file or not, we focus at byte level for these methods.
+Given that the number of languages is considerable, finding a general tokeniser is not as feasible as relying solely on bytes without tokenisation. Therefore, we focus specifically on the byte-level version of the methods.
-\subsection{Baseline: $n$-gram-based frequency distance}
+\subsection{Baseline: $n$-gram-based Frequency Distance}
\paragraph{$n$-gram}
An $n$-gram is a slice of a larger sequence with $n$ units. In NLP, the sequence is naturally a string. Depending on different problems, a unit represents a character or a word.
For example, the string ``\texttt{print(n)}'' with 8 characters could generate the following character based $n$-grams:
\begin{itemize}
\item unigrams: \texttt{p, r, ..., )}
\item bigrams: \texttt{\textvisiblespace p, pr, ri, ..., n), )\textvisiblespace}
\item trigrams: \texttt{\textvisiblespace\textvisiblespace p, \textvisiblespace pr, pri, rit, ..., n)\textvisiblespace, )\textvisiblespace\textvisiblespace}
\item ...
\end{itemize}
or word-based $n$-grams:
\begin{itemize}
\item unigrams: \texttt{, print, (, n, ), }
\item bigrams: \texttt{ print, print (, ( n, n ), ) }
\item trigrams: \texttt{ print (, print ( n, ( n ), n ) }
\item ...
\end{itemize}
Strings are often padded with start marker \texttt{} and end marker \texttt{}. In general, a $k$-unity sequence generates exactly $k-(n-1)$ n-grams.
Cavnar and Trenkle~\cite{Cavnar94} introduce an early NLP method for text categorisation using the distance between two $n$-gram frequency profiles.
Zipf's law is an empirical observation, expressing that the $n$-th most common word in a human language occurs with a frequency inversely proportional to $n$. According to Zipf's law, by retaining the most common words, it is possible to obtain a list describing the characteristics of the language.
-Given a training set, at the training phase, a bag of $n$-grams is generated for each document in the training set. By gathering all bags of a language and counting the occurrences of each $n$-gram, a list of $n$-grams ordered by number of occurrences is created as the \emph{category profile} of the class. Only the most frequent 300 $n$-grams are kept, since they are highly correlated to the language.
+Given a training set, at the training phase, a bag of $n$-grams is generated for each document in the training set. By gathering all bags of a language and counting the occurrences of each $n$-gram, a list of $n$-grams ordered by the number of occurrences is created as the \emph{category profile} of the class. Only the most frequent 300 $n$-grams are kept, since they are highly correlated to the language.
The \emph{distance} between category profile and document profile is defined as follows:
Given trained category profiles $p_{l_1}, ..., p_{l_k}$ for $k$ languages, and document profile $p_{d}$ of test document $d$,
\[
distance(p_{l_i}, p_{d}) = \sum_{w\in p_{d}} | rankdist(w, p_d, p_{l_i})|
\]
\[
rankdist(w, p_d, p_{l_i})=
\begin{cases}
|rank(w, p_d) - rank(w, p_{l_i})| & \text{if }rank(w, p_{l_i}) \text{ exists,} \\
|p_d| & \text{else}
\end{cases}
\]
where $p$ contains an ordered list of word, $rank(w, p)$ returns the rank of $w$ in list $p$. $rankdist(w, p_d, p_{l_i})$ returns the out-of-place distance between two profiles if $w$ appears in $p_{l_i}$. If $w$ is an out-of-vocabulary word, the distance is the length of document profile $p_d$.
The document is then categorised as language with minimum distance.
-\subsection{Multinominal Naïve Bayes}
+\subsection{Multinominal Naïve Bayes (MNB)}
This approach was introduced by van Dam and Zaytsev~\cite{vanDam16} into Programming Language Detection.
We assume in Naïve Bayes model that each word of the document is independent from each other. According to Bayes' Theorem,
\begin{eqnarray*}
P(l|w_1w_2...w_n) & = & \frac{P(w_1w_2...w_n|l)P(l)}{P(w_1w_2...w_n)} \\
& = & c\cdot P(l) P(w_1w_2...w_n|l)\\
& = & c\cdot P(l) \prod_{i = 1}^n P(w_i|l)
\end{eqnarray*}
Probability of $w_i$ in language $l$ is estimated by its occurrences in language with bag-of-word assumption:
\[
P(w_i|l) = \frac{C_l(w_i) + 1}{\sum_{w\in V}C_l(w) + |V|}
\]
where $C_l$ gives frequency of a word, $V$ is the vocabulary of all languages.
Assumption of independence of words in original model is quite limited for classification, in practice we actually use unigrams to 5-grams to replace words in the original method for taking the context of words into account.
\subsection{$n$-gram model}
The approach was also tested by van Dam and Zaytsev~\cite{vanDam16}. As the precedent method, $n$-gram model utilises also statistical properties of $n$-grams but in another way.
Originally, $n$-gram model aims at predicting the possibility of a unit after knowing $n-1$ units occurred before. Given a unit $w_i$, the probability of its occurrence in a sequence is defined as:
\[
P(w_i | w_1...w_{i-1})
\]
According to Markov assumption, we omit older context in the sequence,
\[
P(w_i | w_1...w_{i-1}) \approx P(w_i | w_{i-(n-1)}...w_{i-1})
\]
In reality, the probability could be estimated by maximum likelihood estimation (MLE):
\[
P(w_i | w_{i-(n-1)}...w_{i-1}) = \frac{C(w_{i-(n-1)}...w_{i-1}w_{i})}{C(w_{i-(n-1)}...w_{i-1})}
\]
where $C$ gives the count of given $n$-gram.
By chain rule of probability and precedent estimation,
\[
P(w_1w_2...w_n)\approx \prod_{i = 1}^n P(w_i|w_{i-(n-1)}...w_{i-1})
\]
Now we transform such model into a classifier. Given a sequence $w_1w_2...w_n$, we assume that each language $l$ appears with the same probability and the probability of a given sequence is fixed.
According to Bayes' Theorem,
\begin{eqnarray*}
P(l|w_1w_2...w_n) & = & \frac{P(w_1w_2...w_n|l)P(l)}{P(w_1w_2...w_n)} \\
& = & c\cdot P(w_1w_2...w_n|l)\\
& = & c\cdot \prod_{i = 1}^n P(w_i|w_{i-(n-1)}...w_{i-1}, l)
\end{eqnarray*}
Rather than counting $n$-grams in the document, the probability of $n$-gram is estimated from the $n$-gram frequency of language, obtained from training set.
\[
P(w_i | w_{i-(n-1)}...w_{i-1}, l) = \frac{C_l(w_{i-(n-1)}...w_{i-1}w_{i})}{C_l(w_{i-(n-1)}...w_{i-1})}
\]
where $C_l$ gives the count of language $l$ in training set.
While estimating the probability of $n$-grams, smoothing techniques are required because of possible occurrence of \emph{out-of-vocabulary (OOV)} $n$-grams. In our case, Modified Kneser-Ney is applied since it is one of the methods that gives better experimental results in~\cite{vanDam16}.
\subsection{Convolutional Neural Network (ConvNet)}
-Convolutional Neural Network is one of the most popular machine learning branches usually used for image classification. It is a class of deep feed-forward artificial neural networks. Recent years, many researches showed the capability of ConvNet for NLP problems.
-
-The following two architectures are tested in Section~\ref{sec:experimentations}.
+Convolutional Neural Network is one of the most popular machine learning branches usually used for image classification. It is a class of deep feed-forward artificial neural networks. In recent years, many researches showed the capability of ConvNet for NLP problems. We consider following architectures respectively for word-level and byte-level. They are tested in Section~\ref{sec:experimentations}.
-\subsubsection{Word-level Approach}
+\subsubsection{Word-level Architecture}
\label{sec:word-conv}
Although Gilda~\cite{Gilda17} showed the performance of his own architecture, we were not able to rebuild the same network due to the lack of network architecture details and hyper-parameter configuration. We therefore focused on other architectures.
Kim~\cite{Kim14} introduced a ConvNet for natural language sentence classification. Figure~\ref{fig:word-convnet} from \cite{Kim14} illustrates the architecture of the network.
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{cnn.png}
\caption{\label{fig:word-convnet}Model architecture with two channels for an example sentence~\cite{Kim14}.}
\end{figure}
\paragraph{Word Embedding}
In this architecture, word is the unit of the input. The $i$-th word $w_i$ is transformed into a vector $\mathbf{x}_i \in \mathbb{R}^k$ by word embedding level. Word vectors are then concatenated to form the representation of the document, an $n\times k$ matrix.
The number of words $n$ of the document is fixed by the model. Therefore, a document longer than $n$ words needs to be pruned and the shorter one needs padding by concatenating zero-vectors at the beginning or the end of the matrix.
\paragraph{Feature Extraction}
In the convolutional levels, by using a \emph{filter} $\mathbf{w_h} \in R^{hk}$, a \emph{feature} $c_i$ is then generated,
\[c_i = f(\mathbf{w_h}\cdot(\mathbf{x}_i\ ||\ \mathbf{x}_{i+1}\ ||\ ...\ ||\ \mathbf{x}_{i+h-1}) + b)\]
where $||$ is vector concatenation operator, $b\in \mathbb{R}$ is a bias term, $f$ is an \emph{activation function} outputting a feature from a set of inputs.
This procedure utilises the similar principle of $n$-gram model. However, rather than extracting features from original words, ConvNet works on their vector representation.
Each filter produces a \emph{feature map}, \emph{i.e.} a vector $\mathbf{c}^h\in \mathbb{R}^{n - h+1}$. A max-over-time-pooling is then applied on the feature map $\mathbf{c}^h$, aiming at choosing the most important features with the highest values and avoiding overfitting at training stage. We then obtain the final feature map of this $h\times k$ filter.
Several filters are usually applied to obtain the corresponding feature map, representing a \emph{channel}. They are then concatenated vertically into a final feature map $\mathbf{c}$.
\paragraph{Classification}
\emph{Fully connected layer} is a traditional multi-layer perceptron whose neurons are all connected to every neuron of the precedent and the following levels. Feature map $\mathbf{c}$ obtained above is then put into such fully connected layer for extracting higher level features preparing for final classification.
Softmax activation function, outputting a vector whose values sums to 1, is used in the output layer, in order to output probabilities of the membership in language classes. The classification is done by attributing the most possible language, indicated by the highest output possibility.
-\subsubsection{Byte-level Approach}
+\subsubsection{Byte-level Architecture}
Kim~\cite{Kim15} introduced a character-level ConvNet for language modelling. The original architecture is adapted by Chaitanya Joshi\footnote{\url{https://github.com/chaitjo/character-level-cnn}} for classification by replacing recurrent layers with same fully connected layers as word-level approach of Section~\ref{sec:word-conv}.
-Instead of using word or token as feature, character-level approach could make use of character (or byte) without building a large vocabulary. Although the size of vocabulary is commonly considerably small, \emph{e.g.} 256 when we use bytes as vocabulary.
+Instead of using word or token as feature, character-level approach could make use of character (or byte) without building a large vocabulary. The size of vocabulary is commonly considerably small, \emph{e.g.} 256 when we use bytes as vocabulary.
Feature extraction and classification are similar to the word-level approach.
\section{Experimentations}
\label{sec:experimentations}
In this section, we present several questions that we would like to answer by experiments on our customised dataset.
\subsection{Implementation and System Setup}
-We implement the methods described in Section~\ref{sec:methods-for-evaluation} in Python 3, in order to finally integrate one of them in Software Heritage.
+We implemented the methods described in Section~\ref{sec:methods-for-evaluation} in Python 3, in order to finally integrate one of them in Software Heritage.
-Baseline method is implemented natively in Python. We implement MNB using Scikit-Learn. $n$-gram model is implemented with KenLM~\cite{kenlm}. The last two ConvNets are implemented with Keras~\cite{keras} using TensorFlow~\cite{tensorflow2015-whitepaper} as backend.
+Baseline method was implemented natively in Python. We implemented MNB using Scikit-Learn. $n$-gram model was implemented with KenLM~\cite{kenlm}. The last two ConvNets were implemented with Keras~\cite{keras} using TensorFlow~\cite{tensorflow2015-whitepaper} as backend.
-We principally execute the training and test phase on a portable computer with 2.7 GHz Intel Core i5 processor running macOS 10.13. The training phase of two ConvNet methods is executed in an virtual machine running Ubuntu 16.04 with one Intel Sandy Bridge virtual CPU, equipped with one NVIDIA Tesla K80 GPU on Google Cloud Platform. The VM is configured for making use of TensorFlow backend with GPU acceleration using CUDA Deep Neural Network Library (cuDNN).
+We principally executed the training and test phase on a portable computer with 2.7~GHz Intel Core i5 processor running macOS 10.13. The training phase of two ConvNet methods was executed in an virtual machine running Ubuntu 16.04 with one Intel Sandy Bridge virtual CPU, equipped with one NVIDIA Tesla K80 GPU on Google Cloud Platform. The VM was configured for making use of TensorFlow backend with GPU acceleration using CUDA Deep Neural Network Library (cuDNN).
\subsection{Training Set and Test Set}
-Firstly, files of the training set are randomly picked from the dataset. To avoid the imbalance of the training set that impacts the performance of several methods in Section~\ref{sec:methods-for-evaluation}, we restrain the maximum number of training files to 500 for each language. The test set is then built from remaining samples, it includes up to {1,000} files for testing.
+Firstly, files of the training set were randomly picked from the dataset. To avoid the imbalance of the training set that impacts the performance of several methods in Section~\ref{sec:methods-for-evaluation}, we restrained the maximum number of training files to 500 for each language. The test set was then built randomly from remaining samples, it includes up to {1,000} files for testing.
We built three series of training set and test set of different sizes:
\begin{itemize}
\item \texttt{mini}: 20 languages in \cite{vanDam16} , 10,000 training files, 20,000 test files.
\item \texttt{less}: 127 languages with more than 5,000 files collected in dataset, 63,500 training files, 127,000 test files.
\item \texttt{total}: 374 languages in Table~\ref{tab:lan} in Appendices, 157,897 training files, 286,224 test files.
\end{itemize}
\subsection{Tokenisation}
In our case, tokenisation is useless for byte-level applications of method. The interest to introduce a simple general tokeniser is to break a document into words for making use of word-based methods.
It is difficult to summarise the relationship between programming language alphabet and its byte representation. We empirically suppose that most of the programming languages share some basic characters, \emph{e.g.} latin alphabet, parentheses, space, \emph{etc.} and most of encoding standards covers these characters in common.
A binary document is broken down by a set of characters (operators, punctuations, spaces, \emph{etc.}) and numbers (integer, float, \emph{etc.}). All separators are retrieved after splitting. For example, for the string ``\verb|print ("Hello world! 你好,世界!")|'' with UTF-8 encoding, its byte representation is
\begin{lstlisting}
'print ("Hello world! \xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x81")'
\end{lstlisting}
It is then tokenised to a sequence of 12 words:
\begin{lstlisting}
'print', ' ', '(', '"', 'Hello', ' ', 'world', '!', ' ', '\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x81', '"', ')'
\end{lstlisting}
\subsection{Model Quality Metrics}
For a class $c$, test results of documents could be regrouped into four categories, we mark $\hat{y_i}$ as ground truth class label, $y_i$ as predicted label:
\begin{itemize}
\item True Positive (TP): when $\hat{y_i} = l$ and $y_i = l$, \emph{i.e.} document written in $l$ is recognised as the same language.
\item False Positive (FP): when $\hat{y_i} \neq l$ and $y_i = l$, \emph{i.e.} document not written in languag $l$ is incorrectly recognised as $l$.
\item True Negative (TN): when $\hat{y_i} \neq l$ and $y_i \neq l$, \emph{i.e.} document not written in $l$ is rejected by $l$.
\item False Negative (FN): when $\hat{y_i} = l$ and $y_i \neq l$, \emph{i.e.} document written in $l$ is incorrectly rejected by $l$.
\end{itemize}
-In the context of classification, the quality of methods is measured by Precision, Recall and $F_1$ score.
+In the context of classification, the quality of methods is measured by precision, recall and $F_1$ score.
Recall is also called True Positive Rate (TPR). It is the fraction of correctly classified samples over all samples should be predicted as in $c$:
\[\text{recall} = \frac{\text{\#TP}}{\text{\#TP}+\text{\#FN}}\]
Precision is also called Positive Predictive Value (PPV). It is the fraction of correctly classified samples over all samples predicted as in $c$:
\[\text{precision} = \frac{\text{\#TP}}{\text{\#TP}+\text{\#FP}}\]
-The harmonic mean of precision and recall is called $F_1$ score, introduced for balancing two metrics:
+The harmonic mean of precision and recall is called $F_1$ score, it is introduced for balancing two metrics:
\[
F_1 = \left(\frac{\text{precision}^{-1} + \text{recall}^{-1}}{2}\right)^{-1} = 2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}
\]
In following subsections, we use $F_1$ as the measurement of the model quality of each class' performance.
Global model quality is evaluated by accuracy score:
\[
\text{accuracy}(y,\hat{y}) = \frac{1}{n}\sum_{i=0}^{n-1}1(y_i = \hat{y}_i)
\]
where $y$ is the predicted labels, $\hat{y}$ is the ground truth labels, $n$ is the number of samples, $1(\cdot)$ is the indicator function. The score shows the ratio of the number of samples whose projected label is the same as its ground truth to the total number of samples.
\subsection{Experimental Results}
-This subsection contains the overall results obtained during the experimentation. For more detailed results (confusion matrices, comparison of $F_1$ for each language between all tested methods), please check out the repository of our work.
+This subsection contains the overall results obtained during the experimentation. For more detailed results, including confusion matrices and comparison of $F_1$ for each language between all tested methods, please check out the repository of our work.
\subsubsection{Quality of Models}
-The evaluation of the quality of models utilises the entire list of 374 languages.
+The evaluation of the quality of models utilised the entire list of 374 languages.
\paragraph{Overall Quality}
Table~\ref{tab:total-comp} shows that baseline method reaches only 46.68\% of overall accuracy. Byte-level ConvNet marks the best accuracy at 88.64\% which is much higher than word-level ConvNet. Both MNB and $n$-gram model reach acceptable results respectively at 85.70\% and 84.42\%.
\begin{table}[t]
\centering
\begin{tabular}{|c|c|}
\hline
& Accuracy / \% \\ \hline
Baseline & 46.48 \\
MNB & 85.70 \\
$n$-gram model & 84.42 \\
Word-level ConvNet & 77.19 \\
Byte-level ConvNet & 88.64 \\ \hline
\end{tabular}
\caption{\label{tab:total-comp} Comparison of accuracy between evaluation methods.}
\end{table}
-\paragraph{Inequality Between Classes} Although the overall score of Byte-level ConvNet reaches 88.64\%, $F_1$ score of several classes is much lower than the average. For instance, $F_1$ of Eagle reaches 99.6\%, meanwhile C achieves only 58.8\%. Figure~\ref{fig:ineq} illustrates huge gap between best and worst results.
+\paragraph{Inequality Between Classes} Although the overall score of Byte-level ConvNet reaches 88.64\%, $F_1$ scores of several classes are much lower than the average. For instance, $F_1$ of Eagle reaches 99.6\%, meanwhile C achieves only 58.8\%. Figure~\ref{fig:ineq} illustrates the huge gap between the best and the worst results.
\begin{figure}[t!]
\centering
\subfloat[][25 language with highest $F_1$]{
\includegraphics[height=0.4\textwidth]{./comparison_cnn_f1_above}
}
\subfloat[][25 language with least $F_1$]{
\includegraphics[height=0.4\textwidth]{./comparison_cnn_f1_below}
}
- \caption{\label{fig:ineq} Inequality between the most performing classes and least performing classes.}
+ \caption{\label{fig:ineq} Inequality between the most effective classes and least effective classes.}
\end{figure}
\paragraph{Interclass Confusion}
-Some languages are especially difficult to be distinguished from each other for these methods. We visualise the confusion matrices of methods in our repository in order to give several intuitive observations.
+We visualised the confusion matrices of methods in our repository in order to give several intuitive observations. Some languages are especially difficult to be distinguished from each other for these methods.
-There are significant confusions between similar languages, \emph{i.e.} C and C++; Objective-C, Objective-C++ and Objective-J; Befunge and HyPhy; Java and Processing; NetLinx, PAWN and Ruby; Javascript and Cycript, \emph{etc}.
+We noticed significant confusions between similar languages, which are often in the same family \emph{e.g.} C and C++; Objective-C, Objective-C++ and Objective-J. Other confusions happen when a language is derived or inspired from another language, \emph{e.g.} Java and Processing; Javascript and Cycript, \emph{etc}.
\subsubsection{Benchmark and Model Sizes}
-Table~\ref{tab:ben-train} shows that the first three basic NLP methods could be rapidly trained on CPU even when a large number of classes are considered. ConvNet methods demand more computing power in training stage. In classification stage, on the contrary, ConvNets classify a document over 10 times faster than other $n$-gram based approaches, especially when there is a significant amount of languages.
+Table~\ref{tab:ben-train} shows that the first three basic NLP methods could be rapidly trained on CPU even when a large number of classes are considered. ConvNet methods demand more computing power in training stage. In classification stage, on the contrary, ConvNets classify a document over 10 times faster (even hundreds of times faster) than other $n$-gram based approaches, especially when there is a significant number of languages.
\begin{table}[t]
\centering
\begin{tabular}{|c|c|c|c|}
\hline
& \multirow{2}{*}{Training Time} & Test Time & \multirow{2}{*}{Model Size}\\
& & (per file)&\\
\hline
Baseline & 2.1 h & 0.14 s & 4.4 MiB \\
MNB & 0.8 h & 3.0 s & 375 MiB \\
$n$-gram model & 0.9 h & 2.1 s & 7.9 GiB \\
\multirow{2}{*}{Word-level ConvNet} & 44 h & \multirow{2}{*}{0.01 s} & \multirow{2}{*}{349 MiB}\\
& (20 h*) & & \\
\multirow{2}{*}{Byte-level ConvNet} & 23 h & \multirow{2}{*}{0.01 s} & \multirow{2}{*}{34 MiB} \\
& (2 h*) & & \\
\hline
\end{tabular}
\footnotesize{*: Training time on distant VM using GPU.}
\caption{\label{tab:ben-train} Comparison of training time and test time benchmark on the same computer with model size.}
\end{table}
\subsubsection{Filename Extension Is Important}
-We know empirically that filename extension is a critical feature of Programming Language Detection. However, we would like to evaluate its importance by tests. Knowing that ConvNet is good at highlighting features that distinguish mostly the inputs, we test the performance using Byte-level ConvNet by adding the extension of the file to the input.
+We know empirically that filename extension is a critical feature of Programming Language Detection. However, we would like to evaluate its importance by tests. Knowing that ConvNet is good at highlighting features that distinguish mostly the inputs, we tested the performance using Byte-level ConvNet by adding the extension of the file to the input.
-For convenience, the ConvNets are tested only in \texttt{mini} dataset, but the results are already meaningful. Table~\ref{tab:ext} shows that by adding the extension into the code the detection accuracy could be visibly improved. Inversely, in the context of Software Heritage, the missing of extensions impacts concretely the performance of Programming Language Detection.
+For convenience, the ConvNets were tested only in \texttt{mini} dataset, but the results are already meaningful. Table~\ref{tab:ext} shows that, by adding the extension into the code, the detection accuracy could be visibly improved. Inversely, in the context of Software Heritage, the missing of extensions impacts concretely the performance of detection.
\begin{table}[t]
\centering
\begin{tabular}{|c|c|}
\hline
& Accuracy / \%\\
\hline
Without Extension & 94.53 \\
With Extension & \textbf{97.61} \\
\hline
\end{tabular}
\caption{\label{tab:ext} Comparison of accuracy with extension and accuracy without extension with Byte-level ConvNet Classification on 20 classes.}
\end{table}
\subsubsection{Word or Byte}
-Our choice of applying tested methods at byte-level is competitive with the word-level applications. Each family of methods at these two levels are experimented on \texttt{mini} training and test sets. Table~\ref{tab:w-b} indicates that byte-level methods perform better for MNB and ConvNet, baseline model and $n$-gram model drops slightly after switched to byte-level.
+Our choice of applying tested methods at byte-level is competitive with the word-level applications. Each family of methods at these two levels were experimented on \texttt{mini} training and test sets. Table~\ref{tab:w-b} indicates that byte-level methods perform better for MNB and ConvNet, baseline model and $n$-gram model drops slightly after switched to byte-level.
\begin{table}[t]
\centering
\begin{tabular}{|c|c|c|}
\hline
&\multicolumn{2}{c|}{Accuracy / \%} \\
\cline{2-3}
& Word & Byte \\
\hline
Baseline & 68.47 & 62.14 \\
MNB & 89.19 & 92.04\\
$n$-gram model & 92.22 & 88.42 \\
ConvNet & 90.42 & 94.53 \\
\hline
\end{tabular}
\caption{\label{tab:w-b} Comparison of accuracy between word-level and byte-level application of tested methods on \texttt{mini} test set.}
\end{table}
\section{Application in Software Heritage}
\label{sec:application-in-software-heritage}
-We have applied Byte-level ConvNet, the most effective method, to a subset of the Software Heritage archive, containing more than 16 millions files (around 0.1\% of the archive). However, we were not able to check the results one by one because we were not equipped with other reliable automatic checking tool providing such large amount of languages and we should evaluate them manually by human efforts. For this reason, only several folders are picked for evaluation by hand.
+We have applied Byte-level ConvNet, the most effective method, to a subset of the Software Heritage archive, containing more than 16 millions files (around 0.1\% of the archive). However, we were not able to check the results one by one because we were not equipped with other reliable automatic checking tool providing such large number of languages and we should evaluate them manually by human efforts. For this reason, only several folders were picked for evaluation by hand.
\paragraph{Manual Verification Results}
Since we have nothing but the content of each file to judge its language in this dataset, the following results are based on author's acquired knowledge with the help of searching engine and other assistant tools. The results here are indicative.
Five folders with 1,259 files were picked from the subset for evaluation using a graphic interface. Unfortunately, the overall accuracy reaches only 66.05\%, fairly far from our experimental accuracy (88.64\%).
After analysing tested samples, we know that errors could normally categorised into the following cases:
\begin{itemize}
\item Short files. These files containing short code snippet are even indistinguishable for human.
\item Non-text files. Documentation consists usually of PDF documents, PNG or JPEG photos are surely misclassified.
\item ConvNet does not work well for many popular languages. From the results of the former section, massively appearing languages, such as C, C++, HTML, are more often wrongly classified.
\end{itemize}
\paragraph{Recourse}
-Libmagic is an efficient library differentiating plain text files and other formatted files. It is also reliable while recognising the popular languages. We decide to abandon some particular classes which is often misclassified by ConvNet and cover these result by Libmagic's.
+Libmagic is an efficient library differentiating plain text files and other formatted files. It is also reliable while recognising the popular languages. We decided to abandon some particular classes which is often misclassified by ConvNet and cover these results by Libmagic's.
\section{Challenges of Large-scale Deployment}
\label{sec:challenges}
\subsection{Balancing Dataset}
Imbalance in dataset between classes could affect the performance of different models in many ways. For the approaches essentially based on statistics, \emph{i.e.} $n$-gram frequency, $n$-gram model, a small training set means that it is possible that we could not fetch enough features. For ConvNet approaches, apart from the former reason, ConvNets intend to ignore smaller classes to avoid errors.
In spite of our efforts on balancing the number of repositories for each class, a significant imbalance is eventually observed between language classes. We know from Figure~\ref{fig:distribution} (in Appendices) that the first half of dataset consists of 13 languages, 361 other languages share the another half. Nearly a half of languages possess less than 5,000 files, and two-thirds of these own less than 1,000 files.
In future work, we are willing to fetch a more balanced database for each language and enrich weaker classes during the real deployment on the archive.
\subsection{Enhancing Current Models}
Out of the context of NLP, more complex tasks have been done in Image Classification. Deng~\emph{et al.} demonstrated that classification based on hierarchical cost could be significantly more informative~\cite{Deng10}. The same reasoning could also be adopted in Programming Language Detection by studying the grammars of different languages and the development history of programming languages, in order to build a series of models with hierarchical structure. The model could start by distinguishing ``programming language'' with ``data description language'', then deep down until final classification of a specific language.
\subsection{Recognising New Languages}
-The real challenges come from the changes over time. In order to recognise as many languages as possible, our language list should be able to grow through the time. Unfortunately, the existing performing methods fix \emph{a priori} a list of languages and focus on distinguishing between them.
+The real challenges come from the changes over time. In order to recognise as many languages as possible, our language list should be able to grow through the time. Unfortunately, the existing effective methods fix \emph{a priori} a list of languages and focus on distinguishing between them.
On the one hand, in spite of our efforts for fetching as many languages as possible, it is already impossible to list all existing languages. On the other hand, we have no idea about how many new languages will appear in the archive.
-Therefore, in this subsection, we will note several attempts on discovering new classes and discuss the extensibility of models in following parts.
+Therefore, in this subsection, we will note several attempts on discovering new classes and discuss the extensibility of models in following parts of this subsection.
\subsubsection{Discovering New Languages}
Unsupervised Learning is the machine learning task of finding a model indicating inherent structures of unlabelled data. Clustering is one of the topic on finding potential new self-forming classes in feature space. Since new languages are still unknown for us, we focus here on hierarchical clustering, which does not demand \emph{a priori} a fixed number of new classes.
\paragraph{Agglomerative Hierarchical Clustering (AHC)}
AHC, the mostly applied Hierarchical Clustering method, is a bottom-top approach which starts from individual files and combines them into a larger cluster. We call the file without label an \emph{observation}. Given $n$ observations $\{o_1,o_2,...,o_n\}$, a distance is calculated with a \emph{pairwise metric} for each pair of documents, resulting $O(n^2)$ distances for $n$ observations. At first, every single observation is a cluster. By applying a \emph{linkage criteria}, two of clusters are combined as a single cluster. The algorithm terminates when there is only one cluster containing $n$ observations for gathering.
-The clustering is firstly tested on the \texttt{mini} training set. Unfortunately, it does not work as we expected. By varying pairwise metric and linkage criteria, we obtained a slightly more performing combination: euclidean distance and average linkage. However, Figure~\ref{fig:clusters} shows that no language is capable to form a huge (as original number of observations of a class), visible and pure agglomeration of the same language. Most of them are equally mixed up inside a cluster. Clearly, other methods are needed to be tested for this task in the future.
+The clustering was tested on the \texttt{mini} training set. Unfortunately, it did not work as we expected. By varying pairwise metric and linkage criteria, we obtained a slightly more effective combination: euclidean distance and average linkage. However, Figure~\ref{fig:clusters} shows that no language is capable to form a huge (at least as huge as the original number of observations of a class) and pure agglomeration of a single language. Most of them are equally mixed up inside a cluster. Clearly, other methods are needed to be tested for this task in the future.
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{clusters}
\caption{\label{fig:clusters} Top 20 clusters of Agglomerative Hierarchical Clustering. AHC is applied in \texttt{mini} training set for 20 languages, 500 observations for each language.}
\end{figure}
\subsubsection{Extensibility of Existing Models}
-Once discovered, new classes need to be integrated into the existing model. Since the Baseline method, $n$-gram model and MNB demand a profile stocking statistics for each language, it suffices to train the incoming supplementary training sets and simply add the profiles into the model. On the contrary, ConvNet approaches should be retrained with a new network. However, no matter how we integrate these classes into original models, the quality of models will drop after adding them.
+Once discovered, new classes need to be integrated into the existing model. Since the Baseline method, $n$-gram model and MNB demand a profile stocking statistics for each language, it suffices to train the incoming supplementary training sets and simply add the profiles into the model. On the contrary, ConvNet approaches should be retrained by building a new network. However, no matter how we integrate these classes into original models, the quality of models will drop after adding them.
\paragraph{Impact of Retraining with More Classes}
-In the context of Software Heritage, we need to be able to recognise as many languages as possible. Therefore, it is inevitable to integrate new languages to older classifier. We test three series of training and test sets in order to discover the impact of number of classes on global results and the deterioration of $F_1$ for commonly appeared languages.
+In the context of Software Heritage, we need to be able to recognise as many languages as possible. Therefore, it is inevitable to integrate new languages to older classifier. We tested three series of training and test sets in order to discover the impact of number of classes on global results and the deterioration of $F_1$ for commonly appeared languages.
\begin{table}[t]
\centering
\begin{tabular}{|c|c|c|c|}
\hline
&\multicolumn{3}{c|}{Accuracy / \%} \\
\cline{2-4}
& \texttt{mini} & \texttt{less} & \texttt{total} \\
& \footnotesize{(20 languages)} & \footnotesize{(127 languages)} & \footnotesize{(374 languages)}\\
\hline
Baseline & 63.13 & 51.06 & 46.48 \\
MNB & 92.04 & 87.38 & 85.72 \\
$n$-gram model & 88.42 & 85.33 & 84.42 \\
Word-level ConvNet & 90.42 & 85.78 & 77.19 \\
Byte-level ConvNet & 94.53 & 91.71 & 88.64 \\
\hline
\end{tabular}
\caption{\label{tab:size} Comparison of accuracy score for each method on three series of training and test sets.}
\end{table}
Table~\ref{tab:size} compares the global accuracy scores of each series and each approach. We figure out that along with the growth of number of classes, the accuracy drops for all methods. From 20 languages to 374 languages, the baseline method loses 16.65\%, while $n$-gram model loses only 4\%.
Figure~\ref{fig:size} shows that the recognition quality of earlier integrated languages drops on most occasions, especially for those languages which are often the root of later introduced languages, such as C, Javascript.
\begin{figure}[t!]
\centering
\subfloat[Baseline]{
\includegraphics[height=8cm]{./comparison_ngram_dist_size.pdf}
}
\subfloat[MNB]{
\includegraphics[height=8cm]{./comparison_bayes_size.pdf}
}
\subfloat[$n$-gram]{
\includegraphics[height=8cm]{./comparison_ngram_prob_size.pdf}
}
\subfloat[Word ConvNet]{
\includegraphics[height=8cm]{./comparison_cnn_word_size.pdf}
}
\subfloat[Byte ConvNet]{
\includegraphics[height=8cm]{./comparison_cnn_size.pdf}
}
\caption{\label{fig:size} Comparison of $F_1$ score for each method on three series of training and test sets. (Blue: \texttt{mini}, Red: \texttt{less}, Cyan: \texttt{total})}
\end{figure}
\subsubsection{Incremental Learning}
Incremental learning is another trend in supervised learning, which is capable of taking new classes in account when they appear in the training data flow. This online procedure means that earlier learned knowledge is conserved, reused and enriched, which is different from the offline retraining by completely forgetting the acquired knowledge. Nowadays, there exists several deep incremental learning models, \emph{e.g.} Gepperth's GeppNet~\cite{Gepperth16}, Rebuffi \emph{et al.}'s iCaRL~\cite{RebuffiKL16}, Kemker and Kanan's FearNet~\cite{Kemker17}, \emph{etc.}
-Although the online version learning is favourable for the use cases of Software Heritage, the performance shown in \cite{Kemker17} points out that it is inevitable that the overall accuracy of these incremental learning method degrades after adding new classes. In addition, the online learning is shown in \cite{Kemker17} to always underperform with reference to offline version.
+Although the online version learning is favourable for the use cases of Software Heritage, the performance shown in \cite{Kemker17} points out that it is inevitable that the overall accuracies of these incremental learning methods degrade after adding new classes. In addition, the online learning is shown in \cite{Kemker17} to always underperform with reference to offline version.
\section{Conclusion}
-In the frame of TRE, we investigated existing NLP methods of text categorisation for applying them to source code files in Software Heritage. A dataset covering 374 language classes was originally created for tested machine learning methods. We compared several mature NLP methods for a large scale and proposed the most performing one, byte-level ConvNet method, in the possible large-scale deployment in archive. Despite achieving an accuracy of 88.64\% on a full-size corpus of 374 languages, the performance of byte-level ConvNet is not convincing for official deployment. Besides, the existing methods are insufficient to meet all demands of Software Heritage. For these reasons, we traced a potential roadmap for future enhancement on discovering new languages, building hierarchical model to enhance current models and engaging more languages classes into the model.
+In the frame of TRE, we investigated existing NLP methods of text categorisation for applying them to source code files in Software Heritage. A dataset covering 374 language classes was originally created for tested machine learning methods. We compared several mature NLP methods for a large scale and proposed the most effective one, byte-level ConvNet method, in the possible large-scale deployment in archive. Despite achieving an accuracy of 88.64\% on a full-size corpus of 374 languages, byte-level ConvNet is not convincing enough for official deployment. Besides, the existing methods are insufficient to meet all demands of Software Heritage. For these reasons, we traced a potential roadmap for future enhancement on discovering new languages, building hierarchical model to enhance current models and engaging more languages classes into the model.
\clearpage
\begin{appendices}
\section{Language List}
\begin{table}[h!]
\centering
\tiny
\begin{tabularx}{\textwidth}{|X|X|X|X|X|X|}
\hline
1C Enterprise & ABAP & ActionScript & Ada & Adobe Font Metrics & Agda\\ \hline
AGS Script & Alloy & AMPL & AngelScript & Ant Build System & ANTLR\\ \hline
ApacheConf & Apex & API Blueprint & APL & AppleScript & Arc\\ \hline
AsciiDoc & ASP & AspectJ & Assembly & ATS & Augeas\\ \hline
AutoHotkey & AutoIt & Awk & Ballerina & Batchfile & BitBake\\ \hline
BlitzBasic & BlitzMax & Bluespec & Boo & Brainfuck & Brightscript\\ \hline
Bro & C & C\# & C++ & Cap'n Proto & CartoCSS\\ \hline
Ceylon & Chapel & Charity & ChucK & Cirru & Clarion\\ \hline
Clean & Click & CLIPS & Clojure & CMake & COBOL\\ \hline
CoffeeScript & ColdFusion & COLLADA & Common Lisp & Common Workflow Language & Component Pascal\\ \hline
CoNLL-U & Cool & Coq & Crystal & Csound & Csound Document\\ \hline
Csound Score & CSS & CSV & Cuda & CWeb & Cycript\\ \hline
D & Dart & DataWeave & desktop & Diff & DIGITAL Command Language\\ \hline
DM & Dockerfile & Dogescript & DTrace & Dylan & E\\ \hline
Eagle & eC & ECL & edn & Eiffel & Elixir\\ \hline
Elm & Emacs Lisp & EmberScript & EQ & Erlang & F\#\\ \hline
Factor & Fancy & Fantom & Filebench WML & FLUX & Forth\\ \hline
Fortran & FreeMarker & Frege & G-code & Game Maker Language & GAMS\\ \hline
GAP & GDB & GDScript & Genie & Genshi & Gerber Image\\ \hline
Gettext Catalog & Gherkin & GLSL & Glyph & Gnuplot & Go\\ \hline
Golo & Gosu & Grace & Gradle & Grammatical Framework & Graph Modeling Language\\ \hline
GraphQL & Graphviz (DOT) & Groovy & Hack & Harbour & Haskell\\ \hline
Haxe & HCL & HLSL & HTML & HXML & Hy\\ \hline
HyPhy & IDL & Idris & IGOR Pro & Inform 7 & INI\\ \hline
Inno Setup & Io & Ioke & Isabelle & J & Jasmin\\ \hline
Java & JavaScript & Jolie & JSON & JSON5 & JSONiq\\ \hline
Julia & Jupyter Notebook & KiCad Layout & KiCad Legacy Layout & KiCad Schematic & Kit\\ \hline
Kotlin & KRL & LabVIEW & Lasso & Lean & Lex\\ \hline
LFE & LilyPond & Limbo & Linker Script & Linux Kernel Module & Liquid\\ \hline
LiveScript & LLVM & Logos & Logtalk & LOLCODE & LookML\\ \hline
LoomScript & LSL & Lua & M & M4 & Makefile\\ \hline
Mako & Markdown & Mask & Mathematica & Matlab & Maven POM\\ \hline
Max & MAXScript & MediaWiki & Mercury & Meson & Metal\\ \hline
Mirah & Modelica & Modula-2 & Module Management System & Monkey & Moocode\\ \hline
MoonScript & MQL4 & MQL5 & MTML & mupad & NCL\\ \hline
Nemerle & nesC & NetLinx & NetLogo & NewLisp & Nextflow\\ \hline
Nginx & Nim & Nit & Nix & NSIS & Nu\\ \hline
Objective-C & Objective-C++ & Objective-J & OCaml & ooc & Opa\\ \hline
OpenEdge ABL & OpenSCAD & OpenType Feature File & Org & Ox & Oz\\ \hline
P4 & Pan & Papyrus & Parrot & Pascal & PAWN\\ \hline
Pep8 & Perl & Perl 6 & PHP & Pickle & PicoLisp\\ \hline
PigLatin & Pike & PLpgSQL & PLSQL & Pod & PogoScript\\ \hline
Pony & PostScript & POV-Ray SDL & PowerBuilder & PowerShell & Processing\\ \hline
Prolog & Propeller Spin & Protocol Buffer & Public Key & Puppet & Pure Data\\ \hline
PureBasic & PureScript & Python & q & QMake & QML\\ \hline
R & Racket & Ragel & RAML & Rascal & Raw token data\\ \hline
RDoc & REALbasic & Rebol & Red & Redcode & Ren'Py\\ \hline
RenderScript & reStructuredText & REXX & Ring & RMarkdown & RobotFramework\\ \hline
Roff & Rouge & RPC & RPM Spec & Ruby & Rust\\ \hline
SaltStack & SAS & Scala & Scheme & Scilab & sed\\ \hline
Self & ShaderLab & Shell & Shen & Slash & Smali\\ \hline
Smalltalk & Smarty & SMT & Solidity & SourcePawn & SPARQL\\ \hline
SQF & SQL & SQLPL & Squirrel & SRecode Template & Stan\\ \hline
Standard ML & Stata & SubRip Text & SuperCollider & SVG & Swift\\ \hline
SystemVerilog & Tcl & Tea & Terra & TeX & Text\\ \hline
Textile & Thrift & TI Program & TLA & TOML & Turing\\ \hline
Turtle & TXL & TypeScript & Unity3D Asset & Uno & UnrealScript\\ \hline
UrWeb & Vala & VCL & Verilog & VHDL & Vim script\\ \hline
Visual Basic & Volt & Vue & Wavefront Material & Wavefront Object & wdl\\ \hline
Web Ontology Language & WebAssembly & WebIDL & wisp & X10 & xBase\\ \hline
XC & XML & Xojo & XPages & XProc & XQuery\\ \hline
XS & XSLT & Xtend & Yacc & YAML & YANG\\ \hline
Zephir & Zimpl \\ \cline{1-2}
\end{tabularx}
\caption{\label{tab:lan} Language list of the dataset, 374 languages engaged.}
\end{table}
\section{File Distribution in Dataset of Each Language}
\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{circle}
\caption{\label{fig:distribution}File distribution by language in the dataset.}
\end{figure}
\section{Hyperparameters of ConvNets}
\begin{table}[h!]
\centering
\begin{tabular}{|c|c|}
\hline
Hyperparameter & Value \\
\hline
input size & 400 \\
vocabulary size & 15000 \\
character embedding size & 128 \\
filter sizes & [3, 4, 5] \\
nb. of filter matrices & 100 \\
dropout rate & 0.5 \\
activation function & ReLU \\
nb. of neurons in fully connected level & 1024 \\
nb. of classes & 374 \\
\hline
\end{tabular}
\caption{\label{tab:hyp-word} Details of hyperparameter configuration of word-level ConvNet architecture for \texttt{total} training set and test set, referred from \cite{Kim14}.}
\end{table}
\begin{table}[h!]
\centering
\begin{tabular}{|c|c|}
\hline
Hyperparameter & Value \\
\hline
input size & 4,096 \\
vocabulary size & 256 \\
character embedding size & 64 \\
filter sizes & [3, 5, 7, 9, 10] \\
nb. of filter matrices & 256 \\
activation function & ReLU \\
nb. of neurons in fully connected level & 1,024 \\
nb. of classes & 374 \\
\hline
\end{tabular}
\caption{\label{tab:hyp-byte} Details of hyperparameter configuration of byte-level ConvNet architecture for \texttt{total} training set and test set, referred from Chaitanya Joshi's adaptation.}
\end{table}
\end{appendices}
\bibliography{bib-rapport}
\bibliographystyle{unsrt}
\end{document}
\ No newline at end of file