dataset.tex

\section{Dataset}
\label{section:dataset}

Dataset is an important factor of learning the model.
In order to detect vulnerabilities with function-level granularity and to classify them with type, we need each function labeled either secure or not, and with type of vulnerability if insecure.

Despite the number of available source codes, labeling them is challenging.
Tera-PROMISE \cite{promiserepo} is a research dataset repository for software engineering. Although it has been widely used in area of metric-based bug detection, its datasets are labeled with code metrics which cannot be used to our model.
Neuhaus et al. used vulnerabilities database from Mozilla project to predict vulnerable component based on function calls, but granularity still remains at module level \cite{neuhaus2007predicting}.

Labeling certain function code as whether secure or not can also be a challenge.
If there is an exact attack vector, it can be clearly shown that the code has vulnerability, while strictly proving it doesn't can only be done by formal verification.
I assumed that code after patching vulnerabilities does not have vulnerabilities anymore.

Due to limitations of existing datasets described above, I collected new source code dataset from Github.
I collected merged pull requests of which title containing vulnerability type (e.g., buffer overflow).
Then I labeled merged commit as secure and base commit as insecure.