Datasets Privacy

Introduction

This document presents the datasets generated for Scava, discusses the implications it has regarding privacy, and describes what has been achieved to ensure data is safe.

All datasets are anonymised: fields that could be used to identify individuals or companies either directly or indirectly have been transformed using the Anonymise::Utility Perl module.

The intended audience of the datasets is composed of:

Should one have questions or remarks on the datasets, please feel free to contact us. All cases related to privacy will be handled with utmost diligence.

Description of the datasets

There are three types of datasets generated, each with its specific schema and attributes. The first step to preserve privacy is to describe the various datasets and their attributes, and identify what field could pose a threat.

AERI stacktraces

The AERI stacktraces dataset contains information about exceptions encountered by users in the Eclipse IDE. It includes data about the exception itself, and the environment where it happened.

The incidents dataset offers the following attributes:

The problems dataset offers the following attributes:

The incidents bundle offers the following attributes:

Eclipse Mailing lists

The Eclipse mailing lists dataset offers the following attributes:

Eclipse projects extracts

The Eclipse projects extracts have different sets of data depending on the sources available for each project. We list thereafter the full list of extracts, highlighting attributes that include privacy-related information.

Git (Software Configuration Management)

Bugzilla (Issue tracking)

Forums (User-oriented communication)

PMI (project metadata)

SonarQube (code analysis)

Anonymisation

The mechanism used to anonymise the data is the Anonymise::Utility Perl module. It basically uses asymmetric encryption to generate a one-off mapping between clear IDs and obfuscated strings.

Data transformation

The private key is thrown away, preventing any recovering of the encrypted IDs. This technique has several advantages:

The resulting datasets contain no email address, names, user id or machine id.

Privacy compliance

The management and publication of data in the European Union is regulated by the General Data Protection Regulation (GDPR) directive, which also addresses the export of data outside the EU and EEA areas. Since we are EU citizens -- and considering also that the Crossminer project is funded by the H2020 EU research program -- we are to abide by this regulation. Besides the legal implications of publishing open datasets, we are willing to make sure that everybody, individuals or companies, involved in the data is safe.

In the case of software engineering data, there is a huge amount of public information readily available without any restrictions. Most, if not all, tools used in the open-source world provide information about who did what and when -- which is undoubtely useful for collaboration and community. It is also mandatory regarding intellectual property processes: when one contributes a file to an open-source project, it is at the very least good practice to put her name (and maybe email address) in the header of the file along the licence used. When Intellectual Property is an important concern, like for the Eclipse Foundation, it simply is required since we need to know who that work belongs to in the case of IP issues and legal lawsuite cases.

The publication of open data in this context, i.e. with the original data being already publicly available from public tools, is a specific case of the GDPR and it is hard to find any reliable information about how it should be conducted. As a result we relied on similar studies and articles and proceeded on a best-effort basis to provide datasets to our users which are as useful and safe as possible.

Considering that:

Considering also that:

We assume that both the data itself and its publication are safe, regarding both the users and the current regulation.

References