The data revolution continues to transform every sector of science, industry and government. Due to the incredible impact of data-driven technology on society, we are becoming increasingly aware of the imperative to use data and algorithms responsibly -- in accordance with laws and ethical norms. In this article we discuss three recent regulatory frameworks: the European Union's General Data Protection Regulation (GDPR), the New York City Automated Decisions Systems (ADS) Law, and the Net neutrality principle, that aim to protect the rights of individuals who are impacted by data collection and analysis. These frameworks are prominent examples of a global trend: Governments are starting to recognize the need to regulate data-driven algorithmic technology. Our goal in this paper is to bring these regulatory frameworks to the attention of the data management community, and to underscore the technical challenges they raise and which we, as a community, are well-equipped to address. The main take-away of this article is that legal norms cannot be incorporated into data-driven systems as an afterthought. Rather, we must think in terms of responsibility by design, viewing it as a systems requirement.
Providing a 360-degree view of a given data item especially for sensitive data is essential towards not only protecting the data and associated privacy but also assuring trust, compliance, and ethics of the systems that use or manage such data. With the advent of GDPR, California Data Privacy Law and such other regulatory requirements, it is essential to support data transparency in all such dimensions. Moreover, data transparency should not violate privacy and security requirements. In this paper, we put forward a vision for how data transparency would be achieved in a de-centralized fashion using blockchain technology.
Data is a cornerstone of empirical software engineering (ESE) research and practice. It underpins numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study we assess the quality of thirteen datasets that have been used extensively in research on software effort estimation. The quality issues considered in this paper draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are: 1. an evaluation of the ?fitness for purpose? of these commonly used datasets; and 2. an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher quality datasets.
Fake news are nowadays an issue of pressing concern, given their recent rise as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge (FNC-1) was organized in early 2017 to encourage the development of machine learning-based classification systems for stance detection (i.e., for identifying whether a particular news article agrees, disagrees, discusses, or is unrelated to a particular news headline), thus helping in the detection and analysis of possible instances of fake news. This article presents a novel approach to tackle this stance detection problem, based on the combination of string similarity features with a deep neural network architecture that leverages ideas previously advanced in the context of learning efficient text representations, document classification, and natural language inference. Specifically, we use bi-directional GRUs together with neural attention for representing (i) the headline, (ii) the first two sentences of the news article, and (iii) the entire news article. These representations are then combined/compared, complemented with similarity features inspired on other FNC-1 approaches, and passed to a final layer that predicts the stance of the article towards the headline. We also explore the use of external sources of information, specifically large datasets of sentence pairs originally proposed for training and evaluating natural language inference methods, in order to pre-train specific components of the neural network architecture (e.g., the GRUs used for encoding sentences). The obtained results attest to the effectiveness of the proposed ideas and show that our model, particularly when considering pre-training and the combination of neural representations together with similarity features, slightly outperforms the previous state-of-the-art.
In this article, we study the problem of automatic fact checking, paying special attention to the impact of contextual and discourse information. We address two related tasks: the detection of check-worthy claims (here, in the context of political debates), and the verification of factual claims (here, answers to questions in a community question answering forum). We develop supervised systems based on neural networks, kernel-based support vector machines, and combinations thereof, which make use of rich input representations in terms of discourse cues (encoding the discourse relations from a discourse parser) and contextual features. In the claim identification problem, we model the target claim in the context of the full intervention of a participant and the previous and the following turns in the debate, taking into account also contextual meta information. In the answer verification problem, we model the answer with respect to the entire question--answer thread in which it occurs, and with respect to other related posts from the entire forum. We develop annotated datasets for both tasks and we run an extensive experimental evaluation of the models, confirming that both types of information ---but especially contextual features--- play an important role for the performance of our claim check-worthiness prediction and of our answer verification systems.
The spread of false and unverified information has the potential to inflict damage by harming the reputation of individuals or organisations, shaking financial markets, and influencing crowd decisions in important events. Despite a great deal of research in this area, the academia still does not have a particular plan to confront this troublesome phenomenon. In this research, we focus on this point by assessing the readiness of academia against false and unverified information. To this end, we adopt the emergence framework and measure its different dimensions over more than 21000 articles, published by academia about false and unverified information. Our results show the current body of research had an organic growth so far, which is not promising enough for confronting the problem of false and unverified information. To tackle this problem, we suggest an external push strategy which compared to the early stage of the field, reinforces the emergence dimensions and cause to achieve a higher level in every dimension.
We are experiencing an amazing data-centered revolution. Incredible amounts of data are collected, integrated and analyzed, leading to key breakthroughs in science and society. This well of knowledge, however, is at a great risk if we do not dispense with some of the data flood. First, the amount of generated data grows exponentially and already at 2020 is expected to be more than twice the available storage. Second, even disregarding storage constraints, uncontrolled data retention risks privacy and security, as recognized, e.g., by the recent EU Data Protection reform. Data disposal policies must be developed to benefit and protect organizations and individuals. Retaining the knowledge hidden in the data while respecting storage, processing and regulatory constraints is a great challenge. The difficulty stems from the distinct, intricate requirements entailed by each type of constraint, the scale and velocity of data and the constantly evolving needs. While multiple data sketching, summarization and deletion techniques were developed to address specific aspects of the problem, we are still very far from a comprehensive solution. Every organization has to battle the same tough challenges, with ad hoc solutions that are application specific and rarely sharable. In this vision paper we will discuss the logical, algorithmic, and methodological foundations required for the systematic disposal of large-scale data, for constraints enforcement and for the development of applications over the retained information. In particular we will overview relevant related work, highlighting new research challenges and potential reuse of existing techniques.
Curated, labelled, high quality data is a valuable commodity for tasks such as business analytics or machine learning. Open data is a common source of such data; for example, retail analytics draws on open demographic data, and weather forecast systems draw on open atmospheric and ocean data. This data is released openly by governments to achieve various objectives, such as transparency, informing citizen engagement, or supporting private enterprise, and is generally trusted. Critical examination of ongoing social changes, including the post-truth phenomenon, suggests quality, integrity, and authenticity of open data may be at risk. We describe these risks, with examples, and identify mechanisms to mitigate them. As an initial assessment of awareness of these risks, we compare our analysis to perspectives captured during open data stakeholder consultations in Canada.
Deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are ``fit for purpose" of deep learning is still a question. In this paper, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. In order to evaluate the quality of the augmented data in fidelity, variety and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.
With the internet video gaining increasing popularity and soaring to dominate the network traffic, extensive study is being carried out on how to achieve higher Quality of Experience (QoE) in its content delivery. Associated with the chunk-based streaming protocol, the Adaptive Bitrate (ABR) algorithms have recently emerged to cope with the diverse and fluctuating network conditions by dynamically adjusting bitrates for future chunks. This inevitably involves predicting the future throughput of a video session. Parameterized ABR simplifies the ABR design by abstracting all or part of the prediction of the network uncertainty into its parameters. In this paper, we consider the issue of learning the best settings of these parameters from the study of the backlogged throughput traces of previous video sessions. Essential to our study is how to properly partition the logged sessions according to those critical features that affect the network conditions, e.g. Internet Service Provider (ISP), geographical location etc. so that different parameter settings could be adopted in different situations to reach better prediction. We present our greedy approach to the feature-based partition. It follows the strategy explored in the Decision Tree. The performance of our partition algorithm has been evaluated on our throughput dataset with a sample parameterized ABR algorithm. The experiment shows that our approach can improve the average bitrate of the sample ABR algorithm by 36.1% without causing the increase of the rebuffering ratio where 99% of the sessions can get improvement. It can also improve the rebuffering ratio by 87.7% without causing the decrease of the average bitrate where, among those sessions involved in rebuffering, 82% receives improvement and 18% remains the same.
With increasing popularity of online review systems, a large volume of user-generated content helps people to make reasonable judgments about the quality of services or products of unknown providers. However, these platforms can be easily abused to become entrances for misinformation since malicious users can freely insert information into these systems without validation. Consequently, online review systems become targets of individual or professional spammers who insert deceptive reviews by manipulating ratings and content of the reviews. In this work, we propose a review spamming detection scheme based on aspect-specific opinions extracted from individual reviews and their deviations from the aggregated aspect-specific opinions. We propose to model the influence on the trustworthiness of the user due to his opinion deviation from the majority in the form of a deviation-based penalty, and integrate this penalty into the three-layer trust propagation framework to iteratively compute the trust scores for users, reviews, and target entities, respectively. The trust scores are effective indicators of spammers, since they reflect the overall deviation of a user from the aggregated aspect-specific opinions across all targets and all aspects. Experiments on the dataset collected from Yelp.com show that the proposed detection scheme based on aspect-specific content-aware trust propagation is able to measure users' trustworthiness based on opinions expressed in reviews.