The data revolution continues to transform every sector of science, industry and government. Due to the incredible impact of data-driven technology on society, we are becoming increasingly aware of the imperative to use data and algorithms responsibly -- in accordance with laws and ethical norms. In this article we discuss three recent regulatory frameworks: the European Union's General Data Protection Regulation (GDPR), the New York City Automated Decisions Systems (ADS) Law, and the Net neutrality principle, that aim to protect the rights of individuals who are impacted by data collection and analysis. These frameworks are prominent examples of a global trend: Governments are starting to recognize the need to regulate data-driven algorithmic technology. Our goal in this paper is to bring these regulatory frameworks to the attention of the data management community, and to underscore the technical challenges they raise and which we, as a community, are well-equipped to address. The main take-away of this article is that legal norms cannot be incorporated into data-driven systems as an afterthought. Rather, we must think in terms of responsibility by design, viewing it as a systems requirement.
Providing a 360-degree view of a given data item especially for sensitive data is essential towards not only protecting the data and associated privacy but also assuring trust, compliance, and ethics of the systems that use or manage such data. With the advent of GDPR, California Data Privacy Law and such other regulatory requirements, it is essential to support data transparency in all such dimensions. Moreover, data transparency should not violate privacy and security requirements. In this paper, we put forward a vision for how data transparency would be achieved in a de-centralized fashion using blockchain technology.
Governance of misinformation is a serious concern in social media platforms. Based on three case studies, we offer insights on managing misinformation in social media, for the policy agents. These studies are essential, because of misinformation prevalent in the existing social media platforms. Managing misinformation, for instance, fake news, is a challenge for policy makers and the platforms. The paper explores the factors of rapid propagation of misinformation. An average of about 15 lakh tweets were analysed on each of the three different cases surrounding misinformation. This study provides insights for managing misinformation in social media, with a specific focus on cognitive factors that may emerge as drivers of misinformation virality. We highlight findings from three separate case studies focusing on the intrinsic properties of the content, the personality attributes of misinformation propagators, and lastly the network related attributes that facilitate the rapid propagation of misinformation. Findings indicate that these three aspects collectively catalyze the misinformation diffusion and subsequent virality. Policy makers can utilize the findings in this experience study for the governance of misinformation. Tracking and disruption in any one of the identified drivers could act as a control mechanism to manage misinformation propagation.
Data is a cornerstone of empirical software engineering (ESE) research and practice. It underpins numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study we assess the quality of thirteen datasets that have been used extensively in research on software effort estimation. The quality issues considered in this paper draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are: 1. an evaluation of the ?fitness for purpose? of these commonly used datasets; and 2. an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher quality datasets.
The spread of false and unverified information has the potential to inflict damage by harming the reputation of individuals or organisations, shaking financial markets, and influencing crowd decisions in important events. Despite a great deal of research in this area, the academia still does not have a particular plan to confront this troublesome phenomenon. In this research, we focus on this point by assessing the readiness of academia against false and unverified information. To this end, we adopt the emergence framework and measure its different dimensions over more than 21000 articles, published by academia about false and unverified information. Our results show the current body of research had an organic growth so far, which is not promising enough for confronting the problem of false and unverified information. To tackle this problem, we suggest an external push strategy which compared to the early stage of the field, reinforces the emergence dimensions and cause to achieve a higher level in every dimension.
We are experiencing an amazing data-centered revolution. Incredible amounts of data are collected, integrated and analyzed, leading to key breakthroughs in science and society. This well of knowledge, however, is at a great risk if we do not dispense with some of the data flood. First, the amount of generated data grows exponentially and already at 2020 is expected to be more than twice the available storage. Second, even disregarding storage constraints, uncontrolled data retention risks privacy and security, as recognized, e.g., by the recent EU Data Protection reform. Data disposal policies must be developed to benefit and protect organizations and individuals. Retaining the knowledge hidden in the data while respecting storage, processing and regulatory constraints is a great challenge. The difficulty stems from the distinct, intricate requirements entailed by each type of constraint, the scale and velocity of data and the constantly evolving needs. While multiple data sketching, summarization and deletion techniques were developed to address specific aspects of the problem, we are still very far from a comprehensive solution. Every organization has to battle the same tough challenges, with ad hoc solutions that are application specific and rarely sharable. In this vision paper we will discuss the logical, algorithmic, and methodological foundations required for the systematic disposal of large-scale data, for constraints enforcement and for the development of applications over the retained information. In particular we will overview relevant related work, highlighting new research challenges and potential reuse of existing techniques.
Curated, labelled, high quality data is a valuable commodity for tasks such as business analytics or machine learning. Open data is a common source of such data; for example, retail analytics draws on open demographic data, and weather forecast systems draw on open atmospheric and ocean data. This data is released openly by governments to achieve various objectives, such as transparency, informing citizen engagement, or supporting private enterprise, and is generally trusted. Critical examination of ongoing social changes, including the post-truth phenomenon, suggests quality, integrity, and authenticity of open data may be at risk. We describe these risks, with examples, and identify mechanisms to mitigate them. As an initial assessment of awareness of these risks, we compare our analysis to perspectives captured during open data stakeholder consultations in Canada.
Disinformation spreads rapidly through social media. Bots and trolls are often viewed as promulgators of such information. However, there are many types of disinformation. An in-depth study of the spread of disinformation following the release of the Black Panther movie is described. We demonstrate that there were multiple disinformation campaigns that varied in type. We find that these different disinformation campaigns vary in the speed with which they spread, the utilization of images, whether that diffusion is supported by bots, and whether digital democracy groundswells can counter the spread. These results suggest that countering disinformation is not as simple as spotting fake news and suspending perpetrators, but rather that distinctions are important when considering the design of responses and detection algorithms.
Deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are ``fit for purpose" of deep learning is still a question. In this paper, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. In order to evaluate the quality of the augmented data in fidelity, variety and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.
With the internet video gaining increasing popularity and soaring to dominate the network traffic, extensive study is being carried out on how to achieve higher Quality of Experience (QoE) in its content delivery. Associated with the chunk-based streaming protocol, the Adaptive Bitrate (ABR) algorithms have recently emerged to cope with the diverse and fluctuating network conditions by dynamically adjusting bitrates for future chunks. This inevitably involves predicting the future throughput of a video session. Parameterized ABR simplifies the ABR design by abstracting all or part of the prediction of the network uncertainty into its parameters. In this paper, we consider the issue of learning the best settings of these parameters from the study of the backlogged throughput traces of previous video sessions. Essential to our study is how to properly partition the logged sessions according to those critical features that affect the network conditions, e.g. Internet Service Provider (ISP), geographical location etc. so that different parameter settings could be adopted in different situations to reach better prediction. We present our greedy approach to the feature-based partition. It follows the strategy explored in the Decision Tree. The performance of our partition algorithm has been evaluated on our throughput dataset with a sample parameterized ABR algorithm. The experiment shows that our approach can improve the average bitrate of the sample ABR algorithm by 36.1% without causing the increase of the rebuffering ratio where 99% of the sessions can get improvement. It can also improve the rebuffering ratio by 87.7% without causing the decrease of the average bitrate where, among those sessions involved in rebuffering, 82% receives improvement and 18% remains the same.