The data revolution continues to transform every sector of science, industry and government. Due to the incredible impact of data-driven technology on society, we are becoming increasingly aware of the imperative to use data and algorithms responsibly -- in accordance with laws and ethical norms. In this article we discuss three recent regulatory frameworks: the European Union's General Data Protection Regulation (GDPR), the New York City Automated Decisions Systems (ADS) Law, and the Net neutrality principle, that aim to protect the rights of individuals who are impacted by data collection and analysis. These frameworks are prominent examples of a global trend: Governments are starting to recognize the need to regulate data-driven algorithmic technology. Our goal in this paper is to bring these regulatory frameworks to the attention of the data management community, and to underscore the technical challenges they raise and which we, as a community, are well-equipped to address. The main take-away of this article is that legal norms cannot be incorporated into data-driven systems as an afterthought. Rather, we must think in terms of responsibility by design, viewing it as a systems requirement.
Providing a 360-degree view of a given data item especially for sensitive data is essential towards not only protecting the data and associated privacy but also assuring trust, compliance, and ethics of the systems that use or manage such data. With the advent of GDPR, California Data Privacy Law and such other regulatory requirements, it is essential to support data transparency in all such dimensions. Moreover, data transparency should not violate privacy and security requirements. In this paper, we put forward a vision for how data transparency would be achieved in a de-centralized fashion using blockchain technology.
A new era of Information Warfare has arrived. Various actors, including state-sponsored ones, are weaponizing information on Online Social Networks to run false information campaigns with targeted manipulation of public opinion on specific topics. These false information campaigns can have dire consequences to the public: mutating their opinions and actions, especially with respect to critical world events like major elections. Evidently, the problem of false information on the Web is a crucial one, and needs increased public awareness, as well as immediate attention from law enforcement agencies, public institutions, and in particular, the research community. In this paper, we make a step in this direction by providing a taxonomy of the Web's false information ecosystem, comprising various types of false information, actors, and their motives. We report a comprehensive overview of existing research on the false information ecosystem by identifying several lines of work: 1) how the public perceives false information; 2) understanding the propagation of false information; 3) detecting and containing false information on the Web; and 4) false information on the political stage.Finally, for each of these lines of work, we report several future research directions that can help us better understand and mitigate the emerging problem of false information dissemination on the Web.
This paper studies a new framework that incorporates graph patterns to support fact checking in knowledge graphs. Our method discovers discriminant graph patterns to construct classifiers for fact prediction. (1) We propose a class of graph fact checking rules (GFCs). A GFC incorporates graph patterns that best distinguish true and false facts of generalized fact statements. We provide statistical measures to characterize useful patterns that are both discriminant and diversified. (2) We show that it is feasible to discover GFCs in large graphs with optimality guarantees. (a) We develop an algorithm that performs localized search to generate a stream of graph patterns, and dynamically assemble best GFCs from multiple GFCs sets, where each set ensures quality scores within certain ranges. The algorithm guarantees a 1/2-µ approximation when it (early) terminates. (b) We also develop a space-efficient alternative that dynamically spawns prioritized patterns with best marginal gains to verified GFCs. It guarantees a 1-1/e approximation. Both strategies guarantee a bounded time cost independent with the size of underlying graph. (3) To support fact checking, we develop two classifiers, which make use of top ranked GFCs as predictive rules, or instance-level features of the pattern matches induced by GFCs, respectively. Using real-world data, we experimentally verify the efficiency and the effectiveness of GFC-based techniques for fact checking in knowledge graphs, and verify its application in knowledge exploration and news prediction.
Fake news are nowadays an issue of pressing concern, given their recent rise as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge (FNC-1) was organized in early 2017 to encourage the development of machine learning-based classification systems for stance detection (i.e., for identifying whether a particular news article agrees, disagrees, discusses, or is unrelated to a particular news headline), thus helping in the detection and analysis of possible instances of fake news. This article presents a novel approach to tackle this stance detection problem, based on the combination of string similarity features with a deep neural network architecture that leverages ideas previously advanced in the context of learning efficient text representations, document classification, and natural language inference. Specifically, we use bi-directional GRUs together with neural attention for representing (i) the headline, (ii) the first two sentences of the news article, and (iii) the entire news article. These representations are then combined/compared, complemented with similarity features inspired on other FNC-1 approaches, and passed to a final layer that predicts the stance of the article towards the headline. We also explore the use of external sources of information, specifically large datasets of sentence pairs originally proposed for training and evaluating natural language inference methods, in order to pre-train specific components of the neural network architecture (e.g., the GRUs used for encoding sentences). The obtained results attest to the effectiveness of the proposed ideas and show that our model, particularly when considering pre-training and the combination of neural representations together with similarity features, slightly outperforms the previous state-of-the-art.
In this article, we study the problem of automatic fact checking, paying special attention to the impact of contextual and discourse information. We address two related tasks: the detection of check-worthy claims (here, in the context of political debates), and the verification of factual claims (here, answers to questions in a community question answering forum). We develop supervised systems based on neural networks, kernel-based support vector machines, and combinations thereof, which make use of rich input representations in terms of discourse cues (encoding the discourse relations from a discourse parser) and contextual features. In the claim identification problem, we model the target claim in the context of the full intervention of a participant and the previous and the following turns in the debate, taking into account also contextual meta information. In the answer verification problem, we model the answer with respect to the entire question--answer thread in which it occurs, and with respect to other related posts from the entire forum. We develop annotated datasets for both tasks and we run an extensive experimental evaluation of the models, confirming that both types of information ---but especially contextual features--- play an important role for the performance of our claim check-worthiness prediction and of our answer verification systems.
Mappings of first name to gender have been widely recognized as a critical tool for the completion, study and validation of data records in a range of areas. In this study, we investigate how organizations with large databases of existing entities can create their own mappings between first names and gender and how these mappings can be improved and utilized. Therefore, we first explore a dataset with demographic information on more than 4 million people, which was provided by a car insurance company. Then, we study how naming conventions have changed over time and how they differ by nationality. Next, we build a probabilistic first-name-to-gender mapping and augment the mapping by adding nationality and decade of birth to improve the mappings performance. We test our mapping in two-label and three-label settings and further validate our mapping by categorizing patent filings by gender of the inventor. We compare the results with previous studies outcomes and find that our mapping produces high-precision results. We validate that the additional information of nationality and year of birth improve the precision scores of name-to-gender mappings. Therefore, the proposed approach constitutes an efficient process for improving the data quality of organizations records, if the gender attribute is missing or unreliable.
The spread of false and unverified information has the potential to inflict damage by harming the reputation of individuals or organisations, shaking financial markets, and influencing crowd decisions in important events. Despite a great deal of research in this area, the academia still does not have a particular plan to confront this troublesome phenomenon. In this research, we focus on this point by assessing the readiness of academia against false and unverified information. To this end, we adopt the emergence framework and measure its different dimensions over more than 21000 articles, published by academia about false and unverified information. Our results show the current body of research had an organic growth so far, which is not promising enough for confronting the problem of false and unverified information. To tackle this problem, we suggest an external push strategy which compared to the early stage of the field, reinforces the emergence dimensions and cause to achieve a higher level in every dimension.
Deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are ``fit for purpose" of deep learning is still a question. In this paper, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. In order to evaluate the quality of the augmented data in fidelity, variety and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.
With the internet video gaining increasing popularity and soaring to dominate the network traffic, extensive study is being carried out on how to achieve higher Quality of Experience (QoE) in its content delivery. Associated with the chunk-based streaming protocol, the Adaptive Bitrate (ABR) algorithms have recently emerged to cope with the diverse and fluctuating network conditions by dynamically adjusting bitrates for future chunks. This inevitably involves predicting the future throughput of a video session. Parameterized ABR simplifies the ABR design by abstracting all or part of the prediction of the network uncertainty into its parameters. In this paper, we consider the issue of learning the best settings of these parameters from the study of the backlogged throughput traces of previous video sessions. Essential to our study is how to properly partition the logged sessions according to those critical features that affect the network conditions, e.g. Internet Service Provider (ISP), geographical location etc. so that different parameter settings could be adopted in different situations to reach better prediction. We present our greedy approach to the feature-based partition. It follows the strategy explored in the Decision Tree. The performance of our partition algorithm has been evaluated on our throughput dataset with a sample parameterized ABR algorithm. The experiment shows that our approach can improve the average bitrate of the sample ABR algorithm by 36.1% without causing the increase of the rebuffering ratio where 99% of the sessions can get improvement. It can also improve the rebuffering ratio by 87.7% without causing the decrease of the average bitrate where, among those sessions involved in rebuffering, 82% receives improvement and 18% remains the same.
With increasing popularity of online review systems, a large volume of user-generated content helps people to make reasonable judgments about the quality of services or products of unknown providers. However, these platforms can be easily abused to become entrances for misinformation since malicious users can freely insert information into these systems without validation. Consequently, online review systems become targets of individual or professional spammers who insert deceptive reviews by manipulating ratings and content of the reviews. In this work, we propose a review spamming detection scheme based on aspect-specific opinions extracted from individual reviews and their deviations from the aggregated aspect-specific opinions. We propose to model the influence on the trustworthiness of the user due to his opinion deviation from the majority in the form of a deviation-based penalty, and integrate this penalty into the three-layer trust propagation framework to iteratively compute the trust scores for users, reviews, and target entities, respectively. The trust scores are effective indicators of spammers, since they reflect the overall deviation of a user from the aggregated aspect-specific opinions across all targets and all aspects. Experiments on the dataset collected from Yelp.com show that the proposed detection scheme based on aspect-specific content-aware trust propagation is able to measure users' trustworthiness based on opinions expressed in reviews.