In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. If the information about link reliability is not used explicitly, the classification accuracy in the underlying network may be affected adversely. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model and automatic parameter selection, and show that the incorporation of uncertainty in the classification process as a first-class citizen is beneficial. We experimentally evaluate the proposed approach using different real data sets, and study the behavior of the algorithms under different conditions. The results demonstrate the effectiveness and efficiency of our approach.
Challenge Paper (no abstract) Excerpt: Financial markets respond to information. That information can be accurate or inaccurate (misinformation) but investors make rapid buy and sell decisions and often act before verifying authenticity. The challenge for data and information quality researchers is to develop tools to detect fraud early and develop strategies or decision rules regulators can use to determine whether to suspend trading.
This paper studies a new framework that incorporates graph patterns to support fact checking in knowledge graphs. Our method discovers discriminant graph patterns to construct classifiers for fact prediction. (1) We propose a class of graph fact checking rules (GFCs). A GFC incorporates graph patterns that best distinguish true and false facts of generalized fact statements. We provide statistical measures to characterize useful patterns that are both discriminant and diversified. (2) We show that it is feasible to discover GFCs in large graphs with optimality guarantees. (a) We develop an algorithm that performs localized search to generate a stream of graph patterns, and dynamically assemble best GFCs from multiple GFCs sets, where each set ensures quality scores within certain ranges. The algorithm guarantees a 1/2-µ approximation when it (early) terminates. (b) We also develop a space-efficient alternative that dynamically spawns prioritized patterns with best marginal gains to verified GFCs. It guarantees a 1-1/e approximation. Both strategies guarantee a bounded time cost independent with the size of underlying graph. (3) To support fact checking, we develop two classifiers, which make use of top ranked GFCs as predictive rules, or instance-level features of the pattern matches induced by GFCs, respectively. Using real-world data, we experimentally verify the efficiency and the effectiveness of GFC-based techniques for fact checking in knowledge graphs, and verify its application in knowledge exploration and news prediction.
Fake news are nowadays an issue of pressing concern, given their recent rise as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge (FNC-1) was organized in early 2017 to encourage the development of machine learning-based classification systems for stance detection (i.e., for identifying whether a particular news article agrees, disagrees, discusses, or is unrelated to a particular news headline), thus helping in the detection and analysis of possible instances of fake news. This article presents a novel approach to tackle this stance detection problem, based on the combination of string similarity features with a deep neural network architecture that leverages ideas previously advanced in the context of learning efficient text representations, document classification, and natural language inference. Specifically, we use bi-directional GRUs together with neural attention for representing (i) the headline, (ii) the first two sentences of the news article, and (iii) the entire news article. These representations are then combined/compared, complemented with similarity features inspired on other FNC-1 approaches, and passed to a final layer that predicts the stance of the article towards the headline. We also explore the use of external sources of information, specifically large datasets of sentence pairs originally proposed for training and evaluating natural language inference methods, in order to pre-train specific components of the neural network architecture (e.g., the GRUs used for encoding sentences). The obtained results attest to the effectiveness of the proposed ideas and show that our model, particularly when considering pre-training and the combination of neural representations together with similarity features, slightly outperforms the previous state-of-the-art.
A multi-criteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. A MCSS problem can be solved using multi-dimensional optimisation techniques that trade-off the different objectives. Sometimes one may have uncertain knowledge regarding how well the candidate data sources meet the criteria. In order to overcome this uncertainty, one may rely on end users or crowds to annotate the data items produced by the sources in relation to the selection criteria. In this paper, a proposed Targeted Feedback Collection (TFC) approach is introduced, that aims to identify those data items on which feedback should be collected, thereby providing evidence on how the sources satisfy the required criteria. The proposed TFC targets feedback by considering the confidence intervals around the estimated criteria values, with a view to increasing the confidence in the estimates that are most relevant to the multi-dimensional optimisation. Variants of the proposed TFC approach have been developed, for use where feedback is expected to be reliable (e.g. where it is provided by trusted experts) and where feedback is expected to be unreliable (e.g. from crowd workers). Both variants have been evaluated, and positive results are reported against other approaches to feedback collection, including active learning, in experiments that involve real world data sets and crowdsourcing.
In this article, we study the problem of automatic fact checking, paying special attention to the impact of contextual and discourse information. We address two related tasks: the detection of check-worthy claims (here, in the context of political debates), and the verification of factual claims (here, answers to questions in a community question answering forum). We develop supervised systems based on neural networks, kernel-based support vector machines, and combinations thereof, which make use of rich input representations in terms of discourse cues (encoding the discourse relations from a discourse parser) and contextual features. In the claim identification problem, we model the target claim in the context of the full intervention of a participant and the previous and the following turns in the debate, taking into account also contextual meta information. In the answer verification problem, we model the answer with respect to the entire question--answer thread in which it occurs, and with respect to other related posts from the entire forum. We develop annotated datasets for both tasks and we run an extensive experimental evaluation of the models, confirming that both types of information ---but especially contextual features--- play an important role for the performance of our claim check-worthiness prediction and of our answer verification systems.
Mappings of first name to gender have been widely recognized as a critical tool for the completion, study and validation of data records in a range of areas. In this study, we investigate how organizations with large databases of existing entities can create their own mappings between first names and gender and how these mappings can be improved and utilized. Therefore, we first explore a dataset with demographic information on more than 4 million people, which was provided by a car insurance company. Then, we study how naming conventions have changed over time and how they differ by nationality. Next, we build a probabilistic first-name-to-gender mapping and augment the mapping by adding nationality and decade of birth to improve the mappings performance. We test our mapping in two-label and three-label settings and further validate our mapping by categorizing patent filings by gender of the inventor. We compare the results with previous studies outcomes and find that our mapping produces high-precision results. We validate that the additional information of nationality and year of birth improve the precision scores of name-to-gender mappings. Therefore, the proposed approach constitutes an efficient process for improving the data quality of organizations records, if the gender attribute is missing or unreliable.
Governance, risk and compliance (GRC) managers often struggle with the documentation of the current state of their organization due to the complexity of their information systems landscape, the complex regulatory and organizational environment and frequent changes. Governance, risk and compliance tools seek to support them by integrating existing information sources. However, a comprehensive analysis of how the data is managed in such tools as well as the impact of its quality is still missing. To build an empirical basis, we conducted a series of interviews with information security managers responsible for GRC management activities in their organizations. The results of a qualitative content analysis of these interviews suggest that decision-makers largely depend on high quality documentation but struggle to maintain their documentation at the required level for longer periods of time. Besides discussing factors affecting the quality of GRC data and information, this work also provides insights into approaches implemented by organizations to analyze, improve and maintain the quality of their GRC data and information.
High-quality data is critical for effective data science. With the increasing use of data science in all modern endeavors, concern about irresponsible data use has led to a push for responsible data science to ensure that the insights gained do not come at the steep cost of violating privacy of individuals. This has led to the development of data protection regulations around the globe and the use of sophisticated anonymization techniques to protect privacy. Such measures make it harder for the data scientist to understand the data, exacerbating the issue of ensuring high data quality. We pose the high-level problem how can a data scientist develop the needed trust that private data has high quality? in this paper. We then identify a series of challenges for various data-centric communities, and outline research questions for data quality and privacy researchers, that would need to be addressed to effectively answer the problem posed in this paper.