ACM DL

ACM Journal of

Data and Information Quality (JDIQ)

Menu
Latest Articles

Ensuring High-Quality Private Data for Responsible Data Science: Vision and Challenges

High-quality data is critical for effective data science. As the use of data science has grown, so too have concerns that individuals’ rights... (more)

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection

A multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’... (more)

Improving Classification Quality in Uncertain Graphs

In many real applications that use and analyze networked data, the links in the network graph may be erroneous or derived from probabilistic... (more)

NEWS

October, 2018 - Call for papers:

Special Issue on Quality Assessment of Knowledge Graphs
Initial submission deadline:
- 3 March 2019

Other news:

Special Issue on Combating Digital Misinformation and Disinformation
Status: Reviewing in progress

Special Issue on Reproducibility in Information Retrieval
Two parts special issue:
- Evaluation Campaigns, Collections and Analyses (Vol. 10, Issue 3, Oct. 2018)
- Tools and Infrastructures (Vol. 10, Issue 4, Oct. 2018)

On the Horizon challenge papers

From 2019, JDIQ will accept a new type of contribution called "On the Horizon". These manuscripts, which can be submitted by invitation only, will be written by top researchers in the field of Data Quality. Their aim is at introducing new rising topics in the field of Data Quality, discussing why they are emerging, the challenging aspects, the envisioned solutions.

Discovering Patterns for Fact Checking in Knowledge Graphs

This paper studies a new framework that incorporates graph patterns to support fact checking in knowledge graphs. Our method discovers discriminant graph patterns to construct classifiers for fact prediction. (1) We propose a class of graph fact checking rules (GFCs). A GFC incorporates graph patterns that best distinguish true and false facts of generalized fact statements. We provide statistical measures to characterize useful patterns that are both discriminant and diversified. (2) We show that it is feasible to discover GFCs in large graphs with optimality guarantees. (a) We develop an algorithm that performs localized search to generate a stream of graph patterns, and dynamically assemble best GFCs from multiple GFCs sets, where each set ensures quality scores within certain ranges. The algorithm guarantees a 1/2-µ approximation when it (early) terminates. (b) We also develop a space-efficient alternative that dynamically spawns prioritized patterns with best marginal gains to verified GFCs. It guarantees a 1-1/e approximation. Both strategies guarantee a bounded time cost independent with the size of underlying graph. (3) To support fact checking, we develop two classifiers, which make use of top ranked GFCs as predictive rules, or instance-level features of the pattern matches induced by GFCs, respectively. Using real-world data, we experimentally verify the efficiency and the effectiveness of GFC-based techniques for fact checking in knowledge graphs, and verify its application in knowledge exploration and news prediction.

Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News

Fake news are nowadays an issue of pressing concern, given their recent rise as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge (FNC-1) was organized in early 2017 to encourage the development of machine learning-based classification systems for stance detection (i.e., for identifying whether a particular news article agrees, disagrees, discusses, or is unrelated to a particular news headline), thus helping in the detection and analysis of possible instances of fake news. This article presents a novel approach to tackle this stance detection problem, based on the combination of string similarity features with a deep neural network architecture that leverages ideas previously advanced in the context of learning efficient text representations, document classification, and natural language inference. Specifically, we use bi-directional GRUs together with neural attention for representing (i) the headline, (ii) the first two sentences of the news article, and (iii) the entire news article. These representations are then combined/compared, complemented with similarity features inspired on other FNC-1 approaches, and passed to a final layer that predicts the stance of the article towards the headline. We also explore the use of external sources of information, specifically large datasets of sentence pairs originally proposed for training and evaluating natural language inference methods, in order to pre-train specific components of the neural network architecture (e.g., the GRUs used for encoding sentences). The obtained results attest to the effectiveness of the proposed ideas and show that our model, particularly when considering pre-training and the combination of neural representations together with similarity features, slightly outperforms the previous state-of-the-art.

Automatic Fact Checking Using Context and Discourse Information

In this article, we study the problem of automatic fact checking, paying special attention to the impact of contextual and discourse information. We address two related tasks: the detection of check-worthy claims (here, in the context of political debates), and the verification of factual claims (here, answers to questions in a community question answering forum). We develop supervised systems based on neural networks, kernel-based support vector machines, and combinations thereof, which make use of rich input representations in terms of discourse cues (encoding the discourse relations from a discourse parser) and contextual features. In the claim identification problem, we model the target claim in the context of the full intervention of a participant and the previous and the following turns in the debate, taking into account also contextual meta information. In the answer verification problem, we model the answer with respect to the entire question--answer thread in which it occurs, and with respect to other related posts from the entire forum. We develop annotated datasets for both tasks and we run an extensive experimental evaluation of the models, confirming that both types of information ---but especially contextual features--- play an important role for the performance of our claim check-worthiness prediction and of our answer verification systems.

Augmenting Data Quality through High-Precision Gender Categorization

Mappings of first name to gender have been widely recognized as a critical tool for the completion, study and validation of data records in a range of areas. In this study, we investigate how organizations with large databases of existing entities can create their own mappings between first names and gender and how these mappings can be improved and utilized. Therefore, we first explore a dataset with demographic information on more than 4 million people, which was provided by a car insurance company. Then, we study how naming conventions have changed over time and how they differ by nationality. Next, we build a probabilistic first-name-to-gender mapping and augment the mapping by adding nationality and decade of birth to improve the mappings performance. We test our mapping in two-label and three-label settings and further validate our mapping by categorizing patent filings by gender of the inventor. We compare the results with previous studies outcomes and find that our mapping produces high-precision results. We validate that the additional information of nationality and year of birth improve the precision scores of name-to-gender mappings. Therefore, the proposed approach constitutes an efficient process for improving the data quality of organizations records, if the gender attribute is missing or unreliable.

Experience: Data and Information Quality Challenges in Governance, Risk and Compliance Management

Governance, risk and compliance (GRC) managers often struggle with the documentation of the current state of their organization due to the complexity of their information systems landscape, the complex regulatory and organizational environment and frequent changes. Governance, risk and compliance tools seek to support them by integrating existing information sources. However, a comprehensive analysis of how the data is managed in such tools as well as the impact of its quality is still missing. To build an empirical basis, we conducted a series of interviews with information security managers responsible for GRC management activities in their organizations. The results of a qualitative content analysis of these interviews suggest that decision-makers largely depend on high quality documentation but struggle to maintain their documentation at the required level for longer periods of time. Besides discussing factors affecting the quality of GRC data and information, this work also provides insights into approaches implemented by organizations to analyze, improve and maintain the quality of their GRC data and information.

Content-Aware Trust Propagation Towards Online Review Spam Detection

With increasing popularity of online review systems, a large volume of user-generated content helps people to make reasonable judgments about the quality of services or products of unknown providers. However, these platforms can be easily abused to become entrances for misinformation since malicious users can freely insert information into these systems without validation. Consequently, online review systems become targets of individual or professional spammers who insert deceptive reviews by manipulating ratings and content of the reviews. In this work, we propose a review spamming detection scheme based on aspect-specific opinions extracted from individual reviews and their deviations from the aggregated aspect-specific opinions. We propose to model the influence on the trustworthiness of the user due to his opinion deviation from the majority in the form of a deviation-based penalty, and integrate this penalty into the three-layer trust propagation framework to iteratively compute the trust scores for users, reviews, and target entities, respectively. The trust scores are effective indicators of spammers, since they reflect the overall deviation of a user from the aggregated aspect-specific opinions across all targets and all aspects. Experiments on the dataset collected from Yelp.com show that the proposed detection scheme based on aspect-specific content-aware trust propagation is able to measure users' trustworthiness based on opinions expressed in reviews.

Data Quality Challenges with Missing Values and Mixed Types in Joint Sequence Analysis

The goal of this paper is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in socio-demographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multi-dimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an edit distance using Optimal Matching (OM). Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce artificial clusters as well as unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed Stochastic Neighborhood Embedding (t-SNE) to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.

All ACM Journals | See Full Journal Index

Search JDIQ
enter search term and/or author name