ACM Journal of

Data and Information Quality (JDIQ)

Latest Articles

Ensuring High-Quality Private Data for Responsible Data Science: Vision and Challenges

High-quality data is critical for effective data science. As the use of data science has grown, so too have concerns that individuals’ rights... (more)

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection

A multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’... (more)

Improving Classification Quality in Uncertain Graphs

In many real applications that use and analyze networked data, the links in the network graph may be erroneous or derived from probabilistic... (more)


October, 2018 - Call for papers:

Special Issue on Quality Assessment of Knowledge Graphs
Initial submission deadline:
- 3 March 2019

Other news:

Special Issue on Combating Digital Misinformation and Disinformation
Status: Reviewing in progress

Special Issue on Reproducibility in Information Retrieval
Two parts special issue:
- Evaluation Campaigns, Collections and Analyses (Vol. 10, Issue 3, Oct. 2018)
- Tools and Infrastructures (Vol. 10, Issue 4, Oct. 2018)

On the Horizon challenge papers

From 2019, JDIQ will accept a new type of contribution called "On the Horizon". These manuscripts, which can be submitted by invitation only, will be written by top researchers in the field of Data Quality. Their aim is at introducing new rising topics in the field of Data Quality, discussing why they are emerging, the challenging aspects, the envisioned solutions.

Dependencies for Graphs: Challenges and Opportunities

What are graph dependencies? What do we need them for? What new challenges do they introduce? This paper tackles these questions. It aims to incite curiosity and interest in this emerging area of research.

Transparency, Fairness, Data Protection, Neutrality: Data Management Challenges in the Face of New Regulation

The data revolution continues to transform every sector of science, industry and government. Due to the incredible impact of data-driven technology on society, we are becoming increasingly aware of the imperative to use data and algorithms responsibly -- in accordance with laws and ethical norms. In this article we discuss three recent regulatory frameworks: the European Union's General Data Protection Regulation (GDPR), the New York City Automated Decisions Systems (ADS) Law, and the Net neutrality principle, that aim to protect the rights of individuals who are impacted by data collection and analysis. These frameworks are prominent examples of a global trend: Governments are starting to recognize the need to regulate data-driven algorithmic technology. Our goal in this paper is to bring these regulatory frameworks to the attention of the data management community, and to underscore the technical challenges they raise and which we, as a community, are well-equipped to address. The main take-away of this article is that legal norms cannot be incorporated into data-driven systems as an afterthought. Rather, we must think in terms of responsibility by design, viewing it as a systems requirement.

Data Transparency with Blockchain and AI Ethics

Providing a 360-degree view of a given data item especially for sensitive data is essential towards not only protecting the data and associated privacy but also assuring trust, compliance, and ethics of the systems that use or manage such data. With the advent of GDPR, California Data Privacy Law and such other regulatory requirements, it is essential to support data transparency in all such dimensions. Moreover, data transparency should not violate privacy and security requirements. In this paper, we put forward a vision for how data transparency would be achieved in a de-centralized fashion using blockchain technology.

The Web of False Information: Rumors, Fake News, Hoaxes, Clickbait, and Various Other Shenanigans

A new era of Information Warfare has arrived. Various actors, including state-sponsored ones, are weaponizing information on Online Social Networks to run false information campaigns with targeted manipulation of public opinion on specific topics. These false information campaigns can have dire consequences to the public: mutating their opinions and actions, especially with respect to critical world events like major elections. Evidently, the problem of false information on the Web is a crucial one, and needs increased public awareness, as well as immediate attention from law enforcement agencies, public institutions, and in particular, the research community. In this paper, we make a step in this direction by providing a taxonomy of the Web's false information ecosystem, comprising various types of false information, actors, and their motives. We report a comprehensive overview of existing research on the false information ecosystem by identifying several lines of work: 1) how the public perceives false information; 2) understanding the propagation of false information; 3) detecting and containing false information on the Web; and 4) false information on the political stage.Finally, for each of these lines of work, we report several future research directions that can help us better understand and mitigate the emerging problem of false information dissemination on the Web.

Discovering Patterns for Fact Checking in Knowledge Graphs

This paper studies a new framework that incorporates graph patterns to support fact checking in knowledge graphs. Our method discovers discriminant graph patterns to construct classifiers for fact prediction. (1) We propose a class of graph fact checking rules (GFCs). A GFC incorporates graph patterns that best distinguish true and false facts of generalized fact statements. We provide statistical measures to characterize useful patterns that are both discriminant and diversified. (2) We show that it is feasible to discover GFCs in large graphs with optimality guarantees. (a) We develop an algorithm that performs localized search to generate a stream of graph patterns, and dynamically assemble best GFCs from multiple GFCs sets, where each set ensures quality scores within certain ranges. The algorithm guarantees a 1/2-µ approximation when it (early) terminates. (b) We also develop a space-efficient alternative that dynamically spawns prioritized patterns with best marginal gains to verified GFCs. It guarantees a 1-1/e approximation. Both strategies guarantee a bounded time cost independent with the size of underlying graph. (3) To support fact checking, we develop two classifiers, which make use of top ranked GFCs as predictive rules, or instance-level features of the pattern matches induced by GFCs, respectively. Using real-world data, we experimentally verify the efficiency and the effectiveness of GFC-based techniques for fact checking in knowledge graphs, and verify its application in knowledge exploration and news prediction.

Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News

Fake news are nowadays an issue of pressing concern, given their recent rise as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge (FNC-1) was organized in early 2017 to encourage the development of machine learning-based classification systems for stance detection (i.e., for identifying whether a particular news article agrees, disagrees, discusses, or is unrelated to a particular news headline), thus helping in the detection and analysis of possible instances of fake news. This article presents a novel approach to tackle this stance detection problem, based on the combination of string similarity features with a deep neural network architecture that leverages ideas previously advanced in the context of learning efficient text representations, document classification, and natural language inference. Specifically, we use bi-directional GRUs together with neural attention for representing (i) the headline, (ii) the first two sentences of the news article, and (iii) the entire news article. These representations are then combined/compared, complemented with similarity features inspired on other FNC-1 approaches, and passed to a final layer that predicts the stance of the article towards the headline. We also explore the use of external sources of information, specifically large datasets of sentence pairs originally proposed for training and evaluating natural language inference methods, in order to pre-train specific components of the neural network architecture (e.g., the GRUs used for encoding sentences). The obtained results attest to the effectiveness of the proposed ideas and show that our model, particularly when considering pre-training and the combination of neural representations together with similarity features, slightly outperforms the previous state-of-the-art.

Automatic Fact Checking Using Context and Discourse Information

In this article, we study the problem of automatic fact checking, paying special attention to the impact of contextual and discourse information. We address two related tasks: the detection of check-worthy claims (here, in the context of political debates), and the verification of factual claims (here, answers to questions in a community question answering forum). We develop supervised systems based on neural networks, kernel-based support vector machines, and combinations thereof, which make use of rich input representations in terms of discourse cues (encoding the discourse relations from a discourse parser) and contextual features. In the claim identification problem, we model the target claim in the context of the full intervention of a participant and the previous and the following turns in the debate, taking into account also contextual meta information. In the answer verification problem, we model the answer with respect to the entire question--answer thread in which it occurs, and with respect to other related posts from the entire forum. We develop annotated datasets for both tasks and we run an extensive experimental evaluation of the models, confirming that both types of information ---but especially contextual features--- play an important role for the performance of our claim check-worthiness prediction and of our answer verification systems.

Augmenting Data Quality through High-Precision Gender Categorization

Mappings of first name to gender have been widely recognized as a critical tool for the completion, study and validation of data records in a range of areas. In this study, we investigate how organizations with large databases of existing entities can create their own mappings between first names and gender and how these mappings can be improved and utilized. Therefore, we first explore a dataset with demographic information on more than 4 million people, which was provided by a car insurance company. Then, we study how naming conventions have changed over time and how they differ by nationality. Next, we build a probabilistic first-name-to-gender mapping and augment the mapping by adding nationality and decade of birth to improve the mappings performance. We test our mapping in two-label and three-label settings and further validate our mapping by categorizing patent filings by gender of the inventor. We compare the results with previous studies outcomes and find that our mapping produces high-precision results. We validate that the additional information of nationality and year of birth improve the precision scores of name-to-gender mappings. Therefore, the proposed approach constitutes an efficient process for improving the data quality of organizations records, if the gender attribute is missing or unreliable.

Experience: Data and Information Quality Challenges in Governance, Risk and Compliance Management

Governance, risk and compliance (GRC) managers often struggle with the documentation of the current state of their organization due to the complexity of their information systems landscape, the complex regulatory and organizational environment and frequent changes. Governance, risk and compliance tools seek to support them by integrating existing information sources. However, a comprehensive analysis of how the data is managed in such tools as well as the impact of its quality is still missing. To build an empirical basis, we conducted a series of interviews with information security managers responsible for GRC management activities in their organizations. The results of a qualitative content analysis of these interviews suggest that decision-makers largely depend on high quality documentation but struggle to maintain their documentation at the required level for longer periods of time. Besides discussing factors affecting the quality of GRC data and information, this work also provides insights into approaches implemented by organizations to analyze, improve and maintain the quality of their GRC data and information.

Assessing the Readiness of the Academia in the Topic of False and Unverified Information

The spread of false and unverified information has the potential to inflict damage by harming the reputation of individuals or organisations, shaking financial markets, and influencing crowd decisions in important events. Despite a great deal of research in this area, the academia still does not have a particular plan to confront this troublesome phenomenon. In this research, we focus on this point by assessing the readiness of academia against false and unverified information. To this end, we adopt the emergence framework and measure its different dimensions over more than 21000 articles, published by academia about false and unverified information. Our results show the current body of research had an organic growth so far, which is not promising enough for confronting the problem of false and unverified information. To tackle this problem, we suggest an external push strategy which compared to the early stage of the field, reinforces the emergence dimensions and cause to achieve a higher level in every dimension.

Improving Adaptive Video Streaming through Session Classification

With the internet video gaining increasing popularity and soaring to dominate the network traffic, extensive study is being carried out on how to achieve higher Quality of Experience (QoE) in its content delivery. Associated with the chunk-based streaming protocol, the Adaptive Bitrate (ABR) algorithms have recently emerged to cope with the diverse and fluctuating network conditions by dynamically adjusting bitrates for future chunks. This inevitably involves predicting the future throughput of a video session. Parameterized ABR simplifies the ABR design by abstracting all or part of the prediction of the network uncertainty into its parameters. In this paper, we consider the issue of learning the best settings of these parameters from the study of the backlogged throughput traces of previous video sessions. Essential to our study is how to properly partition the logged sessions according to those critical features that affect the network conditions, e.g. Internet Service Provider (ISP), geographical location etc. so that different parameter settings could be adopted in different situations to reach better prediction. We present our greedy approach to the feature-based partition. It follows the strategy explored in the Decision Tree. The performance of our partition algorithm has been evaluated on our throughput dataset with a sample parameterized ABR algorithm. The experiment shows that our approach can improve the average bitrate of the sample ABR algorithm by 36.1% without causing the increase of the rebuffering ratio where 99% of the sessions can get improvement. It can also improve the rebuffering ratio by 87.7% without causing the decrease of the average bitrate where, among those sessions involved in rebuffering, 82% receives improvement and 18% remains the same.

Content-Aware Trust Propagation Towards Online Review Spam Detection

With increasing popularity of online review systems, a large volume of user-generated content helps people to make reasonable judgments about the quality of services or products of unknown providers. However, these platforms can be easily abused to become entrances for misinformation since malicious users can freely insert information into these systems without validation. Consequently, online review systems become targets of individual or professional spammers who insert deceptive reviews by manipulating ratings and content of the reviews. In this work, we propose a review spamming detection scheme based on aspect-specific opinions extracted from individual reviews and their deviations from the aggregated aspect-specific opinions. We propose to model the influence on the trustworthiness of the user due to his opinion deviation from the majority in the form of a deviation-based penalty, and integrate this penalty into the three-layer trust propagation framework to iteratively compute the trust scores for users, reviews, and target entities, respectively. The trust scores are effective indicators of spammers, since they reflect the overall deviation of a user from the aggregated aspect-specific opinions across all targets and all aspects. Experiments on the dataset collected from show that the proposed detection scheme based on aspect-specific content-aware trust propagation is able to measure users' trustworthiness based on opinions expressed in reviews.

Data Quality Challenges with Missing Values and Mixed Types in Joint Sequence Analysis

The goal of this paper is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in socio-demographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multi-dimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an edit distance using Optimal Matching (OM). Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce artificial clusters as well as unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed Stochastic Neighborhood Embedding (t-SNE) to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.

All ACM Journals | See Full Journal Index

Search JDIQ
enter search term and/or author name