ACM Journal of

Data and Information Quality (JDIQ)

Latest Articles

Data Transparency with Blockchain and AI Ethics

Providing a 360° view of a given data item especially for sensitive data is essential toward not only protecting the data and associated privacy but also assuring trust, compliance, and ethics of the systems that use or manage such data. With the advent of General Data Protection Regulation, California Data Privacy Law, and other such... (more)

Assessing the Readiness of Academia in the Topic of False and Unverified Information

The spread of false and unverified information has the potential to inflict damage by harming the reputation of individuals or organisations, shaking... (more)

Different Faces of False: The Spread and Curtailment of False Information in the Black Panther Twitter Discussion

The task of combating false information online appears daunting, in part due to a public focus on how quickly it can spread and the clamor for automated platform-based interventions. While such concerns can be warranted, threat analysis and intervention design both benefit from a fuller understanding of different types of false information and of... (more)

Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation

Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data... (more)

A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning

Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant... (more)

Improving Adaptive Video Streaming through Session Classification

With internet video gaining increasing popularity and soaring to dominate network traffic, extensive... (more)


July, 2019 - Call for papers:
Special Issue on Metadata Discovery for Assessing Data Quality
Submission deadline:
- 9 October 2019 (extended)

Call for papers:

Special Issue on Quality Assessment of Knowledge Graphs
Status: Reviewing in progress  

Other news:

Special Issue on Combating Digital Misinformation and Disinformation
Status: Publication in progress

Special Issue on Reproducibility in Information Retrieval
Two parts special issue:
- Evaluation Campaigns, Collections and Analyses (Vol. 10, Issue 3, Oct. 2018)
- Tools and Infrastructures (Vol. 10, Issue 4, Oct. 2018)

On the Horizon challenge papers

From 2019, JDIQ is accepting a new type of contribution called "On the Horizon". These manuscripts are written by top researchers in the field of Data Quality. Their aim is at introducing new rising topics in the field of Data Quality, discussing why they are emerging, the challenging aspects, the envisioned solutions.


Welcome video for Associate Editors:

Ethical Dimensions for Data Quality

Ethics-aware data processing is a pressing need, considering that data are often used within critical decision processes (e.g., staff evaluation, college admission, criminal sentencing) as well as in the everyday life. This poses new challenges across the whole information extraction process, since in general there is no guarantee as to the character of the input data, and the algorithms are written by human beings and often the models are obtained from the data are opaque and difficult to interpret. We investigate the introduction of such ethical principles, including fairness and transparency, a first-class citizens among the dimensions of data quality.

Experience: Managing Misinformation in Social Media - Insights for policy makers from the Twitter Analytics

Governance of misinformation is a serious concern in social media platforms. Based on three case studies, we offer insights on managing misinformation in social media, for the policy agents. These studies are essential, because of misinformation prevalent in the existing social media platforms. Managing misinformation, for instance, fake news, is a challenge for policy makers and the platforms. The paper explores the factors of rapid propagation of misinformation. An average of about 15 lakh tweets were analysed on each of the three different cases surrounding misinformation. This study provides insights for managing misinformation in social media, with a specific focus on cognitive factors that may emerge as drivers of misinformation virality. We highlight findings from three separate case studies focusing on the intrinsic properties of the content, the personality attributes of misinformation propagators, and lastly the network related attributes that facilitate the rapid propagation of misinformation. Findings indicate that these three aspects collectively catalyze the misinformation diffusion and subsequent virality. Policy makers can utilize the findings in this experience study for the governance of misinformation. Tracking and disruption in any one of the identified drivers could act as a control mechanism to manage misinformation propagation.

Anatomy of Metadata for Data Curation

Real-world datasets often suffer from various data quality problems. Several data cleaning solutions have been proposed so far. However, data cleaning remains a manual and iterative task that requires domain and technical expertise. Exploiting metadata promises to improve the tedious process of data preparation because data errors are detectable through metadata. This paper investigates the intrinsic connection between metadata and data errors. In this work, we establish a mapping that reflects the connection between data quality issues and extractable metadata using qualitative and quantitative techniques. Additionally, we present a new metadata classification and taxonomy based on a closed grammar that can also maintain very complex metadata.

Transforming Pairwise Duplicates to Entity Clusters for High Quality Duplicate Detection

Duplicate detection algorithms produce clusters of database records, each cluster representing a single real- world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm and many other clustering algorithms focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm Extended Maximum Clique Clustering (EMCC) and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.

Assessing the Quality of Primary Studies: A Systematic Literature Review

Researchers rely on systematic literature reviews to synthesize existing evidence regarding a research topic. While being important means to condense knowledge, conducting a systematic literature review requires a large amount of time and effort. Consequently, researchers have proposed semi-automatic techniques to support different stages of the review process. Two of the most time consuming tasks are to select primary studies and to assess their quality. In this article, we report a systematic literature review in which we identify, discuss, and synthesize existing techniques of the software engineering domain that aim to semi-automate these two tasks. Instead of solely providing statistics, we discuss these techniques in detail and compare them, aiming to improve our understanding of supported and unsupported activities. To this end, we identify 8 primary studies that report unique techniques and have been published between 2007 and 2016. Most of these techniques rely on text mining and can be beneficial for researchers, but an independent validation using real systematic literature reviews is missing for most of them. Moreover, our results indicate the necessity of developing more reliable techniques and of extending their scope to further activities to facilitate the selection and quality assessment of primary studies.

Robustness of Word and Character N-grams Combinations in Detecting Deceptive and Truthful Opinions

Opinions in reviews about the quality of products or services can be important information for readers. Unfortunately, such opinions may include deceptive ones posted for some business reasons. To keep the opinions as a valuable and trusted source of information, we propose an approach to detecting deceptive and truthful opinions. Specifically, we explore the use of word and character n-grams combinations, function words, and word syntactic n-grams (word sn-grams) as features for classifiers to deal with this task. We also consider applying word correction to our utilized dataset. Our experiments show that classification results of using the word and character n-grams combinations features could perform better than those of employing other features. Although the experiments indicate that applying the word correction might be insignificant, we note that the deceptive opinions tend to have a smaller number of error words than the truthful ones. To examine robustness of our features, we then perform cross-classification tests. Our latter experiments results suggest that using the word and character n-grams combinations features could work well in detecting deceptive and truthful opinions. Interestingly, the latter experimental results also indicate that using the word sn-grams as combinations features could give good performance.


Mining Expressive Rules in Knowledge Graphs

We describe RuDiK, a system for the discovery of declarative rules over knowledge-bases (KBs). RuDiK discovers rules expressing positive relationships between KB elements, such as ?if two persons have the same parent, they are likely to be siblings?, and negative rules, i.e., patterns that identify contradictions in the data, such as ?if two persons are married, one cannot be the child of the other" or ?the birth year for a person cannot be bigger than her graduation year". While the former class infers new facts in the KB, the latter class is crucial for other tasks, such as detecting erroneous triples in data cleaning. Our approach satisfies two main requirements. First, it increases the expressive power of the supported rule language wrt existing systems. RuDiK discovers rules containing (i) comparisons among literal values and (ii) selection conditions with constants. Richer rules increase the accuracy and the coverage over the facts in the KB. This is achieved with aggressive pruning of the search space and with disk-based algorithms, effectively enabling rule mining in commodity machines. Second, RuDiK is robust to errors and incompleteness in the input KB. It discovers approximate rules based on a measure of support that is aware of the quality issues. We model the mining process as an incremental graph exploration problem and prove that our search strategy has optimality guarantees. Extensive experiments using real-world KBs show that RuDiK outperforms previous proposals in terms of efficiency and that it discovers more effective rules for the application at hand.

All ACM Journals | See Full Journal Index

Search JDIQ
enter search term and/or author name