ACM Journal of

Data and Information Quality (JDIQ)

Latest Articles

Data Transparency with Blockchain and AI Ethics

Providing a 360° view of a given data item especially for sensitive data is essential toward not only protecting the data and associated privacy but also assuring trust, compliance, and ethics of the systems that use or manage such data. With the advent of General Data Protection Regulation, California Data Privacy Law, and other such... (more)

Assessing the Readiness of Academia in the Topic of False and Unverified Information

The spread of false and unverified information has the potential to inflict damage by harming the reputation of individuals or organisations, shaking... (more)

Different Faces of False: The Spread and Curtailment of False Information in the Black Panther Twitter Discussion

The task of combating false information online appears daunting, in part due to a public focus on how quickly it can spread and the clamor for automated platform-based interventions. While such concerns can be warranted, threat analysis and intervention design both benefit from a fuller understanding of different types of false information and of... (more)

Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation

Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data... (more)

A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning

Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant... (more)

Improving Adaptive Video Streaming through Session Classification

With internet video gaining increasing popularity and soaring to dominate network traffic, extensive... (more)


July, 2019 - Call for papers:
Special Issue on Metadata Discovery for Assessing Data Quality
Submission deadline:
- 9 October 2019 (extended)

Call for papers:

Special Issue on Quality Assessment of Knowledge Graphs
Status: Reviewing in progress  

Other news:

Special Issue on Combating Digital Misinformation and Disinformation
Status: Publication in progress

Special Issue on Reproducibility in Information Retrieval
Two parts special issue:
- Evaluation Campaigns, Collections and Analyses (Vol. 10, Issue 3, Oct. 2018)
- Tools and Infrastructures (Vol. 10, Issue 4, Oct. 2018)

On the Horizon challenge papers

From 2019, JDIQ is accepting a new type of contribution called "On the Horizon". These manuscripts are written by top researchers in the field of Data Quality. Their aim is at introducing new rising topics in the field of Data Quality, discussing why they are emerging, the challenging aspects, the envisioned solutions.


Welcome video for Associate Editors:

Ethical Dimensions for Data Quality

Ethics-aware data processing is a pressing need, considering that data are often used within critical decision processes (e.g., staff evaluation, college admission, criminal sentencing) as well as in the everyday life. This poses new challenges across the whole information extraction process, since in general there is no guarantee as to the character of the input data, and the algorithms are written by human beings and often the models are obtained from the data are opaque and difficult to interpret. We investigate the introduction of such ethical principles, including fairness and transparency, a first-class citizens among the dimensions of data quality.

Experience: Managing Misinformation in Social Media - Insights for policy makers from the Twitter Analytics

Governance of misinformation is a serious concern in social media platforms. Based on three case studies, we offer insights on managing misinformation in social media, for the policy agents. These studies are essential, because of misinformation prevalent in the existing social media platforms. Managing misinformation, for instance, fake news, is a challenge for policy makers and the platforms. The paper explores the factors of rapid propagation of misinformation. An average of about 15 lakh tweets were analysed on each of the three different cases surrounding misinformation. This study provides insights for managing misinformation in social media, with a specific focus on cognitive factors that may emerge as drivers of misinformation virality. We highlight findings from three separate case studies focusing on the intrinsic properties of the content, the personality attributes of misinformation propagators, and lastly the network related attributes that facilitate the rapid propagation of misinformation. Findings indicate that these three aspects collectively catalyze the misinformation diffusion and subsequent virality. Policy makers can utilize the findings in this experience study for the governance of misinformation. Tracking and disruption in any one of the identified drivers could act as a control mechanism to manage misinformation propagation.

Transforming Pairwise Duplicates to Entity Clusters for High Quality Duplicate Detection

Duplicate detection algorithms produce clusters of database records, each cluster representing a single real- world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm and many other clustering algorithms focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm Extended Maximum Clique Clustering (EMCC) and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.

Assessing the Quality of Primary Studies: A Systematic Literature Review

Researchers rely on systematic literature reviews to synthesize existing evidence regarding a research topic. While being important means to condense knowledge, conducting a systematic literature review requires a large amount of time and effort. Consequently, researchers have proposed semi-automatic techniques to support different stages of the review process. Two of the most time consuming tasks are to select primary studies and to assess their quality. In this article, we report a systematic literature review in which we identify, discuss, and synthesize existing techniques of the software engineering domain that aim to semi-automate these two tasks. Instead of solely providing statistics, we discuss these techniques in detail and compare them, aiming to improve our understanding of supported and unsupported activities. To this end, we identify 8 primary studies that report unique techniques and have been published between 2007 and 2016. Most of these techniques rely on text mining and can be beneficial for researchers, but an independent validation using real systematic literature reviews is missing for most of them. Moreover, our results indicate the necessity of developing more reliable techniques and of extending their scope to further activities to facilitate the selection and quality assessment of primary studies.

Robustness of Word and Character N-grams Combinations in Detecting Deceptive and Truthful Opinions

Opinions in reviews about the quality of products or services can be important information for readers. Unfortunately, such opinions may include deceptive ones posted for some business reasons. To keep the opinions as a valuable and trusted source of information, we propose an approach to detecting deceptive and truthful opinions. Specifically, we explore the use of word and character n-grams combinations, function words, and word syntactic n-grams (word sn-grams) as features for classifiers to deal with this task. We also consider applying word correction to our utilized dataset. Our experiments show that classification results of using the word and character n-grams combinations features could perform better than those of employing other features. Although the experiments indicate that applying the word correction might be insignificant, we note that the deceptive opinions tend to have a smaller number of error words than the truthful ones. To examine robustness of our features, we then perform cross-classification tests. Our latter experiments results suggest that using the word and character n-grams combinations features could work well in detecting deceptive and truthful opinions. Interestingly, the latter experimental results also indicate that using the word sn-grams as combinations features could give good performance.

Getting Rid of Data

We are experiencing an amazing data-centered revolution. Incredible amounts of data are collected, integrated and analyzed, leading to key breakthroughs in science and society. This well of knowledge, however, is at a great risk if we do not dispense with some of the data flood. First, the amount of generated data grows exponentially and already at 2020 is expected to be more than twice the available storage. Second, even disregarding storage constraints, uncontrolled data retention risks privacy and security, as recognized, e.g., by the recent EU Data Protection reform. Data disposal policies must be developed to benefit and protect organizations and individuals. Retaining the knowledge hidden in the data while respecting storage, processing and regulatory constraints is a great challenge. The difficulty stems from the distinct, intricate requirements entailed by each type of constraint, the scale and velocity of data and the constantly evolving needs. While multiple data sketching, summarization and deletion techniques were developed to address specific aspects of the problem, we are still very far from a comprehensive solution. Every organization has to battle the same tough challenges, with ad hoc solutions that are application specific and rarely sharable. In this vision paper we will discuss the logical, algorithmic, and methodological foundations required for the systematic disposal of large-scale data, for constraints enforcement and for the development of applications over the retained information. In particular we will overview relevant related work, highlighting new research challenges and potential reuse of existing techniques.

Characterizing Disinformation Risk to Open Data in the Post-truth Era

Curated, labelled, high quality data is a valuable commodity for tasks such as business analytics or machine learning. Open data is a common source of such data; for example, retail analytics draws on open demographic data, and weather forecast systems draw on open atmospheric and ocean data. This data is released openly by governments to achieve various objectives, such as transparency, informing citizen engagement, or supporting private enterprise, and is generally trusted. Critical examination of ongoing social changes, including the post-truth phenomenon, suggests quality, integrity, and authenticity of open data may be at risk. We describe these risks, with examples, and identify mechanisms to mitigate them. As an initial assessment of awareness of these risks, we compare our analysis to perspectives captured during open data stakeholder consultations in Canada.

All ACM Journals | See Full Journal Index

Search JDIQ
enter search term and/or author name