ACM Journal of

Data and Information Quality (JDIQ)

Latest Articles

Challenges in Enabling Quality of Analytics in the Cloud

Experience: Learner Analytics Data Quality for an eTextbook System

Validating Data Quality Actions in Scoring Processes

Requirements for Data Quality Metrics


January, 2018 - Call for papers:
Special Issue on Combating Digital Misinformation and Disinformation

Initial submission deadline:
- April 1st, 2018

NEW Non-CS Initial submission:
- May 1st, 2018

March, 2017 -- Call for Papers: Special issue on Reproducibility in Information Retrieval Extended Submission deadline: October 6, 2017 

Feb. 2017 -- Call for Papers: 
Special Issue on Improving the Veracity and Value of Big Data 
Extended Submission deadline: April  1st, 2017

Jan. 2016 -- New Book Announcement
Carlo Batini and Monica Scannapieco have a new book:

Data and Information Quality: Dimensions, Principles and Techniques 

Springer Series: Data-Centric Systems and Applications, soon available from the Springer shop

The Springer flyer is available here

Experience and Challenge papers:  JDIQ now accepts two new types of papers. Experience papers describe real-world applications, datasets and other experiences in handling poor quality data. Challenges papers briefly describe a novel problem or challenge for the IQ community. See Author Guidelines for details.

Forthcoming Articles
Machine Reading of Biomedical Data Dictionaries

This paper describes an approach for automated ingestion of biomedical data dictionaries. Automated ingestion or reading is the process of extracting element details for each of the data elements from a data dictionary in a document format (such as PDF) to a completely structured format. The structured format is essential if the data dictionary metadata is to be used in applications such as data integration, and also in evaluating the quality of the associated data. We present a machine-learning classification solution to the problem using conditional random field (CRF) classifiers and leveraging multiple text and character based features of text rows in the document. We present an evaluation using several actual data dictionary documents demonstrating the effectiveness of our approach.

Challenge Paper: Data Quality Issues in Queue Mining

Queue mining is a novel research area of data mining that learns queueing models from data logs. These models are then used for performance prediction in queueing-oriented systems. Queue mining combines techniques from process mining, queueing theory, statistics, and optimization. This paper reviews challenges that stem from data quality issues in queue mining, as well as some existing solutions to these challenges.

Comparative analysis of sequence clustering methods for de-duplication of biological databases

The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However the underlying data quality of these resources is a critical concern. A particular is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database de-duplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency; and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database de-duplication. Given high volumes of data, exhaustive all-by-all pairwise comparison of sequences cannot scale, and thus heuristics have been used, in particular use of simple similarity thresholds. This heuristic introduces a trade-off between efficiency and accuracy that we explore in this paper: if the similarity threshold is very high, the methods are accurate but slow; if the similarity threshold is too low, the methods are fast but inaccurate. We study the two best-known clustering tools for sequence database de-duplication, CD-HIT and UCLUST. Our contributions include: a detailed assessment of the redundancy remaining after de-duplication; application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method; and a biological case study that assesses intra-cluster function annotation consistency, to demonstrate the impact of these factors in practical application of the sequence clustering methods. The results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. The evaluation leads to practical recommendations for users for more effective use of the sequence clustering tools for de-duplication.

Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current LOD (Linked Open Data) cloud is. Measurements (and indexes) that involve more than two datasets are not available although they are important: (a) for obtaining complete information about a set of entities (with provenance for aiding data cleaning and checking), (b) for aiding dataset discovery and selection, (c) for assessing the connectivity between any set of datasets for checking quality and for monitoring their evolution over time, and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a na1ve way, in this paper we introduce indexes (and their construction algorithms) that can speedup such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl:sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and finally (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability we propose parallel index construction algorithms and parallel latticebased incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.

Towards Veracity Assessment in RDF Knowledge Bases  An Exploratory Analysis

Through the different aspects of Knowledge Bases, data quality is one of the most relevant in order to obtain the benefit of such information. Knowledge Bases quality assessment poses a number of big data challenges such as high volume, variety, velocity and veracity. In this paper we focus on answering questions to the assessment of veracity of facts through Deep Fact Validation (DeFacto), a fact checking framework designed to validate facts in RDF Knowledge Bases. Despite current development in the research area, the underlying framework for such task still faces many challenges. This article pinpoints and discusses these issues and conduct a thorough analysis of its pipeline, aiming at reducing the error propagation through its components. As a result of this exploratory analysis, we give insights for an enhanced architecture which is able to better execute the complex task of fact checking, moving towards a better engineering of DeFacto.

Editorial: Special Issue on Improving the Veracity and Value of Big Data

Ontological Multidimensional Data Models and Contextual Data Quality

Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment is mapped into the context, for additional analysis, processing, and quality data extraction. The resulting contexts allow for the representation of dimensions, and multidimensional data quality assessment becomes possible. At the core of a multidimensional context we include a generalized multidimensional data model and a Datalog+/- ontology with provably good properties in terms of query answering. These main components are used to represent dimension hierarchies, dimensional constraints, dimensional rules, and define predicates for quality data specification. Query answering relies upon and triggers navigation through dimension hierarchies, and becomes the basic tool for the extraction of quality data. The OMD model is interesting per se, beyond applications to data quality. It allows for a logic-based, and computationally tractable representation of multidimensional data, extending previous multidimensional data models with additional expressive power and functionalities.

Information Quality Awareness and Information Quality Practice

Healthcare organizations increasingly rely on electronic information to optimize their operations. Information of high diversity from various sources accentuate the relevance and importance of information quality (IQ). The quality of information needs to be improved to support a more efficient and reliable utilization of healthcare information systems (IS). This can only be achieved through the implementation of initiatives followed by most users across an organization. The purpose of this study is to examine how awareness of IS users about IQ issues would affect their actual practices toward IQ initiatives. Influenced by the awareness on beneficial and problematic situations generated by IQ practices, users motivation is found to influence their IQ-related behavior. In addition, social influences and facilitating conditions moderate the relationship between user intention and actual practice. The theoretical and practical implications of findings are discussed, especially IQ best practices in the healthcare settings.


Publication Years 2009-2018
Publication Count 134
Citation Count 250
Available for Download 134
Downloads (6 weeks) 983
Downloads (12 Months) 12316
Downloads (cumulative) 84510
Average downloads per article 631
Average citations per article 2
First Name Last Name Award
Peter Aiken ACM Senior Member (2011)
Mikhail Atallah ACM Fellows (2006)
Ahmed Elmagarmid ACM Fellows (2012)
ACM Distinguished Member (2009)
Wenfei Fan ACM Fellows (2012)
Matthias Jarke ACM Fellows (2013)
Daniel S Katz ACM Senior Member (2011)
Beth A. Plale ACM Senior Member (2006)

First Name Last Name Paper Counts
Yang Lee 4
Peter Christen 3
Stuart Madnick 3
John Talburt 3
Nan Tang 3
G Shankaranarayanan 3
Peter Edwards 3
Roman Lukyanenko 3
Eitel LauríA 2
Carolyn Matheus 2
Xiaobai Li 2
Ali Sunyaev 2
Vassilios Verykios 2
Wenfei Fan 2
Roger Blake 2
Felix Naumann 2
Kewei Sha 2
Daisyzhe Wang 2
Wolfgang Lehner 2
Monica Tremblay 2
Ross Gayler 2
Christan Grant 2
Arnon Rosenthal 2
Sherali Zeadally 2
Dinusha Vatsalan 2
Robert Meusel 1
Maurice Van Keulen 1
Irit Askira Gelman 1
Stephen Chong 1
Edoardo Pignotti 1
Eric Medvet 1
Fabiano Tarlao 1
John Herbert 1
Paul Mccullagh 1
Juan Augusto 1
Maurice Mulvenna 1
Fabio Mercorio 1
Laure Berti-Équille 1
Fei Chiang 1
Siddharth Sitaramachandran 1
J Jha 1
Sven Weber 1
Richard Briotta 1
Johann Freytag 1
Saad Alaboodi 1
María Bermúdez-Edo 1
Maria Alvarez 1
Panagiotis Ipeirotis 1
Milan Markovic 1
Justin St-Maurice 1
Wenyuan Yu 1
Jürgen Umbrich 1
Fabian Panse 1
Fumiko Kobayashi 1
Paolo Missier 1
Kristin Weber 1
Paul Glowalla 1
Wenyuan Yu 1
Xu Pu 1
Benjamin Ngugi 1
Beverly Kahn 1
Fausto Giunchiglia 1
Wan Fokkink 1
Christoph Quix 1
Matthias Jarke 1
Jeffrey Fisher 1
Jeremy Millar 1
Adriane Chapman 1
Hilko Donker 1
Heiko Müller 1
H Nehemiah 1
Steven Brown 1
Terry Clark 1
Matthew Jensen 1
Jay Nunamaker, 1
Adir Even 1
Rachid Chalal 1
Fons Wijnhoven 1
Jeremy Debattista 1
Sushovan De 1
Heiko Paulheim 1
Dominique Ritze 1
Dezhao Song 1
Rabia Nuray-Turan 1
Dmitri Kalashnikov 1
Yinle Zhou 1
Daniel Dalip 1
Pável Calado 1
Tobias Vogel 1
Arvid Heise 1
Uwe Draisbach 1
Youwei Cheah 1
Olivier Curé 1
Claire Collins 1
Ioannis Anagnostopoulos 1
Huan Liu 1
Willem Van Hage 1
Patricia Franklin 1
Gilbert Peterson 1
Hongwei Zhu 1
Peter Aiken 1
Len Seligman 1
Robert Ulbricht 1
Martin Hahmann 1
Michael Zack 1
Nitin Joglekar 1
Mikhail Atallah 1
Ulf Leser 1
Irit Gelman 1
Yanjuan Yang 1
Paul Bowen 1
Min Chen 1
Dennis Wei 1
Aleksandra Mojsilović 1
Ion Todoran 1
Ali Khenchaf 1
D Elizabeth 1
Trent Rosenbloom 1
Shawn Hardenbrook 1
Subhash Bhalla 1
Kaushik Dutta 1
Jeffrey Parsons 1
Valerie Sessions 1
Kresimir Duretec 1
Leena Al-Hussaini 1
Pim Dietz 1
Eric Nelson 1
Manoranjan Dash 1
David Becker 1
M Kaiser 1
Floris Geerts 1
Thomas Redman 1
Xiaoming Fan 1
Giannis Haralabopoulos 1
Kyle Niemeyer 1
Arfon Smith 1
Archana Nottamkandath 1
Darryl Ahner 1
Hongwei Zhu 1
Claudio Hartmann 1
Cihan Varol 1
Coşkun Bayrak 1
David Robb 1
Rosella Gennari 1
Ezra Kahn 1
Adam Kriesberg 1
Mark Braunstein 1
Marta Zárraga-Rodríguez 1
C Raj 1
Peter Elkin 1
Amitava Bagchi 1
Hema Meda 1
Matteo Magnani 1
Sufyan Ababneh 1
Craig Fisher 1
Jiannan Wang 1
Jianing Wang 1
Sebastian Neumaier 1
Norbert Ritter 1
Ayush Singhania 1
R Greenwood 1
George Moustakides 1
Bernd Heinrich 1
Mathias Klier 1
Marcos Gonçalves 1
Hongwei Zhu 1
Bing Lv 1
Paul Mangiameli 1
Dirk Ahlers 1
Alberto Bartoli 1
James McNaull 1
Kelly Janssens 1
Mouhamadoulamine Ba 1
Judith Gelernter 1
Ciro D'Urso 1
Hua Zheng 1
Ahmed Elmagarmid 1
Fiona Rohde 1
Michael Mannino 1
David Corsar 1
Elliot Fielstein 1
Theodore Speroff 1
Yang Lee 1
Judee Burgoon 1
Josh Attenberg 1
Sean Goldberg 1
Andreas Rauber 1
Marco Valtorta 1
Sabrina Abdellaoui 1
Catherine Burns 1
Subbarao Kambhampati 1
Jeff Heflin 1
Boris Otto 1
Alan March 1
Alun Preece 1
Anja Klein 1
Marco Cristo 1
Richard Wang 1
Marilyn Tremaine 1
Christian Skalka 1
Andrea Lorenzo 1
Maurizio Murgia 1
Mario Mezzanzanica 1
Roberto Boselli 1
Luvai Motiwalla 1
Daniel Katz 1
Sandra Geisler 1
Douglas Hodson 1
Dov Biran 1
Edward Anderson 1
Aseel Basheer 1
Pierpaolo Vittorini 1
Ralf Tönjes 1
Karthikeyan Ramamurthy 1
Laurent Lecornu 1
Shelly Sachdeva 1
Stuart Madnick 1
Debra VanderMeer 1
Foster Provost 1
Nicola Ferro 1
Christian Becker 1
Chintan Amrit 1
Sören Auer 1
Christoph Lange 1
Sharad Mehrotra 1
Sandra Sampaio 1
Dustin Lange 1
Therese Williams 1
Beth Plale 1
Jianyong Wang 1
Chris Baillie 1
John Krogstie 1
Banda Ramadan 1
John O’Donoghue 1
Wenjun Li 1
Davide Ceolin 1
Khoi Tran 1
Lan Cao 1
Diego Marcheggiani 1
Nour El Mawass 1
Payam Barnaghi 1
Arputharaj Kannan 1
Jean Caillec 1
Anupkumar Sen 1
Rashid Ansari 1
Fahima Nader 1
Philip Woodall 1
Shuai Ma 1
Nigel Martin 1
Venkata Meduri 1
Axel Polleres 1
Suzanne Embury 1
Hubert Österle 1
Erhard Rahm 1
Lizhu Zhou 1
Jeffrey Vaughan 1
Melanie Herschel 1
Huizhi Liang 1
Paolo Coletti 1
Mirko Cesarini 1
Hongjiang Xu 1
Vincenzo Maltese 1
Fred Morstatter 1
Xiaoping Liu 1
Paul Groth 1
Valentina Maccatrozzo 1
Mohamed Yakout 1
A Borthick 1
Fabrizio Sebastiani 1
Sara Tonelli 1
Peter Arbuckle 1
Kush Varshney 1
Rahul Basole 1
Jimeng Sun 1
Dmitry Chornyi 1
Danilo Montesi 1
Omar Alonso 1
Ashfaq Khokhar 1
Alan Labouseur 1
Alexandra Poulovassilis 1
Yuheng Hu 1
Yi Chen 1

Affiliation Paper Counts
University of Padua 1
Facebook, Inc. 1
Federal University of Amazonas 1
Florida State University 1
Virginia Commonwealth University 1
University of Amsterdam 1
Vanderbilt University 1
Instituto Superior Tecnico 1
University of Houston 1
Google Inc. 1
University of Leipzig 1
Hospital Universitario Austral 1
Harvard University 1
University of Colorado at Denver 1
Oklahoma City University 1
University of Rhode Island 1
State University of New York at Albany 1
Georgia State University 1
University of Antwerp 1
University of Texas at Austin 1
Oregon State University 1
Beihang University 1
University of Massachusetts System 1
Indian Institute of Science, Bangalore 1
University of Saskatchewan 1
Elsevier 1
University of Augsburg 1
The College of William and Mary 1
Vienna University of Technology 1
University of South Carolina 1
Simon Fraser University 1
Memorial University of Newfoundland 1
Boston University 1
Technical University of Munich 1
Butler University 1
University of Maryland 1
Italian National Research Council 1
New Jersey Institute of Technology 1
National Institute of Standards and Technology 1
Cardiff University 1
Sam Houston State University 1
University College Cork 1
Microsoft Corporation 1
Ben-Gurion University of the Negev 1
Charleston Southern University 1
Commonwealth Scientific and Industrial Research Organization 1
Rutgers, The State University of New Jersey 1
University of Cambridge 1
University of Patras 1
Hellenic Open University 1
University of Baghdad 1
Universite Paris-Est 1
University of Illinois at Urbana-Champaign 1
Lehigh University 2
USDA ARS Beltsville Agricultural Research Center 2
Humboldt University of Berlin 2
Fraunhofer Institute for Applied Information Technology 2
Nanyang Technological University 2
Old Dominion University 2
Suffolk University 2
Free University of Bozen-Bolzano 2
University of Innsbruck 2
University of Arizona 2
Norwegian University of Science and Technology 2
King Saud University 2
University of Waterloo 2
University of Kentucky 2
University of Trento 2
RWTH Aachen University 2
University of Toronto 2
University of Surrey 2
Indiana University 2
New York University 2
Massachusetts Institute of Technology 2
University of Massachusetts Boston 2
University of Bologna 2
University of Hamburg 2
Federal University of Minas Gerais 2
University of Oklahoma 2
University of Queensland 2
University of Aizu 2
McMaster University 2
Universidad de Navarra 2
Indian Institute of Management Calcutta 2
University of Mannheim 3
Telecom Bretagne 3
Northeastern University 3
University of Cologne 3
Babson College 3
Vienna University of Economics and Business Administration 3
University of California, Irvine 3
University of Bonn 3
University of Massachusetts Medical School 3
University of St. Gallen 3
Birkbeck University of London 3
Purdue University 3
Ecole nationale superieure d'Informatique 3
University of Thessaly 3
Georgia Institute of Technology 3
University of Trieste 4
IBM Thomas J. Watson Research Center 4
University of Florida 4
University of Milan - Bicocca 4
Vrije Universiteit Amsterdam 4
University of Manchester 4
Technical University of Dresden 4
Qatar Computing Research institute 4
University of Illinois at Chicago 4
University of Edinburgh 4
United States Department of Veterans Affairs 4
Anna University 4
University of Ulster 4
University of Twente 4
United States Air Force Institute of Technology 4
Marist College 5
Tsinghua University 5
Arizona State University 5
MITRE Corporation 5
University of Massachusetts Lowell 5
Hasso-Plattner-Institut fur Softwaresystemtechnik GmbH 6
Florida International University 6
University of Aberdeen 7
University of Arkansas at Little Rock 8
Australian National University 9

Journal of Data and Information Quality (JDIQ) - Challenge Paper, Experience Paper and Research Paper

Volume 9 Issue 2, January 2018 Challenge Paper, Experience Paper and Research Paper

Volume 9 Issue 1, October 2017 Research Papers and Challenge Papers
Volume 8 Issue 3-4, July 2017 Challenge Papers, Experience Paper and Research Papers
Volume 8 Issue 2, February 2017 Challenge Papers and Research Papers

Volume 8 Issue 1, November 2016 Special Issue on Web Data Quality
Volume 7 Issue 4, October 2016 Challenge Papers and Regular Papers
Volume 7 Issue 3, September 2016 Research Paper, Challenge Papers and Experience Paper
Volume 7 Issue 1-2, June 2016 Challenge Papers, Regular Papers and Experience Paper

Volume 6 Issue 4, October 2015 Challenge Papers and Regular Papers
Volume 6 Issue 2-3, July 2015
Volume 6 Issue 1, March 2015
Volume 5 Issue 4, February 2015
Volume 5 Issue 3, February 2015 Special Issue on Provenance, Data and Information Quality

Volume 5 Issue 1-2, August 2014
Volume 4 Issue 4, May 2014

Volume 4 Issue 3, May 2013
Volume 4 Issue 2, March 2013 Special Issue on Entity Resolution

Volume 4 Issue 1, October 2012
Volume 3 Issue 4, September 2012
Volume 3 Issue 3, August 2012
Volume 3 Issue 2, May 2012
Volume 3 Issue 1, April 2012
Volume 2 Issue 4, February 2012

Volume 2 Issue 3, December 2011
Volume 2 Issue 2, February 2011

Volume 2 Issue 1, July 2010

Volume 1 Issue 3, December 2009
Volume 1 Issue 2, September 2009
Volume 1 Issue 1, June 2009
All ACM Journals | See Full Journal Index

Search JDIQ
enter search term and/or author name