ACM Journal of

Data and Information Quality (JDIQ)

Latest Articles

Preserving Patient Privacy When Sharing Same-Disease Data

Medical and health data are often collected for studying a specific disease. For such same-disease microdata, a privacy disclosure occurs as long as... (more)

Ontology-Based Data Quality Management for Data Streams

Data Stream Management Systems (DSMS) provide real-time data processing in an effective way, but there is always a tradeoff between data quality (DQ)... (more)


Jan. 2016 -- New book announcement


Carlo Batini and Monica Scannapieco have a new book:

Data and Information Quality: Dimensions, Principles and Techniques 

Springer Series: Data-Centric Systems and Applications, soon available from the Springer shop

The Springer flyer is available here

Experience and Challenge papers:  JDIQ now accepts two new types of papers. Experience papers describe real-world applications, datasets and other experiences in handling poor quality data. Challenges papers briefly describe a novel problem or challenge for the IQ community. See Author Guidelines for details.

Forthcoming Articles
The Challenge of Test Data Quality in Data Processing

The need for robust test data sets with test oracles presents challenging questions in data and information quality research. The profound lack of high-quality test data sets to enable the dynamic testing of data processing components highlights open research challenges in data quality related to (1) sample data quality, (2) test data synthesis and (3) quality models.

Automated Quality Assessment of Metadata across Open Data Portals

The Open Data movement has become a driver for publicly available data on the Web. More and more data  from governments, public institutions but also from the private sector  is made available online and is mainly published in so called Open Data portals. However, with the increasing number of published resources, there are a number of concerns with regards to the quality of the data sources and the corresponding metadata, which compromise the searchability, discoverability and usability of resources. In order to get a more complete picture of the severity of these issues, the present work aims at developing a generic metadata quality assessment framework for various Open Data portals: we treat data portals independently from the portal software frameworks by mapping the specific metadata of three widely used portal software frameworks (CKAN, Socrata, OpenDataSoft) to the standardized DCAT metadata schema. We subsequently define several quality metrics, which can be evaluated automatically and in a scalable manner. Finally, we report findings based on monitoring a set of over 250 Open Data portals. This includes the discussion of general quality issues, e.g. the retrievability of data, and the analysis of the our specific quality metrics.

A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction

The amount of text data has been growing exponentially in recent years. State-of-the-art statistical text extraction methods over this data are likely to contain errors. Recent work has shown probabilistic databases can store and query uncertainty over extraction results, however, these systems do not natively result in a reduction of error. In this paper we propose pi-CASTLE, a system that uses a probabilistic database as an anchor to execute, optimize and integrate machine and human computing. Uncertain fields are crowdsourced with the goal of reducing uncertainty and improving accuracy. We use information theory to optimize the set of questions and a Bayesian probabilistic model to integrate uncertain crowd answers back into the database. Experiments show promising results in significantly reducing machine error using very small amounts of crowdsourced human input. Additionally, probabilistic integration is shown to more effectively resolve conflicting crowd answers and provide users with the flexibility to tune the desired trade-off between accuracy and recall according to the need of applications. Using crowds to assist machine-learned models proves to be a cost-effective way to close the last mile in terms of accuracy for text labeling and extraction tasks.

Towards More Accurate Statistical Profiling of Deployed Microdata

Being promoted by major search engines such as Google, Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which makes it difficult to estimate the data volume and create a data profile. In addition, as the usage of global identifiers is not common, the real number of entities described by this formats in the Web is hard to assess. In this article, we discuss how the subsequent application of data cleaning steps lead to a more realistic view on the data, step by step. The cleaning steps applied include both heuristics for fixing errors as well as means to duplicate detection and elimination. Using the Web Data Commons Microdata corpus, we show that applying such quality improvement methods can essentially change the statistics of the dataset and lead to different estimates of both the number of entities as well as the class distribution within the data.

BayesWipe: A Scalable Probabilistic Framework for Improving Data Quality

Recent efforts in data cleaning of structured data have focused exclusively on problems like data de-duplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

Luzzu - A Methodology and Framework for Linked Data Quality Assessment

The increasing variety of Linked Data on the Web makes it challenging to determine the quality of this data, and subsequently to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data Quality, the output of such tools is not suitable for machine consumption, and thus consumers can hardly compare and rank datasets in the order of fitness for use. This paper describes a conceptual methodology for assessing Linked Datasets, and Luzzu, a framework for Linked Data Quality Assessment. Luzzu is based on four major components: (1) an extensible interface for defining new quality metrics; (2) an interoperable, ontology-driven back-end for representing quality metadata and quality problems that can be reused within different semantic frameworks; (3) a scalable stream processor for data dumps and SPARQL endpoints; and (4) a customisable ranking algorithm taking into account user-defined weights. We show that Luzzu scales linearly against the number of triples in a dataset. We also demonstrate the applicability of the Luzzu framework by evaluating and analysing a number of statistical datasets against a variety of metrics. This article contributes towards the definition for a holistic data quality lifecycle, in terms of the co-evolution of linked datasets, with the final aim improving their quality.

Editorial: Special Issue on Web Data Quality

From Content to Context: The Evolution and Growth of Data Quality Research

Research in data and information quality has made significant strides over the last twenty years. It has become a unified body of knowledge incorporating techniques, methods, and applications from a variety of disciplines including information systems, computer science, operations management, organizational behavior, psychology, and statistics. With organizations viewing Big Data, social media data, data-driven decision-making, and analytics as critical, data quality has never been more important. We believe that data quality research is reaching the threshold of significant growth and a metamorphosis from focusing on measuring and assessing data quality  content - towards a focus on usage and context. At this stage, it is vital to understand the identity of this research area in order to recognize its current state and to effectively identify an increasing number of research opportunities within. Using Latent Semantic Analysis (LSA) to analyze the abstracts of 972 peer-reviewed journal and conference articles published over the past 20 years, this paper contributes by identifying the core topics and themes that define the identity of data quality research. It further explores their trends over time, pointing to the data quality dimensions that have  and have not  been well-studied, and offering insights into topics that may provide significant opportunities in this area

First Name Last Name Paper Counts
Yang Lee 4
Peter Christen 3
John Talburt 3
Stuart Madnick 3
Wolfgang Lehner 2
Ross Gayler 2
Dinusha Vatsalan 2
Ali Sunyaev 2
Vassilios Verykios 2
G Shankaranarayanan 2
Roman Lukyanenko 2
Nan Tang 2
Sherali Zeadally 2
Eitel LauríA 2
Xiaobai Li 2
Arnon Rosenthal 2
Irit Askira Gelman 1
Alexandra Poulovassilis 1
Danilo Montesi 1
Omar Alonso 1
John Herbert 1
Juan Augusto 1
Maurice Mulvenna 1
Paul Mccullagh 1
Sven Weber 1
Siddharth Sitaramachandran 1
J Jha 1
Laure Berti-Équille 1
Fabian Panse 1
Fumiko Kobayashi 1
Fabio Mercorio 1
Fei Chiang 1
Richard Briotta 1
Johann Freytag 1
María Bermúdez-Edo 1
Maria Alvarez 1
Kristin Weber 1
Panagiotis Ipeirotis 1
Paolo Missier 1
Benjamin Ngugi 1
Beverly Kahn 1
Wenyuan Yu 1
Xu Pu 1
Paul Glowalla 1
Felix Naumann 1
Wenyuan Yu 1
Christoph Quix 1
Matthias Jarke 1
Fausto Giunchiglia 1
Wan Fokkink 1
Adriane Chapman 1
Jeremy Millar 1
Hilko Donker 1
Dezhao Song 1
Rabia Nuray-Turan 1
Dmitri Kalashnikov 1
Jeffrey Fisher 1
Yinle Zhou 1
Youwei Cheah 1
Heiko Müller 1
Adir Even 1
Steven Brown 1
Terry Clark 1
H Nehemiah 1
Matthew Jensen 1
Daniel Dalip 1
Pável Calado 1
Tobias Vogel 1
Arvid Heise 1
Uwe Draisbach 1
Fons Wijnhoven 1
Olivier Curé 1
Claire Collins 1
Patricia Franklin 1
Huan Liu 1
Ioannis Anagnostopoulos 1
Willem Van Hage 1
Len Seligman 1
Gilbert Peterson 1
Robert Ulbricht 1
Martin Hahmann 1
Peter Aiken 1
Eric Nelson 1
Hongwei Zhu 1
Nitin Joglekar 1
Michael Zack 1
Ulf Leser 1
Irit Gelman 1
Mikhail Atallah 1
Paul Bowen 1
Christan Grant 1
Yanjuan Yang 1
Dennis Wei 1
Aleksandra Mojsilović 1
Ion Todoran 1
Ali Khenchaf 1
Trent Rosenbloom 1
Shawn Hardenbrook 1
Subhash Bhalla 1
Valerie Sessions 1
D Elizabeth 1
Kaushik Dutta 1
M Kaiser 1
Floris Geerts 1
Thomas Redman 1
David Becker 1
Pim Dietz 1
Jeffrey Parsons 1
Manoranjan Dash 1
Xiaoming Fan 1
Wenfei Fan 1
Kyle Niemeyer 1
Arfon Smith 1
Giannis Haralabopoulos 1
Archana Nottamkandath 1
Darryl Ahner 1
Claudio Hartmann 1
Norbert Ritter 1
Hongwei Zhu 1
Cihan Varol 1
Coşkun Bayrak 1
David Robb 1
Daisyzhe Wang 1
Mark Braunstein 1
Rosella Gennari 1
Marta Zárraga-Rodríguez 1
Sufyan Ababneh 1
Peter Elkin 1
C Raj 1
Craig Fisher 1
Amitava Bagchi 1
Hema Meda 1
Bernd Heinrich 1
Mathias Klier 1
Dirk Ahlers 1
Alberto Bartoli 1
R Greenwood 1
Marcos Gonçalves 1
Jianing Wang 1
Matteo Magnani 1
Ayush Singhania 1
George Moustakides 1
Bing Lv 1
Paul Mangiameli 1
Hongwei Zhu 1
James McNaull 1
Kelly Janssens 1
Hua Zheng 1
Judith Gelernter 1
Mouhamadoulamine Ba 1
Ciro D'Urso 1
Jeff Heflin 1
Christian Skalka 1
Ahmed Elmagarmid 1
Fiona Rohde 1
Kewei Sha 1
Michael Mannino 1
Elliot Fielstein 1
Theodore Speroff 1
Marco Valtorta 1
Yang Lee 1
Judee Burgoon 1
Boris Otto 1
Andrea Lorenzo 1
Maurizio Murgia 1
Josh Attenberg 1
Alun Preece 1
Marilyn Tremaine 1
Alan March 1
Marco Cristo 1
Felix Naumann 1
Anja Klein 1
Richard Wang 1
Luvai Motiwalla 1
Sandra Geisler 1
Daniel Katz 1
Mario Mezzanzanica 1
Roberto Boselli 1
Douglas Hodson 1
Sharad Mehrotra 1
Edward Anderson 1
Chris Baillie 1
Peter Edwards 1
Dov Biran 1
Beth Plale 1
Ralf Tönjes 1
Pierpaolo Vittorini 1
Karthikeyan Ramamurthy 1
Laurent Lecornu 1
Shelly Sachdeva 1
Stuart Madnick 1
Monica Tremblay 1
Debra Vandermeer 1
John Krogstie 1
Foster Provost 1
Sandra Sampaio 1
Wenfei Fan 1
Dustin Lange 1
Therese Williams 1
Chintan Amrit 1
Banda Ramadan 1
Jianyong Wang 1
Roger Blake 1
John O’Donoghue 1
Wenjun Li 1
Davide Ceolin 1
Khoi Tran 1
Lan Cao 1
Jeffrey Vaughan 1
Melanie Herschel 1
Payam Barnaghi 1
Jean Caillec 1
Rashid Ansari 1
Arputharaj Kannan 1
Anupkumar Sen 1
Hubert Österle 1
Paolo Coletti 1
Suzanne Embury 1
Erhard Rahm 1
Nigel Martin 1
Huizhi Liang 1
Lizhu Zhou 1
Shuai Ma 1
Xiaoping Liu 1
Fred Morstatter 1
Mirko Cesarini 1
Hongjiang Xu 1
Vincenzo Maltese 1
Paul Groth 1
Valentina Maccatrozzo 1
Maurice Van Keulen 1
Stephen Chong 1
Edoardo Pignotti 1
Mohamed Yakout 1
A Borthick 1
Rahul Basole 1
Jimeng Sun 1
Sara Tonelli 1
Kush Varshney 1
Ashfaq Khokhar 1
Carolyn Matheus 1
Dmitry Chornyi 1
Eric Medvet 1
Fabiano Tarlao 1

Affiliation Paper Counts
University of Illinois at Urbana-Champaign 1
Qatar Computing Research institute 1
Florida State University 1
Virginia Commonwealth University 1
Vanderbilt University 1
Instituto Superior Tecnico 1
Google Inc. 1
University of Leipzig 1
Hospital Universitario Austral 1
Harvard University 1
University of Colorado at Denver 1
Oklahoma City University 1
University of Rhode Island 1
State University of New York at Albany 1
Georgia State University 1
University of Antwerp 1
University of Texas at Austin 1
Oregon State University 1
Beihang University 1
University of Massachusetts System 1
Indian Institute of Science 1
Elsevier 1
University of Augsburg 1
University of South Carolina 1
Memorial University of Newfoundland 1
Boston University 1
Technical University of Munich 1
Butler University 1
National Institute of Standards and Technology 1
Cardiff University 1
University of Massachusetts Boston 1
Sam Houston State University 1
University College Cork 1
Microsoft 1
Ben-Gurion University of the Negev 1
Charleston Southern University 1
Commonwealth Scientific and Industrial Research Organization 1
Rutgers, The State University of New Jersey 1
University of Oklahoma 1
University of Patras 1
Hellenic Open University 1
Universite Paris-Est 1
Federal University of Amazonas 1
Lehigh University 2
Humboldt University of Berlin 2
Fraunhofer Institute for Applied Information Technology 2
Arizona State University 2
Nanyang Technological University 2
Old Dominion University 2
Suffolk University 2
Free University of Bozen-Bolzano 2
University of Innsbruck 2
University of Arizona 2
Norwegian University of Science and Technology 2
University of Florida 2
University of Kentucky 2
University of Trento 2
RWTH Aachen University 2
University of Surrey 2
Indiana University 2
New York University 2
Massachusetts Institute of Technology 2
Babson College 2
University of Bologna 2
University of Hamburg 2
Federal University of Minas Gerais 2
University of Queensland 2
University of Aizu 2
McMaster University 2
Universidad de Navarra 2
Indian Institute of Management Calcutta 2
Hamad bin Khalifa University 2
Marist College 3
University of Cologne 3
University of Illinois at Chicago 3
University of Massachusetts Medical School 3
Purdue University 3
Birkbeck University of London 3
University of California, Irvine 3
Georgia Institute of Technology 3
University of Thessaly 3
Northeastern University 3
Telecom Bretagne 3
University of Edinburgh 3
University of Aberdeen 3
University of St. Gallen 3
United States Department of Veterans Affairs 4
Anna University 4
University of Ulster 4
University of Twente 4
United States Air Force Institute of Technology 4
Technical University of Dresden 4
IBM Thomas J. Watson Research Center 4
University of Milan - Bicocca 4
Vrije Universiteit Amsterdam 4
University of Manchester 4
University of Trieste 4
University of Massachusetts Lowell 5
Florida International University 5
MITRE Corporation 5
Tsinghua University 5
University of Arkansas at Little Rock 8
Australian National University 9
All ACM Journals | See Full Journal Index