ACM Journal of

Data and Information Quality (JDIQ)

Latest Articles


Data Challenges in Disease Response

Challenges for Quality of Data in Smart Cities

A Challenge for Long-Term Knowledge Base Maintenance

Challenges in Quality of Temporal Data — Starting with Gold Standards

Data and Analytics Challenges for a Learning Healthcare System

A Methodology to Evaluate Important Dimensions of Information Quality in Systems

Assessing the quality of the information proposed by an information system has become one of the major research topics in the last two decades. A... (more)


Information is a strategic company resource, but there is no consensus in the literature regarding the set of dimensions to be considered when measuring the quality of the information. Most measures of information quality depend on user perception. Using multiple correlation analysis, we obtain a model that allows us to explain how information... (more)


Special issue on Web Data Quality

The goal of this special issue is to present innovative research in the areas of Web Data Quality Assessment and Web Data Cleansing. The editors of this special issue are Christian Bizer, Xin Luna Dong, Ihab Ilyas, and Maria-Esther Vidal. See the call for papers for more details.



New options for ACM authors to manage rights and permissions for their work

ACM introduces a new publishing license agreement, an updated copyright transfer agreement, and a new author-pays option which allows for perpetual open access through the ACM Digital Library. For more information, visit the ACM Author Rights webpage.


ICIQ 2015, the International Conference on Information Quality, will take place on July 24 in Cambrigde, MA at the MIT.

Experience and Challenge papers: JDIQ now accepts two new types of papers. Experience papers describe real-world applications, datasets and other experiences in handling poor quality data. Challenges papers briefly describe a novel problem or challenge for the IQ community. See calls for papers for details.

Special Issue on Provenance and Quality of Data and Information: The term provenance refers broadly to information about the origin, context, derivation, lineage, ownership or history of some artifact. The provenance of data is more specifically a form of structured metadata that records the activities involved in data production. The notion applies to a broad variety of data types, from database records, to scientific datasets, business transaction logs, web pages, social media messages, and more. At the same time, different definitions and measures of quality apply to each of these data types, in different domains.

The JDIQ guest editors are Paolo Missier (Newcastle University, UK, and Paolo Papotti (Qatar Computing Research Institut, Qatar,

Forthcoming Articles

Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution

Real-time entity resolution is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing techniques are generally used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are to be compared with the query record in more detail. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has successfully been used for entity resolution of large static databases. However, because it is based on static sorted arrays and is designed for batch entity resolution that resolves all records in a database rather than resolving those relating to a single query record, this technique is not suitable for real-time entity resolution on dynamic databases which are updated constantly. We propose a tree-based dynamic sorted neighborhood index that facilitates resolving a stream of query records against a large and dynamic database in real-time, and investigate both static and adaptive window approaches. We also propose a technique to reduce query matching times by precalculating the similarities between attribute values stored in neighboring tree nodes. We experimentally evaluate our proposed techniques on two large data sets, as well as on synthetic data with different data quality characteristics. Our results show that as the index grows no appreciable increase occurs in both record insertion and query times. Compared to earlier real-time entity resolution techniques our approach achieves significantly reduced indexing and query matching times while maintaining high matching accuracy.

Combining User Reputation and Provenance Analysis for Trust Assessment

Trust is a broad concept which, in many systems, is often reduced to user reputation alone. However, user reputation is just one way to determine trust. The estimation of trust can be tackled from other perspectives as well, including by looking at provenance. Here, we present a complete pipeline for estimating the trustworthiness of artifacts given their provenance and a set of sample evaluations. The pipeline is composed of a series of algorithms for: (1) extracting relevant provenance features, (2) generating stereotypes of user behavior from provenance features, (3) estimating the reputation of both stereotypes and users, (4) using a combination of user and stereotype reputations to estimate the trustworthiness of artifacts and, (5) selecting sets of artifacts to trust. These algorithms rely on the W3C PROV recommendations for provenance and on evidential reasoning by means of subjective logic. We evaluate the pipeline over two tagging datasets: tags and evaluations from the Netherlands Institute for Sound and Vision's Waisda? video tagging platform; and crowdsourced annotations from the project. The approach achieves up to 85% precision when predicting tag trustworthiness. Perhaps more importantly, the pipeline provides satisfactory results using relatively little evidence through the use of provenance.


Combining User Reputation and Provenance Analysis for Trust Assessment

Data and Analytics Challenges for a Learning Healthcare System

Digital health data is both big and wide. We discuss three distinct challenges in applying data analytics toward the development of a learning healthcare system: data access, data curation, and development of new analytic techniques. We conclude with some interim approaches and future opportunities.

Data Quality Challenge: Toward a tool for string processing by examples

Many data-related activities at organizations of all sizes are concerned with low-level string processing, such as format transformation and validation, data cleaning, substring extraction and classification, and so on. Problems of this sort occur routinely in an one-off fashion as part of specific processes or activities that cannot be integrated in long-lived workflows, such as analysis of data gathered from the web or from other enterprise sources. These tasks are a vital ingredient of virtually every organization but are difficult to address efficiently: they are usually too simple to justify the cost and latency of a full-blown IT project, yet they are not simple enough to be solved by non-IT specialists. Several proposals have been made for tools capable of string processing by examples. Tools of this sort would constitute a significant step forward for many daily IT-related activities, but actually delivering their potential in practice is indeed very challenging.

Document and Corpus Quality Challenges for Knowledge-Management in Engineering Enterprises

Enterprise data is an amalgam of structured, semi-structured, and unstructured data and documents stored in heterogeneous systems. However, this does not necessarily mean that it is completely without structure, simply that in many cases the structure is not readily apparent or modelled to be useful. Formats such as PDF, CAD, Excel, or Word are standardised but offer a high grade of flexibility; the issue is rather that their freeform content does not readily divulge its structure and meaning. When taking a step back, we see a challenging quality issue not based on the individual documents, but on the whole corpus of documents available in an enterprise.


Publication Years 2009-2015
Publication Count 84
Citation Count 118
Available for Download 84
Downloads (6 weeks) 1268
Downloads (12 Months) 10404
Downloads (cumulative) 57200
Average downloads per article 681
Average citations per article 1
First Name Last Name Award
Mikhail Atallah ACM Fellows (2006)
Ahmed Elmagarmid ACM Fellows (2012)
ACM Distinguished Member (2009)
Wenfei Fan ACM Fellows (2012)
Wenfei Fan ACM Fellows (2012)
Beth A. Plale ACM Senior Member (2006)

First Name Last Name Paper Counts
John Talburt 3
Yang Lee 3
Stuart Madnick 3
Vassilios Verykios 2
Ali Sunyaev 2
Eitel Lauría 2
Nan Tang 2
G Shankaranarayanan 2
Pierpaolo Vittorini 1
Karthikeyan Ramamurthy 1
Ralf Tönjes 1
Laurent Lecornu 1
Chris Baillie 1
Peter Edwards 1
Wenfei Fan 1
Dustin Lange 1
Sharad Mehrotra 1
Edward Anderson 1
Sandra Sampaio 1
Jianyong Wang 1
Roger Blake 1
Beth Plale 1
Mario Mezzanzanica 1
Roberto Boselli 1
Chintan Amrit 1
Dov Biran 1
Shelly Sachdeva 1
Stuart Madnick 1
Monica Tremblay 1
Debra Vandermeer 1
Foster Provost 1
Roman Lukyanenko 1
Payam Barnaghi 1
Ion Todoran 1
Jean Caillec 1
Stephen Chong 1
Jeffrey Vaughan 1
Shuai Ma 1
Nigel Martin 1
Lan Cao 1
Lizhu Zhou 1
Arputharaj Kannan 1
Melanie Herschel 1
John O’Donoghue 1
Erhard Rahm 1
Suzanne Embury 1
Rashid Ansari 1
Anupkumar Sen 1
Hubert Österle 1
Sara Tonelli 1
Kush Varshney 1
Edoardo Pignotti 1
Alexandra Poulovassilis 1
Maurice Van Keulen 1
Mohamed Yakout 1
A Borthick 1
Irit Askira Gelman 1
Carolyn Matheus 1
Dmitry Chornyi 1
Mirko Cesarini 1
Hongjiang Xu 1
Ashfaq Khokhar 1
Danilo Montesi 1
Omar Alonso 1
Xiaobai Li 1
Dennis Wei 1
María Bermúdez-Edo 1
Fabio Mercorio 1
Wenyuan Yu 1
Peter Christen 1
Felix Naumann 1
Wenyuan Yu 1
Fabian Panse 1
Paolo Missier 1
Xu Pu 1
Benjamin Ngugi 1
Beverly Kahn 1
John Herbert 1
Juan Augusto 1
Maurice Mulvenna 1
Paul Mccullagh 1
Paul Glowalla 1
Fumiko Kobayashi 1
Richard Briotta 1
Johann Freytag 1
Kristin Weber 1
Panagiotis Ipeirotis 1
Tobias Vogel 1
Arvid Heise 1
Uwe Draisbach 1
Dezhao Song 1
Rabia Nuray-Turan 1
Dmitri Kalashnikov 1
Adir Even 1
Terry Clark 1
H Nehemiah 1
Youwei Cheah 1
Fons Wijnhoven 1
Yinle Zhou 1
Steven Brown 1
Jay Nunamaker, 1
Heiko Müller 1
Matthew Jensen 1
Daniel Dalip 1
Pável Calado 1
Christan Grant 1
Aleksandra Mojsilović 1
Sherali Zeadally 1
Ali Khenchaf 1
Claire Collins 1
Floris Geerts 1
David Becker 1
Wenfei Fan 1
Nitin Joglekar 1
Mikhail Atallah 1
Paul Bowen 1
Manoranjan Dash 1
Xiaoming Fan 1
Trent Rosenbloom 1
Shawn Hardenbrook 1
D Elizabeth 1
Olivier Curé 1
Thomas Redman 1
Pim Dietz 1
Eric Nelson 1
Hongwei Zhu 1
Michael Zack 1
Yanjuan Yang 1
Valerie Sessions 1
Subhash Bhalla 1
Ulf Leser 1
Irit Gelman 1
Kaushik Dutta 1
M Kaiser 1
Jeffrey Parsons 1
Rosella Gennari 1
Daisyzhe Wang 1
Dinusha Vatsalan 1
Jianing Wang 1
David Robb 1
R Greenwood 1
Ayush Singhania 1
George Moustakides 1
Paul Mangiameli 1
Craig Fisher 1
C Raj 1
Norbert Ritter 1
Cihan Varol 1
Coşkun Bayrak 1
Wolfgang Lehner 1
Bing Lv 1
Sufyan Ababneh 1
Peter Elkin 1
Amitava Bagchi 1
Matteo Magnani 1
Mathias Klier 1
Bernd Heinrich 1
Marcos Gonçalves 1
Hongwei Zhu 1
Kewei Sha 1
Christian Skalka 1
Felix Naumann 1
Jeff Heflin 1
Ahmed Elmagarmid 1
Michael Mannino 1
Fiona Rohde 1
Alun Preece 1
Marilyn Tremaine 1
James McNaull 1
Kelly Janssens 1
Therese Williams 1
Anja Klein 1
Marco Valtorta 1
Elliot Fielstein 1
Ted Speroff 1
Yang Lee 1
Hema Meda 1
Judee Burgoon 1
Alan March 1
Boris Otto 1
Richard Wang 1
Josh Attenberg 1
Marco Cristo 1

Affiliation Paper Counts
Federal University of Amazonas 1
Florida State University 1
Vanderbilt University 1
Instituto Superior Tecnico 1
Google Inc. 1
University of Leipzig 1
Hospital Universitario Austral 1
Harvard University 1
University of Colorado at Denver 1
University of Rhode Island 1
State University of New York at Albany 1
Georgia State University 1
MITRE Corporation 1
University of Antwerp 1
University of Texas at Austin 1
Beihang University 1
University of Massachusetts System 1
Indian Institute of Science 1
University of Augsburg 1
University of South Carolina 1
Technical University of Dresden 1
Memorial University of Newfoundland 1
Boston University 1
Technical University of Munich 1
Butler University 1
Cardiff University 1
University of Massachusetts Boston 1
Sam Houston State University 1
University College Cork 1
University of Thessaly 1
Microsoft 1
Ben-Gurion University of the Negev 1
Charleston Southern University 1
Commonwealth Scientific and Industrial Research Organization 1
Rutgers University 1
University of Oklahoma 1
University of Patras 1
University of Massachusetts Lowell 1
Hellenic Open University 1
Universite Paris-Est 1
Qatar Computing Research institute 1
University of Aizu 2
University of Bologna 2
Massachusetts Institute of Technology 2
New York University 2
Nanyang Technological University 2
Lehigh University 2
Old Dominion University 2
Suffolk University 2
Australian National University 2
Humboldt University of Berlin 2
University of Innsbruck 2
University of Arizona 2
University of Queensland 2
Federal University of Minas Gerais 2
Indian Institute of Management Calcutta 2
Northeastern University 2
University of Hamburg 2
Indiana University 2
Babson College 2
University of Cologne 3
University of Aberdeen 3
Purdue University 3
University of California, Irvine 3
University of London 3
University of St. Gallen 3
University of Illinois at Chicago 3
University of Edinburgh 3
Marist College 3
University of Manchester 4
Florida International University 4
Anna University 4
University of Ulster 4
University of Twente 4
United States Department of Veterans Affairs 4
University of Milan - Bicocca 4
Tsinghua University 5
University of Arkansas at Little Rock 8
All ACM Journals | See Full Journal Index