ACM Journal of

Data and Information Quality (JDIQ)

Latest Articles

The Challenge of “Quick and Dirty” Information Quality

Data Quality Challenges in Distributed Live-Virtual-Constructive Test Environments

Information Quality Research Challenge

As information technology becomes an integral part of daily life, increasingly, people understand the world around them by turning to digital sources as opposed to directly interacting with objects in the physical world. This has ushered in the age of Ubiquitous Digital Intermediation (UDI). With the explosion of UDI, the scope of Information... (more)

Data Standards Challenges for Interoperable and Quality Data

Challenges for Context-Driven Time Series Forecasting

Predicting time series is a crucial task for organizations, since decisions are often based on uncertain information. Many forecasting models are... (more)

Combining User Reputation and Provenance Analysis for Trust Assessment

Trust is a broad concept that in many systems is often reduced to user reputation alone. However, user reputation is just one way to determine trust.... (more)

Automatic Discovery of Abnormal Values in Large Textual Databases

Textual databases are ubiquitous in many application domains. Examples of textual data range from names and addresses of customers to social media... (more)


In a manner similar to most organizations, BigCompany (BigCo) was determined to benefit strategically from its widely recognized and vast quantities of data. (U.S. government agencies make regular visits to BigCo to learn from its experiences in this area.) When faced with an explosion in data volume, increases in complexity, and a need to respond... (more)


Jan. 2016 -- New book announcement


Carlo Batini and Monica Scannapieco have a new book:

Data and Information Quality: Dimensions, Principles and Techniques  

Springer Series: Data-Centric Systems and Applications, soon available from the Springer shop

The Springer flyer is available here

Special issue on Web Data Quality

The goal of this special issue is to present innovative research in the areas of Web Data Quality Assessment and Web Data Cleansing. The editors of this special issue are Christian Bizer, Xin Luna Dong, Ihab Ilyas, and Maria-Esther Vidal. See the call for papers for more details.



New options for ACM authors to manage rights and permissions for their work

ACM introduces a new publishing license agreement, an updated copyright transfer agreement, and a new author-pays option which allows for perpetual open access through the ACM Digital Library. For more information, visit the ACM Author Rights webpage.


ICIQ 2015, the International Conference on Information Quality, will take place on July 24 in Cambrigde, MA at the MIT.

Experience and Challenge papers: JDIQ now accepts two new types of papers. Experience papers describe real-world applications, datasets and other experiences in handling poor quality data. Challenges papers briefly describe a novel problem or challenge for the IQ community. See calls for papers for details.

Special Issue on Provenance and Quality of Data and Information: The term provenance refers broadly to information about the origin, context, derivation, lineage, ownership or history of some artifact. The provenance of data is more specifically a form of structured metadata that records the activities involved in data production. The notion applies to a broad variety of data types, from database records, to scientific datasets, business transaction logs, web pages, social media messages, and more. At the same time, different definitions and measures of quality apply to each of these data types, in different domains.

The JDIQ guest editors are Paolo Missier (Newcastle University, UK, and Paolo Papotti (Qatar Computing Research Institut, Qatar,

Forthcoming Articles

Data are a strategic asset in every organization. The quality of data can make a difference in a number of scenarios, for example, using data mining techniques to gain market share, effectively managing customer relationships, providing the proper service to citizens, or applying econometric tools on reliable data to provide program evaluation and inform policy makers on a range of decisions. Moreover, taking into account the semantics of data to discover and correct errors is too expensive and complex for a database that primarily contains information pertaining to different domains (e.g., customers personal data, prospective projects, billing, and accounting). This paper proposes the application of a new method to analyse the quality of datasets stored in the tables of a database with no knowledge of the semantics of the data and without the need to define repositories of rules. The proposed method is based on proper revisions of different approaches that are combined to boost overall performance and accuracy. A novel transformation algorithm is conceived that treats the items of database tables as data points in real coordinate space of n dimensions. The application of the method to a set of archives, some of which have been studied extensively in the literature, provides very promising experimental results and outperforms the single application of other techniques. Finally, a list of future research directions is highlighted.

Automated Quality Assessment of Metadata across Open Data Portals

The Open Data movement has become a driver for publicly available data on the Web. More and more data  from governments, public institutions but also from the private sector  is made available online and is mainly published in so called Open Data portals. However, with the increasing number of published resources, there are a number of concerns with regards to the quality of the data sources and the corresponding metadata, which compromise the searchability, discoverability and usability of resources. In order to get a more complete picture of the severity of these issues, the present work aims at developing a generic metadata quality assessment framework for various Open Data portals: we treat data portals independently from the portal software frameworks by mapping the specific metadata of three widely used portal software frameworks (CKAN, Socrata, OpenDataSoft) to the standardized DCAT metadata schema. We subsequently define several quality metrics, which can be evaluated automatically and in a scalable manner. Finally, we report findings based on monitoring a set of over 250 Open Data portals. This includes the discussion of general quality issues, e.g. the retrievability of data, and the analysis of the our specific quality metrics.

Replacing Mechanical Turkers? How to Evaluate Learning Results with Semantic Properties

Some machine learning algorithms offer more than just superior predictive power. They often generate additional information about the dataset upon which they were trained, providing additional insight into the underlying data. Examples of these algorithms are topic modeling algorithms such as Latent Dirichlet Allocation (LDA)~\cite{blei2003latent}, whose topics are often inspected as part of the analysis that many researchers do on their data. Recently deep learning algorithms such as word embedding algorithms like Word2Vec~\cite{mikolov2013distributed} have produced models with semantic properties. These algorithms are immensely useful; they tell us something about the environment from which they generate their predictions. One pressing challenge is how to evaluate the quality of the information produced by these algorithms. This evaluation (if done at all) is usually carried out via user studies. In the context of LDA topics, researchers ask human subjects questions and seeing how they understand different aspects of the topics~\cite{chang2009reading}. While this type of evaluation is sound, it is expensive both from the perspective of time and cost, and thus cannot be easily reproduced independently. These experiments have the additional drawback of being hard to scale up and difficult to generalize. We would like to pose this challenging question of evaluating the information quality of these semantic properties - could we find automatic methods of evaluating information quality as easily as we evaluate predictive power using accuracy, precision, and recall?

Preserving Patient Privacy When Sharing Same-Disease Data

Medical and health data are often collected for studying a specific disease. For such same-disease microdata, a privacy disclosure occurs as long as an individual is known to be in the microdata. Individuals in the same-disease microdata are thus subject to higher disclosure risk than those in microdata with different diseases. This important problem has been overlooked in data privacy research and practice and no prior study has addressed this problem. In this study, we analyze the disclosure risk for the individuals in the same-disease microdata and propose a new metric that is appropriate for measuring disclosure risk in this situation. An efficient algorithm is designed and implemented for anonymizing the same-disease data to minimize the disclosure risk while keeping data utility as good as possible. An experimental study was conducted on real patient and population data. Experimental results show that traditional re-identification risk measures underestimate the actual disclosure risk for the individuals in the same-disease microdata and demonstrate that the proposed approach is very effective in reducing the actual risk for the same-disease data. This study suggests that privacy protection policy and practice for sharing medical and health data should consider not only the individuals identifying attributes but also the health and disease information contained in the data. It is recommended that data-sharing entities should employ a statistical approach, instead of the HIPAAs Safe Harbor policy, when sharing the same-disease microdata.

Challenges in ontology evaluation

Ontologies provide the semantics often as middleware for a number of Artificial Intelligence tools, and can be used to make logical assertions. Ontologies can define objects and the relationships among them in any domain-specific system. Finding logic errors in complete ontologies proves largely impossible for even the most widely-used reasoners. And logic is just one of numerous ways in which an ontology might be assessed. Therefore, we suggest that ontology evaluation is of limited value. Instead, we argue that the logical connections within ontologies should be tested while in development by tools such as Scenario-based Ontology Evaluation (SCONE). We would change present tools such that domain experts are able to make changes in the ontology without knowing ontology languages or description logic. And that ontology-based systems could allow fuzzy matching based on ontologies that might be imperfect.

Ontology-based Data Quality Management for Data Streams

Data Stream Management Systems (DSMS) have proven to provide real-time data processing in an effective way, but in these systems there is always a trade-off between data quality and performance. We propose an ontology-based data quality framework for data stream management which includes data quality measurement and monitoring in a transparent, modular, and flexible way. We follow a threefold approach, that takes the characteristics of relational data stream management for data quality metrics into account. While (1) Query Metrics respect changes in data quality due to query operations, (2) Content Metrics allow the semantic evaluation of data in the streams. Finally, (3) Application Metrics allow easy user-defined computation of data quality values to account for application specifics. Additionally, a quality monitor allows to observe data quality values and take counteractions to balance data quality and performance. The framework has been designed along a DQ management methodology suited for data streams. The framework has been evaluated in the domains of road traffic applications and health monitoring.

EXPERIENCE: Glitches in Databases, how to Ensure Data Quality by Outlier Detection Techniques

Data is a strategic asset in every organization. The quality of data owned can make difference in various scenarios: using data mining techniques to gain market share, managing effectively the customer relationship, providing the proper service to citizens. Moreover taking into account the semantic of data in order to discover and correct errors is too expensive and complex mainly in a database containing information that pertains to different domains (e.g. customers personal data, prospective projects, billing, accounting...). Here is proposed a method aiming at analyzing the quality of data sets stored in tables of a database without knowing the semantic of the data itself and without the need of defining any repositories of rules. The proposed method is based on well known approaches analyzed in literature, suitably revised in order to be combined to boost the overall performance and accuracy. The application of the method to a set of archives, some of which have been studied extensively in literature, provides very promising experimental results.

Veracity of Big Data: Challenges of Cross-modal Truth Discovery

In this challenge paper, we argue that the next generation of data management and data sharing systems need to manage not only volume and variety of Big Data but most importantly veracity of data. Designing truth discovery systems requires a fundamental paradigm shift in data management and goes beyond adding new layers of data fusion heuristics or developing yet another probabilistic graphical truth discovery model. Actionable and Web-scale truth discovery requires a transdisciplinary approach to incorporate the dynamic and cross-modal dimension related to multi-layered networks of contents and sources.

The Challenge of Improving Credibility of User-Generated Content in Online Social Networks

In every environment of information exchange, Information Quality (IQ) is considered as one of the most important issues. Studies in Online Social Networks (OSNs) analyze a number of related subjects that span both theoretical and practical aspects, from data quality identification and simple attribute classification to quality assessment models for various social environments. Among several factors that affect information quality in online social networks is the credibility of user-generated content. To address this challenge, some proposed solutions include community-based evaluation and labeling of user-generated content in terms of accuracy, clarity and timeliness, along with well-established real-time data mining techniques.

All ACM Journals | See Full Journal Index