ACM DL

ACM Journal of

Data and Information Quality (JDIQ)

Menu
Latest Articles

Unifying Data and Constraint Repairs

Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data by finding minimal or lowest cost changes to the... (more)

Challenges in Ontology Evaluation

Veracity of Big Data

The Challenge of Improving Credibility of User-Generated Content in Online Social Networks

In every environment of information exchange, Information Quality (IQ) is considered one of the most important issues. Studies in Online Social... (more)

EXPERIENCE

Enterprise's archives are inevitably affected by the presence of data quality problems (also called glitches). This article proposes the application of a new method to analyze the quality of datasets stored in the tables of a database, with no knowledge of the semantics of the data and without the need to define repositories of rules. The proposed... (more)

NEWS

Jan. 2016 -- New book announcement

 

Carlo Batini and Monica Scannapieco have a new book:

Data and Information Quality: Dimensions, Principles and Techniques 

Springer Series: Data-Centric Systems and Applications, soon available from the Springer shop

The Springer flyer is available here


Experience and Challenge papers:  JDIQ now accepts two new types of papers. Experience papers describe real-world applications, datasets and other experiences in handling poor quality data. Challenges papers briefly describe a novel problem or challenge for the IQ community. See Author Guidelines for details.

Forthcoming Articles
Automated Quality Assessment of Metadata across Open Data Portals

The Open Data movement has become a driver for publicly available data on the Web. More and more data  from governments, public institutions but also from the private sector  is made available online and is mainly published in so called Open Data portals. However, with the increasing number of published resources, there are a number of concerns with regards to the quality of the data sources and the corresponding metadata, which compromise the searchability, discoverability and usability of resources. In order to get a more complete picture of the severity of these issues, the present work aims at developing a generic metadata quality assessment framework for various Open Data portals: we treat data portals independently from the portal software frameworks by mapping the specific metadata of three widely used portal software frameworks (CKAN, Socrata, OpenDataSoft) to the standardized DCAT metadata schema. We subsequently define several quality metrics, which can be evaluated automatically and in a scalable manner. Finally, we report findings based on monitoring a set of over 250 Open Data portals. This includes the discussion of general quality issues, e.g. the retrievability of data, and the analysis of the our specific quality metrics.

Replacing Mechanical Turkers? How to Evaluate Learning Results with Semantic Properties

Some machine learning algorithms offer more than just superior predictive power. They often generate additional information about the dataset upon which they were trained, providing additional insight into the underlying data. Examples of these algorithms are topic modeling algorithms such as Latent Dirichlet Allocation (LDA)~\cite{blei2003latent}, whose topics are often inspected as part of the analysis that many researchers do on their data. Recently deep learning algorithms such as word embedding algorithms like Word2Vec~\cite{mikolov2013distributed} have produced models with semantic properties. These algorithms are immensely useful; they tell us something about the environment from which they generate their predictions. One pressing challenge is how to evaluate the quality of the information produced by these algorithms. This evaluation (if done at all) is usually carried out via user studies. In the context of LDA topics, researchers ask human subjects questions and seeing how they understand different aspects of the topics~\cite{chang2009reading}. While this type of evaluation is sound, it is expensive both from the perspective of time and cost, and thus cannot be easily reproduced independently. These experiments have the additional drawback of being hard to scale up and difficult to generalize. We would like to pose this challenging question of evaluating the information quality of these semantic properties - could we find automatic methods of evaluating information quality as easily as we evaluate predictive power using accuracy, precision, and recall?

Towards More Accurate Statistical Profiling of Deployed schema.org Microdata

Being promoted by major search engines such as Google, Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which makes it difficult to estimate the data volume and create a data profile. In addition, as the usage of global identifiers is not common, the real number of entities described by this formats in the Web is hard to assess. In this article, we discuss how the subsequent application of data cleaning steps lead to a more realistic view on the data, step by step. The cleaning steps applied include both heuristics for fixing errors as well as means to duplicate detection and elimination. Using the Web Data Commons Microdata corpus, we show that applying such quality improvement methods can essentially change the statistics of the dataset and lead to different estimates of both the number of entities as well as the class distribution within the data.

BayesWipe: A Scalable Probabilistic Framework for Improving Data Quality

Recent efforts in data cleaning of structured data have focused exclusively on problems like data de-duplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

Preserving Patient Privacy When Sharing Same-Disease Data

Medical and health data are often collected for studying a specific disease. For such same-disease microdata, a privacy disclosure occurs as long as an individual is known to be in the microdata. Individuals in the same-disease microdata are thus subject to higher disclosure risk than those in microdata with different diseases. This important problem has been overlooked in data privacy research and practice and no prior study has addressed this problem. In this study, we analyze the disclosure risk for the individuals in the same-disease microdata and propose a new metric that is appropriate for measuring disclosure risk in this situation. An efficient algorithm is designed and implemented for anonymizing the same-disease data to minimize the disclosure risk while keeping data utility as good as possible. An experimental study was conducted on real patient and population data. Experimental results show that traditional re-identification risk measures underestimate the actual disclosure risk for the individuals in the same-disease microdata and demonstrate that the proposed approach is very effective in reducing the actual risk for the same-disease data. This study suggests that privacy protection policy and practice for sharing medical and health data should consider not only the individuals identifying attributes but also the health and disease information contained in the data. It is recommended that data-sharing entities should employ a statistical approach, instead of the HIPAAs Safe Harbor policy, when sharing the same-disease microdata.

Ontology-based Data Quality Management for Data Streams

Data Stream Management Systems (DSMS) have proven to provide real-time data processing in an effective way, but in these systems there is always a trade-off between data quality and performance. We propose an ontology-based data quality framework for data stream management which includes data quality measurement and monitoring in a transparent, modular, and flexible way. We follow a threefold approach, that takes the characteristics of relational data stream management for data quality metrics into account. While (1) Query Metrics respect changes in data quality due to query operations, (2) Content Metrics allow the semantic evaluation of data in the streams. Finally, (3) Application Metrics allow easy user-defined computation of data quality values to account for application specifics. Additionally, a quality monitor allows to observe data quality values and take counteractions to balance data quality and performance. The framework has been designed along a DQ management methodology suited for data streams. The framework has been evaluated in the domains of road traffic applications and health monitoring.

Luzzu - A Methodology and Framework for Linked Data Quality Assessment

The increasing variety of Linked Data on the Web makes it challenging to determine the quality of this data, and subsequently to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data Quality, the output of such tools is not suitable for machine consumption, and thus consumers can hardly compare and rank datasets in the order of fitness for use. This paper describes a conceptual methodology for assessing Linked Datasets, and Luzzu, a framework for Linked Data Quality Assessment. Luzzu is based on four major components: (1) an extensible interface for defining new quality metrics; (2) an interoperable, ontology-driven back-end for representing quality metadata and quality problems that can be reused within different semantic frameworks; (3) a scalable stream processor for data dumps and SPARQL endpoints; and (4) a customisable ranking algorithm taking into account user-defined weights. We show that Luzzu scales linearly against the number of triples in a dataset. We also demonstrate the applicability of the Luzzu framework by evaluating and analysing a number of statistical datasets against a variety of metrics. This article contributes towards the definition for a holistic data quality lifecycle, in terms of the co-evolution of linked datasets, with the final aim improving their quality.

From Content to Context: The Evolution and Growth of Data Quality Research

Research in data and information quality has made significant strides over the last twenty years. It has become a unified body of knowledge incorporating techniques, methods, and applications from a variety of disciplines including information systems, computer science, operations management, organizational behavior, psychology, and statistics. With organizations viewing Big Data, social media data, data-driven decision-making, and analytics as critical, data quality has never been more important. We believe that data quality research is reaching the threshold of significant growth and a metamorphosis from focusing on measuring and assessing data quality  content - towards a focus on usage and context. At this stage, it is vital to understand the identity of this research area in order to recognize its current state and to effectively identify an increasing number of research opportunities within. Using Latent Semantic Analysis (LSA) to analyze the abstracts of 972 peer-reviewed journal and conference articles published over the past 20 years, this paper contributes by identifying the core topics and themes that define the identity of data quality research. It further explores their trends over time, pointing to the data quality dimensions that have  and have not  been well-studied, and offering insights into topics that may provide significant opportunities in this area

All ACM Journals | See Full Journal Index