ACM DL

ACM Journal of

Data and Information Quality (JDIQ)

Menu
Latest Articles

Evaluation-as-a-Service for the Computational Sciences: Overview and Outlook

Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic... (more)

Anserini: Reproducible Ranking Baselines Using Lucene

This work tackles the perennial problem of reproducible baselines in information retrieval research, focusing on bag-of-words ranking models. Although academic information retrieval researchers have a long history of building and sharing systems, they are primarily designed to facilitate the publication of research papers. As such, these systems... (more)

Reproducible Web Corpora: Interactive Archiving with Automatic Quality Assessment

The evolution of web pages from static HTML pages toward dynamic pieces of software has rendered archiving them increasingly difficult. Nevertheless, an accurate, reproducible web archive is a necessity to ensure the reproducibility of web-based research. Archiving web pages reproducibly, however, is currently not part of best practices for web... (more)

To Clean or Not to Clean: Document Preprocessing and Reproducibility

Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specific details about how this markup information... (more)

NEWS

October, 2018 - Call for papers:

Special Issue on Quality Assessment of Knowledge Graphs
Initial submission deadline:
- 3 March 2019

Other news:

Special Issue on Combating Digital Misinformation and Disinformation
Status: Reviewing in progress

Special Issue on Reproducibility in Information Retrieval
Two parts special issue:
- Evaluation Campaigns, Collections and Analyses (Vol. 10, Issue 3, Oct. 2018)
- Tools and Infrastructures (Vol. 10, Issue 4, Oct. 2018)

On the Horizon challenge papers

From 2019, JDIQ will accept a new type of contribution called "On the Horizon". These manuscripts, which can be submitted by invitation only, will be written by top researchers in the field of Data Quality. Their aim is at introducing new rising topics in the field of Data Quality, discussing why they are emerging, the challenging aspects, the envisioned solutions.

Improving Classification Quality in Uncertain Graphs

In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. If the information about link reliability is not used explicitly, the classification accuracy in the underlying network may be affected adversely. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model and automatic parameter selection, and show that the incorporation of uncertainty in the classification process as a first-class citizen is beneficial. We experimentally evaluate the proposed approach using different real data sets, and study the behavior of the algorithms under different conditions. The results demonstrate the effectiveness and efficiency of our approach.

Financial Regulatory and Risk Management Challenges Stemming From Firm-Specific Digital Misinformation

Challenge Paper (no abstract) Excerpt: Financial markets respond to information. That information can be accurate or inaccurate (misinformation) but investors make rapid buy and sell decisions and often act before verifying authenticity. The challenge for data and information quality researchers is to develop tools to detect fraud early and develop strategies or decision rules regulators can use to determine whether to suspend trading.

Discovering Patterns for Fact Checking in Knowledge Graphs

This paper studies a new framework that incorporates graph patterns to support fact checking in knowledge graphs. Our method discovers discriminant graph patterns to construct classifiers for fact prediction. (1) We propose a class of graph fact checking rules (GFCs). A GFC incorporates graph patterns that best distinguish true and false facts of generalized fact statements. We provide statistical measures to characterize useful patterns that are both discriminant and diversified. (2) We show that it is feasible to discover GFCs in large graphs with optimality guarantees. (a) We develop an algorithm that performs localized search to generate a stream of graph patterns, and dynamically assemble best GFCs from multiple GFCs sets, where each set ensures quality scores within certain ranges. The algorithm guarantees a 1/2-µ approximation when it (early) terminates. (b) We also develop a space-efficient alternative that dynamically spawns prioritized patterns with best marginal gains to verified GFCs. It guarantees a 1-1/e approximation. Both strategies guarantee a bounded time cost independent with the size of underlying graph. (3) To support fact checking, we develop two classifiers, which make use of top ranked GFCs as predictive rules, or instance-level features of the pattern matches induced by GFCs, respectively. Using real-world data, we experimentally verify the efficiency and the effectiveness of GFC-based techniques for fact checking in knowledge graphs, and verify its application in knowledge exploration and news prediction.

Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News

Fake news are nowadays an issue of pressing concern, given their recent rise as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge (FNC-1) was organized in early 2017 to encourage the development of machine learning-based classification systems for stance detection (i.e., for identifying whether a particular news article agrees, disagrees, discusses, or is unrelated to a particular news headline), thus helping in the detection and analysis of possible instances of fake news. This article presents a novel approach to tackle this stance detection problem, based on the combination of string similarity features with a deep neural network architecture that leverages ideas previously advanced in the context of learning efficient text representations, document classification, and natural language inference. Specifically, we use bi-directional GRUs together with neural attention for representing (i) the headline, (ii) the first two sentences of the news article, and (iii) the entire news article. These representations are then combined/compared, complemented with similarity features inspired on other FNC-1 approaches, and passed to a final layer that predicts the stance of the article towards the headline. We also explore the use of external sources of information, specifically large datasets of sentence pairs originally proposed for training and evaluating natural language inference methods, in order to pre-train specific components of the neural network architecture (e.g., the GRUs used for encoding sentences). The obtained results attest to the effectiveness of the proposed ideas and show that our model, particularly when considering pre-training and the combination of neural representations together with similarity features, slightly outperforms the previous state-of-the-art.

Crowd-sourced Targeted Feedback Collection for Multi-Criteria Data Source Selection

A multi-criteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. A MCSS problem can be solved using multi-dimensional optimisation techniques that trade-off the different objectives. Sometimes one may have uncertain knowledge regarding how well the candidate data sources meet the criteria. In order to overcome this uncertainty, one may rely on end users or crowds to annotate the data items produced by the sources in relation to the selection criteria. In this paper, a proposed Targeted Feedback Collection (TFC) approach is introduced, that aims to identify those data items on which feedback should be collected, thereby providing evidence on how the sources satisfy the required criteria. The proposed TFC targets feedback by considering the confidence intervals around the estimated criteria values, with a view to increasing the confidence in the estimates that are most relevant to the multi-dimensional optimisation. Variants of the proposed TFC approach have been developed, for use where feedback is expected to be reliable (e.g. where it is provided by trusted experts) and where feedback is expected to be unreliable (e.g. from crowd workers). Both variants have been evaluated, and positive results are reported against other approaches to feedback collection, including active learning, in experiments that involve real world data sets and crowdsourcing.

Ensuring High-Quality Private Data for Responsible Data Science: Vision and Challenges

High-quality data is critical for effective data science. With the increasing use of data science in all modern endeavors, concern about irresponsible data use has led to a push for responsible data science to ensure that the insights gained do not come at the steep cost of violating privacy of individuals. This has led to the development of data protection regulations around the globe and the use of sophisticated anonymization techniques to protect privacy. Such measures make it harder for the data scientist to understand the data, exacerbating the issue of ensuring high data quality. We pose the high-level problem how can a data scientist develop the needed trust that private data has high quality? in this paper. We then identify a series of challenges for various data-centric communities, and outline research questions for data quality and privacy researchers, that would need to be addressed to effectively answer the problem posed in this paper.

All ACM Journals | See Full Journal Index

Search JDIQ
enter search term and/or author name