This paper describes an approach for automated ingestion of biomedical data dictionaries. Automated ingestion or reading is the process of extracting element details for each of the data elements from a data dictionary in a document format (such as PDF) to a completely structured format. The structured format is essential if the data dictionary metadata is to be used in applications such as data integration, and also in evaluating the quality of the associated data. We present a machine-learning classification solution to the problem using conditional random field (CRF) classifiers and leveraging multiple text and character based features of text rows in the document. We present an evaluation using several actual data dictionary documents demonstrating the effectiveness of our approach.
Queue mining is a novel research area of data mining that learns queueing models from data logs. These models are then used for performance prediction in queueing-oriented systems. Queue mining combines techniques from process mining, queueing theory, statistics, and optimization. This paper reviews challenges that stem from data quality issues in queue mining, as well as some existing solutions to these challenges.
The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However the underlying data quality of these resources is a critical concern. A particular is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database de-duplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency; and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database de-duplication. Given high volumes of data, exhaustive all-by-all pairwise comparison of sequences cannot scale, and thus heuristics have been used, in particular use of simple similarity thresholds. This heuristic introduces a trade-off between efficiency and accuracy that we explore in this paper: if the similarity threshold is very high, the methods are accurate but slow; if the similarity threshold is too low, the methods are fast but inaccurate. We study the two best-known clustering tools for sequence database de-duplication, CD-HIT and UCLUST. Our contributions include: a detailed assessment of the redundancy remaining after de-duplication; application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method; and a biological case study that assesses intra-cluster function annotation consistency, to demonstrate the impact of these factors in practical application of the sequence clustering methods. The results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. The evaluation leads to practical recommendations for users for more effective use of the sequence clustering tools for de-duplication.
Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current LOD (Linked Open Data) cloud is. Measurements (and indexes) that involve more than two datasets are not available although they are important: (a) for obtaining complete information about a set of entities (with provenance for aiding data cleaning and checking), (b) for aiding dataset discovery and selection, (c) for assessing the connectivity between any set of datasets for checking quality and for monitoring their evolution over time, and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a na1ve way, in this paper we introduce indexes (and their construction algorithms) that can speedup such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl:sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and finally (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability we propose parallel index construction algorithms and parallel latticebased incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.
Through the different aspects of Knowledge Bases, data quality is one of the most relevant in order to obtain the benefit of such information. Knowledge Bases quality assessment poses a number of big data challenges such as high volume, variety, velocity and veracity. In this paper we focus on answering questions to the assessment of veracity of facts through Deep Fact Validation (DeFacto), a fact checking framework designed to validate facts in RDF Knowledge Bases. Despite current development in the research area, the underlying framework for such task still faces many challenges. This article pinpoints and discusses these issues and conduct a thorough analysis of its pipeline, aiming at reducing the error propagation through its components. As a result of this exploratory analysis, we give insights for an enhanced architecture which is able to better execute the complex task of fact checking, moving towards a better engineering of DeFacto.
Editorial: Special Issue on Improving the Veracity and Value of Big Data
Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment is mapped into the context, for additional analysis, processing, and quality data extraction. The resulting contexts allow for the representation of dimensions, and multidimensional data quality assessment becomes possible. At the core of a multidimensional context we include a generalized multidimensional data model and a Datalog+/- ontology with provably good properties in terms of query answering. These main components are used to represent dimension hierarchies, dimensional constraints, dimensional rules, and define predicates for quality data specification. Query answering relies upon and triggers navigation through dimension hierarchies, and becomes the basic tool for the extraction of quality data. The OMD model is interesting per se, beyond applications to data quality. It allows for a logic-based, and computationally tractable representation of multidimensional data, extending previous multidimensional data models with additional expressive power and functionalities.
Healthcare organizations increasingly rely on electronic information to optimize their operations. Information of high diversity from various sources accentuate the relevance and importance of information quality (IQ). The quality of information needs to be improved to support a more efficient and reliable utilization of healthcare information systems (IS). This can only be achieved through the implementation of initiatives followed by most users across an organization. The purpose of this study is to examine how awareness of IS users about IQ issues would affect their actual practices toward IQ initiatives. Influenced by the awareness on beneficial and problematic situations generated by IQ practices, users motivation is found to influence their IQ-related behavior. In addition, social influences and facilitating conditions moderate the relationship between user intention and actual practice. The theoretical and practical implications of findings are discussed, especially IQ best practices in the healthcare settings.