We discuss challenges for enabling quality across multiple data analytics contexts.
The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However the underlying data quality of these resources is a critical concern. A particular is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database de-duplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency; and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database de-duplication. Given high volumes of data, exhaustive all-by-all pairwise comparison of sequences cannot scale, and thus heuristics have been used, in particular use of simple similarity thresholds. This heuristic introduces a trade-off between efficiency and accuracy that we explore in this paper: if the similarity threshold is very high, the methods are accurate but slow; if the similarity threshold is too low, the methods are fast but inaccurate. We study the two best-known clustering tools for sequence database de-duplication, CD-HIT and UCLUST. Our contributions include: a detailed assessment of the redundancy remaining after de-duplication; application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method; and a biological case study that assesses intra-cluster function annotation consistency, to demonstrate the impact of these factors in practical application of the sequence clustering methods. The results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. The evaluation leads to practical recommendations for users for more effective use of the sequence clustering tools for de-duplication.
Data quality and especially the assessment of data quality have been intensively discussed in research and practice alike. To adequately support an economically oriented management of data quality and decision making under uncertainty, it is essential to assess the data quality level by means of well-founded metrics. However, if not adequately defined, these metrics can lead to wrong decisions and economic losses. Therefore, based on a decision-oriented framework, we present a set of five requirements for data quality metrics. If these requirements are met, the respective metric and its values are capable of supporting an economically oriented management of data quality and decision making under uncertainty. We further demonstrate the applicability and efficacy of these requirements by evaluating two well-known data quality metrics.
Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment is mapped into the context, for additional analysis, processing, and quality data extraction. The resulting contexts allow for the representation of dimensions, and multidimensional data quality assessment becomes possible. At the core of a multidimensional context we include a generalized multidimensional data model and a Datalog+/- ontology with provably good properties in terms of query answering. These main components are used to represent dimension hierarchies, dimensional constraints, dimensional rules, and define predicates for quality data specification. Query answering relies upon and triggers navigation through dimension hierarchies, and becomes the basic tool for the extraction of quality data. The OMD model is interesting per se, beyond applications to data quality. It allows for a logic-based, and computationally tractable representation of multidimensional data, extending previous multidimensional data models with additional expressive power and functionalities.
Data Quality is gaining momentum among organizations from when they realized that poor data quality might cause failures and/or inefficiencies, thus compromising business processes and application results. However, enterprises often adopt data quality assessment and improvement methods based on practical and empirical approaches, without conducting a rigorous analysis of the data quality issues and the outcome of the enacted data quality improvement practices. In particular, data quality management, and especially the identification of the data quality dimensions to be monitored and improved is up to knowledge-workers on the basis of their skills and experience. Control methods are therefore designed on the basis of expected and evident quality problems and thus they may not be effective in dealing with unknown and/or unexpected problems. This paper aims to provide a methodology, based on fault injection, for validating the data quality actions used by organizations. We show how it is possible to check if the adopted techniques properly monitor the real issues that may damage business processes. At this stage we focus on scoring processes, i.e., processes in which the output represents the evaluation or ranking of a specific object. We show the effectiveness of our proposal by means of a case study in the financial risk management area.
We present lessons learned related to data collection and analysis from over four years of experience with the eTextbook system OpenDSA. The use of such cyberlearning systems is expanding rapidly in both formal and informal educational settings. While the precise issues related to any such project are idiosyncratic based on the data collection technology and goals of the project, certain types of data collection problems will be common. We first describe several problems that we encountered with syntactic-level data collection. We then discuss fundamental issues with relating events to users, and tracking users over time, which are both prerequisites to converting syntactic-level interaction streams to semantic-level behavior needed for higher-order analysis of the data. We then present examples of such behavior-level analysis, which in turn lead to changes in the OpenDSA system needed to to replace undesirable learning behavior with more productive behavior.