Healthcare is evolving towards patient-centered care, and Shared Decision Making (SDM) holds great promise to improve health, reduce costs, and better align care with patients values. Information quality is a key aspect to empowering patients to make informed decisions. However, the progress on shared decision making is impeded by several unresolved information quality challenges. In this paper we propose three key challenges we believe need to be addressed to better facilitate SDM, including consistency and reconciliation, optimizing timeliness and accuracy tradeoff, and integrating decision aids. We call on the information quality community to begin addressing the challenges above to support the on-going transition of healthcare to SDM.
Wireless sensor networks are widely applied in data collection applications. Energy efciency is one of the most important design goals. In this paper, we propose QAAC, Quality-Assured Adaptive data Compression, to reduce the amount of data communication so that to save energy. QAAC rst builds clusters from dataset using an adaptive clustering algorithm; then a code for each cluster is generated and stored in a Huffman encoding tree, which is used to encode the original dataset in an encoding algorithm with improvement approach. After the encoded data, the Huffman encoding tree and parameters used in the improvement algorithm have been received at the sink, a decompression algorithm is used to retrieve the approximation of the original dataset. The performance evaluation shows that QAAC is efcient and achieves much higher compression ratio than compared lossy and lossless compression algorithms and much less information loss than compared lossy compression algorithms.
Spam on Online Social Networks (OSNs) has received a booming interest in the last few years. Following the rise of these platforms and their establishment as a ubiquitous part of the online existence, spammers have found in them an opportunity to make a lucrative business. A major part of the literature that aims at detecting spammers on OSNs uses the supervised learning model as the building schema of their contributions. This model assumes that it is possible to classify entities based on their statistical characteristics. A vital condition for the successful implementation of this model is to ensure that data is collected and labeled in a clean, accurate and non-biased way, resulting in high-quality datasets. In this paper, we discuss the different steps of the supervised classification methodology applied to social spam detection. This includes data collection, labeling, transformation, and sharing. From this, various issues arise in relation to collection bias, inaccurate and irreproducible labeling, obscure provenance of adjunct datasets (such as blacklists and spam dictionaries), imprecise description of features extraction and data transformation, and finally, complete or partial unavailability of raw and final datasets used to build statistical decision models.
The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However the underlying data quality of these resources is a critical concern. A particular is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database de-duplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency; and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database de-duplication. Given high volumes of data, exhaustive all-by-all pairwise comparison of sequences cannot scale, and thus heuristics have been used, in particular use of simple similarity thresholds. This heuristic introduces a trade-off between efficiency and accuracy that we explore in this paper: if the similarity threshold is very high, the methods are accurate but slow; if the similarity threshold is too low, the methods are fast but inaccurate. We study the two best-known clustering tools for sequence database de-duplication, CD-HIT and UCLUST. Our contributions include: a detailed assessment of the redundancy remaining after de-duplication; application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method; and a biological case study that assesses intra-cluster function annotation consistency, to demonstrate the impact of these factors in practical application of the sequence clustering methods. The results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. The evaluation leads to practical recommendations for users for more effective use of the sequence clustering tools for de-duplication.
In the last five years there has been a flurry of work on information extraction from clinical documents, i.e., on algorithms capable of extracting, from the informal and unstructured texts that are generated during everyday clinical practice, mentions of concepts relevant to such practice. Most of this literature is about methods based on supervised learning, i.e., methods for training an information extraction system from manually annotated examples. While a lot of work has been devoted to devising learning methods that generate more and more accurate information extractors, no work has been devoted to investigating the effect of the quality of training data on the learning process. Low quality in training data often derives from the fact that the person who has annotated the data is different from the one against whose judgment the automatically annotated data must be evaluated. In this paper we test the impact of such data quality issues on the accuracy of information extraction systems as applied to the clinical domain. We do this by comparing the accuracy deriving from training data annotated by the authoritative coder (i.e., the one who has also annotated the test data, and by whose judgment we must abide), with the accuracy deriving from training data annotated by a different coder. The results indicate that, although the disagreement between the two coders (as measured on the training set) is substantial, the difference is (surprisingly enough) not always statistically significant.
Life Cycle Assessment is a modeling approach to address the environmental aspects and potential environmental impacts (e.g. use of resources and the environmental consequences of releases) throughout a product's life cycle from raw material acquisition through production, use, end-of-life treatment, recycling and final disposal (i.e. cradle-to-grave). The LCA community is faced with a major challenge in its capacity to produce sufficient documentation and metadata to determine representation of LCA models and to reuse them correctly. This challenge in capacity is driven by two factors: the nascent state of standardization in LCA modeling and the strong focus on research and publishing results for funded LCA work. The USDAs National Agricultural Library (NAL) is dedicated to data management, access, and preservation. Its mission enables it to focus on informatics related challenges that others may not have the expertise, capacity or funding to address. The NAL is contributing solutions to LCAs documentation challenge by implementing a synthesis of the most complete LCA formats into a balanced metadata structure. The NAL also publishes a repository of LCA research data at www.lcacommons.gov. Building capacity to develop high quality data, supported by comprehensive metadata and documentation, requires a community of LCA researchers and practitioners that are dedicated to following best practices, and an appreciation of the value realized through well described datasets. As a government organization with a mission dedicated to providing access to quality data, the NAL will continue to develop and support this community of practice.
As the number of open data initiatives continues to increase, there is a growing recognition within the open data community of a need to shift from focusing on data publication to also consider issues such as data coverage, openness, and quality. Here we outline challenges related to the quality of open data, including: assisting data publishers with understanding and utilising quality dimensions and assessment methods, as well as how to use the results of quality assessment; and exploring the sharing and reuse of quality metrics across datasets, tools, and publishers.