In the last five years there has been a flurry of work on information extraction from clinical documents, i.e., on algorithms capable of extracting, from the informal and unstructured texts that are generated during everyday clinical practice, mentions of concepts relevant to such practice. Most of this literature is about methods based on supervised learning, i.e., methods for training an information extraction system from manually annotated examples. While a lot of work has been devoted to devising learning methods that generate more and more accurate information extractors, no work has been devoted to investigating the effect of the quality of training data on the learning process. Low quality in training data often derives from the fact that the person who has annotated the data is different from the one against whose judgment the automatically annotated data must be evaluated. In this paper we test the impact of such data quality issues on the accuracy of information extraction systems as applied to the clinical domain. We do this by comparing the accuracy deriving from training data annotated by the authoritative coder (i.e., the one who has also annotated the test data, and by whose judgment we must abide), with the accuracy deriving from training data annotated by a different coder. The results indicate that, although the disagreement between the two coders (as measured on the training set) is substantial, the difference is (surprisingly enough) not always statistically significant.
Life Cycle Assessment is a modeling approach to address the environmental aspects and potential environmental impacts (e.g. use of resources and the environmental consequences of releases) throughout a product's life cycle from raw material acquisition through production, use, end-of-life treatment, recycling and final disposal (i.e. cradle-to-grave). The LCA community is faced with a major challenge in its capacity to produce sufficient documentation and metadata to determine representation of LCA models and to reuse them correctly. This challenge in capacity is driven by two factors: the nascent state of standardization in LCA modeling and the strong focus on research and publishing results for funded LCA work. The USDAs National Agricultural Library (NAL) is dedicated to data management, access, and preservation. Its mission enables it to focus on informatics related challenges that others may not have the expertise, capacity or funding to address. The NAL is contributing solutions to LCAs documentation challenge by implementing a synthesis of the most complete LCA formats into a balanced metadata structure. The NAL also publishes a repository of LCA research data at www.lcacommons.gov. Building capacity to develop high quality data, supported by comprehensive metadata and documentation, requires a community of LCA researchers and practitioners that are dedicated to following best practices, and an appreciation of the value realized through well described datasets. As a government organization with a mission dedicated to providing access to quality data, the NAL will continue to develop and support this community of practice.
As the number of open data initiatives continues to increase, there is a growing recognition within the open data community of a need to shift from focusing on data publication to also consider issues such as data coverage, openness, and quality. Here we outline challenges related to the quality of open data, including: assisting data publishers with understanding and utilising quality dimensions and assessment methods, as well as how to use the results of quality assessment; and exploring the sharing and reuse of quality metrics across datasets, tools, and publishers.