This paper describes an approach for automated ingestion of biomedical data dictionaries. Automated ingestion or reading is the process of extracting element details for each of the data elements from a data dictionary in a document format (such as PDF) to a completely structured format. The structured format is essential if the data dictionary metadata is to be used in applications such as data integration, and also in evaluating the quality of the associated data. We present a machine-learning classification solution to the problem using conditional random field (CRF) classifiers and leveraging multiple text and character based features of text rows in the document. We present an evaluation using several actual data dictionary documents demonstrating the effectiveness of our approach.
Data quality has become a pervasive challenge for organizations as they wrangle with large, heterogeneous datasets to extract value. Given the proliferation of sensitive and confidential information, it is crucial to consider data privacy concerns during the data cleaning process. For example, in medical database applications, varying levels of privacy are enforced across the attribute values. Attributes such as a patient's country or city of residence may be less sensitive than the patient's prescribed medication. Traditional data cleaning techniques assume the data is openly accessible, without considering the differing levels of information sensitivity. In this work, we take the first steps towards a data cleaning model that integrates privacy as part of the data cleaning process. We present a privacy-aware, constraint based data cleaning framework that differentiates the information content among the attribute values during the data cleaning process to resolve data inconsistencies while minimizing the amount of information disclosed. Our data repair algorithm includes a set of data disclosure operations that considers the information content of the underlying attribute values, while maximizing data utility. Our evaluation using real datasets show that our algorithm scales well, and achieves improved performance and repair accuracy over existing differentially private data cleaning solutions.
The paper discusses challenges related to design of a framework for real-time, adaptive, cost-effective collection of high-quality data for critical infrastructure and emergence management. A key objective of the framework is to provide the ability to adaptively collect data based on: capabilities of available data collection technologies; communication capabilities; temporal deadlines; required classification/prediction accuracy; relevant data quality requirements.
Queue mining is a novel research area of data mining that learns queueing models from data logs. These models are then used for performance prediction in queueing-oriented systems. Queue mining combines techniques from process mining, queueing theory, statistics, and optimization. This paper reviews challenges that stem from data quality issues in queue mining, as well as some existing solutions to these challenges.
Data from Twitter have been increasingly employed to study the impact of events. Conventionally, researchers have relied on keywords to create a panel of Twitter users who mention event-related keywords during and after an event. There are limitations to the keyword-based approach. First, the technique suffers from selection bias since users who discuss an event are already more interested in event-related topics beforehand; it is thus unclear whether observed impacts are merely driven by a set of users who are intrinsically more interested in an event. Second, there are no viable groups for comparison to a keyword-based sample of Twitter users. We propose an alternative sampling approachgeolocated panels defined by users geolocation to studying response to events on Twitter. Geolocated panels are exogenous to the keywords in users tweets, resulting in less selection bias than the keyword-based approach. Geolocated panels allow us to follow within-person changes over time and enable the creation of comparison groups. We evaluate our panel selection approach in two real-world settings: response to mass shootings and response to TV advertising. We first empirically show that geolocated panels are subject to selection biases, while geolocated panels reduce selection biases. Then we show how geolocated panels can provide qualitatively different results. We believe that we are the first to provide a clear empirical example of how a better panel-selection design, based on an exogenous variable such as geography, both reduces selection bias compared to the current state-of-the-art and increases the value of Twitter research for studying events.
Public administrations are increasingly publishing Open data to make their governance more transparent. These publicly available Open data include fiscal data, e.g., budget and spending data. The publication of Open fiscal datasets is an important part of transparent and accountable governance. Another critical part of governance transparency and accountability is that published datasets should meet open data publication guidelines. When requirements in data guidelines are not met, effective data analysis over published datasets cannot be done. In this paper, we present an extensive assessment in published real-world Open fiscal data, common data quality issues, and guidelines for publishing Open fiscal data. The reported work has been done by studying works related to Open fiscal data publication, as well as collecting important factors that should be present in Open fiscal datasets. Moreover, collected factors have been scored according to the results of a survey. As a result, we have come up with an Open Fiscal Data Publication (OFDP) framework to assess the quality of Open fiscal datasets which is also described in the paper. . We gather and comprehensively analyze a representative set of more than 75 fiscal datasets from several public administrations across different regions at different levels (e.g., supranational, national, municipality). We characterize quality issues commonly arising in these datasets. Our evaluation shows that there are many factors in fiscal data publication that still need to be taken care of so that the data can be effectively analyzed. In the end, we provide a set of specific guidelines for publishing open fiscal data.
During data pre-processing, analysts spend a significant part of their time and effort profiling the quality of the data along with cleansing and transforming the data for further analysis. While quality metricsranging from general to domain-specific measuresfacilitate assessment of the quality of a dataset, there are hardly any approaches to visually support the analyst in customizing and applying such metrics. Yet, visual approaches could facilitate user involvement for data quality assessment. We present MetricDoc, an interactive environment for assessing data quality that provides customizable, reusable quality metrics in combination with immediate visual feedback. Moreover, we provide an overview visualization of these quality metrics along with error visualizations that facilitate interactive navigation of the data to determine the causes of quality issues present in the data. In this paper we describe the architecture, design, and evaluation of MetricDoc which underwent several design cycles, including heuristic evaluation and expert reviews as well as a focus group with data quality, human-computer interaction, and visual analytics experts.
Healthcare organizations increasingly rely on electronic information to optimize their operations. Information of high diversity from various sources accentuate the relevance and importance of information quality (IQ). The quality of information needs to be improved to support a more efficient and reliable utilization of healthcare information systems (IS). This can only be achieved through the implementation of initiatives followed by most users across an organization. The purpose of this study is to examine how awareness of IS users about IQ issues would affect their actual practices toward IQ initiatives. Influenced by the awareness on beneficial and problematic situations generated by IQ practices, users motivation is found to influence their IQ-related behavior. In addition, social influences and facilitating conditions moderate the relationship between user intention and actual practice. The theoretical and practical implications of findings are discussed, especially IQ best practices in the healthcare settings.
Software metrics are becoming more acceptable measures for software quality assessment. However, there is no standard form for representing metric definitions, which would be useful for metrics exchange and customization. In this paper, we propose the Software Product Metrics Definition Language (SPMDL). We developed an XML-based description language, for defining software metrics in a precise and reusable form. Metric definitions in SPMDL are based on meta-models extracted from either source code or design artifacts, such as the Dagstuhl Middle Meta-model, with support for various abstraction levels. The language defines several flexible computation mechanisms such as extended OCL queries and predefined graph operations on the meta-model. SPMDL provides unambiguous description of the metric definition; it is also easy to use and extensible.