Ethics-aware data processing is a pressing need, considering that data are often used within critical decision processes (e.g., staff evaluation, college admission, criminal sentencing) as well as in the everyday life. This poses new challenges across the whole information extraction process, since in general there is no guarantee as to the character of the input data, and the algorithms are written by human beings and often the models are obtained from the data are opaque and difficult to interpret. We investigate the introduction of such ethical principles, including fairness and transparency, a first-class citizens among the dimensions of data quality.
Governance of misinformation is a serious concern in social media platforms. Based on three case studies, we offer insights on managing misinformation in social media, for the policy agents. These studies are essential, because of misinformation prevalent in the existing social media platforms. Managing misinformation, for instance, fake news, is a challenge for policy makers and the platforms. The paper explores the factors of rapid propagation of misinformation. An average of about 15 lakh tweets were analysed on each of the three different cases surrounding misinformation. This study provides insights for managing misinformation in social media, with a specific focus on cognitive factors that may emerge as drivers of misinformation virality. We highlight findings from three separate case studies focusing on the intrinsic properties of the content, the personality attributes of misinformation propagators, and lastly the network related attributes that facilitate the rapid propagation of misinformation. Findings indicate that these three aspects collectively catalyze the misinformation diffusion and subsequent virality. Policy makers can utilize the findings in this experience study for the governance of misinformation. Tracking and disruption in any one of the identified drivers could act as a control mechanism to manage misinformation propagation.
Real-world datasets often suffer from various data quality problems. Several data cleaning solutions have been proposed so far. However, data cleaning remains a manual and iterative task that requires domain and technical expertise. Exploiting metadata promises to improve the tedious process of data preparation because data errors are detectable through metadata. This paper investigates the intrinsic connection between metadata and data errors. In this work, we establish a mapping that reflects the connection between data quality issues and extractable metadata using qualitative and quantitative techniques. Additionally, we present a new metadata classification and taxonomy based on a closed grammar that can also maintain very complex metadata.
Duplicate detection algorithms produce clusters of database records, each cluster representing a single real- world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm and many other clustering algorithms focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm Extended Maximum Clique Clustering (EMCC) and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
Researchers rely on systematic literature reviews to synthesize existing evidence regarding a research topic. While being important means to condense knowledge, conducting a systematic literature review requires a large amount of time and effort. Consequently, researchers have proposed semi-automatic techniques to support different stages of the review process. Two of the most time consuming tasks are to select primary studies and to assess their quality. In this article, we report a systematic literature review in which we identify, discuss, and synthesize existing techniques of the software engineering domain that aim to semi-automate these two tasks. Instead of solely providing statistics, we discuss these techniques in detail and compare them, aiming to improve our understanding of supported and unsupported activities. To this end, we identify 8 primary studies that report unique techniques and have been published between 2007 and 2016. Most of these techniques rely on text mining and can be beneficial for researchers, but an independent validation using real systematic literature reviews is missing for most of them. Moreover, our results indicate the necessity of developing more reliable techniques and of extending their scope to further activities to facilitate the selection and quality assessment of primary studies.
Opinions in reviews about the quality of products or services can be important information for readers. Unfortunately, such opinions may include deceptive ones posted for some business reasons. To keep the opinions as a valuable and trusted source of information, we propose an approach to detecting deceptive and truthful opinions. Specifically, we explore the use of word and character n-grams combinations, function words, and word syntactic n-grams (word sn-grams) as features for classifiers to deal with this task. We also consider applying word correction to our utilized dataset. Our experiments show that classification results of using the word and character n-grams combinations features could perform better than those of employing other features. Although the experiments indicate that applying the word correction might be insignificant, we note that the deceptive opinions tend to have a smaller number of error words than the truthful ones. To examine robustness of our features, we then perform cross-classification tests. Our latter experiments results suggest that using the word and character n-grams combinations features could work well in detecting deceptive and truthful opinions. Interestingly, the latter experimental results also indicate that using the word sn-grams as combinations features could give good performance.
We describe RuDiK, a system for the discovery of declarative rules over knowledge-bases (KBs). RuDiK discovers rules expressing positive relationships between KB elements, such as ?if two persons have the same parent, they are likely to be siblings?, and negative rules, i.e., patterns that identify contradictions in the data, such as ?if two persons are married, one cannot be the child of the other" or ?the birth year for a person cannot be bigger than her graduation year". While the former class infers new facts in the KB, the latter class is crucial for other tasks, such as detecting erroneous triples in data cleaning. Our approach satisfies two main requirements. First, it increases the expressive power of the supported rule language wrt existing systems. RuDiK discovers rules containing (i) comparisons among literal values and (ii) selection conditions with constants. Richer rules increase the accuracy and the coverage over the facts in the KB. This is achieved with aggressive pruning of the search space and with disk-based algorithms, effectively enabling rule mining in commodity machines. Second, RuDiK is robust to errors and incompleteness in the input KB. It discovers approximate rules based on a measure of support that is aware of the quality issues. We model the mining process as an incremental graph exploration problem and prove that our search strategy has optimality guarantees. Extensive experiments using real-world KBs show that RuDiK outperforms previous proposals in terms of efficiency and that it discovers more effective rules for the application at hand.