Ethics-aware data processing is a pressing need, considering that data are often used within critical decision processes (e.g., staff evaluation, college admission, criminal sentencing) as well as in the everyday life. This poses new challenges across the whole information extraction process, since in general there is no guarantee as to the character of the input data, and the algorithms are written by human beings and often the models are obtained from the data are opaque and difficult to interpret. We investigate the introduction of such ethical principles, including fairness and transparency, a first-class citizens among the dimensions of data quality.
Governance of misinformation is a serious concern in social media platforms. Based on three case studies, we offer insights on managing misinformation in social media, for the policy agents. These studies are essential, because of misinformation prevalent in the existing social media platforms. Managing misinformation, for instance, fake news, is a challenge for policy makers and the platforms. The paper explores the factors of rapid propagation of misinformation. An average of about 15 lakh tweets were analysed on each of the three different cases surrounding misinformation. This study provides insights for managing misinformation in social media, with a specific focus on cognitive factors that may emerge as drivers of misinformation virality. We highlight findings from three separate case studies focusing on the intrinsic properties of the content, the personality attributes of misinformation propagators, and lastly the network related attributes that facilitate the rapid propagation of misinformation. Findings indicate that these three aspects collectively catalyze the misinformation diffusion and subsequent virality. Policy makers can utilize the findings in this experience study for the governance of misinformation. Tracking and disruption in any one of the identified drivers could act as a control mechanism to manage misinformation propagation.
Duplicate detection algorithms produce clusters of database records, each cluster representing a single real- world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm and many other clustering algorithms focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm Extended Maximum Clique Clustering (EMCC) and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
Researchers rely on systematic literature reviews to synthesize existing evidence regarding a research topic. While being important means to condense knowledge, conducting a systematic literature review requires a large amount of time and effort. Consequently, researchers have proposed semi-automatic techniques to support different stages of the review process. Two of the most time consuming tasks are to select primary studies and to assess their quality. In this article, we report a systematic literature review in which we identify, discuss, and synthesize existing techniques of the software engineering domain that aim to semi-automate these two tasks. Instead of solely providing statistics, we discuss these techniques in detail and compare them, aiming to improve our understanding of supported and unsupported activities. To this end, we identify 8 primary studies that report unique techniques and have been published between 2007 and 2016. Most of these techniques rely on text mining and can be beneficial for researchers, but an independent validation using real systematic literature reviews is missing for most of them. Moreover, our results indicate the necessity of developing more reliable techniques and of extending their scope to further activities to facilitate the selection and quality assessment of primary studies.
Opinions in reviews about the quality of products or services can be important information for readers. Unfortunately, such opinions may include deceptive ones posted for some business reasons. To keep the opinions as a valuable and trusted source of information, we propose an approach to detecting deceptive and truthful opinions. Specifically, we explore the use of word and character n-grams combinations, function words, and word syntactic n-grams (word sn-grams) as features for classifiers to deal with this task. We also consider applying word correction to our utilized dataset. Our experiments show that classification results of using the word and character n-grams combinations features could perform better than those of employing other features. Although the experiments indicate that applying the word correction might be insignificant, we note that the deceptive opinions tend to have a smaller number of error words than the truthful ones. To examine robustness of our features, we then perform cross-classification tests. Our latter experiments results suggest that using the word and character n-grams combinations features could work well in detecting deceptive and truthful opinions. Interestingly, the latter experimental results also indicate that using the word sn-grams as combinations features could give good performance.
We are experiencing an amazing data-centered revolution. Incredible amounts of data are collected, integrated and analyzed, leading to key breakthroughs in science and society. This well of knowledge, however, is at a great risk if we do not dispense with some of the data flood. First, the amount of generated data grows exponentially and already at 2020 is expected to be more than twice the available storage. Second, even disregarding storage constraints, uncontrolled data retention risks privacy and security, as recognized, e.g., by the recent EU Data Protection reform. Data disposal policies must be developed to benefit and protect organizations and individuals. Retaining the knowledge hidden in the data while respecting storage, processing and regulatory constraints is a great challenge. The difficulty stems from the distinct, intricate requirements entailed by each type of constraint, the scale and velocity of data and the constantly evolving needs. While multiple data sketching, summarization and deletion techniques were developed to address specific aspects of the problem, we are still very far from a comprehensive solution. Every organization has to battle the same tough challenges, with ad hoc solutions that are application specific and rarely sharable. In this vision paper we will discuss the logical, algorithmic, and methodological foundations required for the systematic disposal of large-scale data, for constraints enforcement and for the development of applications over the retained information. In particular we will overview relevant related work, highlighting new research challenges and potential reuse of existing techniques.
Curated, labelled, high quality data is a valuable commodity for tasks such as business analytics or machine learning. Open data is a common source of such data; for example, retail analytics draws on open demographic data, and weather forecast systems draw on open atmospheric and ocean data. This data is released openly by governments to achieve various objectives, such as transparency, informing citizen engagement, or supporting private enterprise, and is generally trusted. Critical examination of ongoing social changes, including the post-truth phenomenon, suggests quality, integrity, and authenticity of open data may be at risk. We describe these risks, with examples, and identify mechanisms to mitigate them. As an initial assessment of awareness of these risks, we compare our analysis to perspectives captured during open data stakeholder consultations in Canada.