In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. If the information about link reliability is not used explicitly, the classification accuracy in the underlying network may be affected adversely. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model and automatic parameter selection, and show that the incorporation of uncertainty in the classification process as a first-class citizen is beneficial. We experimentally evaluate the proposed approach using different real data sets, and study the behavior of the algorithms under different conditions. The results demonstrate the effectiveness and efficiency of our approach.
Challenge Paper (no abstract) Excerpt: Financial markets respond to information. That information can be accurate or inaccurate (misinformation) but investors make rapid buy and sell decisions and often act before verifying authenticity. The challenge for data and information quality researchers is to develop tools to detect fraud early and develop strategies or decision rules regulators can use to determine whether to suspend trading.
This work tackles the perennial problem of reproducible baselines in information retrieval research, focusing on bag-of-words ranking models. Although academic information retrieval researchers have a long history of building and sharing software toolkits, they are primarily designed to facilitate the publication of research papers. As such, these toolkits are often incomplete, inflexible, poorly documented, difficult to use, and slow, particularly in the context of modern web-scale collections. Furthermore, the growing complexity of modern software ecosystems and the resource constraints most academic research groups operate under make maintaining open-source toolkits a constant struggle. On the other hand, except for a small number of companies (mostly commercial web search engines) that deploy custom infrastructure, Lucene has become the de facto platform in industry for building search applications. Lucene has an active developer base, a large audience of users, and diverse capabilities to work with heterogeneous web collections at scale. However, it lacks systematic support for ad hoc experimentation using standard test collections. We describe Anserini, an information retrieval toolkit built on Lucene that fills this gap. Our goal is to simplify ad hoc experimentation and allow researchers to easily reproduce results with modern bag-of-words ranking models on diverse test collections. With Anserini, we demonstrate that Lucene provides a suitable framework for supporting information retrieval research. Experiments show that our toolkit can efficiently index large web collections, provides modern ranking models that are on par with research implementations in terms of effectiveness, and supports low-latency query evaluation to facilitate rapid experimentation.
Introduction to the Special Issue on Reproducibility in Information Retrieval: Tools and Infrastructures
A multi-criteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. A MCSS problem can be solved using multi-dimensional optimisation techniques that trade-off the different objectives. Sometimes one may have uncertain knowledge regarding how well the candidate data sources meet the criteria. In order to overcome this uncertainty, one may rely on end users or crowds to annotate the data items produced by the sources in relation to the selection criteria. In this paper, a proposed Targeted Feedback Collection (TFC) approach is introduced, that aims to identify those data items on which feedback should be collected, thereby providing evidence on how the sources satisfy the required criteria. The proposed TFC targets feedback by considering the confidence intervals around the estimated criteria values, with a view to increasing the confidence in the estimates that are most relevant to the multi-dimensional optimisation. Variants of the proposed TFC approach have been developed, for use where feedback is expected to be reliable (e.g. where it is provided by trusted experts) and where feedback is expected to be unreliable (e.g. from crowd workers). Both variants have been evaluated, and positive results are reported against other approaches to feedback collection, including active learning, in experiments that involve real world data sets and crowdsourcing.
The evolution of web pages from static HTML pages toward dynamic pieces of software has rendered archiving them increasingly difficult. Nevertheless, an accurate, reproducible web archive is a necessity to ensure the reproducibility of web-based research. Archiving web pages reproducibly, however, is currently not part of best practices for web corpus construction. As a result, and despite the ongoing efforts of other stakeholders to archive the web, tools for the construction of reproducible web corpora are insufficient or ill-fitted. This paper presents a new tool tailored to this purpose. It relies on emulating user interactions with a web page while recording all network traffic. The customizable user interactions can be replayed on demand, while requests sent by the archived page are served with the recorded responses. The tool facilitates reproducible user studies, user simulations, and evaluations of algorithms that rely on extracting data from web pages. To evaluate our tool, we conduct the first systematic assessment of reproduction quality for rendered web pages. Using our tool, we create a corpus of 10,000 web pages carefully sampled from the CommonCrawl and manually annotated with regard to reproduction quality via crowdsourcing. Based on this data we test three approaches to automatic reproduction quality assessment. An off-the-shelf neural network, trained on visual differences between the web page during archiving and reproduction, matches the manual assessments best. This automatic assessment of reproduction quality allows for immediate bugfixing during archiving and continuous development of our tool as the web continues to evolve.
Web document collections such as WT10G, GOV2 and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, etc. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly hurt performance if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as test collections grow in size, and become more noisy. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.
Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic evaluation to measure progress: here, the Cranfield paradigm of creating shared test collections, defining search tasks, and collecting ground truth for these tasks has persisted up until now. In recent years, however, several new challenges have emerged that do not fit this paradigm very well: extremely large data sets, confidential data sets as found in the medical domain, and rapidly changing data sets as often encountered in industry. Also, crowdsourcing has changed the way that industry approaches problem-solving with companies now organizing challenges and handing out monetary awards to incentivize people to work on their challenges, particularly in the field of machine learning. The objectives of this paper are to summarize and compare the current approaches and consolidate the experiences of these approaches to outline the next steps of EaaS, particularly towards sustainable research infrastructures.