In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. If the information about link reliability is not used explicitly, the classification accuracy in the underlying network may be affected adversely. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model and automatic parameter selection, and show that the incorporation of uncertainty in the classification process as a first-class citizen is beneficial. We experimentally evaluate the proposed approach using different real data sets, and study the behavior of the algorithms under different conditions. The results demonstrate the effectiveness and efficiency of our approach.
Challenge Paper (no abstract) Excerpt: Financial markets respond to information. That information can be accurate or inaccurate (misinformation) but investors make rapid buy and sell decisions and often act before verifying authenticity. The challenge for data and information quality researchers is to develop tools to detect fraud early and develop strategies or decision rules regulators can use to determine whether to suspend trading.
This work tackles the perennial problem of reproducible baselines in information retrieval research, focusing on bag-of-words ranking models. Although academic information retrieval researchers have a long history of building and sharing software toolkits, they are primarily designed to facilitate the publication of research papers. As such, these toolkits are often incomplete, inflexible, poorly documented, difficult to use, and slow, particularly in the context of modern web-scale collections. Furthermore, the growing complexity of modern software ecosystems and the resource constraints most academic research groups operate under make maintaining open-source toolkits a constant struggle. On the other hand, except for a small number of companies (mostly commercial web search engines) that deploy custom infrastructure, Lucene has become the de facto platform in industry for building search applications. Lucene has an active developer base, a large audience of users, and diverse capabilities to work with heterogeneous web collections at scale. However, it lacks systematic support for ad hoc experimentation using standard test collections. We describe Anserini, an information retrieval toolkit built on Lucene that fills this gap. Our goal is to simplify ad hoc experimentation and allow researchers to easily reproduce results with modern bag-of-words ranking models on diverse test collections. With Anserini, we demonstrate that Lucene provides a suitable framework for supporting information retrieval research. Experiments show that our toolkit can efficiently index large web collections, provides modern ranking models that are on par with research implementations in terms of effectiveness, and supports low-latency query evaluation to facilitate rapid experimentation.
One typical way of building test collections for offline measurement of information retrieval systems is to pool the ranked outputs of different systems down to some chosen depth d, and then form relevance judgments for those documents only. Non-pooled documents -- ones that did not appear in the top-d sets of any of the contributing systems -- are then deemed to be non-relevant for the purposes of evaluating the relative behavior of the systems. In this paper we use rbp-derived residuals to re-examine the reliability of that process. By fitting the rbp parameter p to maximize similarity between AP- and NDCG-induced system rankings on the one hand, and RBP- induced rankings on the other, an estimate can be made as to the potential score uncertainty associated with those two recall-based metrics. We then consider the effect that residual size -- as an indicator of possible measurement uncertainty in utility-based metrics -- has in connection with recall-based metrics, by computing the effect of increasing pool sizes, and examining the trends that arise in terms of both metric score and system separability using standard statistical tests. The experimental results show that the confidence levels expressed via the p-values generated by statistical tests are unrelated to both the size of the residual, and to the degree of measurement uncertainty caused by the presence of unjudged documents, and demonstrate an important limitation of typical test collection-based information retrieval effectiveness evaluation. We therefore recommend that all such experimental results should report, in addition to the outcomes of statistical significance tests, the residual measurements generated by a suitably-matched weighted-precision metric, to give a clear indication of measurement uncertainty that arises due to the presence of unjudged documents in a test collection with finite judgments.
Introduction to the Special Issue on Reproducibility in Information Retrieval: Evaluation Campaigns, Collections, and Analyses
Introduction to the Special Issue on Reproducibility in Information Retrieval: Tools and Infrastructures
We report on our experience with TREC OpenSearch, an online evaluation campaign that enabled researchers to evaluate their experimental retrieval methods using real users of a live website. Specifically, we focus on the task of ad-hoc document retrieval within the academic search domain, and work with two search engines, CiteSeerX and SSOAR, that provide us with traffic. We describe our experimental platform, which is based on the living labs methodology, and report on the experimental results obtained. We also share our experiences, challenges and lessons learned from running this track in 2016 and 2017.
The evolution of web pages from static HTML pages toward dynamic pieces of software has rendered archiving them increasingly difficult. Nevertheless, an accurate, reproducible web archive is a necessity to ensure the reproducibility of web-based research. Archiving web pages reproducibly, however, is currently not part of best practices for web corpus construction. As a result, and despite the ongoing efforts of other stakeholders to archive the web, tools for the construction of reproducible web corpora are insufficient or ill-fitted. This paper presents a new tool tailored to this purpose. It relies on emulating user interactions with a web page while recording all network traffic. The customizable user interactions can be replayed on demand, while requests sent by the archived page are served with the recorded responses. The tool facilitates reproducible user studies, user simulations, and evaluations of algorithms that rely on extracting data from web pages. To evaluate our tool, we conduct the first systematic assessment of reproduction quality for rendered web pages. Using our tool, we create a corpus of 10,000 web pages carefully sampled from the CommonCrawl and manually annotated with regard to reproduction quality via crowdsourcing. Based on this data we test three approaches to automatic reproduction quality assessment. An off-the-shelf neural network, trained on visual differences between the web page during archiving and reproduction, matches the manual assessments best. This automatic assessment of reproduction quality allows for immediate bugfixing during archiving and continuous development of our tool as the web continues to evolve.
Web document collections such as WT10G, GOV2 and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, etc. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly hurt performance if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as test collections grow in size, and become more noisy. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.
The evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the Information Retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate effectiveness completely automatically, without human relevance assessments. Since human relevance assessment is one of the main costs of building a test collection, both in human time and money resources, this rather ambitious goal would have a practical impact. In this paper we reproduce the main results on evaluating information retrieval systems without relevance judgments; furthermore, we generalize such previous work to analyze the effect of test collections and evaluation metrics. We also expand the idea to the estimation of query difficulty and, finally, we propose a semi-automatic evaluation. Our results show that: (i) we are able to reproduce previous work, (ii) the collection and the metric used impact in the semi automatic evaluation of systems, (iii) the automatic evaluation can (to some extents) be used to predict query difficulty, and (iv) results lead to an effective semi-automatic evaluation of retrieval systems.
Effectiveness evaluation of information retrieval systems by means of a test collection is a widely used methodology. However, it is rather expensive, in terms of resources, time, and money; therefore, many researchers have proposed methods for a cheaper evaluation. One particular approach, on which we focus in this paper, is to use fewer topics: in TREC-like initiatives, usually system effectiveness is evaluated as the average effectiveness on a set of n topics (usually, n=50, but more than 1,000 have been also adopted); instead of using the full set, it has been proposed to find the best subsets of a few good topics that evaluate the systems in the most similar way to the full set. The computational complexity of the task has so far limited the analysis that have been performed. We develop a novel and efficient approach based on a multi-objective evolutionary algorithm. The higher efficiency of our new implementation allows us to reproduce some notable results on topic set reduction, as well as perform new experiments to generalize and improve such results. We show that our approach is able to both reproduce the main state of the art results and to allow us to analyze the effect of the collection, metric, and pool depth used for the evaluation. Finally, differently from previous studies which have been mainly theoretical, we are also able to discuss some practical topic selection strategies, integrating results of automatic evaluation approaches.
Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic evaluation to measure progress: here, the Cranfield paradigm of creating shared test collections, defining search tasks, and collecting ground truth for these tasks has persisted up until now. In recent years, however, several new challenges have emerged that do not fit this paradigm very well: extremely large data sets, confidential data sets as found in the medical domain, and rapidly changing data sets as often encountered in industry. Also, crowdsourcing has changed the way that industry approaches problem-solving with companies now organizing challenges and handing out monetary awards to incentivize people to work on their challenges, particularly in the field of machine learning. The objectives of this paper are to summarize and compare the current approaches and consolidate the experiences of these approaches to outline the next steps of EaaS, particularly towards sustainable research infrastructures.