An Exploration of Approaches to Integrating Neural Reranking Models in Multi-Stage Ranking Architectures

Zhucheng Tu, MaŠ Crane, Royal Sequiera, Junchen Zhang, and Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo, Ontario, Canada {michael.tu,maŠ.crane,rdsequie,j345zhan,jimmylin}@uwaterloo.ca

ABSTRACT results. Recent e‚orts include the Lucene4IR1 workshop organized We explore di‚erent approaches to integrating a simple convolu- by Azzopardi et al. [3], the Lucene for Information Access and tional neural network (CNN) with the Lucene search engine in a Retrieval Research (LIARR) Workshop [2] at SIGIR 2017, and the multi-stage ranking architecture. Our models are trained using the Anserini IR toolkit built on top of Lucene [24]. PyTorch toolkit, which is implemented in /C++ with Given the already substantial and growing interest in applying a Python frontend. One obvious integration strategy is to expose deep learning to information retrieval [11], it would make sense to the neural network directly as a service. For this, we use Apache integrate existing deep learning toolkits with open-source search Œri‰, a so‰ware framework for building scalable cross-language engines. In this paper, we explore di‚erent approaches to integrat- services. In exploring alternative architectures, we observe that ing a simple convolutional neural network (CNN) with the Lucene once trained, the feedforward evaluation of neural networks is search engine, evaluating alternatives in terms of performance quite straightforward. Œerefore, we can extract the parameters (latency and throughput) as well as ease of integration. of a trained CNN from PyTorch and import the model into Java, Fundamentally, neural network models (and even more broadly, taking advantage of the Java Deeplearning4J for feedfor- learning-to-rank approaches) for a variety of information retrieval ward evaluation. Œis has the advantage that the entire end-to-end tasks behave as rerankers in a multi-stage ranking architecture [1, system can be implemented in Java. As a third approach, we can 5, 8, 11, 14, 18, 22]. Given a user query, systems typically begin extract the neural network from PyTorch and “compile” it into a with a document retrieval stage, where a standard ranking function C++ program that exposes a Œri‰ service. We evaluate these alter- such as BM25 is used to generate a set of candidates. Œese are then natives in terms of performance (latency and throughput) as well passed to one or more rerankers to generate the €nal results. In as ease of integration. Experiments show that feedforward evalu- this paper, we focus on the question answering task, although the ation of the convolutional neural network is signi€cantly slower architecture is exactly the same: we consider a standard pipeline in Java, while the performance of the compiled C++ network does architecture [17] where the natural language question is €rst used not consistently beat the PyTorch implementation. as a query to retrieve a set of candidate documents, which are then segmented into sentences and rescored with a convolutional neural network (CNN). Viewed in isolation, this €nal stage is commonly 1 INTRODUCTION referred to as answer selection. We explored three di‚erent approaches to integrating a CNN As an empirical discipline, information retrieval research requires for answer selection with Lucene in the context of an end-to-end substantial so‰ware infrastructure to index and search large collec- question answering system. Œe fundamental challenge we tackle tions. To address this challenge, many academic research groups is that Lucene is implemented in Java, whereas many deep learning have built and shared open-source search engines with the broader toolkits (including PyTorch, the one we use) are wriŠen in C/C++ community—prominent examples include Lemur/Indri [9, 10] and with a Python frontend. We desire a seamless yet high-performance Terrier [7, 13]. Œese systems, however, are relatively unknown way to interface between components in di‚erent programming lan- beyond academic circles. With the exception of a small number of guages; in this respect, our work shares similar goals as Pyndri [19] arXiv:1707.08275v1 [cs.IR] 26 Jul 2017 companies (e.g., commercial web search engines), the open-source and Luandri [12]. Lucene search engine (and its derivatives such as Solr and Elas- Overall, our integration e‚orts proceeded along the following ticsearch) have become the de facto platform for deploying search line of reasoning: applications in industry. • Œere is a recent push in the information retrieval community Since we are using the PyTorch deep learning toolkit to train our to adopt Lucene as the research toolkit of choice. Œis would lead convolutional neural networks, it makes sense to simply expose the network as a service. For this, we use Apache Œri‰, a so‰- to a beŠer alignment of information retrieval research with the 2 practice of building search applications, hopefully leading to richer ware framework for building scalable cross-language services. • academic–industrial collaborations, more ecient knowledge trans- Œe complexities of deep learning toolkits lie mostly in train- fer of research innovations, and greater reproducibility of research ing neural networks. Once trained, the feedforward evaluation of neural networks is quite straightforward and can be com- pletely decoupled from backpropagation training. Œerefore, we

SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR’17), August 7–11, 2017, Shinjuku, Tokyo, Japan 1hŠps://sites.google.com/site/lucene4ir/home © 2017 Copyright held by the owner/author(s). 2hŠps://thri‰.apache.org/ convolution pooled join ! hidden ! sentence softmax! feature maps ! representation! layer! layer! namespace py qa

service QuestionAnswering { double getScore(1:string question, 2:string answer) xd Fd } document ! Figure 2: ‡ri IDL for a service that reranks question- The! cat! sat! on! the! mat! answer pairs.

xq Fq query ! of words [w1,w2,... w |S |], each of which is translated into its cor- responding distributional vector (i.e., from a ), Where! was! the! cat! ?! additional ! features xfeat yielding a sentence matrix. Convolutional feature maps are applied to this sentence matrix, followed by ReLU activation and simple Figure 1: ‡e overall architecture of our convolutional neu- max-pooling, to arrive at a representation vector xq for the question ral network for answer selection. and xd for the “document” (i.e., candidate answer sentence). At the join layer (see Figure 1), all intermediate representations explored an approach where we extract the parameters of the are concatenated into a single vector: trained CNN from PyTorch and imported the model into Java, us- [ T T T ] xjoin = xq ; xd ; xfeat (1) ing the Deeplearning4J library for feedforward evaluation. Œis has the advantage of language uniformity—the entire end-to-end Œe €nal component of the input vector at the join layer consists system can be implemented in Java. of “extra features” xfeat derived from four word overlap measures • Œe ability to decouple neural network training from deployment between the question and the candidate sentence: word overlap introduces additional options for integration. We explored an and idf-weighted word overlap between all words and only non- approach where we directly “compile” our convolutional neural stopwords. network into a C++ program that also exposes a Œri‰ service. Our model is implemented using the PyTorch deep learning toolkit based on a reproducibility study of Severyn and MoschiŠi’s Experiments show that feedforward evaluation of the convolutional model [16] by Rao et al. [15] using the deep learning toolkit neural network is signi€cantly slower in Java, which suggests that (in Lua). Our network con€guration uses the best seŠing, as deter- Deeplearning4J is not yet as mature and optimized as other toolkits, mined by Rao et al. via ablation analyses. Speci€cally, they found at least when used directly “out-of-the-box”. Note that PyTorch that the bilinear similarity component actually decreases e‚ective- presents a Python frontend, but the backend takes advantage of ness, and therefore is not included in our model. highly-optimized C libraries. Œe relative performance of our C++ For our experiments, the CNN was trained using the popular implementation and the PyTorch implementation is not consistent TrecQA dataset, €rst introduced by Wang et al. [23] and further across di‚erent processors and operating systems, and thus it re- elaborated by Yao et al. [25]. Ultimately, the data derive from the mains to be seen if our “network compilation” approach can yield ‹estion Answering Tracks from TREC 8–13 [20, 21]. performance gains. 2 BACKGROUND 3 METHODOLOGY In this paper, we explore a very simple multi-stage architecture 3.1 PyTorch ‡ri for question answering that consists of a document retrieval and Given that our convolutional neural network for answer selection answer selection stage. An input natural language question is used (as described in the previous section) is implemented and trained as a bag-of-words query to retrieve h documents from the collection. in PyTorch, the most obvious architecture for integrating disparate Œese documents are segmented into sentences, which are treated components is to expose the neural network model as a service. as candidates that feed into the answer selection module. Œis answer selection service can then be called, for example, from Given a questionq and a candidate set of sentences {c1,c2,... cn }, a Java client that integrates directly with Lucene. For building this the answer selection task is to identify sentences that contain the service, we take advantage of Apache Œri‰. answer. For this task, we implemented the convolutional neural Apache Œri‰ is a widely-deployed framework in industry for network (CNN) shown in Figure 1, which is a slightly simpli€ed building cross-language services. From an interface de€nition lan- version of the model proposed by Severyn and MoschiŠi [16]. guage (IDL), Œri‰ can automatically generate server and client We chose to work with this particular CNN for several reasons. stubs for remote procedure calls (RPC). In our case, the server stub It is a simple model that delivers reproducible results with multiple is in Python (to connect to PyTorch) while the client stub is in implementations [15]. It is quick to train (even on CPUs), which Java (to connect to Lucene). Œese stubs handle marshalling and supports fast experimental iteration. Although its e‚ectiveness unmarshalling parameters in a language-agnostic manner, mes- in answer section is no longer the state of the art, the model still sage transport, server protocols, connection management, load provides a reasonable baseline. balancing, etc. Our model adopts a general “Siamese” structure [4] with two Our Œri‰ IDL for answer selection is shown in Figure 2. Œe subnetworks processing the question and candidate answers in par- namespace speci€es the module that the de€nition is declared in, as allel. Œe input to each “arm” in the neural network is a sequence well as the language. Œe syntax for the function declaration closely class QuestionAnsweringHandler: MultiLayerConfiguration conf = ... new NeuralNetConfiguration.Builder() . list () def getScore(self, question, answer): . layer (0, new ConvolutionLayer.Builder(50, kernelW) xq = self.__make_input_matrix(question) .nIn(inChannels) xa = self.__make_input_matrix(answer) .stride(stride, stride) .nOut(outChannels) pred = self.model(xq, xa, self.default_ext_feats) .padding(0, padding) pred = torch.exp(pred) .activation(Activation.TANH) return pred.data[0, 1] . build ()) . build (); class QAModel(nn.Module): model = new MultiLayerNetwork(conf); def __init__(self, input_n_dim, filter_width, model.output(batchedSentenceEmbedding, false ); conv_filters=100, ...): ... self.conv_q = nn.Sequential( Figure 4: Snippet of the Deeplearning4J approach, showing nn.Conv1d(input_n_dim, conv_channels, how the convolution operation is invoked. filter_width, padding=filter_width-1), nn. Tanh () ) { self.conv_a = ... "type": "record", self.combined_feature_vector =\ "name": "weights", nn.Linear(2*conv_channels, n_hidden) "namespace": "CNN", self.combined_features_activation = nn.Tanh() " fields ": [ { def forward(self, question, answer, ext_feats): "name" : "dimension", q = self.conv_q.forward(question) " type ": { q = F.max_pool1d(q, q.size()[2]) "type": "array", q = q.view(-1, self.conv_channels) "items": "int" a = ... } }, x = torch.cat([q, a, ext_feats], 1) { x = self.combined_feature_vector.forward(x) "name" : "weights", x = self.combined_features_activation.forward(x) " type ": { ... "type": "array", return x "items": "double" } Figure 3: Snippet of the PyTorch implementation of answer } selection with the ‡ri service handler. Note the getScore ] } method implements the ‡ri interface and implicitly in- vokes the forward method of QAModel, the CNN in Figure 1. Figure 5: Avro schema de€ning model weights.

resembles a function declaration in C/C++ or Java, containing the In our case, if we implement the feedforward evaluation in Java, return type of the function as well as the name and type of each pa- then our entire question answering pipeline can be captured in a single programming language. We take advantage of a Java deep rameter. Œe main di‚erence is the addition of an integer parameter 3 id, which is used for identifying €elds for schema evolution. learning toolkit called Deeplearning4J, built on top of a Java nu- Œri‰ automatically generates “service boilerplate”, and the pro- meric computation library called ND4J, to explore this idea (see grammer is le‰ to implement the service and client handler. Œe example code snippet in Figure 4). Note that although Deeplearn- service handler is essentially a wrapper that feeds the input re- ing4J is a complete deep learning toolkit with support for training ceived by the server stub into the neural network model already neural networks, we only use it for forward evaluation. Since deep implemented in PyTorch, as shown in Figure 3. Œe client directly learning today remains dominated by Python-based toolkits such calls the corresponding method of the client stub generated by as PyTorch, TensorFlow, and , this captures the workƒow of Œri‰ like an ordinary function invocation—in our case, from Java. most researchers and practitioners. To export model parameters, we need an interoperable serializa- tion format for exchanging data between Python and Java. For this, 3.2 Deeplearning4J we use Apache Avro, a widely-adopted data serialization frame- In our end-to-end question answering pipeline, only the answer work.4 Although Œri‰ can also be used for data serialization, selection stage involves deep learning and thus depends on PyTorch. Avro’s main advantage is its dynamic typing feature that avoids However, the training of neural networks can be easily decoupled code generation. A script wriŠen in Python loads the model trained from its deployment. Since the feedforward pass of neural net- by PyTorch, extracts the parameters, parses the schema, and then works is relatively straightforward, we can use an existing deep writes the model parameters to an Avro €le. Œe Avro schema we learning toolkit for training, and once the model is learned, we can use, expressed in JSON, is shown in Figure 5. extract the model parameters and import the model into a di‚erent toolkit (in a di‚erent programming language), or reimplement the 3hŠps://deeplearning4j.org/ feedforward evaluation directly in a language of our choice. 4hŠps://avro.apache.org/ Note that weights for di‚erent types of parameters could have class QuestionAnsweringHandler : virtual public QuestionAnsweringIf { di‚erent dimensions. For example, since multiple convolution €l- public : ters are used to extract di‚erent features from the sentence matrices, ... // Static matrix values and each €lter is a 2-dimensional tensor (matrix), the convolution double getScore ( const std::string &question, const std::string &answer) { €lters together form a 3-dimensional tensor. However, the edge // Convolution weights for the fully-connected layer is only a 1-dimension ten- this ->conv_result.resize(question_words.size()); sor (vector). All weights are reshaped to one dimension in the for (int i=0; iq_conv_filters.size(); i++) { for (int k=0; kquestion_input, 0, k + COLUMN_PADDING, the Avro €le, parse the schema, read the dimensions, reshape each this ->q_conv_filters[i].rows(), weight matrix, and €nally convert them into the correct datatype this ->q_conv_filters[i].columns()); for Deeplearning4J. auto cc = sub % this ->q_conv_filters[i]; Œe advantage of the Deeplearning4J approach for reranking is float sum = 0.0; language uniformity, which allows the entire question-answering for (int j=0; jconv_result[k] = sum; } coupled architectures, but a broader discussion is beyond the scope this ->question_conv_map[i] = of our work. max(this ->conv_result); } this ->question_conv_map = tanh( 3.3 C++ ‡ri this ->question_conv_map + Once we separate the training of a neural network from its deploy- this ->q_conv_biases); ... ment, we can in principle reimplement the feedforward pass in any return score ; language we choose. A Java implementation (as described in the } previous section) yields language uniformity, which is a worthy }; goal from an integration perspective. As an alternative, we explored Figure 6: Snippet of the C++ generated program deployed the potential for increased performance by “compiling” the neural behind the ‡ri service, showing the convolution compu- network itself into a standalone binary that exposes a Œri‰ service. tation for the question. In more detail: We have implemented a C++ conversion program that reads the same serialized model from Avro (as described in the previous section). However, rather than construct the network a‰er reading the weights, the program instead generates another program with the network already encoded—see representative and widely used as other deep learning toolkits such as Torch and snippet of code in Figure 6. Œis gives the maximum potential to TensorFlow. Second, we did not have an implementation of our bene€t from compiler optimizations as a number of the matrices are CNN in Deeplearning4J and lacked the resources to port the model. constants. Since the conversion program is under our control, this We concede, however, that a thorough evaluation should include technique can adapt to arbitrary network architectures (although this experimental condition. this is not the case in our current proof-of-concept implementation). Œere are alternative approaches to integrating Python and Java Additional, our compilation approach means that any potential code that we did not explore: PyLucene is a Python extension for deployment requires only a single binary. accessing Java Lucene, and Py4J is a bridge that allows a Python pro- Œe generated program uses the Blaze math library5 [6] to per- gram to dynamically access Java objects in a . form all vector/matrix manipulations, which makes extensive use In such an architecture, the entire question-answering pipeline of BLAS/LAPACK functions and SIMD instructions for maximum would be implemented in Python, calling Java for functionalities performance. Like the PyTorch implementation, the generated pro- that involve Lucene. Again, we did not have sucient resources to gram exposes a Œri‰ service. Œe generation program and a simple explore this approach to integration, but agree that such a condition Œri‰ client are publicly available on GitHub.6 should be evaluated in future work.

3.4 Alternative Approaches 4 RESULTS For completeness, it is worthwhile to discuss other obvious inte- We compared the performance of our three proposed approaches in gration approaches that we did not examine in this study. terms of latency and throughput. All the experiments in this paper In the quest for language uniformity, it would also be possible were performed on two machines: to implement our entire CNN (including training) in Java using the • A desktop with an Intel Core i7-6800K CPU (6 physical cores, 12 Deeplearning4J toolkit. We decided not to pursue this route for logical cores with hyperthreading) running Ubuntu 16.04. two reasons: First, Deeplearning4J does not appear to be as mature • An Apple MacBook Air laptop with an Intel Core i5-5250U CPU 5hŠps://bitbucket.org/blaze-lib/blaze (2 physical cores, 4 logical cores with hyperthreading) running 6hŠps://.com/snapbug/coconut macOS Sierra 10.12.6. For the Python implementation, we used Python 3.6 and PyTorch Machine Approach Œroughput (QPS) 0.1.12. For the Java implementation, we used Java 8 and Deeplearn- Desktop PyTorch 1226.49 ing4J 0.8.0 on Oracle JVM 1.8.0. Deeplearning4J (which uses the Deeplearning4J 530.4 ND4J matrix library) uses OpenBLAS by default. Our C++ imple- C++ Generator 1235.50 mentation uses Blaze 3.1 and was compiled with -O3 -DNDEBUG options using gcc 5.4 on the desktop and Apple LLVM version 8.1.0 Laptop PyTorch 828.57 (clang-802.0.42) on the laptop. Deeplearning4J 165.75 Both PyTorch and Deeplearning4J use the OMP NUM THREADS en- C++ Generator 1025.91 vironment variable to decide how many threads to use. We set OMP NUM THREADS=1 for all experiments. Œe C++ implementation Table 1: Feedforward evaluation performance of the CNN is also single-threaded. In the relevant con€gurations, Avro 1.8.2 (without the ‡ri service wrapper). and Œri‰ 0.10.0 are used. All implementations ran on the CPU only. Performance measurements do not include deserialization time of the model parameters and other startup costs. Œroughput Latency (ms) Machine Approach (QPS) p50 p99 4.1 Feedforward Evaluation Desktop PyTorch Œri‰ 1150.86 0.83 1.72 We €rst examined the performance of the three approaches without C++ Œri‰ 1000.40 1.00 1.59 Œri‰. For each implementation, a‰er we loaded the model param- eters into memory, we iterated through the question–answer pairs Laptop PyTorch Œri‰ 774.28 1.20 2.51 from the TrecQA raw-dev and raw-test splits and invoked the ap- C++ Œri‰ 924.39 1.08 1.71 propriate function that computes the similarity score for each pair. Table 2: End-to-end performance of the ‡ri service. We divided the total number of pairs scored by the total elapsed time to arrive at performance in terms of queries per second (QPS). Results are summarized in Table 1. Œe calling program ran on a single thread. suggests that the Python frontend to PyTorch adds minimal perfor- We observe that the performance of the Java Deeplearning4J im- mance overhead (since the PyTorch backend also uses optimized C plementation is approximately 2–6× slower than the PyTorch and libraries). It remains to be seen if further optimizations in our C++ C++ implementations, depending on the environment. In our ini- implementation can translate into consistently beŠer performance. tial Java implementation, we used the ND4J matrix library directly to implement convolutions in the simplest way possible—looping 4.2 ‡ri Service over €lters and convolving each €lter with the sentence embedding separately. Œe performance of this na¨ıve implementation was two Finally, we compared the performance of the PyTorch and C++ orders of magnitude worse than that of Deeplearning4J, because in- implementations behind Œri‰. We did not examine wrapping ternally Deeplearning4J implements convolutions using the im2col Deeplearning4J in a Œri‰ service since the advantage of the Java approach, which turns the problem into a matrix multiplication and implementation is the ability to directly integrate with Lucene. takes advantage of highly-optimized functions such as GEMM.7 We We expect Œri‰ to introduce some overhead due to data serial- spent a considerable amount of time investigating the performance ization/deserialization and network protocols, but such a design of our Java implementations (both using ND4J directly and using enables components in an end-to-end question answering pipeline Deeplearning4J). We were unable to further improve the perfor- to be built in di‚erent languages. mance of the Deeplearning4J implementation; it seems that the Typically, in a microservices architecture, the Œri‰ server and toolkit already exploits all the “obvious” optimization tricks. clients would run on separate machines, but for the purposes of this Interestingly, the relative performance of the C++ and PyTorch experiment, we run them both on the same machine. In this setup, implementations is not consistent across our di‚erent test environ- we use a Python Œri‰ client to make requests to both the Python ments, which is likely aŠributable to some combination of hardware, Œri‰ server as well as the C++ Œri‰ server. Œe Python client , and the so‰ware toolchain (e.g., gcc vs. LLVM). uses a single thread to send requests—keeping the con€guration as We also note that the performance gap between the Java implemen- close to the non-Œri‰ seŠing as possible. Both Œri‰ servers use tation and the other two vary across the di‚erent environments. TSimpleServer, which runs a single thread, accepts one connection Although performance di‚erences are to be expected, we currently at a time, and repeatedly processes requests from the connection. lack a clear understanding of their sources. Results are summarized in Table 2, with p50 and p99 denoting Overall, these results suggest that Deeplearning4J and ND4J are median and 99th percentile latency, respectively. less mature than their counterparts in Python/C++ (i.e., PyTorch Comparing the results in Table 1 with those in Table 2, we are and numpy) and “out-of-the-box” con€gurations may need further able to quantify the overhead introduced by the Œri‰ service. On €ne-tuning to compete with the other approaches. Œe lack of both machines, wrapping PyTorch with Œri‰ added about 6–7% a consistent performance advantage in our C++ implementation overhead. Wrapping C++ with Œri‰ added around 10% and 24% overhead on the laptop and on the desktop, respectively. In this particular case, it seems that the Python Œri‰ implementation is 7hŠps://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/ more ecient than the C++ one. 5 CONCLUSIONS [13] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Christina Lioma. 2006. Terrier: A High Performance and Scalable Information In light of the recent push in the information retrieval community Retrieval Platform. In SIGIR 2006 Workshop on Open Source IR. to adopt Lucene as the research toolkit of choice and the growing [14] Jan Pedersen. 2010. ‹ery Understanding at Bing. In Invited Talk at SIGIR. interest in applying deep learning to information retrieval problems, [15] Jinfeng Rao, Hua He, and Jimmy Lin. 2017. Experiments with Convolutional Neural Network Models for Answer Selection. In SIGIR. this paper explores three ways to integrate a simple CNN with [16] Aliaksei Severyn and Alessandro MoschiŠi. 2015. Short Text Lucene. We can directly expose the PyTorch model with a Œri‰ Pairs with Convolutional Deep Neural Networks. In SIGIR. 373–382. [17] Stefanie Tellex, Boris Katz, Jimmy Lin, Gregory Marton, and Aaron Fernandes. service, extract the model parameters and import the model into 2003. ‹antitative Evaluation of Passage Retrieval Algorithms for ‹estion the Java Deeplearning4J toolkit for direct Lucene integration, or Answering. In SIGIR. 41–47. “compile” the network into a standalone C++ program that exposes [18] Nicola TonelloŠo, Craig Macdonald, and Iadh Ounis. 2013. Ecient and E‚ective Retrieval Using Selective Pruning. In WSDM. 63–72. a Œri‰ service. [19] Christophe Van Gysel, Evangelos Kanoulas, and Maarten de Rijke. 2017. Pyndri: All considered, the simplest approach of wrapping PyTorch in a Python Interface to the Indri Search Engine. arXiv:1701.00749. Œri‰ seems at present to be the best option that combines per- [20] Ellen M. Voorhees and Hoa Trang Dang. 2005. Overview of the TREC 2005 ‹estion Answering Track. In TREC. formance and ease of integration. Œe other two approaches have [21] Ellen M. Voorhees and Dawn M. Tice. 1999. Œe TREC-8 ‹estion Answering potential advantages that remain currently unrealized. For Java, Track Evaluation. In TREC. [22] Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model the performance gap might shrink over time as Deeplearning4J be- for Ecient Ranked Retrieval. In SIGIR. 105–114. comes more mature. For our compiled C++ approach, additional op- [23] Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the Jeopardy timizations might yield consistent performance gains to justify the Model? A ‹asi-Synchronous Grammar for QA. In EMNLP-CoNLL. 22–32. [24] Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of additional complexity. Finally, there are other approaches discussed Lucene for Information Retrieval Research. In SIGIR. in Section 3.4 that we have not yet explored. Œe best approach [25] Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. 2013. to integrating neural networks with the Lucene search engine for Answer Extraction as Sequence Tagging with Tree Edit Distance. In HLT-NAACL. 858–867. information retrieval applications remains an open question, but hopefully our preliminary explorations start to shed some light on the relevant issues.

6 ACKNOWLEDGMENTS We’d like to thank the Deeplearning4J team for help with their toolkit. Œis work was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, with additional contributions from the U.S. National Science Foundation under CNS- 1405688. Any €ndings, conclusions, or recommendations expressed do not necessarily reƒect the views of the sponsors.

REFERENCES [1] Nima Asadi and Jimmy Lin. 2013. E‚ectiveness/Eciency Tradeo‚s for Candi- date Generation in Multi-Stage Retrieval Architectures. In SIGIR. 997–1000. [2] Leif Azzopardi, MaŠ Crane, Hui Fang, Grant Ingersoll, Jimmy Lin, Yashar Mosh- feghi, Harrisen Scells, Peilin Yang, and Guido Zuccon. 2017. Œe Lucene for Information Access and Retrieval Research (LIARR) Workshop at SIGIR 2017. In SIGIR. [3] Leif Azzopardi, Yashar Moshfeghi, Martin Halvey, Rami S. Alkhawaldeh, Krisz- tian Balog, Emanuele Di Buccio, Diego Ceccarelli, Juan M. Fernandez-Luna,´ Charlie Hull, Jake Mannix, and Sauparna Palchowdhury. 2016. Lucene4IR: De- veloping Information Retrieval Evaluation Resources using Lucene. SIGIR Forum 50, 2 (2016), 58–75. [4] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger,¨ and Roopak Shah. 1993. Signature Veri€cation Using a “Siamese” Time Delay Neural Network. In NIPS. 737–744. [5] J. Shane Culpepper, Charles L. A. Clarke, and Jimmy Lin. 2016. Dynamic Cuto‚ Prediction in Multi-Stage Retrieval Systems. In ADCS. 17–24. [6] Klaus Iglberger, Georg Hager, Jan Treibig, and Ulrich Rude.¨ 2012. High Perfor- mance Smart Expression Template Math Libraries. In HPCS. 367–373. [7] Craig Macdonald, Richard McCreadie, Rodrygo L. T. Santos, and Iadh Ounis. 2012. From Puppy to Maturity: Experiences in Developing Terrier. In SIGIR 2012 Workshop on Open Source IR. [8] Irina Matveeva, Chris Burges, Timo Burkard, Andy Laucius, and Leon Wong. 2006. High Accuracy Retrieval with Multiple Nested Ranker. In SIGIR. 437–444. [9] Donald Metzler and W. Bruce Cro‰. 2004. Combining the Language Model and Inference Network Approaches to Retrieval. IP&M 40, 5 (2004), 735–750. [10] Donald Metzler, Trevor Strohman, Howard Turtle, and W. Bruce Cro‰. 2004. Indri at TREC 2004: Terabyte Track. In TREC. [11] Bhaskar Mitra and Nick Craswell. 2017. Neural Models for Information Retrieval. arXiv:1705.01509v1. [12] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Luandri: A Clean Lua Interface to the Indri Search Engine. arXiv:1702.05042.