78-Chess-Xilinx (Pdf)
Total Page:16
File Type:pdf, Size:1020Kb
TheThe HydraHydra ProjectProject Hydra, currently the strongest chess program in the world, is a cutting-edge application that combines cluster computing with the fine-grain parallel FPGA world. by Chrilly Donninger Programmer Nimzo Werkstatt OEG [email protected] Ulf Lorenz Researcher Paderborn University [email protected] The International Computer Games Association (ICGA) regularly organizes computer chess world championships. For quite a long time, large mainframe computers won these championships. Since 1992, however, only PC pro- grams have been world chess champi- ons. They have dominated the world, increasing their playing strength by about 30 ELO per year. (ELO is a sta- tistical measure of 100 points differ- ence corresponding to a 64% winning chance. A certain number of ELO points determines levels of expertise: Beginner = ~1,000 ELO, International Master = ~2,400, Grandmaster = ~2,500, World Champion = ~2,830.) Today, the computer chess commu- nity is highly developed, with special machine rooms using virtual reality and closed and open tournament rooms. Anybody can play against grandmasters or strong machines through the Internet. 00 Xcell Journal Second Quarter 2005 Hydra important feature of FPGAs, because near the top of the complete game tree for The Hydra Project is internationally driv- the evolution of a chess program never examination; usually we select it with the en, financed by PAL Computer Systems in ends, and the dynamic progress in help of a maximum depth parameter. We Abu Dhabi, United Arab Emirates. The computer chess enforces short develop- then assign heuristic values (such as one core team comprises programmer Chrilly ment cycles. Therefore, flexibility is at side has a queen more, so that side will Donninger in Austria; researcher Ulf least as important as speed. probably win) to the artificial leaves of the Lorenz from the University of Paderborn in pre-selected partial tree. We propagate • We can use a lot of fine-grain Germany; chess grandmaster Christopher these values up to the root of the tree as if parallelism. Lutz in Germany; and project manager they were true ones (Figure 1). The key observation over the last 40 Muhammad Nasir Ali in Abu Dhabi. Technical Description FPGAs from Xilinx® are provided on PCI years of computer chess data is that the game The key feature that enables computer cards from AlphaData in the United tree acts as an error filter. The larger the tree chess programs to play as strong as – or Kingdom. The compute cluster is built by that we can examine and the more sophisti- stronger – than the best human players is Megware in Germany, supported by the cated its shape, the better its error filter their search algorithm. The programs per- Paderborn Center for Parallel Computing. property. Therefore, what we need is speed. form a forecast: given a certain position, Taiwan is involved as well. what can I do, what can my opponent do The goal of the Hydra Project is literal- next, and what can I do thereafter? Top ly one of world dominance in the comput- Modern programs use some variant of er chess world: a final, widely accepted the Alphabeta algorithm to examine the victory over human players. Indeed, we are resulting game tree. This algorithm is opti- convinced that in 2005, a computer will be mal in the sense that in most cases it will the strongest chess entity, a world first. examine only O(bt/2) many leaves, instead of Four programs stand out as serious con- bt many leaves, assuming a game tree depth tenders for the crown: of t and a uniform branching of b. With the • Shredder, by Stefan Meyer-Kahlen, the help of upper and lower bounds, the algo- Figure 1 – Game tree search (in the blue part) dominating program over the last rithm uses information that it collects dur- leads to an approximation procedure. decade ing the search process to keep the remaining The Hardware Architecture • Fritz, by Frans Morsch, the most well- search tree small. This makes it a sequential Hydra uses the ChessBase/Fritz GUI run- known program procedure that is difficult to parallelize, and naïve approaches waste resources. ning on a Windows XP PC. It connects to • Junior, by Amir Ban and Shay Although the Alphabeta algorithm is the Internet using ssh to our Linux cluster, Bushinsky, the current Computer efficient, we cannot compute true values which itself comprises eight dual PC server Chess World Champion for all positions in games like chess. The nodes able to handle two PCI buses simul- • Our program, Hydra, in our opinion game tree is simply far too large. Therefore, taneously. Each PCI bus contains one FPGA the strongest program at the moment we use tree search as an approximation pro- accelerator card. One message passing inter- cedure. First, we select a partial tree rooted face (MPI) process is mapped onto each of These four programs scored more than 95% of the points against their opponents Four Dual PCs with Myrinet in the 2003 World Championship. Interconection Network Computational speed, as well as sophis- ticated chess knowledge, are the two most CPU 1 CPU 2 important features of chess programs. FPGAs play an important role in Hydra by Memory Hub Controller harnessing the demands on speed and pro- gram sophistication. Additionally, FPGAs PCI RAM provide these benefits: Controller • Implementing more knowledge FPGA 1 FPGA 2 requires additional space, but nearly PCI Bus no additional time. Internet • FPGA code can be debugged and changed like software without long ASIC development cycles. This is an Figure 2 – A cluster of dual PCs supplied with two FPGA cards; each is connected to a GUI via the Internet. Second Quarter 2005 Xcell Journal 00 the processors; one of the FPGAs is associat- we can solve them with the help of a move generator consists of two 8 x 8 chess ed with it as well. A Myrinet network inter- Configware coprocessor, benefiting from the boards, as shown in Figure 3. The connects the server nodes (Figure 2). fine-grain parallelism inside the application. GenAggressor and GenVictim modules We have a complete chess program on-chip, instantiate 64 square instances each. Both The Software Architecture consisting of modules for the search, the determine to which neighbor square The software is partitioned into two: the evaluation, generating moves, and executing incoming signals must be forwarded. distributed search algorithm running on or taking back moves. At present, we use 67 The square instances will send piece the Pentium nodes of the cluster and the block RAMs, 9,879 slices, 5,308 TBUFs, signals (if there is a piece on that square), soft co-processor on the Xilinx FPGAs. The basic idea behind our paralleliza- tion is to decompose the search tree in GenVictim (To-Square) GenAggressor (From-Square) order to search parts of it in parallel and to balance the load dynamically with the help of the work-stealing concept. First, a special processor, P0, gets the search problem and starts performing the forecast algorithm as if it would act sequen- tially. At the same time, the other proces- sors send requests for work to other 64 Square 64 Square randomly chosen processors. When Pi (a processor that is already supplied with 64 x16 Input Bits Comparator 64 x16 Input Bits Comparator work) catches such a request, it checks Tree Tree whether or not there are unexplored parts 1. Occupied squares send a signal in GenVictim 1. Winner square generates signal of a super piece 2. Free squares forward these signals 2. Free squares forward the signals of its search tree ready for evaluation. These 3. All squares receiving a signal are potential to-squares 3. Squares occupied by own pieces are potential from-squares unexplored parts are all rooted at the right 4. Comparator tree selects most attractive to-square 4. Comparator tree selects most attractive from-square (taking moves). siblings of the nodes of Pi’s search stack. Pi sends back either a message that it cannot Figure 3 – The gen modules form the move generator. perform work, or it sends a work packet (a chess position with bounds) to the request- ing processor Pj. Thus, Pi becomes a master 534 flip-flops, and 18,403 LUTs. An upper respectively, forwarding the signals of far- itself, and Pj starts a sequential search on its bound for the number of cycles per search reaching pieces to neighbor square own. The processors can be master and node is nine cycles. We estimate that a pure instances. Additionally, each square can worker at the same time. software solution would require at least output the signal “victim found.” Then we The relationship dynamically changes 6,000 Pentium cycles. The longest path con- know that this square is a “victim” (a to- during computation. When Pj has finished sists of 51 logic levels, and the design runs at square of a legal move). The collection of its work (possibly with the help of other 30 MHz on a Virtex™-I 1000. We have all “victim found” signals is input to an processors), it sends an answer message to just ported the design to a Virtex arbiter (a comparator tree) that selects the Pi. The master/worker relationship between XC2VP70-5 so that we can now run the most attractive not-yet-examined victim. Pi and Pj is released, and Pj becomes idle. It program with 50 MHz. The GenAggressor module takes the again starts sending requests for work into In software, a move generator is usually arbiter’s output as input and sends the sig- the network. When processor Pi finds out implemented as a quad-loop: one loop over nal of a super-piece (a combination of all that it has sent a wrong αβ-window to one all piece types; an inner loop over pieces of possible pieces).