Massively Parallel MIMD Architecture Achieves High Performance in a Spam Filter

John von Neumann Institute for Computing Massively Parallel MIMD Architecture Achieves High Performance in a Spam Filter O.R. Birkeland, M. Nedland, O. Snøve Jr. published in Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the International Conference ParCo 2005, G.R. Joubert, W.E. Nagel, F.J. Peters, O. Plata, P. Tirado, E. Zapata ( Editors), John von Neumann Institute for Computing, Jülich, NIC Series, Vol. 33, ISBN 3-00-017352-8, pp. 691-698, 2006. c 2006 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above. http://www.fz-juelich.de/nic-series/volume33 6911 Massively parallel MIMD architecture achieves high performance in a spam filter O.R. Birkelanda, M. Nedlanda, O. Snøve Jr.a aInteragon AS, Medisinsk teknisk senter, NO-7489 Trondheim, Norway Supercomputer manufacturing is usually a race for floating point operations, and most therefore opt for a design that allows for the highest possible clock frequency. Many modern applications are, however, limited not only by the number of operations that can be performed on the data, but by the available memory bandwidth. We review the main features of a MISD architecture that we have introduced earlier, and show how a system based on these chips is able to scale with respect to query and data volume in an email spam filtering application. We compare the results of a minimal solution using our technology with the performance of two popular implementations of deterministic and nondeterministic automatons, and show how both fail to scale well with neither query nor data volumes. 1. Introduction Mainstream parallel computer systems have diverse design targets. Projects like BlueGene/L aim for the highest performance on numerical calculations [6], whereas designs like Green Destiny target efficiency, reliability and availability instead [2]. A common characteristic of most designs is the dominating supercomputer requirement for floating point operations, but several applications, for instance within bioinformatics and network traffic monitoring, are limited by the available memory bandwidth rather than the raw floating point compute power [8,1]. Brute force data mining, that is linear searching through raw data, is one such application. Fine grain parallelism have been proposed by others, both as minute compute nodes [4] or inte- gration of memory and processing circuitry [8,5]. In our work, we propose an architecture that takes advantage of one of the main assets of synchronous memory technology, which is high sustainable bandwidth during linear access. We review a MIMD architecture for pattern matching with regular expression-type queries, with high performance, high density and low infrastructure requirements. By tailoring the compute logic for the given problem, the system’s size and complexity can be re- duced to a minimum. Focusing on non-computational applications, we are able to build systems with higher density and more operations per second than traditional clusters and supercomputers. One application that requires scalability with respect to both query and data volume is email spam filtering. Spam is prevented with a large number of approaches, but most involve (among others) using regular expressions to find common spam patterns. The number of required spam patterns are increasing, as well as the volume of email, resulting in an increasing demand for compute power. Some of the common regular expression evaluators like DFA scanners have severe scalability issues. We propose a scalable architecture for evaluation queries such as regular expressions, that achieves high performance in a compact system by use of fine grain parallel processing. 2. System architecture We concentrate on the design considerations for this architecture, including general purpose versus special purpose hardware; fine grain versus coarse grain parallelism; compute density; cost of ownership; and requirements for development resources. 6922 Our project started with the observation that an index-based algorithm requires that the keywords are known beforehand, and the realization that an extremely rapid brute force linear search eliminates the need for an indexing step. A continuous linear search also implies that searching becomes independent of the queries’s complexity. Since the execution time for such a search is O(n), it becomes important to use the peak memory bandwidth, and have as many independent memory channels as possible. We developed a fine-grain parallel architecture called the pattern matching chip (PMC) [3], where each node can be as small as a single ASIC and a single memory device. Each PMC contains 1,024 individual processing elements operating on a shared data stream. We constructed a MIMD system with PCI cards, each containing 16 PMC chips, accelerating a standard PC to perform 1013 symbol comparisons per second. Each chip has a dedicated memory bank; thus, the memory volume and bandwidth scale at the same rate as the number of processing elements (PEs). 16 or more PEs are used for each query, depending on query size. This allows for up to 64 simultaneous queries per chip. Additional queries can be stored in the local memory, requiring 16 kB for each configuration for all 1024 PEs. Thus several thousand queries can be resident for each PMC, and be used with minimal system involvement, only to specify which queries to use. In spam filtering, the email data can be uploaded to local memory once, and then interactively processed by several sets of queries. If the data rate is lower than the scan rate, the difference can be reclaimed in the form of an increased number of queries. For example, one chip could scan 64 queries at 100 MB/s data rate, or 6400 queries at 1 MB/s data rate. Query configurations, as well as data, can be loaded to local SDRAM, even during searches, without any performance penalties. The performance is linear with respect to the number of queries, the length of the queries, and the data rate. It is also linearly scalable with more PMC chips. The processing speed is limited by the memory bandwidth in each node. Consequently, the PMC is tailored to run at the memory data rate. As we did not aim for the highest processing clock frequency, a low power design was possible, and that proves essential when small building blocks are used to build dense systems. As all memory accesses take place as linear scans, the required power for searching though the data is minimized as well. Any non-predictable access pattern involves protocol overhead, for example in switching pages within the memory chip, during which a lot of time and energy are consumed. The same methodology is applied to the configuration of the chip. The entire configuration is read once into the chip and stored in configuration registers. Thus, there are no instruction memory accesses during searches. As a side effect, the full memory bandwidth is available for data transfer. The large number of small PEs on each chip implies no local hot spot. The same applies when several chips are used on the PCI card, without a requirement for active cooling. The density that can be achieved in the system is thus governed by the transistor density on silicon (PMC and SDRAM), rather than the power consumption density. Another feat of this design is that smaller cores also imply less engineering. As we target applications where data can easily be distributed for processing, no elaborate interconnection scheme is required. The only custom part in the design is the PMC, and the remaining components are sourced from standard technologies like SDRAM and PCI. Correspondingly, the use of a small, repetitive, orthogonal core element also makes the compiler design simpler. Such compilation also involves minimal optimization steps, and executes very fast. DFA scanners like lex could achieve high scan rates, but with an unpredictable compile time. The architectural concepts have been proven in a five PC cluster, achieving 5 · 1013 operations per second with near-linear scalability. Performance is achieved by massively distributed compute 6933 resources, which translates to 500,000 processing elements in the aforementioned cluster. Further- more, our evaluation system has a total of 60 GB of memory, with an aggregated bandwidth of 48 GB per second. We will compare the performance of the PMC with two popular software algorithms in a email spam filtering application. Our reason for choosing this application as our example, is that we are able to show how we can tailor the performance to yield the appropriate ratio of search speed to query throughput. 3. Problem definition Spam—unsolicited commercial emails or unsolicited bulk emails—has become a tremendous problem for email users, and is increasing. Users around the world receive billions of spam messages each day and are indirectly forced to bear the cost of delivery and storage of these unwanted emails. A number of ways exist to deal with the problem—they range from simple blacklists of bad senders, words, or IP addresses to complex automated filtering approaches. One of these popular filters is SpamAssassin, a program that uses hundreds of weighted rules to obtain a score for each email header (see http://spamassassin.apache.org/ for additional details). Patterns are constructed to match popular words used by spammers and a genetic algorithm optimizes the weights so that the combined predictor achieves optimal performance on a manually curated training set. Users must determine the actions that should be taken when an email receives a score that puts a spam label on the message according to the user’s predefined thresholds.

Load more