Massively Parallel and Asynchronous Tsetlin Machine Architecture Supporting Almost Constant-Time Scaling
Total Page:16
File Type:pdf, Size:1020Kb
Massively Parallel and Asynchronous Tsetlin Machine Architecture Supporting Almost Constant-Time Scaling Kuruge Darshana Abeyrathna * 1 Bimal Bhattarai * 1 Morten Goodwin * 1 Saeed Rahimi Gorji * 1 Ole-Christoffer Granmo * 1 Lei Jiao * 1 Rupsa Saha * 1 Rohan Kumar Yadav * 1 Abstract 1. Introduction Using logical clauses to represent patterns, Tsetlin Tsetlin machines (TMs) (Granmo, 2018) have recently machines (TMs) have recently obtained compet- demonstrated competitive results in terms of accuracy, mem- itive performance in terms of accuracy, memory ory footprint, energy, and learning speed on diverse bench- footprint, energy, and learning speed on several marks (image classification, regression, natural language benchmarks. Each TM clause votes for or against understanding, and speech processing) (Berge et al., 2019; a particular class, with classification resolved us- Yadav et al., 2021a; Abeyrathna et al., 2020; Granmo et al., ing a majority vote. While the evaluation of 2019; Wheeldon et al., 2020; Abeyrathna et al., 2021; Lei clauses is fast, being based on binary operators, et al., 2021). They use frequent pattern mining and resource the voting makes it necessary to synchronize the allocation principles to extract common patterns in the data, clause evaluation, impeding parallelization. In rather than relying on minimizing output error, which is this paper, we propose a novel scheme for desyn- prone to overfitting. Unlike the intertwined nature of pat- chronizing the evaluation of clauses, eliminating tern representation in neural networks, a TM decomposes the voting bottleneck. In brief, every clause runs problems into self-contained patterns, expressed as conjunc- in its own thread for massive native parallelism. tive clauses in propositional logic (i.e., in the form if input For each training example, we keep track of the X satisfies condition A and not condition B then output class votes obtained from the clauses in local vot- y = 1). The clause outputs, in turn, are combined into a ing tallies. The local voting tallies allow us to classification decision through summation and thresholding, detach the processing of each clause from the rest akin to a logistic regression function, however, with binary of the clauses, supporting decentralized learning. weights and a unit step output function. Being based on This means that the TM most of the time will op- the human-interpretable disjunctive normal form (Valiant, erate on outdated voting tallies. We evaluated the 1984), like Karnaugh maps (Karnaugh, 1953), a TM can proposed parallelization across diverse learning map an exponential number of input feature value combina- tasks and it turns out that our decentralized TM tions to an appropriate output (Granmo, 2018). learning algorithm copes well with working on outdated data, resulting in no significant loss in learning accuracy. Furthermore, we show that the Recent progress on TMs Recent research reports several proposed approach provides up to 50 times faster distinct TM properties. The TM can be used in convolution, learning. Finally, learning time is almost constant providing competitive performance on MNIST, Fashion- for reasonable clause amounts (employing from MNIST, and Kuzushiji-MNIST, in comparison with CNNs, 20 to 7; 000 clauses on a Tesla V100 GPU). For K-Nearest Neighbor, Support Vector Machines, Random sufficiently large clause numbers, computation Forests, Gradient Boosting, BinaryConnect, Logistic Cir- time increases approximately proportionally. Our cuits and ResNet (Granmo et al., 2019). The TM has also parallel and asynchronous architecture thus allows achieved promising results in text classification (Berge et al., processing of massive datasets and operating with 2019), word sense disambiguation (Yadav et al., 2021b), more clauses for higher accuracy. novelty detection (Bhattarai et al., 2021c;b), fake news de- tection (Bhattarai et al., 2021a), semantic relation analysis *Equal contribution (The authors are ordered alphabetically by (Saha et al., 2020), and aspect-based sentiment analysis (Ya- last name.) 1Department of Information and Communication Tech- dav et al., 2021a) using the conjunctive clauses to capture nology, University of Agder, Grimstad, Norway. Correspondence textual patterns. Recently, regression TMs compared favor- to: Ole-Christoffer Granmo <[email protected]>. ably with Regression Trees, Random Forest Regression, and Proceedings of the 38 th International Conference on Machine Support Vector Regression (Abeyrathna et al., 2020). The Learning, PMLR 139, 2021. Copyright 2021 by the author(s). above TM approaches have further been enhanced by vari- Massively Parallel and Asynchronous Tsetlin Machine Architecture ous techniques. By introducing real-valued clause weights, investigate how processing time scales with the number of it turns out that the number of clauses can be reduced by clauses, uncovering almost constant-time processing over up to 50× without loss of accuracy (Phoulady et al., 2020). reasonable clause amounts. Finally, in Section5, we con- Also, the logical inference structure of TMs makes it pos- clude with pointers to future work, including architectures sible to index the clauses on the features that falsify them, for grid-computing and heterogeneous systems spanning the increasing inference- and learning speed by up to an order of cloud and the edge. magnitude (Gorji et al., 2020). Multi-granular clauses sim- The main contributions of the proposed architecture can be plify the hyper-parameter search by eliminating the pattern summarized as follows: specificity parameter (Gorji et al., 2019). In (Abeyrathna et al., 2021), stochastic searching on the line automata (Oom- men, 1997) learn integer clause weights, performing on-par • Learning time is made almost constant for reasonable or better than Random Forest, Gradient Boosting, Neural clause amounts (employing from 20 to 7; 000 clauses Additive Models, StructureBoost and Explainable Boosting on a Tesla V100 GPU). Machines. Closed form formulas for both local and global • For sufficiently large clause numbers, computation TM interpretation, akin to SHAP, was proposed by Blakely time increases approximately proportionally to the in- & Granmo(2020). From a hardware perspective, energy crease in number of clauses. usage can be traded off against accuracy by making infer- ence deterministic (Abeyrathna et al., 2020). Additionally, • The architecture copes remarkably with working on Shafik et al.(2020) show that TMs can be fault-tolerant, outdated data, resulting in no significant loss in learn- completely masking stuck-at faults. Recent theoretical work ing accuracy across diverse learning tasks (regression, proves convergence to the correct operator for “identity” and novelty detection, semantic relation analysis, and word “not”. It is further shown that arbitrarily rare patterns can sense disambiguation). be recognized, using a quasi-stationary Markov chain-based analysis. The work finally proves that when two patterns are Our parallel and asynchronous architecture thus allows pro- incompatible, the most accurate pattern is selected (Zhang cessing of more massive data sets and operating with more et al., 5555). Convergence for the “XOR” operator has also clauses for higher accuracy, significantly increasing the im- recently been proven by Jiao et al.(2021). pact of logic-based machine learning. 2. Tsetlin Machine Basics Paper Contributions In all of the above mentioned TM schemes, the clauses are learnt using Tsetlin automaton 2.1. Classification (TA)-teams (Tsetlin, 1961) that interact to build and in- tegrate conjunctive clauses for decision-making. While A TM takes a vector X = [x1; : : : ; xo] of o Boolean features producing accurate learning, this interaction creates a bottle- as input, to be classified into one of two classes, y = 0 or neck that hinders parallelization. That is, the clauses must y = 1. These features are then converted into a set of literals be evaluated and compared before feedback can be provided that consists of the features themselves as well as their to the TAs. negated counterparts: L = fx1; : : : ; xo; :x1;:::; :xog. In this paper, we first cover the basics of TMs in Section2. If there are m classes and n sub-patterns per class, a TM Then, we propose a novel parallel and asynchronous archi- employs m × n conjunctive clauses to represent the sub- 1 tecture in Section3, where every clause runs in its own patterns. For a given class , we index its clauses by j, thread for massive parallelism. We eliminate the above 1 ≤ j ≤ n, each clause being a conjunction of literals: interaction bottleneck by introducing local voting tallies V Cj(X) = lk: (1) that keep track of the clause outputs, per training exam- lk2Lj ple. The local voting tallies detach the processing of each Here, lk; 1 ≤ k ≤ 2o; is a feature or its negation. Further, clause from the rest of the clauses, supporting decentralized Lj is a subset of the literal set L. For example, the particular learning. Thus, rather than processing training examples clause Cj(X) = x1 ^ :x2 consists of the literals Lj = one-by-one as in the original TM, the clauses access the fx1; :x2g and outputs 1 if x1 = 1 and x2 = 0. training examples simultaneously, updating themselves and the local voting tallies in parallel. In Section4, we investi- The number of clauses n assigned to each class is user- gate the properties of the new architecture empirically on configurable. The clauses with odd indexes are assigned pos- regression, novelty detection, semantic relation