Compressed Matrix Multiplication∗
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Fast Matrix Multiplication
Fast matrix multiplication Yuval Filmus May 3, 2017 1 Problem statement How fast can we multiply two square matrices? To make this problem precise, we need to fix a computational model. The model that we use is the unit cost arithmetic model. In this model, a program is a sequence of instructions, each of which is of one of the following forms: 1. Load input: tn xi, where xi is one of the inputs. 2. Load constant: tn c, where c 2 R. 3. Arithmetic operation: tn ti ◦ tj, where i; j < n, and ◦ 2 f+; −; ·; =g. In all cases, n is the number of the instruction. We will say that an arithmetic program computes the product of two n × n matrices if: 1. Its inputs are xij; yij for 1 ≤ i; j ≤ n. 2. For each 1 ≤ i; k ≤ n there is a variable tm whose value is always equal to n X xijyjk: j=1 3. The program never divides by zero. The cost of multiplying two n × n matrices is the minimum length of an arithmetic program that computes the product of two n × n matrices. Here is an example. The following program computes the product of two 1 × 1 matrices: t1 x11 t2 y11 t3 t1 · t2 The variable t3 contains x11y11, and so satisfies the second requirement above for (i; k) = (1; 1). This shows that the cost of multiplying two 1 × 1 matrices is at most 3. Usually we will not spell out all steps of an arithmetic program, and use simple shortcuts, such as the ones indicated by the following program for multiplying two 2 × 2 matrices: z11 x11y11 + x12y21 z12 x11y12 + x12y22 z21 x21y11 + x22y21 z22 x21y12 + x22y22 1 When spelled out, this program uses 20 instructions: 8 load instructions, 8 product instructions, and 4 addition instructions. -
Efficient Matrix Multiplication on Simd Computers* P
SIAM J. MATRIX ANAL. APPL. () 1992 Society for Industrial and Applied Mathematics Vol. 13, No. 1, pp. 386-401, January 1992 025 EFFICIENT MATRIX MULTIPLICATION ON SIMD COMPUTERS* P. BJORSTADt, F. MANNEr, T. SOI:tEVIK, AND M. VAJTERIC?$ Dedicated to Gene H. Golub on the occasion of his 60th birthday Abstract. Efficient algorithms are described for matrix multiplication on SIMD computers. SIMD implementations of Winograd's algorithm are considered in the case where additions are faster than multiplications. Classical kernels and the use of Strassen's algorithm are also considered. Actual performance figures using the MasPar family of SIMD computers are presented and discussed. Key words, matrix multiplication, Winograd's algorithm, Strassen's algorithm, SIMD com- puter, parallel computing AMS(MOS) subject classifications. 15-04, 65-04, 65F05, 65F30, 65F35, 68-04 1. Introduction. One of the basic computational kernels in many linear algebra codes is the multiplication of two matrices. It has been realized that most problems in computational linear algebra can be expressed in block algorithms and that matrix multiplication is the most important kernel in such a framework. This approach is essential in order to achieve good performance on computer systems having a hier- archical memory organization. Currently, computer technology is strongly directed towards this design, due to the imbalance between the rapid advancement in processor speed relative to the much slower progress towards large, inexpensive, fast memories. This algorithmic development is highlighted by Golub and Van Loan [13], where the formulation of block algorithms is an integrated part of the text. The rapid devel- opment within the field of parallel computing and the increased need for cache and multilevel memory systems are indeed well reflected in the changes from the first to the second edition of this excellent text and reference book. -
Matrix Multiplication: a Study Kristi Stefa Introduction
Matrix Multiplication: A Study Kristi Stefa Introduction Task is to handle matrix multiplication in an efficient way, namely for large sizes Want to measure how parallelism can be used to improve performance while keeping accuracy Also need for improvement on “normal” multiplication algorithm O(n3) for conventional square matrices of n x n size Strassen Algorithm Introduction of a different algorithm, one where work is done on submatrices Requirement of matrix being square and size n = 2k Uses A and B matrices to construct 7 factors to solve C matrix Complexity from 8 multiplications to 7 + additions: O(nlog27) or n2.80 Room for parallelism TBB:parallel_for is the perfect candidate for both the conventional algorithm and the Strassen algorithm Utilization in ranges of 1D and 3D to sweep a vector or a matrix TBB: parallel_reduce would be possible but would increase the complexity of usage Would be used in tandem with a 2D range for the task of multiplication Test Setup and Use Cases The input images were converted to binary by Matlab and the test filter (“B” matrix) was generated in Matlab and on the board Cases were 128x128, 256x256, 512x512, 1024x1024 Matlab filter was designed for element-wise multiplication and board code generated for proper matrix-wise filter Results were verified by Matlab and an import of the .bof file Test setup (contd) Pictured to the right are the results for the small (128x128) case and a printout for the 256x256 case Filter degrades the image horizontally and successful implementation produces -
The Strassen Algorithm and Its Computational Complexity
Utrecht University BSc Mathematics & Applications The Strassen Algorithm and its Computational Complexity Rein Bressers Supervised by: Dr. T. van Leeuwen June 5, 2020 Abstract Due to the many computational applications of matrix multiplication, research into efficient algorithms for multiplying matrices can lead to widespread improvements of performance. In this thesis, we will first make the reader familiar with a universal measure of the efficiency of an algorithm, its computational complexity. We will then examine the Strassen algorithm, an algorithm that improves on the computational complexity of the conventional method for matrix multiplication. To illustrate the impact of this difference in complexity, we implement and test both algorithms, and compare their runtimes. Our results show that while Strassen’s method improves on the theoretical complexity of matrix multiplication, there are a number of practical considerations that need to be addressed for this to actually result in improvements on runtime. 2 Contents 1 Introduction 4 1.1 Computational complexity .................................................................................. 4 1.2 Structure ........................................................................................................... 4 2 Computational Complexity ................................................................ 5 2.1 Time complexity ................................................................................................ 5 2.2 Determining an algorithm’s complexity .............................................................. -
Accelerating Matching Pursuit for Multiple Time-Frequency Dictionaries
Proceedings of the 23rd International Conference on Digital Audio Effects (DAFx-20), Vienna, Austria, September 8–12, 2020 ACCELERATING MATCHING PURSUIT FOR MULTIPLE TIME-FREQUENCY DICTIONARIES ZdenˇekPr˚uša,Nicki Holighaus and Peter Balazs ∗ Acoustics Research Institute Austrian Academy of Sciences Vienna, Austria [email protected],[email protected],[email protected] ABSTRACT An overview of greedy algorithms, a class of algorithms MP Matching pursuit (MP) algorithms are widely used greedy meth- falls under, can be found in [10, 11] and in the context of audio ods to find K-sparse signal approximations in redundant dictionar- and music processing in [12, 13, 14]. Notable applications of MP ies. We present an acceleration technique and an implementation algorithms in the audio domain include analysis [15], [16], coding of the matching pursuit algorithm acting on a multi-Gabor dictio- [17, 18, 19], time scaling/pitch shifting [20] [21], source separation nary, i.e., a concatenation of several Gabor-type time-frequency [22], denoising [23], partial and harmonic detection and tracking dictionaries, consisting of translations and modulations of possi- [24]. bly different windows, time- and frequency-shift parameters. The We present a method for accelerating MP-based algorithms proposed acceleration is based on pre-computing and thresholding acting on a single Gabor-type time-frequency dictionary or on a inner products between atoms and on updating the residual directly concatenation of several Gabor dictionaries with possibly different in the coefficient domain, i.e., without the round-trip to thesig- windows and parameters. The main idea of the present accelera- nal domain. -
Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels
Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2019 Using machine learning to improve dense and sparse matrix multiplication kernels Brandon Groth Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Applied Mathematics Commons, and the Computer Sciences Commons Recommended Citation Groth, Brandon, "Using machine learning to improve dense and sparse matrix multiplication kernels" (2019). Graduate Theses and Dissertations. 17688. https://lib.dr.iastate.edu/etd/17688 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Using machine learning to improve dense and sparse matrix multiplication kernels by Brandon Micheal Groth A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Applied Mathematics Program of Study Committee: Glenn R. Luecke, Major Professor James Rossmanith Zhijun Wu Jin Tian Kris De Brabanter The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2019 Copyright c Brandon Micheal Groth, 2019. All rights reserved. ii DEDICATION I would like to dedicate this thesis to my wife Maria and our puppy Tiger. -
Paper, We Present Data Fusion Across Multiple Signal Sources
68 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO. 1, JANUARY 2016 A Configurable 12–237 kS/s 12.8 mW Sparse-Approximation Engine for Mobile Data Aggregation of Compressively Sampled Physiological Signals Fengbo Ren, Member, IEEE, and Dejan Markovic,´ Member, IEEE Abstract—Compressive sensing (CS) is a promising technology framework, the CS framework has several intrinsic advantages. for realizing low-power and cost-effective wireless sensor nodes First, random encoding is a universal compression method that (WSNs) in pervasive health systems for 24/7 health monitoring. can effectively apply to all compressible signals regardless of Due to the high computational complexity (CC) of the recon- struction algorithms, software solutions cannot fulfill the energy what their sparse domain is. This is a desirable merit for the efficiency needs for real-time processing. In this paper, we present data fusion across multiple signal sources. Second, sampling a 12—237 kS/s 12.8 mW sparse-approximation (SA) engine chip and compression can be performed at the same stage in CS, that enables the energy-efficient data aggregation of compressively allowing for a sampling rate that is significantly lower than the sampled physiological signals on mobile platforms. The SA engine Nyquist rate. Therefore, CS has a potential to greatly impact chip integrated in 40 nm CMOS can support the simultaneous reconstruction of over 200 channels of physiological signals while the data acquisition devices that are sensitive to cost, energy consuming <1% of a smartphone’s power budget. Such energy- consumption, and portability, such as wireless sensor nodes efficient reconstruction enables two-to-three times energy saving (WSNs) in mobile and wearable applications [5]. -
Divide-And-Conquer Algorithms
Chapter 2 Divide-and-conquer algorithms The divide-and-conquer strategy solves a problem by: 1. Breaking it into subproblems that are themselves smaller instances of the same type of problem 2. Recursively solving these subproblems 3. Appropriately combining their answers The real work is done piecemeal, in three different places: in the partitioning of problems into subproblems; at the very tail end of the recursion, when the subproblems are so small that they are solved outright; and in the gluing together of partial answers. These are held together and coordinated by the algorithm's core recursive structure. As an introductory example, we'll see how this technique yields a new algorithm for multi- plying numbers, one that is much more efficient than the method we all learned in elementary school! 2.1 Multiplication The mathematician Carl Friedrich Gauss (1777–1855) once noticed that although the product of two complex numbers (a + bi)(c + di) = ac bd + (bc + ad)i − seems to involve four real-number multiplications, it can in fact be done with just three: ac, bd, and (a + b)(c + d), since bc + ad = (a + b)(c + d) ac bd: − − In our big-O way of thinking, reducing the number of multiplications from four to three seems wasted ingenuity. But this modest improvement becomes very significant when applied recur- sively. 55 56 Algorithms Let's move away from complex numbers and see how this helps with regular multiplica- tion. Suppose x and y are two n-bit integers, and assume for convenience that n is a power of 2 (the more general case is hardly any different). -
Algebraic Elimination of Epsilon-Transitions Gérard Duchamp, Hatem Hadj Kacem, Éric Laugerotte
Algebraic Elimination of epsilon-transitions Gérard Duchamp, Hatem Hadj Kacem, Éric Laugerotte To cite this version: Gérard Duchamp, Hatem Hadj Kacem, Éric Laugerotte. Algebraic Elimination of epsilon-transitions. 2004. hal-00001038v1 HAL Id: hal-00001038 https://hal.archives-ouvertes.fr/hal-00001038v1 Preprint submitted on 15 Jan 2004 (v1), last revised 13 Dec 2004 (v2) HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Algebraic elimination of ε-transitions G´erard Duchamp, Hatem Hadj Kacem, Eric´ Laugerotte∗ LIFAR, Facult´edes Sciences et des Techniques, 76821 Mont-Saint-Aignan Cedex, France. Abstract We present here algebraic formulas associating a k-automaton to a k-ε-automaton. The existence depends on the definition of the star of matrices and of elements in the semiring k. For this reason, we present the theorem which allows the transformation of k-ε-automata into k-automata. The two automata have the same behaviour. Keywords : Automata with multiplicities, ε-transitions, behaviour, star of matri- ces. 1 Introduction Automata with multiplicities (or weighted automata) are a versatile class of transition systems which can modelize as well classical (boolean), stochas- tic, transducer automata and be applied to various purposes such as image compression, speech recognition, formal linguistic (and automatic treatment of natural languages too) and probabilistic modelling. -
Improved Greedy Algorithms for Sparse Approximation of a Matrix in Terms of Another Matrix
Improved Greedy Algorithms for Sparse Approximation of a Matrix in terms of Another Matrix Crystal Maung Haim Schweitzer Department of Computer Science Department of Computer Science The University of Texas at Dallas The University of Texas at Dallas Abstract We consider simultaneously approximating all the columns of a data matrix in terms of few selected columns of another matrix that is sometimes called “the dic- tionary”. The challenge is to determine a small subset of the dictionary columns that can be used to obtain an accurate prediction of the entire data matrix. Previ- ously proposed greedy algorithms for this task compare each data column with all dictionary columns, resulting in algorithms that may be too slow when both the data matrix and the dictionary matrix are large. A previously proposed approach for accelerating the run time requires large amounts of memory to keep temporary values during the run of the algorithm. We propose two new algorithms that can be used even when both the data matrix and the dictionary matrix are large. The first algorithm is exact, with output identical to some previously proposed greedy algorithms. It takes significantly less memory when compared to the current state- of-the-art, and runs much faster when the dictionary matrix is sparse. The second algorithm uses a low rank approximation to the data matrix to further improve the run time. The algorithms are based on new recursive formulas for computing the greedy selection criterion. The formulas enable decoupling most of the compu- tations related to the data matrix from the computations related to the dictionary matrix. -
Privacy Preserving Identification Using Sparse Approximation With
Privacy Preserving Identification Using Sparse Approximation with Ambiguization Behrooz Razeghi, Slava Voloshynovskiy, Dimche Kostadinov and Olga Taran Stochastic Information Processing Group, Department of Computer Science, University of Geneva, Switzerland behrooz.razeghi, svolos, dimche.kostadinov, olga.taran @unige.ch f g Abstract—In this paper, we consider a privacy preserving en- Owner Encoder Public Storage coding framework for identification applications covering biomet- +1 N M λx λx L M rics, physical object security and the Internet of Things (IoT). The X × − A × ∈ X 1 ∈ A proposed framework is based on a sparsifying transform, which − X = x (1) , ..., x (m) , ..., x (M) a (m) = T (Wx (m)) n A = a (1) , ..., a (m) , ..., a (M) consists of a trained linear map, an element-wise nonlinearity, { } λx { } and privacy amplification. The sparsifying transform and privacy L p(y (m) x (m)) Encoder | amplification are not symmetric for the data owner and data user. b Schematic Decoding List Probe +1 We demonstrate that the proposed approach is closely related (Private Decoding) y = x (m) + z d (a (m) , b) γL λy λy ≤ − to sparse ternary codes (STC), a recent information-theoretic 1 p (positions) − ´x y = ´x (Pubic Decoding) 1 m M (y) concept proposed for fast approximate nearest neighbor (ANN) ≤ ≤ L Data User b = Tλy (Wy) search in high dimensional feature spaces that being machine learning in nature also offers significant benefits in comparison Fig. 1: Block diagram of the proposed model. to sparse approximation and binary embedding approaches. We demonstrate that the privacy of the database outsourced to a for example biometrics, which being disclosed once, do not server as well as the privacy of the data user are preserved at a represent any more a value for the related security applications. -
Column Subset Selection Via Sparse Approximation of SVD
Column Subset Selection via Sparse Approximation of SVD A.C¸ivrila,∗, M.Magdon-Ismailb aMeliksah University, Computer Engineering Department, Talas, Kayseri 38280 Turkey bRensselaer Polytechnic Institute, Computer Science Department, 110 8th Street Troy, NY 12180-3590 USA Abstract Given a real matrix A 2 Rm×n of rank r, and an integer k < r, the sum of the outer products of top k singular vectors scaled by the corresponding singular values provide the best rank-k approximation Ak to A. When the columns of A have specific meaning, it might be desirable to find good approximations to Ak which use a small number of columns of A. This paper provides a simple greedy algorithm for this problem in Frobenius norm, with guarantees on( the performance) and the number of columns chosen. The algorithm ~ k log k 2 selects c columns from A with c = O ϵ2 η (A) such that k − k ≤ k − k A ΠC A F (1 + ϵ) A Ak F ; where C is the matrix composed of the c columns, ΠC is the matrix projecting the columns of A onto the space spanned by C and η(A) is a measure related to the coherence in the normalized columns of A. The algorithm is quite intuitive and is obtained by combining a greedy solution to the generalization of the well known sparse approximation problem and an existence result on the possibility of sparse approximation. We provide empirical results on various specially constructed matrices comparing our algorithm with the previous deterministic approaches based on QR factorizations and a recently proposed randomized algorithm.