Distributed Computation of Linear Inverse Problems with Application to Computed Tomography Reconstruction Yushan Gao, Thomas Blumensath

1 Distributed Computation of Linear Inverse Problems with Application to Computed Tomography Reconstruction Yushan Gao, Thomas Blumensath, Abstract—The inversion of linear systems is a fundamental step where y; A; x and e are projection data, system matrix, re- in many inverse problems. Computational challenges exist when constructed image vector and measurement noise respectively. trying to invert large linear systems, where limited computing However, the relatively lower computational efficiency limits resources mean that only part of the system can be kept in computer memory at any one time. We are here motivated their use, especially in many x-ray tomography problems, by tomographic inversion problems that often lead to linear where y and x can have millions of entries each [9] and inverse problems. In state of the art x-ray systems, even a where A, even though it is a sparse matrix, can have billions standard scan can produce 4 million individual measurements of entries. and the reconstruction of x-ray attenuation profiles typically For realistic dataset sizes, these methods thus require sig- requires the estimation of a million attenuation coefficients. To deal with the large data sets encountered in real applications nificant computational resources, especially as the matrix A and to utilise modern graphics processing unit (GPU) based is seldom kept in memory but is instead re-computed on the computing architectures, combinations of iterative reconstruction fly, which can be done relatively efficiently using the latest algorithms and parallel computing schemes are increasingly Graphical Processor Unit (GPU) based parallel computing applied. Although both row and column action methods have platforms. However, as GPUs have limited internal memory, been proposed to utilise parallel computing architectures, individual computations in current methods need to know either this typically requires the data y and/or reconstruction volume the entire set of observations or the entire set of estimated x- x to be broken into smaller subsets on which individual ray absorptions, which can be prohibitive in many realistic big computations are performed. The development of efficient data applications. We present a fully parallelizable computed algorithms that only work on subsets of the data at any one tomography (CT) image reconstruction algorithm that works time is thus becoming of increasing interest [10], [11]. with arbitrary partial subsets of the data and the reconstructed volume. We further develop a non-homogeneously randomised Currently, most of these methods can be divided into two selection criteria which guarantees that sub-matrices of the categories: row action methods, which operate on subsets of system matrix are selected more frequently if they are dense, thus the observations y at a time and column action methods, which maximising information flow through the algorithm. A grouped operate on subsets of the voxels x at a time [12]–[14]. version of the algorithm is also proposed to further improve Row action methods:A classical method is the Kaczmarz convergence speed and performance. Algorithm performance is verified experimentally. algorithm (also known as the Algebraic Reconstruction Tech- nique (ART)) which together with its block based variants are Index Terms—CT image reconstruction, parallel computing, widely used in tomographic image reconstruction [7], [15]– gradient descent, coordinate descent, linear inverse systems. [17]. Increased convergence rates are obtained with block Kaczmarz methods, which are also known as the Simultaneous I. INTRODUCTION Algebraic Reconstruction Technique (SART) [11], [18]. By si- N transmission computed tomography (CT), standard scan arXiv:1709.00953v1 [cs.DC] 4 Sep 2017 multaneously selecting several rows of projection data, SART I trajectories, such as rotation based or helical trajectories, can also be converted into efficient parallel strategies [19]. allow the use of efficient analytical reconstruction techniques Apart from the Kaczmarz method, another classical row action such as the filtered backprojection algorithm (FBP) [1], [2] and method is the simultaneous iterative reconstruction technique the Feldkamp Davis Kress (FDK) [3], [4] method. However, in (SIRT). The SIRT method updates each reconstructed image low signal to noise settings, if scan angles are under-sampled voxel by combining all projection values whose corresponding or if nonstandard trajectories are used, then less efficient, it- x-rays passing through this voxel. In this way, the convergence erative reconstruction methods can provide significantly better rate significantly increases but so does the computation load reconstructions [5]–[8]. These methods model the x-ray system [20], [21]. To lower the computational load within one iter- as a linear system of equations: ation, the block form of SIRT is widely used in distributed y = Ax + e; (1) computing systems [11], [22], [23], yielding results of superior quality at the cost of increasing the reconstruction time. Manuscript received September 01, 2017. This work was supported by EPSRC grant EP/K029150/1, a University of Southampton PGR scholarship, Since in CT systems the system matrix is often large a Faculty of Engineering and the Environment Lancaster Studentship and the and sparse, component averaging (CAV) and its block form China Scholarship Council. (BICAV) methods have been developed to utilise the sparsity Y. Gao and T. Blumensath are with the Faculty of Engineering and Envi- ronment, University of Southampton, Southampton, SO17 1BJ, UK (email: property [24]–[26]. Furthermore, BICAV uses an additional [email protected]; [email protected]). coefficient matrix based on the number of non-zero elements 2 in each column within a row block of the system matrix, thus Algorithm 2 Generic column action iteration significantly increases the convergence rate compared to the Initialization: x0 2 Rn is arbitrary. Both matrix A and j original CAV method. vector x are divided into q column blocks. A and xj are th 0;1 0 Column action methods: Column action methods are corresponding j column blocks. r = y − Ax . !j also called iterative coordinate descent method(ICD). They and Mj are relaxation parameters and coefficient matrices reduces the N-dimensional optimization problem into a one- respectively. dimensional problem, which is shown to have a faster con- for k = 0; 1; 2; :::(epochs or outer iterations) do vergence rate in the margins of the reconstructed image [27]. for j = 1; 2; :::; q(inner iterations) do k+1 k j T k;j A non-homogeneous (NH-ICD) update strategy is proposed xj = xj + !jMj(A ) r in [28], [29] to increase the convergence rate. The strategy k;j+1 k;j j k+1 k r = r − A (xj − xj ) first generates a pixel/voxel selection criterion and then based end for on this criterion voxels which are furthest from convergence rk+1;1 = rk;q+1 are selected to be updated. To increase the scalability and end for parallelism, a block form ICD (B-ICD) is proposed in [30]. The volume object is sliced along with the helical rotation axis and within one slice several neighbouring voxels are grouped is infeasible. However, if the algorithm is applied in a parallel together as a block. After all blocks within one slice are form, then each processor (or node) needs to store the whole updated, the slice index steps along with the axial direction to reconstructed image vector x as each update is performed on the next slice and repeats the same iteration. Experiments show the entire image, which can be computationally challenging in that a bigger block size is of help to increase the convergence large data reconstructions, especially in terms of the required T rate, but at the cost of heavier computation amount. To reduce forward projection Aix and back projection Ai ri [24]. the computation complexity brought by B-ICD, the ABCD On the other hand, using column action algorithms in a algorithm, which is a derivative of B-ICD method, combines parallel computing scheme does not require each processor to the pixel line feature [28], [29] and the block form [30] store the whole reconstructed image but only to store a small to update the pixel line simultaneously, leading to a faster part of it. However, they instead require access to all of y or convergence rate. Inspired by NH-ICD, the NH idea is also r, which again can be prohibitive in large scale situations. applied to the ABCD method by more frequently updating Thus, row action and column action methods require access the axial blocks which change by the largest amount during to the entire vectors y (or r) or x in each sub-iteration. There an iteration [31]. However, this block method is only suitable are few exceptions to this. One exception is the work in for standard CT scanners with circular or helical scanning [33], where a row action method is discussed in which each trajectories and may not be appropriate for arbitrary scanning node only requires parts of the reconstructed image vector x. trajectories. By splitting the image volume along planes perpendicular to To optimize systems of linear equation y = Ax, [32] the rotation axis, this variant of SIRT calculates the area of presents a concise summary of both row action and column overlap on the detector between two adjacent sub-volumes. action algorithm, which are show in Algo.1 and Algo.2. Only the data within the overlapping areas on the detector are exchanged between computation nodes. However, this method Algorithm 1 Generic row action iteration only works for circular scan trajectories and its scalability is Initialization: x0 2 Rn is arbitrary. Both matrix A and limited due to the requirement that the overlap area should be projection data y are divided into p row blocks. Ai and yi small to minimise data communication overheads. are corresponding ith row blocks. xk is the estimate or x Our goal here is to develop an algorithm working with th in the k iteration.

Load more