<<

electronics

Article Efficient Systolic-Array Redundancy Architecture for Offline/Online Repair

Keewon Cho 1, Ingeol Lee 2, Hyeonchan Lim 1 and Sungho Kang 1,*

1 Department of Electrical and Electronic Engineering, Yonsei University, Seoul 120-749, Korea; [email protected] (K.C.); [email protected] (H.L.) 2 Design Technology Team, Samsung Electronics CO., Ltd., Hwaseong-Si 445-330, Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-2-2123-2775

 Received: 22 January 2020; Accepted: 13 February 2020; Published: 15 February 2020 

Abstract: Neural-network computing has revolutionized the field of machine learning. The systolic- array architecture is a widely used architecture for neural-network computing acceleration that was adopted by Google in its (TPU). To ensure the correct operation of the neural network, the reliability of the systolic-array architecture should be guaranteed. This paper proposes an efficient systolic-array redundancy architecture that is based on systolic-array partitioning and rearranging connections of the systolic-array elements. The proposed architecture allows both offline and online repair with an extended redundancy architecture and programmable fuses and can ensure reliability even in an online situation, for which the previous fault-tolerant schemes have not been considered.

Keywords: offline repair; online repair; parallel processing; redundancy; systolic-arrays

1. Introduction In the past few years, convolutional neural networks (CNNs) have outperformed traditional machine-learning . Thus, CNNs are considered state-of-the-art for a wide range of applications, such as image and video recognition [1], text classification [2], and language translation [3]. These neural networks are traditionally operated on central processing units (CPUs), graphics processing units (GPUs), and field-programmable gate arrays (FPGAs) [4], which can be easily adapted to new network architectures. However, to train and execute a CNN, millions of parameters must be taken into consideration, making the construction of neural networks expensive. Several hardware techniques have been developed for accelerating such machine-learning algorithms. In recent years, neural-network accelerator design has been a topic of enormous interest in the industry. A systolic-array is a homogeneous grid of processing elements (PEs), which are small processors, with each element connected only to its neighbors [5]. During its operation, a PE reads data from its neighbors, computes a simple multiplication function, and stores the result in its local memory. Recently, Google developed a Tensor Processing Unit (TPU) that uses a 256 256 systolic-array of × multiply and accumulate (MAC) units at its core and provides 30–80 times higher performance than CPU- or GPU-based servers [6]. However, if faults occur in some PEs, such as manufacturing problems, this computing architecture in the form of a systolic-array can erroneous results. Although neural-network application has a fault-tolerance capability, reliability issues still exist, because a single hardware fault affects all operations of the neural-network application. Besides, post-manufacturing failures may cause the additional erroneous results during the normal operation. Therefore, there is a need for an architecture that can guarantee the reliability of the systolic-array architecture. In this point of view, the proposed

Electronics 2020, 9, 338; doi:10.3390/electronics9020338 www.mdpi.com/journal/electronics Electronics 2020, 9, 338 2 of 14 idea is the best fit for the CNN applications for high reliability. For example, machine learning solutions in the automotive vehicle industry needs high reliability in sensor fusion, mapping, and path planning [7–9]. In general, if a single core has faults, it can be replaced by a redundant core. However, in the conventional multicore CPU architecture, designing a redundancy core is considered inefficient, because each redundant CPU has a very large area overhead. However, application-specific accelerator processors, such as neural-network accelerators, are composed of at least several hundreds to tens of thousands of cores. Owing to this structural characteristic, it can be expected that design of the redundancy of the systolic MAC array incurs a reasonable area overhead compared with the design of the redundancy of the multicore CPU architecture. The redundancy architecture of the systolic MAC array is configured differently from general redundancy structures, such as through-silicon via (TSV) redundancy and input/output (IO) redundancy structures. In the case of TSV and IO redundancy, rerouting to the redundancy is relatively easy because the signal is connected between only one input and one output. However, in the systolic-array architecture, as there are a large number of connections from one MAC core to other adjacent MAC cores, various routing possibilities should be considered for obtaining the correct calculation results after repair. In this research, an efficient redundancy architecture for a systolic MAC array is proposed. The proposed method involves offline repair, which is conducted during a manufacturing test by using the additional redundant MACs. Online repair is also performed by using the remaining redundant MACs that are not used in the offline repair. In order to utilize the repair information in the offline repair , programmable fuses, which are widely used in the memory repair, are adopted in the proposed method. Unlike the previous works, the offline repair process focuses not only the reparability, but also the fault tolerance for online repair. By doing so, the proposed method improves the reliability in the offline and online statuses and requires a small area overhead since the additional redundancies can be used for both repairs. The proposed method provides not only the improved yield and reliability but also the cost-effective design while requiring no drastic change in the IC manufacturing stage. The remainder of this paper is organized as follows. In Section2, we review the previous work related to the systolic MAC array and its redundancy architecture. Section3 presents the proposed redundancy architecture and the basic concept of the proposed repair mechanism. Section4 presents the experimental results indicating the performance of the proposed architecture with regard to the repair rate and hardware overhead. Finally, Section5 concludes the paper.

2. Previous Works In this section, the systolic-array architecture and its fault-tolerance schemes, which were previously proposed, are described.

2.1. Systolic-Array Architecture Because matrix multiplication and convolution operations are independent for each piece of data, the calculations are optimized using parallel processing with as many processors as possible. However, if the number of processors exceeds several thousand, interconnection problems that transfer data from the memory increase rapidly. To solve these problems, a systolic-array structure was introduced in [5], as shown in Figure1a. In the systolic-array architecture, a MAC unit, which enables MAC functions, is used [10]. The MAC unit performs multiplication and accumulation processes repeatedly to perform continuous and complex operations in digital signal processing. A basic MAC architecture is illustrated in Figure1b. In neural-network calculations, the convolution-layer operation accounts for most of the computation time. Therefore, by adopting this systolic-array structure, matrix multiplication and convolution calculations can be accelerated. ElectronicsElectronics2020 20,209,, 3382, x FOR PEER REVIEW 3 of3 14 of 14 Electronics 2020, 2, x FOR PEER REVIEW 3 of 14

(a) (b)

(a) (b)

Figure 1. Neural computing unit structure for calculating N N matrix multiplication: (a) Systolic-array × neuralFigure computing 1. Neural unit; computing ( b) Conventional unit structure MAC for architecture.calculating N × N matrix multiplication : (a) Systolic- array neural computing unit; (b) Conventional MAC architecture. 2.2. PreviousFigure Fault1. Neural Repair computing Systolic-Array unit structure Designs for calculating N × N matrix multiplication: (a) Systolic- 2.2Fault-tolerant. arrayPrevious neural Fault computing systolic-array Repair Systolic unit; designs(-bArray) Conventional Designs were proposed MAC architecture. by Kung et al. [11] and were later improved [12]. The basicFault concept-tolerant of these systolic schemes-array designs is considering were proposed the array by as Kung a small et al. systolic-array [11] and were when later faulty improved MACs 2.2. Previous Fault Repair Systolic-Array Designs are[12 detected]. The basic in the concept array. of these schemes is considering the array as a small systolic-array when faulty MACsFigureFault are-2tolerant adetected shows systolic in the the redundant array.-array designs MAC were architecture proposed by for Kung repairing et al. [ a11 faulty] and were MAC later processor. improved MAC[12]. processorsTheFig urebasic 2a concept shows are arranged the of theseredundant in schemes a 4 MAC4 systolic-array, is consideringarchitecture and forthe redundantrepairingarray as a a small MACfaulty systolic processorsMAC processor.-array are when placed MAC faulty at × theMACsprocessors end ofare each detected are row. arranged Thesein the in array. redundant a 4 × 4 systolic MAC-array, processors and redundant can serve MAC as substitutes processors when are placed a fault at occurs the in eachendFig of row. ureeach Furthermore,2 arow. shows These the redundant toredundant replace MAC the MAC faulty processors architecture MAC can with servefor the repairing redundantas substitu a faultytes MAC, when MAC the a signalfault processor. occurs should MACin be reroutedprocessorseach row. so that areFurthermore, severalarranged MUXs toin replacea 4 are × 4 used. thesystolic faulty-array, MAC and with redundant the redundant MAC MAC, processors the signal are shouldplaced beat the endrerouted of each so row.that severalThese redundantMUXs are used. MAC processors can serve as substitutes when a fault occurs in each row. Furthermore, to replace the faulty MAC with the redundant MAC, the signal should be rerouted so that several MUXs are used.

(a) (b)

Figure 2. Previous systolic-array redundant architecture [12]: (a) Overview; (b) Repair example.

An example of a repair(a) process in this redundancy architecture is presented(b) in Figure2b. According to the characteristicsFigure 2. Previous of the systolic systolic-array,-array redundant if MAC architecture unit C11 [ has12]: a(a defect,) Overview; for normal(b) Repair operation, example. the role of the faulty core is performed by C12, and the role of C12 is performed by C13. Finally, a redundant An example of a repair process in this redundancy architecture is presented in Figure 2b. core R1 performs the role of C13. Through this process, the repair process is completed in such a According to the characteristics of the systolic-array, if MAC unit C11 has a defect, for normal Figure 2. Previous systolic-array redundant architecture [12]: (a) Overview; (b) Repair example. manneroperation, that the role signals of the to faulty be transmitted core is performed to the faulty by C12, MAC and the are role shifted of C12 to theis performed neighboring by C13. MAC andFinally, transmitted. a redundant Thus, core the repairR1 performs process the is role completed. of C13. Through Another this repair process, process the isrepair proposed process in [is13 ]. In thisAn study, example the role of of a repair the faulty process MAC in is this replaced redundancy by the neighboring architecture core is presented which is in placed Figure on 2b. a diagonal.According Basically, to the switches characteristics are located of the at systolicthe intersection-array, if point MAC among unit everyC11 has four a neighboring defect, for normal cores. operation, the role of the faulty core is performed by C12, and the role of C12 is performed by C13. Finally, a redundant core R1 performs the role of C13. Through this process, the repair process is

Electronics 2020, 2, x FOR PEER REVIEW 4 of 14

completed in such a manner that the signals to be transmitted to the faulty MAC are shifted to the Electronicsneighboring2020, 9MAC, 338 and transmitted. Thus, the repair process is completed. Another repair process4 of 14is proposed in [13]. In this study, the role of the faulty MAC is replaced by the neighboring core which is placed on a diagonal. Basically, switches are located at the intersection point among every four By utilizing switches, the faulty MAC can be bypassed in either horizontal or vertical ways and shifted neighboring cores. By utilizing switches, the faulty MAC can be bypassed in either horizontal or by its neighboring core. vertical ways and shifted by its neighboring core. Recently, Zhang et al. proposed a fault-tolerant systolic-array design considering the impact of Recently, Zhang et al. proposed a fault-tolerant systolic-array design considering the impact of hardware faults in the application domain [14]. The method reduced the effect of the faulty MAC by hardware faults in the application domain [14]. The method reduced the effect of the faulty MAC by calculating the weight in the CNN corresponding to the faulty MAC as zero. Although this method calculating the weight in the CNN corresponding to the faulty MAC as zero. Although this method reduces the effect of hardware faults without additional redundant MACs, reliability issues still exist, reduces the effect of hardware faults without additional redundant MACs, reliability issues still exist, because the systolic-array is comprised of the faulty MACs. because the systolic-array is comprised of the faulty MACs. 3. Proposed Redundancy Architecture 3. Proposed Redundancy Architecture In this section, the proposed systolic-array redundancy architecture is described. The proposed In this section, the proposed systolic-array redundancy architecture is described. The proposed method has two main features: the “redundancy architecture” and “online error repair.” method has two main features: the “redundancy architecture” and “online error repair.” 3.1. Systolic-Array Redundancy Architecture 3.1. Systolic-Array Redundancy Architecture The previous redundancy architecture can repair all faulty MACs when only one faulty MAC occursThe in previous each row. redundancy This means architecture that faulty can MACs repair cannot all faulty be repairedMACs when when only two one or morefaulty faultyMAC MACsoccurs in are each exist row. in aThis row. means Therefore, that faulty an extended MACs cannot redundancy be repaired architecture when two is or considered more faulty in whichMACs configuringare exist in a redundanciesrow. Therefore, at an the extended end of redundancy the row and architecture column of is the considered array as in shown which in configuring Figure3a. Thisredundancies redundancy at the architecture end of the canrow repair and column faults forof the various array patterns as shown when in Fig aure large 3a. number This redundancy of MACs fail.architecture The proposed can repair method faults repairs for various faulty patterns MACs bywhen using a large the shift-basednumber of repairMACs mechanismfail. The proposed which ismethod commonly repairs used faulty for itsMACs simple by using implementation. the shift-based Figure repair3b showsmechanism an example which inis commonly which three used faulty for MACsits simple are implementation concentrated in an. Fig area.ure 3b In shows this example, an example there in are which two faultythree faulty MACs MACs in the are 2nd concentrated row and the 3rdin an column. area. In Nevertheless, this example, the there redundant are two MAC faulty architecture MACs in structure the 2nd can row repair and this the faulty 3rd column. pattern withNevertheless, more additional the redundant hardware MAC overhead. architecture structure can repair this faulty pattern with more additional hardware overhead.

(a) (b)

Figure 3. Redundancies in both rows and columns: (a) Overall architecture; (b) Repair example.

3.2. Redundancy Configuration by Partitioning the Entire Array Figure 3. Redundancies in both rows and columns: (a) Overall architecture; (b) Repair example. In the event of MAC failure, to repair the fault, the signals that should be assigned to the faulty MAC are rerouted to the adjacent MAC, and the signals that should be conveyed to the last MAC in the 3.2. Redundancy Configuration by Partitioning the Entire Array rows or columns are assigned to the redundant MAC. However, when the number of faulty MACs is high,I then the redundancy event of MAC architecture failure, to cannot repair repair the fault, the specific the signals faulty that MAC should patterns. be assigned For example, to the iffaulty two orMAC more are faults rerouted occur to in the a systolic-array adjacent MAC, row, and one the redundant signals that MAC should at the be end conveyed of the row to the can last only MAC repair in one faulty MAC. In this case, the remaining unrepaired faulty MACs can be repaired by the redundant

MAC located at the end of the column, but if there are other faults in this column, the fault pattern Electronics 2020, 9, 338 5 of 14 means that the faults cannot be repaired. Therefore, to reduce the probability of these fault patterns occurring, an architecture of configuring redundancies by partitioning a large size systolic-array can be considered. This partitioning architecture, which is called “array partitioning,” is more efficient in terms of repair rate because it can repair more faulty MAC patterns. Increasing the partitioning level means partitioning the entire array into multiple small arrays. For example, if the entire systolic-array size is 128 128, setting the partitioning level to 4 will partition the entire array into four 64 64 arrays. × × Similarly, when the partitioning level is set to 16, an entire 128 128 array will be partitioned into × 16 arrays with a size of 32 32. In this paper, the parameter PL (partitioning level) which indicates the × number of parts the entire array will be partitioned into is used. Let NROW and NCOL are the number of rows and columns of the systolic-array, respectively. Then, if the PL is set to 4, it means that the entire N N size array is divided into four N /2 N /2 size arrays. ROW × COL ROW × COL An advantage of the “array partitioning” feature of the proposed method is that it increases the repair rate by increasing PL; however, excessive partitioning of the entire array can increase the redundancies and possibly cause a timing problem due to the redundancy configuration. Therefore, it is important to determine the appropriate PL.

3.3. One Redundancy per Multiple Rows and Columns A straightforward redundancy architecture for repairing faults in systolic-array partitions involves placing redundancies at the end of every row and column. However, this architecture is inefficient in terms of area overhead unless the probability of a fault occurring in the systolic-array is high. Therefore, to design a more efficient redundancy architecture in terms of area overhead, an architecture whereby a redundancy is placed per two or more rows and columns can be considered. We set the parameters “One redundancy per multiple Rows Level (ORL)” and “One redundancy per multiple Columns Level (OCL)”. If the ORL is set to 2 and the OCL is set to 4, it means one redundancy is placed in two rows and four columns. In addition, when applying the same value to ORL and OCL, use the parameter OL. For example, if OL is set to 4, ORL and OCL have the same value 4. Figure4a shows the proposed architecture with OL 2. Redundancy MAC RC0 can repair any one faulty MAC occurring in the first row and the second row. That is, if a faulty core is detected in the first systolic-array row, RC0 should act as the last core of the first row, and if a faulty core is detected in the second systolic row, RC0 should act as the last core in the second row. This extension architecture can reduce the total additional hardware overhead because it can reduce the number of redundant

MACs.Electronics Similarly, 2020, 2, xFigure FOR PEER4b REVIEW shows the case where OL is 4. 6 of 14

(a) (b)

Figure 4. Proposed redundancy architecture: (a) OL 2; (b) OL 4.

Figure 4. Proposed redundancy architecture: (a) OL 2; (b) OL 4.

Based on this extension, it is also possible to extend the architecture of one redundant core per three or more rows and columns. However, if three or more rows and columns have only one redundant MAC, this redundancy architecture can be configured with less hardware overhead, but the repair rate will decrease; furthermore, a timing problem may occur because the distance between the redundancy and MAC will increase.

3.4. Repair Strategy for Offline/Online Repair Typically, redundancies that are not used in offline repair are wasted. However, in the proposed method, the remaining redundancies are used to repair faults during normal operation. If at least one redundant MAC remains unused, the online repair process can be performed. Typically, programmable fuses are used to repair faulty memory cells by programming the address fuses of the spare decoder [15]. In order to utilize the unused redundant MACs, programmable fuses are adopted to enable the online repair process to repair additional erroneous MACs during the normal operation. Basically, the online repair process is progressed with unused redundant MACs after the offline repair process. It means that if an error occurs in the MAC which cannot be covered by the remaining redundancies, it is impossible to repair the erroneous MAC with the proposed method. Therefore, it is best to leave unused redundant MACs as efficiently as possible during the offline repair process. This does not mean that the number of unused redundant MACs should be maximized because the area of repairable MACs in the online repair process is not fully determined by the quantity of unused redundant MACs. Besides, more unused redundant MACs means redundancies are over-assigned in the first place. Unlike the previous redundancy architecture, the proposed method assigns redundancies to both the end of the row and column of the array. As mentioned in Section 2.2, each redundant MAC is specialized to repair the faulty MAC at the same line. So, several MACs share the same redundant resources depending on the redundancy architecture. Let’s say that those several MACs belong to the same group. Then, each group has the same number of available redundant MACs to use before the offline repair process. As the repair process is progressed, the number of available redundancies of each group is decreased. The basic concept of the proposed repair strategy is spending redundancies first in the group which has the higher number of available redundant MACs. This prevents the specific group from having no available redundancy even though there is another option to utilize resources of other groups.

Electronics 2020, 9, 338 6 of 14

Based on this extension, it is also possible to extend the architecture of one redundant core per three or more rows and columns. However, if three or more rows and columns have only one redundant MAC, this redundancy architecture can be configured with less hardware overhead, but the repair rate will decrease; furthermore, a timing problem may occur because the distance between the redundancy and MAC will increase.

3.4. Repair Strategy for Offline/Online Repair Typically, redundancies that are not used in offline repair are wasted. However, in the proposed method, the remaining redundancies are used to repair faults during normal operation. If at least one redundant MAC remains unused, the online repair process can be performed. Typically, programmable fuses are used to repair faulty memory cells by programming the address fuses of the spare decoder [15]. In order to utilize the unused redundant MACs, programmable fuses are adopted to enable the online repair process to repair additional erroneous MACs during the normal operation. Basically, the online repair process is progressed with unused redundant MACs after the offline repair process. It means that if an error occurs in the MAC which cannot be covered by the remaining redundancies, it is impossible to repair the erroneous MAC with the proposed method. Therefore, it is best to leave unused redundant MACs as efficiently as possible during the offline repair process. This does not mean that the number of unused redundant MACs should be maximized because the area of repairable MACs in the online repair process is not fully determined by the quantity of unused redundant MACs. Besides, more unused redundant MACs means redundancies are over-assigned in the first place. Unlike the previous redundancy architecture, the proposed method assigns redundancies to both the end of the row and column of the array. As mentioned in Section 2.2, each redundant MAC is specialized to repair the faulty MAC at the same line. So, several MACs share the same redundant resources depending on the redundancy architecture. Let’s say that those several MACs belong to the same group. Then, each group has the same number of available redundant MACs to use before the offline repair process. As the repair process is progressed, the number of available redundancies of each group is decreased. The basic concept of the proposed repair strategy is spending redundancies first in the group which has the higher number of available redundant MACs. This prevents the specific group from having no available redundancy even though there is another option to utilize resources of other groups. The example of the proposed repair strategy is depicted in Figure5. In this example, there are two faulty MACs, C13 and C33, into a 4 4 MAC array as shown in Figure5a. There are four groups × in the MAC array and each group has two available redundant MACs. For example, Group 1 consists of MAC C00, C01, C10, and C11. If a fault occurs in Group 1, redundant MACs, RC0 and RR0, can be utilized. Therefore Group 1 has one available row redundant MAC and one column redundant MAC. Figure5b depicts the number of available redundant MACs of each group. In this example, C13 and C33 can be repaired by utilizing {RC0, RC2} or {RC0, RR2}. However, choosing {RC0, RR2} as a solution makes Group 2 has no available redundancy in the online repair process. On the other hand, choosing {RC0, RC2} as a solution makes all groups have at least one available redundant MAC in the online repair process as shown in Figure5d. This repair strategy secures more flexibility in the online repair process. Electronics 2020, 2, x FOR PEER REVIEW 7 of 14

The example of the proposed repair strategy is depicted in Figure 5. In this example, there are two faulty MACs, C13 and C33, into a 4 × 4 MAC array as shown in Figure 5a. There are four groups in the MAC array and each group has two available redundant MACs. For example, Group 1 consists of MAC C00, C01, C10, and C11. If a fault occurs in Group 1, redundant MACs, RC0 and RR0, can be utilized. Therefore Group 1 has one available row redundant MAC and one column redundant MAC. Figure 5b depicts the number of available redundant MACs of each group. In this example, C13 and C33 can be repaired by utilizing {RC0, RC2} or {RC0, RR2}. However, choosing {RC0, RR2} as a solution makes Group 2 has no available redundancy in the online repair process. On the other hand, choosing {RC0, RC2} as a solution makes all groups have at least one available redundant MAC in Electronicsthe online2020 ,repair9, 338 process as shown in Figure 5d. This repair strategy secures more flexibility7 in of 14the online repair process.

(a) (b)

(c) (d)

Figure 5. Example of the proposed repair strategy: (a) Faulty systolic-array example; (b) Visualization

of the number of available redundant MACs of each group; (c) Repair decision of C13; (d) Repair decision of C33. Figure 5. Example of the proposed repair strategy: (a) Faulty systolic-array example; (b) Visualization Theof the overall number flowchart of available of the redundant proposed MACs method of each is group; shown (c in) R Figureepair decision6. First, of redundant C13; (d) Repair MACs shoulddecision be tested of beforeC33. testing normal MACs because only healthy redundant MACs can be utilized in the repair process. And then, the offline test for normal MACs is progressed. When faults are detected The overall flowchart of the proposed method is shown in Figure 6. First, redundant MACs in the offline test process, faulty information is collected in the fault list. As get the proper repair should be tested before testing normal MACs because only healthy redundant MACs can be utilized solution, the repair decision of the offline repair starts after all the faulty information is collected. Then, in the repair process. And then, the offline test for normal MACs is progressed. When faults are the proposed method investigates the number of available redundancies of each group and sorts the detected in the offline test process, faulty information is collected in the fault list. As get the proper numbers in descending order. If there are available redundancies to use, the offline repair process is performed based on the repair strategy. The redundancy list is updated whenever the repair decision is making. If all faulty MACs are repaired, the offline test phase ends. Once the online test phase starts, the first process is to check the available resources through programmable fuses. In the online test phase, repairable MACs are determined by the information of remaining redundant MACs in the redundancy list. Therefore, if an error is detected during the normal operation, the proposed method can instantly find out whether the erroneous MAC is repairable or not. If the group which has the erroneous MAC can utilize two or more redundant MACs, the proposed method seeks the way to avoid the number of available redundant MACs of each group being 0. This decision making process Electronics 2020, 2, x FOR PEER REVIEW 8 of 14 repair solution, the repair decision of the offline repair starts after all the faulty information is collected. Then, the proposed method investigates the number of available redundancies of each group and sorts the numbers in descending order. If there are available redundancies to use, the offline repair process is performed based on the repair strategy. The redundancy list is updated whenever the repair decision is making. If all faulty MACs are repaired, the offline test phase ends. Once the online test phase starts, the first process is to check the available resources through programmable fuses. In the online test phase, repairable MACs are determined by the information of remaining redundant MACs in the redundancy list. Therefore, if an error is detected during the normal operation, the proposed method can instantly find out whether the erroneous MAC is Electronicsrepairable2020 or, 9 ,not. 338 If the group which has the erroneous MAC can utilize two or more redundant8 of 14 MACs, the proposed method seeks the way to avoid the number of available redundant MACs of caneach be group easily being performed 0. This withdecision the simplemaking binary process search can be tree. easily Then, performed online repair with stepthe simple is repeatedly binary performedsearch tree. until Then, there online is no repair remaining step is redundancyrepeatedly performed left. until there is no remaining redundancy left.

Figure 6. Flowchart of the proposed method.

3.5. Efficient Efficient Redundancy ConfigurationConfiguration ToTo eefficientlyfficiently perform perform the the convolution convolution operation, operation, it isit eisff ectiveeffective to setto theset the systolic-array systolic-array to 128 to 128128 × × or128 more. or more. Therefore, Therefore, a strategy a strategy for for a systolic-arraya systolic-array with with a sizea size of of 128 128 ×128 128 oror moremore shouldshould alsoalso be × configuredconfigured to construct redundancy. WhileWhile it is possiblepossible to simply place redundancies all the way to the right and to the bottom side of the entire array, thisthis isis notnot anan eefficientfficient solution in terms of repair rate and hardware overhead.overhead. As mentioned mentioned in in Sections Section 3.2s 3and.2 and 3.3, the3.3, proposed the proposed architecture architecture can reduce can thereduce hardware the hardwar overheade andoverhead can increase and can the increase repair ratethe whenrepair multiplerate when faulty multiple MACs faulty are detected. MACs are Therefore, detected. the Therefore, redundancy the architectureredundancythat architecture can be applied that can as be effi appliedciently as as possible efficiently in termsas possible of hardware in terms overhead of hardware and repair overhead rate withand repair a systolic-array rate with a size, systolic can- bearray effectively size, can applied be effec intively real neuralapplied network in real neural operations. network operations.

4. Simulation Results

4.1. Simulation Environment To evaluate the performance of the proposed architecture, hardware overheads, and repair rates induced by the redundancy configuration were calculated and compared with those obtained by changing the parameter of the proposed architecture. For example, Figure7 shows the redundancy architecture with PL 4 and OL 2. We experimented with large size of systolic-arrays such as 128 128, × 256 256, and 512 512. Then, experiments were performed while varying the parameters, PL and × × OL, for each array size. These experiments were performed in a 3 GHz Linux environment with 256 GB memory. Electronics 2020, 2, x FOR PEER REVIEW 9 of 14

4. Simulation Results

4.1. Simulation Environment To evaluate the performance of the proposed architecture, hardware overheads, and repair rates induced by the redundancy configuration were calculated and compared with those obtained by changing the parameter of the proposed architecture. For example, Figure 7 shows the redundancy architecture with PL 4 and OL 2. We experimented with large size of systolic-arrays such as 128 × 128, 256 × 256, and 512 × 512. Then, experiments were performed while varying the parameters, PL and

OL, for eachElectronics array2020 size., 9, 338 These experiments were performed in a 3 GHz Linux environment9 of 14 with 256 GB memory.

Figure 7.Figure Proposed 7. Proposed redundancy redundancy architecture with withPL: 4 andPL: OL4 and: 2. OL: 2. 4.2. Repair Rates 4.2. Repair Rates To evaluate the repair performance of the redundancy architectures, numerous random fault patterns were introduced into the N N systolic-array MAC architectures. These faulty MACs To evaluate the repair performanceROW of× theCOL redundancy architectures, numerous random fault can be repaired using the redundant MACs that are located at the right and bottom sides. If all faulty patterns wereMACs introduced are repaired into successfully the N afterROW × the N repairCOL systolic process,- thisarray fault MAC pattern architectures. is classified as a “repairable These faulty MACs can be repairedcase.” using On the the other redundant hand, if one orMACs more faulty that MACsare located cannot be at repaired, the right this and case isbottom classified sides. as an If all faulty MACs are repaired“irreparable successfully case.” The repair after rates the are repair represented process, as follows: this fault pattern is classified as a “repairable case.” On the other hand, if one or more faulty MACsRepairable cannot case be repaired, this case is classified as an Repair rate(%) = “irreparable case.” The repair rates are representedRepairable as case follows:+ Irreparable case To investigate the effect of the PL and OL on repair rate, repair rates were measured while 푅푒푝푎푖푟푎푏푙푒 푐푎푠푒 increasing PL andRepairOL. Inrate this(% experiment,) = the number of fault patterns was randomly generated 100,000 times. 푅푒푝푎푖푟푎푏푙푒 푐푎푠푒 + 퐼푟푟푒푝푎푟푎푏푙푒 푐푎푠푒 First, the repair rate was measured while PL was increased when OL was fixed to 1. In this case, increasing PL while maintaining OL increases the number of redundancies and also the number of To investigaterepairable faults. the effect Therefore, of the as the PLPL andincreases, OL on a higher repair repair rate, rate repair can be achieved rates were as shown measured in while increasing PLFigure and8a. InOL contrast,. In this when experiment, the PL was fixed, the the number repair rate of was fault measured patterns while was increasing randomly the OL. generated 100,000 times.The result is shown in Figure8b. Increasing OL means decrease of the number of redundancies and the First, therepair repair rate drops. rate was As shown measured in Figure while8, the increase PL was of increasedPL means the when increase OL in redundancy,was fixed whichto 1. In this case, means an improvement in the repair rate. Similarly, the increase of OL causes the decrease of the repair increasing PLrate while because maintaining it reduces the number OL increases of redundancies. the number Changing ofPL andredundanciesOL is related toand changing also the number of repairable faults.number Therefore, of redundancies as the and PL thus increases, a change in a the higher repair repair rate can rate be expected. can be Therefore,achieved to as proceed shown in Figure 8a. In contrast,with when the experiments the PL was with fixed, the same the number repair of redundantrate was MACs,measured experiments while were increasing conducted the under OL. The result is shown in Figure 8b. Increasing OL means decrease of the number of redundancies and the repair rate drops. As shown in Figure 8, the increase of PL means the increase in redundancy, which means an improvement in the repair rate. Similarly, the increase of OL causes the decrease of the repair rate

Electronics 2020, 2, x FOR PEER REVIEW 10 of 14 Electronics 2020, 9, 338 10 of 14 because it reduces the number of redundancies. Changing PL and OL is related to changing the number of redundancies and thus a change in the repair rate can be expected. Therefore, to proceed with the aexperiments given number with of thePL andsameOL numberconditions of redundant to equalize MACs, the experiments number of redundant were conducted MACs. under For example,a given assuming a systolic-array of N N, the number of redundancies when the proposed architecture is not number of PL and OL conditions× to equalize the number of redundant MACs. For example, assuming applieda systolic is -2Narray, where of N ×N Nis, the placed number on theof redundancies right side and whenN is the placed proposed on the architecture bottom side. is not Under applied these is conditions,2N, wherea N fourfold is placed increase on the inright the sidePL willand doubleN is placed the number on the bottom of redundancies, side. Under and these a doubling conditions, of thea OLfourfoldreduces increase the number in theof PL redundancies will double the by number half. Therefore, of redundancies, to have theand same a doubling number of ofthe redundancies, OL reduces whichthe number is 2N, of when redundancies an N N bysystolic-array half. Therefore, is set,to have the repairthe same rate number was measured of redundancies, with the whichPL and is 2NOL, × aswhen (1,1), an (4,2), N × (16,4),N systolic (64,8),-array and is (256,16), set, the repair respectively. rate was The measured results with are shownthe PL inand Figure OL as9. (1,1), As can (4,2), be seen(16,4), from (64,8), the resultsand (256,16) of this, respectively. experiment, The the results repair are rate shown is the in highest Figure in 9.OL As2 can condition. be seen from When theOL isresults increased of this to experiment, 2 or more, the the erepairffect of rate reducing is the highest repair in rate OL due2 condition. to the increase When OL of isOL increased, becomes to more2 or dominantmore, the than effect the of e ffreducingect of increasing repair rate repair due rate to the due increase to the increase of OL, become of PL. Theres more is onedominant more importantthan the thingeffect in of these increasing results. repair As mentioned rate due to in the Section increase3, the of numberPL. There of is used one redundantmore important MACs thing in the in otheseffline repairresults. process As mentioned and the number in Section of the3, the faulty number MACs of areused the redundant same if the MACs faulty in systolic-arraythe offline repair is repairable. process Beforeand the the number offline repairof the process,faulty MACs there are the 256 same redundant if the MACsfaulty systolic in Figure-array9a and is repairable. 512 redundant Before MACs the inoffline Figure repair9b. As process, can be there seen fromare 256 the redundant graphs, the MACs number in Figure of unused 9a and redundant 512 redundant MACs MACs is much in Figure larger than9b. As the can number be seen of from used the redundant graphs, MACsthe number even of in unused the situations redundant which MACs have is lowmuch repair larger rate. than These the resultsnumber prove of used the redundant need for the MACs online even repair in the process situations with which unused have redundant low repair MACs.rate. These results prove the need for the online repair process with unused redundant MACs.

(a) (b) Electronics 2020, 2, x FOR PEER REVIEW 11 of 14 Figure 8. Repair ratecomparison with various PLs and OLs: (a) Fix OL, change PL;(b) Fix PL, change OL. Figure 8. Repair rate comparison with various PLs and OLs: (a) Fix OL, change PL; (b) Fix PL, change OL.

(a) (b)

Figure 9. Repair rate comparison with 2N redundancies: (a) 128 128; (b) 256 256. × ×

Figure 9. Repair rate comparison with 2N redundancies: (a) 128 × 128; (b) 256 × 256.

For comparison with the previous architecture [12], the experiments with the number of redundancies is set to N are conducted. Since the previous architecture places redundancy only on its right side, the number of redundancies in this architecture of N × N is determined as N. As can be seen in Table 1, even with the same number of redundancies, the redundancy architecture where redundancies on the right and bottom side has a much higher repair rate than that of the previous architecture under the condition that N is 128, 256, and 512. In addition, it can be seen that, even in this case, as PL and OL increase, the repair rate decreases slightly even if the number of redundancies is the same.

Table 1. Repair rates comparison with the previous architecture (unit: %).

Array size 128 × 128 256 × 256 512 × 512 # of Prev. PL: 4 PL: 16 Prev. PL: 4 PL: 16 Prev. PL: 4 PL: 16 faults [12] OL: 4 OL: 8 [12] OL: 4 OL: 8 [12] OL: 4 OL: 8 10 69.81 99.61 99.18 83.82 99.96 99.90 91.62 99.99 99.99 20 21.28 94.27 90.96 46.83 99.18 98.67 68.40 99.89 99.82 30 2.50 76.15 68.32 15.08 96.11 94.29 42.28 99.44 99.18 40 0.13 45.32 36.87 4.02 88.65 85.33 21.12 98.30 97.76 50 0.00 16.51 11.92 0.58 75.24 70.65 8.40 96.00 95.13 60 0.00 3.12 2.02 0.06 57.29 51.70 2.74 92.14 90.22 70 0.00 0.25 0.14 0.01 37.34 31.96 0.75 86.26 83.88 80 0.00 0.01 0.00 0.00 19.91 16.36 0.16 78.10 75.39 90 0.00 0.00 0.00 0.00 8.44 6.52 0.00 68.07 64.62 100 0.00 0.00 0.00 0.00 2.61 1.95 0.01 56.74 53.06 110 0.00 0.00 0.00 0.00 0.58 0.44 0.00 44.06 40.72 120 0.00 0.00 0.00 0.00 0.09 0.07 0.00 32.18 29.30

Finally, repair rates of the online repair process with various redundancy architectures are shown in Figure 10. In this experiment, the data from Figure 9b is utilized. It is assumed that 40 faulty MACs are already repaired in the offline repair process. In order to evaluate the performance of the online repair process, the concept of the error rate is defined as follows. An error rate of a single MAC is the probability that an error occurs in a MAC during the normal operation. Similar to the results of

Electronics 2020, 9, 338 11 of 14

For comparison with the previous architecture [12], the experiments with the number of redundancies is set to N are conducted. Since the previous architecture places redundancy only on its right side, the number of redundancies in this architecture of N N is determined as N. As can × be seen in Table1, even with the same number of redundancies, the redundancy architecture where redundancies on the right and bottom side has a much higher repair rate than that of the previous architecture under the condition that N is 128, 256, and 512. In addition, it can be seen that, even in this case, as PL and OL increase, the repair rate decreases slightly even if the number of redundancies is the same.

Table 1. Repair rates comparison with the previous architecture (unit: %).

Array Size 128 128 256 256 512 512 × × × # of Prev. PL: 4 PL: 16 Prev. PL: 4 PL: 16 Prev. PL: 4 PL: 16 Faults [12] OL: 4 OL: 8 [12] OL: 4 OL: 8 [12] OL: 4 OL: 8 10 69.81 99.61 99.18 83.82 99.96 99.90 91.62 99.99 99.99 20 21.28 94.27 90.96 46.83 99.18 98.67 68.40 99.89 99.82 30 2.50 76.15 68.32 15.08 96.11 94.29 42.28 99.44 99.18 40 0.13 45.32 36.87 4.02 88.65 85.33 21.12 98.30 97.76 50 0.00 16.51 11.92 0.58 75.24 70.65 8.40 96.00 95.13 60 0.00 3.12 2.02 0.06 57.29 51.70 2.74 92.14 90.22 70 0.00 0.25 0.14 0.01 37.34 31.96 0.75 86.26 83.88 80 0.00 0.01 0.00 0.00 19.91 16.36 0.16 78.10 75.39 90 0.00 0.00 0.00 0.00 8.44 6.52 0.00 68.07 64.62 100 0.00 0.00 0.00 0.00 2.61 1.95 0.01 56.74 53.06 110 0.00 0.00 0.00 0.00 0.58 0.44 0.00 44.06 40.72 120 0.00 0.00 0.00 0.00 0.09 0.07 0.00 32.18 29.30

Finally, repair rates of the online repair process with various redundancy architectures are shown in Figure 10. In this experiment, the data from Figure9b is utilized. It is assumed that 40 faulty MACs are already repaired in the offline repair process. In order to evaluate the performance of the online

Electronicsrepair process, 2020, 2, x theFOR concept PEER REVIEW of the error rate is defined as follows. An error rate of a single MAC12 of 14 is the probability that an error occurs in a MAC during the normal operation. Similar to the results of thethe offline offline repair repair process, process, redundancy redundancy architectures, architectures, which which have have high high repair repair rates rates in in the the offline offline repair repair process,process, also also show show good results in the online repair process. In In conclusion, the the proposed proposed method method showsshows much much more more improved improved repair repair rates rates compared compared to to the the previous previous work. work. However, However, it is it necessary is necessary to assignto assign the the proper proper value value of ofPLPL andand OLOL consideringconsidering the the number number of of faults, faults, the the error error rate, rate, and and overall overall hardwarehardware overhead.

FigureFigure 10. 10. OnlineOnline repair repair rate rate comparison comparison with with various various PLPLss and and OLOLs.s.

4.3. Hardware Overhead To calculate the hardware overhead of the redundancy architectures, we employed the synthesis tool Synopsys Design Vision [16] and the Synopsys Armenia Educational Department (SAED) 32/28 nm Open Cell Library [17] technology parameters. The additional hardware consists of MUXs that allow signals to be sent to adjacent MACs to reroute the signal and redundant MACs. In this experiment, the additional area overheads were measured by changing PL and OL in a systolic-array having a large size of 128 × 128 or more. Figure 11 shows the ratio of the additional hardware size caused by the redundant configuration for the systolic-array compared to the structure without the redundancy. Increasing OL to 1, 2, 4, and 8 in the same PL condition will inevitably lead to hardware reduction because the number of redundant MACs is reduced. Conversely, when OL is fixed and PL is increased, as shown in Figure 12, the area overhead becomes larger because the redundancies are further allocated for each divided systolic-array partition. In addition, we measured the area overhead while adjusting OL and PL for the area comparison under the conditions of the same number of redundancies. As shown in Figure 13, when the number of redundancies is the same, it can be seen that even if PL and OL change, the overall area overhead is similar. However, owing to an increase in OL, the overall area overhead slightly increased because of the addition of a large size MUX for routing in multiple rows and columns. Although 8.9%, 5.8%, and 3.0% hardware overheads are consumed in 8-bit, 16-bit, and 32- bit, respectively, compared to the previous architecture, high repair rates can be achieved through the proposed architecture.

Electronics 2020, 2, x FOR PEER REVIEW 12 of 14 the offline repair process, redundancy architectures, which have high repair rates in the offline repair process, also show good results in the online repair process. In conclusion, the proposed method shows much more improved repair rates compared to the previous work. However, it is necessary to assign the proper value of PL and OL considering the number of faults, the error rate, and overall hardware overhead.

Electronics 2020, 9, 338 12 of 14 Figure 10. Online repair rate comparison with various PLs and OLs. 4.3. Hardware Overhead 4.3. Hardware Overhead To calculate the hardware overhead of the redundancy architectures, we employed the synthesis tool SynopsysTo calculate Design the hardware Vision [16 overhead] and the Synopsys of the redundancy Armenia Educationalarchitectures, Department we employed (SAED) the synthesis 32/28 nm tool Synopsys Design Vision [16] and the Synopsys Armenia Educational Department (SAED) 32/28 Open Cell Library [17] technology parameters. The additional hardware consists of MUXs that allow nm Open Cell Library [17] technology parameters. The additional hardware consists of MUXs that signals to be sent to adjacent MACs to reroute the signal and redundant MACs. In this experiment, allow signals to be sent to adjacent MACs to reroute the signal and redundant MACs. In this the additional area overheads were measured by changing PL and OL in a systolic-array having a large experiment, the additional area overheads were measured by changing PL and OL in a systolic-array size of 128 128 or more. having a large× size of 128 × 128 or more. Figure 11 shows the ratio of the additional hardware size caused by the redundant configuration Figure 11 shows the ratio of the additional hardware size caused by the redundant configuration for the systolic-array compared to the structure without the redundancy. Increasing OL to 1, 2, 4, for the systolic-array compared to the structure without the redundancy. Increasing OL to 1, 2, 4, and and 8 in the same PL condition will inevitably lead to hardware reduction because the number of 8 in the same PL condition will inevitably lead to hardware reduction because the number of redundant MACs is reduced. Conversely, when OL is fixed and PL is increased, as shown in Figure 12, redundant MACs is reduced. Conversely, when OL is fixed and PL is increased, as shown in Figure the area overhead becomes larger because the redundancies are further allocated for each divided 12, the area overhead becomes larger because the redundancies are further allocated for each divided systolic-array partition. In addition, we measured the area overhead while adjusting OL and PL systolic-array partition. In addition, we measured the area overhead while adjusting OL and PL for for the area comparison under the conditions of the same number of redundancies. As shown in the area comparison under the conditions of the same number of redundancies. As shown in Figure Figure 13, when the number of redundancies is the same, it can be seen that even if PL and OL change, 13, when the number of redundancies is the same, it can be seen that even if PL and OL change, the the overall area overhead is similar. However, owing to an increase in OL, the overall area overhead overall area overhead is similar. However, owing to an increase in OL, the overall area overhead slightly increased because of the addition of a large size MUX for routing in multiple rows and slightly increased because of the addition of a large size MUX for routing in multiple rows and columns. Although 8.9%, 5.8%, and 3.0% hardware overheads are consumed in 8-bit, 16-bit, and 32-bit, columns. Although 8.9%, 5.8%, and 3.0% hardware overheads are consumed in 8-bit, 16-bit, and 32- respectively, compared to the previous architecture, high repair rates can be achieved through the bit, respectively, compared to the previous architecture, high repair rates can be achieved through proposed architecture. the proposed architecture.

Electronics 2020, 2, x FOR PEER REVIEW 13 of 14

Figure 11. HardwareHardware overhead comparison for various OLs.

Figure 12. HardwareHardware overhead comparison for various PLs.

Figure 13. Hardware overhead comparison under the same number of redundancies condition.

5. Conclusion To ensure the reliability of the systolic-array architecture, which is widely used in neural- network computing, a new efficient redundancy architecture is proposed. The proposed architecture can repair a larger number of faults than the previous redundancy architecture, with reasonable hardware overhead. Moreover, unused redundant MACs are utilized in the online repair process through programmable fuses. For maximizing the efficiency of the online repair process, group-based repair strategy is adopted in the entire repair process. This provides maximum redundancies utilization by minimizing hardware overhead waste.

Author Contributions: Conceptualization, K.C. and I.L.; methodology, K.C. and I.L.; software, K.C. and I.L.; validation, K.C., I.L., and H.L.; formal analysis, I.L.; investigation, K.C. and H.L.; resources, K.C.; data curation, K.C.; writing—original draft preparation, K.C. and I.L.; writing—review and editing, K.C. and S.K.; visualization, K.C. and I.L.; supervision, S.K.; project administration, S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding: This work was supported by Samsung Electronics Company, Ltd., Hwaseong, Korea.

Conflicts of Interest: The authors declare no conflict of interest.

References

1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Stateline, NV, USA, 3– 8 December 2012; pp. 1097–1105. 2. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei, L.F. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 1725–1732. 3. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112.

Electronics 2020, 2, x FOR PEER REVIEW 13 of 14

Figure 11. Hardware overhead comparison for various OLs.

Electronics 2020, 9, 338 13 of 14 Figure 12. Hardware overhead comparison for various PLs.

Figure 13. Hardware overhead comparison under the same number of redundancies condition.

5. Conclusion Conclusions ToTo ensure ensure the the reliability reliability of theof the systolic-array systolic-array architecture, architecture, which which is widely is widely used in used neural-network in neural- computing,network computing, a new e ffia newcient efficient redundancy redundancy architecture architecture is proposed. is proposed. The proposedThe proposed architecture architecture can repaircan repair a larger a larger number number of faults of than faults the than previous the previous redundancy redundancy architecture, architecture, with reasonable with reasonable hardware overhead.hardware overhead. Moreover, Moreover, unused redundant unused redundant MACs are MACs utilized are inutilized the online in the repair online process repair throughprocess programmablethrough programmable fuses. For fuses maximizing. For maximizing the efficiency the efficiency of the online of the repaironline process,repair process, group-based group- repairbased strategyrepair strategy is adopted is ad inopted the entire in the repair entire process. repair This process. provides This maximum provides redundancies maximum redundanc utilizationies by minimizingutilization by hardware minimizing overhead hardware waste. overhead waste.

Author Contributions:Contributions: Conceptualization,Conceptualization, K K.C..C. and I I.L.;.L.; methodology methodology,, K K.C..C. and I.L.;I.L.; software software,, K.C. and I.L. I.L.;; validation,validation, K.C.,K.C., I.L.,I.L., and H.L.;H.L.; formal analysis,analysis, I.L.;I.L.; investigation,investigation, K K.C..C. and H.L. H.L.;; resources resources,, K K.C.;.C.; data curation curation,, K.C.;K.C.; writing—originalwriting—original draftdraft preparation,preparation, K.C.K.C. andand I.L.;I.L.; writing—reviewwriting—review andand editing,editing, K.C.K.C. andand S.K.;S.K.; visualization,visualization, K.C.K.C. and I.L.;I.L.; supervision,supervision, S.K.;S.K.; project administration,administration, S.K.;S.K.; funding acquisition,acquisition, S.K.S.K. All authors have read and agreed to the published version of the manuscript. agreed to the published version of the manuscript. Funding: This work was supported by Samsung Electronics Company, Ltd., Hwaseong, Korea. Funding: This work was supported by Samsung Electronics Company, Ltd., Hwaseong, Korea. Conflicts of Interest: The authors declare no conflict of interest. Conflicts of Interest: The authors declare no conflict of interest. References References 1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. 1. InKrizhevsky Proceedings, A.; Sutskever, of the Advances I.; Hinton, in Neural G.E. Imagenet Information classification Processing with Systems deep convolutional (NIPS), Stateline, neural NV,networks USA,. 3–8In Proceedings December 2012; of the pp. Advances 1097–1105. in Neural Information Processing Systems (NIPS), Stateline, NV, USA, 3– 2. Karpathy,8 December A.; 2012 Toderici,; pp. 1097 G.; Shetty,–1105. S.; Leung, T.; Sukthankar, R.; Fei, L.F. Large-scale video classification with 2. convolutionalKarpathy, A.; Toderici, neural networks. G.; Shetty, In S.; Proceedings Leung, T.;of Sukthankar, the IEEE Conference R.; Fei, L.F. on Large Computer-scale video Vision classification and Pattern Recognition,with convolutional Columbus, neural OH, networks USA, 24–27. In JuneProceedings 2014; pp. of 1725–1732. the IEEE Conference on Computer Vision and 3. Sutskever,Pattern Recogniti I.; Vinyals,on, Columbus, O.; Le, Q.V. OH, Sequence USA, 24 to– sequence27 June 2014 learning; pp. 1725 with–1732. neural networks. In Proceedings of 3. theSutskever Advances, I.; Vinyals, in Neural O.; Information Le, Q.V. Sequence Processing to sequence Systems learning (NIPS), with Montreal, neural QC, networks Canada,. In 8–13Proceedings December of 2014;the Advances pp. 3104–3112. in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 4. Ma,2014 Y.;; pp. Suda, 3104 N.;–3112. Cao, Y.; Seo, J.S.; Vrudhula, S. Scalable and modularized rtl compilation of convolutional neural networks onto fpga. In Proceedings of the Field Programmable Logic and Applications (FPL) 26th International Conference, Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–8. 5. Kung, H.T. Why systolic architectures? IEEE Comput. 1982, 15, 37–46. [CrossRef] 6. Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, Toronto, ON, Canada, 26 June 2017; pp. 1–12. 7. Paine, S.W.; Fienup, J.R. Machine learning for improved image-based wavefront sensing. Opt. Lett. 2018, 43, 1235–1238. [CrossRef][PubMed] 8. Wu, C.; Ko, J.; Davis, C.C. Lossy wavefront sensing and correction of distorted laser beams. Appl. Opt. 2020, 59, 817–824. [CrossRef] Electronics 2020, 9, 338 14 of 14

9. Liu, S.; Tang, J.; Wang, C.; Wang, Q.; Gaudiot, J.L. A unified cloud platform for autonomous driving. Computer 2017, 50, 42–49. [CrossRef] 10. Hoang, T.T.; Själander, M.; Edefors, P.L. A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit. Circuits Syst. I Regul. Pap. IEEE Trans. 2010, 57, 3073–3081. [CrossRef] 11. Kung, H.T.; Lam, M.S. Fault-tolerance and two-level pipelining in vlsi systolic arrays. Tech. Rep. Carnegie Mellon UNIV 1983.[CrossRef] 12. Kim, J.H.; Reddy, S.M. On the design of fault-tolerant two-dimensional systolic arrays for yield enhancement. IEEE Trans. Comput. 1989, 38, 515–525. [CrossRef] 13. Takanami, I.; Horita, T.; Akiba, M.; Terauchi, M.; Kanno, T. A built-in self-repair circuit for restructuring mesh-connected processor arrays by direct spare replacement. In Transactions on Computational Science XXVII, LNCS 9570; Springer: Berlin/Heidelberg, Germany, 2016; pp. 97–119. 14. Zhang, J.; Gu, T.; Basu, K.; Garg, S. Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator. In Proceedings of the IEEE VLSI Test Symposium, San Francisco, CA, USA, 22–25 April 2018; pp. 1–6. 15. Lee, C.; Kang, W.; Cho, D.; Kang, S. A new fuse architecture and a new post-share redundancy scheme for yield enhancement in 3-d-stacked memories. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2014, 33, 786–797. 16. Synopsys Design Vision, Synopsys, Mountain View, CA, USA. Available online: https://www.synopsys. com/content/dam/synopsys/implementation&signoff/datasheets/design-compiler-nxt-ds.pdf (accessed on 14 February 2020). 17. The Synopsys Armenia Educational Department (SAED) 32/28nm Open Cell Library, Synopsys, Mountain View, CA, USA. Available online: https://www.synopsys.com/community/university-program/teaching- resources.html (accessed on 14 February 2020).

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).