<<

A Neural Network Approach to Phonocardiography

Ian Cathers

Master of Biomedical Engineering The University of New South Wales 1991 Declaration

I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of a university or other institute of higher learning, except where due acknowledgement is made in the text.

Signed: Dated:

1 Acknowledgements

I wish to express my thanks to the following people for their assistance in a variety of ways:-

Dr Albert Avolio (Centre for Biomedical Engineering, University of New South Wales) for his supervision of the project and his ability to encourage in the broader educational and professional objectives that such a task can catalyse.

Mr John Telec and Mr Robert Mannell (School of Linguistics, Macquarie University) for their time spent in helping with the learning curve associated with Kay Sonograph .

Dr Phillip Harris and Mr David Hardy (Department of Cardiology, Royal Prince Alfred Hospital, Sydney), for their time spent in resurrecting a moth-balled M-mode echocardiograph for phonocardiographic use. ·

Dr Michael Feneley, Sue and Meli (Cardiology Department, St Vincent's Hospital, Sydney) for their willingness to adjust schedules to allow access to equipment and subjects.

Dr Walter Ivanstoff (Macquarie University) for his English translation of a Russian paper.

Anne, the only person to have had an overview of the whole project, and who was encouraging in every aspect.

Miriam and Timothy, who put up with weeks of RF! in the AM broadcast band while sluggish neural networks struggled with seemingly simple tasks.

2 Abstract

This project is a pilot study of the feasibility of automated heart sound recognition using neural network computing techniques.

In the past human heart sounds have proven to be an important diagnostic tool, for valvular disease in particular. More recently, their importance has been eclipsed by direct visualisation techniques. Despite these current emphases in diagnostic methodologies, a cheap tool for the automated recognition and classification of heart sounds may prove a useful primary screening device, considering the high capital investment required for direct visualisation hardware.

Heart sounds from a variety of cardiovascular pathologies were digitised, pre­ processed, and characterised before being used as input data for software-implemented multi-layer neural networks. The neural networks were trained by backpropagation under a variety of input data, network topology and learning rate parameters. Trained networks were also used to classify heart sounds not previously encountered in the training data, in order to quantify their ability to generalise.

The rate of the networks' approach to correct classification of the training data was found to be highly dependent on the gain term, and less strongly dependent on the size of the momentum term. On the other hand, the likelihood of reaching the correct classification of the training data was dependent on the nature of the input data.

Of particular significance was the implementation of a small scale normaVabnormal heart sound classifier, which showed excellent accuracy in classifying a small range of untested heart sounds.

While the task of training networks with even the limited output requirements studied here proved to be highly computationally expensive, the classifications from an implemented network were fast and accurate. A viable diagnostic tool would require a far wider range of input data, and significant computing facilities for training. It may prove to be more effective as a front-end pre-processor for a rules-based system, considering the complexities and limitations of differential cardiac diagnosis from heart sounds alone.

3 Preface

This project report is structured in the following way:-

Chapter 1 Neural Networks - A Review. Introduction to neural networks and a technical review of the literature.

Chapter 2 Heart Sounds and Phonocardiography - A Review. General background and review of auscultation and phonocardiography.

Chapter 3 The Heart Sound Signal - Pre-Processing and Characterisation. Materials, methods and computational strategies followed in obtaining and pre-processing of heart sounds which were used as input data for neural networks.

Chapter 4 Network Training. Investigations into some of the parameters affecting the training of neural networks in heart sound recognition.

Chapter 5 Heart Sound Recognition. Investigations into some heart sound recognition tasks using neural networks.

Chapter 6 Conclusions. Conclusions from this study and future directions of work in this area.

To provide a more intuitive structure to Chapters 4 and 5, materials and methods of general applicability are grouped together, such as in Chapter 3 and Sections 4.2. Experimental results are organised by topic, rather than Materials and Methods, Results, Discussion and Conclusions. Methodology specific to the particular investigation is discussed together with results under these topic headings.

4 Contents

1. Neural Networks - A Review ...... 9 1.1. Introduction ...... 9 1 .1 .1 . Classifiers ...... 1O 1 .1 .1 .1 . Neuron Element ...... 11 1.1.1.2. Training Procedure ...... 12 1 .1 .1 .3. Classification Process for New Data ...... 12 1.2. Neural Network Taxa ...... 13 1 .2.1 . Nodal Types ...... 14 1.2.1 .1 . Method of combination of weights and inputs ...... 14 1.2.1.2. Linearity ...... 15 1.2.1.3.Number Types ...... 15 1.2.1 .4. Determinism ...... 16 1.2.2. Topologies ...... 16 1.2.2.1. Network Size and Capacity ...... 19 1.2.2.2. Direction of Information Flow ...... 20 1.2.3. Heuristics ...... 20 1.2.3.1. Supervision ...... 21 1.2.3.2. Delta Rule ...... 21 1.2.3.3. Generalised Delta Rule - Backpropogation ...... 24 1.2.3.3.1. Convergence ...... 25 1.2.3.3.1.1. Visualisation ...... 27 1.2.3.4. Other Algorithms ...... 27 1.2.3.5. Energy ...... 28 1.2.3.6. Coding in the Hidden Layers ...... 29 1.2.3.7. Comparison with Traditional Classifiers ...... 30 1.3. Advantages and Disadvantages ...... 30 1.3.1. Generalisation ...... 31 1.3.2. Graceful Degradation ...... 32 1.3.3. Speed, Scaling and System Requirements ...... 32 1.3.4. Hardware Implementation ...... 34 1.4. Applications ...... 35 1.4.1 . & Synthesis ...... 35 1.4.2. ECG ...... 36 1.4.3. EEG ...... 36 1.4.4. Image Recognition ...... 36

5 1.4.5. Al and Expert Systems ...... 36 1.4.6. Filtering ...... 37 1.4.7. Other ...... 37

2. Heart Sounds and Phonocardiography - A Review ...... 38 2.1. Synopsis ...... 38 2.2. Introduction ...... 38 2.3. Cardiac Anatomy ...... 39 2.4. The Cardiac Cycle ...... 40 2.5. Heart Sounds and Murmurs ...... 42 2.5.1. Production Mechanisms ...... 42 2.5.2. Factors Affecting Auscultated Sounds ...... 45 2.5.2.1. Regions for Auscultation ...... 45 2.5.2.2. Sound Transmission ...... 46 2.5.3. The First Heart Sound ...... 46 2.5.4. The Second Heart Sound ...... 47 2.5.5. The Third Heart Sound ...... 48 2.5.6. The Fourth Heart Sound ...... 49 2.5.7. Summation Gallop ...... 49 2.5.8. Opening Snaps ...... 49 2.5.9. Systolic Clicks ...... 50 2.5.10. Murmurs ...... 50 2.5.10.1. Systolic Murmurs ...... 50 2.5.10.2. Diastolic Murmurs ...... 51 2.5.10.3. Continuous Murmurs ...... 51 2.5.11. Friction Rubs ...... 52 2.5.12. Summary ...... 53 2.5.13. Differential Diagnosis ...... 53 2.6. Auscultation and Phonocardiography ...... 56 2.6.1. Auscultation ...... 56 2.6.2. Phonocardiography ...... ~ ...... 57 2.6.2.1. Spectrocardiography ...... 58 2.6.2.2. Sonvelography ...... 59 2.6.2.3. Recognition ...... 59 2.7. Conclusions ...... 61

6 3. The Heart Sound Signal - Pre-Processing and Characterisation ...... 62 3.1 . Synopsis ...... 62 3.2. lntroduction ...... 62 3.3. Sources ...... 62 3.3.1. Pre-Recorded ...... 62 3.3.2. Subject Recordings ...... 63 3.3.3.Summary of Recordings Used ...... 65 3.4. Digitisation ...... 66 3.5. Pre-Processing ...... 67 3.5.1. Rationale ...... 67 3.5.2. Methods ...... 67 3.5.3. Program Cardiac ...... 70 3.6. Characterisation ...... 70 3.6.1. Dimensional Overlap ...... 71 3.6.2. Average Inter-Class Correlations ...... 72 3.6.3. Average Inter-Class Distance ...... 73 3.6.4. Program Hyper.pas ...... 73 3.7. Conclusions ...... 74

4. Network Training ...... 75 4.1 . Synopsis ...... 75 4.2. General Materials and Methods ...... 75 4.3. Factors Affecting Learning ...... 76 4.3.1. Gain ...... 76 4.3.2. Order of Pattern Presentation ...... 78 4.3.3. Momentum ...... 80 4.3.4. Topology ...... 82 4.3.5. Nature of the Training Data ...... 87 4.4. Conclusions ...... 92

5. Neural Network Phonocardiograph Classifiers ...... 94 5.1 . Synopsis ...... 94 5.2. Introduction ...... 94 5.3. General Materials and Methods ...... 94 5.4. Factors Affecting Classification Success ...... 94 5.4.1. Variable Topology ...... 94 5.4.2. Training Data ...... 96 5.5. A Normal/Abnormal Classifier ...... 97

7 5.6. Conclusions ...... 99

6. Conclusions ...... 101 6.1. Future Directions and Applications ...... 102

Appendix A - Program CARDIAC.PAS ...... 104

Appendix B - Program HYPER.PAS ...... 112

Appendix C - Program MULTI.PAS (Turbo Pascal Version) ...... 123

Appendix D- Program MV.PAS (VAX Version) ...... 137

Appendix E - Parameter ldentification ...... 146

References ...... 147

8 1. Neural Networks - A Review

1.1. Introduction

Higher animals perform visual and auditory perception tasks with an efficiency which far outstrips even the most sophisticated machines. This higher performance is revealed in terms of raw processing speed and the "intelligence" with which generalisations are made and new data is integrated with old.

These performance advantages over current computers are directly attributable to the computational and physical topology of the brain's "wetware" and associated peripheral sensory organs.

With approximately 1011 neurons each having up to 10,000 connections [Feldman, 1985], the brain is a highly parallel system.

The brain's parallelism does confer speed advantages since a greater data width is being processed. However, the parallel nature of the brain is qualitatively different from that of a conventional vector processor, say, which also deals with a broad data width. The brain's computational strategy is intrinsic to the connectivity of its neurons. The topology, number and strength of the connections not only determines how something is learned or processed, but what has been learnt.

The brain's parallelism also allows pieces of data to be compared across large chunks of data simultaneously. In effect, the brain has the ability to consider the whole (or large pieces of the whole) at one time - rather than a stepwise comparison of parts. Complex information is often encoded in the relationship between many pans of a signal - and it is only by a comparison of the many interrelationships that the informational content is elucidated.

Neural network computing (or connectionism or parallel distributed processing) is a technique for mimicking some of the features of brain-like computation. The original motivation for its development was the desire to obtain a greater understanding of brain functioning [see Hebb, 1949, for instance]. As models became more complex, and some of the computational advantages realised, potential applications became another driving force.

9 It is the brain's special type of parallelism which allows people to distinguish different faces easily - a task of extreme difficulty for a traditional computer. The process of associating and classifying the image of a face involves the simultaneous comparison of data from all over the face. Of course, the task could be achieved by a complex set of rules applied to different parts of the face in a linear fashion; however, this would make the classification process exceedingly slow. It would also require enormous storage capacity, as the brain would have different rules or templates for different facial orientations and expressions, for instance.

The classification problem will be considered as a means of introducing neural networks.

1.1.1. Classifiers

Consider characteristics of a person which may be denoted numerically, such as:

height (x1)

mass (x2)

Each person, using the parameters above, is denoted by a point in 2-D space ie. (x1,x2).

It may be thought that these parameters have height good predictive value in classifying people into a number of groups - male and female, for ' example. When data from a number of people is plotted on the Canesian plane, they may show up these natural groupings, as shown in Figure 1.1.

Once a boundary line between these two mass groupings can be determined, it becomes a Fig. 1.1: The separation of two simple matter of determining upon which side of classes using height and mass the line some new point falls to decide whether data. the corresponding person is male or female. Thus the line acts as a decision surface allowing us to classify into male or female depending on a person's height and weight.

Although most of the following argument will be using the two dimensional case for the purpose of clarity, the arguments can be extended to many dimensions. As well as a

10 person's height and weight, we may add blood sugar concentration and mean heart rate for instance. Each person would then be represented by a vector in 4-space.

1.1.1.1. Neuron Element

It is possible to develop a computational device which implements this classification scheme. Devices such as the neuron element in an ADALINE [Widrow and Hoff, 1960], or a single layer Perceptron [Block, 1962], similar to that shown in Figure 1.2, will achieve this in the following way.

e

Fig 1.2: The inputs and outputs for a Perceptron computational element.

The input data (x1,x2) is multiplied by corresponding weights w1 and w2 and then summed together with a threshold value 8, giving I,xiwi - 8. The perceptron takes the

1 value thus obtained and operates a hard-limiting function,[, on it. This is shown graphically in Figure 1.3.

Thus, the final output y is:- f (t)

y = -1 for L,XiWi - 8 < 0 +1t----

1 t and y = +1 for Ixiwi - 8 2: 0. ----1 1 For non-numerical classifications (eg. male/female), the output is coded. So, in this example, female may be Fig. 1.3: The hard limiting + coded by 1 and male by -1. function of a single layer Perceptron. For linearly separable classes, there exists a set of

11 weights w 1 and w2 , and a threshold value 0, such that a possible boundary line between 2 classes is given by:-

... (1.1)

This of course raises the question of how the weights and threshold are determined.

1.1.1.2. Training Procedure

One method for training such a perceptron is as outlined below:-

1. Set the weights and threshold to random values 2. Present an exemplar, with known classification, as input. 3. Calculate the actual output of the perceptron. 4. Compare the actual output with that which is desired. If these differ, adjust the weights and threshold by an amount proponional to the difference between desired and actual output. 5. Repeat with the next input, until actual outputs always match desired outputs.

Assuming the classes of inputs are linearly separable, this procedure will eventually converge on values for weights and threshold which allow the classes to be distinguished.

1.1.1.3. Classification Process for New Data

Having determined the values for weights and threshold using a training procedure such as that above, an unknown input can then be presented to the perceptron and its output obtained. So long as this new input data is representative of one of the exemplar classes for which the perceptron was trained, the output will probably provide the correct classification.

12 The conditional nature of the above • statement is necessary because it is possible • • • • • for the new input to clearly be part of one ' • • D ' • of the classes, but still be on the "wrong" ' '. .. D D D ' side of the line which was achieved with 0 0 .~misclassified , point the training data - such points will be D D D D outliers. This will not happen when the new data is "embedded" within the training data. x, An example of this type of misclassification is shown in Figure 1.4. Fig 1.4: Misclassification of an outlying data point.

For data in which different classes overlap each other, or indeed, may be embedded within each other, more complex boundary regions are required. The classes are no longer linearly separable, but require multiple lines (hyperplanes in multi-dimensional space) to partition their decision regions. The complexity of the required partitioning determines the necessary network configuration, as discussed in Section 1.2.2 on Topologies.

1.2. Neural Network Taxa

There are 3 main components to a neural network system - the network topology - including the nodal characteristics, the actual weights between nodes, and the training algorithm. The interdependency between these is ~ ◄◄-----~sets ~ shown in Figure 1.5.

It is the network topology which Fig. 1.5: The relationship between various aspects of a neural network system. determines the training algorithms which may be used. This will be dealt with more fully in Section 1.2.2 on Topologies. Of course, once training has been completed, the particular algorithm used is of no consequence to the operation of the network.

Topology is also the determinant of the range of tractable classification tasks, and this is covered in Section 1.2.3 on Heuristics.

13 It is the weights between nodes which define what has actually been learnt. Thus two networks may have identical topologies and have been trained using identical algorithms, but because different sets of training patterns have been used, different sets of weights result. The networks have learnt different information.

1.2.1. Nodal Types

The nodes in a neural network have three main features:-

1.2.1.1. Method of combination of weights and inputs

Most commonly, nodes operate as summation units, where an intermediate output y is formed from weight vector Wand input vector X as below:-

y = f(W.X'f) ... (1.2)

1e. ... (1.3)

The nature of the function f is described in Section 1.2.1.2 on Linearity.

It also possible to have units which use the products of inputs - as the exclusive type, or in combination with summation units. The output of a "sigma-pi" unit is given by:-

... (1.4) y = f[~wijrrxkl 1 Sk where the product is over all the inputs xk which are elements of the subset Sk. ie. Sk denotes the set of conjunctions of inputs to the sigma-pi unit . Such a structure allows one input to "gate" another [Rumelhart, Hinton & McClelland, 1986].

Another alternative is the product unit, as proposed by Durbin and Rumelhart [1989]. The output of this type of unit is given by:-

... (1.5) where Pi is the "weight"; a power to which xi is raised.

14 Such units in combination with summation units can represent any general type of polynomial term in the input - allowing relationships between inputs which are 2nd, 3rd order etc.

1.2.1.2. Linearity

The function f above, determines the linearity (or otherwise) of the node. Multi-layer networks confer no advantages over single layer ones if the function f is linear. It is easy to show that successive linear operations on a set of inputs can be replaced by a single linear operation [see Jordan, 1986, for instance]. Multi-layer topologies confer rich possibilities for decision boundaries when non-linear functions are used in the nodes. The most commonly used non-linear functions are shown in Figure 1.6:-

f(t} f(t} f(t)

Hard Limiting Threshold Logic Sigmoidal or Logistic

Fig. 1.6: Non-linearities commonly used in neural network nodes.

The sigmoidal or logistic function in particular has received a great deal of attention because it allows a particularly powerful training algorithm to be implemented - backpropogation which is described in Section 1.2.3.3. Beckman and Freeman [1986] have shown that a sigmoidal function does have a biological basis in the rat olfactory bulb, at least.

1.2.1.3. Number Types

The output of some nodes are real numbers and therefore on a continuum, while others may take only binary values (1 and 0, -1 and+ 1 for instance). This distinction can be in some senses artificial, and it is possible to have a blend of the two - using integers from a fixed range, for instance. The binary approach has the twin advantages of reduced storage requirements and the higher processing speed of integer arithmetic on conventional computers.

15 The type of node implemented is largely determined by the problem to be solved. Some quantities are more "naturally binary" such as pixel states in a black and white image, while others, such as speech parameters, tend to be continuous-valued.

The chosen nodal type is a determinant of the set of possible training algorithms.

1.2.1.4. Determinism

The output of a node may be either stochastic or deterministic . If stochastic, the node must be binary in nature and its output is based on a probability which is a function of its inputs and weights, such as the sigmoidal or logistic function described in Section 1.2.1.2 Linearity, above.

Deterministic nodes may be binary or continuous-valued, their outputs being directly set by a function of inputs and weights.

The determinism of the nodal type is also a deciding factor for the set of possible training algorithms.

Examples of networks of varying number type and determinism are tabulated below:-

DETERMINISM stochastic deterministic binary Boltzmann Machine Hopfield, 1982 NUMBER eg. Ackley et al, 1985 TYPE continuous Hopfield, 1984

1.2.2. Topologies

Classification tasks which involve decision boundaries more complex than single hyperplanes, require multi-layer network topologies with non-linear nodal characteristics.

Each node in the first layer divides the decision space into two by a hyperplane. Thus when data are used to train a number of parallel nodes, a set of decision hyperplanes results. These hyperplanes may be intersecting or not. Once trained, the output of each node determines on which side of its corresponding hyperplane a data point lies. An

16 example of a multi-nodal single layer network, together with its decision surface is shown in Figure 1. 7.

Fig. 1.7: A multi-nodal single layer network and its associated half plane decision regions.

If the outputs from these nodes are used as the input for a node in a second layer, it may be set so that its output is high only if all its inputs are high. This is achieved by setting the threshold of the node at the number of inputs into it (assuming each output has a maximum value of 1) - thus it performs a logical AND operation. Since the resulting decision region is the intersection of a number of half-planes, it is of necessity convex. An example is shown in Figure 1.8.

,, , . y ·-~ :-Ai--•· .-,v , , ' x, '

Fig. 1.8: The decision surface resulting from a second node performing an AND operation on the nodes shown in Figure 1.7.

The reasoning can be extended to a further layer. A region of any topology can be considered to be made up of small convex regions. Thus, taking the outputs from second layer nodes (which define convex regions) and using these for inputs to a third layer node which then performs a logical OR operation, concave and discontinuous regions can be defined. The OR operation can easily be achieved if the weights from

17 second to third layer nodes are 1 and the third layer node has a threshold of 0.5. An example of a 3-layer network and possible decision boundaries is shown in Figure 1.9.

Fig. 1.9: Possible decision regions produced by a 3-layer network.

Huang and Lippmann [1987(b)] first demonstrated that two-layer networks can form non-convex and disjoint decision regions, but show that there can be highly sensitive interactions between weights making such a network difficult to train. Although some non-convex regions can be classified with only 2 layers of nodes, counterexamples such as an "hourglass" shape do exist [Moore & Poggio, 1988].

While two hidden layers are sufficient to form decision boundaries of arbitary complexity, it has been suggested that this may not necessarily be an efficient topology for every problem, further layers allowing an overall reduction in the number of nodes and weights required [Lapedes and Farber, 1987].

A summary of network topologies and their associated decision boundaries [after Lippmann, 1987] is shown in Figure 1.10. 1 layer However, it should be noted at this point that functions other than the hard-limiting 21ayer non-linearity, such as the sigmoidal function, will not produce "straight" edges, but decision surfaces delimited by smooth curves. 31ayer

Some multi-layer networks possess Fig 1.10: Types of decision regions connections which skip a layer, feeding which can be implemented with forward beyond the subsequent layer. An different neural network topologies. example is shown in Figure 1.11.

18 Other connectivities are also possible, such as networks which do not have a layered structure. One example this type of network is the Kohonen self-organizing network [Kohonen, 1982, 1988(a), 1988(b)], shown in Figure 1.12.

Fig 1.11: A neural network possessing feed-forward outputs connections which "skip" layers.

Fig 1.12: The topology of a "Kohonen Self-Organising" network.

Although superficially appearing to be in a single layer, the complex interconnectivity within this layer results in the capability of forming complex decision boundaries.

1.2.2.1. Network Size and Capacity

For the case of a 3-layer network, there must be sufficient nodes in the first layer so that each convex sub-region has enough boundary hyperplanes. Generally, three or more boundary hyperplanes would be required to delineate each convex area. Noting that the second layer nodes form conjunctions of convex sub-regions, no more than one third of the number of first layer nodes is required in the second layer. The maximum number of second layer nodes is bounded by the number of disconnected regions in the input data. The required number of third layer nodes is determined by the number of classification categories.

Despite these theoretical considerations, studies on a wide range of continuous-value problems by Huang and Lippmann [1987(a)] have indicated that learning performance is enhanced with double the theoretical number of hidden nodes, and 3 layers rather than 2, for cases with large numbers of nodes in the first layer.

19 Neural networks can also be used as "Associative Memories" (also called "Content Addressable Memories") in which the presentation of a partial pattern allows the reconstruction of the full image or data. Rather than learning generalities, the network is trained to remember specific cases [Tank and Hopfield, 1987; Levy, 1988]. The principles involved are very similar to the classification tasks which have been described so far. The capacity of such an associative memory is finite, and general rules of thumb have shown that the capacity of such a memory is less than 0.15N, where N is the number of nodes in the net [Lippmann, 1987]. The capacity is more limited when exemplar patterns are quite similar - the network tends to confuse the patterns.

The size of the network has imponant implications for the amount of training data required to give a valid generalisation. This is discussed more fully in Section 1.3.1 on Generalisation.

1.2.2.2. Direction of Information Flow

Network topologies discussed so far, and those implemented in this study, involve only the forward flow of information through the network (apan from during the training procedure). However, it is possible for whole networks to be bidirectional, or some elements within the network to provide a feed­ back path. A feedback element in a multilayer network may connect adjacent layers or span the whole network. An example of a network with feedback paths is shown in Figure 1. 13. Fig 1.13: A network having feedback paths.

The formal analysis of this type of configuration becomes much more difficult than for uni-directional ones.

1.2.3. Heuristics

A neural network cannot be implemented to solve some recognition/classification task unless it has somehow been trained and the weights adjusted to suit the particular problem. The method for training the network is topology-dependent, and the complexity increases significantly for multi-layer networks.

20 An active field of neural network research is in finding computationally efficient training algorithms. Efficiency is significant because of the enormous computational demands training makes for all but trivial problems. Having been trained, the classification of an unknown input by the network is computationally simple and correspondingly quick.

The determination of an effective training procedure can be a complex problem; so much so that neural networks themselves ("masters") have been used to train other networks ("slaves") for specific tasks [Lapedes and Farber, 1986]. In effect, the master acts as a neural compiler.

1.2.3.1. Supervision

Training algorithms may be divided into two broad classes - supervised and unsupervised.

Supervised techniques involve presenting training patterns to the input of the network, together with some expert classification of the pattern, which in effect is the desired output for the network. The weights are updated, so that (at least in the long term, for a network which satisfies theoretical constraints), the actual output closely reflects the desired output. The details of how weights are changed vary between supervised training algorithms, some of which are detailed in Sections 1.2.3.2 to 1.2.3.4 below.

On the other hand, unsupervised training algorithms are presented with input data without being classified beforehand, although the required number of classification categories may be required in some cases. The network then proceeds through a process of clustering the data into like groups. When a new input is presented, it is compared to existing clusters of data - if it is very different from those, it forms the basis for a new class, otherwise it is incorporated into the one which represents it most closely.

1.2.3.2. Delta Rule

In general, the delta rule (or Widrow-Hoff Rule, or Least Mean Square Rule) is a supervised learning algorithm in which the network's classification of input data is compared to the desired classification. The weights in the network are then varied in proportion to the difference between actual and desired output.

1e. . .. (1.6)

21 where: ~ wii is the change in weight connecting the ith unit to the jth.

11 is the "gain" term - providing a measure of how much of the error term is to be applied to the adaptation of weights.

~i is the desired output for unit j for the pth pattern.

Ypj is the actual output of unit j on presentation of the pth pattern.

"i,i is the ith input value of pattern p.

8Pi is the difference between desired and actual outputs of unit j on pattern p.

Following the argument of Rumelhart, Hinton & Williams [1986(a) & (b)], we may introduce a total error term for the network on presentation of pattern p:-

E= L[ ½f2] ... (1.7)

p Thus the total error between actual and target outputs is summed over all units, j and over all patterns, p.

If we wish to understand how the error varies with changes in weights, then we require 8E ~- For simplicity, consider only the gradient of the component EP of the total error ie. IJ 8EP

~-IJ

8EP 8EP oyPi Now, ;.. (1.8) ~IJ = Sypj Swij From ( 1. 7) above, we have 8EP = -( ii . - y ·) = -8 . . .. ( 1.9) ov=Ypj t>J PJ PJ And if we are dealing with simple linear units, then Ypj = I,wij"i,i ... (1.10)

22 where "i,i is the ith input of pattern p. and so:- oypj ~ = "i,i ... (1.11) ij Substituting equations (1.3) and (1.5) into (1.2) we have:-

... (1.12) and since oE ... (1.13) ~ = ~;~~- IJ £..J IJ p it can be stated that oE (l ... (1.14) ~IJ

Thus, starting with the condition of learning by the delta rule, equation (1.6), we have shown that a gradient descent in the error, E, is performed.

With a single layer network composed of linear nodes, the error function is bowl­ shaped, and so location of the global minimum is guarantied.

It should be noted that a number of significant assumptions have been made in this derivation, including:-

1. The network is a single layer.

This assumption can be relaxed with the application of the generalised delta rule, as described in Section 1.2.3.3 below.

2. The nodal units are linear.

This assumption can also be relaxed and the above proved for non­ decreasing, differentiable functions, such as the logistic function - as commonly used in multi-layered networks using the generalised delta rule [Rumelhart, Hinton and Williams, 1986].

23 3. The weights are not changed during a complete cycle of pattern presentation.

Some implementations of this learning algorithm only update the weights at the end of a complete cycle of pattern presentation (an "epoch"), and so satisfy this condition [Rumelhart and McClelland, 1988]. However, weight adjustment after each pattern presentation will not lead to a gross departure from this assumption, particularly if the learning rate is small [Rumelhart, Hinton and Williams, 1986(a)].

1.2.3.3. Generalised Delta Rule - Backpropogation

As has already been discussed, single layer, linear networks are of limited application in recognition tasks. They are only able to distinguish linearly separable input classes. Complex decision boundaries are achievable with multi-layer, non-linear networks, but the output is no longer a linear function of input, and so the computation of derivatives is more problematic.

The delta rule can be generalised to more complex networks, but those which are multi­ layer, feed-forward with sigmoid non-linearities are the most commonly employed. Rumelhart, Hinton and Williams [ 1986] have adapted the delta rule to such cases, and also provided a formal foundation for the method which is summarised below.

1. An input pattern vector X = (x1,x2, ••• ) is presented to the network, together with

the desired output vector D = (d1,tli,••·)·

2. The actual output vector Y = (y 1,y2, ••• ) is calculated.

3. The errors are backpropogated through the network. The weights are adjusted by the familiar delta rule:- A W-• = nx_.o . ---p IJ ' I ,>I pj However, opj is calculated in different ways, depending on whether the node under consideration is an internal ("hidden" or the first layer) node or an output (final layer) one.

If Xi is an output node, then

opj = Ypj(l - Ypj)( ~j - Ypj). If Xi is a hidden or first layer node, then:-

24 X- 1 x\~"OkW·k 0PJ · = J (1 - J ,L. J k where k is over all the nodes in the layers above node j and x/ indicates the output of such a node.

The threshold value, 0, for each node is treated as if it were a weight from a node of constant unity output.

Thus, the method involves two passes through the network during training. A forward pass to calculate the output of the network, followed by a backward pass in order to adapt weights. Of course, once training is completed, only the forward pass is used since weights then remain static.

1.2.3.3.1. Convergence Gradient descent does not guarantee finding a global minimum for these multilayer networks since the error surface is not necessarily concave up everywhere. In pursuing the path of steepest descent on the error surface, the network may find a local minimum from which it cannot escape. Rumelhart and McClelland [ 1988] have suggested that local minima tend not to be a problem in practical situations of high numbers of hidden units. Baldi and Hornik [1989] proposed that one possible reason that local minima tend not to be a common problem is that the gradient of E is determined after only one or a few pattern presentations, and is therefore not a true gradient. A noisy gradient estimate will introduce a random element to the convergence path down the error surface [Widrow and Stearns, 1985] but the possibility that this will allow escape from local minima depends on the step size, amount of gradient noise and local error topology.

Training by epoch, then, may lead to a greater entrapment rate in local minima, since the slight miscalculation of true gradient does not occur. This problem may be worthy of further study.

Theoretically, steps in weight space should be infinitesimally small, but in practice are determined by the gain term TI- Another problem with gradient descent techniques are the long learning times. This is particularly true when the error surface has long gently­ sloping ravines but having steep sides . In such cases, unless the gradient is measured accurately, the path may oscillate from side to side while only making slow progress down the length of the ravine [Hinton and Sejnowski, 1986; Rumelhart and McClelland, 1988]. Increasing the gain term TI may speed up the rate of learning in

25 some cases. The effect of changing the step size in weight space is shown in a hypothetical case in Figure 1.14.

Fig. 1.14: Two paths down an hypothetical ravine in weight space showing the effect on step size on the rate of convergence.

A momentum term added to the weight changes allows the learning procedure to adapt to the current position on the error surface. The use of a momentum term changes equation (1.6) to:- i\w/t+ 1) =TtOpj~i + ~Wiit) ... (1.15) where t represents a time or pass through the training data or single pattern presentation and a is a constant reflecting the emphasis on this momentum term.

Previous changes to weights form one component of current weight changes - a type of computational inenia, and hence the term momentum. If large changes in weight had been made before, then they will tend to be made at the next step also. The net effect of this is to filter out high spatial frequencies in the error surface and place more emphasis on its overall shape (the low spatial frequencies). This allows the learning algorithm to track down the major valleys in error space more quickly.

To allow for the fact that the nature of the error surface has different features along different axes, Jacobs [1988] has suggested that each parameter have its own learning rate which can change with time - a finely tuned momentum application of backpropogation. He also recommends an adaptive learning rate scheme whereby if the current gradient is the same as the previous one, the learning rate increases linearly with time, while a change in gradient direction should lead to an exponential decrease in learning rate.

Funher modifications have been made to standard backpropogation to enhance its convergence speed. For example, a margin variable can be set such that an error smaller

26 than it is not backpropogated, tying up system resources for small intermediate enhancements. Tesauro et al. [ 1989] have suggested that adaptive learning rate schemes provide significant performance advantages, while momentum and margins do not. It should be noted that while providing a simplified theoretical base for this assertion, they only applied it to net topologies of one hidden layer and having linear nodes.

It is also worth noting that even simple networks of three nodes and one training pattern have been shown to exhibit chaotic behaviour using backpropogation as the training algorithm, and therefore do not always achieve a predictable equilibrium [ van der Maas et al, 1990].

1.2.3.3.1.1. Visualisation Visualising a network's convergence progress for small dimensions is relatively simple. If only two weights are involved, for instance, then these can form Cartesian axes and the change in weights is shown as a trajectory. For higher dimensions, small dimensional slices are required for direct visualisation. Keeler [ 1986] has used random 2-D slices to aid in the conceptualisation of basins of attraction in Hopfield nets.

Machtynger and Sitte [ 1990] have suggested the use of a full Karnaugh map of the space to overcome some of the limitations of Keeler's method.

Another approach to the visualisation problem for higher dimensional spaces is to examine more indirect measures, such as the sum of squared error for a training set. If this is asymptotic to a non-zero value, it can be inferred that the network is in a local minimum (assuming that the network topology allows a theoretical classification of the input data).

The changes in trajectory angle can be inferred by the correlation of successive weight error derivatives. This provides information related to the topology local to the trajectory.

1.2.3.4. Other Algorithms

While backpropogation is currently the most commonly used training algorithm, other types have been introduced in order to overcome some of its limitations. It should be reiterated, however, that the range of possible training algorithms is topologically determined, and not all are simply interchangeable.

27 Baum [ 1989) has argued the case for more flexible learning algorithms which are able to adapt not only weights, but also the numbers of nodes and synapses operating within the network. Such an approach would lead to a class of "universal learners", capable of learning any learnable concept.

Speed of training is considered one of the biggest drawbacks of backpropogation, and this will be covered in more detail in Section 1.3.3. Alternative algorithms/topologies which provide speed advantages have been proposed, such as having locally tuned nodes, responding to only a limited range of data [Moody and Darken, 1989), genetic algorithms which effectively reduce a problem's dimensionality [Bartlett and Downs, 1990] and using a network which divides the whole decision region into static hypercubes to begin with, then groups these [Huang and Lippmann, 1987).

Since backpropogation requires a finite step size, the convergence to a global minimum cannot even be theoretically guarantied. Quing-zeng (1989] has proposed the "blindman going down the hill method" in which steps of slowly decreasing size in weight space are tested in random directions - only those which produce a better fit are actually taken. A large enough step size will allow escape from local minima under some conditions.

Random optimisation methods are more likely to find global minima than backpropogation, but have only been investigated for small classes of problems [Baba, 1989). A more commonly employed technique for binary/stochastic nets related to this is "simulated annealing", discussed in Section 1.2.3.5 below.

Backpropogation is not biologically plausible since neurons are informationally unidirectional. Noting this, modellers who wish to elucidate brain learning have proposed anatomically possible algorithms, such as deterministic Boltzmann learning [Hinton, 1989).

1.2.3.5. Energy

Hopfield [1982) suggested another way of considering the learning in a particular type of network (binary and symmetric) by introducing a measure of its "energy" , U, as defined by:- 1 W··X·X· U = - -2 ""L., k~ IJ I J ... (1.16) i:#=j

28 and pointed out the similarity with some thermodynamic systems. Although energy U and error E are descriptors for quite different network types, there is a relationship between them [Linsker, 1988].

This was later extended to networks exhibiting graded responses [Hopfield, 1984]. In each case, the learning algorithm was such that a gradient descent in energy-space occurred. As with backpropogation, however, this system can become trapped in a local minimum.

The similarity with thermodynamic systems, spin glasses in particular, suggested one method for avoiding local minima. By introducing a "temperature" term, which started at a high value in the initial stages of learning this would allow the network to explore other possibilities beyond the immediate local contours. In a sense, high temperatures introduce a randomness into the network, allowing it to "bounce out" of local minima. As learning proceeds and the network approaches lower energy states, the global minimum in particular, the temperature is reduced. This is analogous to the slow reduction in temperature which is used in the annealing process, and hence the term simulated annealing is applied to it [Kirkpatrick et al, 1983].

1.2.3.6. Coding in the Hidden Layers

Of fundamental importance to the operation of multi-layer networks is the re-coding of the input data in the hidden layer(s). Such re-coding allows a type of abstraction of the input information.

A commonly used application and illustration of this is 0 0 the encoder problem [see Ackley et al, 1985; 0 0 Rumelhart et al, 1986(a) and 1986(b), for instance]. A 1 1 simple example of this task is for a 4-2-4 network (4 0 0 units in first and third layers and 2 in the second layer) input output to reflect at the output what is presented at the input. Fig 1.15: A 4-2-4 network An illustration of this network, together with an acting as an encoder. A example of input and output is shown in Figure 1.15. sample input and output is given. A solution by the network requires it to represent the input data as a binary number in the middle layer . This

29 is the elegant solution achieved by such a network. The network has achieved a re­ coding of the input information. Such data compression capabilities of neural networks may have applications in information transmission.

Analysis of hidden layers in networks trained to analyse more complex input signals, such as sonar returns and video camera input , have revealed some interesting strategies [Gorman and Sejnowski, 1988; Touretzky and Pomerleau, 1989]. Some networks use the hidden units to model one or more features of imponance to the classification task, while other weights may be involved in "memorising" exceptional patterns. It is interesting to note that trained human observers and the neural network hidden layers picked out similar features of the signal as being shibboleths for the task.

1.2.3.7. Comparison with Traditional Classifiers

Most topology/learning algorithm combinations are related to traditional classifiers and statistical techniques to some degree. Network performance often exceeds that of the traditional classifiers, but the perfqrmance differential is dependent on input distribution.

Networks have been shown to perform principal component analysis, optimum classifier, leader clustering, gaussian classification, k-nearest neighbour and k-means clustering [Linsker, 1988; Lippmann, 1987].

With traditional classifiers, however, the nature of the metric must be known so that a calculation based on it can be performed. Neural networks can be applied for classification without an understanding or knowledge of the basis of separation of different classes. De Silva et. al. [1991] have suggested that at least some research into neural networks has been a method of avoiding the learning and use of the statistics required for traditional pattern classification.

1.3. Advantages and Disadvantages

Neural network computation is complementary to the traditional digital/sequential type. They have weaknesses in areas such as numerical computation, panicularly involving high precision numerical representations; but strengths in pattern recognition related tasks. The relative strengths and weaknesses are related to the fact that neural networks do not "compute" answers, so much as memorise them, or find patterns in the answers. In this sense, neural networks can be thought of as statistical computing devices in

30 which computation, and therefore mistakes are distributed over all the connections, leaving no easily discernible "audit trail" [Anderson, 1988].

Sections 1.3.1 to 1.3.4 below highlight some of the facets which are of particular importance to the performance of neural network computing.

1.3.1. Generalisation

Generalisation is the ability to distil features in the training set and to apply them to novel patterns. This is one of the features of neural networks which make them "brain­ like". Human cognitive ability is able to extract the salient features of a number of cases of the object car for example, and to apply them to other objects in order to form a classification of car or not-car. Overtraining on a very narrow range of car types (say small red ones with black roofs) may lead a person to a false concept of car - there would be a lack of generalisation.

In the same way, neur~ networks can be overtrained on small, non-representative samples and so not learn the quintessential features required.

Another way of stating this is that the network is underconstrained. This can be corrected by constraining the network (say by pruning), or by providing more training samples [Touretzky and Pomerleau, 1989].

While acknowledging the fact that the ability of a network to generalise is very task specific, Sietsma [ 1990] concluded from studies on frequency determination of sine waves that:- a) Training with noisy inputs improves the networks ability to generalise. b) Pruning redundant units from a trained network improves classification performance of noise corrupted data, but this effect is less important if the network is trained with large amounts of noise in the first place. c) Training with noise reduces the number of redundant nodes, and the more complex solutions found by these larger networks find more general solutions than pruned, and therefore smaller networks trained without noise.

31 Baum and Haussler [ 1989] have theoretically determined the number of training samples for a multi-layer, feed-forward network having one hidden layer and linear nodes, to classify new patterns with an error rate of E, to be bounded about the value number of weights in network . . . . The analysis assumes that both trammg and test samples E are drawn from the same distributions. The case of networks with two hidden layers is unsolved, however.

Ahmad and Tesauro [ 1988] studied the simpler case of a single layer perceptron trained on linearly separable patterns and found:-

a) The failure rate falls off as the exponential of the number of training patterns, where these are random, but much more rapidly when the patterns are chosen to be borderline cases. Thus training is more effective when patterns chosen from a class are most alike another class.

b) To achieve a given classification performance, the number of input patterns scales linearly with the number of input units. Since this study was performed on single layer networks, this result is consistent with that of Baum and Haussler above.

1.3.2. Graceful Degradation

By its very nature, a neural network's computational ability is distributed over the whole system. The destruction of a single computational element in a sequential system may cause its complete collapse. There is no single critical element in a neural network, and the loss of computational elements - nodes or connections - results in a slow degradation. The more elements which are lost, the greater the degradation. (For poorly designed, underconstrained systems, element loss may actually improve the system's performance, as has already been noted in Section 1.3.1.) In this sense, they are similar to holographic images in which any single part of the image is distributed over the whole hologram.

1.3.3. Speed, Scaling and System Requirements

The solution of any problem on any type of computational system has certain minimal requirements. These requirement come under the headings of time, space and Kolmogorov complexities [Abu-Mostafa, 1986]. The source of these for sequential and neural network computers are set out in the table below:-

32 SEQUENTIAL NEURAL NETWORK COMPUTER TIME COMPLEXITY number of steps number of iterations SPACE COMPLEXITY scratch memorv number of neurons KOLMOGOROV algorithm length information capacity of COMPLEXITY synaptic connections

It can be seen from the above table that space and Kolmogorov complexity are intrinsically linked for a neural computer, since the number of nodes and the number of degrees of freedom (information capacity) are related. Thus high space complexity requirements demand large networks with consequent large Kolmogorov complexity - whether this is required or not.

If N is the number of nodes in a network, the capacity of the network is of the order of N3 bits - thus the Kolmogorov complexity grows much faster than the number of neurons. This indicates the types of problem for which neural networks are best suited - those requiring a long algorithm on a sequential machine. Examples include pattern recognition and multiple constraint-satisfaction problems, such as the "travelling salesman problem". Problems which do make good use of a network's resources represent immense computational power with only a small hardware commitment. On the other hand, problems requiring short algorithms and high time complexity are very inefficient on a neural network because of the unused network capacity. Sequential computers are ideally suited to these small input vector, multi-iterated problems. [Abu­ Mostafa, 1986; Hopfield and Tank, 1986; Churchland and Churchland, 1990).

On sequential machines, some problems scale exponentially in the time required for a solution with increasing size of input . Neural networks can solve the same problem in polynomial time at the expense of an exponential increase in network size [Abu­ Mostafa, 1986; Baum, 1986].

The actual training of the network is another scaling problem, and this is related to the training algorithm used. Morse [ 1989] states that some research indicates that backpropogation training time grows exponentially with nodal number but only polynomially with the number of training patterns used. Alkon [ 1989] has developed a network training system (DYSTAL) which he claims has more of a biological basis than other algorithms and does not result in the normal increase in number of iterations

33 per node as the number of nodes increases, thus proving less resource-hungry in its scaling.

Memory requirements are a significant consideration in the scale of a network, and this depends on the network size - including the number of nodes and interconnections; as well as the algorithm. For instance, the use of momentum requires the storage of the old weight changes as well as the current weights.

1.3.4. Hardware Implementation

Nearly all neural computers are currently simulated in software on sequential digital machines. Such an approach, while the only cost-effective method in most cases with current technology, places severe speed limitations on the network. Even though neural networks are parallel structures, the computations are performed serially on sequential machines. On a true hardware neural network, the total computation would be performed in approximately one network clock pulse per layer. In a serial machine, many pulses are required for the determination of a single weight.

This tends not to be much of a problem for the implementation of a trained network; however training can be very slow because many thousands of passes through the network are required. Thus one viable intermediate approach would be to train a network on a very fast serial device, to be later implemented on slower, lower cost ones.

VLSI implementations of neural networks have been gaining an enormous momentum in a short period of time. The first application of major significance was for use in real­ time visual processing, particularly motion detection [Sivilotti et al, 1987]. The device was based on an analog CMOS array of 22 amplifiers and 462 connections reduced to less than one square centimetre in size.

The combined requirements of parallelism, low precision matrix multiplication, reasonable small signal linearity and high degree of interconnectivity lends neural computing to optical implementations [Anderson, 1988). Farhat et al [1985), for instance have developed an effective optical implementation of a Hopfield network and Brady and Psaltis [ 1989] have harnessed photorefractive crystals in holographic techniques to act as high density reconfigurable interconnections for neural optical neural computers. Such optical approaches harness enormous high speed computational power.

34 1.4. Applications

It is the flexibility of neural network systems which have allowed them to be applied in such a diverse range of areas. Their success has been most marked in areas which have proven difficult (or indeed, intractable) for traditional sequential computing - the areas in which human cognition still far exceeds its digital counterpart.

Biomedical applications will in general benefit greatly from this area because of the "messy" signals used and the computationally intensive information processing and image analysis required [Reggia and Sutton, 1988; Kohonen, 1988(a)].

Some medical applications will be less direct, such as a speech recognition system to allow medical personnel to access medical records in a multi-media environment [Meredith, 1990].

A brief listing of applications follows, with particular emphasis on areas which have significance to the application of neural networks to human heart sound recognition.

1.4.1. Speech Recognition & Synthesis

No current machine is capable of speaker-independent, large vocabulary, continuous­ speech recognition. Such a task, achieved seemingly easily by humans, remains out of reach of current signal analysis techniques. The massive parallelism afforded by neural networks does offer promise in this area. Quite apart from the computational strategies which they bring to bear, incorporation of biologically inspired front end pre-processors (based on cochlea function) integrate naturally with neural networks [Mueller and Lazzaro, 1986; Kohonen, 1988(b); Burr, 1988; Waibel and Hampshire, 1989; Lippmann, 1989; Lang and Waibel, 1990; Reilly and Boashash, 1990].

In the area of speech synthesis, outstanding performance with very modest development overheads have been achieved by a text-to-speech transcription network called NETtalk [Sejnowski and Rosenberg, 1986].

35 1.4.2. ECG

Neural networks have been applied in three distinct areas of ECG signal processing and classification.

1. As an adaptive filter in feta! electrocardiography to eliminate the maternal signal [Widrow and Winter, 1988].

2. Time location of QRS complex in acute disease [Atlas et al, 1988]. The results from this study were poor, however.

3. Classification of ECG signals according to pathology [Chi and Jabri, 1990, 1991; Nickolls, 1990].

1.4.3. EEG

EEG signals are notoriously complex involving close similarities between signal and noise. Neural network techniques have been applied to this problem with some success, · even allowing determination of drug type administered from the subject's EEG signal [Gevins and Morgan, 1988].

1.4.4. Image Recognition

Image recognition applications have included, optical character recognition [Fukushima, 1988; King, 1989, Wang et al, 1989], automated cytological examinations [Rennie, 1990], 3-D structure recognition [Feldman et al, 1988], and an autonomous vehicle navigation system [Touretzky and Pomerleau, 1989].

One advantage of their application in this field is the ability to design networks which are scale and perspective invariant [Widrow et al, 1988].

1.4.5. Al and Expert Systems

It is possible for a neural network to be implemented as an expert system, with the added speed advantage of considering competing hypotheses in parallel. Such a network has been applied to a chest pain diagnosis system [Ramamoorthy and Ho, 1988].

Natural language processing takes automated speech recognition to the next stage, and neural networks have been applied in this area also [Feldman, 1985], even including the sub-problem of humour recognition [Shuo and Fay, 1989].

36 1.4.6. Filtering

The filtering of ECG signals has been mentioned in Section 1.4.3 above. Neural networks have powerful noise rejection capabilities which can be employed in signal detection problems [Lippmann and Beckman, unpublished personal communication; Klimasauskas, 1989].

1.4.7. Other

The areas of application are not limited to those mentioned above, and cover areas ranging from stock market price prediction systems [Computerworld, August 10, 1990] to protein tertiary structure analysis from NMR data [Kinoshita, 1990].

37 2. Heart Sounds and Phonocardiography - A Review

I have been able to hear very plainly the beating of a Man's Heart .... Who knows, I say, bur that it may be possible to discover the Motions of the Internal Parts of Bodies ... by the sounds they make, that one may discover the Works performed in the several Offices and Shops of a Man's Body, and thereby discover what Instrument or Engine is out of order. Robert Hooke

The Method of Improving Natural Philosophy in The Posthumous Works of Robert Hooke, Containing his Cutlerian Lectures and Other Discourses Read at the Meeting of the Illustrious Royal Society, etc., 1705.

2.1. Synopsis

The mechanisms of sound production having a cardiac origin are discussed. Heart sounds provide a window into cardiac function, both normal and abnormal. Their use and limitations in the diagnosis of various cardiac pathologies is considered. The extensions to the diagnostic capabilities of ausculation achieved through phonocardiography and related techniques are also discussed.

2.2. Introduction

Sound energy produced by the heart and the flow of blood through it had provided the most direct and detailed evidence of its mechanical action until the advent of direct visualisation made possible by techniques such as ultrasound.

While auscultation is still routinely employed for the purposes of initial screening, a more detailed study and sophisticated analysis of heart sounds through phonocardiographic techniques has lost ground as these direct visualization techniques have become more accessible.

This chapter provides a brief introduction to cardiac function, auscultation and phonocardiography. Compared to neural network computing, knowledge in this area is relatively mature, and the intention is not to review recent research in the abstruse detail of human heart sounds, but to provide a broad background to the subject. Since this chapter outlines largely well-accepted concepts, only points of contention between

38 authors, and less well known studies have been referenced. The following general references are particularly helpful: Leatham [1958], McKusick [1958], Luisada [1959], Friedberg [1966], Wartak [1972], Luisada [1973], Scher [1974], Fleming and Baimbridge [1974], Hurst [1974], Tave! et al,[year unidentified], Tavel [1976], Tave! [1978], Holzner and Mathes [1983], Opie [1984], Dalen and Alpert [1987], Julian [1988].

2.3. Cardiac Anatomy

The heart consists of four chambers - the left and right atria and ventricles. It is the right heart which drives deoxygenated blood through the lungs, and it returns to the left heart which pumps the oxygenated blood throughout the body.

The two thin-walled atria are separated from each other by an inter-atrial septum and the thicker-walled ventricles by an inter-ventricular septum. Corresponding atria and ventricles communicate via atrial-ventricular valves (AV valves). The left side AV valve is called the mitral and consists of two cusps while the one on the right has three. However, it should be noted that this difference is not always very significant since the mitral valve will occasionally have a small third cusp and the intermediate cusp of the tricuspid is often so small that functionally it is bicuspid. Each cusp of the AV valves is attached by chordae tendineae and papillary muscles, which in normal cardiac function, help prevent their eversion. Ventricles are connected to the aorta and pulmonary artery on the left and right side respectively via the aortic and pulmonic semi-lunar valves, which are composed of three cusps (although bicuspid and quadricuspid examples are found rarely).

The left side of the heart operates at higher pressures because it is the pump for a more complex vasculature and this is achieved by a thicker musculature.

Figure 2.1 is a schematic diagram of cardiac structure, showing gross anatomy and the direction of blood flow.

39 pulrTOnary veins

Fig. 2.1: Schematic diagram of the heart showing chambers, valves and blood flow.

It is the various pressure differences within the heart's chambers and connecting vessels, together with heart valves, which control the flow of blood.

2.4. The Cardiac Cycle

In the following description, attention will be focussed on the left heart except where special attention needs to be drawn to the action of the right heart. It should be kept in mind, however, that while slight differences in timing and intensity do shape cardiac sounds, similar overall mechanisms are occurring in both sides at approximately the same time.

A version of Wiggers' diagram is shown Figure 2.2 and is widely used as a convenient method of showing the temporal relationships between pressures, volumes, heart sounds and ECG tracings during a cardiac cycle.

40 , . - ...... ______------AORTIC PRESSURE

LEFT VENTRICULAR PRESSURE S2 ,.l JIil~~, ~---- HEART SOUNDS ----""'"'"'"'1"------~~S4 S3 mitral

[; 1717177027/%£Vfldl~01/zzzzz 77/ 2 1 VALVE Pos1r1ONs aortic closed R open

p

ECG a s

Fig 2.2: A Wigger's diagram showing some of the mechanical events of the heart, with an ECG tracing as reference.

Pressure in the left ventricle builds up to the point where it exceeds that in the atrium and the mitral valve snaps shut, creating one component of the first heart sound.

The ventricle is isolated from atrium and aorta by the mitral and aortic valves and so is at a fixed volume. A short period of isovolumic contraction begins, the ventricular pressure rising until it is greater than that in the aorta, at which time the aortic semi­ lunar valve opens and blood is ejected rapidly, adding an aortic component to the first heart sound.

The myocardium then enters a relaxation phase and the ejection of blood slows down until the aortic pressure is once again greater than the ventricular and the aortic valve closes, providing the first component of the second heart sound. The pulmonic valve normally closes slightly after the first and this causes the second component of the second heart sound.

41 The ventricle enters a phase of isovolumic relaxation, having both mitral and aortic valves closed until its pressure drops below the atrial pressure and the mitral valve opens (normally) silently.

The period between closures of the aortic valve and mitral valve is termed diastole, and it should be noted that this does not correspond exactly with physiological diastole as indicated by myocardium relaxation. The complementary period of the cardiac cycle is termed systole.

2.5. Heart Sounds and Murmurs

The generation of sound in the heart is a complex phenomenon dependent on movement and tensing of cardiac tissue as well as blood within. Equally complex is the transmission of that sound through the inhomogeneous surrounding tissues to a point where it can be heard and/or recorded.

The nomenclature of phonocardiology and cardiac auscultation normally draws the distinction between heart sounds and murmurs. The former are transients while the latter are of longer duration.

2.5.1. Production Mechanisms

The transient heart sounds, such as the first I and second sounds are valve closure sounds. It ,1 is currently held that the tensing of the valve curtains, together with hydrostatic pressure gradients are the source of such sounds, rather than a collision of the actual valve flaps. Pressure transients result from the interruption to local flow of blood by a valve closure. For example, at the time of mitral closure, a local Fig. 2.3: Mechanism for the tensing backflow of fluid occurs, the momentum of of valve curtains due to local back flow of blood at the time of valve which results in the slight billowing of the closure. valve cusps and surrounding tissues. As the limit of extensibility is reached, a pressure transient in the fluid and tissue is generated. The type of local back flow and transient generation is shown in Figure 2.3.

42 "Opening snaps" of AV valves are heard in the cases of some pathologies, most notably mitral stenosis. Other transient sounds may be produced by the vibrations of the myocardium by rapid ventricular filling, for instance.

Murmurs may be generated by a number of mechanisms outlined below [after McKusick, 1958].

1. High flow rates with large Reynolds numbers - the resulting turbulent flow causing vibrations. It should be noted that the haematocrit significantly alters blood viscosity, and so anaemia can result in the development of new murmurs.

2. Widening of a conduit beyond a constriction leading to an interplay of turbulence, eddy formation and consequent wall vibrations. Pressure differences described by Bernoulli's principle at this point may directly drive wall vibrations also. Stenotic valves and aortic coarction may be an example of this basic mechanism.

3. Turbulence when a rapidly moving stream meets a slower one, aortic aneurism being a case in point.

4. Impact of a high velocity fluid jet with an opposite wall, as in mitral regurgitation or ventricular septa! defects. These impacts can produce lesions and tend to have their source in the higher pressure left ventricle.

5. Frictional rubbing, such as the heart against an inflamed pericardium (this is discussed more fully in Section 2.5.11 below).

6. "Trumpeting" vibrations such as that caused by the flow of blood through a stenotic valvular diaphragm.

7. Vibrations of chordae tendineae as blood flows past because of their aberrant position or because of an unusual blood flow such as in the case of some ventricular septa! defects.

8. Fluttering of a thin flap of tissue, such as a retroverted valve cusp.

43 Examples 6, 7 and 8 above are musical in nature, and example 5 is rarely so. It is these musical murmurs which tend to be much louder and can sometimes be heard unaided some distance from the subject.

The mechanisms for these murmurs, together with examples and diagrams are summarised in the table below.

MECHANISM DIAGRAM and EXAMPLE 1 Turbulent flow in even Genesis of murmur in anaemia. tube.

2 Sudden widening beyond Stenotic valve a constriction

3 Rapidly moving stream Aortic aneurism meets slower one.

4 Jet impact Mitral regurgitation

44 5 Frictional rubbing Pericardia! friction rub

6 Trumpeting Stenotic valvular diaphragm

= ~i;

7 Chord vibrations Aberrant chordae tendineae

"I' - ~ :, l1 I I =- \, I 'I I/ - ,.,_

8 Fluttering Retroverted valve cusp

~- ....-: - " - ,,,.

2.5.2. Factors Affecting Auscultated Sounds 2.5.2.1. Regions for Auscultation

There are four traditional regions for cardiac auscultation, each having distinct advantages for different sound components. These regions are shown in Figure 2.4, together with the valve positions.

45 I \ / '--

/ / ' I ' '\ I i I \ I I I \

I I I

I I\ I \ I

Fig. 2.4: Relative positions of heart valves and the four traditional regions for auscultation.

2.5.2.2. Sound Transmission

Sound quality and intensity as heard or recorded at the surface of the body is affected by many factors related to its transmission through tissue. Higher frequencies tend to experience a greater attenuation as they travel through tissue, for instance. Various body cavities and structures also have their own resonant frequencies which alter the overall gain of the transmitted sound. Quite apart from the intensity of the sound generated by the heart, the surface intensity is also related to chest thickness, particularly the amount of adipose tissue present.

2.5.3. The First Heart Sound

The audible components of the first heart sound have been identified with mitral and tricuspid valve closure, normally in that order [McKusick, 1958], although some researchers have suggested left ventricular events are the exclusive causative factors of the normal first sound.

The first heart sound is best heard in the apical and tricuspid areas, and in normal cardiac function, the first sound has its higher intensity components at the lower end of

46 the frequency spectrum, between 30 and 150 Hz, and having a duration of 0.10 to 0.16 seconds.

The amplitude of the first sound may be reduced in pathologies such as:- a) Dilation of the left ventricle associated with mitral insufficiency, causing a slower rise in ventricular pressure. A close association between first sound intensity and rate of change of pressure has been found [Luisada, 1973 ]. b) Myocardial infarction causing a slower pressure rise.

On the other hand, increases in first sound intensity can occur in cases such as mitral stenosis, where mitral valve closure occurs during the rapid rise of ventricular pressure, rather than before, leading to a louder, "snappier" quality.

While normal physiological splitting of the first sound occurs, some pathological conditions are causative, such as right bundle branch block.

2.5.4. The Second Heart Sound

The second heart sound is associated with the closure of the semi-lunar valves - pulmonic and aortic, each contributing a definite component. The aortic element tends to have a higher frequency and amplitude with a wider radiation pattern. The overall frequency range is approximately 220 to 400 Hz and the sound has a duration of 0.08 to 0.14 seconds. The components of the sound are best heard at the left and right sternal borders of the second [McKusick, 1958, Julian, 1988] or third [Luisada, 1973] intercostal spaces, in the aortic and pulmonic areas, however, the pulmonic component is not as perceptible in the aortic region.

In normal cardiopulmonary function, the aortic component precedes but fuses with the pulmonic during expiration and splits by up to 30ms [McKusick, 1958] or 80ms [Friedberg, 1966; Tavel, 1976] during inspiration. This splitting results from a longer right side systole caused by a larger stroke volume, itself the result of greater right side venous return occurring with the change in thoracic pressure.

The amplitude of the second heart sound may be increased by factors such as hypertension. Pulmonary hypertension tends to increase the second component and systemic hypertension the first component. On the other hand, stenosis of either

47 semi-lunar valve or adjacent artery may cause a decrease in amplitude of the corresponding second heart sound component.

Some second sound timing abnormalities which normally indicate an underlying pathology are listed below:-

No splitting * Physiologic - particularly in older individuals. * Systemic hypertension. Fixed splitting * Right bundle branch block * Left ventricular ectopic beats * Dilation of pulmonary artery Paradoxical Splitting * Left bundle branch block (increases during * Aortic stenosis expiration) * Patent ductus arteriosus

2.5.5. The Third Heart Sound

Physiologic third heart sounds are found in a majority of children and teenagers, but become very rare by the age of 40. They are more easily detected by phonocardiography than auscultation due to their low amplitude (less than half that of the other two sounds) and frequency (10 to lO0Hz). The onset of the third sound is between 0.12 to 0.18 seconds after the onset of the second, and has a duration of between 40 and 80 ms.

The sound results from vibrations during early diastolic ventricular filling of a vigorous heart and so is most easily detected in the apical region after brief exercise.

The pathologic third heart sound, referred to as the ventricular gallop, third heart sound gallop or protodiastolic gallop gives a distinctive 3 beat cadence. It is thought that the mechanism of its production is the same as for the normal third heart sound, and other clinical findings are necessary to distinguish between them. For instance, a third sound in an otherwise normal teenager is likely to be physiological; while in the elderly, it is indicative of cardiac dilation and failure, increased venous return to either ventricle (fever, mitral insufficiency, ventricular septa! defect) or slow heart rates (complete heart block).

48 2.5.6. The Fourth Heart Sound

The founh hean sound, when present occurs at the stan of the cardiac cycle, about 0.12 second before the first sound, lasting 40 to 60 ms, and is associated with atrial contraction and the impact of blood on a perhaps less distensible ventricle.

The sound tends to have a low frequency composition; and the higher the frequency, the more likely is a pathological causation.

Abnormal founh sounds have been found in such conditions as systolic overload of one ventricle (aonic stenosis, systemic hypenension, pulmonary stenosis, pulmonary hypenension), diffuse (myocarditis) or localised (ischemic hean disease) myocardial damage.

It is most easily detected over the lower left sternal border and the apex, and tends to be more easily recorded than auscultated.

2.5.7. Summation Gallop

It is possible for the pathological third and founh sounds to be fused into a single "summation gallop". This is most likely during tachycardia. Slowing the pulse by carotid sinus pressure will allow these two sounds to become distinct, producing a distinct four-beat cadence, and confirm such a diagnosis.

The summation gallop tends to be longer than an individual third or fourth sound, and is often louder than the first or second sounds.

2.5.8. Opening Snaps

Although the AV valves normally open silently, stenosis may lead to valves which are forcefully opened and become taut in their restricted, fully opened positions. The opening snap of mitral stenosis is the most common example of this, however it also occurs with severe stenosis of the tricuspid valve.

The snap has high frequency components, ranging from 50 to 600 Hz, lasting between 20 and 40ms and occurring some 30 to 120ms after the aortic component of the second sound.

49 2.5.9. Systolic Clicks

Extrasystolic clicks may be associated with ejection of blood (ejection clicks) - high frequency transients following soon after the first heart sound and may be caused by the sudden arrest of the opening semi-lunar valves or the tensing of aonic and pulmonic arteries due to the sudden pressure rise.

Clicks may also be of the non-ejection variety. In such cases, the sound probably arises from mitral valve prolapse, and may be associated with systolic murmurs.

2.5.10. Murmurs

Hean murmurs can be conveniently classified according to their location within the cardiac cycle, as well as their duration, intensity profile and frequency. It has been suggested that most people have innocent hean murmurs, normally of low intensity.

2.5.10.1. Systolic Murmurs

Murmurs during systole may be associated with ejection of blood through the pulmonary or aonic outflow tracts, in which case they do not commence until a shon time after the first hean sound. Since murmur energy is a function of blood velocity, the intensity normally follows a crescendo-decrescendo pattern, finishing when blood velocity is very slow or stopped - before the second hean sound. While the gap between the first sound and the murmur may be small, that between the murmur and the second sound allows them to be distinguished as distinct entities.

Normal ejection murmurs tend to encompass between one half and two thirds of systole and be at the lower limits of audibility. In children, they often have a musical quality, becoming more rasping in adults.

It should be noted that intensity of the murmur does not correlate to the severity of the obstruction, since it is also a function of blood velocity and therefore pressure difference. Thus congestive heart failure may actually lead to a reduction in murmur intensity due to reduced stroke volume.

Regurgitant murmurs tend to be pansystolic (or holosystolic), having more of a "blowing" quality. The fact that the pressure gradient which leads to the regurgitation of blood across an incompetent AV valve or septal defect exists from the time of AV valve closure up to or beyond the second sound results in the longer duration of this murmur.

50 2.5.10.2. Diastolic Murmurs

While the presence of a systolic murmur does not necessarily imply an underlying pathology, a murmur in diastole does. As with the systolic varieties, they may be either forward-flow or regurgitant.

Diastolic filling murmurs are caused by a narrowing of the aperture made by the AV valves. In cases of mitral stenosis, the murmur is often immediately preceded by the opening snap, and is characterised by a low-pitched, rumbling quality, continuing until the first heart sound. Since the mitral valve opens some time after the closure of the semilunar valves, there is a distinct time interval between the second sound and the murmur.

Tricuspid stenosis produces similar murmurs to aortic stenosis, however the sound tends to be maximum at the left sternal border, and also to increase with inspiration due to the increase in flow across the valve. The quality of the sound tends to be "scratchier" and shallower.

Regurgitant diastolic flow results from insufficiency of the semi-lunar valves. Aortic insufficiency is typically indicated by a diastolic murmur, often starting just before the second sound in severe cases, and often having a diamond-shaped amplitude envelope due to the time course of the pressure gradient. The sound has a high pitched ("blowing") quality. In cases of Austin Flint murmur (a diastolic murmur detected at the apex), the regurgitant aortic stream meets that from the atrium, leading to low frequency vibrations.

The murmur of pulmonic insufficiency is normally of a lower frequency and shorter duration than that of aortic insufficiency. These characteristics result from the lower pressure gradient between pulmonary artery and right ventricle.

2.5.10.3. Continuous Murmurs

A murmur which begins in systole and persists for part or all of diastole is termed continuous. Such murmurs are normally the result of a path between a high pressure and low pressure region, the pressure difference being maintained for most of the cardiac cycle.

51 The most common example is patent ductus arteriosus in which there is a persistence of the fetal connection of the aorta and pulmonary artery. The peak difference in pressure between the aorta and pulmonary artery is at the time of the second sound, and this is when the murmur is normally loudest Pulmonary hypertension may negate this pressure difference in diastole, therefore the murmur may be absent for this part of the cardiac cycle.

Continuous murmurs may also be produced by conditions such as the severe narrowing of an artery, greatly increased blood flow and a number of other situations involving fistulas and ruptures.

2.5.11. Friction Rubs

Inflammation of the pericardium may cause a frictional rubbing, which is pronounced during periods of maximum ventricular movement. A typical frictional rub would have three components - a presystolic component produced by atrial contraction, a systolic component produced by ventricular contraction and a diastolic component resulting from rapid ventricular filling. The sound so produced tend to be "scratchy" in quality, having a predominance of high frequency components.

52 2.5.12. Summary

The timing of various main heart sounds and murmurs are summarised in the table below showing styilised time intensity graphs.

First Sound (S1) Second Sound (S2) Third Sound (S3) -t- Fourth Sound (S4) S4

Aortic (A) and Pulmonic (P) s, S2 corn onents Diastolic Murmur (DM) Opening Snap (OS)

OS OM

S1 S2

Systolic Murmur (SM)

~· M s, S2

Continuous Murmur

l\lj 1 ---1··. ·~\\tl!\11,i 1\\ l\\11 ~ \11t· CM

2.5.13. Differential Diagnosis

The role which heart sounds have in differential diagnosis of cardiac pathologies is of fundamental importance to the possibility of their use in computerised diagnosis, or computer-assisted systems. It is not simply the raw heart sound signal in isolation from all other information and techniques which allows an unequivocal diagnosis. The actual sound signal must be combined with other information such as differences in the sound signal at different chest locations and with different pick-up devices, the effect of respiration, movement and drugs, in order to provide a more complete picture.

53 Examples of where associated information and techniques are useful for differential diagnosis are discussed below.

Example 1: The differentiation of physiologic and pathologic third heart sounds is not possible on the basis of sound alone, since they can be perceived as exactly alike. Clinical history is required in addition to distinguish the two.

Example 2: Opening snaps and third heart sounds are often confused in cases of mitral insufficiency which tends to bring the third sound earlier in the cycle than it would otherwise. One method of differentiating is by the thudding quality of third sounds compared to the higher-pitched snaps of mitral insufficiency. Also, third sounds are best heard at the apex, while the mitral valve is most distinct at the left sternal border.

Example 3: One can often distinguish between a split first sound and an ejection click of pulmonic stenos1s by diminution of the latter during inspiration.

Example 4: Amyl nitrate is a drug which induces a fall in systemic pressure, tachycardia and increased cardiac output. Its inhalation tends to enhance some murmurs while decreasing the amplitude of others, as summarised in the table below:-

Increase Decrease * most diastolic ejection * regurgitant systolic murmurs of murmurs mitral insufficiency and ventricular septal defect * diastolic mitral murmurs * diastolic murmurs of aortic * tricuspid flow murmurs insufficiency

As a result of these differential effects, the drug is particularly effective in distinguishing the following pairs of pathologies.

54 Increased Murmur Decreased Murmur aortic stenosis mitral insufficiency pulmonic stenosis small ventricular septal defects mitral stenosis Austin Flint murmur of aortic insufficiency

Example 5: The Valsalva manoeuvre involves straining against a closed glotis. This reduces venous return to the right heart and so reduces ventricular stroke volume and ejection time. Left heart blood flow is lowered after a few seconds.

In normal subjects, the aortic and pulmonic components of the second heart sound move closer together and transiently increase upon release. With atrial septal defect, however, the manoeuvre has little or no effect.

The manoeuvre is also useful in distinguishing the side of origin of murmurs. During straining, all murmurs are diminished because all blood flow is reduced. After the release of respiratory pressure, flow of blood into the right heart very quickly rises, while taking 4 to 10 beats for left blood flow to return to normal (the time take to complete the pulmonary circuit). The few seconds of disparity in blood flow between the two sides of the heart allows murmurs to be more easily localised.

Example 6: Mitral and tricuspid stenosis can be distinguished by the fact that the sound quality is somewhat different, the latter having a "scratchier" quality. The amplitude of the murmur from tricuspid stenosis also tends to increase with inspiration and the localisation differs. Elevated venous pressure as evidenced by the jugular veins is a further indicator of this tricuspid defect.

Example 7: Frictional rubs can be confused with some types of systolic murmur. The occurrence of (sometimes transient) diastolic components allows

55 differentiation. Of importance to this discussion, however, is that this may only occur in a small percentage of cardiac cycles.

The examples above highlight the need for more than a simple pattern recognition approach for the implementation of computer assisted cardiac diagnosis based on heart sounds as a useful standalone diagnostic tool. Clinical implementation would require multiple samples from a number of positions and under various conditions in order to distinguish some pathologies. A more detailed discussion of possible clinical implementations of a neural network-based cardiac auscultation system is to be found in Chapter 6.

Of course, evidence provided by such a tool would be additional to that obtained from other clinical techniques and investigations such as blood pressure measurements, clinical features, clinical history, echocardiography, X-ray, angiocardiography, ECG and apexcardiogram etc.

2.6. Auscultation and Phonocardiography 2.6.1. Auscultation

Heart sounds, as detected at the surface of the body, tend to have intensities and frequencies at the limit of human audibility. The overlap between heart sound production and human audibility is shown in Figure 2.5 [based on McKusick, 1958].

10

co a.

~ .1 ::::, (/) (/) ~ .01 a. Audible heart sounds ...... -z....-and murmurs .001

.0001

Frequency (Hz)

Fig 2.5: The relationship between the acuity of human hearing and heart sounds and murmurs.

56 Of panicular importance in the consideration of human detection of hean sounds is the fact that the low frequency components of the sound tend to mask the higher frequency sounds [Zemlin, 1988], quite apan from the fact that the lower frequencies tend to have a larger amplitude.

Stethoscopes behave as mechanical filters and acoustic couplers in the audition of bean sounds. They occlude extraneous noise, reducing masking. The chest pieces are either of a diaphragm or bell type. The diaphragm is more efficient at higher frequencies and hence it is used to detect hean sounds with high frequency components and for close splitting of sounds. The bell has better coupling characteristics at lower frequencies and is necessary for the auscultation of third hean sounds, for instance. The frequency pick­ up characteristics of the bell are altered by the pressure with which it is applied to the skin. With very pressure, the bell is most sensitive to lower frequencies. In effect, a diaphragm is created when the bell is applied with higher pressures.

Subject positioning also has a significant effect on the intensity of various bean sounds. A semi-recumbent posture brings the apex in direct contact with the chest wall, maximising the transmission of the diastolic murmur of mitral stenosis, for instance. On the other hand, sitting brings the base of the hean closer, allowing a more likely detection of diastolic murmurs resulting from pulmonary and aortic insufficiency. Other standard subject positions are the left lateral semi-recumbent and lying prone.

2.6.2. Phonocardiography

Traditionally, phonocardiography has involved the recording of hean sounds on magnetic tape and/or their display by cathode ray tubes or on paper. The rationale for this has included:-

a) Recording allows consistent and reproducible data for the purposes of teaching auscultation.

b) Electronic detection and amplification overcomes some of the limitations of the human auditory system. Electronic systems having a greater sensitivity and frequency response are able to elucidate signal components otherwise inaccessible and of potential diagnostic significance.

c) Graphic display of the heart sound signal allows a quantification of signal characteristics, such as time between split sound components. Some quantifiable

57 signal parameters have been correlated to pathological features. An example of this is the relationship between the time interval from the second sound to the opening snap in mitral stenosis and the pressure gradient across the valve. d) Permanent records allow some aspects of a disease's progress to be charted over time. e) Graphic displays bring into play a physician's visual pattern recognition skills as well as the auditory ones.

High and band pass filters are used in order to provide a measure of the energy spread in different frequency bands, as well as to allow the display of the small amplitude high frequency components which would otherwise be swamped by the enormous energy in the low frequency bands of a linear device.

Filtration is also used to simulate the sounds perceived by a physician, in order to teach auscultation. Such recordings, however, impose limitations on some of the benefits which are otherwise conferred by phonocardiography.

While the recording equipment can be easily calibrated in terms of intensity, this is of limited usefulness since heart sound intensity is also determined by less quantifiable factors such as microphone pressure and air-sealing and subject obesity, for instance.

The synchronization of phonocardiographic records with cardiac events is of imponance, and this is normally achieved by a simultaneous ECG using standard limb lead II. Other parameters such as jugular pulse and apexcardiogram have also been used. Respiratory patterns are also important for the interpretation of some sounds - such as split second sounds, and this is accomplished on some phonocardiographs by a separate channel recording this.

While the standard ausculatory areas are those normally used in phonocardiography, other sites and methodologies have been employed. Of particular note is esophageal and intracardiac recording.

2.6.2.1. Spectrocardiog raphy

Spectral analysis of cardiac sounds adds a frequency dimension to phonocardiography. Spectral analysers, similar to those used for speech analysis have been used in order to

58 provide more extensive frequency-content information than could be obtained by manually switching in a series of filters.

A particular application of phonocardiography was undenaken by Goncharova and Roman ova [ 1988] who studied the changes in the spectral content of the first heart sound following a myocardial infarction, and suggested these resulted from changes in the acoustical properties of the myocardial tissue.

Leatham [ 1975] contends that limiting the recording to a single site, even if multiple low pass filters are employed, is of little practical value. He argues that a reasonable picture of cardiac function is obtained only by multiple, preferably simultaneous, recordings from a number of ausculatory regions.

2.6.2.2. Sonvelography

Another variant of traditional phonocardiography is sonvelography, in which the envelope of the logarithm of sound intensity is displayed [Rushmer, Bark and Ellis, 1952; Rushmer, Sparkman et al, 1952]. The rationale for this development was that the tracings so produced were of a simpler form, without the complication of high frequency oscillations, and thus were easier to interpret. It was acknowledged that frequency information was lost [Rushmer, Bark and Ellis, 1952(a)], but that "the exact frequency of heart sounds and murmurs is rarely considered in interpreting these sounds".*

Using such a record, Rushmer et al [1954] found it impossible to distinguish 3rd heart sounds and early diastolic murmurs in children. Frequency content is important in distinguishing these.

A similar system of using the sound envelope as the only input signal is applied in this study of heart sound recognition by neural networks.

2.6.2.3. Computer Recognition

Little work has been carried out on the applications of computational methodologies to phonocardiography, let alone systems of automated heart sound diagnosis. W artak [ 197 4] briefly describes a method for computerised determination of the duration and

• It should be emphasized that no frequency information of heart sound components is conveyed by such diagrams unless a series of filters is used, as in traditional phonocardiography.

59 frequency content of first and second sounds and other sounds and murmurs, using ECG data as a supplemental timing signal. The system involves traditional signal processing techniques of identifying maximum slope of the ECG signal, setting thresholds for the phonocardiogram data etc. While only a cursory outline is provided, the following points should be made:- a) The sampling rate of 500Hz and 1OOOHz is inadequate to include heart sounds up to the upper audible value of around 500Hz, let alone any extension beyond that enabled by a greater sensitivity of an electronic system at higher frequencies. b) The methodology for determination of the relative frequency of a component by counting sampling points per second is fallacious for a system which samples at constant rate.

Wartak concludes that further work is needed, but that the human eye and ear would probably be unsurpassed for their signal recognition capabilities.

Friedberg [1966] also alludes to the potential for a heart sound, fed into a computer, to be classified as normal or not "within a minute".

A great deal of effort has been expended on computerised analysis of ECG and heart rate signals [see Searle et al, 1988, for instance], driven by the requirements of pace­ makers and remote patient monitoring. It would seem the dearth of work on computerised analysis of phonocardiographic records has resulted from the difficulty in their analysis coupled with the fact that ultrasonography has largely overtaken their diagnostic value. Alternative diagnostic tools, coupled with the decline in valvular heart disease in recent decades has led to the fact that, phonocardiography is not now performed at the three largest hospitals in Sydney, for example.

60 2.7. Conclusions

The effective use of auscultation, phonocardiography or spectrocardiography for cardiac diagnosis involves a complex interplay of:- patient history patient morphology patient positioning respiration area(s) of auscultation manoeuvres administration of pharmaceuticals comparision with other signals such as ECG frequency response of equipment and filters used frequency analysis (if any) of signal

It is only in some well-defined "text book" cases that the heart sound signal alone could be used to provide a fairly unequivocal diagnosis in the absence of a broader context provided by the information above.

61 3. The Heart Sound Signal - Pre-Processing and Characterisation

3.1. Synopsis

The materials and methods used to obtain heart sound recordings are reported. The computational methodologies used for their pre-processing are also described, as well as some techniques for the characterisation of heart sounds from different classes (ie. pathologies).

3.2. Introduction

The training of a neural network for effective heart sound recognition and identification is dependent on the quality of the training data used. "Quality" implies not only such factors as reasonable signal to noise ratios, but also samples which are representative exemplars of their class. A signal which excels in these parameters may still be useless for training a neural network if discriminatory features are not extracted where there has to be a reduction in the overall information content in order to develop a network of trainable size.

3.3. Sources

The sources of heart sound recordings used in this study were clinical training tapes and records, and sounds obtained from subjects for this specific purpose using three different sets of equipment at three different locations. The advantage of using such a wide range of sources is that a neural network trained on these exemplars is more likely to learn to be device-independent and to reject specific machine artefacts and idiosyncrasies. However, limitations on numbers of suitable subjects with valvular disease meant that pre-recorded heart sounds formed almost all of the sound samples.

3.3.1. Pre-Recorded

Two sources of pre-recorded sounds were used, being those recorded by Tavel et. al. and Barlow. These recordings are of excellent quality having relatively high signal to noise ratios. A disadvantage is that most of the recordings tend to be fairly extreme examples of the pathologies represented.

62 3.3.2. Subject Recordings

Some preliminary investigations were made on 25 subjects using an adapted traditional stethoscope. One of the tubes connecting the diaphragm to an earpiece was disconnected and a small electret microphone mounted in electret a small piece of PVC tubing fitted to microphone the neck of the diaphragm. This piece of equipment is shown in Figure 3.1.

Not only was this arrangement simple, cheap and effective, it also Fig. 3.1: An adapted stethoscope used to allowed traditional (but monaural) obtain heart sound signals. auscultation to take place simultaneously with phonocardiography. The electret microphone was connected directly to a MacLab analogue to digital converter, which in tum was connected to a Macintosh Plus computer running MacScope version 3.0 data acquisition software. Data was sampled at a 2kHz rate. This was the fastest rate available under the software which still allowed a full cardiac cycle to be sampled at a time (1.2 seconds of data). An example of the data acquired with this method is shown in Figure 3.2.

0.4 > §. 0.0 CD -0 ::, .t: Q. E -0.4 <

-0.8

200 400 600 800 Time (ms) Fig. 3.2: Sample of heart sound data obtained with the Macintosh/MacLab system with a 2kHz sampling rate.

The system does have potential for future use in the acquisition of heart sound signals since clinically based phonocardiography equipment is becoming very rare. Further advantages include the fact that the signal is digitised immediately, by-passing a tape intermediary with its attendant introduction of noise and limitation on response. Other

63 data acquisition channels are simultaneously available, one of which could be used for an ECG timing signal, eliminating the need for hand-marking of signals described in Section 3.4 below. Apart from a 50Hz low pass filter, no other filtering is available, and additional hardware would be required for this facility.

Since this data was not obtained in a clinical setting, and could not therefore be tagged with an expert diagnosis, only sounds which were clearly normal were used.

A few recordings were also made at the Cardiac Departments of St Vincent's and Royal Prince Alfred Hospitals (RPAH), Sydney, although only those from one subject were used in the training of neural networks in the work reported here.

At RPAH, an Irex System II M-mode Echocardiogram with an installed Heart Sound/Pulse module was used. Recordings were made with a low pass filter setting at 30Hz (6dB/octave roll off) and a high pass of 2kHz (24db/octave roll off). The Irex heart sound microphone used, while of dedicated design, was of unspecified characteristics. The "stethophone" output was used to record to cassette tape.

This unit was relatively old (approximately 15 years), had been decommissioned for a number of years and was prone to introducing electrical noise intermittently into the signal. It did have a comparatively high overall gain and was used in very quiet surroundings.

At St Vincent's Hospital, a state-of-the-art Hewlett Packard Sonos 1000 echocardiogram with a phonocardiograph unit was used. Recordings were made to video tape, or the output taken to a cassette tape recorder. The microphone did not rely on an air seal in the manner of traditional heart sound microphones, but consisted of a pressure transducer. The characteristics for this microphone were unspecified. The low pass filter was set to a 50Hz cut-off.

The physical conditions at St Vincent's Hospital were not optimal for phonocardiographic recordings. The echocardiogram itself generates a considerable amount of noise and the surroundings were also relatively noisy. These factors would have been more significant had a contact transducer not been used. Recordings made in this setting were only used in preliminary investigations not reported here.

64 3.3.3. Summary of Recordings Used

Table 3.1 provides a summary of subject information for the recordings used in this study.

I.D. AGE SEX PATHOLOGY/ RECORDING SOURCE No. SYMPTOM POSITION CYCLES USED

A 16 M normal aoex Tavel 5 B 16 M normal, split S2 pulmonic Tavel 10 C 21 · M normal, S3 aoex Tavel 5 D 22 F atrial septal defect, fixed pulmonic Tavel 5 solit S2, systolic murmur E 10 F ventricular septa! defect, LLSB Tavel 5 pan-svstolic murmur F 47 F congestive heart failure, apex Tavel 5 patholo_gic S3

G 32 M normal anex RPAH 7 H 28 M normal aoex MacLab 1 I 36 F severe mitral stenosis LSB Tavel 5 J 26 M aortic insufficiency, left 3rd Tavel 5 diastolic murmur intercostal K 69 F moderate mitral apex Tavel 5 insufficiency, atrial fibrillation L 52 M mitral prolapse just medial to Tavel 5 anex M 32 F moderately severe mitral apex Tavel 5 stenosis N 17 F mitral incompetance anex Barlow 5 0 52 M aortic insufficiency, some left 4th Barlow 5 stenosis intercostal

Table 3.1: Summary of heart sound data used to train and test neural networks in this study. Subject I.D. is used for reference in later chapters.

65 3.4. Digitisation

Heart sound recordings of between 4 and 8 seconds in length were fed into a Kay DSP speech computer at the Centre for Linguistics at Macquarie University, or an identical machine in the School of Communication Disorders at Cumberland College of Health Sciences. These machines are dedicated speech sound analysers.

The analog recordings are sampled by the Kay Sonograph at 10.04kHz. In the case of the machine at Macquarie University, a low pass filter set to 2.SkHz was interposed between the tape recorder and the sonograph.

Approximately 6 cycles of heart sound data, in the form of an amplitude-time graph, were displayed at a time. In the absence of an independent timing signal, such as an ECG, the cycles were individually hand marked, beginning from the onset of the first heart sound; or the fourth, where one existed. An example of a series of cardiac sounds, showing a marked cycle is shown in Figure 3.3.

I I.· I -,r I I I I I

Fig. 3.3: A series of consecutive heart sounds as displayed on the Kay DSP. Note the cursors delineating a single cardiac cycle beginning at the onset of the first sound.

Each cycle was then in turn down-loaded to an IBM AT compatible computer via a SCSI pon link. Data was stored in a file in MSL format, which is a standard defined by Kay. The amplitude data was stored with a 12 bit resolution.

In the case of data obtained from the Macintosh-MacLab system, the above processes were by-passed altogether, except for the fact that it was still necessary to hand mark

66 the cardiac cycles. The data were saved as text files and translated to MS-DOS format using Apple File Exchange software on a Macintosh Ilci computer.

3.5. Pre-Processing 3.5.1. Rationale

A neural network is capable of extracting features important for classification, given enough training data and weights. In this case, where a single cardiac cycle will contain over 10,000 data points, it becomes necessary to perform some sort of information reduction. A process of feature extraction must be employed.

Since the main diagnostic criterion used by experts using heart sounds is the timing of events, it was decided to use the amplitude of the heart sound envelope as the input signal for the neural network. This ignores the significant contribution of frequency, particularly in the differentiation of murmurs. The added parameter does make the problem exceedingly computationally expensive, and was not reasonably achievable given accessible computer resources.

Using this approach, the logarithm of the amplitude envelope could also have been employed, as this is proportional to the intensity level of the sound - a parameter more closely associated with perceived loudness. This is the approach taken in sonvelography, described in Section 2.6.2.2.

3.5.2. Methods

Time-amplitude data for each cardiac cycle was contained in a file which was processed in the following way:-

1. A mean amplitude was calculated and all amplitude values then expressed as the absolute value of its difference from the mean. ie. The "DC" component of the amplitude-time data was removed and then the signal "rectified". The amplitude values could then be squared to provide a measure proportional to sound energy, but this had little effect on the gross shape of the signal. A typical waveform is shown in Figure 3.4, and the result after "rectification" in Figure 3.5.

67 Fig. 3.4: A typical heart sound signal of a single cardiac cycle before "rectification". The horizontal bar represents l00ms.

Fig. 3.5: The heart sound cycle from Fig. 3.4 which has been "rectified".

2. A signal envelope was calculated by replacing all points in a 6.0ms window with the maximum value in that window. The window was moved along in jumps of 6.0ms, so that there was no overlap between consecutive windowed data. In order to smooth this rather quantised envelope, a moving window average was then calculated. In this case, the window was chosen to be 8.0ms, and moved in increments of 0.lms (one data point for the Kay DSP processed sounds). The averaging procedure treated

68 the signal as circular. ie. The end of the cycle was averaged with the beginning. This prevented large discontinuities between the beginning and end which may have arisen.

These values of window size were chosen since they (subjectively) preserved the main features of the heart sound intensity signal over a wide range of heart sound types. The resulting amplitude envelope for the signal above is shown in Figure 3.6. It should be noted that only every fifteenth or so data points can be displayed, and this may lead to apparent discontinuities between the waveform and the computed envelope where the latter has a larger value.

Fig. 3.6: The computed sound amplitude envelope for the cardiac sound signal of Fig. 3.4.

3. In order to reduce the information content of the signal, typically 10,000 data points, the envelope signal was sampled at 60 or 80 equally spaced points. These were chosen to allow the identification of close splitting. These Nyquist sampling rates correspond to the identification of splits between 33ms and 25ms respectively. Since most auscultation literature suggests that trained physicians cannot identify splitting narrower than 20ms, these figures seemed reasonable. Visual inspection of the sampled envelopes confirmed that splits of significance were identified, and the 60 sample points were typically used as this allowed to a greater reduction in network complexity when used as input.

The extracted envelope is shown in Figure 3.7 using both 60 and 80 samples. Note that the 80 samples have preserved little extra information of significance beyond the 60 samples and that the split first sound is clearly indicated in both cases.

69 80 sample points

60 sample points

Fig 3.7: The heart sound amplitude envelope sampled 60 and 80 times.

The fact that an equal number of sampling points was used over signals of differing duration was effectively a time-normalisation of the data.

4. The sampled cardiac envelope, or vector, was normalised with respect to magnitude also. This was important, since only the intra-signal amplitude variations are of significance. As has been discussed in Section 2.5.2, heart sound amplitude is dependent on many factors, most of which cannot be readily quantised in the final recorded amplitude.

3.5.3. Program CARDIAC

The procedures above were performed using Program CARDIAC.PAS written in Turbo Pascal and run on IBM XT and 80386 compatible computers. The listing for this program is given in Appendix A.

3.6. Characterisation

Measures of the inter-class similarity of the envelopes obtained using the methods above were necessary in order to:- a) allow estimates of the neural network size required to learn to distinguish between different classes.

70 b) provide a measure of the difficulty in separating sounds from different classes.

While visualisation of sets of vectors in 60-space becomes somewhat problematic, three broad measures providing some indication of vector distribution were used, and these are discussed below.

3.6.1. Dimensional Overlap

For the purposes of illustration, imagine examples of vectors in 2-space from two different classes. If the vectors in two classes overlap in only 2 - 1 = 1 dimensions, then the classes of vectors are linearly separable. However, if they overlap in both dimensions, the situation is ambiguous. Figure 3.8 a) shows two classes of points plotted in two dimensions, but overlapping in only one dimension, and therefore linearly separable. Figure 3.8 b) and c) are examples of classes overlapping in both dimensions, but one being linearly speparable (b) and the other not (c).

a b C I a a I aO--~--1------o I • • a I 0 a I • • I.. 10 a a o o overlap a I• • • --,,-----.-_;;,,- __ J e 0 0 a a a I • •• I • • I a overlap ---a o7_.I ______• _ I • • I o _.;_!,.I ••___ _a_ I •+ I I I overlap

Fig. 3.8: a) linearly separable classes overlapping in only one dimension; b) linearly separable classes overlapping in 2 dimensions, c) classes which overlap in 2 dimensions which are not linearly separable.

In three dimensions, overlap between the different classes in less than the three dimensions means that the vectors are linearly separable (separable by a plane in this case).

This principle can be generalised to N-space. Two classes are separable by a hyperplane if the number of dimensions in which the components of all the vectors of one class falls within the range of the values of the components of all the vectors of the other class is less than N. This condition is sufficient but not necessary.

71 Let the set of vectors V 1 belong to the class C1• ie. c1 = {v:, v;, ... v!}

Similarly, let the second class, C2, be the set of vectors V2.

where vector V~ = (xt xt ... , x~) is the jth vector from the ith class.

i . i If xk min is the smallest value of the kth component of all the p vectors V', and xk max is the largest, then the following pseudo-code illustrates how the number of dimensions in which an overlap between classes i and j occurs can be counted. count<-- O fork<-- 1 to N for s <-- 1 to q 1 s2 1 jf Xk min 5 Xk < Xk max

then count <-- count + 1 and skip to next k

The above procedure will only indicate linear separability in some cases, and so is a very weak test. More sophisticated measures of clustering can be achieved with techniques such as principal component analysis [Linsker, 1988].

Theoretically, if classes are linearly separable, a single layer network will always be able to find a separating hyperplane as discussed in Section 1.2.2.

3.6.2. Average Inter-Class Correlations

The correlation between two normalised vectors can be calculated as their dot product, which is the same as the cosine of the angle between them~ Thus a measure of inter­ class similarity can be obtained from the average of the cosines of the angles between all the vectors in one class with all those in the second class [Jordan, 1986]. Similarly, intra-class similarity can be obtained by calculating the mean of this cosine for vectors within the one class. More formally, if there are two classes of vectors:-

72 then the average interclass correlation can be calculated as:-

A B V; ,Vi 1 mn mn II -,v-Al-lv-sl •.. (3 .1) • • I J 1 J It should be noted that lvtllv71 = 1 for normalised vectors.

3.6.3. Average Inter-Class Distance

The average Euclidean distance between vectors of different classes provides another measure of class separability [Kittler, 1986].

This average inter-class distance between classes i andj, 0 112, can be calculated by

1 P q ~ [ ih Jlc] 2 D 1(2 --"'"'- pq .t.., .t.., .t.., X1 - X1 ... (3.2) h=l k=l 1=1

3.6.4. Program Hyper.pas

Program HYPER.PAS was written using Turbo Pascal and run on an 80386 based computer in order to calculate the above measures of inter-class similarity. A listing for this program is found in Appendix B.

The program reports on the number of dimensions in which classes overlap, the average inter-class correlations, average inter-class distances and provides the option of graphically displaying the range of values taken by vector components from two classes at a time. An example of the type of graphical display is shown in Figure 3.9 below.

73 Range of values for all vectors in class

I Class 1 Class 2 ..iiil:i

11 11 1 1 11 Ej •· " 111·· • I11 11 11 1it•~r:r;i: Ht :,• :P;uf:: 1• ,rr~~lE •• ,r:: :,11~fl , ,t ., :; ,, ,,• E i.,, l•1 i Components of vectors (1-60)

Fig. 3.9: Graphical display produced by Program Hyper.pas showing the amount of overlap between sets of vectors from two different classes of heart sounds.

3.7. Conclusions

A wide range of facilities and techniques were employed to obtain suitable heart sound data for use in the training of neural networks. In particular, the simple adaptation of a standard stethoscope and electret microphone showed particularly good potential in this area.

A number of computational tools have been developed to effectively pre-process the signals to obtain sampled amplitude envelopes and to provide some measures of inter­ class similarity.

74 4. Network Training

4.1. Synopsis

Some of the parameters affecting learning by neural networks of a limited range of heart sounds was investigated. The parameters were the gain, order of pattern presentation to the network, size of the momentum term, network topology and the nature of the training data. The results obtained point to some important considerations for the development of more extensive networks.

4.2. General Materials and Methods

Neural networks for heart sound recognition were simulated in software and run on a variety of hardware platforms, as availability allowed. Specifications of the computing systems used are outlined below:-

COMPUTER MEMORY OPERATING COMPILER SYSTEM 10 MHz XT compatible, 640kb MS-DOS 3.3 Turbo Pascal 5 .5 no co-processor 33 MHz 80386, 4Mb MS-DOS 4.01 Turbo Pascal 5.5 no co-processor MicroVax II 16Mb VMS 5.2 Pascal 3.9

The networks were all 3-layer and completely connected (each node obtaining inputs from all nodes in the lower layers and passing its output to all nodes in the subsequent layer). Training was by backpropogation and the networks were implemented with and without momentum terms.

The versions run on the IBM XT compatible and 80386 computers allowed the option of graphic visualisation of the changes in network weights. However, a disadvantage was that Turbo Pascal version 5.5 imposes a 64k limit on variable storage. In practice, this limited the maximum network size to 60:50: 15:3 (60 input values in each heart sound, 50 first layer nodes, 15 second layer nodes and 3 third layer, output nodes) with no momentum terms, and somewhat smaller networks trained with momentum.

75 During training, heart sound patterns were presented to the network in a fixed order, or chosen at random. Each pattern class had an associated desired output. Since three classes were tested at a time, and the output layer consisted of 3 nodes, the desired outputs were assigned to be {0,0,1}, {0,1,0} and {1,0,0}. The differences between desired and actual outputs were backpropogated after each pattern (as opposed to after each epoch).

After every 500 pattern presentations the network reported on progress on the training set. The sum of the squared deviations of the network's output from the desired output over all training patterns proved to be a useful measure of progress, though the average squared deviation could also have been used. The number of patterns in the training set correctly classified also provided a useful test of learning success.

In all the investigations in this chapter, 3 different classes of heart sounds were used, each class consisting of 5 repeats from a single subject.

Listings of Turbo Pascal (MULTI.PAS) and VAX Pascal (MY.PAS) version·s of the programs are provided in Appendices C and D respectively. The additions required to implement a momentum term in the learning are shown in italics in the VAX version. A chart showing network structure and parameter identification is provided in Appendix E.

4.3. Factors Affecting Learning 4.3.1. Gain

Gain is the fraction of the error of the network output which is backpropogated to adjust network weights. The effect of gain on the rate of learning was investigated for a neural network with no momentum term. The network, having a 60:40:20:3 topology, was presented with the same patterns chosen in random order. The heart sound classes were from subjects A, Band C. The network was trained at a fixed gain for 30 replicates. Since the network began with weights assigned at random, each replicate provided a different approach to learning. The results of variation of sum of squared deviation (SSD) with number of iterations (pattern presentations) are shown in Figure 4.1.

76 20 -,------, g11in=O. I gain=0.2 15

55 I 0 IZl

5

0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerelions No. lleralions

20 .....------, 20 -,------, g11in=0.3 g11in=0.4 15 15

55 10 ~ 10 IZl IZl

5 5

0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. )lerelions No. llerelions

20 ,------, 20 .....------, gein=0.5 gain=0.6 15 15

55 10 ~ 10 IZl IZl

5 5

0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerelions No. Iterations

20 .....------, 20 -,------, gain=0.7 gain=0.B 15 15

55 10 10 IZl 55u;

5 5

0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerelions No. Iterations

20 -,------, 20 .....------, g11in=0.9 gain= 1.0 15 15

~ 10 ~ 10 IZl IZl

5 5

0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerelions No. Iterations

Fig. 4.1: Sum of squared deviations from the target output for a neural network being trained on 3 classes of heart sounds using different values of gains. Each training protocol was repeated 30 times and the heart sound patterns were presented in random order.

77 The learning rate of the network is clearly gain-dependent. Gains in the range 0.6 to 0.9 provide the fastest convergence rate, most trials having learnt to correctly classify the 15 training patterns with "zero" error after 4000 iterations. The difference between the actual and the target outputs is never zero, but it does become vanishingly small. The optimal range of gains represents a tuning of the step size of the network in error space to the average topology of the error surface" . Steps which are too large may lead the network to move uphill in error space, where a step of small enough size will track the true gradient more accurately. Very small steps, as shown for gains in the range of 0.1 to 0.2 in panicular, reveal inefficient learning. In these cases, a typical series of steps follow the same gradient, and so could have been replaced by a single larger step down the error surface. At a gain of 0.1, none of the 30 trials reached a SSD of less than 0.1 after even 6000 pattern presentations.

In each of the 10 different gains tested, cases occur in which the network seems to have become stuck in local minima. Indeed, this is true of every different set of parameters tested in this chapter. These local minima may correspond to the network not being able to distinguish at all the three classes of pattern, providing a relatively high residual error (a SSD of around 10), or the misclassification of a single pattern, as is the case for the trial with a gain of 0.4 which showed a significant SSD after 6000 iterations. Such a pattern would represent an outlier for that class of patterns.

4.3.2. Order of Pattern Presentation

The same input patterns as used in Section 4.3.1 were presented again to the same neural network under the same ranges of gain, but with a fixed order of input. A pattern from each class was presented in turn, and this order remained constant. The variation of SSD with number of iterations is shown in Figure 4.2 .

• It should be noted that in speaking error surface topologies, it is really an average topology, since the actual error fucntion changes with each pattern presented.

78 20 20 gain=O.l gain=0.2 15 15

~ 10 ~ 10 Cll Cll

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerations No. lleralions

20 20 gain=0.3 gain=0.4 15 15

~ 10 ~ 10 Cll Cll

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. Iterations No. Iterations

20 20 gain=0.5 gain=0.6 15 15

(;l 10 5J 10 Cll Cll

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. Ilerations No. Iterations

20 20 gain=0.7 gain=0.8 15 15

(;l 10 (;l 10 Cll Cll

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. Iterations No. Iterations

20 20 gain=0.9 gain= 1.0 15 15

(;l 10 ~ 10 Cll rn

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. lleration11 No. lteralion11

Fig. 4.2: The effect of varying gain on the training of a neural network with 3 heart sounds classes with an ordered presentation of patterns.

79 While the same general results as in the case of random pattern presentation were obtained, a comparison of Figures 4.1 and 4.2 shows that there is very little tendency for the network to increase in SSD. A random order of pattern presentation allows certain cases to occur with greater short-term frequency and so to have their error reduced at the expense of the total error for the complete set of training patterns. An orderly presentation of patterns represents a more efficient training strategy in this case.

4.3.3. Momentum

In order to study the effect of momentum on the rate of learning of a network, the patterns used in Sections 4.3.1 and 4.3.2 were presented, and the multiplier for this momentum term, alpha (as in Equation 1.15), was varied from 0 (no momentum) to 0.9, while the gain was kept at a fixed value of 0.5 throughout. The topology of the network remained 60:40:20:5. In this study, patterns were presented in random order and 30 replicate training passes used at each value of alpha. The results of SSD as a function of number of iterations are presented in Figure 4.3.

80 20 20 alpha=0 alpha=0.I 15 15

~ 10 ~ 10 Cl') Cl')

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerations No. Iterations

20 20 alpha=0.2 alpha=0.3 15 15

~ 10 ~ 10 Cl') Cl')

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 ::1000 4000 5000 6000 No. llerations No. Iterations

20 20 alpha=0.4 alpha=0.5 15 15

~ 10 ~ 10 Cl') (I')

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerations No. Iterations

20 20 alpha=0.6 alpha=0.7 15 15

~ 10 f;i 10 Cl') (I')

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerations No. llerations

20 20

15 15

~ 10 f;i 10 Cl') r/l

5 5

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 No. llerations No. Jlerations

Fig. 4.3: The effect of varying the size of the momentum term on the rate of learning for a network being trained on 3 heart sound classes with random order of pattern presentation.

81 Leaming becomes unstable for higher values of alpha (0.8 to 0.9). There is a tendency for faster rates of convergence with momentum when the value of alpha is in the range of 0.2 to 0.6, though the benefits are not as marked as those obtained by an effective tuning of the gain term.

4.3.4. Topology

The relationship between the network topology and the rate of network learning was investigated by training with the same set of patterns, networks having various numbers of nodes in the first and second layers.

The first set of training patterns were chosen so that there was a low degree of inter­ class correlation. The heart sounds were obviously quite different, being as set out below, including the subject I.D.:-

systolic murmur (D) class 1 normal, no or narrow splitting of S 1 (A) class 2 pansystolic murmur of ventricular septal defect (E) class 3

Each class consisted of 5 sounds from the same patient, so there was a high degree of intra-class correlation. The amplitude patterns from each class are shown in Figure 4.4.

Clan3

Fig. 4.4: Overlay of amplitudes for three heart sound classes, each with five repeats.

82 Using Program Hyper.pas, as described in Section 3.6.4, the measure of similarity between (and within) classes are tabulated below.

Class 1 Class 2 Class 3 Class 1 60 50 56 Class 2 60 42 Class 3 60

Table 4.1: Number of dimensions in which overlap occurs for the heart sound classes illustrated in Figure 4.4.

Class 1 Class 2 Class 3 Class 1 0.80 0.69 0.64 Class 2 0.96 0.48 Class 3 0.60

Table 4.2: Average inter- and intra-class correlations of heart sound classes illustrated in Figure 4.4.

Class 1 Class 2 Class 3 Class 1 0.57 0.89 0.61 Class 2 0.57 0.93 Class 3 0.29

Table 4.3: Average inter- and intra-class Euclidean distances for the heart sound classes illustrated in Figure 4.4.

A pairwise comparison of heart sounds from these classes of the range of values taken in each of the 60 input dimensions is shown in Figure 4.5. This is the type of graphical representation described in Section 3.6.4 and shown in Figure 3.9.

83 lnl~Hi ~i i:g l.l. !.:.•:_ij_ :: g ii :: 2 =~ Class !!11 " I Clan 3

11 ;I 1 'I I I •' 'I I! Ii 'i !i' II! l:~~1"'.rJ,,tll,.~~-'11.f.~(;f.~;,

Fig.4.5: A pairwise comparison of the range of amplitude values taken by the heart sounds in the three classes under consideration.

Thus the input patterns may be summarised as being fairly compact clusters of points widely separated. Such a training set may be expected to be more easily learned than say classes with high intra-class variability and high inter-class correlations, corresponding to diffuse points with large amounts of overlap.

The training data was presented to networks having 60 input dimensions, and 3 output nodes, but whose number of first layer nodes was 10, 30 or 50 and number of second layer nodes being 5, 10 or 15. Thus nine different topologies were tested.

The resulting number of weights in these networks varies by a factor of 5.7 between smallest and largest. The number of weights (including the nodal thresholds) in each

84 topology is reported in Table 4.4 below.

No. 2nd layer nodes 5 10 15 10 683 753 823 No. 1st layer nodes 30 2003 2173 2343 50 3323 3593 3863

Table 4.4: Numbers of weight terms in neural networks having 60 inputs, 3 third layer nodes and varying numbers of first and second layer nodes.

The networks had no momentum terms and the gain was set to a constant 0.5 throughout. The training patterns were presented in random order. Each topology was trained a total of 30 times and the results of SSD as a function of number of pattern presentations are shown in Figure 4.6. This figure is set out in a similar way to the table above, so that the number of first layer nodes increases from top to bottom, and the number of second layer nodes increases from left to right. The variable part of the topology label in each graph is underlined.

85 20 .....------, 20 .....------, 20 .....------, topology 60: l 0:5:3 topology 60: l 0: l 0:3 topoloey 60: 10: 15:3

15 15 15

~ 10 i;i 10 i;i 10 rn rn rn

5 5 5

2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 No. Iterations (x l 000) No. Iterations (xl000) No. Iterations (x 1000)

20 .....------, 20 --.------, 20 --.------, topology 60:30:5:3 topology 60:30: 10:3 topology 60:30: l 5:3

15 15 15

~ 10 i;i 10 ~ 10 rn rn rn

5 5 5

0 0 0 2 5 6 2 3 4 5 6 2 3 4 5 6 No. llerations {xl000) No. Iterations (xl000) No. Iterations {xl000)

20 20 20 topology 60:50:5:3 topology 60:50: l 0:3 topology 60:50: 15:3

15 15 15

i;i ]0 i;i 10 i;i 10 rn rn rn

5 5 5

0 0 0 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 No. Iterations {x 1000) No. lleralions (x l 000) No. Iterations {x 1000)

Fig.4.6: Comparison of the effectiveness of training networks of different topologies. The numbers of nodes in the first and second layers were varied (these are unerlined).

From Figure 4.6 it can be clearly seen that the most minimal network trialled (60:10:5:3) is able to converge to a solution, though the chances of this occurring are less than for the more extensive networks. This is not surprising, since these heart sound data are clearly linearly separable, and a single layer network is theoretically able to form a solution to a problem such as this.

86 The results indicate that the chances of the network forming viable decision boundaries does not vary significantly with greater numbers of second layer nodes within the range tested. This can be seen by comparing graphs in horizontal rows, noting that the number of instances in which the error term reduces to near zero is relatively constant.

On the other hand, greater numbers of first layer nodes measurably alter the chances of the network converging to a solution. Comparing graphs in venical columns clearly shows this.

The above may well be explained by the fact that second layer nodes form unions of hyperplanes defined by first layer nodes, and that all the cases presented are essentially first layer-limited. There are always more than enough second layer nodes to form useful unions of hyperplanes, the ratio of first to second layer nodes never being much more than the 3 which is often used as a rule of thumb (See Section 1.2.2.1 on Network Size and Capacity).

It is only the likelihood of convergence which was topology-dependent. The rate of convergence for training trials which did have near zero residual errors was not significantly different for the different networks.

While convergence of the simpler topologies is clearly possible, they may not be the most computationally efficient to train due to their lower probability of achieving the desired results.

The trend to increased lik:lihood of convergence does not seem to have reached a plateau with 50 first layer nodes, and is wonhy of funher investigation, though this would be very time consuming.

4.3.5. Nature of the Training Data

In order to test the effect of the nature of the input data on the ability of different network topologies to form viable decision boundaries, the networks described in Section 4.3.4 above were retrained in an identical way, but using hean sound data which had a higher inter-class similarity. The working hypothesis was that hean sound classes which were more alike would require more complex topologies to separate them from each other, and thus be more easily realized with more extensive topologies.

87 A second set of input data was used which consisted of the following classes of heart sounds, together with subject I.D.:-

normal, no or narrow splitting of S 1 (A) class 2 (same as in Section 4.3.4) normal - no split sounds (B) class 4 normal - physiologic splitting of S2 (B) class 5

Classes 4 and 5 were sounds obtained from the same subject, differing only in position within the respiratory cycle. Class 2 consisted of sounds from another subject. Each class was composed of 5 samples. The time-amplitude traces for the normalised data is shown in Figure 4.7.

Fig.4.7: Amplitude traces for the second group of heart sound classes.

Measures of inter and intra-class similarity for these classes of sounds are tabulated below.

88 Class 2 Class 4 Class S Class 2 60 58 58 Class 4 60 52 Class S 60

Table 4.5: Number of dimensions in which overlap occurs for the heart sound classes illustrated in Figure 4.7.

Class 2 Class 4 Class S Class 2 0.96 0.71 0.83 Class 4 0.83 0.60 Class S 0.87

Table 4.6: Average inter- and intra-class correlations of heart sound classes illustrated in Figure 4.7.

Class 2 Class 4 Class S Class 2 0.57 0.85 0.71 Class 4 0.47 0.89 Class S 0.43

Table 4.7: Average inter- and intra-class Euclidean distances for the heart sound classes illustrated in Figure 4.7.

A pairwise comparison of the range of amplitude values taken by sounds in the different heart sound classes are shown in Figure 4.8.

89 ::

::

J1 2 ii ii ii Ii I Class ,, ,, ;; ;: ,, Class 4 iii: i: !! g H lmr,~~rit1.:fJ;.r~~j1i:aW,iQlitir,[ct,~

Fig. 4.8: Pairwise comparison of the range of amplitudes taken by the heart sounds illustrated in Fig. 4. 7.

The intuitive perception that these heart sound classes are more alike than the first set is borne out by the much larger correlation coefficients and higher degree of dimensional overlap. The inter-class distance measures do not provide any conclusive evidence for difference between the two groups of heart sound classes, however.

When the networks of different topology were trained under the same protocol as described in Section 4.3.4, results of SSD as a function of number of pattern iterations as presented in Figure 4.9 were obtained.

90 20 ------, 20 --,------~ 20 ------topology 60: 10:5:3 topology 60: I 0: 10:3 topology 60: l 0: 15:3

15 15

~ 10 r;j 10 VJ VJ

5 5

0 ~,-~-,------r--,-----r---r--' 0 -,----,---.--,-----,--~~...J 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 No. lleralions (xl000) No. lleralions (xl000) No. Iterations (xl000)

20 ------, 20 ------, 20 ------~ topology 60:30:5:3 topology 60:30: 10:3 topology 60:30: 15:3

15

~ JO VJ

5

2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 No. Iterations (xl000) No. Iterations (xl000) No. Iterations (x 1000)

20 ------, 20 --,------7 20 ------, topology 60:~:3 topology 60:50: 10:3 topology 60:50: 15:3

15 15

5

3 4 6 2 3 4 5 2 3 4 5 6 No !Lera lions (x 1000) No. Iterations (xl000) No. Iterations (x 1000)

Fig. 4.9: Comparison of training effectiveness for different network topologies using highly similar training data as shown in Fig. 4. 7.

'

91 A comparison of Figures 4.9 and 4.6 does show that the second set of classes proved to be more difficult for the networks to separate. This is indicated by a number of features of the results obtained:-

1. The probability of convergence for the more homogeneous heart sounds classes was reduced for all network topologies. In fact, no convergence was achieved for the three simplest topologies within the test limit of 6000 iterations.

2. The numbers of local minima would seem to be greater, as indicated by the increased likelihood of a network becoming trapped at one of a variety of residual error levels.

4.4. Conclusions

In all the investigations outlined, a very limited range of heart sound data was used. The random nature of the learning process requires that numerous repeats are made in order for patterns to emerge. This places increased computational burdens on a a process which is already computationally intensive. The results of neural network training presented in this chapter alone represent approximately 460 hours of VAX II and 80386 CPU time alone. It should be noted that while the general principles elucidated here have wider applicability, the actual results obtained are highly specific to the heart sound data which has been used to train the networks. This is shown by the differences in network topology required to achieve comparable results for heart sounds classes of varying inter-class correlations, for instance.

The order of bean sound pattern presentation to the network is significant in as much as random presentations tend to display more erratic wanderings to convergence as a result of shon-term inequalities in frequency of presentation.

The benefits obtained by the use of a momentum term in any particular application would need to be weighed against the increased storage and computational overheads. In the cases presented here, when a network is trained on a small computer, the implementation of momentum may restrict network size and hence lead to a net reduction in learning efficiency. Despite a reduction in the average number of iterations required for a network to converge to a solution, it may find such solutions less often. It should also be kept in mind that each iteration in a network with momentum does require a greater number of operations.

92 Careful design and limited investigations of network topology show that big improvements in the likelihood of convergence can be affected. Even though it may be possible for a more simple network to separate heart sounds of different classes, the likelihood of this being achieved is greater for networks with more degrees of freedom. This likelihood is also greater for heart sound classes showing lower inter-class correlation than for more homogeneous classes.

It is highly possible that there exist cross-couplings between topology, momentum and gain, but adequate investigation of these would require enormous computational resources. The results here show that the order of presentation, momentum and gain effect the rate of convergence of the network to a solution, while network topology affects the likelihood of achieving such convergence.

93 5. Neural Network Phonocardiograph Classifiers

5.1. Synopsis

A number of networks are trained to classify heart sound signals not previously encountered during training. Small amounts of data did not allow reliable classification in all cases, however a simple normaVabnormal classifier is found to make use of sparser data sets to reliably classify unknown sounds.

5.2. Introduction

The study of neural network heuristics applied to the task of phonocardiography is useful in that it adds to the examples of a problem with a poorly understood theoretical basis. However, it is not an end in itself, and the potential of a useful clinical tool can only be realised if a trained neural network can correctly classify heart soun~s not before encountered.

5.3. General Materials and Methods

The sources of heart sounds and the pre-processing procedures described in Chapter 3 were used in the following studies. The neural networks, however, were run only on the 80386 based computer, as described in Section 4.2.

5.4. Factors Affecting Classification Success 5.4.1. Variable Topology

In order to test the possibility that network topology affects the success of a network to correctly classify a previously unseen signal, networks of the different topologies described in Section 4.3.4 were trained on data, then presented with heart sound envelopes not included in the training set.

Networks were trained with and without momentum, with orderly and random pattern presentations and with a variety of gains. The test patterns were presented to the network only when the mean SSD on the training set was less than 0.04. This always corresponded to a perfect classification score on the training data.

94 The training data consisted of a total of 15 bean sounds from three different pathologies - one subject for each pathology. The training protocol was similar to the "train 2/3, test 1/3" principle, except that the networks were trained on 4 sounds from each of the 3 pathologies and then tested on the remaining 3 sounds (one from each pathology). This process was repeated a total of 3 times, with different training and test sounds.

It is imponant to note that both training and test data come from the same subjects, and so this is a very weak trial of the network's ability to form a general solution.

The training data used consisted of the following 3 classes:-

systolic murmur (D) class 1 (as in Section 4.3.4) normal, no or narrow splitting of S 1 (A) class 2 (as in Section 4.3.4) pathological S3 (F) class 7

Using Program Hyper.pas, as described in Section 3.6.4, the measures of similarity between (and within) classes are tabulated below. Classes tested· in this section (classes 1, 2 and 7) are shown in bold, while class 3 is referred to in Section 5.4.2 in another test.

Class 1 Class 2 Class 7 Class 3 Class 1 60 50 41 56 Class 2 60 43 42 Class 7 60 Class 3 60

Table 5.1: Number of dimensions in which overlap occurs for the heart sound classes used in Sections 5.4.1 and 5.4.2.

Class 1 Class 2 Class 7 Class 3 Class 1 0.80 0.69 0.58 0.64 Class 2 0.96 0.63 0.48 Class 7 0.87 Class 3 0.60

Table 5.2: Average inter- and intra-class correlations of heart sound classes used in Sections 5.4.1 and 5.4.2.

95 Cla~ 1 Cla~2 Cla~7 Cla~3 Cla~ 1 0.57 0.89 0.91 0.61 Cla~2 0.57 0.95 0.93 Cla~7 0.44 Cla~3 0.29

Table 5.3: Average inter- and intra-cla~ Euclidean distances for the heart sound classes used in Sections 5.4.1 and 5.4.2.

The number of successful classifications of test patterns for this input and test data over the total presented for each topology is summarized in Table 5.4.

No. 2nd layer nodes 5 10 15 10 6/6 9/9 9/9 No. 1st layer nodes 30 14/15 8/9 9/9 50 9/9 9/9 9/9

Table 5.4: Number of test sounds correctly cla~ified / number presented for neural networks of various topologies using heart sound classes 1, 2 and 7.

It can be seen from these results that all networks, irrespective of topology, were highly successful in classifying the heart sounds not previously encountered. Within the range of topologies tested, there does not seem to be a loss of ability to correctly generalise with larger numbers of hidden units, as found by Chi and Jabri [1990] in their tests of ECG data classification.

5.4.2. Training Data

In order to test how the network's ability to generalise depended on the training data, the procedure described in Section 5 .4.1 above was repeated using the data in classes 1, 2 and 3. A comparison of the inter-class similarity measures in Tables 5.1, 5.2 and 5.3 show no clear-cut difference between the classes 1, 2 and 3 as compared to classes 1, 2 and 7. Class 3 does represent a somewhat more dense clustering of vectors, as indicated by the Euclidean distance measures, however.

96 The results of successful classification of unknown signals for the different topologies is summarised in Table 5.5.

No. 2nd layer nodes 5 10 15 10 1/3 2/3 0/3 No. 1st layer nodes 30 3/3 1/3 1/3 50 1/3 0/3 3/3

Table 5.5: Number of test sounds correctly classified/ number presented for neural networks of various topologies using heart sound classes 1, 2 and 3.

While only consisting of a few trials, these results do show that the network has not been able to generalise. It should be emphasized that the unknown sound signals are used as test input only after the network can correctly classify all the training samples.

Despite the fact that the only difference between the investigation in Section 5.4.1 and this section is the interchange of classes 7 and 3, in these cases the networks often became "overtrained". The decision boundaries wiHfollow tightly around the training data, perhaps even isolating individual points which under other circumstances would be grouped together to form more general regions. As a result, the network has learnt the training data to the exclusion of other, similarly classified data in the test set. Since the actual position of the decision surfaces vary somewhat, a variable response to the test data is obtained. It is perhaps the more dense clustering of points in class 3 which allows a "tighter" fit of the decision surface to the training points.

5.5. A Normal/ Abnormal Classifier

A useful intermediary step to a phonocardiograph classifier providing a complete pathological/symptomatic categorisation would be a device which simply flags an abnormal state, as suggested by Friedberg [1966].

A neural network was trained with this ideal in mind. The training and testing data, while still of fairly limited extent, did come from a wider range of sources and subjects than that used in other networks described thus far. A total of 48 different heart sounds

97 were used to train the network, coming from 9 different subjects and 13 different sounds from 3 different subjects were used as the testing set.

In each case, the sets of sounds were subdivided into normal and abnormal, though the latter case consisted of a mixture of pathologies such as mitral incompetence, mitral stenosis and aortic incompetence.

A graphic summary of the sound envelopes used to train the network is shown in Figure 5 .1, and the test data is shown in Figure 5 .2.

Normal Abnormal

mitral stenosis

fixed split S2 systolic murmur

aortic insufficiency

milral insufficiency

mitral insufficiency with prolapse

mitral stenosis

Fig. 5.1: Training data used for the normal/abnormal classifier. Signals are grouped together by subject, the number of signals in each case is indicated, as is the subject I.D. The distinguishing abnormality for subjects is indicated where appropriate.

98 Normal Abnormal

3 5 mitral insufficiency

5 aortic insufficiency

Fig.5.2: Test data for the normal/abnormal classifier. Signals are grouped together by subject, the number of signals in each case is indicated, as is the subject I.D. The distinguishing abnormality for subjects is indicated where appropriate.

Since only a partitioning of the input data into two classes was required, networks only having two output nodes were used. Based on the results of Chapter 4, a network having 50 first layer and 5 second layer nodes was used as a likely optimal training configuration.

The training data was presented in random order, because of the large number of examples used, until a 90% correct classification on the training set was achieved. The network achieved this within a total of 6000 iterations on the second training attempt. Under these conditions, the network correctly classified 12 out of the 13 test hean sounds.

When the requirement of 95% accuracy on the training data was set, the network achieved this on the first attempt and correctly classified all 13 hean sounds in the test set.

The network was more reliably trained on this data than on some of the more homogeneous data used in previous studies. The fact that the training set was more than three times larger meant that individual patterns were only presented at one third the frequency for the same number of iterations.

5.6. Conclusions

In general, this study illustrates the importance of training neural networks on large, heterogeneous data sets. Reliable classification is not guaranteed when small, homogeneous data sets are used to train large networks.

99 A normal/abnormal heart sound classifier does not make the same demands on the amount of training data as would a more comprehensive classification system. Sparse data from a wide range of pathologies are useful in allowing the network to generalise effecti vel y.

100 6. Conclusions

This study has shown that the use of neural networks in the recognition of heart sounds is an approach with significant potential.

The use of a modified stethoscope proved to be an effective and cheap method of obtaining digitised heart sound signals, while the sound envelope showed itself to be a feature useful in distinguishing the pathologies tested.

Training of a network with a large amount of heterogeneous data was found to be particularly fruitful, allowing the network to adequately generalise and so correctly classify unknown test signals as normal or abnormal.

With any up-scaling of the networks used here in terms of size or scope of training data, it would be worthwhile making preliminary investigations into training parameters. The network topology and gain term were of particular importance in efficient training, while the use of a momentum term allowed small improvements in training speed, but at the expense of maximum attainable network size under memory-limited conditions.

Previous reports of greater learning efficiencies being obtained by using more than the theoretical number of hidden units were confirmed. The topology having 50 first and only 5 second layer nodes proved to be the most likely to learn the training data used in this study. Greater numbers of second layer nodes did not improve the network's likelihood of convergence and in fact degraded its speed due to larger numbers of weights.

A fixed order of pattern presentation was found to be more efficient for small training sets. The effect of order of pattern presentation for larger training sets was not investigated, but it would seem reasonable to presume that it is less critical with data originating from numerous sources. Random selection of inputs from an orderly list of pattern classes may be an effective method of presenting patterns in this situation.

101 6.1. Future Directions and Applications

The large networks required to provide sufficient temporal resolution of the heart sound signal impose enormous computational demands on training. While technically possible, training of such large networks on a clinically significant range of heart sounds is impractical on the small computer systems used in this study.

An associated limitation is that large networks with large numbers of weights would require large numbers of samples from many sources in each class to be able to generalise sufficiently well to be of practical use. A general diagnostic neural network is of no clinical use if it learns individual cases to the exclusion of others with a similar classification.

One way to reduce the size of the network may be to perform a greater amount of feature extraction on the heart sound data before presenting them to the network. In a sense, this is doing some of the "network's work" for it, but perhaps more efficiently. For example a peak detection pre-processor which locates temporal location, height and a single measure of peak spread, coupled with the main frequency component in each peak may provide significant data reduction without a significant loss of diagnostic discrimination. Currently Fourier pre-processing of the signal is being employed in order to see whether frequency feature extraction speeds up the training process, or leads to more accurate recognition because of better clustering of exemplars in a class.

A more fruitful line of attack may be the use of wavelet theory, a recently applied technique which is proving more compact for transient signal representation than Fourier analysis [Wallich, 1991, for a brief introduction].

Having stated that limitations on computing power necessitate a data reduction for ease of training, it should be kept in mind that this is true only if network heuristics are of interest. If a diagnostic tool is the aim, then given an extensive and representative sample set, training would only have to be performed once, on a super-computer or purpose-built neural processing board, the weights then transferred to very modest machines having greater accessibility, for diagnostic use.

The use of simultaneous recording from multiple auscultatory sites is obviously advantageous in that it not only provides more information, but would also eliminate an element of pre-judging the diagnosis by an operator. As has been discussed, particular pathologies are best auscultated at particular regions, and this requires expert

102 knowledge. Multiple channel recording would mean that this aspect would be incorporated within the network. Of course, four ausculatory sites implies a fourfold increase in data, and an extremely large neural network with the necessity of a very much larger database of multiple channel heart sound recordings.

A compromise would be to retain a neural network of relatively modest size, adding an input for auscultatory site, and train it on data from multiple recording positions. If implemented as a diagnostic tool, the heart sound data from each site could be consecutively used as network input, searching for classifications in each case, and the results of each being presented.

A real-time diagnostic tool would also require the identification of heart sound cycles. An ECG signal could provide this. It would not be necessary to use data from this as network input, but simply to allow the delineation of cycles as part of pre-processing.

Real heart sounds do not necessarily exhibit one-to-one correspondences with simple pathologies. Subjects may in fac·t exhibit multiple pathologies which may also result in masking or enhancement. Unless a network has been trained with similar exemplars, it may flag the subject as having one or other (or possibly neither) of the component symptoms. In this sense, heart sound recognition is a task more akin to natural language processing where an "intelligent" parsing is required, rather than the more basic task of phoneme/word recognition. These so-called syntactic methods have been applied to analysis of EEG and carotid pulse waveforms, for instance [Shiavi and Bourne, 1986].

Neural networks have a demonstrated ability in word recognition, but more traditional artificial intelligence systems have tended to hold sway in natural language processing. This work in speech recognition may indicate that a combination of a neural network pre-processor for a more traditional artificial intelligence engine may be a fertile approach for heart sound recognition. Such an approach would allow other diagnostic indicators, such as those discussed in Section 2.5 .13 to be incorporated.

The extraordinary power of neural networks only becomes fully realised with hardware implementations so the speed advantages of their parallelism are not lost. Purpose-built neural network chips are becoming more common, and such a device with the trained weights already pre-loaded is not an unrealistic possibility in the near future. The resolution of single channel signals in this study could already be analysed on simple personal computers in real time (or close to it). Multi-channel analysis would require the added speed conferred by neural chips for real-time analysis, however.

103 Appendix A - Program CARDIAC.PAS

program cardiac;

{This program pre-processes files made up on the Kay Sonograph - in MSL format. Having read the file, the program performs a rectification of the data, ie. the absolute value of all data points - mean is taken.

The max value in a window then replaces all data points within the window - effectively a low pass filter, then a moving window average is performed to smooth the data.

The values at the beginning and the end of the file are spliced (averaged) together.

This resultant waveform is then sampled at some determined interval and the data stored in simple text format.

The processes are shown graphically as they occur.} {------}

{$M 64000,0,655360} {make room!}

Uses Crt, Dos, Graph;

CONST maxpoints = 15000; headlength = 48; maxenv = 100;

TYPE vector= ARRAY[1 .. maxpoints] of word;

VAR filelist : text; {file containing file names to be processed} filename, numstr: string; graphdriver, graphmode : integer; {graphics system parameters} xmax, ymax: integer; {max no. pixels on screen} header: ARRAY[1 .. headlength] of byte; {MSL format header} points, env : vector; {storage vectors for data} samplerate, num, max, min, mean, slices, step : integer; {sampling rate for data, number data points, max, min and mean values, number of samples for final waveform, step size to move window}

104 {------} procedure filer(filename : string);

{reads in files to be processed, then reads data from file, types out various header info} VAR i nfile : file; outfile : text; numread : word; buf: byte; filedate, filetime, samplestring, databits, numpoints : string; filelD : string; i, code : integer; total : longint;

BEGIN filename := Concat(filename,'.spl'); Assign(infile,filename); Reset(infile, 1 ); BlockRead(i nfile, header, SizeOf (header), nu mread);

filelD := "; FOR i := 33 to 48 DO BEGIN filelD := filelD + Char(header[i]); END; writeln(filelD);

filedate := "; FOR i := 1 to 1 0 DO BEGIN filedate := filedate + Char(header[i]); END; writeln('Date: ',filedate);

filetime := "; FOR i := 11 to 18 DO BEGIN filetime := filetime + Char(header[i]); END; Writeln('Time: ', filetime);

samplestring := "; FOR i := 19 to 24 DO BEGIN samplestring := samplestring + Char(header[i]); END; Val(samplestring,samplerate,code); writeln('Sampling rate: ',samplerate);

105 databits := "; FOR i := 25 to 26 DO BEGIN databits := databits + Char(header[i]); END; writeln('Databits: ',databits);

numpoints := "; FOR i:= 27 to 32 DO BEGIN IF header[i] > 47 THEN numpoints := numpoints + Char(header[i]) ELSE END; Val(numpoints,num,code); writeln('Number of data points: ',num);

BlockRead(infile,points,num*2,numread); Close(infile);

max := points[1]; min := points[1]; total := O; FOR i := 1 to num DO BEGIN total :=total+ points[i]; IF points[i] > max then max := points[i]; IF points[i] < min then min := points[i]; END;

mean := Round(total/num); Writeln('max. value: ',max,' min. value: ',min); Writeln('Mean value: ',mean);

END; {------} Procedure init; {initializes graphics system}

BEGIN DetectG raph(graphdriver, graph mode); lnitgraph(graphdriver,graphmode,'c:\pascal'); xmax := GetMaxX; ymax := GetMAxY; END; {------} Procedure displayname(filename : string); {displays name of file} BEGIN SetT extStyle(DefaultFont, HorizDir,2); OutTextXY(100,20,filename); SetTextStyle(DefaultFont,HorizDir, 1);

106 Str(num,numstr); OutTextXY(100,50,numstr +' points'); END; {------} Procedure plot(points : vector; plotcolor: word); {plots the graph}

CONST yscale = 0.8; {scaling factor} VAR i, k : integer; yval, meanline : integer; {meanline is the mean screen position}

BEGIN SetColor(plotcolor); yval := Round((points[1 ]-min)/(max-min)*yscale*ymax); MoveTo(1,ymax-yval); step:= Trunc(num/xmax); meanline := Round((mean-min)/(max-min)*yscale*ymax);

k := 1; i := k*step; WHILE i < num DO BEGIN yval := Round((points[i]-min)/(max-min)*yscale*ymax); LineTo(k,ymax-yval); {Line(k,ymax-meanline,k,ymax-yval); Alternate graphics display} k := k + 1; i := k*step; END;

SetColor(12); Line(0,ymax-meanline, xmax,ymax-meanline); END; {------} Procedure envelope(window : integer); {peak detection - performs crude low pass filtering} VAR i, j, k : integer; halfwindow: integer; big : integer;

BEGIN step := window; halfwindow := Trunc(window/2);

107 FOR i := 1 TO num DO {rectify} BEGIN points[i] := Abs(points[i]-mean)+ mean; END; plot(points, 14); {show result}

{search for biggest value in a window and replace all values in the window with it} k := O; REPEAT big:= points[k*window+1]; FOR i := 2 TO window DO BEGIN IF big< points[k*window+i] THEN big := points[k*window+i); END;

FOR i := 1 TO window DO BEGIN env[k*window+i] := big; END; k := k + 1; UNTIL k = Trunc(num/window);

{deal with beginning and end of data} IF (k-1 )*window+ 1 < num THEN BEGIN big := points[(k-1 )*window+ 1]; FOR i := (k-1 )*window+2 TO num DO BEGIN IF big< points[i] THEN big := points[i]; END;

FOR i := (k-1 )*window+2 TO num DO BEGIN env[i] := big; END; END; plot(env, 15);

END; {------} procedure smoother(window : integer);

{performs moving window average on data}

VAR i, j : integer; halfwindow : integer; sum : longint; negpoints, ovrpoints : integer;

108 BEGIN halfwindow := Trunc(window/2);

{treat ends of data differently} FOR i := 1 TO halfwindow DO BEGIN negpoints := halfwindow + 1 - i; sum:= O; FOR j := 0 TO negpoints-1 DO BEGIN sum :=sum+ env[num-j]; END;

FOR j := 1 TO i+halfwindow DO BEGIN sum :=sum+ env[j]; END; points[i] := Round(sum/window); END;

ovrpoints := O; FOR i := num-halfwindow+ 1 TO num DO BEGIN ovrpoi nts := ovrpoi nts + 1 ; sum:= O; FOR j := 1 TO ovrpoints DO BEGIN sum :=sum+ env[j]; END; FOR j:= i-halfwindow TO num DO BEGIN sum :=sum+ env[j]; END; points[i] := Round(sum/window); END;

{main chunk of data} FOR i := halfwindow+ 1 TO num-halfwindow DO BEGIN sum:= O; FOR j := i-halfwindow TO i+halfwindow DO BEGIN sum :=sum+ env[j]; END; points[i] := Round(sum/window); END; plot(points, 15); {show progress} END; {------} 109 Procedure store_env(filename : string; slices : integer);

{store results of processing}

CONST yscale = 0.8; VAR outfile1 : text; nameoutfile, txtoutfile, binoutfile : string; i, j, slicestep, marker, yval : integer; normenv : ARRAY[1 .. maxenv] of integer; norm : ARRAY[1 .. maxenv] of real; minenv, maxenv, meanenv : integer; sum : longint; sumnorm : real;

BEGIN {ChDir('insert suitable directory name here');} Delete(filename,Pos('SPL',filename),3); {strip of SPL extension} txtoutfile := filename + 'ENV'; {add ENV extension} Assign(outfile1 ,txtoutfile); Rewrite(outfile1 );

slicestep := Round(num/slices); {number of points per final sampled point} MoveTo(0,ymax-Round((points[slicestep]-min)/(max-min)*yscale*ymax));

{show positions on graph of final sampled points - show as little boxes} FOR i := 1 to slices DO BEGIN j := i*slicestep; normenv[i] := pointsu]; marker:= RoundU/step); yval := Round((pointsu]-min)/(max-min)*yscale*ymax); Bar(marker-2,ymax-yval-1,marker+2,ymax-yval+ 1 ); LineTo(marker,ymax-yval-50); END;

sum:= 0; minenv := points[1 ]; maxenv := points[1 ];

FOR i := 1 TO slices DO BEGIN sum := sum + normenv[i]; IF minenv > normenv[i] THEN minenv := normenv[i]; IF maxenv < normenv[i] THEN maxenv := normenv[i]; END;

110 meanenv := Round(sum/slices);

{normalise the vector} sumnorm := 0; FOR i := 1 TO slices DO BEGIN norm[i] := (normenv[i] - minenv)/max; sumnorm := sumnorm + norm[i]*norm[i]; END;

sumnorm := Sqrt(sumnorm);

FOR i := 1 TO slices DO BEGIN norm[i] := norm[i]/sumnorm; writeln( outfile 1 , norm[i]); END; Close(outfile1 ); {ChDir('return to your other directory');} END; {------} BEGIN {ChDir('what directory?');} Assign(filelist, 'envfile.lst'); Reset(filelist); WHILE not Eof(filelist) DO BEGIN Readln(filelist,filename); filer(filename); init; ClearViewPort; displayname(filename); plot(points,3); envelope(61 ); {61 represents a 6ms window for a 10kHz sampling rate} smoother(81 ); {81 represents a 8ms window for a 10kHz sampling rate} delay(2000); {take time to view graph} slices := 60; store_env(filename, slices); Closegraph; {close down graphics system} END; readln; END.

111 Appendix B - Program HYPER.PAS

Program hyper; {------}

{Provides an indication of the similarity of different classes of input files.

This is acheived with the following measures:-

1. The mumber of dimensions in which there is overlap between all the data in one class compared to all the data in another class.

2. An average inter-class dot product is computed and displayed in a table. Sucha dot product is the correlation between two sets of numbers (or the cosineof the angle between the two vectors). See Jordan,M.I. "An Introduction tolinear Algebra in Parallel Distributed Processing", chapter 9 of ParallelDistributed Processing, vol.l by Rumelhart and McClelland.

3. The average Euclidean distance between input pattern in different classes.

The program then allows the graphical display of the range of values - min to max for any pairs of chosen classes.

Input file of file names is in form filename fileclass filename fileclass .... etc

At startup, the number of different file classes is requested.}

{------}

Uses DOS, CRT, GRAPH;

CONST dimension= 60; {number of terms in each pattern vector} VAR x, tempdist: real; i, j, classcount, numclasses : integer; hyperfile : text; {hyperfile contains file names of pattern vector, envfile contains actual file data} classname, In : string; numfiles : integer; {total number of pattern files} filename : string; fileclass : integer; pattern : ARRAY[1 .. 40, 1 .. dimension] OF real; {storage for all patterns} 112 classpattern : ARRA Y[1 .. 40] OF integer; {pattern classification for files} filecorr : ARRAY[1..40, 1. .40] OF real; {inter-file correlations} classcorr : ARRAY[1 .. 10, 1 .. 1O] OF real; {inter-class correlations} dist: ARRAY[1 . .40, 1. .40) OF real; {inter-file distances} classdist : ARRAY[1 .. 10, 1 .. 10) OF real; {inter-class distances} extrema : ARRAY[1 .. 10) OF RECORD classname : string; {name for pattern type} min : ARRAY [1 .. dimension] OF real; {min value for class} max : ARRAY [1 .. dimension] OF real; {max value for class} mean: ARRAY [1 .. dimension] OF real; {mean value for class} meandist : real; {average vector length} maxdist : real; {max vector length} dim : integer; flag : boolean; {keep track of 1st file in class} filesinclass : integer; {no. files in class} END; xmax, ymax, graphdriver, graphmode : integer; {graphics parameters} pic1 , pic2 : integer; exit : boolean; eh: char;

{------} Procedure init; {initializes graphics system} BEGIN DetectGraph(graphdriver, graphmode); lnitgraph(graphdriver,graphmode,'c:\pascal'); xmax := GetMaxX; ymax := GetMAxY; END; {------} Procedure min_max;

{Reads list of pattern file names 'hyper.lst'. Opens each pattern file in turn - 'envfile', reads it, stores data, determines minimum, max and mean values.}

VAR code : integer; s : string;

BEGIN writeln('Enter number of pattern classes .. .'); readln(numclasses);

FOR i := 1 TO numclasses DO {initialise} BEGIN WITH extrema[i] DO BEGIN flag := true; filesinclass := O;

113 meandist := O; maxdist := O; END; END;

ChDir('c:\thesis\env'); ASSIGN(hyperfile,'\thesis\envfile.lst'); RESET(hyperfile);

{store file names and file classifications} i := O; WHILE NOT eof(hyperfile) DO BEGIN i := i + 1; read In ( hype rf ile, In); filename := Copy(ln, 1,Pos(' ',ln)-1 ); Delete(ln, 1,Pos(' ',In)); Val(ln,fileclass,code); {determine file class} writeln(i,' ',filename,' ',fileclass); classpattern[i] := fileclass; {store file classification} Assign(envfile,filename); Reset(envfile);

{read in pattern data, determining means, mins and max} WITH extrema[fileclass] DO BEGIN IF flag = true THEN {for 1st file in each class} BEGIN filesi nclass := 1 ; FOR j := 1 TO dimension DO BEGIN readln( envfile,x); pattern[i,j] := x; {store pattern vector} meanLi] := x; min[j] := x; max[j] := x; maxdist := maxdist + x*x; END; flag := false; END ELSE {for subsequent files in classes} BEGIN tempdist := O; filesinclass := filesinclass + 1 ; FOR j := 1 TO dimension DO BEGIN readln(envfile,x); pattern[i,j) := x; tempdist := tempdist + x*x; mean[j] := mean[j] + x;

114 IF x > max[j] THEN maxU] := x; IF x < min[j] THEN minU] := x; END; {for j} IF tempdist > maxdist THEN maxdist := tempdist; END; {if flag} END; {with} Close( envfile); numfiles := i; END; {while not eof} Close(hyperfile);

{report on mean and maximum vector lengths for different classes} T extColor(yellow); Writeln('CLASS DISTANCE OF MEAN MAX DISTANCE'); TextColor(LightBlue); write In('------'); TextColor(white);

FOR i := 1 TO numclasses DO BEGIN WITH extrema[i] DO BEGIN - FOR j := 1 TO dimension DO BEGIN mean[j] := mean[j]/filesinclass; meandist := meandist + meanU]*meanU]; END; meandist := Sqrt(meandist); maxdist := Sqrt(maxdist); writeln(' ',i,' ',meandist:9,' ',maxdist:9); END; {with extrema} END; {for i} readln; END; {min_max} {------} Procedure overlap;

{Reports on the number of dimensions of overlap between files in different classe - gives an indication of how separable different classes of patterns may be.} VAR class1, class2, score : integer; over : ARRAY [1 .. 10, 1 .. 1O] OF integer;

BEGIN writeln; TextColor(white); Writeln('FILE CLASSES'); T extColor(yellow);

115 Writeln('NUMBER OF DIMENSIONS IN WHICH HAVE OVERLAP'); TextColor(LightRed); Writeln('HIGH CORRESPONDENCE'); Norm Video; Writeln('SUPPRESSED REPETIONS'); TextColor(white); writeln; write(' '); FOR i := 1 TO numclasses DO BEGIN write(i :3); END; write(Chr(1 0),Chr(13));

{count up number of dimension of overlap and print out} FOR class1 := 1 TO numclasses DO BEGIN TextColor(white); Write(' ',class1 :3);

FOR class2 := 1 TO numclasses DO BEGIN . score:= 0; FOR i := 1 TO dimension DO {test overlap} BEGIN IF extrema[class1 ].min[i] > extrema[class2].max[i] THEN ELSE IF extrema[class2].min[i] > extrema[class1].max[i] THEN ELSE score :=score+ 1; END; {for i} over[class2,class1] := score; IF (over[class2,class1] >= dimension-2) AND (class2 > class1) THEN TextColor(LightRed)

ELSE IF (over[class2,class1] < dimension-2) AND (class2 > class1) THEN TextColor(yellow)

ELSE IF class2 <= class1 THEN NormVideo; write(over[class2,class1]:3); END; {for class2} write(Chr(1 0),Chr(13)); END; {for class1} NormVideo; END; {overlap}

116 {------} Procedure over_picture; {Graphical representation of range of values taken by different classes of input patterns.} CONST numpics = 2; yscale = 400; yoffset = 380; jump= 6; width= 4; xoffset = 150; Gray50 : FillPatternType = ($00,$55,$00,$55,$00,$55,$00,$55);

VAR xpos : integer; s :string;

BEGIN init; SetBkColor(White); SetTextStyle( defaultfont, HorizDir,2); SetColor(LightBlue); Str(pic1 ,s); OutTextXY(xmax-100,50,s); SetColor(LightRed); Str(pic2,s); OutTextXY(xmax-100,80,s); SetColor(green); OutTextXY(xmax-250,65,'Classes'); OutTextXY(50,ymax-50,'x = exit'); OutTextXY(50,ymax-30,'Any key to continue'); FOR i := 1 TO dimension DO BEGIN WITH extrema[pic1] DO BEGIN Set Fill Pattern (gray50, Lig htRed); xpos := xoffset + i*jump; Bar(xpos,yoffset-Round(min[i]*yscale), xpos+width,yoffset-Round(max[i]*yscale)); END;

WITH extrema[pic2] DO BEGIN SetFillStyle(SolidFill, LightBlue); xpos := xoffset + i*jump + width; Bar(xpos,yoffset-Round(min[i]*yscale), xpos+width,yoffset-Round(max[i]*yscale)); END; END; END;

117 {------} Procedure correlation;

{Determines the inner product {or dot product) of all patterns with each other, then determines an average value for inter-class correlation.}

VAR k: integer; brief, totcorr : real;

BEGIN FOR i := 1 TO numfiles DO BEGIN FOR j := 1 TO numfiles DO BEGIN filecorr(i,j] := O; {initialise} FOR k := 1 TO dimension DO BEGIN {build up dot product} filecorr[i,j] := filecorr[i,j] + pattern(i,k]*patternU,k]; END; {fork} END; {for j} END; {for i}

FOR i := 1 TO numclasses DO {initialise} BEGIN FOR j := 1 TO numclasses DO BEGIN classcorr[i,j] := O; END; END;

FOR i:= 1 TO numfiles DO {add up interclass correaltions} BEGIN FOR j := 1 TO numfiles DO BEGIN brief := classcorr(classpattern(i] ,classpatternU]]; brief :=brief+ filecorr(i,j]; classcorr[classpattern[i],classpatternU]] := brief; END; END;

FOR i := 1 TO numclasses DO {average out interclass correlations} BEGIN { over number of files used} FOR j := i TO numclasses DO BEGIN totcorr := (classcorr[i,j] + classcorrLJ,i]); {2 is factor below because of double-counting in way determination set up in loops above} classcorr[i,j] :=

118 totcorr/(2*extrema[i].filesinclass*extremaLi].filesinclass); classcorr[j,i] := classcorr[i,j]; END; END;

writeln; TextColor(white); Writeln('FILE CLASSES'); TextColor(yellow); Writeln('INTERCLASS CORRELATIONS'); TextColor(LightRed); Writeln('HIGH CORRESPONDENCE'); Norm Video; Writeln('SUPPRESSED REPETITIONS'); TextColor(white); writeln; write(' '); FOR i := 1 TO numclasses DO BEGIN write(i :7); END; write(Chr(1 0),Chr(13));

FOR i := 1 TO numclasses DO BEGIN TextColor(white); Write(' ',i:3); FOR j := 1 TO numclasses DO BEGIN IF j > i THEN TextColor(yellow); If classcorr[i,j] > 0.8 THEN TextColor(lightred); IF j <= i THEN NormVideo; write(' ',classcorr[i,j]:i:3); END; {for class2} write(Chr(1 0),Chr(13)); END; {for class1} Norm Video; END; {overlap}

{------}

119 Procedure metric;

{Determines the average Euclidean distance between vectors of all patterns with vectors of other patterns. See J.Kittler "Feature Selection and Extraction" in T.Y.Young & K.Fu (eds) "Handbook of Pattern Recognition and Image Processing" 1986, Academic Press, London.}

VAR k: integer; brief, totdist : real;

BEGIN FOR i := 1 TO numfiles DO BEGIN FOR j := 1 TO numfiles DO BEGIN dist[i,j] := O; {initialise} FOR k := 1 TO dimension DO BEGIN dist[i,j] := dist[i,j] + Sqr(pattern[i,k]-patternu,k]); END; {fork} dist[i,j] := Sqrt(dist[i,j]); END; {for j} END; {for i}

FOR i := 1 TO numclasses DO {initialise} BEGIN FOR j := 1 TO numclasses DO BEGIN classdist[i ,j] := O; END; END;

FOR i:= 1 TO numfiles DO {add up interclass distances} BEGIN FOR j := 1 TO numfiles DO BEGIN brief := classdist[classpattern[i],classpatternu]]; brief := brief+ dist[i,j]; classdist[classpattern[i],classpatternu]] := brief; END; END;

FOR i := 1 TO numclasses DO {average out interclass distances} BEGIN { over number of files used} FOR j := i TO numclasses DO BEGIN totdist := (classdist[i,j] + classdist[j,i]); {2 is factor below because of double-counting in way

120 determination set up in loops above} classdist[i,j] := totdist/(2*extrema[i].filesinclass*extremaU].filesinclass); classdist[j,i] := classdist[i,j]; END; END;

writeln; TextColor(white); Writeln('FILE CLASSES'); TextColor(yellow); Writeln('INTERCLASS DISTANCES'); TextColor(LightRed); Writeln('HIGH CORRESPONDENCE'); NormVideo; Write In ('SU PP RESS ED REPETITIONS'); TextColor(white); writeln; write(' '); FOR i := 1 TO numclasses DO BEGIN write(i:7); END; write(Chr(1 0),Chr(13));

FOR i := 1 TO numclasses DO BEGIN TextColor(white); Write(' ',i:3); FOR j := 1 TO numclasses DO BEGIN IF j > i THEN TextColor(yellow); IF classdist[i,j] < 0.5 THEN TextColor(lightred); IF j <= i THEN NormVideo; write(' ',classdist[i,j] :i :3); END; {for class2} write(Chr(1 0),Chr(13)); END; {for class1} Norm Video; END; {overlap} {------}

121 BEGIN min_max; Repeat overlap; correlation; metric; writeln('Enter desired class numbers for graphical comparison'); readln(pic1); readln(pic2); over_picture; eh := ReadKey; IF eh= 'x' THEN exit:= true ELSE exit:= false; Closegraph; {close down graphics system} Until exit; END.@ {hyper}

122 Appendix C - Program MULTI.PAS (Turbo Pascal Version) program multi; {------}

{ This program simulates a 3 layer perceptron with back-propogation learning algorithm - semilinear, logistic threshold function, no momentum.

Input data consists of a list of file names containing heart sound data and file classifications in text format in file 'envfile.lst':- filename 1 fileclassification 1 filename2 fileclassification2 .... etc NB: filenames are text and fileclassifications are integers

Each file with name filename? is in text format containing real data (one per line) of heart sound envelope.

These are expected in the same directory, being D:\, preferably a RAM drive to speed up access.}

Uses Dos, Crt, Graph;

CONST dimension = 60; {# of input data from signal} maxnode1 = 30; {# nodes in 1st layer} {may be made} maxnode2 = 15; {# nodes in 2nd layer} {VAR for changing} maxnode3 = 3; {# nodes in 3rd layer} {topologies} maxnumfiles = 50; {max # input patterns} gain= 0.7; {gain for error signal - may be mad VAR for changing gain} rand= 0.4; {initial max random value} report= 500; {reporting frequency - # iterations}

LABEL finish; TYPE vector = ARRAY[1 .. maxnode3] of REAL; bigvector = ARRAY[1 .. maxnode2] of REAL; hugevector = ARRAY[1 .. maxnode1] of REAL;

matrix= ARRAY[1 .. maxnode2] of vector; {maxnode2 x maxnode3} medmatrix = ARRAY[1 .. maxnumfiles] of vector; {maxnumfiles x maxnode3} bigmatrix = ARRAY[1 .. maxnode1] of bigvector; {maxnode1 x maxnode2}

123 huge matrix = ARRA Y[1 .. dimension] of hugevector; {dimension x maxnode1} megamatrix = ARRAY[1 .. maxnumfiles] of hugevector; {maxnumfiles x maxnode1}

VAR xo, x1, delta0, theta1, sumdel1 : hugevector; {input, output 1st layer, error term} x2, theta2, delta1, sumdel2 : bigvector; {output 2nd layer} x3, theta3, delta2 : vector; {output final layer} maxnode1, maxnode2, node1, node2 : integer; w0 : hugematrix; {weights from inputs to layer 1} w1 : bigmatrix; {weights from layer 1 to layer 2} w2 : matrix; {weights from layer 2 to layer 3} d : medmatrix; {row vectors of ideal output} {currentfile : pointer to input data being used; mainloop : counter for iterations; numfiles : actual# input patterns; inp_file : names of files storing input data}

r, c, i, j, currentfile, mainloop : integer; numfiles : integer; {number of files} inp_file : ARRAY[1 .. maxnumfiles] OF string; {names of input files} xmax, ymax, graphdriver, graphmode : integer; {graphics parameters} ans, oldans : char; {storage for key strokes}

outf4 : text; sd, ssd: real; bigloop, score: integer;{# interations, number of correct classifications} In : string; fileclass, code : integer; {classification for file, code - used by VAL} reps : integer; again : boolean;

{ .------} Procedure init; {initializes graphics system} BEGIN DetectGraph(graphdriver, graphmode); lnitgraph(graphdriver,graphmode,'c:\pascal'); xmax := getmaxx; ymax := getmaxy; END; {------}

124 procedure getfiles(filename : string); VAR filelist : text; {file containing file names of input data} file_cat : string; {id from file name indicating signal classification}

BEGIN {Set up matrix of ideal outputs - each output node has a high (1) corresponding to target pattern - all other nodes will be low (0)} FOR r := 1 TO maxnumfiles DO BEGIN FOR c := 1 TO maxnode3 DO BEGIN d[r,c] := 0; END; END;

Assign(filelist, filename); Reset(filelist);

i := O; {Read in list of files for input patterns} WHILE NOT eof(filelist) DO BEGIN i := i + 1; readln(filelist,ln); inp_file[i] := Copy(ln, 1,Pos(' ',ln)-1 ); Delete(ln, 1,Pos(' ',In)); Val(ln,fileclass,code); {determine file class} d[i,fileclass] := 1; writeln(i,' ',inp_file[i],' ',fileclass); END; numfiles := i; {# input patterns} Close(filelist); END; {------} procedure filer;

{Reads in data from file specified by pointer currentfile} VAR envfile : text;

BEGIN Assign( envfile ,inp_file[ currentfile]); Reset(envfile); FOR i := 1 TO dimension DO BEGIN readln( envfile ,x0[i]);

125 END; Close( envfile); END; {------} procedure init_net; {Set up initial neural net parameters - random values for weights and offsets}

BEGIN Randomize; FOR r := 1 TO dimension DO BEGIN FOR c := 1 TO maxnode1 DO BEGIN w0[r,c) := rand*Random; END; END; FOR c := 1 TO maxnode2 DO BEGIN FOR r := 1 TO maxnode1 DO BEGIN w1 [r,c] := rand*Random; END; END;

FOR c := 1 TO maxnode3 DO BEGIN FOR r := 1 TO maxnode2 DO BEGIN w2[r,c) := rand*Random; END; END;

FOR i := 1 TO maxnode1 DO BEGIN theta 1[i) := rand*Random; END;

FOR i := 1 TO maxnode2 DO BEGIN theta2[i) := rand*Random; END;

FOR i := 1 TO maxnode3 DO BEGIN theta3(i) := rand*Random; END; END;

126 {------} procedure get_output;

{Present input pattern and determine output of neural network. Exp function appears due to semi-logistic function} VAR sum: real;

BEGIN sum:= 0; {layer 1} FOR c := 1 TO maxnode1 DO BEGIN FOR r := 1 TO dimension DO BEGIN sum :=sum+ w0[r,c]*x0[r]; END; sum := -1 *(theta1 [c] + sum); x1 [c] := 1/(1 + Exp(sum)); END;

sum:= 0; {Layer 2} FOR c := 1 TO maxnode2 DO BEGIN FOR r := 1 TO maxnode1 DO BEGIN sum :=sum+ w1 [r,c]*x1 [r]; END; sum := -1 *(theta2[c] + sum); x2[c] := 1/(1 + Exp(sum)); END;

sum:= 0; {Layer 3} FOR c := 1 TO maxnode3 DO BEGIN FOR r := 1 TO maxnode2 DO BEGIN sum :=sum+ w2[r,c]*x2[r]; END; sum:= -1*(theta3[c] + sum); x3[c] := 1/(1 + Exp(sum)); END; END;

{------}

127 procedure adapt;

{Back-propogation of errors to adapt weights dependent on distance actual output from target output}

BEGIN FOR i := 1 TO maxnode2 DO sumdeI2[i] := 0; FOR i := 1 TO maxnode1 DO sumdel1 [i] := 0;

FOR c := 1 TO maxnode3 DO BEGIN {sumdel2 is used in determing weight change for errors one layer down. Similarly, sumdel1 for the first layer. Delata2 is the error term for each node - actual weight change obtained by multiplying with output and gain.} delta2[c] := x3[c]*(1 - x3[c])*(d[currentfile,c] - x3[c]); FOR r := 1 TO maxnode2 DO BEGIN w2[r,c] := w2[r,c] + gain*delta2[c]*x2[r]; sumdeI2[r] := sumdel2[r] + w2[r,c]*delta2[c]; END; {Treat offset, theta3, as if it is the weight from a node of constant unity output} theta3[c] := theta3[c] +gain*delta2[c]; END;

FOR c := 1 TO maxnode2 DO BEGIN delta1 [c] := sumdel2[c]*x2[c]*(1-x2[c]); FOR r := 1 TO maxnode1 DO BEGIN w1 [r,c] := w1 [r,c] + gain*delta1 [c]*x1 [r]; sumdeI1[r] := sumdel1[r] + w1[r,c]*delta1[c]; END; theta2[c] := theta2[c] +gain*delta1 [c]; END;

FOR c := 1 TO maxnode1 DO BEGIN delta0[c] := sumdel1 [c]*x1 [c]*(1-x1 [c]); FOR r := 1 TO dimension DO BEGIN w0[r,c] := w0[r,c] + gain*delta0[c]*x0[r]; END; theta1 [c] := theta1 [c] +gain*delta0[c]; END; END; {------}

128 procedure show_wgt;

{Diagramatic representation of weights as learning takes place.Upper, middle and lower layers can be shown (one at a time), as well as a representation of actual output compared to target ouptut. These procedures slow down the learning procedure - they can be commented out altogether, otherwise "fast" display option - accessed by typing "f" can be used}

procedure show_iter; {Displays iteration # in top LH corner} VAR mainstr: string;

BEGIN SetColor(9); Bar(xmax-100,5,xmax-140, 15); SetTextStyle(DefaultFont,HorizDir, 1 ); Str( mai nloop, mai nstr); OutTextXY(xmax-135, 7,mainstr); SetColor(4); END;

procedure outputs; {show value of network outputs and targe values}

CONST ymag = 200; VAR xstep,yval : integer;

BEGIN clearviewport; xstep := Trunc(xmax/(maxnode3-1)-100); SetFillStyle(1 , 15); FOR i := 1 TO maxnode3 DO BEGIN yval := Round(ymax-ymag*d[currentfile,i]); Bar( (i-1 )*xstep+50, Round(ymax-ymag*x3[i]), (i-1 )*xstep+30,ymax); Line((i-1 )*xstep,yval,(i-1 )*xstep+B0,yval); END; show_iter; END;

129 procedure upper; {show weights between second and thrid layers}

CONST ymag = 30; {Magnification of plots in y-direction} VAR {ydsiplaylength : room for display of one node halfdisplay : half above xstep : step in x-direction - to space out display} ydisplaylength, halfdisplay, xstep : integer;

BEGIN xstep := Trunc(xmax/(maxnode3-1 )); ydisplaylength := Trunc(ymax/maxnode2); halfdisplay := Trunc(ydisplaylength/2);

FOR r := 1 TO maxnode2 DO BEGIN MoveTo(1,(r-1 )*ydisplaylength-halfdisplay- Round(ymag*w2[r, 1])); FOR c := 2 TO maxnode3 DO BEGIN Line To(( c-1 )*xstep,r*ydisplaylength-halfdisplay­ Round(ymag*w2[r,c])); END; END; show_iter; END;

procedure middle; {show weights between first and second layers}

CONST ymag = 30;

VAR ydisplaylength, halfdisplay, xstep : integer;

BEGIN xstep := Trunc(xmax/(maxnode2-1 )); ydisplaylength := Trunc(ymax/maxnode1 ); halfdisplay := Trunc(ydisplaylength/2); FOR r := 1 TO maxnode 1 DO BEGIN MoveTo(1,(r-1 )*ydisplaylength-halfdisplay- Round(ymag*w1 [r, 1]));

130 FOR c := 2 TO maxnode2 DO BEGIN Line To( (c-1 )*xstep, r*ydisplaylength-halfdisplay­ Round(ymag*w1 [r,c))); END; END; show_iter; END;

procedure lower; {show weights between inputs and first layer}

CONST ymag = 30;

VAR ydisplaylength, halfdisplay, xstep : integer;

BEGIN xstep := Trunc(xmax/(maxnode1-1 )); ydisplaylength := Trunc(ymax/dimension); halfdisplay := Trunc(ydisplaylength/2); FOR r := 1 TO dimension DO BEGIN MoveTo(1,(r-1 )*ydisplaylength-halfdisplay- Round(ymag*w0[r, 1])); FOR c := 2 TO maxnode1 DO BEGIN Line To( (c-1 )*xstep, r*ydisplaylength-halfdisplay­ Round(ymag*w0[r,c])); END; END; show_iter; END;

BEGIN {decide on which display} CASE ans OF 'u', 'U' : upper; 'm', 'M': middle; 'I', 'L' : lower; 'o','O' : outputs; 'f','F' : show_iter; 'c', 'C' : ClearViewPort; ELSE ans := oldans; beep(440, 100); END; {case} END; {------} 131 procedure key; {look at keyboard}

BEGIN IF keypressed THEN BEGIN ClearViewPort; {clear screen} old ans := ans; ans:= readkey; END; END; {------} procedure storewgt;

{Store final learned weights in a binary file} VAR outf 1 : file;

BEGIN Assign(outf1 ,'weights.bi'); Rewrite(outf1, 1); Blockwrite(outf1 ,w0,maxnode1 *dimension*6); Blockwrite(outf1 ,w1 ,maxnode1 *maxnode2*6); Blockwrite( outf 1 ,w2,maxnode2*maxnode3*6); Close( outf 1) ; END; {------} procedure testout;

{The current state of the network is tested by applying each signal from the training set in turn. A score of hits is determined - 1 point each time the network gives the correct classification.} VAR counter : integer; max: real; maxx3, maxd : integer;

BEGIN score := O; ssd := O;

FOR currentfile := 1 TO numfiles DO BEGIN filer; get_output; max := d[currentfile, 1); maxd := 1;

132 FOR i := 2 TO maxnode3 DO BEGIN IF d[currentfile,i] > max THEN BEGIN max := d[currentfile,i]; maxd := i; END; END;

max := x3[1]; maxx3 := 1; FOR i := 2 TO maxnode3 DO BEGIN IF x3[i] > max THEN BEGIN max := x3[i]; maxx3 := i; END; END;

IF maxd = maxx3 THEN score := score + 1 ;

FOR i := 1 TO maxnode3 DO BEGIN sd := Sqr(d[currentfile,i]-x3[i]); ssd := sd + ssd; END; END;

writeln('mainloop: ',mainloop,' reps',reps,' hits: ',score,' misses: ',numfiles-score,' ssd:',ssd:1 O); writeln(outf4,reps,' ',mainloop:5,' ',score:3,' ',ssd:1 O); Flush(outf4); END;

{------}

133 procedure unknowns;

{The current state of the network is tested by applying each signal from the TEST set in turn. A score of hits is determined - 1 point each time the network gives the correct classification.} VAR counter : integer; max: real; maxx3, maxd : integer;

BEGIN score:= O; ssd := O; FOR currentfile := 1 TO numfiles DO BEGIN filer; get_output; max := d[currentfile,1]; maxd := 1;

FOR i := 2 TO maxnode3 DO BEGIN IF d[currentfile,i] > max THEN BEGIN max := d[currentfile,i]; maxd := i; END; END;

max := x3[1 ]; maxx3 := 1; FOR i := 2 TO maxnode3 DO BEGIN IF x3[i] > max THEN BEGIN max := x3[i]; maxx3 := i; END; END;

IF maxd = maxx3 THEN score := score + 1 ;

FOR i := 1 TO maxnode3 DO BEGIN sd := Sqr(d[currentfile,i]-x3[i]); ssd := sd + ssd; END; END;

134 writeln('TESTING UNKNOWNS', 'hits: ',score,' misses: ',numfiles-score, ' ssd:',ssd:1 O); writeln; writeln(outf4,'UNKNOWNS ', 'hits: ',score,' misses: ',numfiles­ score,' ssd:',ssd:1 O); write In (outf 4); Flush(outf4); END;

{------} procedure runmarker; {provides flags in output file} BEGIN writeln(outf4,dimension,' ',maxnode1 ,' ',maxnode2,' ', maxnode3); writeln(dimension,' ',maxnode1 ,' ',maxnode2,' ', maxnode3); END; {------}

BEGIN Assign(outf4,'run.rep'); Rewrite(outf4); ChDir('D:\'); {D: is set up as a RAM drive to speed up access to files} {all files of input data are copied here before starting the} {program} ans:= 'o'; Randomize; init; SetColor( 4); run marker;

getfiles('envfile.lst'); init_net; currentfile := 1; FOR mainloop := 1 TO 6000 DO BEGIN key; currentfile := (currentfile) mod (numfiles) + 1; {for orderly file presentation} filer; {get data for chosen file} get_output; show_wgt;

135 IF Frac(mainloop/report) = 0.0 THEN {is the number of iterations an integer multiple of the reporting frequency?} BEGIN testout; IF ssd < 0.5 THEN {set some criterion for having learned training set} BEGIN writeln(outf4,ssd); getfiles('unknowns.lst'); {test network with training set} unknowns; GOTO FINISH; END; {if ssd} END; {if frac} adapt; FINISH: END; {mainloop}

Closegraph; storewgt; Close(outf4); readln; END. {multi}

136 Appendix D - Program MV.PAS (VAX Version)

program mv (input.output); {------}

{ This program simulates a 3 layer perceptron with back-propogation learning algorithm - semilinear, logistic threshold function.

It has a momentum term in the learning algorithm -code specific to this is shown in italics

This has most of the same variable names as the MS-DOS version of MULTI.PAS}

CONST maxnumfiles = 100; {max # input patterns} report= 500; {reporting frequency - # iterations} dimension= 60; maxnode1 = 60; maxnode2 = 40; maxnode3 = 3; gain= 0.5; rand= 1.0; alpha= 0.5;

TYPE vector= ARRAY[1 .. maxnode3] of REAL; bigvector = ARRAY[1 .. MAXNODE2] of REAL; hugevector = ARRA Y[1 .. dimension] of REAL;

matrix = ARRAY[1 .. MAXNODE2] of vector; {maxnode2 x maxnode3} medmatrix = ARRAY[1 .. maxnumfiles] of vector; {maxnumfiles x maxnode3} bigmatrix = ARRAY[1 .. MAXNODE1] of bigvector; {maxnode1 x maxnode2} huge matrix = ARRA Y[1 .. dimension] of hugevector; {dimension x maxnode1} megamatrix = ARRAY[1 .. maxnumfiles] of hugevector; {maxnumfiles x maxnode1} VAR x1, delta0, theta1, sumdel1 : hugevector; {input, output 1st layer, error term} x2, theta2, delta1, sumdel2 : bigvector; {output 2nd layer} x3, theta3, delta2 : vector; {output final layer}

x0 : megamatrix; {inputs from all files}

137 w0, w0old, w0temp : hugematrix; {weights from inputs to layer 1} w1, w1 old, w1 temp : bigmatrix; {weights from layer 1 to layer 2} w2, w2old, w2temp : matrix; {weights from layer 2 to layer 3} d : medmatrix; {row vectors of ideal output} alpha : real;

{currentfile : pointer to input data being used, mainloop : counter for iterations numfiles : actual # input patterns inp_file : names of files storing input data}

r, c, i, j, currentfile, mainloop : integer;

numfiles : integer; inp_file: ARRAY[1 .. maxnumfiles] OF PACKED ARRAY[1 .. 20] OF char;

outf1, outf2, outf3 : text; sd, ssd : real; bigloop, hugeloop : integer; time : integer; seed_value : unsigned; {for random number generator} random result : real; ·

{------} [ASYNCHRONOUS] Function MTH$RANDOM (VAR seed: [VOLATILE] unsigned) : SINGLE; EXTERNAL;

{the VAX random number generator is an external library} {------} procedure getfiles;

VAR filelist : text; {file containing file names of input data} file_cat : string (1 0);

BEGIN Open(FILE_VARIABLE:= filelist, FILE_NAME:= 'envfile.lst', HISTORY:= OLD, DISPOSITION:= SAVE); Reset(filelist);

i := O; {Read in list of files for input patterns} WHILE NOT eof(filelist) DO BEGIN i := i + 1; readln(filelist,inp_file[i]); writeln(i,' ',inp_file[i]); END; numfiles := i; {# input patterns} Close(filelist);

138 {Set up matrix of ideal outputs - each output node has a high (1) corresponding to target pattern - all other nodes will be low (0)} FOR r := 1 TO numfiles DO BEGIN FOR c := 1 TO maxnode3 DO BEGIN d[r,c] := 0; END; END;

{obtain file categories from file names - first 3 characters} FOR r := 1 TO numfiles DO BEGIN file_cat := Substr(inp_file[r], 1,3); IF file_cat = '{condition1 }' THEN d[r, 1] := 1.0 ELSE IF file_cat = '{condition2}' THEN d[r,2] := 1.0 ELSE d[r,3] := 1.0; {these can be extended to incorporate other conditions} END; END;

{------} procedure filer;

VAR envfile : text;

BEGIN FOR j := 1 TO numfiles DO BEGIN Open(FILE_VARIABLE:= envfile,FILE_NAME:= inp_fileLi], HISTORY:= OLD, DISPOSITION:= SAVE); Reset(envfile); FOR i := 1 TO dimension DO BEGIN readln(envfile,x00,i]); END; Close(envfile); END; END; {------}

139 procedure init_net;

{Set up initial neural net parameters - random values for weights and offsets}

BEGIN FOR r := 1 TO dimension DO BEGIN FOR c := 1 TO maxnode1 DO BEGIN random_result := MTH$RANDOM(seed_value); w0[r,c] := rand*random_result; w0old[r,c] := w0[r,c]; END; END;

FOR c := 1 TO maxnode2 DO BEGIN FOR r := 1 TO maxnode1 DO BEGIN random_result := MTH$RANDOM(seed_value); w1 [r,c] := rand*random_result w1 old[r,c] := w1 [r,c]; END; END;

FOR c := 1 TO maxnode3 DO BEGIN FOR r := 1 TO maxnode2 DO BEGIN random_result := MTH$RANDOM(seed_value); w2[r,c] := rand*random_result; w2old[r,c] := w2[r,c]; END; END;

FOR i := 1 TO maxnode1 DO BEGIN random_result := MTH$RANDOM(seed_value); theta1 [i] := rand*random_result; END;

FOR i := 1 TO maxnode2 DO BEGIN random_result := MTH$RANDOM(seed_value); theta2[i] := rand*random_result; END;

140 FOR i := 1 TO maxnode3 DO BEGIN random_result := MTH$RANDOM(seed_value); theta3[i] := rand*random_result; END; END; {------} procedure get_output;

{Present input pattern and determine output} VAR sum: real;

BEGIN sum:= O; {layer 1} FOR c := 1 TO maxnode1 DO BEGIN FOR r := 1 TO dimension DO BEGIN sum :=sum+ w0[r,c]*x0[currentfile,r]; END; sum := -1 *(theta1 [c] + sum); x1 [c] := 1/(1 + Exp(sum)); END;

sum:= O; {Layer 2} FOR c := 1 TO maxnode2 DO BEGIN FOR r := 1 TO maxnode1 DO BEGIN sum :=sum+ w1 [r,c]*x1 [r); END; sum := -1 *(theta2[c] + sum); x2[c] := 1/(1 + Exp(sum)); END;

sum:= O; {Layer 3} FOR c := 1 TO maxnode3 DO BEGIN FOR r := 1 TO maxnode2 DO BEGIN sum :=sum+ w2[r,c]*x2[r]; END; sum:= -1*(theta3[c] + sum); x3[c] := 1/(1 + Exp(sum)); END; END;

141 {------} procedure adapt;

{Back-propogation of errors to adapt weights dependent on distance actual output from target output

Momentum requires the storage of old weights from previous loop}

BEGIN FOR c:= 1 TO maxnode3 DO BEGIN FOR r:= 1 TO maxnode2 DO BEGIN w2temp{r,c] := w2{r,c}; END; END;

FOR c:= 1 TO maxnode2 DO BEGIN FOR r:= 1 TO maxnode 1 DO BEGIN w1temp{r,c} := w1{r,c}; END; END;

FOR c:= 1 TO maxnode 1 DO BEGIN FOR r:= 1 TO dimension DO BEGIN wOtemp{r,c} := wO{r,c}; END; END;

FOR i := 1 TO maxnode2 DO sumdel2[i] := O; FOR i := 1 TO maxnode1 DO sumdel1 [i] := O;

FOR c := 1 TO maxnode3 DO BEGIN {sumdel2 is used in determing weight change for errors one layer down. Similarly, sumdel1 for the first layer. Delata2 is the error term for each node - actual weight change obtained by multiplying with output and gain.}

delta2[c] := x3[c)*(1 - x3[c])*(d[currentfile,c] - x3[c]); FOR r := 1 TO maxnode2 DO BEGIN w2[r,c] := w2[r,c] + gain*delta2[c]*x2[r]+alpha*(w2{r,c]­ w2old{r,c}); sumdel2[r] := sumdel2[r] + w2[r,c)*delta2[c];

142 END; {Treat offset, theta3, as if it is the weight from a node of constant unity output} theta3[c] := theta3[c] +gain*delta2[c]; END;

FOR c := 1 TO maxnode2 DO BEGIN delta1 [c] := sumdel2[c]*x2[c]*(1-x2[c]); FOR r := 1 TO maxnode1 DO BEGIN w1 [r,c] := w1 [r,c] + gain*delta1 [c]*x1 [r]+alpha*(w1 [r,c]- w1 old[r,c]); sumdel1 [r] := sumdel1 [r] + w1 [r,c]*delta1 [c]; END; theta2[c] := theta2[c] +gain*delta1 [c]; END;

FOR c := 1 TO maxnode 1 DO BEGIN delta0[c] := sumdel 1[c]*x1 [c]*(1-x1 [cl); FOR r := 1 TO dimension DO BEGIN w0[r,c] := w0[r,c] + gain*delta0[c]*x0[currentfile,r]+ alpha*(wO{r,c]-wOold{r,c]); END; theta1 [c] := theta1 [c] +gain*delta0[c]; END;

FOR c:= 1 TO maxnode3 DO BEGIN FOR r:= 1 TO maxnode2 DO BEGIN w2old{r,c] := w2temp{r,c]; END; END;

FOR c:= 1 TO maxnode2 DO BEGIN FOR r:= 1 TO maxnode 1 DO BEGIN wtold{r,c] := wttemp{r,c]; END; END;

FOR c:= 1 TO maxnode 1 DO BEGIN FOR r:= 1 TO dimension DO BEGIN wOold{r,c] := wOtemp{r,c];

143 END; END; END; {------} procedure testout;

{The current state of the network is tested by applying each signal from the training set in turn. A score of hits is determined - 1 point each time the network gives the correct classification.} VAR score, counter : integer; max: real; maxx3, maxd : integer;

BEGIN IF mainloop = report THEN BEGIN Writeln; Writeln; Writeln('Alpha= ',alpha',' ','Gain= ',gain:3:3,' net : ',dimension:3,maxnode1 :3,maxnode2:3,maxnode3:3); END score := O; ssd := O;

FOR currentfile := 1 TO numfiles DO BEGIN get_output; max := d[currentfile,1]; maxd := 1; FOR i := 2 TO maxnode3 DO BEGIN IF d[currentfile,i] > max THEN BEGIN max := d[currentfile,i]; maxd := i; END; END;

max := x3[1 ]; maxx3 := 1; FOR i := 2 TO maxnode3 DO BEGIN IF x3[i] > max THEN BEGIN max := x3[i]; maxx3 := i; END;

144 END;

IF maxd = maxx3 THEN score := score + 1 ;

FOR i := 1 TO maxnode3 DO BEGIN sd := (d[currentfile,i] - x3[i])*(d[currentfile,i]-x3[i]); ssd := sd + ssd; END; END; writeln(' ',ssd:12:7); Close(outf1 ); END; {------} BEGIN getfiles; filer; FOR hugeloop := 1 TO 30 DO {do 30 repeats of training} BEGIN init_net; FOR mainloop := 1 TO 6000 DO BEGIN random_result := MTH$RANDOM(seed_value); {random file order} currentfile := Trunc(random_result*numfiles+ 1); get_output; IF mainloop/report - Round(mainloop/report) = 0.0 THEN testout; adapt; END; END; time := clock; writeln('Execution time : ', time, ' ms'); END.

145 Appendix E - Parameter Identification

outputs

x3[maxnode3] LAYER 3

LAYER 2

LAYER 1

w0[1, 1]

xO[dimension] x0(1] x0(2] inputs

146 References

Abu-Mostafa,Y.S.,1986, "Neural Networks for Computing?", in Denker,J.S.(ed), Neural Networks for Computing: AIP Conference Proceedings, 1-6, Snowbird, AIP, New York.

Ackley,D.H., Hinton,G.E., Sejnowski,T.J., 1985, "A Leaming Algorithm for Boltzmann Machines", Cognitive Science, 9, 147-169.

Ahmad,S., Tesauro,G., 1988, "A Study of Scaling and Generalization in Neural Networks", Neural Networks , I , 3.

Alkon,D.L., 1989, "Memory Storage and Neural Systems", Sci.Am., July, 26-34.

Anderson,J.A., 1988, What Neural Nets Can Do, LEA, Hillsdale,N.J.

Atlas,L.E., Marks,R.J., Taylor,J.W., 1988, "Network Leaming Modifications for Multi­ Modal Classification Problems: Applications to EKG Patterns", Neural Networks, I, 4.

Baba,N., 1989, "A New Approach for Finding the Global Minimum of Error Function of Neural Networks", Neural Networks, 2, 367-373.

Baldi,P., Homik,K., 1989, "Neural Networks and Principal Component Analysis: Leaming from Examples Without Local Minima", Neural Networks, 2, 53-58.

Barlow,J.B., LKA 4421, Auscultation of the Heart, (recording), Decca/EM!, Sydney.

Bartlett,P., Downs,T., 1990, "Training a Neural Network with a Genetic Algorithm", Poster presentation, First Australian Conference on Neural Networks.

Baum,E., 1989, "A Proposal for More Powerful Leaming Algorithms", Neural Computation, I, 201-207.

Baum, E.B., Haussler,D., 1989, "What Size Net Gives Valid Generalization", Neural Computation, I, 151-160.

Baum,E.B., 1986, "Generalizing Back Propogation to Computation", in Denker,J.S.(ed), Neural Networks for Computing: AIP Conference Proceedings, 47-52, Snowbird, AIP, New York.

Block,H.D., 1962, "The Perceptron: A Model for Brain Functioning. I", Reviews of Modern Physics, 34, 123-135. in Anderson,J.A., Rosenfeld,E., Neurocomputing: F oundarions of Research, MIT Press, Cambridge,MA .

147 Brady,D., Psaltis,D., 1989, "Perceptron Learning in Optical Neural Computers", 251- 264 in Wherrett,B.S. and Tooley,F.A.P. (eds), Optical Computing, Edinburgh Uni. Press, Edinburgh.

Burr,D.J., 1988, "Experiments on Neural Net Recognition of Spoken and Written Text", IEEE Trans. Acoustics, Speech and Signal Processing, 36, 1162-1168.

Chi,Z., Jabri,M., 1990, "Classification of QRS Complexes using a Layered Feed­ Forward Network", 59-63 in Jabri,M.(ed.), Proceedings of the First Australian Conference on Neural Networks, Syd.Uni.Elec.Eng.

Chi,Z., Jabri,M., 1991, "Comparison of MLP and ID3-derived Approaches for ECG Classification", 263-266 in Jabri,M.(ed.), Proceedings of the Second Australian Conference on Neural Networks, Syd.Uni.Elec.Eng.

Churchland,P.M., Churchland,P.S., 1990, "Could a Machine Think?", Sci.Am., January, 26-31.

Dalen,J.E., Alpert,J.S.(eds), 1981, Valvular Heart Disease, Little, Brown & Co., Boston. de Silva,C.J.S., Alder,M.D., Attikiouzel,J.A., 1991, "Paradigms for Pattern Classification",81-89 in Jabri,M. (ed.), Proceedings of the Second Australian Conference on Neural Networks, Syd.Uni.Elec.Eng.

Durbin, R., Rumelhart, D.E., 1989, "Product Units: A Computationally Powerful and Biologically Plausible Extension to Backpropogation Networks", Neural Computation, 1, 133-142.

Eeckman,F.H., Freeman,W.J., 1986, "The Sigmoid Nonlinearity in Neural Computation. An Experimental Approach.", 135-145 in Denker,J.S.(ed), Neural Networks for Computing (AIP Conference Proceedings 151), AIP, New York.

Farhat,N.H. Psaltis,D. Prata.A. Paek,E., 1985, "Optical Implementation of the Hopfield Model", Applied , 24, 1469-1475.

Feldman, J .A., 1985, "Connections", Byte, April, 277-284.

Feldman,J.A., Fanty,M.A., Goddard,N.H., 1988, "Computing with Structured Neural Networks", Computer, 21, 91-103.

Fleming,J.S., Baimbridge,M.V., 1974, Lecture Notes on Cardiology, Blackwell, Oxford.

Friedberg,C.K., 1966, Diseases of the Heare, W.B.Saunders, Philadelphia.

Fukushima,K., 1988, "A Neural Network for Visual Pattern Recgonition", Computer, 21, 65-75

148 Gevins,A.S., Morgan,N.H., 1988, "Applications of Neural-Network (NN) Signal Processing in Brain Research", IEEE Trans. Acoustics, Speech & Signal Processing, 36, 1152-1161.

Goncharova,L.N. Romanova,O.V., 1988, "Changes in the Spectrum of Heart Sounds in Myocardial Infarction", Sovetskava Meditsina, 80-83.

Gorman, R.P., Sejnowski, T.J., 1988, "Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets", Neural Networks, l, 75-89.

Hebb,D.O., 1949, The Organization ofBehavior, in Anderson,J.A. and Rosenfeld,E. (eds), Neurocomputing: Foundations of Research, 45-58, MIT Press, Cambridge.

Hinton, G.E., 1989, "Deterministic Boltzmann Leaming Performs Steepest Descent in Weight-Space", Neural Computation, l, 143-150.

Hinton,G.E., Sejnowski,T.J., 1986, "Leaming and Relearning in Boltzmann Machines", in McClelland,J.L. and Rumelhart,D.E. (eds), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol.1, 283- 317, MIT Press, Cambridge.MA.

Holzner,J.H., Mathes,P., 1983, Atlas of Heart Disease, Butterworth, Munich.

Hopfield,J.J., 1984, "Neurons with Graded Response Have Collective Computational Properties like those of Two-State Neurons", Proceedings of the National Academy of Sciences, 81, 3088-3092.

Hopfield,J.J., Tank,D.W., 1986, "Computing with Neural Circuits: A Model", Science, 233, 625-633.

Huang,W.Y., Lippmann,R.P., 1987(a), "Comparisons Between Neural Net and Conventional Classifiers", Presented at !CNN, San Diego, CA, 21-24 June.

Huang,W.Y., Lippmann,R.P., 1987(b), "Neural Net and Traditional Classifiers", Presented at Conf. on Neural Inf. Proc. Syst., Denver, Colorado.

Hurst,J.W.(ed), 1974, The Heart, Arteries and Veins, McGraw Hill, Tokyo.

Jacobs,R.A., 1988, "Increased Rates of Convergence Through Leaming Rate Adaptation", Neural Networks, l, 295-307.

Jordan,M.I., 1986, "An Introduction to Linear Algebra in Parallel Distributed Processing", in Rumelhart,D.E. and McClelland J.L. (eds), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vo/.1: Foundations, 365-422, MIT Press.

149 Julian,D.G., 1988, Cardiology, Ballieres, London.

Keeler,J.D., 1986, "Basins of Attraction of Neural Network Models", in Denker,J.S(ed), Neural Networks for Computing: AIP Conference Proceedings, 259-264, Snowbird, AIP, New York.

King,T., 1989, "Using Neural Networks for Pattern Recognition", DrDobbs J., January, 14-28.

Kinoshita,J., 1990, "Net Result: Folded Protein", Sci.Am., April, 12-13.

Kirkpatrick,S., Gelatt,C.D., Vecchi,M.P., 1983, "Optimization by Simulated Annealing", Science, 220, 671-680.

Kittler,J., 1986, "Feature Selection and Extraction", in Young,T.Y. and Fu,K.(eds), Handbook of Pattern Recognition and Image Processing, 59-83, Academic Press, London.

Klimasauskas,C., 1989, "Neural Nets and Noise Filtering", Dr Dobb's Journal, January, 32-48. ·

Kohonen,T., 1982, "Self-Organized Formation of Topologically Correct Feature Maps", Biological Cybernetics, 43, 59-69.

Kohonen, T., 1988(a), "An Introduction to Neural Computing", Neural Networks, 1, 3- 16.

Kohonen,T., 1988(b), "The Neural Phonetic Typewriter", Computer (IEEE), 21, 11-22.

Lang, K.J., Waibel, A.H., 1990, "A Time-Delay Neural Network Architecture for Isolated Word Recognition", Neural Networks, 3, 23-43.

Lapedes,A., Farber,R., 1987, "How Neural Nets Work", Proceedings of IEEE, Denver Conference on Neural Nets.

Lapedes,A. Farber,R., 1986, "Programming a Massively Parallel, Computation Universal System : Static Behaviour", in Denker,J.S.(ed.), Neural Networks for Computing, 283-298, Snowbird,UT, AIP, New York.

Leatham.A., 1958, "Auscultation of the Heart", The Lancet, 2, 703-708.

Leatham.A., 1975, Auscultation of the Heart and Phonocardiography, Churchill, Edinburgh.

Levy,J., 1988, "Computers that Learn to Forget", New Scientist, 119, 36-40.

Linsker,R., 1988, "Self-Organization in a Perceptual Network", Computer, 21, 105-117.

150 Lippman,R.P., 1987, "An Introduction to Computing with Neural Nets",IEEEASSP, April, 4-22.

Lippmann,R.P., 1989, "Review of Neural Networks for Speech Recognition", Neural Computation, 1, 1-38.

Luisada,A.A.(ed), 1959, Cardiology Vol 1 : Normal Heart and Vessels, McGraw Hill, New York.

Luisada,A.A., 1973, The Sounds of the Diseased Heart, Warren H. Green, St Louis.

Machtynger,J., Sitte,J., 1990, "Visualizing Basins of Attraction in Binary Neural Nets", in Jabri,M.(ed), Proceedings of the First Australian Conference on Neural Networks, 20, Syd.Uni.Elec.Eng.

McKusick,V.A., 1958, Cardiovascular Sound in Health and Disease, Williams & Wilkins, Baltimore.

Meredith,H., 1990, "DIDDLY: A Multi_Media Performer in the Theatre", The Australian, 21st Aug, Nationwide News, Canberra.

Moody, J. Darken, C.J., 1989, "Fast Leaming in Networks of Locally-Tuned Processing Units", Neural Computation, 1, 281-294.

Moore,B., Poggio,T., 1988, "Representation Properties of Multilayer Feed.forward Networks", Neural Networks, 1,203.

Morse,K, 1989, "In an Upscale World", Aust.Pers.Comp., August, 108-109.

Mueller,P. Lazzaro,J., 1986, "A Machine for Neural Computation of Acoustical Patterns with Applications to Real Time Speech Recognition", in Denker,J.S. (ed.), Neural Networks for Computing: AIP Conference Proceedings, Snowbird, AIP, New York.

Nickolls,P., 1990, "Neural Networks in Medical Devices for Cardiology", in Jabri,M.(ed.), Proceedings of the First Australian Conference on Neural Networks, 107, Syd.Uni.Elec.Eng.

Opie,L., 1986, The Heart: Physiology, Metabolism, Pharmacology and Therapy,, Grune & Stratton, Orlando.

Quing-zeng,F., 1989, "Neural Networks and Chaotic Time Series Predictions", in Zhao,K.H., Zhang,C.F. and Zhu,Z.X.(eds), Learning and Recognition: A Modern Approach, 124-131, World Scientific, Singapore.

Ramamoorthy,P.A., Ho,S., 1988, "A Neural Network Approach for Implementing Expen Systems", Neural Networks, 1, 44.

151 Reggia,J.A., Sutton,G.G., 1988, "Self-Processing Networks and Their Biomedical Implications", Proc.lEEE, 76, 680-692.

Reilly,A.P., Boashash,B., 1990, "A Time-Frequency Signal Processing Approach to Speech Recognition using Neural Networks", in Jabri,M.(ed.), Proceedings of the First Australian Conference on Neural Networks, 17- 19, Syd.Uni.Elec.Eng.

Rennie,J., 1990, "Cancer Catcher", Sci.Am., May, 55.

Rumelhan,D.E., Hinton,G.E., McClelland,J.L., 1986, "A General Framework for Parallel Distributed Processing", in McClelland,J.L. and Rumelhan,D.E.(eds), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Voll, 45-76, MIT Press, Cambridge.MA.

Rumelhan,D.E., Hinton,G.E., Williams,R.J., 1986(a), "Learning Internal Representations by Error Propogation", in Rumelhan,D.E. and McClelland,J.L., Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol.l: Foundations, 318-362, MIT Press, Cambridge, MA.

Rumelhan,D.E., Hinton,G.E., Williams,R.J., 1986(b), "Learning Representations by Back-Propogating Errors", Nature, 323, 533-536.

Rumelhan,D.E., McClelland,J .L., 1988, Explorations in Parallel Distributed Processing: A Handbook of Models, Programs and Exercises, MIT Press, Cambridge,MA.

Rumelhan,D.E. Zisper,D., 1986, "Feature Discovery by Competitive Learning", in Rumelhan,D.E. and McClelland,J.L. (eds), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol.]: Foundations, 151-193, MIT Press, CAmbridge, MA.

Rushmer,R.F., Bark,R.S., Ellis,R.M., 1952, "Direct-Writing Hean Sound Recorder", American Journal of Diseases in Children, 83, 733-739.

Rushmer,R.F., Tidwell,R.A., Ellis,R.M., 1954, "Sonvelographic Recording of Murmurs During Acute Myocarditis", American Heart Journal, 48, 835-846.

Rushmer,R.F., Sparkman,D.R., Polley,R.F.L., Bryan,E.E., Bruce,R.R., Welch,G., Bridges,W, 1952, "Variability in Detection and Interpretation of Hean Murmurs : A Comparison of Auscultation and Stethography", American Journal of Diseases in Children, 83, 740-754.

Scher,A.M., 1974, "Mechanical Events of the Cardiac Cycle, in Howell and Fulton (eds), Physiology and Biophysics, Saunders, Philadelphia.

152 Sejnowski,T.J., Rosenberg,C.R., 1986, "NETtalk: A Parallel Network that Learns to Read Aloud", The John Hopkins Univ. Elec. Eng. & Comp. Sci Tech. Repon, 86, 33, in Anderson,J.A. and Rosenfeld,E. (eds), Neurocomputing : Foundations of Research, 663, MIT Press, Cambridge.

Shiavi,R.G. Bourne,J.R., 1986, "Methods of Biological Signal Processing", in Young,T.Z. and Fu,K.S. (eds), Handbook of Pattern Recognition and Signal Processing, 545-568, Academic Press, San Diego.

Shuo,B., Fay,Q., 1989, "A Neural Network Model of Humor Recognition", in Zhao,K.H., Zhang,C.F. and Zhu,Z.X.(eds), Learning and Recognition: A Modern Approach, 217-222, World Scientific, Singapore.

Sietsma,J., 1990, "The Effect of Pruning a Back-Propogation Network", in Jabri,M.(ed.), Proceedings of the First Australian Conference on Neural Networks, 12, Syd.Uni.Elec.Eng.

Sivilotti,M.A., Mahowald,M.A., Mead,C.A., 1987, "Real-Time Visual Computations Using Analog CMOS Processing Arrays", in Losleben,P. (ed), Proceedings of the 1987 Stanford Conference, 295-312, MIT Press, Cambridge,MA .

Tank,D.W., Hopfield,J.J.,, 1988 "Collective Computation in Neuronlike Circuits", Sci.Am., December, 62-70.

Tavel,M.E., 1976, "Phonocardiography", in Chung,E.K.(ed), Non-Invasive Cardiac Diagnosis,171-191, Henry Kimpton, London.

Tavel,M.E., 1978, Clinical Phonocardiography and External Pulse Recording, Year Book Med. Pub., Chicago.

Tavel,M.E., Campbell,R.W., Gibson,M.E., (year unidentified), Heart Sounds and Murmurs: An Audiovisual Presentation, Year Book Medical, Chicago.

Tesauro, G., He, Y., Ahmad, S., 1989, "Asymptotic Convergence of Backpropogation", Neural Computation, 1, 382-391.

Touretzky,D., Pomerleau,D., 1989, "What's in the Hidden Layers?", Aust. Pers. Comp., August, 113-122.

Van Der Maas,H.L.J., Verschure,P.F.M.J., Molenaar,P.C.M., 1990, "A Note on Chaotic Behaviour in Simple Neural Networks", Neural Networks, 3, 119-122.

Waibel,A. Hampshire,J., 1989, "Building Blocks for Speech", Australian Personal Computer, August, 125-135.

Wallich,P., 1991, "Wavelet Theory", Sci.Am., January, 14-15.

153 Wang.X.Q., Zhang,C.F., Zhao,K.H., 1989, "A Nonlinear Network Which Recognizes Chinese Characters", in Zhao,K.H.,Zhang,C.F. and Zhu,Z.X.(eds), Learning and Recognition: A Modern Approach, 209-216, World Scientific, Singapore.

Wanak,J., 1972, Phonocardiology: Integrated Study of Heart Sounds and Murmurs, Harper & Row, Maryland.

Widrow,B. Hoff,M.E., 1960, "Adaptive Switching Circuits", 1960 IRE WESCON Convention Record, 96-104, in Anderson,J.A. and Rosenfeld.E. (eds), Neurocomputing: Foundations of Research, MIT Press, Cambridge, MA., 1986.

Widrow,B., Steams,S.D., 1985, Adaptive Signal Procesing, Prentice Hall, Englewood Cliffs.

Widrow,B., Winter,R., 1988, "Neural Nets for Adaptive Filtering and Adaptive Pattern Recognition", Computer, 21, 25-39.

Widrow,B., Winter,R.G., Baxter,R.A., 1988, "Layered Neural Nets for Pattern Recognition", IEEE Trans. Acoustics, Speech and Signal Processing, 36, 1109-1117.

Zemlin,W.R., 1988, Speech and Hearing Science: Anatomy and Physiology, Prentice Hall, Englewood Clffs, NJ.

154