Target Curricula for Multi-Target Classification: The Role of Internal Meta-Features in Machine Teaching

by

Shannon Kayde Fenn B Eng (Comp) / B Comp Sc

Thesis submitted in fulfilment of the requirements for the Degree of Doctor of Philosophy

Supervisor: Prof Pablo Moscato Co-Supervisor: Dr Alexandre Mendes Co-Supervisor: Dr Nasimul Noman

This research was supported by an Australian Government Research Training Program (RTP) Scholarship

The University of Newcastle School of Electrical Engineering and Computing

June, 2019 © Copyright

by

Shannon Kayde Fenn

2019 Target Curricula for Multi-Target Classification: The Role of Internal Meta-Features in Machine Teaching

Statement of Originality

I hereby certify that the work embodied in the thesis is my own work, conducted under normal supervision. The thesis contains no material which has been accepted, or is being examined, for the award of any other degree or diploma in any university or other tertiary institution and, to the best of my knowledge and belief, contains no material previously published or written by another person, except where due reference has been made. I give consent to the final version of my thesis being made available worldwide when deposited in the University’s Digital Repository, subject to the provisions of the Copyright Act 1968 and any approved embargo.

Shannon Kayde Fenn 3rd June 2019

iii To Mum and Mar Mar, the brightest lights may shine the shortest but their light reaches farthest. Acknowledgments

The first and most important acknowledgement must go to my wife Coralie. You have been with me through every hard moment of my adult life, and the cause of most of the good ones. This thesis would not exist if not for you, nor would I be the person I am today. Thank you. My heartfelt gratitude to my advisers, Pablo, Nasimul, and Alex. Your guidance and wisdom sheltered me from many a poor choice over the years. Your support in professional and personal matters went above and beyond. I am thankful to count you friends. To my father, Richard, you have always been my hero. The way I approach the world and treat the people in it is completely due to you. The day you met mum, and the day you married her, I’m sure they were her best. To my little sisters, Jess and Deeds, never change. You’ve always made me feel loved and special. If ever two people made me feel like I could accomplish something, it was you two. To Bo, Grizzy, Terry, Archie, Baxter, Jill, Brian, Jack, Amos, and everyone else in the Fenn clan and beyond, I can’t remember a single time I didn’t feel part of the family and for that I can never thank you enough. The best result of choosing to do a PhD has been the friends I made. To my friends from CIBM: Amer, Amir, Claudio, Francia, Heloisa, Jake, Inna, Leila, Luke, Łukasz, Marta, Nader, and Natalie, thank you for all the coffee breaks, lunch conversations that were too good to want to leave, and other countless helping hands along the way. To Pat, Matt, Greg, Amy, Chelsea, Ben, Bec, Jason, and far too many more good people to list, thank you for all your support throughout these years. Shannon Kayde Fenn The University of Newcastle June 2019

v List of Publications

• S. Fenn, and P. Moscato, “Target Curricula via Selection of Minimum Feature Sets: a Case Study in Boolean Networks”. The Journal of Research, vol. 18, no. 114, pp. 1–26, 2017.

vi Contents

Acknowledgments v

List of Publications vi

List of Tables x

List of Figures xi

Abstract xxiii

1 Introduction 1 1.1 Curriculum Learning ...... 1 1.2 Machine Learning for Logic Synthesis ...... 2 1.3 Research Aims ...... 3 1.4 Thesis Overview ...... 4

2 Background 6 2.1 Learning Multiple Targets ...... 6 2.1.1 Formalism and Terminology ...... 7 2.1.2 The Label/Target distinction ...... 8 2.1.3 Measuring Prediction Performance ...... 9 2.1.4 Methods for MTC ...... 13 2.2 Curriculum Learning ...... 16 2.2.1 Curricula in Human and Animal Learning ...... 17 2.2.2 Example Curricula ...... 17 2.2.3 Target Curricula ...... 18 2.2.4 Measuring and comparing curricula ...... 19 2.2.5 Summary ...... 21 2.3 Intrinsic Dimension and Feature Selection ...... 22 2.3.1 Feature Selection ...... 22 2.3.2 The Minimum Feature Set Problem ...... 23 2.3.3 Summary ...... 24

vii 2.4 Logic Synthesis ...... 25 2.5 Boolean Networks ...... 26 2.5.1 Training Boolean Networks ...... 28 2.5.2 Late-Acceptance Hill Climbing ...... 28 2.5.3 Varying sample size ...... 29 2.6 Conclusion ...... 30

3 Target Curricula and Hierarchical Loss Functions 32 3.1 Can Guiding Functions Enforce a Curriculum? ...... 33 3.1.1 Hierarchical Loss Functions ...... 33 3.1.2 Cases with Suspected Curricula ...... 36 3.1.3 Training ...... 38 3.1.4 Experimental Results ...... 39 3.2 Appraising “Easy-to-Hard”: Ablation Studies ...... 44 3.3 Discovering Curricula ...... 46 3.3.1 Target Complexity and Minimum Feature Sets ...... 48 3.3.2 Experiments and Results ...... 49 3.4 Real-World Problems ...... 50 3.4.1 ALU and Biological Models ...... 50 3.4.2 Inferring Regulatory Network Dynamics From Time-Series . . . 53 3.4.3 Results ...... 54 3.5 Discussion and Conclusion ...... 56 3.5.1 Issues and Limitations ...... 58 3.5.2 Future Work and Direction ...... 59

4 Application of ID-curricula to Classifier Chains 60 4.1 Introduction ...... 60 4.1.1 Classifier Chains ...... 61 4.1.2 When and how to order chains ...... 63 4.2 Target-aware ID-curricula ...... 65 4.3 FBN Classifier Chains ...... 66 4.4 Experiments ...... 67 4.4.1 Datasets ...... 67 4.4.2 Training and Evaluation ...... 71 4.4.3 Qualifying Curricula ...... 72 4.5 Results ...... 73 4.5.1 Prior benchmarks ...... 73 4.5.2 Random Cascaded Circuits ...... 80 4.5.3 LGSynth91 ...... 83

viii 4.6 Discussion and Conclusion ...... 85 4.6.1 Future Work and Direction ...... 88

5 Adaptive Learning Via Iterated Selection and Scheduling 89 5.1 Self-paced Curricula and Internal Meta-Features ...... 90 5.1.1 Motivation: stepping stone state discovery ...... 90 5.1.2 The Internal Meta-Feature Selection principle ...... 91 5.1.3 Tractability ...... 94 5.2 Baseline Comparison ...... 96 5.2.1 Baselines ...... 97 5.2.2 Measures ...... 97 5.2.3 Results and Discussion ...... 98 5.3 Ablative Studies ...... 108 5.3.1 Aims and Experimental Design ...... 108 5.3.2 Results and Discussion ...... 108 5.4 Scaling to Larger Problems ...... 113 5.5 Conclusion ...... 116 5.5.1 Future Work ...... 117

6 Application of ALVISS to Deep Neural Nets 118 6.1 Feedforward Neural Networks ...... 118 6.2 Meta-features and Target Curricula in NNs ...... 120 6.3 Experiments ...... 121 6.3.1 Architectures ...... 122 6.3.2 Hyper-parameter Selection ...... 122 6.3.3 Metrics ...... 123 6.4 Results ...... 124 6.5 Discussion and Conclusions ...... 127

7 Conclusion and future work 129 7.1 Conclusions and Contributions ...... 129 7.2 Suggestions for Future Work ...... 131 7.2.1 Feature Selection ...... 132 7.2.2 Domain Extension ...... 133 7.2.3 Benchmarks ...... 134

ix List of Tables

3.1 Example error matrices and the associated values for L1, Lw, Llh, and

Lgh (ignoring a normalisation constant for readability). Note that the purpose is not to directly compare the different losses on a particular matrix, but instead to compare the pairwise disparities in the same loss on different error matrices. For example, the first and second matrices

are equivalent under L1 but the first is preferred by all other losses. Rows of E are examples and columns are targets (ordered by the curriculum being enforced). Also included are equivalents of E under the two hierarchical losses: the recurrences defining these losses can be thought of as defining a transformation on E under which the respective loss is

equivalent to L1...... 36 3.2 Instance sizes of initial test-bed problems...... 39 3.3 Instance sizes of the real-world test-beds...... 54 3.4 Results for the yeast dataset. See Figure 3.10 for the estimated hierar- chies. Note that these results are mean difference in test set accuracy, reported as a percentage, and not MCC. The use of curricula on the SK→ {Ste9, Rum1} hierarchy yielded negligible improvement, however we see more promise in the PP→ {Cdc2/Cdc13, Cdc2/Cdc13*} hierarchy,

particularly for Lgh...... 56 3.5 Results for the E. coli dataset. Note that these results are mean differ- ence in test set accuracy, reported as a percentage, and not MCC. See Figure 3.10 for the estimated hierarchies. The use of curricula on the

{G2, G8}→G6 hierarchy has given some improvement, however for Llh we see a drop in performance on one of the base targets in the hierarchy.

For Lgh the results remained positive overall...... 56

4.1 Test-bed instance sizes used in this chapter...... 71

5.1 Parameters for Meta-heuristic for Randomised Priority Search (Meta- RaPS) [203] along with optimal values found using random parameter search [206]...... 96

x 5.2 Instance and training set sizes for the large adders. Number of targets

ni and example pool size are not given as they are 2ni and 2 respectively for all problems...... 115

6.1 Architecture and hyper-parameter search space and values found using random parameter search [206]...... 123

List of Figures

2.1 A 56-node Feedforward Boolean Network (FBN) which correctly im- plements the 6-bit addition function. Each node takes 2 inputs and computes the NAND function as its output. Inputs (far left) have been coloured red and outputs (far right) green. Note the ripple-carry style flow of information between sub-networks...... 27

3.1 Generalisation performance of each loss on the 7-bit cascaded parity problem. The top plot shows the mean macro-averaged test-set MCC,

averaged across all targets, for each loss including L1. The remaining plots show the mean difference in test MCC, on each target, between

L1 (dotted baseline) and each of Lw, Llh and Lgh. The first two targets are not shown as they show only a negligible improvement. A leftward shift in the transition (top) from poor to near-perfect generalisation can be seen. This became more pronounced as the loss more strictly en-

forced the learning order (Lw → Llh → Lgh) with Lgh achieving a 16% reduction in sample-complexity for 0.8 MCC. The differences in test MCC for individual targets (bottom) show an increase in improvement with target difficulty, as expected; reaching a large peak improvement—from 0.17 to 0.70 MCC—for the sixth target. On this particular problem the

difference between Llh and Lw was not statistically significant. For all loss functions the out-of-sample performance was significantly better

than obtained with L1 (dotted line)...... 40

xi 3.2 Mean δ-MCC for each of Lw, Llh and Lgh versus training set size for ad- dition problems of increasing output dimension. The 95% confidence interval of the mean is shown with transparent bands; they are wider for the final adder due to the tenfold reduction in repetitions on that instance. In all cases there was a statistically significant net increase in generalisation performance and, importantly, this improvement in- creased with output dimension (note that the y-axis are not shared). We can also see that the improvement increased as the loss more strictly

enforced the ordering (Lw → Llh → Lgh)...... 41

3.3 Mean δ-MCC for each of Lw, Llh and Lgh versus training set size for the cascaded majority, multiplexer and parity problems. The 95% confi- dence interval of the mean is shown with transparent bands. On all problems we can see a statistically significant net increase in general- isation performance—though the magnitude varied widely between problems. We can also see that this improvement increased as the loss

more strictly enforced the ordering (Lw → Llh → Lgh)...... 42 3.4 Per-target generalisation performance on addition problems of increas-

ing size. Averaged δ-MCC for Lw, Llh and Lgh are plotted against target index, ordered by the easy-to-hard curriculum for the 4- to 8-bit adders. The transparent bands are bootstrapped 95% CIs. Several interesting trends are apparent. Firstly, the effect of applying the curriculum in- creased with problem instance in a consistently positive manner. Sec-

ondly, each loss behaved differently across targets: Lw plateaued then

dipped, Llh followed but continued growing after the dip in Lw, and

Lgh outpaced the others by a significant margin but plummeted at the

final targets. Note that the turning point in Lgh occurred later and later as problem dimension increased...... 43 3.5 Per-target generalisation performance on the cascaded benchmark

problems. Averaged δ-MCC for Lw, Llh and Lgh are plotted against tar- get index, ordered by the easy-to-hard curriculum for the 9-bit cascaded majority, to 15-bit cascaded multiplexer and 7-bit cascaded parity prob- lems. The transparent bands are bootstrapped 95% CIs. Similar overall trends exist as in Figure 3.4, however the exact profile of the effect of each guiding function was problem dependent...... 44

xii 3.6 Mean difference in test-set macro-MCC between Lgh and L1 over var- ious training set sizes for the 5-bit addition problem for randomised target orderings. Each line represents the mean change in MCC for each possible level of correlation between the randomly given order and the known order. For an n-element permutation there are dn × (n − 1)/2e possible values of τ, hence 11 lines. The central permutations offered little benefit, and since random permutations are skewed heavily toward τ = 0 we can infer that a randomly generated ordering would provide scant improvement on average, for this problem. The exact inverse (τ = −1) of the suggested curricula impeded the learning performance by as much as the correct order improved it, and the exact easy-to-hard ordering (τ = 1) conferred the largest benefit...... 45 3.7 Mean MCC plotted against τ for all test-beds with known easy-to-hard

curricula. The mean MCC obtained with no curriculum using L1 is shown with a dashed line. In all cases there is a clear positive correlation be- tween τ and MCC. Notably the best performance occurred at τ = 1 and this performance was consistently better than the baseline. For all addition problems (bottom two rows), curricula at τ = 0 exceeded the baseline, meaning that even an arbitrary curriculum might be slightly beneficial. Additionally, the proportion of τ for which performance

exceeds L1 increased as the output dimension increases. Some of these trends are problem specific. In the case of cascaded majority, multiplexer and parity (top row); the τ ≈ 0 region lies below baseline performance, unlike with addition. Nonetheless the overall positive correlation between τ and MCC was consistent...... 47 3.8 Performance of ID-derived curricula on addition problems of increasing output dimension. The top row shows the curriculum quality measured

by the rank correlation, τ between the detected curriculum (πid) and π∗. The bottom row shows the mean macro-averaged δ-MCC for FBN ∗ trained using Lgh with π and πid. In all cases τ exceeded 0.50 before the respective peaks in test error difference, yet it fell away as output dimension increased. This was accompanied by an increase in the ∗ disparity of the δ-MCC performance between πid and π . Despite this trend, the bottom row still shows a net improvement in generalisation

using πid which increasing with output dimension...... 51

xiii 3.9 Performance of ID-derived curricula on the cascaded majority, multi- plexer and parity problems. The top row shows the curriculum quality measured by the rank correlation, τ between the detected curriculum and π∗. The bottom row shows the mean macro-averaged δ-MCC for ∗ FBN trained using Lgh with the optimal curriculum (π ) and the auto-

matically detected curriculum (πid). For all problems τ exceeded 0.70 before the respective peaks in test error difference. For these same problems the detected target curriculum provided nearly the same ben- efit as the optimal, which is no surprise given the strong correlation between the two. The problem with the largest deviation, CMUX, is also the one for which τ was weakest. Note that the x-axes are only shared within columns, and the y-axis for the bottom row is not shared, since the improvement range varied significantly between problems. . . . . 52 3.10 Possible target hierarchies found using Minimum-Feature-Set (minFS) in the yeast and E. coli datasets. In the E. coli data (a) we see the target G6 (with feature set {G4,G7}) may benefit from being learned after G2 or G8 (both have the feature set {G7}). In the yeast data (b) we see two potential hierarchies: either of Ste9 or Rum1 (both with fea- ture set {start, P P }) may benefit from being learned after SK, and either of Cdc2/Cdc13 or Cdc2/Cdc13* (with feature sets {Ste9, Slp1} and {Slp1, W ee1/Mik1} respectively) may benefit from first learning PP. 54

3.11 Mean difference in test-set macro-MCC between L1 (dotted baseline)

and each of Lw, Llh and Lgh on each training sample for the 74181, mam- malian cell-cycle and FA/BRCA pathway models. The 95% confidence interval of the mean is shown with transparent bands. On all problems we see a net increase in MCC—again varying widely between problems. Except in the case of the mammalian cell-cycle model for which the losses are not clearly distinguishable, this improvement increased as

the loss more strictly enforced the ordering (Lw → Llh → Lgh). . . . . 55

4.1 Structures used for generating synthetic Multi-Target Classification (MTC) benchmarks with deep inter-target relationships...... 69

xiv 4.2 Mean δ-MCC (w.r.t. random chains) for each chain ordering method versus training set size, for addition problems of increasing output di- mension (95% CI shown with transparent bands). The two comparison methods, CEbCC and label-effects, did not improve on a random order- ing. Both the target-agnostic and target-aware ID-curricula methods improved over a random chain in all cases with the latter dominating for larger problem sizes. Considering the peaks, the target-aware ordering improved with target dimension, however the target-agnostic ordering plateaued for the higher dimensions...... 74 4.3 Mean δ-MCC (w.r.t. random orderings) for each chain ordering method versus training set size for the cascaded majority, multiplexer and par- ity problems (95% CI shown with transparent bands). On the parity benchmark the comparison methods, CEbCC and label-effects, were not distinguishable from random chain orders. On the multiplexer CEbCC slightly outperformed random chain orders while label-effects slightly worsened performance. Both the target-agnostic and target-aware ID- curricula methods improved over a random chain ordering, in all cases, with the latter dominating for the multiplexer. There was no improve- ment seen by introducing target-awareness to the ID estimation (red and green curves overlap) for the majority and parity problems, which is unsurprising given the minimal room for improvement (see Figure 3.9). 75 4.4 Per-target generalisation performance on addition problems of increas- ing size. Averaged δ-MCC (w.r.t random curricula) for each method are plotted against target index, ordered by π∗ (95% CI shown with transpar- ent bands). The equity of performance of CEbCC and label effects with random curricula (Figure 4.2) was consistent across all targets. For the ID-curricula, performance increased after the first target for both meth- ods, but for the target-agnostic approach this dropped again (echoing Figure 3.4). Target-awareness dramatically improved on this, remaining dominant across all targets...... 76

xv 4.5 Per-target generalisation performance on the cascaded benchmark problems. Averaged δ-MCC (w.r.t random curricula) for each method are plotted against target index, ordered by π∗ for the 9-bit cascaded majority, 15-bit cascaded multiplexer and 7-bit cascaded parity prob- lems (95% CI shown with transparent bands). Similar overall trends exist as in Figure 4.4, however the exact profile of the effect of each guiding function was problem dependent. Target-agnostic and target-aware curricula were equally performant on the majority and parity problems (red and green curves overlap). On the multiplexer problem the target- agnostic method suffered a drop-off which was drastically improved upon using target-awareness. The baselines performed no differently to random curricula on the parity problem. On the majority problem label effects began to improve after the third target, approaching theID methods, CEbCC displayed inverse performance. On the multiplexer, CEbCC improved for the first few targets but worsened on the last. Label effects was again qualitatively inverse to this, initially under-performing random chains but then improving on the final targets, outperforming the target-agnostic method on the final two targets...... 77 ∗ 4.6 Correlation of generated curricula with π , measured by τπ∗ , for each chain ordering method versus training set size on addition problems of increasing output dimension (95% CI shown with transparent bands). The two baselines, CEbCC and label-effects, produced curricula not at all correlated with π∗. Both the target-agnostic and target-aware ID-curricula methods converged on π∗ as training set size increased however the target-agnostic method required larger and larger training sets to reach that peak. Looking at Figure 4.2 this convergence may eventually occur after the region in which peak improvement is gained from a curriculum. Target-awareness appears to rectify this, as the point at which near perfect correlation was achieved was relatively stable over target dimension...... 78

xvi ∗ 4.7 Correlation of curricula with π , measured by τπ∗ , for each chain order- ing method versus training set size for the cascaded majority, multiplexer and parity problems (95% CI in transparent bands). The two baselines, CEbCC and label-effects, produced curricula not at all correlated with π∗ on the multiplexer and parity instances and weakly correlated and anti- correlated respectively on the majority problem. The target-agnostic and target-aware ID-curricula methods converged toward π∗ as training set size increased. On the majority problem they converged almost identically (the red and green curves overlap), on the parity problem there was a small improvement obtained with target awareness, while on the multiplexer problem the target-aware curricula converged much more quickly...... 79 4.8 Concordance of curricula, measured by Kendall’s W , for each chain ordering method as it varied with training set size for the cascaded majority problem (there are no CIs, W is a single value for the full set of curricula at each point). CEbCC and label-effects produced curricula with some level of increasing concordance. Looking at Figure 4.7 it is clear that this common point is not π∗ and thus CEbCC and label effects may eventually settle on their own centroid curriculum (which may or may not be optimal). The target-aware and target-agnostic curricula converged at a near-identical rate (red and green curves overlap). . . . 80 4.9 Generalisation trends for the structured synthetic problems. Mean training-set-averaged δ-MCC (with bootstrapped 95% CI) is shown for each method grouped by structure and then by instance...... 81 4.10 Curriculum quality measured by correlation with π∗ for the structured

synthetic problems. Mean training-set-averaged τπ∗ (with bootstrapped 95% CI) is shown for each method grouped by structure and then by instance...... 82 4.11 Agreement among chain orders measured by concordance (Kendall’s W ) for the structured synthetic problems. Mean training-set-averaged W (with bootstrapped 95% CI) is shown for each method grouped by structure and then by instance...... 83 4.12 Generalisation trends for the LGSynth91 benchmark suite. Mean training- set-averaged δ-MCC (with bootstrapped 95% CI) is shown for each method grouped by instance. There was a general trend in improvement by the ID-based methods, with the target-aware curricula achieving better gen- eralisation. The corresponding Critical Difference diagram is given in Figure 4.13...... 84

xvii 4.13 Critical Difference diagram based on the macro-averaged test-MCC re- sults (Figure 4.12) for the LGSynth91 benchmark suite. The methods are placed according to their average rank; those closer than the critical difference (CD)—determined by the Nemenyi post-hoc test—are joined by bars. We see that the target-aware ID curricula are separated from all baselines, but not from target-agnostic. Target-agnostic curricula could not be separated from CEbCC or random, while label effects ranked lowest, but not separated from CEbCC...... 84

5.1 A simple example demonstrating the intuition behind Adaptive Learning Via Iterated Selection and Scheduling (ALVISS). The original input fea-

tures are {x1, x2, x3, x4, x5, x6} and the target features are {y1, y2, y3}.

At the outset (b) y1, y2 and y3 have minimum-cardinality feature sets

(shown by coloured dots) of size 2, 4, and 4 respectively. Thus y1 is selected as the “easiest” and solved first. After training, a sub-network

(a) for y1 is found with the internal node z1. Initially y2 and y3 were

indistinguishable but the inclusion of z1 in (c) permits a smaller feature-

set of size 3 for y2 while the solution for y3 remains unchanged. The procedure can now easily distinguish between the two remaining targets

since solving for y2 is a simpler task in light of z1...... 91 5.2 δ-MCC and training time for ALVISS and baselines, on binary addition problems ranging between 4- and 8-targets. In (a) we see that ALVISS generalised significantly better than all baselines on all sizes, but most importantly this improvement increased with the target-dimension of the instance. In (b) we see convergence time (on a log scale) show- ing ALVISS converged faster than the baselines. Additionally ALVISS follows a markedly different growth trend to the rest. This extreme re- duction in time with ALVISS is explained as most of the training time in the baselines is spent optimising the higher-order targets. ALVISS however typically has significantly reduced problem instances for the those targets due to the useful meta-features found within the lower- order sub-networks. The significant difference in scaling trends in (b) demonstrates the impracticality of running the baseline methods on the larger adders presented at the end of this chapter...... 99 5.3 Per-target MCC on binary addition problems ranging between 4- and 8- targets, for ALVISS and baselines. This shows that the bulk of the benefit due to ALVISS was in the higher order targets...... 100

xviii 5.4 Curriculum quality (τπ∗ ), on binary addition problems, for the three ID- based curriculum generation procedures: target-agnostic, target-aware and ALVISS. Adding extra meta-features, in the form of targets or internal features, significantly improved the probability of discovering the correct underlying curricula. Interestingly, ALVISS initially performed better than target-aware on low sample sizes, but the latter converged to τ = 1.0 slightly earlier...... 101 5.5 Mean δ-MCC and training time (log scale) for ALVISS and baselines on the cascaded majority, multiplexer, and parity problems. In (a) we see that ALVISS generalised better than all baselines on the multiplexer and parity problems. On the majority problem, the curriculum-dependent methods (ALVISS, SN and CC) generalised equally but better than IC. The training time results (b) show ALVISS converged faster than all baselines on the majority and multiplexer problems but CC trained faster on the parity problem (on which IC and SN were markedly slower)...... 102 5.6 Per-target MCC on the cascaded majority, multiplexer, and parity prob- lems, for ALVISS and baselines. The trends are problem dependent but qualitatively similar to those in Figure 5.3, except on the majority problem, where ALVISS performed similarly to the other curriculum- dependant baselines...... 102 5.7 Curriculum quality for the three ID-based curriculum generation proce- dures: target-agnostic, target-aware and ALVISS, on the cascaded major- ity, multiplexer and parity problems. As before, there was little variation on the parity and majority instances as the target-agnostic approach

already rapidly converged on τπ∗ . On the multiplexer problem, curricula found using internal meta-features (ALVISS) slightly under-performed compared to those found using target features (target-aware)...... 103 5.8 Macro-MCC for the randomised cascaded benchmarks (Section 4.4.1). ALVISS generalised better by a margin of significance in all cases. SN and CC outperformed IC in almost all cases, though there was some variation between the two across different structures and sizes. Within each structure-group the generalisation improvement increased with problem dimension for ALVISS. This suggests that ALVISS, and target- curricula in general, is suited for deeper problems as initially assumed. 104

xix 5.9 Training time for the randomised cascaded benchmarks (Section 4.4.1). ALVISS consistently converged fastest on the first three (a-c) structures, by orders of magnitude in some cases. On the Pyramid Directed Acyclic Graph (DAG) structures (d) classifier chains converged faster, and for smaller sizes so did Independent Classifiers (IC), likely due to the greater number of targets. For the larger size ALVISS trended downward in relation to the others suggesting that the suspected exponential trends w.r.t. target dimension (Figure 5.2) overtake the quadratic trends noted in Section 5.1.3...... 105 5.10 Curriculum quality results on the random cascaded benchmarks. ALVISS matched or slightly underperformed target-aware curricula on the di- rect chain structures (a) while out-performing on the indirect chains (b). There was little difference between any method on the binary tree structure (c) while on the pyramidal DAG (d) ALVISS produced curricula less correlated with π∗ than target-aware...... 106 5.11 Cumulative macro-MCC and training time for the subset of circuits taken from the LGSynth91 synthesis benchmark (error bars represent boot- strapped 95% CI). In (a) we see that ALVISS outperformed IC on all prob- lems, Classifier Chains (CCs) with target-aware ID-curricula on 8 of 10, and a single network with target-agnostic ID-curricula on 7 of 10. This was better than expected as the circuits were selected without regard for function or composition and the initial expectation was that ALVISS may hurt performance in some cases. In (b) we see that the improved training time offered by ALVISS persisted except on a single dataset, alu2, where it was not significantly faster than CC. For several problems this difference was an order of magnitude or more...... 107 5.12 Ablation experiment results for the addition benchmarks. For both gen- eralisation and training efficiency the ablated variants under-performed ALVISS. Additionally filtering-only initially outperformed the curriculum- only variant, but scaled worse for both metrics...... 109 5.13 Ablation experiment results for the remaining cascaded benchmarks. The ablated variants again under-performed ALVISS on both metrics. For all three problems the curriculum resulted in greater generalisation than meta-feature filtering but the training efficiency relationship between the two varied...... 109

xx 5.14 Ablation experiment results for the random cascaded benchmarks. ALVISS consistently outperformed both ablated variants. This is particularly interesting given that on the pyramidal DAG structure (d) curriculum- only actually worsened generalisation. This is suggestive of a synergistic relationship between the imposition of a curriculum and the filtering of meta-features. The relative importance of curricula and meta-feature filtering varied between structures...... 110 5.15 Ablation experiment results for convergence time on the random cas- caded benchmarks. In most cases ALVISS improved training speed compared to either ablated variant. In one case each of the direct and indirect chain structures, and on all pyramidal DAG instances the filtering-only variant produced lower training times...... 111 5.16 Ablation experiment results for the LGSynth91 circuits. Both ablated variants resulted in lower generalisation than ALVISS on all instances. Generally, the use of meta-feature filtering alone also outperformed IC. As in Figure 5.14, even when the curriculum imposition alone reduced performance (compared to IC) the unablated method outperformed filtering-only. In terms of training efficiency the full approach was fastest except on pm1 and sct. In most cases one or both ablated variants significantly improved training performance over IC...... 112 5.17 Test set δ-MCC for the addition problems. This is included to show that each ablated variant improved generalisation over a different range of training set sizes. Additionally the improvement from the unablated variant appears greater than the sum of its parts, suggesting that cur- riculum and meta-feature filtering reinforce each other. This helps to explain why ALVISS generalised better than either ablated variant, even in cases where one variant, used alone, hurt performance...... 114 5.18 Test-set macro-MCC (a) and curriculum quality (b) for large-scale addi- tion problems (16 up to 128 outputs). The number of runs per data-point was much lower for some of these curves so the shaded region shows the standard deviation not a confidence interval. With less than 1000 training examples full generalisation was achieved for largest adder con- sidered (128-bit). Likewise the corresponding curricula still converged to π∗ over the relevant range in training-set size. This is the first case of a feedforward structure generalising successfully on binary addition problems of this size...... 115

xxi 6.1 The dovetailed (a) architecture and the final architecture (b) that results from applying ALVISS. Note that both use skip connections from all hidden layers to the output layers for each target. Thus the order of targets affects information flow through the network. The structure in (b) also contains skip connects from the input the filter layers which implement the meta-feature filtering component of ALVISS...... 121 6.2 Macro-MCC for adders and LGSynth91 datasets for the neural network proof-of-concept experiments. For all addition instances (a) ALVISS outperformed the baseline architectures and this performance gap in- creased with instance size. Above 16-targets, the baselines are not visible as they generalised better no better than random (MCC = 0). These results are the first case of feedforward neural networks consistently gen- eralising well on binary addition problems of this size. For the LGSynth91 benchmarks (b) there was greater variation in performance than seen in Chapter 5. In 7 of the 13 datasets ALVISS outperformed significantly, in three (mammalian, alu2, and pm1) it performed equal best and in the remaining three (cu, f51m, and x2) it was outperformed by one or more baselines. Note that I conservatively consider f51m a loss despite the overlapping error bars...... 124 6.3 Macro-MCC for heterogeneous datasets for the neural network proof- of-concept experiments. For the direct chain (a), indirect chain (b), and binary tree (c) structures ALVISS generalised better by a margin of sig- nificance and this margin increased with output dimension. For the pyramidal DAG structure (d) the IC architecture generalised better than ALVISS on both sizes, while the dovetailed architecture with curriculum applied was competitive for two of the smaller instances (7-A and 7-C). In every case, the use of a curriculum within the dovetailed architecture improved generalisation...... 125 ∗ 6.4 Correlation (τπ∗ ) of curricula with π for datasets with known optimal curricula. Curricula generated by ALVISS for both FBNs and neural net- works were in high agreement. Both also maintained strong recovery of π∗ with increasing target dimension (a), long after the a priori methods failed. As with FBNs the difference in the structure of inter-target rela- tionships effects curricula recovery. Using neural networks degraded the curricula slightly on both chained structures (b,c), and improved slightly on the pyramidal DAG structure (e). The procedures did not vary significantly on any instance of the binary tree structure (d)...... 126

xxii ABSTRACT

In machine learning, methods inspired by human teaching practices such as the use of curricula, have been fruitful. The bulk of such work has focused on applying a curriculum to the training examples presented to the learning algorithm. However, when there are multiple targets in the learning problem it is also possible to conceive of a curriculum being applied with respect to them. This type of curriculum has received significantly less attention and the key unresolved challenges lay in determining the relative difficulty of learning each target. Logic synthesis is an important aspect of Electronic Design Automation. With the increasing complexity and breadth of designs however comes a significant cost in expert human effort. The most common optimisation-based approaches to automating logic design unfortunately face exponential growth in representation size. Machine learning, which builds models from limited example data, offers a potential remedy to this. The majority of circuit synthesis problems are inherently multi-target making them a good test-bed for target curriculum methods. In this thesis I explore possibilities for target curriculum methods in small-sample machine learning problems using logic synthesis as a test-bed. Using a principle of probably common utility and using intrinsic dimension as a proxy for complexity, I detail two a priori methods, and one self-paced method for generating target-curricula. I also explore a number of ways of applying target curricula within Boolean and neural networks. The results of these studies suggest that target curricula are highly effective in the right circumstances and that using them to directly train digital circuit designs is a compelling direction for future design automation.

xxiii Chapter 1

Introduction

This chapter briefly introduces the core concepts addressed in this thesis beginning with an informal definition of a target curriculum in the context of machine learning. I then describe the problem of logic synthesis and why it is a suitable test-bed for target curriculum methods. Finally, I state the research aims and outline the structure of the remainder of the thesis.

1.1 Curriculum Learning

Machine learning is likely the most fruitful and promising field of at present. Even after adjusting for the significant fanfare and notoriety, it is easyto see that we stand amidst a technological revolution due, in large part, to machine learning. Numerous fields are transforming as machine learning applications mature to the point where solutions perform near-to or better-than human experts [1]. Examples typifying the future potential of this are: competing favourably with radiologists at predicting pneumonia from X-ray images [2], assisting in the search for exotic subatomic particles [3], and providing serviceable translation between multiple languages using a single system [4]. Supervised learning is arguably the most recognisable form of machine learning. The goal, broadly speaking, is to induce a function given some example data (typically input/output pairs) representing whatever phenomenon we wish the function to model. The key challenge is to have the function generalise well, i.e. to be as accurate as possible at predicting the phenomenon for inputs not in the example data. The amount of data needed to consistently generalise well is called the sample complexity. While extremely large datasets are the norm for most modern success stories, there remain no small number of domains where collecting and labelling abundant examples is infeasible or expensive. In such cases, methods with low sample complexity are clearly desirable.

1 CHAPTER 1. INTRODUCTION 2

The output value we attempt to predict is called a target and when this is multivariate we call the problem one of multi-target prediction [5]. Many interesting problems possess multiple targets, and they pose additional challenges to those already existent for single- target problems. It is widely regarded as true that optimum predictive performance in this case requires taking advantage of any relationships that exist between them [6, 7] and the bulk of multi-target approaches explicitly seek and model these relationships in diverse ways [5]. One option is to exploit disparities in the difficulty of targets by applying a curriculum to them; exploring that option is the main thrust of this thesis. In a very general sense a curriculum is some strategy for providing signals to the learner in a helpful order. The most natural, and effective, approach is to provide infor- mation such that the sub-problems the learner has to solve progress in difficulty from easy to hard. In both animals [8] and humans [9, 10], effective learning and teaching strategies are known to often involve the use of curricula; when we teach a young student to add numbers we do not start with 4096.173 + 6e7 + π. It turns out that such an approach is also useful when teaching machines [11–15]. By far the most common type of curriculum in machine learning is one that affects the way examples are provided to the learner [15]. This sort of machine teaching can be done directly—actually holding back difficult examples until a later stage [12, 16]—or indirectly—by changing the way that the learning algorithm uses the examples to con- struct a model [17–19]. This kind of curriculum has had a significant impact on machine learning in recent years and in several cases authors have reported the necessity of using some form of curriculum [20, 21]. While the presence of multiple targets does pose challenges to learning, it also provides the opportunity for using a different kind of curriculum. Applying a curriculum to the targets rather than the examples has received much less attention in the literature and was the basis of this research.

1.2 Machine Learning for Logic Synthesis

Digital circuits form the bedrock of modern computation. They have grown in size and complexity and continue to do so. With this complexity comes significant design challenges that form a significant sink for human effort [22]. The emergence of novel computational environments (DNA, quantum, and heterotic computing for example) with divergent building blocks and constraints also calls for flexible automatic design methodologies [23, 24]. It is no surprise then that breakthroughs in Electronic Design Automation (EDA) are sought [23], and specifically from machine learning [22]. CHAPTER 1. INTRODUCTION 3

Current digital system design practice involve design flows incorporating a pipeline of many EDA tools, a key stage in which is logic synthesis. Logic synthesis is the trans- formation of an abstract description of desired system behaviour into a concrete logic circuit. The task of automating this is incredibly challenging and many sub-problems are known to be intractable [25–27]. Most practical circuits have multiple outputs, and the complexity between those output can vary greatly. Most approaches to logic synthesis are treated as an optimisation problem relying on a full description of the expected circuit behaviour [28–32]. This is a significant roadblock as any such description increases exponentially with the number of inputs, and determining whether two full circuit descriptions are equivalent is intractable. The current applications of machine learning are predominantly indirect approaches [33–37]. In such an approach a surrogate model [36] is trained and used to evaluate parameter choices or predict risks in the design process [22]. If a learning algorithm could instead produce actual circuits with sufficient generalisation performance from relatively small sets of examples, it could accelerate a typical optimisation loop by directly producing candidate solutions. The potential utility, challenging nature, and the fact that the majority of circuit designs are multi-output, makes logic synthesis an ideal proving ground for target curriculum methods. The confluence of logic synthesis and machine learning may also offer bidirectional benefits. There is rising interest in the machine learning literature in hardware mappable design [38, 39], binary learning models are inherently advantageous in this regard. There is similarly interest growing in the opposite direction, of the benefits of applying ML in the synthesis domain, to accelerate design and help with scaling issues [35, 40–42]. These may be even more applicable in cases where approximate synthesis is acceptable [43]. It is clear from Section 1.1 that target curricula warrant further study and logic syn- thesis appears to be an ideal test-bed. The lessons learned will also hopefully inform related areas of machine learning such as multi-task learning [44], lifelong learning [45], and reinforcement learning [46]. Thus this thesis seeks to examine target curriculum methods with a focus on applying them to the problem of logic synthesis.

1.3 Research Aims

The primary goal of this research was to broaden the understanding of target curricula. This involved developing methods for their construction, use in training, and utility evaluation. The sub-goals specific to target curricula were to:

• verify if easy-to-hard target curricula help in learning digital circuits, CHAPTER 1. INTRODUCTION 4

• explore the sensitivity of learning performance to variations in curricula,

• confirm or deny the existence of a relationship between intrinsic dimension and target difficulty,

• design model-agnostic methods for inferring near-optimal curricula from data, and

• construct and compare different methods of imposing a curriculum of targets on a learner.

The sub-goals specific to machine learning in logic synthesis were to:

• further examine the generalisation potential of directly learning multi-output circuit models,

• quantify the sample complexity, and trends therein, particularly as they vary with the use of target curricula, and

• assess the feasibility of learning circuits of practical size with high generalisation from small datasets.

A final tangential goal was to:

• explore the use of resulting techniques in other models and frameworks to explore their broader relevance.

1.4 Thesis Overview

In Chapter 2 I formally define the foundational concepts necessary for the thesis and discuss the relevant state of the art. The chapter covers multi-target classification, curriculum learning and target-curricula, the intrinsic dimension of a function and its purpose, the state of machine learning in logic synthesis, and a definition of the primary learning model: the FBN. Important experimental methodology notes common to all chapters are contained in the relevant sections. In Chapter 3 I present and test the first method for enforcing target curricula by using hierarchical loss functions. This method is then used to examine the space of curricula and the assumed easy-to-hard curriculum is shown to be optimal. Then I describe a novel technique for building curricula using only the training examples and show that this method recovers the optimal curriculum well. The hierarchical losses are tested again using the data-derived curricula, and shown to outperform an equivalent curriculum- lacking learner. Finally, this combined approach is shown to improve generalisation on several real world problems for which no curriculum is immediately obvious. CHAPTER 1. INTRODUCTION 5

Chapter 4 applies the curriculum generation procedure from Chapter 3 to the open problem of chain ordering in Classifier Chains: a state-of-the-art multi-target problem transformation technique. In it I also resolve questions raised by the previous chapter, and arrive at an extended version of the curriculum generation procedure from Chapter 3. I demonstrate this procedure to be superior to its precursor in most cases for training Classifier Chains, both in terms of its ability to recover the optimal curriculum, and the generalisation performance imparted by its use. In this chapter I also present a synthetic data generation procedure that produces scalable problems with known target relationship structures. Next, Chapter 5 describes a method for explicit representation learning in FBNs which combines curriculum generation methods with embedded feature selection. This approach is designed to, and shown to be capable of, curating and uncovering useful meta-features in the learning process. These meta-features correspond directly to useful sub-modules in the partially trained models. I show that this technique simultaneously improves the quality of discovered target curricula, generalisation, and training speed when used with FBNs on logic synthesis benchmarks. This chapter presents the concept underpinning all the curriculum methods herein: the assumption of probable common utility. Finally I scale up to achieve consistent perfect generalisation on a large, modern logic synthesis benchmark problem, a first for machine learning logic synthesis. In the final experimental chapter, Chapter 6, I detail proof of concept experiments applying the techniques developed in Chapter 5 to feedforward neural networks. Suc- cessful results demonstrate the wider utility of the target curriculum methods from prior chapters and produce another first: consistent near-perfect generalisation of a feedforward neural network on large binary addition problems. Importantly, this is achieved with remarkably few examples. Finally, in Chapter 7 I summarise the findings, note the final contributions and suggestion suitable directions for future work. Chapter 2

Background

This chapter provides the context and foundation for the work in the remainder of the thesis. In the first half I give background on machine learning with multiple targets, the current state of curriculum learning with more detail on target-curricula, and a definition of the Intrinsic Dimension of a function (while differentiating it from similar concepts). In the remaining half of the chapter I frame logic synthesis as a machine learning problem, and define the Feedforward Boolean Network (FBN) model including important distinctions in training. Elements of experimental methodology common to all chapters are found toward the end of this chapter.

2.1 Learning Multiple Targets

The archetypal form of machine learning is single-target supervised learning. It is the go- to pedagogical “hello world” example and is by far the most well understood. Informally, it involves inferring a model of some phenomenon given input/output examples. The purpose is to use the output value in the examples to train the model so that it can produce “useful” outputs later, when the output value is unknown. We call the set of examples training data, and the properties for which we have input and output values we call features (or sometimes attributes). When the target feature is categorical we have a classification problem, and when continuous we have a regression problem. It is quite natural to extend prediction to involve more than one target. In the process of classifying an email as spam (or not) for example, we might well be able to classify it as any combination of ‘high-priority’, ‘travel-related’, or ‘appointment-related’. In the most general case, extending a single-target problem to multiple target features is called multi-target prediction [5]. Such problems may be purely classification, purely regres- sion, or a mix. The possible variations and terminology explode at this point [5], even more-so when we consider orthogonal extensions such as on-line or active learning,

6 CHAPTER 2. BACKGROUND 7 structured prediction, or ranking. As such I focus the next subsection to clarifying termi- nology, definitions, semantics, and scope. I then discuss how to measure generalisation performance on multiple targets, and finally I present popular methods in the literature including relevant state-of-the-art.

2.1.1 Formalism and Terminology

This thesis will focus on Boolean Multi-Target Classification, a specialisation of multi- target prediction. The taxonomy by Waegeman et al. [5] gives the more common name Multi-Label Classification, however there are important reasons why I deviate from their terminology. Instead I add another level—Multi-Target Classification (MTC)—in between, such that multi-label classification is a special case of Multi-Target Classification which in turn is a special case of multi-target prediction. The justification for this deviation and for adjusting the taxonomy in this way comes in Section 2.1.2. A MTC problem has purely Boolean target features, i.e. they take values in B = {0, 1}. A Boolean MTC is one where the input features are also Boolean. A specific problem instance is a set of examples represented by a matrix1, D ∈ Bne×(ni+no) where each rows is one of ne examples and each column is a feature (ni input features and no target features). This matrix can be split into the input matrix, X ∈ Bne×ni , and the target matrix, Y ∈ Bne×no , such that D = [XY ]. The goal is to use this data to find a mapping:

f : Bni → Bno commonly called a hypothesis, which correctly models whatever phenomenon gen- erated the examples. This not only means that the output matrix, Yˆ = f(X), from the model should closely match Y , but that the model should accurately predict the phenomenon for examples not given in D. We call this property generalisation, and we use the term memorisation to refer to how well the model fits the data it has been presented. The process that produces f is called training. Clearly it is possible to have a model that memorises very well but generalises poorly (such a model is said to have over-fit the data). As such the data, D should be split prior to training into a training set and a test set. Only the training set is given the training process, the predictive performance of f on the test set is used to estimate generalisation. Completely shielding the test set from the training procedure is of utmost importance for ensuring that the generalisation estimate is valid [47]

1Representation as a set of pairs is also common CHAPTER 2. BACKGROUND 8

2.1.2 The Label/Target distinction

The definition in Section 2.1.1 matches that commonly given in the Multi-Label Classifica- tion (MLC) literature. There are, however, important connotations and assumptions [48, 49] that come with the term label. While these are valid in the usual contexts, they are not always so. This clarification may seem like semantic pedantry, but these assumptions greatly effect choices in performance metric (Section 2.1.3) as well as algorithm design. It is therefore important to address them here. The multi-label case holds an inherent assumption of ‘relevance’ and particularly the existence of an ‘irrelevant’ class for each label. This is valid in many cases, like text categorization or medical diagnosis. In problems where the output is representative of a binary category rather than a label however, this view may be ill-advised. Read et al. [50] describes the difference well2:

In typical MLC problems, the binary classes indicate relevance (e.g., the label beach is relevant (or not) to a particular image). Hence, in practice

only slightly more than 1/nt labels are typically relevant to each example on average…This means that a relatively small part of the target-space is used. In MTC3, classes (including binary classes) are used differently – e.g., a class gender (M/F) – with a less-skewed distribution of classes…In summary, in MTC the practical target-space is much greater than in MLC, making probabilistic inference more challenging.

Due to the connotations that come with the term label, I instead refer to the general case as Multi-Target Classification. I use “Label” (and MLC) only when discussing relevant work where the aforementioned assumptions are met or made. Much discussion has been framed by the multi-label viewpoint but relatively little by the multi-target. The reduced assumptions in the latter may be quite helpful in the case of extending methods into multi-class multi-target domains4. It is also important for logic synthesis where there is no semantic or functional different between either logic level. Measures and methods which prioritise one output value are liable to perform inconsistently if they are presented them with a negated circuit, for instance (output values switched). This will be particularly important when performance metrics are discussed in Section 2.1.3. Due to the common conflation of terms, and the ubiquity of synonymous terms, I use the following in specific ways throughout the text which may differ slightly from the reader’s experience:

2I have altered some terminology and notation for consistency with the rest of the thesis. 3Read et al. [50] use the term Multi-Dimensional Classification (MDC) but for clarity I have replaced it with MTC. 4Not to be confused multi-task, as it occasionally is. CHAPTER 2. BACKGROUND 9

Single-Target Classification (STC) Standard supervised learning with a single categori- cal target, there may or may not be a negative class.

Multi-Target Classification (MTC) As for STC but with multiple categorical targets, again there may or may not be negative classes.

Multi-Label Classification (MLC) MTC with a specific “negative” or “irrelevance” class for each target.

Multi-Task Learning (MTL) A general set of approaches to handling separate, but over- lapping, learning problems. Each problem is typically single-output. The differ- ence from MTC is that, in MTL, the examples and features available for each target are distinct (but may overlap) [5].

In the next section I discuss performance measures commonly used in MTC/MLC con- texts.

2.1.3 Measuring Prediction Performance

In this section I define metrics used for evaluating the prediction performance of a classifier in single-target and multi-target contexts. Functionally there is no difference between training and test data for this purpose, however the function(s) chosen for directing the training process—guiding functions—need not be the same as used for evaluating generalisation. In fact, the key requirements of a guiding function and a final performance metric may differ markedly. This section is devoted to metrics used for evaluating generalisation performance (Chapter 3 is devoted to the experimental analysis of guiding functions).

Single-Target Metrics

Assume we have a binary target vector y and a corresponding prediction vector yˆ. If all examples are interchangeable then we can completely describe the disparity in the prediction of y by yˆ using the confusion matrix:

yˆ 1 0

1 TP FN y 0 FP TN

The confusion matrix condenses the performance of a predictor down to four quantities: the number (or sometimes fraction) of True Positives (TP), True Negatives (TN), False CHAPTER 2. BACKGROUND 10

Positives (FP) and False Negatives (FN). For a binary class these are the only options, thus TP + TN + FP + FN = ne. The number of samples in each class is given by nN = TN + FP and nP = TP + FN , and the imbalance in the target y can be quantified by nN /ne. Most single-target, and many multi-target, scores can be defined using only the confusion matrix. By far the most common and easily understood measure is accuracy5. It is simply the rate of correct prediction given by

TP + TN accuracy = ne and takes values in [0, 1]. A perfect prediction is assigned 1, a completely anti-correlated prediction takes 0 and a uniformly random prediction has an expected value of 0.5. Despite its simplicity and clarity, accuracy is highly susceptible to class-imbalance and is not suitable as an evaluation measure in such cases. In many applications the relative importance of controlling false positives and false negatives is known. It is easy to see in the case of a medical diagnoses that a false negative can be far more costly in human terms than a false positive while, in our email spam classification example, the reverse is arguably true. Common measures that are typically used to quantify these rates are

TP sensitivity = nP and TN specificity = , nN while another related measure TP precision = TP + FP may be used. When it is possible to trade these off in the learner with some tune-able parameter then graphical representations of sensitivity vs. specificity or sensitivity vs. precision may be advised [52]. It is also common for such a curve to be additionally condensed by taking the area under the curve (AUC) [53]. However, if the learned model cannot be tuned in this way such metrics are infeasible. Another typical approach is for a pair of measures to be blended, as in the ubiquitous

F1-score (or just F-score or F-measure). The F1-score is the harmonic mean of sensitivity and precision and has a weighted generalisation, the Fβ-score [54]. While it handles

5Confusingly, an entirely different metric, Jaccard Similarity, is also sometimes called “accuracy” in MLC literature [51]. CHAPTER 2. BACKGROUND 11 class imbalance better than accuracy, it neglects the false-negative rate which, although desirable in certain applications, is disadvantageous for generalised MTC and particularly so for logic synthesis. The F-score, like accuracy, is increasingly advised against as a measure of classifier performance [52, 54]. One metric which does take the entire Confusion Matrix into account is the Matthews Correlation Coefficient (MCC) [55]:

TP × TN − FP × FN MCC = . (2.1) p(TP + FP)(TP + FN )(TN + FP)(TN + FN )

As the name implies, the range for Matthews Correlation Coefficient (MCC) is [-1, +1], -1 indicates the prediction is completely opposite to the target, and +1 indicates perfect prediction, 0 indicates that the prediction and target are uncorrelated and is the expected value for a random prediction. Additionally, on balanced data, MCC equates exactly to accuracy. These factors make MCC easier to interpret than some measures such as Cohen’s Kappa [56]. MCC is robust to class imbalance [52, 57] and often suggested as a default measure for STC problems [52, 57–60]. Brzezinski et al. [54] defined ten desiderata for a single-target measure in the con- text of imbalanced data. They then evaluated 22 popular measures (including those mentioned above) according to those ten properties. They concluded that there is no single metric suitable for all cases, noting that choosing of measure depends on various requirements which should explicitly considered. When Brzezinski et al. [54] examined MCC, they found it to possess 7 of the 10 desired properties. One of the three criticisms—not meeting the Asymmetric Class Evaluation criteria—relates to treating negative and positive labels equivalently. As explained in Section 2.1.2 this distinction between labels is not a ubiquity and that concern does not apply in the context of this work. This ACE is actually an inversion of one of the selection criteria outlined at the end of this section.

The other two unmet desiderata (FN min and FP min) are two sides of the same coin. The combined requirement is that, when one class is entirely incorrectly predicted (i.e. TP = 0 or TN = 0), the measure must evaluate to its minimum possible value. The authors’ reasoning for isolating this situation is compelling, but I find the resulting re- quirement to be overly strong. It seems counter-intuitive to treat a learner which fails to predict either class as equivalent to a learner which fails on one class, but correctly predicts the other. I prefer a weaker requirement: in the case of one completely misclas- sified class, a measure should return a score no greater than that awarded on average to a uniformly random predictor. When TP = 0 ∨ TN = 0, MCC returns a negative value indicating poor performance, but this value varies depending on the performance on CHAPTER 2. BACKGROUND 12 the other class. Thus MCC does not satisfy the stronger requirements of Brzezinski et al. [54] but does satisfy the weaker variant.

Multi-Target Metrics

In the multi-label literature there are a litany of multi-label specific performance metrics. Zhang and Zhou [61] categorise these using two dichotomies: example-based vs. target- based6; and classification vs. ranking. As ranking metrics are defined only for learners that output a certainty level rather than a pure classification, I will not cover them. To understand the first dichotomy, consider the target matrix Y . Example-based metrics calculate an error for each example (row) based on the target predictions for that example and then return a central measure of these row-wise measures, typically the mean [61]. Target-based metrics consider the performance for each target (column) across all examples and summarise these, typically as either micro- or macro-averages. The most popular example-based metrics are the Hamming-loss, exact-match loss (also called subset accuracy or subset 0/1 loss), and Jaccard similarity [51]. The Ham- ming loss is equivalent to the mean error rate (1 − accuracy) for each target. The exact-match loss is often reported but known to be overly strict [51, 61] and apoor metric when the average number of “positive” target values per example is high. The Jaccard similarity is contingent on the assumptions of a positive/negative dichotomy as it ignores the true negative rate. All of the target-based metrics given by Zhang and Zhou [61], and in another review by Madjarov et al. [62], are micro- or macro-averages of single-target metrics mentioned previously. These averages are different approaches to combining the confusion matri- ces for each target into a single metric. The micro-averaged version of a single-target metric first accumulates the confusion matrices for all targets into a single matrix (true positives are accumulated, false negatives etc.) and then the single-target metric is calculated on that summary confusion matrix. The macro-average approach first finds the single-target metric value for each individual confusion matrix and provides the mean of those values.

Why Macro-MCC

While it is common in the literature to present several scores, I do not do so in this thesis for three reasons. Firstly, the practice has been criticised [48, 63]. Secondly, all example- based, and many target-based, scores defined in this section are tailored for sparse instances where there is a meaningful distinction between classes. The subset loss, Jaccard similarity, and F-score have built into them a notion of ignoring the true negative

6The authors use “label-based”. CHAPTER 2. BACKGROUND 13 rate since they assume uni-directional imbalance and the presence of a negative class (the assumptions discussed in Section 2.1.2). Finally, the representation of results in graphical form as done throughout this thesis, particularly with respect to individual targets, becomes untenable when additional metrics are considered. I used the following four criteria for selecting a multi-target metric.

1. Class imbalance should be accounted for.

2. Class-imbalance in either direction should be treated equivalently.

3. No target should have preference in, or undue influence over, the overall generali- sation performance.

4. The value should be interpretable and provide clear delineation between perfect predictors, anti-predictors and random predictors.

The first three criteria are objective, while the fourth is more subjective. The first objective rules out using accuracy, and the second rules out all of the example-based metrics discussed in this section. Given the third objective, Gibaja and Ventura [51] suggest macro- rather than micro-averaging and the fourth suggests that a measure based on accuracy or MCC is preferable to Cohen’s Kappa for instance [56]. The candidate deemed best at the outset of this work, under these conditions, was macro-averaged MCC.

2.1.4 Methods for MTC

Here I outline the prevailing methods for multi-output (most commonly multi-label) clas- sification. Approaches to handling multiple targets (classification [61] or regression [7]) are typically placed in one of two7 categories: algorithm adaptation and problem trans- formation. Both approaches solve MTC problems by applying or extending STC methods. I define both with relevant examples below, a much more comprehensive taxonomy is given by Gibaja and Ventura [51]. In each case I note the applicability of the method to the problem of logic synthesis.

The key philosophy of problem transformation methods is to fit data to algorithm, while the key philosophy of algorithm adaptation methods is to fit algorithm to data. — Zhang and Zhou [61] 7Some might argue that there are also inherently multi-output methods but, as most of these can be considered a trivial adaptation, I retain the two-category model. CHAPTER 2. BACKGROUND 14

Problem Transformation

Problem transformation approaches take a multi-target problem and produces one or more single-target problems. These can then be solved using an existing single-target classifier, called the base classifier, to jointly solve the parent problem. The choice of base classifier is affected by the nature of the transformation, for example a large family of methods induce multi-class sub-problems. The first idea that may spring to mind is to take each target and treat it independently. This is often called the Binary Relevance method in MLC literature [64], although Iusethe more self-descriptive term IC suggested by Read et al. [50]. The IC approach produces one STC sub-problem per target which in turn can be treated using any base classifier. It has the advantages of simplicity and low computational cost [65], and is embarrassingly parallel. The primary drawback of the IC transformation is that it cannot model, or take advantage of, any inter-target relationships [66]. A smaller flaw is that IC cannot use any form of model compression that may be possible if the targets are learned jointly (e.g. from model overlap). The basic method has been extended with both feature selection [67]—and feature generation [68]—applied per-target. The Label Powerset8 (LP) approach transforms the parent problem into one or more multi-class instances. In its basic form the method produces a class for each combination of labels in the training data (using the assumption of a negative target value). Hence the transformed problem is multi-class and the class domain is the power set of the set of labels. One obvious issue is that the number of potential classes is exponential in the number of outputs, thus many LP derivations designed to reduce or prune the label combinations exist. The most widely used LP variants are RAkEL [69] and HOMER [70]. A much bigger problem for LP, in the context of more general MTC, is that many label sets may not be present in training, a problem that is particularly pernicious for small training sets. Unless the base multi-class classifier can predict novel classes these com- binations can never be predicted at inference, severely hampering generalisation. This may be manageable under the low-cardinality assumptions mentioned in Section 2.1.2 but, in the converse case, it is extremely unlikely that any reasonable fraction of possible label combinations will be seen. The inability of LP methods to generalise to unseen patterns [71] is, therefore, a significant drawback. This, and the use of a multi-class base classifier render LP and related methods unsuitable for logic synthesis applications. CCs is a problem-transformation method that produces STCs with Boolean targets, much like IC but while retaining the potential for transfer between the sub-problems [72]. CCs have proven a useful transformation and remain a popular benchmark. The method

8Also called Label Combination (LC). CHAPTER 2. BACKGROUND 15 itself is relatively straightforward extension of the IC formulation and introduces inter- target dependencies by using targets as extra input features during training. The final model is constructed by connecting base classifiers, so a simple CC is viable for logic synthesis so long as the base classifier is. Some extensions, such as Probabilistic CCs [73] and Ensembles of CCs [71], however are not. Canonical CCs are the main topic of Chapter 4 and are defined in much greater detail there. One exciting factor is that ordering challenges [74] inherent to CCs present an opportunity for curriculum-based analysis. Ranking transformations, including the family of pairwise methods [51], use single- target methods in a larger ranking problem [61]. The resulting combined classifier, much like an ensemble of classifiers, cannot be transformed into a representation suitable for logic synthesis. Stacking procedures, such as Meta-BR [75], can be used but have been shown to under-perform CCs [49, 71] while requiring approximately twice the training time [71].

Algorithm Adaptation

As the name suggests, algorithm adaptation approaches extend single-target learning methods to handle multiple targets. For the task of logic synthesis, no popular algorithm adaptations I have seen are applicable. This is because if the original classifier does not generate a model which can be transformed into a logic circuit, then neither will the adaptation. I present a brief overview here since some of the learners examined in the following chapters can be considered algorithm adaptations, and parallels can be drawn. As an off-the-shelf classifier Decision Trees are popular, so unsurprisingly there are several multi-label extensions. For example the adaptation by Clare and King [76] of classic C4.5 [77], which allows leaves to contain a set of labels rather a single label; similarly the random-forest ensemble [62]. While a single-label decision tree can be converted into a circuit, it is unclear how this can be done when disparate subsets of labels exist at any leaf [78]. Random forests only further complicate the issue. None of the adaptation methods listed by Madjarov et al. [62] and later Gibaja and Ventura [51] are applicable. In all cases this is because the method is a modification of an existing learner which itself is unsuitable for generating a single-output circuit. RankSVM [79] for example adapts the Support Vector Machine which, in its base form, is non-trivial to extract interpretable rules from [80], much less a complete circuit. Others, such as MLkNN [81], are based on lazy learners which classify a sample at inference by vote-counting among the training samples judged to be similar. Similar reasoning rules out Boolean Matrix Decomposition [82], probabilistic models, ensemble methods, and evolved populations of association rules [51]. Additionally binarising and rule mining CHAPTER 2. BACKGROUND 16 from neural networks are unsolved problems forming active areas of research [39, 83– 86]. Problem transformations have also been used as a starting point for developing algorithm adaptations such as BRkNN [87]. In that case, the nature of the base classifier allowed significant improvements in efficiency over naively using kNN within an IC frame- work. The gist of this technique is similar to the formulation of the method in Chapter 5 based on problem transformation results in Chapter 4 (although this realisation was retrospective). The Boolean Network model defined in Section 2.5 fits the adaptation model. As with neural networks, the structural and algorithmic differences are trivial and the key lies in defining an appropriate guiding/loss function. As such, they are as near as possible to an inherently multi-target method.

Summary

Although debated [65], it is widely held that dependencies between targets must be modelled for best performance in MTCs [49, 71, 88–91]. All of the approaches mentioned in this section, beside ICs, are born of this need. Leveraging these dependencies is typi- cally viewed as attempting to capture correlations between targets [89] however Read and Hollmén [92] suggest that viewing the problem as one of building effective meta- features may be more fruitful. Local (per-sample) variation in target correlations [89] are also suggestive of the value of higher order relationship models than co-occurrence patterns [88, 93]. There is also discussion on the interplay between base classifier complexity and the need to capture dependencies [65, 94]. Given that many implementations are based on linear base classifiers, Dembczyński et al. [93] suggest that improvements from chaining and stacking could be partly explained by these methods introducing non-linearities which expand the hypothesis space. Read and Hollmén [92] observed that the improvement of CCs over ICs on some problems was much smaller when a random forest base classifier was used compared to logistic regression (a popular base classifier in the literature). It would thus be informative to examine relevant problem transformation methods with highly non-linear base classifiers.

2.2 Curriculum Learning

The extension of human styles of teaching into the machine learning field has shown success in many areas. Most notably the practice of teaching from the easy to the hard [9]. Scheduling examples in a curriculum based on some measure of difficulty has been CHAPTER 2. BACKGROUND 17 considered in detail, as has, to a lesser extent, scheduling the outputs of a multi-output problem. By no means a recent idea [95], it has risen again to prominence [12, 13, 96], likely due to the increased complexity of problems now being tackled.

2.2.1 Curricula in Human and Animal Learning

When students are first taught the concept of addition, they are not simply handed an eclectic set of many-digit number pairs and their summations. Rather, in addition to other modes of explanation, we provide them a learning curve of carefully chosen examples which increase in difficulty (i.e. number of digits). It has been known for many decades that gradually tailoring a task from simple to complex helps when training animals [8, 97]. This approach and phenomenon, both referred to as shaping, has helped in understanding the human, as well as animal, learning and teaching processes [11]. By studying human infants Smith et al. [10] found that the staging of their own capability generates a natural curriculum due to heavy tailed distributions of examples and staged access to different environments. Khan et al. [9] found that humans naturally follow an easy-to-hard example-based curriculum when teaching a single-target concept to an unknown learner. Given the powerful generalisation capabilities of human and animal learners, and the ubiquity of curriculum-like strategies for learning and teaching, it is unsurprising that similar ideas have repeatedly found their way into different areas of machine learning over the last few decades [12, 98–103]. From standard classroom practices we can glean two key sources of inspiration regarding effective learning and teaching processes of humans: the curriculum of examples, and the curriculum of targets. I discuss the use of both strategies in machine learning below.

2.2.2 Example Curricula

Most uses of a curriculum involve scheduling examples. This involves subjecting the training examples to a ranking that affects the manner in which they are presented to the learner. In the simplest case the training data available to the learner is gradually increased by adding examples according to the ranking. There are of course many extensions and generalisations of this [14, 16, 18, 45]. The idea has existed for some time [95] but most of the current interest can be traced back to the work by Bengio et al. [12]. Kumar et al. [104] divide example scheduling into two main camps: externally-paced and self-paced9. In the former the ranking is determined a priori, while in the latter it is

9They use the terms Curriculum Learning and Self-Paced Learning. CHAPTER 2. BACKGROUND 18 optimised jointly with the model; in a sense, the learner determines its own curriculum. Jiang et al. [17] demonstrated a combined approach, where constraints on the example- ranking are imposed by a fixed curriculum and the ranking is learned jointly with the model but subject to those constraints. Purely externally- or self-paced modes emerge as special cases of this framework. They show that the combined approach outperformed either pure approach on matrix factorization and multimedia event detection problems. Spitkovsky et al. [16] demonstrated the success of a curriculum of samples of increas- ing complexity on the problem of determining grammar from free-form sentences. Their proxy for complexity in this case is the length of the example sentence. They also noted the appearance of a “sweet-spot” in sentence length, the inclusion of samples above which reduced performance. The combination of both their curriculum and complexity limitation improved on the state-of-the-art. The success of difficulty-based example curricula can be linked with results inPAC learning [105]. Certain problem classes that are not PAC-learnable under arbitrary example-distributions, become so under a distribution which provides simpler10 exam- ples with higher probability [107]. The typical explanation for the success of example curricula in non-convex scenarios is that it imposes a form of transfer learning where the simpler concept is discovered first and subsequently informs the more complex concept; akin to the use of continuation methods in optimisation [12]. This leads neatly into a consideration of the second inspiration from the classroom scenario: target curricula.

2.2.3 Target Curricula

Learning multiple targets concurrently with a shared representation has been shown, theoretically and experimentally, to improve generalisation results when there are rela- tionships between the targets. In addition, the publication list discussing multi-target and multi-task learning is growing rapidly and using a curriculum of targets, while less common, is also beginning to garner attention. Much of the work presented in this section is for MTL [44, 108, 109], but is still informative in a MTC context. The importance of sharing knowledge between targets increases when there is a disparity in target complexity. Gülçehre and Bengio [20] showed a visual task, on which many popular methods—including deep neural nets, Support Vector Machines (SVMs) and decision trees—failed. This task became solvable when a simpler hinting target was provided (in much the same way as introduced by Abu-Mostafa [98]). However, unless relative target difficulties are known (or suspected), we also face the issue of determining an appropriate order during learning.

10“Simpler” here is defined as having a lower Kolmogorov complexity [106]. CHAPTER 2. BACKGROUND 19

Pentina et al. [96] presented a self-paced method using adaptive linear SVMs. The implementation was model-specific, providing information transfer by using the optimal weight vector for the most recently learned task to regularise the next task’s weight vector. They determine the curricula by solving all remaining tasks and selecting the best fitting, resulting in a quadratic increase in the number of learned models over solving tasks individually. While this may be acceptable for a linear model, it would be prohibitive for a FBN or Neural Network (NN). Li et al. [110] implemented task curricula with linear models by weighting the con- tribution of each task-example pair to the loss function. They used a regulariser in which task-example pairs with low training error receive larger values for these weights. However, this approach is unnecessary in MTC domains as the training examples are the same for all targets. Lad et al. [111] present the only model-agnostic approach to determine an optimal order of tasks that I am aware of at the writing of this thesis. They use conditional entropy to produce a pairwise order matrix and solve the resulting NP-Hard Linear Ordering Problem (see Ceberio et al. [112] for a definition). Their approach is not applicable to Boolean MLCs since the conditional entropy difference calculation they use reduces to the difference of the entropies of the individual targets; which, for balanced binary targets, rapidly approaches zero as more training samples are given. The impact of target order is discussed, in the context of CCs, by Read et al. [71]. As mentioned before CCs use one single-target classifier for each target, trained in an arbitrary order, with the output of all prior classifiers included as input to the next. The authors note that the ordering of targets has a discernible effect on classifier perfor- mance; which they address using ensembles of chains, each with a different random order. They observed consistent superiority of the ensemble approach over a single ran- dom ordering, which suggests non-trivial variability in the impact of different orderings. Given the large space of permutations, a principled approach to determine appropriate orderings would be valuable.

2.2.4 Measuring and comparing curricula

Within the literature discussing target or task curricula there has been no analysis of the curriculum itself, either in the consistency of curricula discovery or sensitivity of methods which apply curricula. To do so we require a similarity measure for curricula and, as curricula are total or partial orders, we can borrow from the statistics of permutations and the theory of rankings. For target-curricula in an off-line setting we can take as assumption that we are comparing rankings on a fixed set of no objects and so, for the most part, I will use the term curriculum to mean a permutation on {1, . . . , no}. CHAPTER 2. BACKGROUND 20

Total Orders

Starting with the most basic case: comparing two total orders. In this case the most popular similarity measure is Kendall’s τ given by

nc − nd τ = , (2.2) nc + nd where nc is the number of concordant pairs: object pairs which are placed in the same relative order under both orderings, and nd is the number discordant pairs: those which are placed in the opposite order. Note that, for two total orders, any pair must be either concordant or discordant and Equation 2.2 can be re-framed as

nd τ = 1 − . n(n − 1)

Being a correlation, τ takes values in [−1, 1], with 1 indicating the two orderings are identical, 0 indicating they are uncorrelated and −1 that they are inverses. The curriculum generation methods considered in this thesis all produce total orders.

Weak and Partial Orders

An obvious extension is the consideration of rankings with tied elements (weak orders) and those with unranked elements (partial orders). Kendall generalised the definition of τ to allow for ties by ignoring them in nc and nd and renormalising the denominator.

The resulting τb metric has been widely used but has been criticised for its handling of ties [113, 114].

Emond and Mason [114] in particular noted that τb violates the triangle inequality, and results in inconsistencies when used to generate a consensus from several rankings.

They proposed an extended rank correlation τx which resolves these issues and demon- strated its equivalence to other measures such as the Kemeny–Snell metric [115]. The explicit assumptions underlying τx, however, are that two elements being tied under a ranking “is a positive statement of agreement, not a declaration of indifference” [114]. This assumption comes from the domain of human rankings and is not necessarily valid in the case of comparing curricula, where a tie may denote only the inability to distinguish two targets. In analysing curricula, we have two primary purposes for a similarity measure, de- pending on whether one of the curricula is considered special. The first is determining how closely a curricula matches some standard reference we may have (such as from domain knowledge) and the second is comparing a set of curricula to determine their level of agreement. Each case differs in how ties are treated. CHAPTER 2. BACKGROUND 21

In the former case, a reference curriculum which marks two targets as tied, is not violated by a candidate which does order those targets; such a candidate can be con- sidered concordant. The reverse: a tie present in a candidate curriculum but not the reference, is not in the scope of this thesis, as all considered approaches result in total orderings. Thus the unmodified τ is a suitable measure. Note that only the definition in Equation 2.2 is valid, as the number of concordant and discordant pairs no longer sums 1 to 2 n(n + 1) in cases with ties. In the case of determining agreement between m curricula, the occurrence of ties in one curricula and not the other marks a point of difference and should be measured as such. However, the only case for measuring agreement in this thesis involves groups of total orders. In such a case, Kendall’s coefficient of concordance (W ) is a suitable measure of the general agreement amongst rankings [116].

Given a set of m rankings of {1, . . . , no}, arrange them into an m×no ranking matrix

R. The row-sums, ri, of R are the summed rank given to the corresponding object with Pno 2 mean r¯and sum of square deviations s = j=1 (rj − r¯) . Kendall’s W is then given by

12s W = 2 3 . (2.3) m (no − no)

2.2.5 Summary

While there has been some criticism of over-anthropomorphism in ML, there is strong evidence that learning is aided by shaping and curriculum practices. The presence of independent discoveries of curriculum-like tendencies in many biological agents, as well as successes (and occasional necessity [20, 21]) of curriculum/hinting strategies in synthetic learning suggest this to be a general phenomenon in learning. It is important to note that the terms Curriculum Learning (CL) and Self-Paced Learn- ing (SPL) used in much of the work referred to in this section relate to specific cases of example curricula in either single-target setting or tasks in a multi-task context. It is possible to also consider a learner—in a MLC framework—to be following a curriculum of targets (or to be self-paced in the same respect). Throughout the remaining chapters, “curriculum” used alone will refer to a curriculum of targets in a MTC context. There are important gaps in the literature, particularly when it comes to curricula of targets, that this thesis aims to fill. I have not seen work using hierarchical loss-functions to impose a target curriculum. Nor is there any in depth examination of variation in target curricula, either for the purpose of comparing generated curricula, or assessing the effect of curricula on learning. CHAPTER 2. BACKGROUND 22 2.3 Intrinsic Dimension and Feature Selection

The ideas presented in this thesis are heavily motivated by the concept of Intrinsic Dimension (ID). It is important at the outset to differentiate the notion of the ID ofa function from the ID of a set of data-points. This is because the contributions herein stem from considering the former, while contemporary academic discussion centres almost entirely on the latter [117–119]. A simple definition of the Intrinsic Dimension of a dataset (d-ID) is the size of the smallest subspace in which all the data lies [120]. For example, a set of points lying on a plane in 4D space is intrinsically 2-dimensional. The Intrinsic Dimension of a function (f- ID) is the minimum number of variables required to compute that function. For example, 2 2 f(x1, x2, x3) = x1 + x2 is intrinsically 2-dimensional, regardless of the number of input features available. Note: I will use d-ID and f-ID in this section only, throughout the remainder of this thesis “ID” is used to refer to f-ID exclusively. Estimating the inherent variability of data with d-ID estimation and projection meth- ods can aid in analysing and reducing arbitrary datasets [117]. This in turn can help in single- or multi-target classification by inspiring principled dimensionality reduction techniques [121] or pre-training complexity estimation [118]. In the multi-target domain though, the possibility for targets to possess different f-IDs opens up a range of options for local feature selection[67, 68, 122] and target curriculum discovery, although there has been no work on the latter. Any meaningful definition for f-ID must be carefully aware of the domain [123]. It can be tempting to note that x1 +x2 is one dimensional under a simple linear transformation 2 2 while x1 +x2 is not, and to deduce that the former has an invariant ID of 1 and the latter of 2. However, if arbitrary transformations are allowed then any function is intrinsically 1D, simply by using itself as a transformation. Also, if straightforward procedures existed for generating arbitrary transformations under which f-ID could guaranteed to reduce then the single-target learning problem would be trivial. Chapter 5 presents one approach for achieving something akin to this (under certain assumptions and without guarantees) in some MTC problems. The concept of f-ID is closely linked to feature selection, as such I give a very brief background on the field in the next section. I then define a combinatorial feature- selection problem, the minimum-Feature-Set problem, which features prominently in this thesis.

2.3.1 Feature Selection

Feature selection can be loosely defined as the problem of either ranking or selecting input features considered useful for some purpose. It is a large field of research with CHAPTER 2. BACKGROUND 23 applications in machine learning [124] and data science [125]. Feature selection is related to, but should not be confused with, the problem of feature engineering11 [126]. Feature engineering involves manually or automatically computing useful meta-features: novel features produced by transforming and combining raw input features. Feature selection on the other hand is concerned with judging the relative utility of features which already exist. Methods for applying feature selection to machine learning can be broadly cate- gorised as filter, wrapper, or embedded methods [127]. Filter methods use a feature selector independently of the learning model prior to training. Wrapper methods tie the two together in a loop; classifier performance is used as a quality measure for feature subsets and the feature selection is treated as a search problem. Embedded methods describe approaches where feature selection is incorporated into the learning algorithm. The uses of feature selection within this thesis fall into either the filter (Chapters 3 and 4) or embedded (Chapters 5 and 6) categories.

2.3.2 The Minimum Feature Set Problem

There are myriad methods for mult-target feature selection [128]. However, the purpose of this work is not feature selection for MTC but instead to examine target-level ID for the purpose of generating target-curricula. Thus the domain of interest is single-target feature selection. One problem formulation in particular is useful as the size of its optimal solution forms a lower bound for the ID of a target. This formulation is the Minimum-Feature-Set (minFS) problem, first defined by Davies and Russell [129]. Consider an STC problem with input matrix X and target vector y. Specifically X is a Boolean ne × ni matrix representing ne examples (rows) each of ni input features

(columns); y is a Boolean ne-element target vector. Since the rows of X are not restricted to be distinct, it is possible to construct an instance where no single function f(X) → yˆ exists such that yˆ = y for all given examples12. It is easy to see such a function exists iff the relation X → y is functional (univalent); that is, there are no two identical examples in X for which there are differing values in y.

Now consider a problem instance and a subset of the input features ∃S ⊆ {1, . . . , ni}. Even if it is possible to construct a non-contradictory function from X to y, it may or may not be possible to do so using only the features indexed by S. In the event it is possible to do so we call S a permissible feature set (or feature set for short). The Minimum-Feature-Set problem then simply asks: what is the smallest permissi- ble feature set? This problem is not at all new to the machine learning community [129]

11feature extraction, representation learning, feature learning 12I point this out as it simplifies the following explanation, but note that none of the learning problems considered display such a pathology in their raw form. CHAPTER 2. BACKGROUND 24 but has received relatively scant attention. The equivalent decision problem is known as the k-Feature-Set (kFS) problem, formally defined below. k-Feature-Set

Given: A Boolean ne × ni matrix, X, where the rows describe ne examples and the columns describe ni input features as well as a Boolean ne-element target vector, y, and a positive integer k.

Question: ∃S ⊆ {1, . . . , ni}, |S| ≤ k, such that ∀i, j ∈ {1, . . . , ni} where yi 6= yj

∃l ∈ S such that Xi,l 6= Xj,l? That is: is there a cardinality k subset, S, of the input features, such that no pair of examples which have identical values across all features in S have a different value for the target feature? The problem is NP-complete [129] and assumed to not be fixed-parameter tractable, when parametrised by k, under current complexity assumptions [130]. Nonetheless, the heuristic solver used throughout this thesis solved the majority of instances in significantly less time than was required for the subsequent training. Note that a solution to this problem gives a lower bound on the intrinsic dimension of any Boolean function which is fully compatible with the given examples. We can use this as a proxy for relative difficulty of targets on the same training data for two reasons. Firstly, for a given training set, a target with a smaller intrinsic dimension has a larger coverage of the possible sample space. Secondly, the intrinsic dimension limits the size of the smallest circuit that implements a given function. There are more frameworks for combinatorial feature selection. Charikar et al. [131] present a number of target-agnostic problem formulations that are very similar to minFS. Oddly they seem to have been unaware of the earlier work of Davies and Russell [129] and suggest something very similar to minFS as future work. Brunato and Battiti [124] propose an exact method for selecting the set of features maximising the mutual infor- mation with respect to a target. Variations such as these are beyond the scope of this thesis but worth keeping in mind for future work.

2.3.3 Summary

Functional-ID has received comparatively little attention in the machine learning liter- ature. Li et al. [132] used a a similar notion of approximating the ID of neural network objective functions in random linear subspaces for model compression and architecture evaluation. No work has yet used ID estimation for curriculum estimation, or used the minFS problem (or feature selection in general) as a framework for estimating ID. CHAPTER 2. BACKGROUND 25 2.4 Logic Synthesis

Logic Synthesis is the problem of automatically producing a digital circuit matching a specification of the desired behaviour [25]. It is a sub-domain of general circuit syn- thesis which, in turn, is a sub-domain of Electronic Design Automation. In addition to correctness, there are many design objectives for optimisation including area, depth, latency, and power consumption. It is obvious though, that before focusing on such objectives it is important that any synthesis method be capable of realising a design that is functionally correct. For the remainder of this section I focus on synthesis of combinational logic: logic circuits that lack cycles and are therefore free of state. Beyond the value added by automated logic synthesis to conventional digital sys- tems, the recent proliferation of novel computing domains also demand fully auto- matic synthesis methods [23]. Such domains include: quantum computing [133, 134], cell-based computing [135–137], DNA computing [138], new advances in multi-valued logic [139] and reversible computation [140], and even combinations of domains [141]. This is particularly true for domains where traditional design principles may be a poor fit [24, 142, 143]. For the most part, fully automated synthesis methods focus on optimisation rather than learning [28–32]. That is they use a complete description of the expected behaviour which suffers from poor scaling, as equivalence checking is coNP-complete13. Some work in this vein nonetheless comes very close to a learning approach, including: syn- thesis of approximate circuits [43, 144], oracle-guided methods [145], and partitioning the behavioural description (in this case a truth table) to produce sub-modules which are combined into a final solution [146–148], such as via the Boole-Shannon decompo- sition [149, 150]. The bias toward full simulation in synthesis is understandable: when an entire specification is available, why limit the data and risk incorrect solutions? However, simulation costs increase combinatorially with the number of inputs [29], and there is evidence that the sample-complexity for learning some circuit families may be quite low [151–153]. If a learning algorithm can cheaply produce a solution that is close to correct, it may be used as a seeding mechanism to significantly accelerate existing optimisers. There are also beneficial use cases to reverse-engineering of both electronic devices [154] and biological systems [155], both of which could obviously be assisted by methods that infer Boolean models that generalise well given a small, incomplete set of training examples [40, 156]. It is unsurprising then that there has been recent interest in using machine learning techniques to improve the EDA process [22, 37, 40].

13By reduction to tautology. CHAPTER 2. BACKGROUND 26

The most common approach, surrogate modelling, trains indirect models to predict the effect of changes to script settings on desired properties like area, latency, and power consumption [33–35]. The value added by this is the reduction in overhead due to expensive simulation stages [35, 157]. Once trained, the surrogate model should allow fast approximation of the (otherwise costly) simulation with enough accuracy to identify promising areas of a large search space [36]. A surrogate does not however produce a candidate circuit. Examples of a machine learning approach for directly learning a circuit as the model do not appear to exist. A recent review by Kahng [22] of applications of machine learning in EDA lists does not mention the direct learning of circuit models from limited example sets at all. Instead, the case studies therein use surrogate modelling or learn ancillary models for tasks such as early simulation-failure prediction or production risk modelling. A similar survey on reverse-engineering by Rematska and Bourbakis [41] lists a single example [40] using machine learning, which uses decision trees [158] as surrogates within an optimisation loop. Logic synthesis from small example sets fits a number of niches that have the po- tential for fruitful research. It is a perfect example of a MTC problem not fitting typical multi-label assumptions (Section 2.1.2). Problem instances are often nested, hierarchi- cal, and complex; motivating the use of target curriculum strategies. There a strong similarities to the use of surrogate models—exact correctness is traded for increased speed—however the difference with a direct approach is that the resulting modelcan be fed immediately into a optimisation module which refines the design (similar to hybridised techniques for circuit optimisation [159]). Structural learning of Boolean networks, one possible approach, is described in the next section.

2.5 Boolean Networks

Arguably the most direct and unrestricted model for combinational logic is the Feed- forward Boolean Network (FBN). An FBN with ni inputs and no outputs computes a

ni no function of the form f : B → B by chaining computations from ng internal nodes, commonly called gates. Internally, a FBN is a DAG where each node has an associated value in B. Nodes take the values of their predecessors as input to a Boolean function which they provide as output to any successors. Input nodes are those with no predecessors and their value is provided from outside the network. The output of the network is simply the values at a particular subset of externally visible output nodes. An FBN can thus compute a function from many inputs to many outputs by combining simpler functional elements. Figure 2.1 shows an example FBN which correctly implements a 6-bit adder. CHAPTER 2. BACKGROUND 27

Figure 2.1: A 56-node FBN which correctly implements the 6-bit addition function. Each node takes 2 inputs and computes the NAND function as its output. Inputs (far left) have been coloured red and outputs (far right) green. Note the ripple-carry style flow of information between sub- networks.

One advantage to working with this representation is that two singleton functionally- complete operators exist: the negated-and (NAND) and negated-or (NOR) functions. Every Boolean mapping can be computed by some degree-2 DAG consisting purely of NAND nodes. The training procedure is then entirely structural, involving searching a space of DAGs. The first suggestion of a model like this for machine learning was in 1948 by Alan Turing [160], and it has been revisited several times since [152, 153, 161, 162]. FBNs have a number of other advantages as learning models. They are efficient in time and memory to simulate as well as being trivial for hardware mapping and logical rule conversion. Logical relationships are simple for humans to extract informa- tion from [83], which is beneficial if the goal of the learning procedure is knowledge- generation. Purely Boolean models can use packed integer representations and bitwise arithmetic to improve model evaluation speed during training and deployment over models with continuous internal parameters [86]. There has been recent interest in binarising the internal layers of deep neural nets precisely for these reasons [84, 163– 165]. Hubara et al. [164] suggest that large reductions in power consumption can also be expected from the use of purely binary models. FBN are trivially multi-output. However, as I show throughout this thesis, the ap- proach to solving the optimisation problem, and to modelling inter-target relationships, can have significant impacts on learning performance. So far no work has considered differences between learning single- and multi-output FBNs. Results in multi-task set- tings suggest that, with multiple related targets, some networks may be favoured due to target correlation, increasing generalisation performance [44, 109]. In the following section I describe the process throughout this thesis for training FBNs. CHAPTER 2. BACKGROUND 28

2.5.1 Training Boolean Networks

For a learning application FBNs contain one input node for each input feature. In general, any node could be considered an output, however any node that is not a predecessor to an output node is computationally useless from an external viewpoint. Thus a natural choice for output nodes are the final no gates. Additionally, any DAG has a topological order of nodes; to ensure a lack of cycles such a topological ordering can be pre-imposed. Parametrisation of the circuit is entirely reflected in the DAG structure and the func- tion computed at each node. There are no weights or biases; all parameters are discrete as are all values propagated through the network. For training, FBNs can be represented th using a 2 × ng integer matrix, where the i column contains the indices of the two sources for the ith internal node [153, 161]. Input nodes take no source and are not represented in the connection matrix. Networks are constrained to be feedforward by imposing a topological ordering of nodes; each node can only source from nodes before it in the ordering. Since any DAG has at least one topological ordering, no structure is excluded by this constraint. Since the late publication of Turing’s first description of a machine which learns by changing the connections of a NAND-gate circuit [160] the majority of the work on learning with a Boolean Network model has been done in the field of physics. Some time later, Patarnello and Carnevali [161], and subsequently others [153, 162], demonstrated that FBNs could generalise well on some problems, despite training set sizes significantly smaller than the space of all possible patterns. As the name suggests, FBNs compute with purely Boolean logic, rather than using weighted sums and continuous activations. This brings challenges, the most prevalent being the lack of gradient-based optimisation procedures or closed form solutions. Training an FBN is thus a combinatorial optimisation problem, typically solved using meta-heuristics such as Simulated Annealing (SA) [162] or genetic algorithms [153]. Patarnello and Carnevali [161] optimised networks by SA in the space of feedforward structures. Each move in the search procedure consisted of a random change in a single connection (subject to the feedforward constraint). Moves which also change the node activation have been considered [153, 162], however in light of the NAND function’s functional completeness, this becomes unnecessary. The optimisation method used for training all FBNs in this thesis is presented in the following section.

2.5.2 Late-Acceptance Hill Climbing

Throughout this thesis, FBNs were trained by using stochastic local search in the space of valid feedforward structures using the Late-Acceptance Hill Climbing (LAHC) meta- heuristic [166] with random restarts. This method assumes a black-box view of the CHAPTER 2. BACKGROUND 29 learner and is similar in implementation to SA while being less sensitive to scaling in the guiding function and requiring only a single meta-parameter in place of a cooling schedule. While the particulars of the optimiser are not relevant to this work, I briefly describe LAHC below and provide pseudo-code for the variant used in Algorithm 2.1.

Algorithm 2.1: Late-Acceptance Hill Climbing (LAHC) [166] with restarts. Require:A scalar cost function: C() Require:An initialisation method: initialise() Require:A neighbour generation method: neighbour() Input: Cost history length: nhistory > 0 Input: Iteration limit: niters > 0 Input: Restart limit: nrestarts ≥ 0 1 r ← 0 2 repeat 3 s ← initialise()  4 forall k ∈ 0, . . . , nhistory − 1 do 5 Cˆk ← C (s)

6 i ← 0 7 repeat ∗ 8 s ← neighbour(s) 9 v ← i mod nhistory ∗ ∗ 10 if C (s ) < Cˆv or C (s ) ≤ C (s) then ∗ 11 s ← s

12 Cˆv ← C (s) 13 i ← i + 1

14 until i = niters or C (s) = 0 15 r ← r + 1

16 until r > nrestarts or C (s) = 0

LAHC works similarly to typical stochastic hill climbing: a modification is made to the current solution and accepted based on some criterion. This modification involves a random change to the source of one connection in the DAG. A modification is accepted if the resulting network has equal or lower loss than the solution obtained L iterations beforehand. LAHC has a single meta-parameter, L, defining the length of a history of costs which provides a dynamic error bound. In the next section I discuss some variations in typical training and testing process which become important in light of the relatively high variance of training FBNs in this way.

2.5.3 Varying sample size

It is typical in the competition-style of publication to evaluate an approach or modifica- tion using only a single training set size. This is since, most commonly, public datasets which provide a given training/test split or, lacking that, cross-validation measures are CHAPTER 2. BACKGROUND 30 used. Both approaches serve important purposes however they do tend to result in studies where no variation in training set size is explored. An important aspect of this work is small sample size learning and as such, variation with sample size is important. I conducted all experiments in this work with a spread of sample sizes. Doing so does not however mean that the simplicity of single measures must be sacrificed. Indeed scalar metrics can be quite necessary in scenarios involving a multiple of datasets, or where we may be interested in exploring the errors in each target. Plots of a performance metric against training set size can show interesting trends, but a large number of them becomes prohibitive. It is thus valuable in some cases to have a summary statistic for these curves. The simplest approach is to take the mean of the metric of instance across all sam- ples, ignoring training set size. This, however, has a downside: it is only representative if the training set sizes are evenly spaced. If we have a skewed sampling distribution, then a region which is oversampled is effectively “weighted” higher than another equivalent sized region. A much better approach is to take the average value of the curve using a normalised integral, 1 Z b f¯ = f(x) dx. (2.4) b − a a calculated with the trapezoidal rule. This is similar to the cumulative measures sug- gested by [153] but, since the normalised value represents the true mean of the function over some domain, I forgo their terminology. I used a range of training set sizes in all experiments and so I use this summarisation approach whenever any metric is not explicitly presented with respect to training set size. When many learners have been trained for each sample size we can bootstrap sample curves for estimation. This process involves repeatedly sampling one representative per training-set size to give a sample curve of the metric of interest. From this sample curve the summary statistic is taken via Equation 2.4. These bootstrap samples can also be used to produce confidence intervals. Throughout the thesis this has been done using the Seaborn plotting library for python [167].

2.6 Conclusion

This chapter presented the foundational definitions and concepts for the work in the remaining chapters. First I situated this work in the wider field of Multi-Target Classifica- tion. The existing successes of curriculum methods in general within machine learning (Section 2.2) motivated further study of target curricula which has been comparatively CHAPTER 2. BACKGROUND 31 understudied. In Section 2.3.3 I noted that ID, estimated by combinatorial feature selec- tion, is a promising, unexamined possibility for target curricula. This hypothesis forms the backbone of the thesis. I identified several relevant gaps in the literature throughout the chapter. In Sec- tion 2.4 I noted the lack of direct machine learning implementations for logic synthesis. This, along with biases noted in Section 2.1.2 motivated its use as an evaluation do- main. Additionally, the infrequency with which curricula themselves are compared (Section 2.2.4) motivated several experimental design choices. The next chapter ad- dresses explores the use of hierarchical loss-functions for imposing a curriculum of targets and presents the first method for building such a curriculum. Chapter 3

Target Curricula and Hierarchical Loss Functions

Guiding function design is a key aspect of any optimisation endeavour [168]. Whether combining multiple disjoint objectives into a single quality measure, or tuning the fitness landscape to achieve some form of regularisation, practitioners often use carefully crafted guiding functions. By obvious extension, the design and selection of appropriate loss functions is critical for successful learning. The first half of this chapter examines the use of hierarchical loss functions for enforcing a curriculum of targets in Boolean MTCs. I present problems for which a difficulty-based curriculum (easy-to-hard) is suspected and define three loss functions which impose a curriculum on the learner. I then present studies showing that these losses, under the suspected easy-to-hard curricula, produce better generalisation than the mean L1 loss; and showing that the easy-to-hard curricula is optimal in that respect. The latter half of the chapter presents a novel approach to discovering an easy- to-hard curriculum from the training examples themselves. It is likely most ideal to use domain knowledge where available to construct curricula, but, in cases where this is impossible or difficult to obtain, an estimation procedure would be desirable. The presented approach makes use of the minFS problem defined in Section 2.3.2 and is intended to enable the use of curriculum-enforcing loss functions when domain knowledge is absent. Additional experiments are then described in which this approach generated curricula that were highly correlated with the known easy-to-hard curricula, and in which the bulk of the generalisation improvement remained when the known curricula were replaced with discovered curricula. Finally, this combined approach is shown to improve generalisation on several real world problems for which no curriculum is immediately obvious.

32 CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 33 3.1 Can Guiding Functions Enforce a Curriculum?

To simplify the following loss function definitions I define the ne × no error matrix, E, which fully characterises the mistakes made by a candidate network on a set of ne example patterns with no targets. E is the element-wise absolute difference between the network output matrix Y 0 and the target matrix Y , with elements given by

0 Ei,j = Yi,j − Yi,j .

th th Thus, Ei,j ∈ B is the error for the j target for the i example. Even for problems with multiple outputs, the only loss functions used for training 1 FBNs to date [153, 161, 169] have been analogues of the L1-loss:

n 1 Xo X L1 = Ei,j, (3.1) no |I| j=1 i∈I where I is the set of example indices shown to the network. This loss treats all examples and targets equally and is the natural first choice for a guiding function, but for multi- target problems there are other options which I explore in Section 3.1.1. Here I examine the initial potential for target curricula. I present three loss functions which are designed to take a total order of targets, and each apply a different method of guiding the learner to follow that curriculum in training. I then define test-bed prob- lems with such an ordering and show that these losses variably improve generalisation performance of FBN on these problems.

3.1.1 Hierarchical Loss Functions

This section defines three loss functions, used throughout the remainder of the chapter. Each loss imposes a given curricula on the training process in a slightly different manner. Assume there exists a two-target Boolean MTC for which we have three candidate learners. Further assume that the first target (y1) is “easier” under some definition than the second (y2), and that the two targets are related by some common substructure. The first learner labels y1 correctly and y2 incorrectly for all examples. The second learner does the opposite, and the third learner labels 50% of the examples incorrectly for both targets. All three networks will have an equal L1 loss (Equation 3.1). The rationale behind the loss functions defined in this section is that the first of the above learners should be preferred as a stepping stone state. By focusing on easier targets earlier in the training process, the network is more likely to find any substructures

1 With purely binary predictors and targets the L1- and L2-loss are identical. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 34 useful to more difficult targets. The expectation is faster convergence speed, better generalisation or both.

The L1-loss (Equation 3.1) is simply the element-wise mean of the error matrix E. I examined three additional losses, which hierarchically aggregate the error matrix elements according to a given curriculum of targets, in progressively more strict ways. The candidates were:

• a linearly weighted mean: Lw,

• a locally hierarchical mean: Llh, and

• a globally hierarchical mean: Lgh.

The definitions given below assume the targets are pre-ordered by the given curriculum, with the target to be learned first at the lowest index.

Weighted loss: Lw

The first loss, Lw, encourages the curriculum by weighting the easier targets more highly. Thus a learner which improves on an earlier target by some error, , is chosen over a learner which achieves the same error on a higher order target. The weights grow linearly and the resulting summation is normalised to the unit interval:

n n 2 Xe Xo L = (n − j + 1) E . w n n (n + 1) o i,j e o o i=1 j=1

Locally hierarchical loss: Llh

The second function, Llh, is named the locally hierarchical loss, as it enforces a hierarchy of targets on a per-example basis. The learner is rewarded for correctly labelling target j of example i if and only if it has also correctly labelled all prior targets {1, 2, ..., j − 1} for that example. Given the following recurrence, defined for each example i,

ai,1 = Ei,1  Ei,k+1 if ai,k = 0 ai,k+1 = , 1 if ai,k = 1 we can define Llh as n n 1 Xe Xo Llh = ai,k . neno i=1 k=1 CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 35

Globally hierarchical loss: Lgh

The final loss, Lgh, follows the same principle as the former but across all examples. For this reason it is named the globally hierarchical loss. In this case the learner is rewarded for correctly labelling target j (of any example) if and only if it has also correctly labelled all prior targets {1, 2, . . . , j − 1} for all examples. This function is effectively a “soft” equivalent to learning the targets rigidly in order. We can define Lgh using the mean per-target errors (the row-means of E), given by

n 1 Xe δ = E . j n i,j e i=1

These can be used to define another recurrence,

b1 = δ1  δk+1, if bk = 0 bk+1 = , 1, if bk > 0 giving n 1 Xo L = b . gh n j o j=1

Examples

To give a better intuition of how these losses impose a curriculum seven small examples are given in Table 3.1. All examples are specifically chosen to have the same L1 loss, and for each I show the error matrix and the associated value for each loss. The purpose is not to contrast the loss function values on any particular row but instead to give some intuition into different scenarios the various losses preference or treat equally. Consider the first row in Table 3.1, where the learner has mastered a single target on all examples but fails otherwise, and the third row, where the learner has conversely succeeded on all targets for a single example. L1 and Llh treat these cases as equivalent whereas Lw and Lgh prefer the former case. If some difficulty (or otherwise defined) curriculum is known, or suspected, from domain specific knowledge then these losses can be used in conjunction with a deriva- tive free optimiser (such as those described in Section 2.5.1). In other cases, a method for discovering a difficulty-based curriculum from the training data is also required. However, before designing such a method (Section 3.3), it was prudent to first examine the effectiveness of these losses on problems for which there already existed a candidate ordering (Section 3.1). It was also prudent to test the assumption that easy-to-hard CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 36

E Llh equiv. Lgh equiv. L1 Lw Llh Lgh 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 6 4.5 6 6 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 6 6 9 9 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 6 6 6 8 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 6 5 6 7 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 6 6 8 8 1 0 1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 1 1 0 1 1 1 1 1 1 1 6 5.5 7 8 1 1 1 1 1 1 1 1 1

Table 3.1: Example error matrices and the associated values for L1, Lw, Llh, and Lgh (ignoring a normalisation constant for readability). Note that the purpose is not to directly compare the different losses on a particular matrix, but instead to compare the pairwise disparities inthe same loss on different error matrices. For example, the first and second matrices are equivalent under L1 but the first is preferred by all other losses. Rows of E are examples and columns are targets (ordered by the curriculum being enforced). Also included are equivalents of E under the two hierarchical losses: the recurrences defining these losses can be thought of as defining a transformation on E under which the respective loss is equivalent to L1. is indeed a useful ordering (Section 3.2). The following section defines the problem instances used for these studies.

3.1.2 Cases with Suspected Curricula

A foundational hypothesis for this thesis was that, for problems with an “arithmetic-like” sequence of targets, there should be some benefit to training in order of significance. To that end, I first examined a set of MTC problems with natural arithmetic target sequences. For this first test-bed I chose four circuit-inference problems known to be challenging. These consisted of binary addition (add) and cascaded variants of three benchmark problems: multiplexer (cmux), majority (cmaj) and parity (cpar). These were chosen as they all possess a natural hierarchy of targets and the base prob- lems possess a number of attributes rendering them difficult to learn [170–172]. Binary CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 37 addition is a well known multi-target problem. For the three remaining problems, multi- target versions can be defined by cascading single-target sub-problems; each successive target being the result of applying the original function over a larger set of inputs. By this method I constructed problems for which there is a known order of difficulty.

Binary Addition

Binary addition takes two n-bit inputs, x and w, and computes an n-bit2 output y. The th i output can be expressed as yi = xi ⊕ wi ⊕ ci−1 where ci represents an intermediate carry value given by ci = (xi ∧ wi) ∨ (ci−1 ∧ (xi ⊕ wi)). For simplicity, I do not model a “carry-in” which is equivalent to initial conditions: c0 = 0. Thus, an n-bit addition results in a problem with ni = 2n and no = n. The strength of the binary addition problem is the ability to produce increasingly large instances with a known increase in complexity. The minimum depth of any accu- rate realisation increases with output dimension. These properties make the adder a useful test case for comparing scaling trends [161] which we will see through this and subsequent chapters.

Cascaded Multiplexer

The multiplexer is a generalised switch which takes n data inputs, d, and dlog2 (n)e select inputs, s, with output equal to the value of the ith data input where i is the integer value given by the select inputs. The output, y, of the cascaded variant is defined by

y1 = (d1 ∧ s1) ∨ (d2 ∧ s1) and

yi = (yi−1 ∧ si) ∨ (di ∧ si) .

For any given n the resulting problem has dimensions ni = 2n − 1 and no = n − 1. It should not be confused with the layered multiplexer problem commonly mentioned in the learning classifier system literature [170].

Cascaded Majority

The majority function (sometimes called count-ones [170]) takes an n-bit input, x, and gives a single output which takes the value of the majority of the input bits. The output, y, of the cascaded variant is given by  i P2i−1 0, if 2 < j=1 xj yi = , 1, otherwise

2Ignoring initial carry in and final carry out CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 38

n resulting in a problem with dimensions ni = n and no = d 2 e.

Cascaded Parity

The final problem, parity, is likely the most well known. It takes the value 1 when the number of 1s in the input is odd and 0 otherwise. The cascaded version presents ni outputs given by i ! X yi = xj mod 2 . j=1 Parity and parity-like functions have been a staple challenge for logic synthesis and machine learning alike for decades. The cascaded parity and multiplexer possess the strongest dependencies between successive targets, as each subsequent output can be computed directly from the previous output and an additional subset of inputs not already used. The binary addition, and random cascaded problems have weaker dependencies, requiring both previous inputs in addition to the previous output and new inputs to compute the next output. This selection of problems allowed me to examine target curricula with varying levels of inter-target dependency over a range of problem sizes.

Instance Sizes

In pilot experiments FBNs proved to be high variance learners. As such I chose problem sizes such that a meaningful effect size can be seen, but also sothat 100−1000 networks per loss/sample-size combination converged in a reasonable time. The dimensions selected, and the ranges in training set and sample pool size are given in Table 3.2. We know a possible curriculum of targets for the above problems since they are constructed from cascading well-known circuits. This ground-truth I eventually assessed with ablation studies in Section 3.2 but first I tested whether the guiding losses affect generalisation performance at all in Section 3.1.4.

3.1.3 Training

FBNs were trained as described in Section 2.5.1 to zero training error with the respective loss function. The underlying optimiser, LAHC (Algorithm 2.1), takes a single hyper- parameter nhistory. In preliminary tests, test-error was minimally affected over a wide range in this value and so I selected the one which provided the best average convergence speed when using the L1-loss. The resulting value was 250 for all problems, except parity, for which it was 1000. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 39

The gate budget, ng, also had little effect on generalisation performance in pre- liminary tests. This is a known phenomenon, Patarnello and Carnevali [161] showed that, past a threshold size, ng as little effect on generalisation capacity theoretically and empirically; a fact which has been repeatedly confirmed [153, 162]. The effect on the number of iterations required for convergence was similar but with a different threshold. Since larger networks do not diminish performance (they simply increase training time linearly and result in a lot of unconnected gates), I set ng to be 50 × no for all problems in this chapter.

3.1.4 Experimental Results

Several problems with obvious candidate easy-to-hard curricula have been described. Three losses, designed to prioritise the optimisation of targets along some given cur- riculum, have also been presented. This section presents experiments examining the effect of applying these guiding functions with this naive curriculum on training speed and generalisation. As mentioned previously, this combination of model and training process results in a high-variance learner. For each of a series of training set sizes (ranges in Table 3.2), and for each loss, I trained FBNs on 1000 random training sets (except for the 8-bit adder for which only 200 repeats were feasible). As such, each point shown in Figure 3.1, and similar plots, represents the mean performance over a large collection of different training sets of that size. Also displayed are 95% confidence intervals as transparent bands where relevant, found by bootstrap sampling as described in Section 2.5.3. For Figure 3.1, the CI was too small for bars or bands to be visible. The out-of-sample macro-MCC, as it varies with sample size for the 7-bit cascaded parity problem, is shown in the top of Figure 3.1. A phase transition between poor and near perfect generalisation was was expected under L1 [161]. What the top plot in Figure 3.1 shows is a leftward shift in this transition as we impose an increasingly strict

Problem Inputs Targets Training set size Example pool size Binary Addition 8 4 [8, 250] 256 Binary Addition 10 5 [8, 256] 1, 024 Binary Addition 12 6 [8, 256] 4, 096 Binary Addition 14 7 [8, 256] 16, 384 Binary Addition 16 8 [8, 256] 65, 536 Cascaded Parity 7 7 [8, 112] 128 Cascaded Majority 9 5 [8, 384] 512 Cascaded Multiplexer 15 7 [8, 180] 32, 768

Table 3.2: Instance sizes of initial test-bed problems. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 40

1.00

0.75

0.50

MCC Macro-Avg 0.25

0.00 20 40 60 80 100

L1 Lw Llh Lgh 0.50 0.25 Target 3 0.00 0.50 0.25 Target 4 0.00 C

C 0.50 M - 0.25 Target 5 0.00 0.50 0.25 Target 6 0.00 0.50 0.25 Target 7 0.00 20 40 60 80 100 Training Set Size

Figure 3.1: Generalisation performance of each loss on the 7-bit cascaded parity problem. The top plot shows the mean macro-averaged test-set MCC, averaged across all targets, for each loss including L1. The remaining plots show the mean difference in test MCC, on each target, between L1 (dotted baseline) and each of Lw, Llh and Lgh. The first two targets are not shown as they show only a negligible improvement. A leftward shift in the transition (top) from poor to near-perfect generalisation can be seen. This became more pronounced as the loss more strictly enforced the learning order (Lw → Llh → Lgh) with Lgh achieving a 16% reduction in sample-complexity for 0.8 MCC. The differences in test MCC for individual targets (bottom) show an increase in improvement with target difficulty, as expected; reaching a large peak improvement—from 0.17 to 0.70 MCC—for the sixth target. On this particular problem the difference between Llh and Lw was not statistically significant. For all loss functions the out-of- sample performance was significantly better than obtained with L1 (dotted line).

order to the targets (L1 → Lw → Llh → Lgh). That is, less examples were required to achieve the same generalisation when the curriculum was imposed. Even for a single problem instance, I did not expect effects to be consistent across all targets. In fact earlier reasoning would suggest that any improvement should in- crease for successive targets. In the bottom of Figure 3.1 I have also shown the MCC improvement given for each target, by each loss. I omitted the first two targets: there was little to no improvement on them as they do not require earlier “helper” sub-circuits (implemented within the computation of prior outputs) to operate. Beyond this, we see the improvement increased with each successive target until the last. This peaked at the second-last target with an increase in MCC from 0.17 to 0.70 using Lgh (equivalent to a bump in test accuracy from 58% to 85%). CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 41

4-bit Addition 5-bit Addition

0.15 C

C 0.10 M - 0.05

0.00

0 50 100 150 200 250 0 50 100 150 200 250 6-bit Addition 7-bit Addition

0.15 C

C 0.10 M - 0.05

0.00

0 50 100 150 200 250 0 50 100 150 200 250 Training Set Size 8-bit Addition

0.15 L1 C

C 0.10 Lw M - Llh 0.05 Lgh 0.00

0 50 100 150 200 250 Training Set Size

Figure 3.2: Mean δ-MCC for each of Lw, Llh and Lgh versus training set size for addition prob- lems of increasing output dimension. The 95% confidence interval of the mean is shown with transparent bands; they are wider for the final adder due to the tenfold reduction in repetitions on that instance. In all cases there was a statistically significant net increase in generalisation performance and, importantly, this improvement increased with output dimension (note that the y-axis are not shared). We can also see that the improvement increased as the loss more strictly enforced the ordering (Lw → Llh → Lgh).

For the remaining instances I compared overall generalisation effects using the differ- ence in macro-averaged MCC w.r.t. L1 (hereinafter δ-MCC). This is plotted against training set size along with 95% CIs in Figures 3.2 and 3.3. We can see that the improvement differs significantly between problems but was consistently positive and statistically significant across a range of training set sizes for all problem instances.

More importantly, the loss which most strictly enforces the learning order: Lgh, also conferred the largest improvement, and vice versa for the loss which least strictly enforces it: Lw. This phenomena was consistent across all problems and supports the hypothesis that guiding functions can adequately impose a curriculum, and that difficulty-based curricula are beneficial. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 42

9-bit Majority 15-bit Multiplexer 0.04 0.01 C C 0.02 M - 0.00

0.00 0.01 0 100 200 300 400 0 50 100 150 Training Set Size 7-bit Parity 0.2

L1

C L

C w

M 0.1 - Llh

Lgh

0.0 0 25 50 75 100 Training Set Size

Figure 3.3: Mean δ-MCC for each of Lw, Llh and Lgh versus training set size for the cascaded majority, multiplexer and parity problems. The 95% confidence interval of the mean is shown with transparent bands. On all problems we can see a statistically significant net increase in generalisation performance—though the magnitude varied widely between problems. We can also see that this improvement increased as the loss more strictly enforced the ordering (Lw → Llh → Lgh).

To explore the variation in performance across targets without going into the same depth as Figure 3.1, I employed the function averaging approach mentioned in Sec- tion 2.5.3 to produce MCC values averaged across a range of sample sizes. The distribu- tion of these average MCC or δ-MCC values can then be viewed more concisely across multiple targets. Figures 3.4 and 3.5 show the per-target generalisation improvement (δ- MCC) grouped by guiding function. The targets are ordered by the assumed easy-to-hard curriculum and bootstrapped confidence intervals are shown. Figure 3.4 highlights several interesting behaviours. Firstly as problem dimension increased the effect of applying the curriculum increased in a predominantly positive manner, as we expect from Figure 3.2. Each loss also behaved quite differently. Lw grew, plateaued and then dipped, with the size of the “hump” increasing with problem dimension. Llh tracked with Lw initially but continued its growth after the other dipped.

Lgh outpaced the others significantly but plummeted at the final target. Figure 3.5 shows the variation in improvement versus target index in the three cas- caded benchmarks: majority, multiplexer and parity. The general trends were similar but there was variation between datasets in the more fine-grained behaviour. For example, the multiplexer problem displayed similar drop-off behaviour for Lw and Lgh but the shape of this behaviour was different for each loss on the remaining problems. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 43

4-bit Addition 5-bit Addition 0.2

C 0.1 C M - 0.0

0 1 2 3 4 0 1 2 3 4 5 6-bit Addition 7-bit Addition 0.2

C 0.1 C M - 0.0

0 2 4 6 0 2 4 6 Target 8-bit Addition 0.2

L1

C 0.1

C Lw w/ given curricula M - Llh w/ given curricula

0.0 Lgh w/ given curricula

0 2 4 6 8 Target

Figure 3.4: Per-target generalisation performance on addition problems of increasing size. Aver- aged δ-MCC for Lw, Llh and Lgh are plotted against target index, ordered by the easy-to-hard curriculum for the 4- to 8-bit adders. The transparent bands are bootstrapped 95% CIs. Several interesting trends are apparent. Firstly, the effect of applying the curriculum increased with problem instance in a consistently positive manner. Secondly, each loss behaved differently across targets: Lw plateaued then dipped, Llh followed but continued growing after the dip in Lw, and Lgh outpaced the others by a significant margin but plummeted at the final targets. Note that the turning point in Lgh occurred later and later as problem dimension increased. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 44

9-bit Majority 15-bit Multiplexer 0.04

0.02 C

C 0.01 M

- 0.00

0.02 0.00 0 2 4 0 2 4 6 Target 7-bit Parity

0.10 L1

C Lw w/ given curricula C M - 0.05 Llh w/ given curricula Lgh w/ given curricula

0.00 0 2 4 6 Target

Figure 3.5: Per-target generalisation performance on the cascaded benchmark problems. Aver- aged δ-MCC for Lw, Llh and Lgh are plotted against target index, ordered by the easy-to-hard curriculum for the 9-bit cascaded majority, to 15-bit cascaded multiplexer and 7-bit cascaded parity problems. The transparent bands are bootstrapped 95% CIs. Similar overall trends exist as in Figure 3.4, however the exact profile of the effect of each guiding function was problem dependent.

It was clear at this stage that, in terms of overall generalisation, a difficulty-based curriculum of targets offered a significant advantage on some Boolean MTCs. With this in mind I used the most effective loss, Lgh, in ablative studies to determine how it behaves with different curricula and whether some other curriculum is more suitable than easy-to-hard. This study is presented in the following section.

3.2 Appraising “Easy-to-Hard”: Ablation Studies

This section addresses the question “Is the easy-to-hard curriculum optimal?”. Ablation studies are presented, where the curricula used were randomised, and organised by correlation with the easy-to-hard curriculum used in Section 3.1. The results presented herein show that, as the curriculum used diverged from the easy-to-hard curriculum, the generalisation improvement diminished. This performance drop continued to the extent that the inverse curriculum (hard-to-easy) actually worsened generalisation, in some cases by as much as the easy-to-hard curriculum improved it.

For each problem in Section 3.1, I trained networks using Lgh with several random curricula and using L1. The first purpose was to test whether the easy-to-hard curriculum CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 45

(least-to-most-significant digit) is outperformed by some other ordering. The second purpose was to see how the performance of Lgh varies by correlation between the dis- covered curriculum and the easy-to-hard curriculum; measured by the rank correlation, τ, between the two. Since these are permutations there are no ties to consider and the simplest version of Kendall’s τ (Equation 2.2) is suitable. I did not sample curricula uniformly since the rank correlation, τ, is heavily skewed over the space of permutations. The values ±1, for instance, are only found for 2 of com- binatorially many possibilities. For a length n permutation there are dn × (n − 1)/2e unique values of τ, so I instead selected curricula to be uniformly distributed over τ.I sampled 25 curricula with replacement for each value of τ and trained networks on 10 random training sets for each curriculum and training set size. The baseline L1 takes no curriculum, so the corresponding data were taken from the experiments in Section 3.1.

0.08

0.06 1.0 0.04

0.02 0.5 C

C 0.00

M 0.0 -

0.02 -0.5 0.04 -1.0 0.06

0.08

20 40 60 80 100 120 140 160 Training Set Size

Figure 3.6: Mean difference in test-set macro-MCC between Lgh and L1 over various training set sizes for the 5-bit addition problem for randomised target orderings. Each line represents the mean change in MCC for each possible level of correlation between the randomly given order and the known order. For an n-element permutation there are dn × (n − 1)/2e possible values of τ, hence 11 lines. The central permutations offered little benefit, and since random permutations are skewed heavily toward τ = 0 we can infer that a randomly generated ordering would provide scant improvement on average, for this problem. The exact inverse (τ = −1) of the suggested curricula impeded the learning performance by as much as the correct order improved it, and the exact easy-to-hard ordering (τ = 1) conferred the largest benefit.

For each value of τ and training set size, I took the mean test-set macro-averaged δ-MCC across the 250 networks trained. As an example, I have shown this value against CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 46 training set size in Figure 3.6 for the 5-bit addition problem. Each line shows the mean

δ-MCC as it varies with sample size for networks trained with Lgh for all curricula with that correlation value. One clear observation is that permutations which positively correlate (blue) with the easy-to-hard ordering improved generalisation, while permutations which negatively correlate (red) actually hurt performance. The second key observation is that this improvement varies with the strength of the correlation; and, most importantly, the largest improvement was given by the exact easy-to-hard curriculum (τ = +1). Figure 3.6 is primarily illustrative. In order to succinctly show the relationship among all benchmarks I again used the function averaging approach in Section 2.5.3. This gave 250 values of mean MCC which I plotted against τ in Figure 3.7 along with the mean value obtained with no curriculum using L1 (dashed lines). Thus each coloured line in Figure 3.6 corresponds to a single point in the upper-centre plot of Figure 3.7. The results in Figure 3.7 were very promising. In all initial benchmarks there was a clear positive correlation between how closely a given curriculum matches easy-to- hard and how well an FBN trained using that curriculum generalised. Notably the best performance was obtained for easy-to-hard and this performance was consistently better than when no curriculum was used (which stands to reason given the results of the previous section). The addition problems in Figure 3.7 (bottom two rows) again show positive trends with output dimension. In all cases curricula at τ = 0 outperform the baseline, meaning that even an arbitrary curriculum might be slightly beneficial. This is intriguing and in agreement with positive results obtained using Classifier Chains even with arbitrary orderings [71]. This would indicate that the beneficial transfer resulting from some targets being in the correct order outweighed the negative transfer resulting from a roughly equal number (on average) in the wrong order. Furthermore, the proportion of τ for which performance exceeds L1 increased as the output dimension increased indicating positive scaling effects. These results confirm the intuition that an easy-to-hard curriculum is optimal, at least for this method of enforcing it. Thus, for the remainder of this chapter, I refer to the easy-to-hard curriculum as π∗. With this cemented, the next step was to design an algorithm capable of inferring π∗ from the training data. Section 3.3 defines such an approach and evaluates it in terms of the agreement of resulting curricula with π∗, as well as generalisation gains seen when using those curricula with Lgh.

3.3 Discovering Curricula

Now that we trust the easy-to-hard curriculum, π∗, on those problems, does there exist a principled method for discovering it, or well correlated substitutes, from the training CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 47

9-bit Cascaded Majority 15-bit Cascaded Multiplexer 7-bit Cascaded Parity 0.80 0.75 0.87 0.79

0.78 0.70 0.86

0.77 0.65

MCC 0.85 0.76 0.60 0.84 0.75 0.55 0.74 0.83

4-bit Addition 5-bit Addition 6-bit Addition 0.78 0.905 0.83 0.77

0.900 0.76 0.82 0.895 0.75

MCC 0.81 0.74 0.890 0.80 0.73 0.885 0.72 0.79 0.880 0.71 1.0 0.5 0.0 0.5 1.0 Rank Correlation ( ) 7-bit Addition 8-bit Addition

0.64 0.71 0.63 0.70 0.62 0.69 0.61 0.68 No Curriculum (L1) 0.60 MCC 0.67 Randomised Curriculum (Lgh) 0.59 0.66 0.58 0.65 0.57 0.64 0.56

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Rank Correlation ( ) Rank Correlation ( )

Figure 3.7: Mean MCC plotted against τ for all test-beds with known easy-to-hard curricula. The mean MCC obtained with no curriculum using L1 is shown with a dashed line. In all cases there is a clear positive correlation between τ and MCC. Notably the best performance occurred at τ = 1 and this performance was consistently better than the baseline. For all addition problems (bottom two rows), curricula at τ = 0 exceeded the baseline, meaning that even an arbitrary curriculum might be slightly beneficial. Additionally, the proportion of τ for which performance exceeds L1 increased as the output dimension increases. Some of these trends are problem specific. In the case of cascaded majority, multiplexer and parity (top row); the τ ≈ 0 region lies below baseline performance, unlike with addition. Nonetheless the overall positive correlation between τ and MCC was consistent. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 48 data alone? The ability to do so would greatly improve the utility of target curriculum methods by enabling them beyond where domain knowledge is available. In this section I describe an approach stemming from a priori estimation of the ID of a Boolean function. I then present experimental results showing that this approach yields curricula which closely resemble π∗, and which preserve the bulk of the generalisation improvements seen in Section 3.1. Finally, I demonstrate the application of this approach on real-world problems—for which there are no ground-truth curricula—to positive effect.

3.3.1 Target Complexity and Minimum Feature Sets

Building an easy-to-hard curriculum of targets depends on estimating their relative difficulty. This chapter examines one maxim: order by increasing ID. I present anin- tuitive approach to estimate the ID by reduction to a combinatorial problem which, importantly, did not significantly increase the training time for the instances considered. In Section 3.3.2 I show that the curricula estimated by this method give generalisation improvements close to those obtained using the known optimal ordering (Section 3.1). There are two root assumptions to ID-based curricula. The first is that a target can be expected to be simpler than another if its ID (the number of input variables on which it truly depends) is smaller. This assumption is strong but intuitive, and there is a solid basis for it as a rule of thumb in the Boolean domain, since the ID of a function places an upper bound on the size of the smallest FBN which implements it. Furthermore, related measures are promising for estimating the complexity of arbitrary data [117] and as such I expect the validity of this idea to extend beyond the Boolean domain. The second assumption is that there are some correlations between targets which can be exploited. As noted in Chapter 2 this is a central assumption to almost all MTC methods. Under the above maxim and assumptions, producing a curriculum reduces to esti- mating ID. The lower bound for the ID of a target, given training data, is the size of the minimum set of input features on which a given target remains valid (where no pattern is mapped to multiple values). This is exactly the Minimum-Feature-Set (minFS) problem defined in Section 2.3.2. Given a procedure for solving minFSs instances, my proposed method for ordering targets reduces to solving the instance induced by each target. Each instance is solved to optimality resulting in a set of features Sj for each target. A estimate of the ID of any target yj is then |Sj|; this is also a lower bound (assuming Sj is optimal). The targets are then sorted by this estimate, and the resulting order used as a curriculum by the hierarchical losses from Section 3.1.1. In all experiments in this chapter, any ties3 were broken randomly with uniform probability. Pseudocode is given in Algorithm 3.1 for

3There are often pairs of targets with equal ID. The losses could be redefined to take a partialorder and treat a subset of targets equally to avoid tie-breaking, however I did not explore this possibility. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 49 completeness, note that it corresponds exactly to a simple argument sort of the indices using the feature set sizes as a key.

Algorithm 3.1: The a priori ID-base curriculum generation process. Given a training dataset (X, Y ) it produces a list of targets indices, ti, defining the order in which targets should be learned. Note that line 5 selects from tied elements with uniform likelihood. Input: X, Y

Output: t1, . . . , tno 1 remaining := {1, . . . , no} 2 for j ∈ 1, . . . , no do 0  3 Fj := minfs X , yj

4 for j ∈ 1, . . . , no do 5 tj := argmin |Fl| l∈remaining 6 remaining := remaining − {tj}

3.3.2 Experiments and Results

We know the difficulty curricula for the problems in Section 3.1.2 and we have seenin Section 3.2 that this curriculum is optimal. Thus we can quantify the effectiveness of the proposed target ordering method by using the rank correlation (Equation 2.2) between the known and estimated orderings. I used the same experimental configuration as in previous sections save that, for each training instance, I shuffled the targets by using a permutation, chosen uniformly at random, to reduce the effect of any deterministic biases. The easy-to-hard curriculum, π∗, is then the inverse of this permutation and the rank correlations I report are with respect to it. Note that this is essentially the same as the rank correlations reported in Section 3.2.

For each resulting configuration I derived a curriculum, πid, with the minFS-based ∗ method, computed the rank correlation between πid and π , and then trained an FBN using Lgh under πid. I considered only Lgh as it produced the largest overall improve- ment in Section 3.1. For adders of increasing size, Figure 3.8 shows τ between the known and automatically discovered target orders (first row), as well as the generalisation im- provement imparted by Lgh when given each ordering (second row), as they vary with training set size. The same results for cascaded parity, multiplexer, and majority are given in Figure 3.9. We see that π∗ is not as easily detected in all problems. On parity and majority τ quickly reaches 1.0 indicating that the proposed method is successful in discovering the correct curriculum. For the multiplexer there is an imperfect, but strong, correlation (τ > CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 50

0.7) between the expected and known target orders. Even with a weakly correlated target order, there was still some improvement gained from enforcing that order, reflecting the results we saw in Section 3.2. With the addition problems we observe conflicting scaling trends. On one hand the quality of curricula obtained appears to diminish with output dimension, accompanied ∗ by an increase in the disparity between the performance under πid and π . On the other hand, the net improvement under πid still increases overall with output dimension.

Consistently we see that using πid still provides a significant net improvement over training with no curriculum. With zero prior knowledge of the respective target difficulties, the minFS-based method combined with a slight change in the loss function, yields significant improve- ments in generalisation. This is promising given we are considering a learner which was already leveraging the expected benefits associated with a shared internal representa- tion [44].

3.4 Real-World Problems

This section consolidates the work in this chapter, by evaluating the use of hierarchical losses and generated curricula on real-world models. These include a commercial integrated circuit, two biological network models, and two small-sample biological time-series datasets.

3.4.1 ALU and Biological Models

Here I describe three models derived from real-world domains. This included a com- mercial integrated circuit: the 74181 series 4-bit Arithmetic Logic Unit (ALU), a Boolean model of the Fanconi Anemia/Breast Cancer (FA/BRCA) pathway, and a Boolean model of the mammalian cell cycle. The 74181 ALU has 14 inputs and 8 outputs. It performs 32 different arithmetic and logical operations on an 8-bit input with carry in, generating a 4-bit output with carry out as well as 2 other carry-related outputs useful for faster calculations when chaining multiple ALUs. The 74181 represents a reverse engineering problem of a realistic size that is less idealised than previous problems, as there is no expected absolute ordering to the targets. The biological models I considered were Boolean models of the Fanconi Anemia/Breast Cancer (FA/BRCA) pathway, and the mammalian cell cycle [173]. Boolean regulatory network models like these describe the time-evolution of gene activity, levels of regu- latory products and protein activity with a single Boolean variable per element. The CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 51 ∗ π with gh 200 L 100 8-bit Addition 8-bit Addition Training Set Size 0 1.0 0.8 0.6 0.4 0.2 0.0 0.15 0.10 0.05 0.00 -MCC for FBN trained using δ 200 100 7-bit Addition 7-bit Addition Training Set Size 0 1.0 0.8 0.6 0.4 0.2 0.0 0.150 0.125 0.100 0.075 0.050 0.025 0.000 200 easy-to-hard 100 6-bit Addition 6-bit Addition . Despite this trend, the bottom row still shows a net improvement in generalisation Training Set Size ∗ π ID-derived 0 and 1.0 0.8 0.6 0.4 0.2 0.0 0.10 0.08 0.06 0.04 0.02 0.00 id π . The bottom row shows the mean macro-averaged ∗ π ) and 200 id π 100 5-bit Addition 5-bit Addition Training Set Size 0 1.0 0.8 0.6 0.4 0.2 0.0 -MCC performance between 0.06 0.05 0.04 0.03 0.02 0.01 0.00 before the respective peaks in test error difference,yet fell it away as output dimension increased. was accompanied This δ 50 . 0 200 exceeded τ between the detected curriculum ( 100 τ 4-bit Addition 4-bit Addition Training Set Size 0 which increasing with output dimension.

. In all cases 1.0 0.8 0.6 0.4 0.2 0.0

id

) ( n o i t a l e r r o C k n a R 0.04 0.03 0.02 0.01 0.00

C C M - π id π Figure 3.8: Performance of ID-derived curricula onrank addition correlation, problems of increasing output dimension. Theand top row shows the curriculum quality measuredby by an the increase in the disparity ofusing the CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 52

9-bit Majority 15-bit Multiplexer 7-bit Parity 1.0 1.0 1.0

) 0.8 0.8 0.8 (

n o i t 0.6 0.6 0.6 a l e r r o

C 0.4 0.4 0.4

k n a R 0.2 0.2 0.2

0.0 0.0 0.0 ID-derived easy-to-hard 9-bit Majority 15-bit Multiplexer 7-bit Parity

0.015 0.04 0.20

0.010 0.03 0.15 C

C 0.005 0.02

M 0.10 -

0.000 0.01 0.05 0.005 0.00 0.00 0.010 0 100 200 300 400 0 50 100 150 25 50 75 100 Training Set Size Training Set Size Training Set Size

Figure 3.9: Performance of ID-derived curricula on the cascaded majority, multiplexer and parity problems. The top row shows the curriculum quality measured by the rank correlation, τ between the detected curriculum and π∗. The bottom row shows the mean macro-averaged δ- ∗ MCC for FBN trained using Lgh with the optimal curriculum (π ) and the automatically detected curriculum (πid). For all problems τ exceeded 0.70 before the respective peaks in test error difference. For these same problems the detected target curriculum provided nearly thesame benefit as the optimal, which is no surprise given the strong correlation between the two. The problem with the largest deviation, CMUX, is also the one for which τ was weakest. Note that the x-axes are only shared within columns, and the y-axis for the bottom row is not shared, since the improvement range varied significantly between problems. system dynamics are then described by Boolean update functions paired with an update method (synchronous, partly asynchronous or fully asynchronous). I chose synchronous models as this enabled the inference of a single deterministic update equation by treat- ing successive time steps as input-output pairs. It is this update rule that I aimed to model. The full Mammalian cell-cycle and FA/BRCA pathway models are given by Poret and Boissel [173]. The Mammalian model consists of 10 nodes, and thus 10 inputs, and 10 targets. However in the published model the “Cyclin D” node does not update and is thus a constant input, as such I removed it as a target. The FA/BRCA model is much larger consisting of 28 nodes, and thus gives us 28 input and 28 target features. For these three problems I ran experiments similar to those in Section 3.1, with the exception of the ALU for which I only sampled 100 training sets per configuration as the training times were substantially larger. Similarly, due to the large memory requirement of generating all 228 patterns for the FA/BRCA model, I used a reduced example pool CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 53 of 210 patterns sampled uniformly at random. The problem dimensions, training set sizes and example pool sizes are given in Table 3.3. There is no suspected optimal target curricula for these problems, so the ordering used for all ordered losses was found using the minFS-based ID estimation described in Section 3.3.1.

3.4.2 Inferring Regulatory Network Dynamics From Time-Series

In many of the experiments in this chapter the primary performance gain was seen at smaller sample sizes. For this reason I selected an additional real-world test-bed where a key issue is the extremely limited amount of available data: inferring Boolean Gene Regulatory Network (GRN) update models from time-series gene expression data. Barman and Kwon [174] provide binarised time-series data for an E. coli GRN and a cell-cycle regulatory network of fission yeast: each consisting of 10 nodes and only a handful of examples. The desired result is a synchronous update model such as in the models described in Section 3.4.1. One approach to building an update model is to treat each sequential pair of states as an input-output pair thus deriving a MTC in which the goal is to predict the next state from the current state. The E. coli dataset contains binarised time-step values for 10 nodes, labelled G1 to G10. The yeast dataset also contains binarised time-step values for 10 nodes: start, SK, Cdc2/Cdc13, Ste9, Rum1, Slp1, Cdc2/Cdc13*, Week1Mik1, Cdc25, PP, and Phase. Some preprocessing of the data were required for feasibility of training. First, I removed repeat patterns as their inclusion violates the requirements for a valid minFS instance (these repeats occur due to time-delay disparities which synchronous updates cannot accurately model). Second, I removed constant-valued targets as they represent nodes for which there is no data suggesting a relationship to any other node. One such target (“start”) was removed from the yeast dataset, and two (“G7” and “G10”) from the E. coli dataset. Finally, as the training sets were extremely small, exploring variation in training set as standard for this chapter was infeasible. Instead, I used leave-one-out cross validation with all examples whose removal did not introduce additional constant targets (the first was the only such example in both cases). Each fold involved the training networks on 1000 randomly sampled training sets for each method, to account for the significant variance typical in such heavily under-determined problems. Pilot tests, which included all targets of the yeast and E. coli datasets as a single problem, showed negligible generalisation difference between losses. Preliminary feature set analysis suggested that there were distinct, shallow hierarchies among the targets (see Figure 3.10), rather than the deeper hierarchies seen in prior problems. A lack CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 54

G2 G8 SK PP {G7} {G7} {start} {Slp1}

G6 Ste9 Rum1 Cdc2/13 Cdc2/13* {G4, G7} {start, PP} {start} {Slp1, Ste9} {Slp1, Wee1/Mik1} (a) E. Coli (b) Yeast

Figure 3.10: Possible target hierarchies found using minFS in the yeast and E. coli datasets. In the E. coli data (a) we see the target G6 (with feature set {G4,G7}) may benefit from being learned after G2 or G8 (both have the feature set {G7}). In the yeast data (b) we see two potential hierarchies: either of Ste9 or Rum1 (both with feature set {start, P P }) may benefit from being learned after SK, and either of Cdc2/Cdc13 or Cdc2/Cdc13* (with feature sets {Ste9, Slp1} and {Slp1, W ee1/Mik1} respectively) may benefit from first learning PP. of improvement was unsurprising in this case since imposing an order upon unrelated targets is not expected to offer any benefit and to potentially even hinder performance. Due to the poor initial performance, and indications that multiple shallow hierarchies were present rather than a single deep hierarchy, I opted to train networks only for the nodes involved in those hierarchies (2 for yeast and 1 for E. coli). For fair comparison these were compared with the baseline L1 network on both target sets: only those in the suspected hierarchy, and all targets. Results are only shown for the former however, since the results of learning a single overall network using L1 were notably worse (possibly due to negative transfer). Again ID-curricula were used for all ordered losses (Section 3.3.1).

Problem Inputs Targets Training set size Example pool size 74181 ALU 14 8 [64, 384] 16, 384 FA/BRCA pathway 28 28 [8, 96] 1, 024 Mammalian cell cycle 10 9 [8, 180] 1, 024 Yeast 10 9 8 9 E. Coli 10 8 4 5

Table 3.3: Instance sizes of the real-world test-beds.

3.4.3 Results

This section contains results for the experiments described above. There are again no suspected curricula so the results for all ordered losses were obtained using ID-derived curricula (Section 3.3.1) and evaluating curriculum quality as in Section 3.3.2 was not possible. For the ALU, mammalian cell-cycle, and FA/BRCA pathway models I used a similar regime as for all prior experiments and so I present the results in the same manner. CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 55

ALU 74181 Mammalian Cell-Cycle 0.10

0.02 C

C 0.05 M -

0.00 0.00

0 100 200 300 400 0 25 50 75 100 Training Set Size FA/BRCA Pathway 0.04

L1

C L w/ derived curricula C w 0.02 M

- Llh w/ derived curricula

Lgh w/ derived curricula 0.00

0 25 50 75 100 Training Set Size

Figure 3.11: Mean difference in test-set macro-MCC between L1 (dotted baseline) and each of Lw, Llh and Lgh on each training sample for the 74181, mammalian cell-cycle and FA/BRCA pathway models. The 95% confidence interval of the mean is shown with transparent bands. On all problems we see a net increase in MCC—again varying widely between problems. Except in the case of the mammalian cell-cycle model for which the losses are not clearly distinguishable, this improvement increased as the loss more strictly enforced the ordering (Lw → Llh → Lgh).

Figure 3.11 shows the mean macro-δ-MCC for each hierarchical loss function as sample size is varied. Of the three, the ALU model showed by far the largest improvement for the most aggressive loss, Lgh, yet the other losses can’t be fairly said to have helped due to the wide confidence intervals. In the mammalian cell-cycle model there was much less differentiation between the hierarchical losses and Lw may have given the most benefit. Overall, in all three cases, imposing the ID-curricula clearly improved generalisation. For the time-series datasets, the leave-one-out testing regime did not involve varying the sample size so I have instead presented the results in Tables 3.4 and 3.5. I report accuracy rather than MCC as there was no possibility of imbalance in the test regime; MCC in this case is equivalent to renormalised accuracy and accuracy is more widely in- terpretable. Again we see that imposing a target curriculum improved the generalisation performance of some targets later in the suspected order even for shallow hierarchies with extremely limited sample sizes. In Table 3.5 we see an improvement in the target G6 and in Table 3.4 we see improvement for both subsequent targets—Cdc2/Cdc13 and Cdc2/Cdc13*—in one of the two estimated hierarchies (see Figure 3.10). CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 56

SK Ste9 Rum1 PP Cdc2/Cdc13 Cdc2/Cdc13*

Lw 0.4 0.5 0.4 1.6 1.2 0.6 Llh 0.6 0.5 0.4 0.8 1.0 0.1 Lgh 0.9 0.5 0.3 0.5 1.3 1.4

Table 3.4: Results for the yeast dataset. See Figure 3.10 for the estimated hierarchies. Note that these results are mean difference in test set accuracy, reported as a percentage, and not MCC. The use of curricula on the SK→ {Ste9, Rum1} hierarchy yielded negligible improvement, however we see more promise in the PP→ {Cdc2/Cdc13, Cdc2/Cdc13*} hierarchy, particularly for Lgh.

G2 G8 G6

Lw 0.1 0.2 0.8 Llh 0.7 -0.9 1.3 Lgh 0.1 0.1 1.4

Table 3.5: Results for the E. coli dataset. Note that these results are mean difference in test set accuracy, reported as a percentage, and not MCC. See Figure 3.10 for the estimated hierarchies. The use of curricula on the {G2, G8}→G6 hierarchy has given some improvement, however for Llh we see a drop in performance on one of the base targets in the hierarchy. For Lgh the results remained positive overall.

3.5 Discussion and Conclusion

In this chapter I have shown that hierarchical losses which impose an easy-to-hard target curriculum improved generalisation performance of FBNs on a series of cascading multi-target circuit-inference problems. In particular, this improvement increased as the curriculum was more strictly enforced and as the output dimension of the problems increased. Note that the losses considered do not alter the set of global optima nor do they restrict the hypothesis space, H, (the space of networks). Instead, their use modifies the effective sampling distribution over H. This alteration of the probability of discovery is achieved in a natural way using concepts inspired by human teaching and learning practices. The set of circuit-inference test-beds described in Section 3.1.2 all possess a common element: intermediate values which are useful for computing the m + 1th output are computed for the mth output. A human designer with access to a complete m-bit circuit would have a much easier time building an (m + 1)-bit circuit. It is natural to assume that this ease should extend—in terms of convergence speed, generalisation, or both—to a optimiser which builds such circuits in the correct order. The results above displayed the latter but not the former. The performance improvements observed, occurred primarily around the existent generalisation phase transition noted by Carnevali and Patarnello [151] (see the top CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 57 of Figure 3.1). This is intuitive: there are fundamental limits with respect to training- set size that cannot be alleviated, and there are also sizes for which there is enough information present that a target curriculum confers no generalisation advantage. What these results show is that—in the interim region between too little and plenty of training information—teaching targets in order of difficulty can make better use of what data is present. This improvement is expected in light of the positive effects of hierarchical cost functions on optimiser performance [175]. Next we saw the importance of the curricula itself in Section 3.2. In all cases, the more correlated a random curriculum was to π∗, the greater the generalisation benefit ∗ imparted by imposing it with Lgh. Those curricula which were anti-correlated with π hurt generalisation in many cases, and curricula which were uncorrelated (essentially random) generated mixed results. The relationship was also continuous, lacking any phase transition behaviour, a good sign for methods which may approximate the ideal curricula. For majority, multiplexer and parity, benefit was only achieved by very highly corre- lated curricula. For the bulk of the correlation range using curricula actually degraded performance. Given that the peak in the distribution of permutations over τ occurs at zero, most randomly generated curricula would be unhelpful for these problems. This highlights the importance of principled methods for determining target curricula. In the case of the addition test-beds I was also able to examine how these effects varied with output dimension. It appears that addition is more naturally suited to curriculum training as even curricula poorly correlated with π∗ still provided some benefit. Further still, as the number of outputs was increased, the location of the point of equivalence with L1 moved leftward and, for the largest case, all curricula outperformed

L1. Nonetheless, generalisation still tracked positively with τ in all cases suggesting value in the search for useful curricula. In Section 3.3.1 I described an approach to automatically discover appropriate target curricula for Boolean MTCs based on the intuitive notion of ID as a proxy for complexity. For the test-beds already examined, much of the generalisation improvement conferred by the curriculum remained, even when no information on the true target order was provided to the learner. The curricula returned also closely resembled π∗ in general. I further examined the same loss functions with ID-curricula on three models of real world phenomena, as well as on two regulatory network time-series data sets in Section 3.4. On all three models I observed test MCC increases similar to those seen on the circuit-inference test-beds. In the case of the yeast and E. coli data sets I discovered shallow target hierarchies, indicating lower expected improvements (Figure 3.1 shows generalisation gains are focused primarily on later targets). Nonetheless the results, while weaker, were still in line with those observed for the previous problems. This is CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 58 promising as I have not explicitly accounted for noise in either the curricula detection method or design of the loss functions.

3.5.1 Issues and Limitations

Not discussed here is the effect that the use of these guiding functions has on the average convergence time. I observed an inconsistent effect on training speed and, for some problems, there was a cost when using target curricula. Another possible impact on training speed is the requirement for the target ordering method to solve an NP-hard optimisation problem. However, for all problems considered, the target order detection process had almost no impact on overall training time. For significantly larger problem instances, however, heuristic solvers would likely be required. Noise has not been considered as a factor. It is uncertain whether the minFS based methods for estimating target overlap and target schedules—or FBNs themselves—are robust to incorrect target labels. Future work should address this shortfall, perhaps by allowing the per-target error constraints in Lgh to be non-zero. The drop in performance on the final target should also be addressed; possibly by using stratified networks, or annealed depth constraints, to prevent later gates being consumed by earlier targets.

I noted in Section 3.1.4 that improvement on the final target from Lgh actually decreased on some problems (Figures 3.4 and 3.5 and the bottom of Figure 3.1). In some problems it simply declined its initial growth relative to the other losses, or in others it actually performed worse than without a curriculum. Initially, this appeared to indicate a point of diminishing returns or a flaw in the reasoning behind the validity of the easy-to-hard curriculum. However the consistent improvement of results seen when increasing output dimension, and the optimal per- formance of π∗ in the ablative studies belie this. Alternatively, I noticed that deviations ∗ in πid from π typically occurred at the latter targets. Given the more aggressive nature of Lgh perhaps these contributed. This however seems unlikely given the consistent improvement of Llh. It is possible that this is a target-curricula analogue to the ob- servation by Zaremba and Sutskever [176] that stochastic, rather than deterministic, example-curricula are advised in certain circumstances. Another, mundane but more likely, explanation is the presence of a single gate budget coupled with a lack of constraints on which gates each target can source from.

The loss function in question, Lgh essentially forces optimisation of each target in suc- cession. As the process continues, it becomes increasingly more likely that any given target will have a majority of its computational nodes closer to the end of the ordering of gates. Subsequent sub-networks will thus require later and later nodes to be available in order to correctly discover the hierarchical structure of the problem which becomes CHAPTER 3. TARGET CURRICULA AND HIERARCHICAL LOSS FUNCTIONS 59 an issue toward the end when the gate budget is exhausted. This saturation might be alleviated by reserving nodes for each target, or shuffling unused gates to the end of the allotment periodically. The uncertainty in the cause of this phenomenon warranted further investigation.

3.5.2 Future Work and Direction

While these results are only for the case of Boolean MTCs learned with FBNs, I expect they may extend to other non-convex learners, such as deep neural nets, for problems with categorical or Boolean domains. Future work could also examine functional ID estimation methods for real valued domains, to determine if the approach presented here is more generally applicable; or could consider differentiable analogues of Llh and

Lgh. Consideration of other feature selection approaches may also be fruitful. At this stage the ideal direction appeared to lay in addressing the issues in the final target with Lgh. This however would require steps not involved in the loss function or curriculum design, such as ‘percolating’ unused gates or annealing cost histories. An alternative direction presented itself when considering CCs: that ID-curricula may be applicable to them and there would be no issues of gate saturation therein. This approach allowed me to study the gate saturation hypothesis in a more optimiser agnostic sense while also examining the efficacy of easy-to-hard curricula in another setting. Chapter 4

Application of ID-curricula to Classifier Chains

This chapter focuses on a recent, popular approach in MLC: Classifier Chains (CCs). In the last chapter, I partially attributed anomalous behaviour in the final target to a clash between a monolithic model with a fixed gate budget and a semi-staged training process. The CC approach (mentioned in Section 2.1.4) similarly involves a staged training process, and target ordering, but not a monolithic model. As such, I used it as a vehicle for testing this prior claim and also for examining the use of ID-curricula in a popular MTC framework. The successes in Chapter 3 motivated the extension of a priori curriculum generation to address the chain ordering problem in CCs. Rather than relying the popular mitigation techniques, such as ensembling, a difficulty curriculum can be employed as the chain order. In this chapter, I define CCs using FBNs and detail experiments evaluating the effect of ordering by an easy-to-hard curriculum, as well as a CC-specific extension to the a priori curriculum generation procedure from Section 3.3.

4.1 Introduction

A Classifier Chain (CC) is a problem transformation method for MTC problems. It al- lows the use of single-target learners on several modified STC problems from which it constructs a solution to the parent problem. A CC, however, requires an ordering of the targets—which is known to impact the method’s success—hence the potential for applying curriculum methods. In the following sections I define CCs, discuss the existing work on chain ordering, and present an extension to the ID-based a priori curriculum discovery method from Section 3.3 inspired by, and tailored to, CCs.

60 CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 61

4.1.1 Classifier Chains

A Classifier Chain (CC) is a problem-transformation method for reducing an MTC into several binary STCs, while retaining some level of potential transfer between the sub- problems (see Section 2.1.4 for introductory detail). It works by treating targets as extra inputs during training. This allows targets to inform each other during the training phase, which in turn enables a combined hypothesis in the final model without requiring modification of the base (single-target) learning algorithm. The immediate issue is that the target values are unavailable during inference. The CC approach addresses this by using the predicted targets in place of these extra inputs at inference, resulting in the titular chain of classifiers. This means that outputs can not be mutually dependent, since the predicted value of one is needed to predict the other (chicken and egg). Thus the approach assumes a partial or total ordering of the targets. The original algorithm by Read et al. [71] takes a total order of the targets (discussion on how such an order might be generated to follow in Section 4.1.2) and builds no single-target binary classifiers c1, . . . , cno each using an augmented input space. Each 0 classifier’s augmented input matrix, Xj, consists of the original input features and the target features of all prior classifiers in the chain. That is:

0 X1 = X 0 h i Xj = X y1 ··· yj−1 ∀ 1 < j ≤ no ,

Where the block matrix [ A b ] denotes the horizontal concatenation of matrix A and column vector b. 0 Each classifier, cj is trained on (Xj, yj). At inference, where Y does not exist, the classifiers are instead provided the predicted values, yˆ1,..., yˆj−1, resulting in an inference chain. The training and inference procedures as originally proposed by Read et al. [72] are given in Algorithms 4.1 and 4.2.

Algorithm 4.1: The original CC training process of Read et al. [72]. Note: Columns of Y are assumed to already be sorted by the desired chain order. Input: X, Y

Output: c1, . . . , cno 0 1 X1 := X 0 2 train c1: X1 → y1 3 for j ∈ 2, . . . , no do 0   4 Xj := X y1 ··· yj−1 0 5 train cj: Xj → yj CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 62

Algorithm 4.2: The original CC inference process by Read et al. [72]. Note: Yˆ is produced according to the ordering used in Algorithm 4.1.

Input: X, c1, . . . , cno Output: Yˆ 1 yˆ1:= c (X) 2 for j ∈ 2, . . . , no do 0   3 Xj := X yˆ1 ··· yˆj−1 4 yˆj:= c (X)

The CC approach has become a popular method and benchmark since its first description: one in twenty Scopus records since 2012 with the keyword “multi-label classification” also bear the keyword “classifier chain”. It has been used as the basis for parametrising a family of algorithms [177], and re-analysed in a deep-learning feature- engineering framework [92]. Many variants have been proposed; broadly speaking these generalise the chain structure, alter the inference process [50, 73], or treat the target ordering issue. It is widely used and recommended as a state-of-the-art benchmark method for MLC [62]. Structural generalisations have been proposed, replacing the chain structure with a collection of sub-chains [178], a chain of blocks [177], and DAGs [179, 180]. These all effectively restrict inter-target dependencies by shortening the chain in different ways. Chen et al. [178] cluster the target vectors by their Mutual Information (MI) and then apply CC only within clusters. Burkhardt and Kramer [177] apply almost an inverse approach whereby targets are grouped into blocks in an overall chain and can only use targets in earlier blocks as augmented inputs. There is no particular method to this grouping, thus the difference from a single chain is that links between some targets within a certain distance of each other are removed. Read et al. [180] later proposed a classifier trellis: a DAG structure with pre-imposed regularities, into which targets are places maximising the MI between targets and their parents greedily. Lee et al. [179] similarly build a DAG structure, this time minimising the Conditional Entropy between targets and their parents. Unlike the trellis, they do not fix a super-structure in advance and instead co-minimise the complexity of the DAG structure, measured by description length. Some fixation of pairwise order is gained by using CE which, unlike MI, is not symmetric. However, since CE(a|b) = H(a) − MI(a, b), it is also true that ordering by CE only differs from MI by separating ties using individual entropy of the targets, which is equivalent to pre-ordering by class imbalance and then ordering by MI. The value of the CC approach herein is that it allows generation of a final multi-output FBN. Something not afforded, for instance, by the Label Powerset and related methods, such as RAkEL [69], which transform the problem into one or more multi-class problems. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 63

It is also impossible with lazy-inference multi-target learners such as ML-KNN [81]. The probabilistic CC variants mentioned before are unsuitable as they assume a base learner which predicts label probabilities; CC variants which produce per-example orderings are likewise unsuitable. There is some criticism of the predominant interpretation of mechanisms under- pinning CCs: that correlations between labels exist, and that CCs offer improvement by exploiting them [94]. As many problem transformation implementations rely on linear base classifiers, Dembczyński et al. [48] point out that the performance of CCs could be partly explained by the introduction of non-linearities—by the chain—causing an expansion of the hypothesis space. This echoes results by Read and Hollmén [94] wherein CC offered far more improvement over IC when the base classifier was logistic regression than when it was random forest. It would be informative to evaluate CCs with a FBNs base classifier due to their high non-linearity.

4.1.2 When and how to order chains

Much discussion exists on the importance of the chain order. The consensus is that chain order can have a significant impact on the performance of a CC but that the extent of this depends on the base classifier. Da Silva et al. [181] studied the issue, finding chain order to be highly influential, and similarly found different orders for each example (at inference) helpful. However, the latter option is clearly not possible for learners with a static structure at inference like FBNs. Additionally Huang et al. [182] studied the more general cases of reusing hypotheses in MLC, noting that relationships between targets in real-world problems are often asymmetric, lending support to the importance ofan appropriate chain order. There are a variety of approaches to addressing the chain ordering issue which I discuss below. The Ensemble of Classifier Chains (ECC) approach seeks to ameliorate the effect of chain ordering by producing an ensemble of chains with different random orders [72]. Read et al. [71] used a bootstrap aggregating approach to model averaging [183] whereby each chain is trained on a modified training set (sampled from the original with replace- ment) and then, at inference, the resulting ensemble votes to decide predicted values. They found the ensemble to significantly outperform random orders on average. Li and Zhou [184] extended ECC using a weighted voting scheme and pruning, reaching comparable performance to large ensembles with much smaller ensembles. trajdos_permutation-based_2018 suggest building ensembles with a diversity of chain orders, however that approach requires meta-optimising over many ensembles, result- ing in many thousands of underlying CCs being trained. The primary downside of an ensemble, in general, is that it requires multiple learners to be trained (often 10 to 50 CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 64 times as many). Specifically for logic synthesis, it does not permit the generation of a single final network as output. An alternate is to find an effective chain order by meta-optimising. Goncalves et al. [185] used a (GA) to find a single chain order with high performance on the training set, however they observed over-fitting with extended iterations. They later improved on this approach by allowing incomplete chains and treated the over-fitting issue using internal validation [186]. Read et al. [50] proposed a Monte Carlo approach to produce an optimal order (or population of orderings for use in an ensemble), and Kumar et al. [187] applied beam search. Liu and Tsang [188] showed that an optimal ordering can be found for large margin base classifiers (such as SVMs) by using dynamic programming; they also presented a greedy approximation which empirically performed similarly well. The primary issue with search based approaches is that they require a large number of repeated training stages, and in some cases [187–189] make assumptions on the type of base singe-output learner. At time of writing, only one method exists to produce chain orderings a priori. Jun et al. [190] suggest a number of heuristics based on the pairwise conditional entropy between target features. This first generates a conditional entropy matrix H where

Hi,j = CI(yi|yj). They propose four greedy heuristics which build a chain approximately minimising the CE sum between each target and the targets earlier in the chain. It is interesting to note that the authors stated intent is to place simpler targets toward the beginning of the chain, the same maxim that motivated the design of ID-curricula in the previous chapter. The primary difference is the entropic metric used to quantify difficulty. Another candidate can be taken from part of the method by Wang et al. [191], which uses partially a priori information, in combination with training set performance, to produce a chain order. The a priori component they call label effects and corresponds loosely to a one-sided version of conditional entropy, where the component condition- ing on a target being negative is ignored—a natural result of assumptions common in MLC (Section 2.1.2). Since these label effects are asymmetric they can, in principle, be used as an a priori method for ordering. Chen et al. [178] produce a complete graph, G, with a vertex for each target and edge weights equal to the mutual information between the two corresponding vertices. “Tree-structure label constraints” are taken from a maximum spanning tree found on G. This is used to cluster targets for producing sub-chains, but it is also used for producing a label ordering. This would ostensibly make it a candidate baseline, however when used to produce total orders, the method produced orderings which are effectively random. This is due to the use of a symmetric inter-target relationship metric: Mutual Information. Any node in the spanning is thus as likely to be the root as any other; thus CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 65 the pairwise orderings of the labels, and the set of nodes considered parents of another, are completely determined by initial conditions. In the literature, CCs are most commonly used with logistic regression [91, 92], SVMs [50, 184], or decision trees [trajdos_permutation-based_2018] as the base clas- sifier. To date FBNs have never been used as a base classifier with CC. Nor has any curricula method—such as presented in Chapter 3—been used to generate a priori chain orderings. In the remainder of this chapter I examine the efficacy of ID-curricula in FBN-based CCs and also present an extended, ID-estimation technique, specific to CCs, which further improved performance.

4.2 Target-aware ID-curricula

One factor that becomes apparent when considering ID-estimation for CCs, is that it does not have to be estimated solely within the input-feature space (as in Section 3.3.1). This is because the output features are treated as potential inputs, altering the domain of each target. They can thus also be considered for ID estimation and, since doing so alters a target’s domain, it may also alter its intrinsic dimension. There are a few ways the inclusion of target features in ID may help. A tie (in the

Section 3.3.1 method) between two targets, yi and yj, may be broken by the inclusion of a different target feature, yk, that reduces the minimum feature set for yi. Or alternatively, one of yi or yj may reduce the feature set of the other and thus improve the curriculum by providing the additional constraint that one precedes the other. In light of this, the ID-estimation method of Section 3.3.1 can be considered “first-order” in some sense. I will refer to it as target-agnostic from here on. The inclusion of target features in this way does complicate the discovery method. The final chain order affects which targets are available as input features for each other. 0 We cannot simply create a matrix Xi = X ∪ Y − {yi} for each target and then order by increasing estimated ID, as the ID for a target becomes invalid if one of its intrinsic features is target placed after it in the resulting curriculum. There is the obvious possibility of a search-based solution. The space of permu- tations could be searched, and then ID estimated for targets based on that ordering, with the goal of finding one minimising some measure of collective ID—or satisfying a validity criterion (e.g. sorted by ID). There are even hints of repair-operators: after the ID estimates are found for some arbitrary chain, two adjacent targets can be swapped providing the one which will be placed earlier is not present in the other’s feature set solution. However this has the potential to drastically increase cost of the ID estima- tion process, in much the same way that meta-optimising the chain order does for the training phase. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 66

I instead examined a greedy approach. The idea is simple: take first the target with the smallest ID under X. Then iteratively take the target with the smallest ID given X and any prior targets in the chain. This has an advantage over search-based approaches by reducing the number of minFS instances that need to be solved. Moreover, after the first solution, upper bounds are known for future instances. If a minFS instance has a solution of size k, then that instance with additional input features still contains that size k solution, thus the optimal solution is of size k or less. The bounds found in the initial round of solutions can be used to aid solvers in subsequent rounds, and the prior solutions themselves can even be seeded. This greedy method is detailed in Algorithm 4.3 and herein I refer to it using the term target-aware, and the method from Section 3.3.1 as target-agnostic. The expected benefit of using target curricula in this way is greater generalisation than achieved by a random chain order without needing to train the several FBN required for a voting ensemble or for meta-optimising the ordering.

Algorithm 4.3: The greedy target-aware ID-base curriculum generation process. Given a training dataset (X, Y ) it produces a list of targets indices, ti, defining the order in which targets should be learned. Input: X, Y

Output: t1, . . . , tno 0 1 X := X 2 remaining := {1, . . . , no} 3 for j ∈ 1, . . . , no do 0 4 tj := argmin |minfs (X , yl)| l∈remaining 0  0  5 X := X ytj 6 remaining := remaining − {tj}

4.3 FBN Classifier Chains

The use of FBNs within a classifier chains framework follows from the description of the classifier chains approach in Section 4.1.1. There is one slight alteration due to the FBN structure conventions in Section 2.5.1, where the final no gates form the output. This doesn’t fit with the CC inference stage which requires that each learner have available, as input, the output of all learners preceding it in the chain. In order to build a CC of FBN where the final result is a single network subject to the conventions in Section 2.5.1, we need some way to reference any internal node as an output. I did this trivially by appending an extra set of output nodes after the CC training procedure. Each such “dummy” output simply needs to compute any function which acts as an identity function when given identical inputs (in this case the OR function). CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 67

Both inputs of each dummy gate are connected to the internal gate whose value we want at the output. By this method deeply internal nodes can act as outputs while the overall circuit is consistent with the conventions in Section 2.5.1. The use of ID-curricula is now straightforward: a curriculum is generated as in Sec- tion 3.3.1 and used as the chain order. Each learner in the chain is a FBN with a budget of ng/no gates and a single target. In principle they can be trained in parallel as they are independent during the training phase. After training each sub-network has its output connected—in place of the additional inputs—to all subsequent sub-networks, resulting in a single overall network. To this I then append no dummy output gates as described above. Note that these dummy outputs are not counted in the gate budget ng for any experiments, since they aren’t involved in training or any true computation.

4.4 Experiments

I conducted several experiments to examine the effect of ID-curricula on CCs of FBNs. I compared the two approaches—target-agnostic and target-aware—to random chains and to the two a priori methods for chain order identify from the literature: CEbCC and label effects (Section 4.1.2). In the following sections I describe the problem test-beds used, the training and evaluation conditions, and metrics for evaluating the produced curricula.

4.4.1 Datasets

In Chapter 3 I showed that target curricula can help in problems with an information cascade. What the experiments therein were not able to probe was how target curricula behave when there are structural differences in this cascade. There are few publicly available multi-target datasets, and none with accompanying domain knowledge re- garding the interplay of targets. As such, I produced several synthetic problems with different underlying structures which are presented in this section. Analysis of the dependencies in the time-series datasets in Chapter 3 revealed only shallow, local hierarchies. As such there is little justification for the use of classifier chains (and minimal possible variation in ordering). I did not use these datasets in this chapter and, instead, sourced several additional benchmarks from the domain of logic synthesis.

Random Cascaded Circuits

In approaching a problem which is suspected to possess a target-relationship struc- ture similar to the cascading problems in Section 3.1.2, one might suggest the use of CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 68 sequence-to-sequence recurrent networks [176, 192]. The application of them, however, requires a problem transformation into a sequential domain, which is infeasible if the order of targets is unknown, or if the underlying functions are heterogeneous. The first condition can be simulated with the previous benchmarks—by permuting the targets prior to training—but examining a methods sensitivity to the latter requires different benchmarks, several candidates for which I present below. I created several synthetic benchmarks intended to help reveal the effect of het- erogeneity in target level functions, as well as the impact of relationship structure in hierarchical MTC problems. I produced problems with four general structures: a direct chain, an indirect chain, a binary tree, and a pyramid DAG (shown in Figure 4.1). These each impose different hierarchical relationships between targets, and have different possibilities for optimal target curricula. Using these we can produce benchmarks that can be scaled in depth and which possess known regularities in the target relationships that can be explored. The first structure directly links targets using a side-heavy tree structure shown in Figure 4.1a. This direct chain structure is an extreme case of transfer and the related easy-to-hard curriculum is a total order. Each target yi takes the value of a randomly selected Boolean function fi calculated from an additional input, xi+1, and the previous target, yi−1. The target values are thus given by

y1 = f1(x1, x2) (4.1)

yi = fi(yi−1, xi+1) ∀i > 1. (4.2)

The second structure, shown in Figure 4.1b, imposes an indirect “carry” effect, like binary addition but with heterogeneity among the sub-functions. These indirect chains represent a wider range of realistic scenarios, wherein transfer occurs through subcom- ponents (c0, c1, …). The equations governing node activity are

c c1 = f1 (x1, x2) (4.3) y y1 = f1 (x1, c1) (4.4) c ci = fi (ci−1, xi+1) ∀i > 1 (4.5) y yi = fi (ci−1, ci) ∀i > 1, (4.6)

th c where ci is the value of the i indirection function, fi , (a subcomponent shared with th y subsequent targets), and yi is the value of the i target function, fi . The third structure is a binary tree shown in Figure 4.1c. It breaks from the linear chain structures, providing depth of computation by coalescing targets to produce higher-order targets which depend on a larger combination of inputs. Target values are CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 69

yn-1 y1 y2 y3 yn-1

y3 ...... y2 c ... y1 c1 c2 3 cn-1

...... x1 x2 x3 x4 xn x1 x2 x3 x4 xn (a) Direct Chain (b) Indirect chain

yn-1 yn(n-1)/2 ...... y2n-1 y n/2 -1 ...... yn yn+1 ... y y yn-1 1 y2 yn/2 1 y2 y3 ...... x1 x2 x3 x4 xn-1 xn x1 x2 x3 x4 xn (c) Binary Tree (d) Pyramidal DAG

Figure 4.1: Structures used for generating synthetic MTC benchmarks with deep inter-target relationships. given by h ni y = f y(x , x ) ∀i ∈ 1,..., (4.7) i i 2i−1 2i 2 hn i y = f y(y , y ) ∀i ∈ + 1, . . . , n − 1 (4.8) i i 2i−n−1 2i−n 2

Note that the first n/2 targets source directly from the input features, while remainder source only from prior targets. The final structure is a pyramid DAG (or mesh) shown in Figure 4.1d. It is similar to the binary tree but with overlapping dependency structures, and differing granularity in dimension control. While the binary tree can only be produced for output dimension in powers of two, the pyramid DAG can be increased at a quadratic rate. Additionally the binary tree has effectively equal input and output dimension, while the pyramid DAG has output dimension quadratic in the number inputs. Target values are given by

y yi = fi (xi, xi+1) ∀i < n (4.9) y yi = fi (yi−w, yi−w+1) ∀i ≥ n, (4.10) where w is the width of the layer that yi is in. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 70

In all cases there is a clear transfer of information between targets, and clear can- didate curricula. There is one caveat: depending on the choice of functions there can be a break in transfer, an obvious example is the constant-valued functions. To ensure increasing complexity, I permitted only true binary functions by excluding the two con- stant functions and four functions which merely pass-through or negate a single input. By this approach it is possible to generate combinatorially many problem instances which, like binary addition, are challenging for feedforward structures but which, unlike binary addition, do not involve repeating identical sub-units. In this work I created 3 distinct problems, for each of 2 sizes, for each structure. Within these problems every node-level function was selected uniformly at random. The chosen sizes were 10 and 20 inputs for the direct and indirect chain structures, 8 and 16 inputs for the binary tree structure, and 7 and 8 inputs for the pyramid structure. This gave a variety of problem instances with sizes amenable to training tens of thousands of FBN on each.

Logic Synthesis Benchmarks

I also used a subset of circuits from a popular logic synthesis benchmark. The LGSynth91 benchmark [193] contains a variety of problem instances for logic synthesis including a large number from industry. I selected combinational multi-level instances with input dimension in the interval [8 .. 32] and output dimension in [5 .. 15]. The upper limits were chosen to allow a large number of runs to complete in a reasonable time while the lower limits ensured the instances were of sufficient complexity for curriculum methods to be warranted. This resulted in a set of 10 real-world evaluation circuits detailed in the last rows of Table 4.1.

Summary

In addition to the problems mentioned above, I included several sizes of binary addi- tion as well as the other cascaded problems from Section 3.1.2. The complete set of benchmarks was:

• 4-, 5-, 6-, 7-, and 8-bit addition,

• one instance each of the cascaded majority, multiplexer, and parity problems,

• a total of 24 instances created from the structures in Section 4.4.1, and

• ten instances selected from the LGSynth91 benchmark suite. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 71

Problem Inputs Targets Training set size Example pool size Direct Chain 10 9 [8, 256] 1, 024 Direct Chain 20 19 [8, 256] 1, 048, 576 Indirect Chain 10 9 [8, 256] 1, 024 Indirect Chain 20 19 [8, 256] 1, 048, 576 Binary Tree 8 7 [8, 256] 1, 024 Binary Tree 16 15 [8, 256] 1, 048, 576 Pyramid DAG 7 21 [8, 120] 128 Pyramid DAG 8 28 [8, 250] 256 LGSynth91 - alu2 10 6 [12, 512] 1, 024 LGSynth91 - alu4 14 8 [12, 512] 16, 384 LGSynth91 - cm162a 14 5 [12, 512] 16, 384 LGSynth91 - cm163a 16 6 [12, 512] 65, 536 LGSynth91 - cu 14 11 [12, 512] 16, 384 LGSynth91 - f51m 8 8 [12, 128] 256 LGSynth91 - pcle 19 9 [12, 256] 524, 288 LGSynth91 - pm1 16 13 [12, 256] 65, 536 LGSynth91 - sct 19 15 [16, 196] 524, 288 LGSynth91 - x2 10 7 [16, 256] 1, 024

Table 4.1: Test-bed instance sizes used in this chapter.

In all cases I obfuscated the target order by permuting the targets randomly prior to training. For the benchmarks with a known difficulty-order we can still use this order to evaluate the quality of discovered curricula as I describe later in Section 4.4.3.

4.4.2 Training and Evaluation

The final set of chain ordering methods comprised:

• permutations chosen uniformly at random [72],

• the label effects measure of Wang et al. [191]

• the conditional entropy approach of Jun et al. [190]

• target-agnostic ID-curricula using only input features (Section 3.3.1), and

• the target-aware ID-curricula method for classifier chains (Section 4.2).

For the conditional entropy approach, Jun et al. [190] proposed four variants but noted that the first outperformed the rest. In all experiments herein I used all four variants but, like the authors, found the first (CEbCC1) to consistently dominate and present only it in results for clarity. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 72

I adopted a similar experimental method as in Chapter 3. A large number of FBN were trained for each treatment, over a range of training set sizes, chosen to roughly span the generalisation transition region; Table 4.1 provides the values for problems introduced in this chapter, for those used from Chapter 3 the values are given in Table 3.2. FBN-CC were trained as per the approach in Section 4.3 until complete memorisation of the training set. As mentioned in Chapter 3, this helps decouple the effects on training efficiency and generalisation capacity.

4.4.3 Qualifying Curricula

The approach used in Sections 3.2 and 3.3.2 for quantifying the quality of curricula is helpful when we have a desired reference curriculum like π∗. It is not helpful, however, if we lack a reference; nor does it distinguish clearly between a method which consis- tently returns a different curriculum (bias) and one which returns arbitrary uncorrelated curricula (variance). Thus we are left with two key questions. Firstly, “does a given method converge on a some ideal curriculum for a given problem?”, and secondly, “Are these generated curricula useful in the context of classifier chains of FBNs?”. The usefulness can be quantified by the effect those curricula have on generalisation. Answering the former question requires measuring the similarity of a set of permutations, something that was covered in more depth in Section 2.2.4. The similarity of a set of rankings is called the concordance. All the ordering methods discussed in this chapter return a total order which can be considered a linear ranking of the targets. Kendall’s coefficient of concordance, W , is thus an appropriate mea- sure [116] and is given in Equation 2.3. A set of curricula with W ≈ 1 are near-identical, while a set with W ≈ 0 can be considered essentially random. There are of course further caveats and considerations. For instance this approach may not strongly differentiate random curricula with cases where curricula cluster around a few possible centroids. The histogram of pairwise distances is a potential tool, as are rank clustering methods. However, I did not pursue these for reasons of scope and because manual examination of low concordance results did not indicate such clustering occurred in any experiments herein. For the problems with a known difficulty order it is simple enough to measure the ∗ correlation, τπ∗ , between the produced curricula and π with Equation 2.2. This is true of several of the problems taken from Chapter 3 and is also true of the problem structures defined in Section 4.4.1. For the direct and indirect chains, π∗ is trivially a total order. Within the other two structures, however, there are target subsets which are not distinguishable, e.g. those on an equal level. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 73

In general, if a DAG exists which describes the inter-target relationships, then any targets which are not transitively linked cannot be said to have an order preference in π∗ and thus should not be penalised in τπ∗ . Thus, when calculating τπ∗ for those structures, I reduce the set of pairs over which concordance/discordance is measured to those pairs with an equivalent edge in the transitive closure of the underlying DAG. Take Figure 4.1d for example, (y1, y2n−1) is considered in Equation 2.2, but not (y1, y2) or (y1, yn+1). The argument for using this approach, rather than considering such pairs as ties, is explained in Section 2.2.4. This calculation method is equivalent to treating π∗ as a family of total orders—all in agreement on the relevant pairs—and defining τπ∗ as the maximum value of Equation 2.2 obtained for any member of that set.

4.5 Results

The experimental work in this chapter follows a very similar vein to the previous. The primary concern was analysing the produced curricula and their effect on the resulting FBN performance. The variables of interest in this experiment were thus:

• out-of-sample MCC,

∗ ∗ • correlation, τπ∗ , with the easy-to-hard curricula, π (where π exists), and

• coherence of curricula (Kendall’s W ).

Measures presented in reference to a baseline—i.e. δ-MCC—are with respect to random curricula. In the following sections I present results for the above on the previous test- bed problems, the new synthetic test-beds, and the problems taken from the LGSynth91 suite.

4.5.1 Prior benchmarks

I show the generalisation difference, δ-MCC, as it varies with training set size for varying sizes of addition in Figure 4.2 and for the cascaded parity, majority and multiplexer problems in 4.3. The corresponding function-averaged δ-MCC for each target is given in Figures 4.4 and 4.5. Likewise I give the correlation between produced curricula and π∗ in Figures 4.6 and 4.7. Figure 4.2 shows the CC-FBN trained using the baseline curricula methods performed no different than when a random curriculum was given. The ID-based curricula onthe other hand produced networks with significantly better generalisation performance. The target-aware approach generalised best and, most notably, the separation increased with CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 74

4-bit Addition 5-bit Addition

0.10 C C 0.05 M -

0.00

0 50 100 150 200 250 0 50 100 150 200 250 6-bit Addition 7-bit Addition

0.10 C C 0.05 M -

0.00

0 50 100 150 200 250 0 50 100 150 200 250 Training Set Size 8-bit Addition

0.10 target-aware ID

C target-agnostic ID C 0.05 M CEbCC - label effects 0.00 random

0 50 100 150 200 250 Training Set Size

Figure 4.2: Mean δ-MCC (w.r.t. random chains) for each chain ordering method versus training set size, for addition problems of increasing output dimension (95% CI shown with transparent bands). The two comparison methods, CEbCC and label-effects, did not improve on a random ordering. Both the target-agnostic and target-aware ID-curricula methods improved over a random chain in all cases with the latter dominating for larger problem sizes. Considering the peaks, the target-aware ordering improved with target dimension, however the target-agnostic ordering plateaued for the higher dimensions. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 75

9-bit Majority 15-bit Multiplexer 0.2

0.025 C

C 0.1

M 0.000 -

0.025 0.0

0 100 200 300 400 0 50 100 150 Training Set Size 7-bit Parity

0.4 target-aware ID

C target-agnostic ID C 0.2 CEbCC M - label effects random 0.0

0 25 50 75 100 Training Set Size

Figure 4.3: Mean δ-MCC (w.r.t. random orderings) for each chain ordering method versus training set size for the cascaded majority, multiplexer and parity problems (95% CI shown with transpar- ent bands). On the parity benchmark the comparison methods, CEbCC and label-effects, were not distinguishable from random chain orders. On the multiplexer CEbCC slightly outperformed random chain orders while label-effects slightly worsened performance. Both the target-agnostic and target-aware ID-curricula methods improved over a random chain ordering, in all cases, with the latter dominating for the multiplexer. There was no improvement seen by introducing target-awareness to the ID estimation (red and green curves overlap) for the majority and parity problems, which is unsurprising given the minimal room for improvement (see Figure 3.9). instance size. The target agnostic approach, however, plateaued as target dimension increased. Figure 4.3 shows the problem dependent nature of the baseline procedures. On the majority problem the label effects curricula were partially effective in comparison with the ID curricula, on parity they had no significant effect and on the multiplexer they per- formed slightly worse than random chain orderings. CEbCC produced orderings which significantly hurt generalisation on the majority problem, were also indistinguishable from random on parity, and slightly improved generalisation (compared to a random order) on the multiplexer problem. ID-based curricula, of both forms, offered improvements similar to those seen on the adders and in Chapter 3. On the multiplexer, providing target information using the target-aware method more than tripled the peak improvement over random chain orders. On the other two problems, however, the target-agnostic and target-aware procedures performed equally. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 76

4-bit Addition 5-bit Addition 0.10

C 0.05 C M -

0.00

6-bit Addition 7-bit Addition 0.10

C 0.05 C M -

0.00

0 2 4 6 8 Target 8-bit Addition 0.10

target-aware ID

C 0.05 target-agnostic ID C

M CEbCC - label effects 0.00 random

0 2 4 6 8 Target

Figure 4.4: Per-target generalisation performance on addition problems of increasing size. Aver- aged δ-MCC (w.r.t random curricula) for each method are plotted against target index, ordered by π∗ (95% CI shown with transparent bands). The equity of performance of CEbCC and label effects with random curricula (Figure 4.2) was consistent across all targets. For the ID-curricula, performance increased after the first target for both methods, but for the target-agnostic ap- proach this dropped again (echoing Figure 3.4). Target-awareness dramatically improved on this, remaining dominant across all targets. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 77

Figure 4.4 shows the generalisation improvement for individual targets, ordered by π∗, on the addition problems. In this respect, we see that little improvement occurred in the initial targets and that improvement initially increased for both ID-curricula methods with target depth. Toward the latter targets there was a corresponding drop-off in improvement (though the target-aware approach maintained positive δ-MCC for all targets). CEbCC and label effects did not differ from random curricula.

9-bit Majority 15-bit Multiplexer 0.05 0.10

C 0.00 0.05 C M - 0.00 0.05 0.05

0 2 4 0 2 4 6 Target 7-bit Parity

0.2 target-aware ID

C target-agnostic ID C CEbCC M

- 0.1 label effects random 0.0

0 2 4 6 Target

Figure 4.5: Per-target generalisation performance on the cascaded benchmark problems. Aver- aged δ-MCC (w.r.t random curricula) for each method are plotted against target index, ordered by π∗ for the 9-bit cascaded majority, 15-bit cascaded multiplexer and 7-bit cascaded parity prob- lems (95% CI shown with transparent bands). Similar overall trends exist as in Figure 4.4, however the exact profile of the effect of each guiding function was problem dependent. Target-agnostic and target-aware curricula were equally performant on the majority and parity problems (red and green curves overlap). On the multiplexer problem the target-agnostic method suffered a drop-off which was drastically improved upon using target-awareness. The baselines performed no differently to random curricula on the parity problem. On the majority problem label effects began to improve after the third target, approaching the ID methods, CEbCC displayed inverse performance. On the multiplexer, CEbCC improved for the first few targets but worsened on the last. Label effects was again qualitatively inverse to this, initially under-performing random chains but then improving on the final targets, outperforming the target-agnostic method on the final two targets.

Figure 4.5 shows the same for the remaining cascaded problems. For the majority and parity problems there was no difference between the two ID-curricula methods. On the majority problem, target-awareness drastically increased performance beyond the second target, significantly counteracting the dip in performance in the target-agnostic method. On the majority and parity problems there was also more interesting variation CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 78 among the non-random baselines. On the majority problem CEbCC dramatically wors- ened generalisation on the higher order targets whereas label effects began to catch up with the ID methods. On the multiplexer problem, CEbCC offered some improvement on earlier targets but worsened for higher order targets whereas label effects did almost the inverse, hurting performance on lower order targets but then increasing to outperform target-agnostic curricula on the final two targets.

4-bit Addition 5-bit Addition

) 1.00 (

n

o 0.75 i t a l

e 0.50 r r o

C 0.25

k n

a 0.00 R 0 50 100 150 200 250 0 50 100 150 200 250 6-bit Addition 7-bit Addition

) 1.00 (

n

o 0.75 i t a l

e 0.50 r r o

C 0.25

k n

a 0.00 R 0 50 100 150 200 250 0 50 100 150 200 250 Training Set Size 8-bit Addition

) 1.00 (

n target-aware ID o 0.75 i t

a target-agnostic ID l

e 0.50

r CEbCC r

o label effects

C 0.25

k random n

a 0.00 R 0 50 100 150 200 250 Training Set Size

∗ Figure 4.6: Correlation of generated curricula with π , measured by τπ∗ , for each chain ordering method versus training set size on addition problems of increasing output dimension (95% CI shown with transparent bands). The two baselines, CEbCC and label-effects, produced curricula not at all correlated with π∗. Both the target-agnostic and target-aware ID-curricula methods converged on π∗ as training set size increased however the target-agnostic method required larger and larger training sets to reach that peak. Looking at Figure 4.2 this convergence may eventually occur after the region in which peak improvement is gained from a curriculum. Target- awareness appears to rectify this, as the point at which near perfect correlation was achieved was relatively stable over target dimension.

Curriculum quality, measured by rank correlation with π∗, is presented in Figures 4.6 and 4.7. The target-agnostic ID-estimation (green) followed the same trends seen in CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 79

9-bit Majority 15-bit Multiplexer ) 1.0 1.0 (

n o i t

a 0.5 l

e 0.5 r r o C

0.0 k n

a 0.0 R 0 100 200 300 400 0 50 100 150 Training Set Size 7-bit Parity ) 1.0 (

n target-aware ID o i t target-agnostic ID a l

e 0.5 CEbCC r r

o label effects C

k random n

a 0.0 R 0 20 40 60 80 100 Training Set Size

∗ Figure 4.7: Correlation of curricula with π , measured by τπ∗ , for each chain ordering method versus training set size for the cascaded majority, multiplexer and parity problems (95% CI in transparent bands). The two baselines, CEbCC and label-effects, produced curricula not at all correlated with π∗ on the multiplexer and parity instances and weakly correlated and anti-correlated respectively on the majority problem. The target-agnostic and target-aware ID- curricula methods converged toward π∗ as training set size increased. On the majority problem they converged almost identically (the red and green curves overlap), on the parity problem there was a small improvement obtained with target awareness, while on the multiplexer problem the target-aware curricula converged much more quickly.

Figures 3.8 and 3.9 (top row of curves) since it is the same procedure. For the majority and parity benchmarks there was little difference in trends between target-agnostic and target-aware ID-estimation, while on the remaining problems the target-aware approach much more quickly converged on π∗. In the addition problems we can see different scaling trends as well. The target-agnostic approach required more training examples to effectively uncover π∗ as output dimension increased, while the target-aware approach maintained a consistent profile. The curriculum quality results for the baselines were markedly different. The baseline procedures produced curricula that were uncorrelated with π∗, on all problems except the majority problem. In this case the label effects procedure settled on curricula that were weakly anti-correlated, while CEbCC produced curricula that were more weakly, but positively, correlated. Concordance was also measured but essentially mirrored Figures 4.6 and 4.7 for all except the cascaded majority problem. As such the returned curricula for the comparison methods on the other problems were uncorrelated with each other (W ≈ 0), confirming CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 80

9-bit Majority 1.0

0.8

0.6 target-aware ID target-agnostic ID 0.4 CEbCC label effects

Condordance (W) random 0.2

0.0

0 50 100 150 200 250 300 350 400 Training Set Size

Figure 4.8: Concordance of curricula, measured by Kendall’s W , for each chain ordering method as it varied with training set size for the cascaded majority problem (there are no CIs, W is a single value for the full set of curricula at each point). CEbCC and label-effects produced curricula with some level of increasing concordance. Looking at Figure 4.7 it is clear that this common point is not π∗ and thus CEbCC and label effects may eventually settle on their own centroid curriculum (which may or may not be optimal). The target-aware and target-agnostic curricula converged at a near-identical rate (red and green curves overlap). that those methods produced essentially random orderings. The concordance results for the cascaded majority problem are given in Figure 4.8 and interestingly shows a trend toward partial agreement amongst the curricula returned by the label effects method on that problem and to a lesser extent by CEbCC.

4.5.2 Random Cascaded Circuits

There is a larger number of datasets in this suite and the shape of the curves over training set size do not vary substantially in shape from the general “hump” shape seen in the previous section. Given this, I chose to simplify the results in this section by representing the metrics of interest with their averages (see Section 2.5.3). The result of this is a single value for each treatment, enabling simpler comparison via a bar plot. The obtained results for δ-MCC, τ and W are given in Figures 4.9, 4.10, and 4.11 respectively. In Figure 4.9 we see that both ID-estimation methods produced curricula which improved generalisation performance over a random curriculum on all problems. On the direct chain instances target-awareness increased this improvement and the dis- parity between them was larger with higher target dimension. On the indirect chain instances both ID-curricula methods performed equally on the 3 smaller problems but target awareness improved on 2 of the larger instances. On both the binary tree and CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 81

0.15 0.050 0.10 C C 0.05 0.025 M - 0.00 0.05 0.000 0.10 10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain

target-aware ID target-agnostic ID CEbCC label effects

0.10 0.08 0.05 C 0.04 C

M 0.00 - 0.00 0.05 0.04 0.10 8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 4.9: Generalisation trends for the structured synthetic problems. Mean training-set- averaged δ-MCC (with bootstrapped 95% CI) is shown for each method grouped by structure and then by instance. pyramid DAG structures the two ID-estimation methods generalised equally and their improvement over a random ordering increased with output dimension. Label effects produced chain orderings that almost always significantly hurt per- formance on the direct chain, binary tree and pyramid DAG structures. On the indirect chain however it improved performance on all smaller instances, and in one larger instance (20-C). In all cases the produced curricula resulted in weaker generalisation than obtained with either ID-estimation method. CEbCC produced relatively performant orderings in many cases. For the direct chain structure it was no better than random on one instance but better than random on the remainder. For the larger size of this structure, CEbCC produced curricula that were more effective than the target-agnostic ID-curricula in 2 of 3 cases. In the indirect chain however, CEbCC underperformed random chain orders on the smaller instances and was no better than random for 2 of 3 larger instances; however it did marginally outperform target-agnostic curricula in the remaining case (20-B). On the binary tree and pyramid DAG structures the results were mixed but tended to be more positive than negative, however it never outperformed either ID-curriculum method, being comparable in one case (Binary Tree 8-C). CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 82 )

( 1.0

n o i 0.5 t a l e

r 0.0 r o C

0.5 k n

a 1.0 R 10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain

target-aware ID target-agnostic ID CEbCC label effects )

( 1.0

n o i 0.5 t a l e

r 0.0 r o C

0.5 k n

a 1.0 R 8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 4.10: Curriculum quality measured by correlation with π∗ for the structured synthetic problems. Mean training-set-averaged τπ∗ (with bootstrapped 95% CI) is shown for each method grouped by structure and then by instance.

In terms of correlation with π∗ the ID based methods outperformed the baselines on almost all instances. For the direct chain structure the target-aware curricula most closely matched π∗, followed by the target-agnostic curricula. Label effects and CEbCC returned curricula either weakly correlated or anti-correlated, with no clear pattern among the two. Similar tendencies were seen in the indirect chain, however the target-agnostic curricula closed the gap with its target-aware cousin. On the binary tree and pyramid DAG structures the curriculum quality results are qualitatively different. For one, the target-agnostic approach consistently found order- ings more in agreement with the underlying structure than the target-aware approach ∗ (note the different definition for π , and thus τπ∗ , in this case). In one case, binary tree 8-C, CEbCC found better curricula overall, and in another (pyramid DAG 6-C) it matched π∗ more closely than the target-aware curricula, but not the target-agnostic curricula. In all cases label effects produced curricula anti-correlated with π∗ to varying degrees. In terms of collective agreement (Figure 4.11), the two ID-based methods displayed very similar patterns as in Figure 4.10. This is unsurprising as Figure 4.10 shows those methods converged on a particular curriculum. What Figure 4.10 demonstrates that the disparities in τπ∗ are caused by variance, and not bias, in the curricula produced. For the direct and indirect chains, the concordance results indicate that label effects and CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 83

1.00

0.75

0.50

0.25

Concordance (W) 0.00 10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain

target-aware ID target-agnostic ID CEbCC label effects random

1.00

0.75

0.50

0.25

Concordance (W) 0.00 16-A 16-B 16-C 8-A 8-B 8-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 4.11: Agreement among chain orders measured by concordance (Kendall’s W ) for the structured synthetic problems. Mean training-set-averaged W (with bootstrapped 95% CI) is shown for each method grouped by structure and then by instance.

CEbCC produced a wide range of curricula which were predominantly dissimilar. On the binary tree and DAG structures however the two baseline methods produced curricula which were nearly as self-similar as the ID strategies.

4.5.3 LGSynth91

As in the previous section, I present the averaged δ-MCC in Figure 4.12 rather than presenting it varying with training set size. The results are much less absolute than seen until this point. As such I followed the advice of Demšar [194] on comparing multiple classifiers over multiple datasets. Demšar [194] suggests using the non-parametric (rank-based) Friedman test [195] to first determine if a collection of methods can be reliably assumed to be different in performance based on their results on multiple datasets. If this passes then the post-hoc Nemenyi test [196] should be applied to determine the pairwise differences. This is handily represented in a Critical Difference (CD) diagram which shows the mean rank of each method, and uses horizontal bars to link methods which the Nemenyi test has failed to differentiate. Figure 4.13 shows the CD diagram for the LGSynth91 benchmarks. In Figure 4.12, we can see that the target-aware curricula improved on random curricula for all circuits. Similarly the target-agnostic curricula helped for all except cm162a, cm163a, and pcle. CEbCC helped for exactly half the datasets while label effects CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 84

0.06 target-aware ID 0.05 target-agnostic ID CEbCC 0.04 label effects

0.03 C

C 0.02 M - 0.01

0.00

0.01

0.02

cu sct x2 alu2 alu4 f51m pcle pm1 cm162a cm163a

Dataset

Figure 4.12: Generalisation trends for the LGSynth91 benchmark suite. Mean training-set- averaged δ-MCC (with bootstrapped 95% CI) is shown for each method grouped by instance. There was a general trend in improvement by the ID-based methods, with the target-aware curricula achieving better generalisation. The corresponding Critical Difference diagram is given in Figure 4.13.

CD

1 2 3 4 5

target-aware ID label effects target-agnostic ID random CEbCC

Figure 4.13: Critical Difference diagram based on the macro-averaged test-MCC results (Fig- ure 4.12) for the LGSynth91 benchmark suite. The methods are placed according to their aver- age rank; those closer than the critical difference (CD)—determined by the Nemenyi post-hoc test—are joined by bars. We see that the target-aware ID curricula are separated from all base- lines, but not from target-agnostic. Target-agnostic curricula could not be separated from CEbCC or random, while label effects ranked lowest, but not separated from CEbCC. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 85 improved generalisation by a significant margin on no dataset (possibly cu with more data), and hurt performance on 4 datasets. Target-aware curricula outperformed CEbCC on alu4, cm162a, cm163a, f51m, and x2 while target-agnostic curricula outperformed CEbCC only on alu4 and x2. On no dataset did the reverse occur, however there is a notable difference between the means in cu and pm1 and with more data it is possible CEbCC may be clearly better. The Friedman test (with α = 0.05) rejected the null hypothesis that all curricula performed equally. Figure 4.13 shows the CD diagram generated using the subsequent Nemenyi test. It shows that target-aware ID-curricula methods can be reliably differenti- ated from all baselines, target-agnostic from label effects but not CEbCC or random chain orders, and that CEbCC cannot be reliably separated from any method bar target-aware ID-curricula (on this suite). Considering the magnitude of differences in Figure 4.12 (which the Nemenyi test does not) would suggest the ID curricula methods and, to a lesser extent, CEbCC were overall helpful in comparison to random chain orderings.

4.6 Discussion and Conclusion

Across the range of problems considered, ID-curricula consistently improved the general- isation of Classifier Chains of FBN over using a random chain order. They also compared favourably with comparison methods. In this chapter I also presented a target-aware extension to the ID estimation procedure, specifically for CCs. This method further improved generalisation in the majority of cases. I compared with the only existing, completely a priori, chain ordering approach, CEbCC, and with another metric label effects which has been used as a component of a different state-of-the-art approach. Label effects performed poorly on these datasets. It may be the case that it would be effective if paired with prior predictability scores obtained from ICs (as originally proposed [191]) but, if so, I would expect it to deliver at least partial improvements in generalisation, given the continuity of generalisation trends seen with respect to τπ∗ in Section 3.2. The results on the structured synthetic benchmarks suggest that structure has a larger impact on the efficacy of ID-curricula than the class of node-level function whichis unsurprising given the nature of the minFS approach. This does not appear true of CEbCC and label effects, whose generalisation performance varied much more within structure groups (the ‘A/B/C’ of Figure 4.9), in several cases alternating between improving upon, and under-performing, random chains. It is interesting to note that, on these datasets, there is an almost inverse relationship between label effects and CEbCC, suggesting that the classes of Boolean function for which each is suited may be at least partially CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 86 complementary. Experimentation with similar structures, but with node-level functions specifically selected by class, may elucidate further. The disparity between ID-curricula and CEbCC was less authoritative on the LGSynth91 benchmark suite, this is not surprising as I did not select for structure within that suite. The hypothesis of this chapter is not that curricula are important in all cases, but rather that they may help in cases where inter-target relationships exist in addition to a dis- parity in difficulty. Nonetheless the experiments did effectively separate ID-curricula from random chain orderings and label effects by the critical difference (Figure 4.13). It is worth noting that the Friedman-Nemenyi approach does not take repetitions or effect-size into account, the critical difference would be the same if only a single learner was obtained for each treatment/dataset pair and if the improvements in Figure 4.12 were much smaller. Demšar [194] notes that there does not appear to be a similar test which could incorporate multiple runs, as they are not independent observations due to training set overlap. In addition to effects on generalisation performance, I was able to examine the curricula returned, for those problems for which π∗ is known. The target-agnostic ap- proach reproduced the results seen in Chapter 3. The target-aware ID-curricula matched or exceeded the rate of convergence (to τ = 1) of the target-agnostic approach. In particular, the inclusion of target features in curriculum estimation rectified the de- cay in curriculum quality seen with increasing target dimension (Figure 4.6). Those problems for which target-awareness did not deliver greater generalisation than the target-agnostic approach are explained by the lack of improvement in τπ∗ (compare Figures 4.3 and 4.7). This echoes the findings in the last chapter and hints at the potential for more widespread suitability of target curricula. On the binary tree and pyramid DAG structures, target-awareness reduced curriculum quality and did not improve generalisation over the target-agnostic method. Analysis of randomly selected instances indicated that target awareness occasionally introduced ties between parent and child targets which Algorithm 4.3 breaks arbitrarily. It may be possible to break ties more effectively by determining whether the introduction of each node reduces the other’s ID estimate. This problem however did not translate to poorer generalisation and I would expect it to diminish in structures with a higher branching factor. For addition, majority, and parity, the baseline metrics produced curricula that were uncorrelated with π∗ on average, and which had a complete lack of concordance (W ≈ 0). For the majority problem the baselines showed more decisive behaviour, exhibiting an increase in concordance and non-zero τπ∗ . These trends in curriculum quality and self-agreement were matched by poorer generalisation and suggest that entropic metrics of target relationship are inappropriate for this class of problem. CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 87

For the direct and indirect chains, the concordance results indicate that label effects and CEbCC produced a wide range of curricula which were predominantly dissimilar. This shows that the low correlation with π∗ was due, not to the methods converging on some different family of optimal curricula, but instead producing mostly random orderings. On the binary tree and DAG structures this was not the case and, combined with the greater consistency of τπ∗ , suggests that these methods settled on some partic- ular curriculum which differed from π∗. The generalisation results in Figure 4.9 however, indicate that this curriculum is sub-optimal compared to π∗. Indeed the generalisation performance of the two baselines can almost be predicted from their τπ∗ value. The structured synthetic datasets shows the relative effect of structure, size, and variation in the class of target function on the various ordering methods. Structure and size affected all methods, but there was much more variation within the same structure with the baselines than with either ID-based method. This was true of generalisation performance and curricula quality (in terms of both τπ∗ and concordance). Given the poorer performance of CEbCC and label effects on the addition and parity benchmarks, it is plausible that this is because entropic measures are effective for capturing rela- tionships of co-incidence or even negation, but not for discovering higher non-linear relationships, like exclusive disjunction with other variables. There is another subtle note of caution regarding the use and description of entropic measures as metrics for target relationships: the impact of the domain (i.e. input fea- tures). The conditional entropy between two targets, for instance, is changed by the inclusion of duplicate examples if this is not explicitly accounted for. In contrast, this does not alter the feature set solutions, making ID estimation robust in this respect. When concluding Chapter 3, I noted a performance drop-off phenomenon for the final target (Figures 3.4 and 3.5) when using Lgh to impose a target-agnostic curriculum. I proposed an explanation for this based on the impact of gate budget constraints. Part of the motivation behind examining target curricula in a Classifier Chain setting was to isolate this by studying a learner with different constraints, and thus confirm or deny that explanation. I observed a similar, but less-pronounced, “dive-off” for the target-agnostic method (Figures 4.4 and 4.5) as seen in Chapter 3; this reduction does partially support the gate saturation hypothesis, but not as a lone culprit. Target-awareness appeared to partially rectify the issue, though, suggesting that part of the issue may be mistakes in the target-agnostic curricula, which are more frequent among higher order targets. It is possible that an optimal target-aware method would further improve upon the greedy approach examined here, although I suspect this is a point of diminishing returns. In summary, curricula generated by using feature selection to estimate ID proved effective for FBN-based CCs across several problems of varying structures. This also adds to the evidence of the importance of chain order on CCs, even with a non-linear CHAPTER 4. APPLICATION OF ID-CURRICULA TO CLASSIFIER CHAINS 88 base classifier. Making the ID-based curriculum generation process target-aware—by including target features as additional input features—proved beneficial in most cases, particularly for problems with a deep hierarchy.

4.6.1 Future Work and Direction

There are several interesting extensions of this work which have already been alluded to in the discussion. One possibility is to further examine the interplay between the structure of target relationships and the efficacy of target curricula by using additional synthetic datasets of varying structures and branching factors. This could also be paired with an examination of different classes of Boolean function at each node. This maybe interesting as much has already been written on the learnability of different classes of Boolean function [lee_dnf_2006, 197–199]. Knowing how these affect the ability to infer curricula, and the effectiveness of those curricula may be especially useful in domains (e.g. GRN inference) where certain function classes are believed to dominate [200]. I did not explore the structure-generalisations of the Classifier Chain approach itself (Classifier Trellises, sub-chains etc.). The minFS solutions used for ID estimation lend themselves to measuring overlap among the intrinsic features of targets. This could be used to inform the structure generation processes sometimes employed in those CC extensions. It may also be informative to examine the target-aware approach using the hierarchical loss functions from Chapter 3 where targets are not available as surrogate inputs. Rather than pursuing these avenues I decided to expand on an idea inspired by the use of target features as additional inputs in CC. Just as a target is a useful meta-feature, so too might the internal nodes of an existing FBN. In the next chapter I explore this idea further. Chapter 5

Adaptive Learning Via Iterated Selection and Scheduling

In this chapter I extend on the ideas presented in the previous two, by noting that all internal nodes in a composite model (like a FBN) are themselves meta-features which have been specifically optimised for the target of interest. This is a notion that will be familiar to any in the field of deep learning and feature engineering. I used this fact to move the feature selection procedure into the training loop and further leverage target overlap. The resulting principle can be expected to aid learning under one key assumption: if outputs are related then meta-features useful in computing one are, with reasonable probability, useful for computing another. I call this the assumption of probable common utility.

Assume we have a target function, y1 : X → B, which can be partially composed as 0 00 y1 = f1 (g (X ) , X ). Further assume we have a second target y2 : X → B, function 0 000 0 00 000 which can be similarly composed: y2 = f2 (g (X ) , X ); where X , X , and X are feature (column) subsets of X. Successful transfer from y1 to y2 is more likely if g can be:

• represented implicitly in the model solving y1, and

• utilised in the training of y2 without introducing other confounds. The work in Chapter 3 approached this implicitly by focusing on a soft temporal curricula to training. The work in Chapter 4 relied on this, but only by considering the targets directly (in our example g must be reverse-engineered from y1 by the training process). Assuming that the above reasoning correctly captures the mechanisms that resulted in those prior successes, it would stand to reason that improvements can be made by explicitly detecting sub-computations like g. I detail an approach designed to explicitly identify useful sub-modules—which I call ALVISS—in Section 5.1 and discuss tractability and tie handling. I then compare ALVISS

89 CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 90 with relevant baselines in Section 5.2, and present ablative studies in Section 5.3. Finally in Section 5.4 I scale up to much larger problem instances, demonstrating successful generalisation on a 128-bit addition problem with training sets containing fewer than 1000 examples.

5.1 Self-paced Curricula and Internal Meta-Features

This section describes a method which I for combining target curricula with feature selection in order to successfully generate internal “stepping stone” meta-features with FBN. I call this method Adaptive Learning Via Iterated Selection and Scheduling (ALVISS). I first outline the motivation, then describe the method, and finally I discuss issues of tractability and methods for dealing with them.

5.1.1 Motivation: stepping stone state discovery

Figure 5.1 shows a small example MTC problem with 3 targets. It represents a situation where a priori ID-estimation is only partially effective. One target, y1, is readily identified as having lower intrinsic dimension, but two remaining tasks are not separated. Fig- ure 5.1a shows a network found by training a single output FBN on y1 (with superfluous nodes removed). The circuit is by no means minimal, but it does have—buried within it—a node that computes a meta-feature, z1, useful to one of the two remaining tasks. The key insight is that the values computed by the internal nodes in an existing FBN are as valid a feature as any input node. This is in much the same way as I treated target features in Chapter 4. Given a partial network solution and a set of examples we now have a new minFS instance, with the same target but many additional features. Of particular importance for our purposes, the solution size for some targets may decrease by the addition of these extra features (as in the case of Figure 5.1c). Including the meta-features in the minFS instances for the two remaining targets

(Figure 5.1c) gives a reduced solution for y2 due to the meta-feature, z1. This has two immediate benefits: it aids curriculum discovery, and it allows a reduction in the sub-problem for y2. Discovering “stepping stones” like this within prior solutions has the potential to improve efficiency and generalisation in subsequent learning tasks. This has a lot of parallels to the concept of Automatically Defined Functions in Genetic Programming [201]. Networks optimised as in Chapter 3 benefited from this effect in a statistical sense. By forcing the optimisation of the first task before the second stepping stones are more likely to be generated before a viable solution to the second task is produced. The delayed focus on the second task allows the local search procedure to stumble upon CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 91

x1 z1

x4 y1

(a) Network trained on y1

x x x x x x x x x x x x z y y y 1 2 3 4 5 6 y1 y2 y3 1 2 3 4 5 6 1 1 2 3 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1

(b) Min-feature-sets prior to learning (c) Min-feature-sets after learning y1

Figure 5.1: A simple example demonstrating the intuition behind ALVISS. The original input features are {x1, x2, x3, x4, x5, x6} and the target features are {y1, y2, y3}. At the outset (b) y1, y2 and y3 have minimum-cardinality feature sets (shown by coloured dots) of size 2, 4, and 4 respectively. Thus y1 is selected as the “easiest” and solved first. After training, a sub-network (a) for y1 is found with the internal node z1. Initially y2 and y3 were indistinguishable but the inclusion of z1 in (c) permits a smaller feature-set of size 3 for y2 while the solution for y3 remains unchanged. The procedure can now easily distinguish between the two remaining targets since solving for y2 is a simpler task in light of z1. this node and biases towards such inter-operative solutions if they exist. One pitfall was the sporadic—and occasionally detrimental—effect on training speed. Additionally, curriculum uncertainties like in Figure 5.1b undo some of the benefit. I present an alternative below.

5.1.2 The Internal Meta-Feature Selection principle

Returning to Figure 5.1, one option for the next step is to simply train for y2 (in a similar manner as Lgh) after allotting a new batch of gates. However, in this situation wehavea feature set and a single target. Unlike in the multi-output case, it is straightforward to apply filter-style feature selection to the sub-problem induced by {x2, x5, z1} → y2.I therefore restricted the sub-network at each step to take only the minFS solutions as input. This follows the guiding principle of Occam’s Razor by explicitly preferring that CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 92 the network learn a function of fewer inputs and, as we shall see, it can lead to large improvements in generalisation. In previous chapters I showed that imposing a curriculum of targets can noticeably improve generalisation performance of FBNs on problems with hierarchically interde- pendent targets. Similarly, difficulty-based target curricula constructed using ID-bounds converged to an optimal difficulty order as the training set was expanded. When Ien- forced the curriculum using hierarchical loss functions in Chapter 3, the more strictly the loss enforced the curriculum, the more pronounced the improvement; the most per- formant loss almost rigidly constrained the learner to optimise the targets sequentially. While successful, the previous curriculum construction methods ordered all target by an intrinsic dimension bound discovered prior to the training phase. The primary failing in these methods was that the quality of curricula diminished with increasing target dimension. This was partially addressed in Chapter 4 with the introduction of target-awareness, but for some structures this didn’t help. In subsequent analysis I observed that curriculum mistakes were most commonly in the higher order targets (for either a priori method). By far the most common error was an inversion of the final two targets. This is expected since a larger feature set is less specified than a smaller one for the same number of examples. Additionally, mistakes occurred less often for target-aware curricula than target-agnostic, thus additional features can be helpful. The idea presented here is a natural extension of target-awareness given two related insights:

1. Internal nodes are themselves meta-features.

2. A meta-feature can be considered useful if it functionally replaces more than one input feature.

Considering the use of feature selection within, rather than prior to, the training loop, leads us to the ALVISS technique. The resulting principle is: train one target at a time, then use the internal nodes of the resulting model as meta-features for subsequent targets. This is done by adding them as extra input features in augmented minFS instances. The training procedure is modified from Independent Classifiers in a manner that strongly resembles Classifier Chains. Rather than deciding on a fixed curriculum at the outset, ALVISS operates iteratively.

First it selects a single target, yt, indicated as the easiest, and trains a sub-model, net t, for that target. It then repeats the feature selection step for any remaining targets using the internal nodes of net t as additional input features. At this point, the target with the smallest feature set is again considered easiest, in light of the newly available meta- features. Ideally, a smaller feature set is found for at least one remaining target. This CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 93 occurrence indicates that some node in the newly trained sub-circuit net t has produced a meta-feature useful for that target1. Figure 5.1 is an example of such behaviour. As well as using the feature set dimension to select the next target, ALVISS also uses the solutions in typical filter-style feature selection. That is, the next sub-model is trained only on those features in the minFS solution, presented as a modified training set. After the sub-model is trained, the nodes which use these filtered features asinputare connected to the corresponding nodes of the previous sub-models. The result is a single multi-output FBN with a semi-stratified structure. This avoids prior sub-networks being pointlessly re-evaluated and restricts the learner to only consider pertinent features. Algorithm 5.1 specifies the overall process.

Algorithm 5.1: The ALVISS approach to target curricula and internal feature filtering. Input: X, Y Output: net 0 1 X ← X 2 remaining ← {1, . . . , no} 3 while remaining 6= ∅ do 4 for j ∈ remaining do 0 5 Xj ← prefilter(X ) 6 Fj ← minfs(Xj, yj)

7 t ← argmin |Fj| j∈remaining 00 0 8 X ← X [Ft] 00 9 train nett: X → yt 0  0  10 X ← X internal_nodes(nett) 11 remaining ← remaining − {t}

12 net ← join_networks(net1,..., netno )

There are two reasons to expect generalisation improvements from shifting ID- estimation (and thus feature selection) into the loop:

• better curricula for high order targets, and

• constrained problem specifications for subsequent problems (compared with IC).

Better curricula can be expected since, for all targets except the first, more information is available. A target can be simpler than another in light of new meta-features rather than just the input data. Something akin to this can be seen in the improvement in curricula discovery with target-awareness in Chapter 4. Smaller feature sets can be expected where the assumption of probable common utility holds.

1This may even occur by random chance for unrelated targets in large networks, in a manner akin to the mechanism by which extreme learning machines operate [202]. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 94

The purpose of this chapter is not to present a new learning algorithm. Rather, the intent is to examine the principle of explicitly identifying and utilising internal meta- features. As such, for all experiments herein, I intentionally excluded the output node of each sub-network from consideration as a meta-feature. While their inclusion may be prudent for maximising performance, it also represents an overlap with CCs using a target-aware ID-curriculum. By excluding the outputs, the two approaches are separated, and the experiments described below can thus better attribute any benefits or pitfalls to the leveraging of specifically internal meta-features.

5.1.3 Tractability

The approach as presented is quite simple but there is an obvious scalability issue: with each new target and sub-network, the input-dimension of the corresponding minFS instance increases. For the problems considered so far the a-priori feature selection time has been negligible compared to subsequent training time. For ALVISS however, as number of targets increases this relationship reverses if the inclusion of new meta- features is not constrained somehow in line 5 of Algorithm 5.1. Due to the intractable nature of the minFS problem this problem of growing instance sizes must be addressed.

Bounding Instance Growth

Fortunately there is a simple heuristic which bounds the meta-feature space. As ALVISS repeatedly solves for all but the first target, in later iterations there already exists an optimal solution for a subset of the feature space (all but the new meta-features). If we replace that subset with the prior solution, then instance size is bounded by a factor that does not depend on output dimension. This is obviously not a safe2 reduction, else I would have just sketched the beginnings of a P-time solution to an NP-complete problem. In practice however, the quality of solutions (and thus curricula) remained competitive with the unconstrained optimum for all cases herein. I refer to this practice as pre-filtering. The proof of the bound is by induction. If the number of input features is l then the initial solution size is trivially bounded by l. If this set is augmented by n new features this gives a secondary problem with dimension n + l. Since the initial solution is still present, the optimal solution in the augmented feature space cannot be larger than l. Thus, inductively, every instance is bounded by l + n for any exact solver or any heuristic which permits a seeded solution. For ALVISS l is the problem input dimension, ni, and n is the sub-network size which is kept constant. If some fixed number, k, of precursor

2A safe reduction is one which preserves the optimal solution. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 95 sub-networks is considered at each step then the bound is kn + l and still independent of the target dimension. There are several pre-filtering variations which all maintain a similar bound.

1. The prior sub-network meta-features alone; bound: n.

2. Input features and prior sub-network meta-features; bound: n + l.

3. Previous minFS and prior sub-network meta-features; bound: n + l.

4. Input features, previous minFS and prior sub-network meta-features; bound: n + 2l.

The method described earlier is option 3. The bounds above are all either trivial or follow the same reasoning as option 3. Pilot experiments found little difference between options 2-4 with a very occasional feature set reduction by option 4. Option 1 occasion- ally results in instances with no solution even when one existed in the original input features. As such the variant used in all experiments was option 4 as it maximises the chance of finding useful feature combinations between inputs and new meta-features while maintaining a constant bound. If the initial instances given by (X, Y ) can be solved in reasonable time, then so too should the subsequent instances be. Note that the above bound controls the number of features in each instance, but obviously does nothing for the number of instances. This is no at the start, decreasing by 2 1 at each stage, resulting in O(no ) overall instances. The independent nature of each instance does however enable per-stage parallelism.

Heuristic Solvers

In light of the increasing dimension of minFS instances it was prudent to examine heuris- tic solvers. As the problem is reducible to the uni-cost set cover problem, I sought an appropriate solver. Lan et al. [203] describe an implementation of the Meta-RaPS technique with state-of-the-art performance on large benchmarks and a simple imple- mentation. Meta-RaPS is a general meta-heuristic construction strategy which modifies an existing greedy algorithm by iterated perturbation and improvement [204]. I used the algorithm as described by Lan et al. [203] without randomised priority rules, which the authors state are inapplicable to uni-cost instances. I also evaluated a greedy heuris- tic, and LocalSolver [205], a proprietary, heuristic MIP solver. Meta-RaPS consistently returned smaller solutions on representative instances or comparable solutions in less time, and was selected as the sole solver going forward. The Meta-RaPS implementation has several tunable parameters. While Lan et al. [203] propose a suitable set, I opted to conduct a randomised parameter search as CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 96 suggested by Bergstra and Bengio [206]. This is similar to a grid search but with less regularity, and thus better coverage of the marginal distribution of each parameter. The algorithm has 6 tune-able parameters: priority, restriction, improvement, search magnitude, improvement iterations, and max iterations. On a set of candidate minFS instances the search showed that optimisation time, and solution quality were invariant to most parameters except priority. The chosen parameters values are given in Table 5.1.

Parameter Valid Range Chosen priority [0.0, 1.0] 0.9 restriction [0.0, 1.0] 0.15 improvement [0.0, 1.0] 0.15 search magnitude [0.0, 1.0] 0.1 improvement iterations [1, ∞) 20 max iterations [1, ∞) 20

Table 5.1: Parameters for Meta-RaPS [203] along with optimal values found using random pa- rameter search [206].

Combining the selected pre-filtering and solver options gave 4 combinations for consideration. These were:

• optimal solution (cplex) with no pre-filtering,

• optimal solution (cplex) with pre-filtering variant 3,

• heuristic solution (Meta-RaPS) with no pre-filtering, and

• heuristic solution (Meta-RaPS) with pre-filtering variant 3.

ALVISS was used with these four combinations on all benchmarks to determine the relative impact of both aspects.

5.2 Baseline Comparison

In this section I present a comparison of the ALVISS principle with baselines from the MLC literature. The selection criteria and resulting baselines are given in Section 5.2.1, performance metrics in Section 5.2.2 and discussion of the results in Section 5.2.3. The general methodology and training parameters are the same as in prior chapters. The training set ranges are given in Tables 3.2 and 4.1. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 97

5.2.1 Baselines

As baselines I considered state-of-the-art MLC approaches capable of generating a single FBN as output. Unlike in Chapters 3 and 4 ALVISS is not a modification of some existing method requiring testing in isolation. In Section 2.1.4 I presented several MLC meth- ods. Of these all popular algorithm adaptations such as MLkNN, multi-label SVMs (e.g. RankSVM), Decision Trees (e.g. MLC4.5, RT-PCT) can be trivially ruled out. Several prob- lem transformations are also unsuitable: Label Powerset derivatives like HOMER and RAkEL, and pairwise methods. Of the methods commonly considered as benchmarks in the literature, the viable options were: IC (commonly Binary Relevance), CC (but not PCC or ECC) and monolithic networks such as in Chapter 3. This resulted in three baselines:

SN A single multi-output FBN, trained using Lgh (as in Chapter 3),

IC The Independent Classifiers transformation: a single-output FBN for each target, and

CC A Classifier Chain, trained using target-aware ID-curricula (as described in Chapter 4).

Due to the independent nature of IC, I considered two variations, one with the same gate budget as the other methods and one with a significantly increased gate budget. Herein I report that which achieved the best generalisation performance which was the former, although the variation was minor (see Section 2.5.1). The variants of SN and CC which I used were those demonstrated as best in prior chapters. Again, it was important to further differentiate the classifier chain method from ALVISS. To do this I intentionally excluded the output node of each sub-network from consideration in line 5 of Algorithm 5.1. I did this as the purpose of this experiment was not to identify a winning algorithm but rather to specifically evaluate the potential of internal meta-features. The value of target features in curricula discovery has already been examined in Chapter 4.

5.2.2 Measures

The factors I considered in this experiment were generalisation performance, conver- gence time, and curricula quality. In Chapters 3 and 4 the methods under consideration were homogeneous in structure and as such, convergence time varied minimally and with little consistency. The baselines in this chapter were diverse in structure and there were significant, systematic differences in convergence time, warranting discussion in further detail.

Generalisation As previously, I measured generalisation performance with the per- target, and macro-averaged, MCC. Space permitting, and where the trends are interesting, CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 98 this is presented against training set size, otherwise I report the function-averaged value (Section 2.4). In most cases the difference with respect to an obvious “control” baseline is reported as δ-MCC. In this section that control is IC as, unlike the others, it provides no mechanism for transfer between targets.

Training Efficiency There are two obvious candidates for measuring the rate of con- vergence with LAHC-optimised FBN: iterations and time. Since it is important to account the feature selection process, and since the underlying implementation and overall network sizes were kept consistent, I opted for total training time.

Curriculum Quality The same measures used to compare the resulting curricula are valid in this case. I used the same datasets as in Chapter 4 and thus the reference ∗ curriculum, π , and the measure of correlation with it, τπ∗ , are the same. Unlike in Chapter 4, all curricula methods herein tended toward approximating π∗. Due to this, concordance as measured by Kendall’s W , was strongly correlated with τπ∗ and thus not additionally informative.

5.2.3 Results and Discussion

The results presented here are in a format consistent with the previous chapters. I start with the homogeneous cascaded problems from Section 3.1.2, then the heterogeneous structured test-beds and LGSynth91 benchmark subset from Section 4.4.1. Within each group I first present the relative generalisation scores and convergence times followed by curriculum quality measured by τπ∗ (for those groups with reference curricula). As mentioned in Section 5.1.3, 4 combinations of pre-filter and solver were examined. The differences in their generalisation performance were consistently within confidence interval bounds. As such only the results for one case: Meta-RaPS with pre-filtering, is presented below.

Homogeneous Cascaded Circuits

On the binary addition problems, Figure 5.2a shows significantly superior generalisation with ALVISS over all baselines. Importantly, this improvement increased with the com- plexity (output-dimension) of the instance with ALVISS showing a different growth trend to CC and SN (keeping in mind that IC forms the zero reference point). Additionally CC with target-aware curricula displayed better generalisation growth than SN (from Chap- ter 3). This provides some evidence for the generality of claims made in Chapter 4 as these are two different underlying methods. SN fails to outperform IC for the largest size, looking back on the decay in curriculum quality for target-agnostic curricula (Figure 3.8) CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 99 we can see why: failure of the target-agnostic curricula to approximate π∗ within the viable phase-transition region of training-set size.

IC ALVISS CC SN 103

0.2 102 C C M - 0.1 101 Training Time (s) 0.0 100 4-bit 5-bit 6-bit 7-bit 8-bit 4-bit 5-bit 6-bit 7-bit 8-bit

(a) Generalisation (b) Training Time

Figure 5.2: δ-MCC and training time for ALVISS and baselines, on binary addition problems ranging between 4- and 8-targets. In (a) we see that ALVISS generalised significantly better than all baselines on all sizes, but most importantly this improvement increased with the target- dimension of the instance. In (b) we see convergence time (on a log scale) showing ALVISS converged faster than the baselines. Additionally ALVISS follows a markedly different growth trend to the rest. This extreme reduction in time with ALVISS is explained as most of the training time in the baselines is spent optimising the higher-order targets. ALVISS however typically has significantly reduced problem instances for the those targets due to the useful meta-features found within the lower-order sub-networks. The significant difference in scaling trends in (b) demonstrates the impracticality of running the baseline methods on the larger adders presented at the end of this chapter.

Figure 5.2b shows the function-averaged training time, on a log scale. ALVISS con- sistently converged faster than all baselines, by a margin that increased with output dimension. For IC and CC, which also train each target independently, the bulk of this time was spent in the higher order targets. It was this difference in scaling trends which prevented consideration of baselines for larger addition instances (Section 5.4). Figure 5.3 shows the per-target generalisation. In this case I show absolute MCC rather than δ-MCC as it shows the trends more concretely and the differences are still clear. CC with target-aware ID-curricula and SN with target-agnostic ID-curricula (enforced via

Lgh) displayed similar trends against IC as they did against their respective baselines in Chapters 3 and 4. This indicates that the trends that were observed in those cases, and here, are more reflective of more general trends with target curricula (for FBNs at least). ALVISS displayed a growth in generalisation improvement which plateaued but did not decrease. This is an important distinction as both a priori methods displayed peak- and-decline behaviour for these problem instances. Also of interest is the presence of apparent asymptotic behaviour for all methods bar SN, at varying levels of generalisation. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 100

4-bit Addition 5-bit Addition 1.0

0.8

0.6 MCC

0.4

6-bit Addition 7-bit Addition 1.0

0.8

0.6 MCC

0.4

0 2 4 6 8 Target 8-bit Addition 1.0

0.8 IC ALVISS 0.6 MCC CC SN 0.4

0 2 4 6 8 Target

Figure 5.3: Per-target MCC on binary addition problems ranging between 4- and 8-targets, for ALVISS and baselines. This shows that the bulk of the benefit due to ALVISS was in the higher order targets.

Curriculum quality results are presented in Figure 5.4. Correlation with the easy-to- hard curriculum (π∗) is plotted against training set size for the three ID-based curriculum generation methods (ALVISS, target-agnostic, and target-aware). Both methods which included additional meta-features (either target features or internal features) outper- formed the input-feature only method significantly. ALVISS and target-aware curricula both converged on π∗ quite quickly. This suggests that much of the improvement of ALVISS over CC was due to internal feature selection rather than improved curricula, something that will be explored further in Section 5.3. Analogous results for the other cascaded problems are given in Figure 5.5. They show that differences in generalisation performance (Figure 5.5a) were unsurprisingly problem-dependent as were the convergence times (Figure 5.5b). However, all curricula- dependent baselines outperformed IC, and for the multiplexer and parity problems CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 101

4-bit Addition 5-bit Addition

) 1.00 (

n o

i 0.75 t a l e

r 0.50 r o C 0.25 k n a

R 0.00 0 50 100 150 200 250 0 50 100 150 200 250 6-bit Addition 7-bit Addition

) 1.00 (

n o

i 0.75 t a l e

r 0.50 r o C 0.25 k n a

R 0.00 0 50 100 150 200 250 0 50 100 150 200 250 Training Set Size 8-bit Addition

) 1.00 (

n o

i 0.75 t

a ALVISS l e

r 0.50 Target-aware r

o Target-agnostic C 0.25 k n a

R 0.00 0 50 100 150 200 250 Training Set Size

Figure 5.4: Curriculum quality (τπ∗ ), on binary addition problems, for the three ID-based curricu- lum generation procedures: target-agnostic, target-aware and ALVISS. Adding extra meta-features, in the form of targets or internal features, significantly improved the probability of discovering the correct underlying curricula. Interestingly, ALVISS initially performed better than target-aware on low sample sizes, but the latter converged to τ = 1.0 slightly earlier.

ALVISS dominated. For the majority benchmark, the three curricula-dependent ap- proaches were comparable in generalisation performance. Convergence time displayed similar trends between baselines to Figure 5.2b, however CC converged faster than ALVISS for the parity benchmark. The per-target generalisation for the non-adder cascaded problems (Figure 5.6) shows that the asymptotic behaviours observed for the adders is problem dependent. On the majority problem we see that all methods degrade in performance at similar rates, though curricula offer an improvement that does increase with target depth. The multiplexer we see similar trends as with addition while for parity problem, each baseline displayed diverging trends with ALVISS maintaining test-MCC above 0.85 (indicating CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 102

IC ALVISS CC SN 103 0.25

0.20 102 C C

M 0.15 - 1 0.10 10

0.05 Training Time (s) 100 0.00 CMaj9 CMux15 CPar7 CMaj9 CMux15 CPar7

(a) Generalisation (b) Training Time

Figure 5.5: Mean δ-MCC and training time (log scale) for ALVISS and baselines on the cascaded majority, multiplexer, and parity problems. In (a) we see that ALVISS generalised better than all baselines on the multiplexer and parity problems. On the majority problem, the curriculum- dependent methods (ALVISS, SN and CC) generalised equally but better than IC. The training time results (b) show ALVISS converged faster than all baselines on the majority and multiplexer problems but CC trained faster on the parity problem (on which IC and SN were markedly slower).

9-bit Majority 15-bit Multiplexer 1.0 0.9

0.8

MCC 0.8

0.6 0.7 0 1 2 3 4 5 0 2 4 6 Target 7-bit Parity 1.00

IC 0.75 ALVISS

MCC CC 0.50 SN

0.25

0 2 4 6 Target

Figure 5.6: Per-target MCC on the cascaded majority, multiplexer, and parity problems, for ALVISS and baselines. The trends are problem dependent but qualitatively similar to those in Figure 5.3, except on the majority problem, where ALVISS performed similarly to the other curriculum- dependant baselines. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 103 strong correlation with test data). At the other end of the spectrum, IC drops in perfor- mance drastically, nearing essentially random behaviour on the test set for the highest order targets.

9-bit Majority 15-bit Multiplexer ) 1.0 1.0 (

n o i 0.8 t a l

e 0.5 r

r 0.6 o C

k 0.4 n

a 0.0 R 0 100 200 300 400 0 50 100 150 Training Set Size 7-bit Parity ) 1.0 (

n o i t 0.8 ALVISS a l

e Target-aware r r

o 0.6 Target-agnostic C

k n a

R 0.4 0 20 40 60 80 100 Training Set Size

Figure 5.7: Curriculum quality for the three ID-based curriculum generation procedures: target- agnostic, target-aware and ALVISS, on the cascaded majority, multiplexer and parity problems. As before, there was little variation on the parity and majority instances as the target-agnostic approach already rapidly converged on τπ∗ . On the multiplexer problem, curricula found using internal meta-features (ALVISS) slightly under-performed compared to those found using target features (target-aware).

Figure 5.7 shows the trends in curriculum quality. There are two points of inter- est. Firstly, curriculum improvements from meta-feature inclusion are highly problem- dependent. Secondly, ALVISS closely tracked, but slightly underperformed target-aware curricula. The latter point indicates the value of internal meta-features for curricula generation but also highlights that they do not entirely replace target values (for these problems at least). Across the set of homogeneous cascaded problems, the use of internal meta-features to generate self-paced curricula and regularise sub-problems resulted in consistent improvements in generalisation without compromising training efficiency. For the most part, it improved training efficiency in a qualitatively similar manner. Presented nextare the results for random cascaded circuits. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 104

Random Cascaded Circuits

The results are grouped by structure, in a similar manner as in Section 4.5.2. As before, due to the large number of datasets, averages over the span of training set sizes are presented. I expected similar improvements on the random cascaded benchmarks as ob- served with CCs in Chapter 4. Whether heterogeneity within each target function would impact performance was uncertain, as was the resulting variations in class-imbalance (which were intentionally not controlled for in construction).

0.2 0.15 C C 0.10 M - 0.1 0.05

0.0 0.00 10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain

ALVISS CC SN

0.10 0.10 C C 0.05 M - 0.05

0.00 0.00 8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 5.8: Macro-MCC for the randomised cascaded benchmarks (Section 4.4.1). ALVISS gen- eralised better by a margin of significance in all cases. SN and CC outperformed IC in almost all cases, though there was some variation between the two across different structures and sizes. Within each structure-group the generalisation improvement increased with problem dimension for ALVISS. This suggests that ALVISS, and target-curricula in general, is suited for deeper problems as initially assumed.

Looking at Figure 5.8 we can see that ALVISS generalised better than the baselines on all four structures. The generalisation improvement was, however, more modest than seen for the larger adders and cascaded parity. Figure 5.9 shows comparable differences in training time. Interestingly for the three smaller pyramidal DAG instances, IC converged faster than ALVISS which did not occur in any other problem examined. I was able to successfully train the baselines on these comparatively larger problems (19 targets where 16-target adder was infeasible for all baselines). Additionally better generalisation was achieved than on the smaller 8-bit addition. These observations CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 105

103

102

101 Training Time (s)

10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain

IC ALVISS CC SN

103

102

101 Training Time (s)

8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 5.9: Training time for the randomised cascaded benchmarks (Section 4.4.1). ALVISS consistently converged fastest on the first three (a-c) structures, by orders of magnitude in some cases. On the Pyramid DAG structures (d) classifier chains converged faster, and for smaller sizes so did IC, likely due to the greater number of targets. For the larger size ALVISS trended downward in relation to the others suggesting that the suspected exponential trends w.r.t. target dimension (Figure 5.2) overtake the quadratic trends noted in Section 5.1.3. suggest that perhaps addition, like its related cousin parity, represents a particularly challenging problem. Nonetheless in Figure 5.9 we still observe cumulative training times for ALVISS several times lower than for the next best baseline, and an order of magnitude lower than for the slowest. Finally, Figure 5.10 shows the results for curriculum quality. For the direct chain (a) and binary tree (c) structures ALVISS yielded curricula little different from target-aware curricula. ALVISS produced better curricula in five of six indirect chain (b) instances. On the pyramidal DAG structure (d), ALVISS appears to continue the drop seen from target-agnostic to target-aware curricula in Section 4.5.2 which is interesting given the significant improvements in generalisation in both cases (Figures 5.8 and Figure 4.9). As noted in Chapter 4, I suspect that ties introduced by overlapping input dependencies, and the randomly chosen node-level functions create situations where learning a latter node first is actually simpler. Again experiments involving a similar structure but a branching factor greater than 2 may settle the matter. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 106 )

( 1.0

n o i t a l e

r 0.5 r o C

k n

a 0.0 R 10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain

ALVISS Target-aware Target-agnostic )

( 1.0

n o i t a l e

r 0.5 r o C

k n

a 0.0 R 8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 5.10: Curriculum quality results on the random cascaded benchmarks. ALVISS matched or slightly underperformed target-aware curricula on the direct chain structures (a) while out- performing on the indirect chains (b). There was little difference between any method on the binary tree structure (c) while on the pyramidal DAG (d) ALVISS produced curricula less correlated with π∗ than target-aware.

Real-world benchmarks

The circuits in the LGSynth91 benchmark serve a wide array of purposes; some are hierarchical and others are predominantly parallel. The actual circuit purpose and domain, if known, were intentionally not sourced to avoid unintentional selection bias. Instead the only selection criteria was size as stated in Section 4.4.1. The purpose was to examine performance on a representative array of circuits including some that may be unsuited to curriculum-based approaches. As such my expectations were of mixed results with ALVISS outperforming on some problems and under-performing on others. Figure 5.11a, shows however that ALVISS did not generalise worse than any baseline (outside the margin of error) on any problem. For seven circuits ALVISS generalised best by a significant margin. Interestingly on this suite the next contender was SN not CC, likely due to the increased prevalence of parallel computational pathways which CC biases against. It is possible that there is also a greater occurrence of weak coupling via smaller shared components which ALVISS is better suited to take advantage of than CC. With respect to convergence time (Figure 5.11b) ALVISS was again a clear winner, except on alu2 where it matched CC. The relationship between the other baselines was CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 107

0.15

0.10 C C M - 0.05

0.00

cu sct x2 alu2 alu4 f51m pcle pm1 cm162a cm163a

(a) Generalisation

IC ALVISS CC SN

103

101 Training Time (s)

cu sct x2 alu2 alu4 f51m pcle pm1 cm162a cm163a

(b) Training Time

Figure 5.11: Cumulative macro-MCC and training time for the subset of circuits taken from the LGSynth91 synthesis benchmark (error bars represent bootstrapped 95% CI). In (a) we see that ALVISS outperformed IC on all problems, CCs with target-aware ID-curricula on 8 of 10, and a single network with target-agnostic ID-curricula on 7 of 10. This was better than expected as the circuits were selected without regard for function or composition and the initial expectation was that ALVISS may hurt performance in some cases. In (b) we see that the improved training time offered by ALVISS persisted except on a single dataset, alu2, where it was not significantly faster than CC. For several problems this difference was an order of magnitude or more. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 108 more varied from problem to problem. This variability further motivates the ablative studies in Section 5.3 to determine the impact of feature selection versus curricula in different cases.

5.3 Ablative Studies

Section 5.2 showed that applying the ALVISS principle consistently improved generali- sation and training efficiency. To probe the contributions of the main components of ALVISS to each of these phenomena, and to potentially discard unnecessary steps, I performed an ablative study. In the following sections I discuss the findings.

5.3.1 Aims and Experimental Design

In this section I report ablative studies designed to ascertain whether all aspects of Algorithm 5.1 are required and to what degree. There are 2 course-grained aspects in question: the curriculum and feature selection. From these we get two ablative variants which I describe below (including the specific lines of Algorithm 5.1 that are effected).

Filtering Only uses a random curricula, but is still layered and uses feature selection. This corresponds to replacing line 7 with a step that selects of an element from remaining with uniform probability.

Curriculum Only with the curriculum, but without restricting the selected features at each stage. This corresponds to removing line 8 and using X0 in the training step directly (line 9).

I followed the same experimental method as in Section 5.2. In fact the results re- ported below were collected in tandem with those above; I simply report them separately for clarity.

5.3.2 Results and Discussion

Here I present results on generalisation and training efficiency for the ablated variants of ALVISS. As in the previous round of results, IC was used as the reference baseline. While curriculum quality reduced slightly for the curriculum-only variant, the negligible differences between the ablative variants did not warrant further plots and soisnot presented. For the original cascaded problems (Figures 5.12 and 5.13), we can see that the combination of internal feature selection and the curriculum imposition outperforms either individually. This is true both of generalisation performance and convergence CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 109

IC Full Curriculum Only Filtering Only

0.25 102 0.20 C

C 0.15 M - 101 0.10

0.05 Training Time (s) 100 0.00 4-bit 5-bit 6-bit 7-bit 8-bit 4-bit 5-bit 6-bit 7-bit 8-bit

(a) Generalisation (b) Training Time

Figure 5.12: Ablation experiment results for the addition benchmarks. For both generalisation and training efficiency the ablated variants under-performed ALVISS. Additionally filtering-only initially outperformed the curriculum-only variant, but scaled worse for both metrics.

IC Full Curriculum Only Filtering Only 0.30

0.25

0.20 102 C C

M 0.15 - 0.10 101

0.05 Training Time (s)

0.00 100 CMaj9 CMux15 CPar7 CMaj9 CMux15 CPar7

(a) Generalisation (b) Training Time

Figure 5.13: Ablation experiment results for the remaining cascaded benchmarks. The ablated variants again under-performed ALVISS on both metrics. For all three problems the curriculum re- sulted in greater generalisation than meta-feature filtering but the training efficiency relationship between the two varied. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 110 time. Curriculum quality plots are not presented (for brevity) but in the case of the addition problems, the curricula were somewhat worsened by the lack of internal feature selection. This is likely a flow-on effect from a reduction in generalisation. Interestingly, there was no clear delineation between the effectiveness of curricula or internal feature selection. Variation can be seen both for different problems, and for different output dimensions of the same problem. Feature selection appears more important (for both generalisation and efficiency) for smaller addition instances, but this was overtaken by curricula as output dimension increased. In terms of generalisation curricula delivered a larger gain than feature selection, however the training efficiency results were less clear-cut.

0.2 0.15 C C 0.10 M - 0.1 0.05

0.0 0.00 10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain

Full Curriculum Only Filtering Only

0.10 0.10

C 0.05 C M

- 0.05 0.00 0.00

8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 5.14: Ablation experiment results for the random cascaded benchmarks. ALVISS con- sistently outperformed both ablated variants. This is particularly interesting given that on the pyramidal DAG structure (d) curriculum-only actually worsened generalisation. This is sugges- tive of a synergistic relationship between the imposition of a curriculum and the filtering of meta-features. The relative importance of curricula and meta-feature filtering varied between structures.

Figures 5.14 and 5.15 give the same results for the random cascaded problems. As before, in terms of generalisation performance, the full approach outperforms both ablated variants consistently. More interestingly, the different structures showed clear disparities in the importance of meta-feature selection and curricula. Curricula seemed of greater value for the direct chain, for the indirect chain there was no clear delineation, CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 111

103

102

101 Training Time (s)

10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain

IC Full Curriculum Only Filtering Only

103

102

101 Training Time (s)

8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 5.15: Ablation experiment results for convergence time on the random cascaded bench- marks. In most cases ALVISS improved training speed compared to either ablated variant. In one case each of the direct and indirect chain structures, and on all pyramidal DAG instances the filtering-only variant produced lower training times. and for the two tree-structures meta-feature selection accounted the vast majority of the performance gain. For the pyramidal DAG instances the ablative variant with no internal selection actually underperformed the baseline (IC) consistently. This did not, however, mean that the full approach underperformed. Indeed it appears that a hard layer-based curriculum is ill-advised for such a structure, but that the curriculum does significantly aid in generating and detecting useful meta-features. The dominance of the full method over its ablated counterparts was not universal when it came to training efficiency. In one instance each of the direct and indirect chain structures (Figures 5.15a and b), the filtering only variant converged faster. For the pyramidal DAG (d) this was the rule, though less prevalent with greater output dimension. This is likely due to the training steps taking a similar time, but the curriculum generation requiring many more steps. For feature selection with no curricula, only a single minFS instance is solved at each layer. The results on the LGSynth91 subset were interesting. Consistently, the filtering only variant generalised better than the curriculum only variant. In several cases, the latter CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 112

0.1 C C M - 0.0

cu sct x2 alu2 alu4 f51m pcle pm1 cm162a cm163a

(a) Generalisation

IC Full Curriculum Only Filtering Only

103

101 Training Time (s)

cu sct x2 alu2 alu4 f51m pcle pm1 cm162a cm163a

(b) Training Time

Figure 5.16: Ablation experiment results for the LGSynth91 circuits. Both ablated variants resulted in lower generalisation than ALVISS on all instances. Generally, the use of meta-feature filtering alone also outperformed IC. As in Figure 5.14, even when the curriculum imposition alone reduced performance (compared to IC) the unablated method outperformed filtering-only. In terms of training efficiency the full approach was fastest except on pm1 and sct. In most cases one or both ablated variants significantly improved training performance over IC. even underperformed the baseline (IC). This suggests the presence of parallel substruc- tures and targets subsets which do not possess any difficulty inter-relationships, for which a hard layering approach is ill-suited. Even in such a case, the use of a curriculum does improve upon the successes of internal meta-feature selection. This echoes the phenomena seen for the pyramidal DAG structures in Figure 5.14d. To shed more light on the relationship between the two components of ALVISS it was helpful to go back to the behaviour varying with training set size. Figure 5.17 shows the test-set δ-MCC (relative to the baseline IC) for the addition problems. Importantly, the performance improvements of the curriculum and meta-feature filtering each occur CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 113 for different regions of training-set size. Additionally, the behaviour of the unablated method does not appear as a sum of the ablated performance differences; nor does it hug the two as would be expected from an either-or relationship. Instead the behaviour appears to be synergistic. This apparent co-reinforcing behaviour is intuitive. An improved curricula increases the likelihood of successful generalisation of earlier targets and thus the likelihood of producing meta-features useful to later targets. In turn, successful generation of these meta-features in training aids in determining the optimal next target in the curriculum. Thus the presence of each component increases the improvement provided by the other. It is also worth noting in Figure 5.17 that the peak improvement from feature selection doesn’t change, while the peak from curricula increases with problem complexity. This suggests that, with increasing depth, the effects due to the imposition of a curriculum may begin to dominate. This is supported by the fact that the quality of curricula was slightly worse in general for the curriculum-only ablated variant. The behaviours discussed are representative of the remaining problems, except for the pyramidal DAG and some LGSynth91 circuits where the curriculum-only variant underperformed IC. The synergistic behaviour was still apparent in these cases as a sum- of-parts-or-less relationship would predict ALVISS under-performing the filtering-only variant, which was not observed. Overall the two central pillars of ALVISS complement each other. Even when the curriculum only variant reduced performance in isolation, the combined approach still improved upon the other variant. This suggests that a curriculum can aid in the gener- ation of useful “stepping stone” meta-features even in situations where a curriculum alone does not appear warranted.

5.4 Scaling to Larger Problems

While the large improvements in generalisation seen in Section 5.2 are very promising, arguably more important is the large decrease in convergence time. Until this point I was limited by the training time of all methods under consideration, to relatively small test- bed problems. The ALU-74181 for instance, while an interesting case and a commercial circuit, has long been obsolete. This section presents the results of testing ALVISS, alone, on increasing large binary addition instances. I trained FBN using ALVISS for binary addition problems with up to 128 targets. This upper bound represents a much more modern range, intended to demonstrate the potential of ML approaches like this for use in direct logic synthesis. The 128-bit adder CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 114

4-bit Addition 5-bit Addition 0.6

0.4 C C M - 0.2

0.0

0 50 100 150 200 250 0 50 100 150 200 250 6-bit Addition 7-bit Addition 0.6

0.4 C C M - 0.2

0.0

0 50 100 150 200 250 0 50 100 150 200 250 Training Set Size 8-bit Addition 0.6

0.4 IC C

C Full M - 0.2 Curriculum Only Filtering Only 0.0

0 50 100 150 200 250 Training Set Size

Figure 5.17: Test set δ-MCC for the addition problems. This is included to show that each ablated variant improved generalisation over a different range of training set sizes. Additionally the improvement from the unablated variant appears greater than the sum of its parts, suggesting that curriculum and meta-feature filtering reinforce each other. This helps to explain why ALVISS generalised better than either ablated variant, even in cases where one variant, used alone, hurt performance. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 115 is, for instance, one of the circuits in the EPFL benchmark suite [207], released in 2015 as a challenging set of large combinatorial synthesis problems. Due to the poor scaling in training efficiency of the baseline methods, they were not compared. Baselines were trained for a similar amount of time up to the 48-bit instances but were unable to achieve even moderate accuracy on the training data; the resulting generalisation was unsurprisingly indistinguishable from random, for all but the first few targets, and so the attempt was abandoned. There were occurrences of incomplete memorisation with ALVISS which have been included in the results. The training set size ranges are given in Table 5.2.

Inputs Training set size 16 [16, 256] 32 [16, 256] 48 [16, 256] 64 [64, 512] 96 [64, 512] 128 [256, 1024]

Table 5.2: Instance and training set sizes for the large adders. Number of targets and example n pool size are not given as they are 2ni and 2 i respectively for all problems.

16-bit 32-bit 48-bit 64-bit 96-bit 128-bit

1.00 )

( 1.0

n

0.75 o i t a l e

0.50 r 0.5 r MCC o C 0.25 k n

a 0.0 0.00 R 0 200 400 600 800 0 200 400 600 800 Training Set Size Training Set Size (a) Generalisation (b) Curriculum Quality

Figure 5.18: Test-set macro-MCC (a) and curriculum quality (b) for large-scale addition problems (16 up to 128 outputs). The number of runs per data-point was much lower for some of these curves so the shaded region shows the standard deviation not a confidence interval. With less than 1000 training examples full generalisation was achieved for largest adder considered (128-bit). Likewise the corresponding curricula still converged to π∗ over the relevant range in training-set size. This is the first case of a feedforward structure generalising successfully on binary addition problems of this size.

Figure 5.18 shows the test-set MCC and curriculum quality (τπ∗ ) with respect to training set size for the larger addition instances. For the largest problem instance (128- bit addition), complete generalisation was achieved with under 103 training examples. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 116

The discovered curricula converged on π∗ over the same range. Importantly Figure 5.4a suggests linear growth in sample-complexity with respect to target-dimension which was noticeably different for the IC baseline. This is likely due to the significantly different trends observed in Figure 5.3.

5.5 Conclusion

In Chapter 3 I examined methods for discovering and enforcing curricula a priori. In a sense, this is most akin to filter-style feature selection. A key drawback is that this approach is unable to benefit from insights gained during the learning process which might otherwise alter the optimal curricula, especially with respect to the latter targets. The same can be said of the target-aware curricula in Chapter 4. In this chapter I presented an approach called ALVISS, inspired by the successes in Chapters 3 and 4, which learns targets iteratively and updates its curriculum on the fly. ALVISS also applies feature filtering in-line, resulting in significant improvements in both convergence speed and generalisation. This is presented as a generalised principle which could potentially be applied to a variety of composite learners to improve performance on MTC problems. I compared ALVISS with those state-of-the-art MLC methods from the literature from which an FBN could be produced as a final model. On all benchmark datasets applying the ALVISS principles resulted in FBNs with significantly better generalisation than any of the considered baselines. In most cases the training efficiency was also improved, by several orders of magnitude in some cases. This was despite the requirement to solve more (and larger) minFS instances. I also conducted an ablation study by removing the feature selection and curriculum application sub-steps of ALVISS. This identified that both aspects are necessary for the best performance and that the relationship between the two is synergistic in nature. The dual generalisation and training efficiency improvements allowed me to scale the training of FBNs using ALVISS up to much larger instances. This resulted in consistent generalisation of a 128-bit adders from less than 1000 training examples. To the best of my knowledge this is the first time an instance of this size has been successfully learned with a feedforward model. This success hints at the possibility of directly synthesising useful combinational logic using very small training sets by using an ALVISS-based approach. CHAPTER 5. ADAPTIVE LEARNING VIA ITERATED SELECTION AND SCHEDULING 117

5.5.1 Future Work

Sub-network outputs were excluded when ALVISS was used in experiments. I did this in order to isolate the impact of internal meta-features and target features. Nonetheless it is likely, given the performance of CC with target-aware curricula, that the inclusion of output features would further aid ALVISS. In a similar vein, there are a range of options for considering different forms of feature selection within ALVISS. The key requirement is that there should be some measure (of the resulting feature subsets) that delineates relative target “difficulty” (herein cardinal- ity). Given such a measure, an entirely different feature selection framework could be used. Additionally, there are sometimes multiple optimal feature sets. If the solver can generate several of these then other options present themselves. First is the distinction between required features (those in every solution) and interchangeable features (those in only some solutions). Required features can often be found by reduction rules in some cases without needing to enumerate all solutions [208, 209]. This distinction allows other modes of filtering meta-features, for example: hard-filtering only the required features, and soft-filtering interchangeable features by probabilistically weighting them for selection in the move generation stage of the optimisation loop (Section 2.5.2). In Section 5.1.3 I noted tractability issues. As ALVISS has proven useful, these warrant addressing. The first problem: the NP-complete nature of minFS, has been partially addressed herein by use of a heuristic solver. Likewise considering other feature selection methods would likely be fruitful. In that same section I mentioned a second tractability concern: the number of feature selection stages grows quadratically with the target dimension. This is partially solvable through parallelism as the instances at each stage are independent. If the feature selection procedure is capable of producing bounds then there is an additional option. Parallel cooperative feature selection routines could communicate bounds and partial solutions, thus allowing some solvers to terminate early, upon determining that the solution for their target cannot be superior to another’s. The next step was to explore whether the lessons learned with FBN are transferable. In the next chapter I present a proof of concept application of the ALVISS principles to deep neural networks. Chapter 6

Application of ALVISS to Deep Neural Nets

This chapter presents proof of concept experiments applying ALVISS to feedforward neural nets. A feedforward neural network fits the requirements for ALVISS: deterministic, semi-discretisable, and composite. As such I present experimental results in this chapter which demonstrate that the ID-estimated curricula, and filtering internal meta-features, both enhanced learning performance of deep neural networks trained on several of the benchmarks used in the previous chapters. Additionally this resulted in the first instance of near-perfect generalisation of a feedforward neural network on a large binary addition problem. While problems like binary addition, can be learned by a neural net; this has so-far only been achieved for reasonable sizes in using a recurrent network [176, 192]. Such an approach presupposes the homogeneity in the target-level functions, as well as the target order. There is, however, no clear way to do this when the relationship between the targets is unknown or is more complex than a single chain, when the underlying functions are heterogeneous, or when the order of targets is unknown.

6.1 Feedforward Neural Networks

Feedforward neural networks are likely the most widely recognised learning models. They are a function approximation model which compute a, potentially multivariate, function of the inputs by stacking layers in an acyclic structure1. The layers are them- selves a simpler form of (typically non-linear) multivariate approximator, which are in turn made up of multiple single units. Thus a multilayer network works by learning a

1Recurrent architectures are beyond the scope of this chapter.

118 CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 119 composition of layers of units. Most importantly for the purpose of this chapter: the units compute meta-features. The computation for a neural network layer balances the simplicity of a linear trans- form with the reality that useful models are usually non-linear. This is achieved in a typical layer by applying an affine transformation followed by element-wise non-linear activation function. The function the layer computes, given an input matrix X, is then:

H = g(XW + b) , where the weights, W , and bias, b, are tunable parameters defining the affine part of the transformation, and g is the activation function. As I mentioned before, this computation can be conceptually broken into single units, where each unit computes a single column of H. A unit corresponds to one row in W and the corresponding scalar bias from b. We can consider this column a meta-feature precisely because it takes a scalar value for each example (row) in X and is computed using multiple features (columns) of that example. A network is constructed by passing the output of layers in as input to other layers, each with their own set of parameters. The possible structures are countless, but a simple first example is a chain of m layers. The first layer computes H1 = g1(XW 1 + b1), the second layer computes H2 = g2(H1W 2 + b2), and so on until the last layer which computes Yˆ = gm(Hm−1W m + bm). This last layer is called the output layer while all the others are typically called hidden layers. For feedforward models this structure, called the network architecture, can be abstracted as a DAG of layers often called a computation graph. The choice of activation function for the output layer is constrained by the problem (since it must produce something resembling Y ) but the hidden activations are, in principle, not. In practice however they are typically chosen from a small set of well studied options and often the same across many or all hidden layers [210]. Additionally, most optimisation methods rely on gradient descent, in which case the activations must be at least piecewise differentiable. The activation functions are almost always fixed in advance and it is the affine param- eters of each layer which are trained [1]. This is done, as it was for FBNs in Chapter 3, by tuning these parameters to minimise some loss function, L(Y , Yˆ ), between the target and output matrices. Unlike in Chapter 3, this function needs to be differentiable, and the parameters need to be continuous. This is because one of the hallmarks of the neural network model is that the partial derivative of the training loss with respect to each parameter are relatively easy to compute using reverse-mode Automatic Differentiation, CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 120 most commonly in the form of the backpropagation algorithm [211]. The particulars are not important for applying ALVISS, and I treat the training process as a black box herein. Classification requires binary outputs. A simple way to achieve this is to use a hard threshold as the final activation, but this is non-differentiable. A better approach is to use a smooth approximation in the form of a function with a sigmoidal shape. As ALVISS needs the internal units to be approximately binary to use the associated meta- features in minFS, I also used the same function for hidden activations. For various reasons, activation values with a non-zero mean pose a serious problem for effective training [212]. Thus, the Boolean domain is represented in this chapter by −1, 1 instead of 0, 1 as in previous chapters. There exist an increasing number of methods for training neural nets with binary weights and activations, or for converting continuous models to have binary internals [86, 164, 165]. I felt it important, however, to test the application of ALVISS in as general and unconstrained a setting as possible. If binarised internals were used, then any observed trends may only be valid in the context of that binarisation. For this reason I did not employ these methods and instead used more standard architecture options.

6.2 Meta-features and Target Curricula in NNs

ALVISS can be applied to a neural net model as follows. The sub-network used at each step consists of a single fully-connected hidden layer with a pseudo-binarising activation. Feature selection and curriculum generation are conducted as per Section 5.1.2 with no changes. When hidden nodes are used as minFS inputs they are first binarised by rounding. Each sub-network is trained in isolation from the remaining network on only the selected features, this freezes prior layers and improves efficiency. The activities of hidden nodes were discretised before provision to the minFS solver, but not during training or testing. I briefly tested bipolar regularisation with a P γ|1 − x||−1−x| penalty term but this helped neither ALVISS nor any baseline. I also considered hard limiting the activities (by replacing the activation function) but was not explored as it introduces the need for a two phase training procedure and did not prove necessary for the successful application of ALVISS. When training is complete the resulting sub-networks are connected together in the structure shown in Figure 6.1b. The layers labelled Hi and yi are simply the hidden and output layers, respectively, trained on each feature-reduced sub-instance. Layer labelled

Fi represent filtering layers which connect the sub-network inputs to their respective nodes in the overall input X and previous hidden layer Hi−1. These filtering layers were implemented using a linear layer with the weight matrix given by the columns of the identity matrix indexed by the feature set solution for that particular target. CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 121

X

F1

Legend Legend FC Layer FC Layer H1 Concat. Layer Concat. Layer

X Filter Layer F2

H Cascading 1 H2 Shared Layers

H2 F3 Cascading ... Shared ... H3 Layers

... Fn

Hn Hn

Isolated Isolated

... Output ... Output Layers Layers

Y Y (a) Dovetail (b) ALVISS

Figure 6.1: The dovetailed (a) architecture and the final architecture (b) that results from applying ALVISS. Note that both use skip connections from all hidden layers to the output layers for each target. Thus the order of targets affects information flow through the network. The structure in (b) also contains skip connects from the input the filter layers which implement the meta-feature filtering component of ALVISS.

6.3 Experiments

There is no state-of-the-art for training neural networks on multi-output digital circuits (Section 2.4). This is despite the recent interest in binarising various parameters [86, 164, 165], as well as the application of deep neural nets as surrogate models in EDA [37]. There are constructive methods which use a full behavioural description and thus are inapplicable as they involve no training and are more akin to the traditional automated synthesis methods in Section 2.4 [213–215]. There are also methods which replicate basic logic gates using neurons which can implement partial truth tables [216], however it is unclear how to use the construction algorithm for multiple outputs. There are some relevant problem-specific suggestions for architectural traits. Analyti- cally derived bounds on sample complexity for linear threshold networks for single-target parity functions suggest increased modularity2 in the architecture but this result how- ever, is not directly applicable [218]. Wilamowski et al. [219] similarly showed that, as in many problems, increasing depth reduced the minimum number of units required to correctly implement parity but did not discuss the effect on the training process or generalisation.

2Modularity is a measure of the connectedness [217]. High modularity means that units are connected to relatively few other units. CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 122

Due to the lack of existing approaches, and different suggestions, I compared ALVISS to a range of architectures given in Section 6.3.1. Section 6.3.2 then describes how relevant hyper-parameters were chosen and finally Section 6.3.3 defines the performance metrics to be reported.

6.3.1 Architectures

I compared ALVISS with four different architectures chosen to span some of the space of possible approaches to cascading multi-target problems. The options selection present differing levels of depth, modularity, and—in one case—the option to apply an apriori curriculum. The four structures are

Wide a large single hidden layer,

IC an isolated sub-network for each target,

Deep a more typical fully connected network with several layers in depth (equal to the number of labels), and

Dovetailed similar to deep, but where the ith output is connected to the ith layer rather than the final layer.

The Dovetailed architecture (Figure 6.1a) was designed to emulate the ALVISS ethos as closely as possible but lacking the iteratively constructed curriculum and filtering stages. It uses connections that are analogous to skip connections [220] in a way that allows earlier targets to act as regularisers for the hidden layers shared with latter targets. The order of the targets used in construction affects the depth and level of available transfer for each target making this structure a candidate for examining the use of a priori curricula in addition to ALVISS. For this reason I used the dovetailed architec- ture with both random curricula and curricula generated using the method defined in Section 3.3.1.

6.3.2 Hyper-parameter Selection

All networks or sub-networks were trained using the Adam optimiser with Nesterov mo- mentum (Nadam) [221, 222] with squared hinge loss [223]. All nets were training for 5000 epochs using a batch ratio of 0.5 using the Keras deep-learning library for python [224]. All layers used the hyperbolic tangent (tanh) as their non-linearity and—excepting fea- ture selection layers—all hidden layers were fully connected. Weights were initialised using uniform Glorot initialisation as suggested for tanh non-linearities [225]. I allocated

40 × no hidden units, distributing them equally among all hidden layers. These choices CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 123

Parameter Considered Chosen optimiser {SGD, RMSprop, Adadelta, Nadam} Nadam batch ratio [0.0 .. 1.0] 0.5 loss {MAE, MSE, hinge, sq. hinge, bin. cross-ent.} squared hinge epochs [1e3 .. 5e5] 5e3 num hidden units [10no .. 120no] 40

Table 6.1: Architecture and hyper-parameter search space and values found using random parameter search [206]. and meta-parameters were decided using randomised search (as suggested by Bergstra and Bengio [206]) on 5-bit addition for the IC baseline to achieve best generalisation with the wide architecture. The options considered, and chosen values, are given in Table 6.1. Once again, I used a range of training set sizes (Tables 4.1 and 5.2), with 100 repe- titions per training set size (except alu2 and alu4 problems from the LGSynth91 suite for which I only used 10 repetitions). Since I trained to a fixed epoch limit, rather than complete memorisation, I was able to complete comparison with the baselines for the larger adders (16-bit and above). In this case I raised the epoch limit to 1e6 and reduced the number of repeats per training-set size to 10. Since these experiments examine a non-FBN model, and shallower architectures are considered, I opted to re-include the real-world models from Section 3.4.1. The problem set thus consisted of the full range of adders, the four structures with heterogeneous target functions from Section 4.4.1, the real-world models from Section 3.4.1, and the LGSynth91 benchmark subset from Section 4.4.1.

6.3.3 Metrics

Generalisation and Efficiency I report generalisation performance using test-set MCC as in all prior chapters. There is no clear reference baseline so I do not report it in differential form (δ-MCC). Since training was held to an iteration limit, rather than a convergence criteria, I did not examine training efficiency.

Curriculum Quality The quality of curricula resulting from the ALVISS process depends on the internal feature representation. It is therefore possible that curriculum discovery is affected by the choice of model. As such I report τπ∗ in the same manner as prior chapters.

Since the a priori curricula is model agnostic I do not report τπ∗ for the dovetailed architecture. CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 124 6.4 Results

In this section I present results obtained by using the ALVISS approach with a neural network rather than a Boolean network. For these tests the performance measure of concern was generalisation. As mentioned above, I do not present training time since I used a fixed epoch limit. The results herein are thus comprised of only the cumulative macro-MCC for all benchmark datasets. I did not divide the addition benchmarks as in Chapter 5 since the epoch limit allowed for completion of the baselines. However, it should be noted that even when allotted substantial time, the baselines failed to fit the training or test data for any addition instance with more than 16 targets.

1.0

0.5 MCC

0.0

4-bit 5-bit 6-bit 7-bit 8-bit 16-bit 32-bit 48-bit 64-bit 96-bit 128-bit (a) Adders

ALVISS Wide Dovetailed (curricula) IC Deep Dovetailed (random) 1.0

0.5 MCC

0.0

cu sct x2 ALU alu2 alu4 f51m pcle pm1 FA/BRCA cm162a cm163a Mammalian (b) Real models + LGSynth91

Figure 6.2: Macro-MCC for adders and LGSynth91 datasets for the neural network proof-of- concept experiments. For all addition instances (a) ALVISS outperformed the baseline architec- tures and this performance gap increased with instance size. Above 16-targets, the baselines are not visible as they generalised better no better than random (MCC = 0). These results are the first case of feedforward neural networks consistently generalising well on binary addition problems of this size. For the LGSynth91 benchmarks (b) there was greater variation in perfor- mance than seen in Chapter 5. In 7 of the 13 datasets ALVISS outperformed significantly, in three (mammalian, alu2, and pm1) it performed equal best and in the remaining three (cu, f51m, and x2) it was outperformed by one or more baselines. Note that I conservatively consider f51m a loss despite the overlapping error bars.

For all addition instances (Figure 6.2a) the ALVISS approach outperformed all base- line architectures. As seen with FBNs in Chapter 5, this performance gap increased as CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 125 the number of targets increased. The baseline architectures failed to fit the training data at all for instances with greater than 16 targets and as such generalisation performance was equivalent to random. For the LGSynth91 benchmarks (Figure 6.2b) we see more variation in performance, in the manner originally expected from the Boolean network experiments in Chapter 5. In 5 of the 10 datasets ALVISS generalised significantly better, in two (alu2 and pm1) it performed equal best and in the remaining three it was outperformed by one or more baselines. Note that I conservatively consider f51m a loss despite the overlapping confidence intervals. Furthermore, ALVISS generalised better than the baselines on two of the three problems from Section 3.4.1 and equal best on the third (mammalian).

0.75 0.75

0.50 0.50 MCC 0.25 0.25

0.00 0.00 10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C (a) Direct Chain (b) Indirect Chain ALVISS Wide Dovetailed (curricula) IC Deep Dovetailed (random)

0.75 0.75 0.50 0.50 MCC 0.25 0.25

0.00 0.00 8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (c) Binary Tree (d) Pyramid DAG

Figure 6.3: Macro-MCC for heterogeneous datasets for the neural network proof-of-concept experiments. For the direct chain (a), indirect chain (b), and binary tree (c) structures ALVISS generalised better by a margin of significance and this margin increased with output dimension. For the pyramidal DAG structure (d) the IC architecture generalised better than ALVISS on both sizes, while the dovetailed architecture with curriculum applied was competitive for two of the smaller instances (7-A and 7-C). In every case, the use of a curriculum within the dovetailed architecture improved generalisation.

Figure 6.3 presents the results for the randomised heterogeneous datasets. ALVISS generalised better by a margin of significance on all instances from three of the four CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 126 structures. Additionally the magnitude of improvement increased with output dimen- sion. On the pyramidal DAG structure ALVISS generalised comparably or marginally worse than the IC architecture. While the dovetailed architecture was outperformed by others, the use of an a priori curricula for ordering the hidden layers did improve performance. This was the case for all addition problems and all structured synthetic problems. In 10 of the 13 cases in

Figure 6.2b the dovetailed architecture with πID equalled or exceeded the generalisation of the same architecture with a random curriculum.

ALVISS (NN) ALVISS (FBN) a priori

1.0

0.5

0.0 4-bit 5-bit 6-bit 7-bit 8-bit 16-bit 32-bit 48-bit 64-bit 96-bit 128-bit ) (a) Adders (

n o

i 1.0 t a l e

r 0.5 r o C

0.0 k

n 10-A 10-B 10-C 20-A 20-B 20-C 10-A 10-B 10-C 20-A 20-B 20-C a R (b) Direct Chain (c) Indirect Chain 1.0

0.5

0.0 8-A 8-B 8-C 16-A 16-B 16-C 7-A 7-B 7-C 8-A 8-B 8-C (d) Binary Tree (e) Pyramid DAG

∗ Figure 6.4: Correlation (τπ∗ ) of curricula with π for datasets with known optimal curricula. Curricula generated by ALVISS for both FBNs and neural networks were in high agreement. Both also maintained strong recovery of π∗ with increasing target dimension (a), long after the a priori methods failed. As with FBNs the difference in the structure of inter-target relationships effects curricula recovery. Using neural networks degraded the curricula slightly on both chained structures (b,c), and improved slightly on the pyramidal DAG structure (e). The procedures did not vary significantly on any instance of the binary tree structure (d).

ALVISS generated curricula of comparable quality when using a neural net base learner as they did with FBN in Chapter 5. Figure 6.4 shows the correlation (τπ∗ ) of gen- erated curricula with π∗ for the adders and synthetic benchmarks (including the ALVISS results from Chapter 5 for comparison). Despite the use of a learner with continuous pa- rameters, and a very different training approach, the curricula generated by ALVISS were of similar quality to the results in Chapter 5. Additionally the method maintains strong CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 127 recovery of π∗ as target dimension is increased; long after the a priori methods fails. As with FBNs the relative quality of the a priori and ALVISS curricula differ with the structure of inter-target relationships. Looking at the synthetic benchmarks (Figures 6.4b- 6.4e) we can see that the use of a neural network degraded the curricula slightly on both chained structures, and improved slightly on the pyramidal DAG structure.

6.5 Discussion and Conclusions

Across the majority of the considered problems ALVISS, when applied to a neural net base-learner, improved generalisation over a range of architectures, and maintained strong recovery of known optimal curricula. In cases where the target-dimension was increased, ALVISS showed significantly better retention of generalisation, reflecting the results with FBNs in Chapter 5. The results were less stark than seen there however, with one of the four synthetic structures appearing equally suited by a shallow, modular architecture using no curriculum and ALVISS performing worse than baselines on a minority of LGSynth91 circuits. Whether the improvement would be larger in the context of binarised neural networks (as it was for the purely binary FBN model) is an interesting question for future research. Additionally, the results with the dovetailed architecture demonstrate a second successful implementation of target curricula. While this architecture performed worse in general than flatter ones, it performed notably better with a curriculum than without. Additionally this architecture recovered some of the performance lost in blindly applying a deep architecture, even without a curriculum. A plausible explanation is that, even for random curricula, the skip connections used mean that some simpler targets will still regularise some of the earlier layers for more complex targets on average; albeit in a suboptimal fashion. The most important results were obtained with the larger adders. For the first time, large scale cascading logic synthesis problems have been trained with high (and sometimes complete) generalisation. Crucially, this was achieved with relatively small sample sizes. This is promising as all other approaches, using either Boolean or neural nets, saw performance diminish rapidly beyond a handful of targets. It is worth discussing one other approach for training neural networks for binary addition problems from the literature in order to clarify some important differences as they may seem superficially similar. The approach in question involves teaching recurrent architectures to perform addition, and similar algorithmic procedures, as a sequence mapping task [176, 192, 226]. While effective, this problem formulation sidesteps an important aspect of the problem as it stands in this work, which is the determination of the order of the targets. By unrolling the problem into a sequential one, CHAPTER 6. APPLICATION OF ALVISS TO DEEP NEURAL NETS 128 the crucial target curriculum information is given explicitly in the problem formulation. Additionally it is unclear how recurrent architectures fare when the function differs from target to target (as with the problems from Section 4.4.1). In fairness, these are not issues sequence-to-sequence modelling attempts to address. The focus of that work is on deliberate teaching of sequential algorithms and they have been remarkable in their success. The kind of generalisation the above-mentioned approach is aiming for is different too. Instead of training on an incomplete set of examples and hoping the network ‘fills in the gaps’, the sequential approach presents all examples up to a certain length and evaluates the generalisation of the network on longer patterns. As such, to date no approach has successfully inferred a large binary adder without providing the curriculum implicit in the problem formulation. The results herein are the first case I am aware of, of feedforward architectures consistently generalising well on large addition problems with up to 128-bit operands. This was achieved with remarkably few examples, which is important given the usual assumptions regarding the amount of data required for training neural nets, as it shows that appropriate teaching techniques can radically alter sample complexity. I observed that breaking the training up into fixed stages, as ALVISS does, slightly increased the computational cost of training. This is because it negates some of the optimisations, within the library framework, that are tailored toward evaluating complete computational graphs on large amounts of data. There are different requirements in embedded domains where memory, compute, and power limitations, combined with a lack of specialised hardware such as GPUs, mean that piecewise procedures may be more efficient [39, 86]. This, combined with the fact that the application ofALVISS significantly reduced sample complexity in some cases, means that ALVISS may be ideally placed to aid in embedded and mobile learning applications. Overall the proof-of-concept was successful. The use of curricula combined with explicit filtering of internal meta-features shows promise for neural network learning. This invites questions both of the potential for wider applicability, and whether more specialised neural network models (binarised neural nets for instance) may be even better suited. Chapter 7

Conclusion and future work

Here I summarise the key findings presented in this thesis, note the final contributions and suggest suitable directions for future work.

7.1 Conclusions and Contributions

The initial steps taken in Chapter 3 were to evaluate the potential of target-curricula. This was done by considering several problems for which a possible difficulty order existed. I showed in Section 3.1 that using hierarchical losses to enforce this candidate curriculum resulted in FBNs with better generalisation. The effect was more pronounced for losses which more strictly enforced that curriculum and for problem instances with greater output dimension. Following this I suggested that generalisation gains were the result of a transformation of the distribution of returned hypotheses, as the losses used do not alter the set of feasible hypothesis. Analysis with varying sample size showed that target curricula are ideal in a sample-limited regime. The next step was to examined the space of curricula in Section 3.2. The association between the rank correlation (with the easy-to-hard curriculum) of a random curriculum, and generalisation performance, was strong and positive (Figure 3.7). Those curricula which were anti-correlated with the easy-to-hard permutation typically performed worse than using no curriculum, and curricula which were uncorrelated (essentially random) presented mixed results. Importantly, the effect size increased with increasing target dimension. The relationship was near linear, lacking any sharp transitions which is fortunate as practical methods may only approximate the ideal curricula. These results highlighted the importance of finding an effective method for building target curricula. The last contribution of Chapter 3 was a simple basis for constructing efficacious tar- get curricula for Boolean MTCs (Section 3.3.1). The intuition was to use ID as a complexity measure, and a minFS solver to produce estimates of the ID for each target. I showed that this method generated curricula in close agreement with the optimal curricula, and

129 CHAPTER 7. CONCLUSION AND FUTURE WORK 130 which improved generalisation performance by almost as much. Finally, I tested the combination of ID-estimated curricula with hierarchical losses on real world models and time-series data sets with positive results. Chapter 4 is a case-study in the use of target curricula for treating the ordering problem that arises from a Classifier Chain transformation. Using the ID-curricula from Chapter 3 as a chain order increased the generalisation of FBN-based CCs over using a random chain order, and over comparable methods selected from the literature. I also showed these chain orderings to be more stable than prior methods for circuit synthesis problems. I then described a target-aware extension to the ID estimation procedure, specifically for CCs. This extension works by including target features as additional input features in the minFS instances and produces a curriculum using a greedy heuristic. I showed target awareness further improved generalisation and curriculum quality in the majority of cases. It also partially rectified a weakness of the target-agnostic method on higher order targets, which had been observed in Chapter 3. The final contribution of Chapter 4 is several synthetic benchmark construction techniques defined in Section 4.4.1. These produce randomised MTC problems with differing target hierarchies that can easily be scaled in depth, making them helpful for future study of target curricula. On these problems both ID-based curriculum methods outperformed the reference chain-ordering algorithms; similar successes were seen on a subset of circuits from the LGSynth91 logic synthesis benchmark suite. Combined, the results of Chapter 4 add to the evidence in support of the much debated importance of chain order on CCs, using a previously unstudied, highly non-linear base classifier. The final chapter focusing on FBNs, Chapter 5, extends on the previous curriculum methods. In Section 5.1.2 I define a method, called Adaptive Learning Via Iterated Selec- tion and Scheduling (ALVISS), which trains targets iteratively and updates its curricula during training by augmenting the minFS instances with meta-features constructed inside the models themselves. This allows the curriculum to be sensitive to information generated by the training process and allows filtering these meta-features in-line. In the experimental results, ALVISS significantly improved both convergence speed and gener- alisation of FBNs over applicable MTC methods from the literature. Subsequent ablative studies (Section 5.3) showed that the combination of the curriculum and meta-feature filtering to be necessary for best performance. Due to large gains in both generalisation and training efficiency, scaling to much larger instances became possible with ALVISS. In Section 5.4, I presented consistent generalisation of FBN on a 128-bit addition problem from under 1000 training examples (less that 1e − 74 of the full truth table size); the first case of an instance this size being successfully learned with a feedforward model. This success opens the possibility of CHAPTER 7. CONCLUSION AND FUTURE WORK 131 directly synthesising useful combinational logic from very small training sets by using an ALVISS-based approach, one of the primary research goals of the thesis. Finally, in Chapter 6 I presented proof of concept experiments applying ALVISS to feedforward neural nets. The process of adapting ALVISS to a model with continuous parameters proved straightforward and was feasible in the most general case, requiring no permanent binarisation of network parameters. In comparison to several architec- tures, ALVISS yielded the best generalisation on most problems considered. ALVISS also maintained strong recovery of known optimal curricula, comparable to that seen with FBNs in Chapter 5. Most importantly, ALVISS showed significantly better retention of generalisation as target-dimension was increased, in line with previous results for FBNs. As in Chapter 5, I showed the first instance seen of near-perfect generalisation of a feedforward neural network on large binary addition problems. This was achieved with training sets of the same size as in Chapter 5, an important result given the current emphasis on more and more data for training neural nets. Chapter 6 thus shows that appropriate teaching techniques can radically alter sample complexity of certain learning tasks for deep neural nets. This reduction in sample-complexity, combined with its step- wise nature, makes ALVISS ideally suited to help expand the use of neural networks in embedded and mobile environments. ALVISS is not a specific algorithm but rather a general principle which can be applied to a variety of composite models for MTC, in much the same way as ordering targets by ID. The details may vary, but the core principles of using internal meta-features to build a self-paced curriculum, and reducing sub-instances by filtering those meta-features, are translatable; the evidence of this is in the results of Chapter 6. Two overall questions summarise this work well: “is target ordering important?” and “is explicitly finding meta-features valuable?”. The answer to both is: “in some cases, very much so”. The degree to which this is true is clearly domain-, problem-, and even implementation-dependant, but I have shown instances where even reasonable training-set accuracy was only obtainable after the application of both a curriculum of targets and meta-feature filtering. Positive outcomes using both FBNs and DNNs suggest potential generalisability warranting further study. In addition to providing several contributions, the work conducted has identified a number of interesting questions for future research which I outline next.

7.2 Suggestions for Future Work

In mapping the direction decisions during this research I have already outlined some potential future work in the chapter conclusions. These I reiterate here, along with additional directions that arise from considering the results in their entirety. CHAPTER 7. CONCLUSION AND FUTURE WORK 132

Possibly the most important thrust lies in exploring the applications to representa- tion learning. A complex interplay appears to exist between curriculum building and the generation/identification of useful meta-features. At present, representation learning has sought to replace expert feature-engineering through largely indirect approaches. The stark difference in efficacy of ALVISS over comparable baselines and over thea priori curriculum methods (Chapters 5 and 6), suggests that a more explicit treatment of meta-features during training may be worth considering. The remaining suggestions presented below are broken into three categories, those related to feature selection, domain extension, and benchmark development.

7.2.1 Feature Selection

Some options exist for reducing the computational cost of the feature selection phase. In all ID estimation regimes herein there are unexploited potential for parallelism as there are typically several instances to solve at a time. Moreover, the minFS instances in a given problem are not at all independent, and tandem solvers can inform each other. One immediate option for branch-and-bound techniques is to communicate bounds between instances in iterative cases where only one target is to be selected (such as the target-aware and ALVISS methods). Once the lower bound for a target instance grows above another’s incumbent solution, the first search can be terminated early as that target will not be selected anyway. The most likely benefit will be improved efficiency but, in the case of inexact solvers, solution quality may also be improved. Cardinality alone as a choice metric is not necessarily best. Perhaps entropic mea- sures could be used to boost performance by providing tie-breakers. Cases may also exist where the target most likely to generalise well is not the one with the smallest ID. While the results herein suggest this is unlikely for logic synthesis, the likelihood of this occurrence in other domains would dictate the importance of examining other metrics. The results of Chapter 4 however, suggest that purely entropic metrics may be unsuitable for MTC problems that do not meet the normal MLC assumptions. Another current limitation is scalability with respect to training set size. While exact feature selection has shown merit in small-sample logic synthesis, scaling to large data will face issues of quadratic time and space complexity w.r.t. number of examples, and intractability w.r.t. input dimension. Intuitively though, a learner which successfully produces a feature-sparse solution is implicitly estimating ID, and the issue of internal feature selection may not be entirely ignorable. Thus an important future question is whether the existing scalable feature selection procedures suffice for ID-estimation or, failing that, whether there are natural approximations to the feature selection problem itself under relevant conditions. If so, ID-estimation and the principles underpinning CHAPTER 7. CONCLUSION AND FUTURE WORK 133

ALVISS may offer the tools to address sparsity concerns and improve generalisation in multi-target domains. In its current form ALVISS induces a chain structure; there are thus structural exten- sions to ALVISS inspired by similar extensions to CCs [177, 178], that are worth consider- ing. One is to cluster targets by some similarity metric, then only apply ALVISS within clusters. The converse option is to learn groups of related targets which are equivalent in difficulty within a single combined learner, while using multi-label feature selection methods for filtering. Such methods may improve the robustness of ALVISS to noise and curriculum-errors, and also improve the treatment of unrelated target groups within the same dataset. Noise has not been considered as a factor in this thesis. It will be important to model how noise affects minFS solutions as well as the curricula derived from them. Future work should address this and perhaps define a noise-tolerant variant of minFS. Possible considerations include: allowing l examples to flip value, or l pairs to not be separated. The important question is how these option translate into coverage/constraints, and how much ID resolving power is lost. Alternatively, a robustness-oriented generalisation of the k-Feature-Set (kFS) problem exists called the (α-β)-k-Feature Set problem [227]. Variants of this generalisation may provide noise tolerance at the cost of increasing feature set sizes. In addition to building resilience to noise, there are other reasons to consider differ- ent combinatorial feature-selection criteria. While minFS offers an exact lower bound to the ID of a function, in some domains other measures of target “difficulty” may be more relevant, and may warrant different feature set objectives and constraints [228]. Charikar et al. [131] outlines a number of combinatorial feature selection frameworks, and there are additional options such as exact mutual information [124]. Careful consideration of the differences in computational complexity of various feature selection formulations under different constraints will likely be essential [229].

7.2.2 Domain Extension

The key to expanding the valid domains for ID-based target curricula will lie in finding a relaxation of the minimum-feature-set definition which allows continuous input vari- ables. The work in Chapter 3 for example, may be extended to gradient-based learners by constructive differentiable analogues of Llh and Lgh. Additionally, Chapter 6 showed the potential of ALVISS even when the model internals are continuous. Thus applying a curriculum in the presence of non-binary features is simple in comparison to build- ing one. A heuristic approach is possible, as used in Chapter 6, but a more rigorous consideration may be fruitful. CHAPTER 7. CONCLUSION AND FUTURE WORK 134

The main problem with defining a continuous relaxation of the minFS problem is defining how any given feature differentiates/explains an example-pair. For example, when working with neural nets in Chapter 6 a possible extension to the minFS definition presented itself. This extension would allow “undefined” input feature values. When considering an example pair with different target values, a feature would not be consid- ered to separate the pair if its value for either example was undefined. This undefined value can be assigned, when using a soft-binarisation like tanh, to activation values in the transition region (with pre-activations close to zero). In principle such an extension may provide the starting point for expanding the applicability of feature-selection based curriculum learning. I consider the most promising direction to be investigating Many Valued Logic (MVL) for multi-class targets. The minFS definition as it stands is theoretically capable of handling multiple discrete levels for target and input features. Practical use however requires further study. Lessons learned in applying the methods in this thesis to Many Valued Logic (MVL) would help twofold. Firstly, it would further the contributions to the field of logic synthesis and, secondly, it would serve as a jumping-off point for continuous-valued extensions.

7.2.3 Benchmarks

In Chapter 4, I presented synthetic dataset generation methods which I believe will be useful for future studies of target curricula. While these already address calls for further MTC benchmarks [65], there are additional, natural extensions to explore. The most apparent of which is to increase the branching factor as, for all cases herein, I used a branching factor of 2. Other interesting modifications include disjoint substructures and restrictions on the class of node-level functions. Knowing how each of these as- pects affect the ability to infer curricula, and the effectiveness of those curricula, may be especially useful in domains possessing certain constraints. In GRN inference for example, certain function classes are believed to dominate [200], thus knowing that target curriculum methods are effective or ineffective on synthetic structures built with these classes would assist the decision making of future researchers and practitioners. The overarching goal of this thesis was to ascertain the utility of target curricula in the context of a machine learning approach to logic synthesis. The sub-goals laid out in Chapter 1 were achieved, throughout this process additional important questions were raised and several compelling avenues for further work have been identified above. The work presented here will serve as a strong foundation for the future work described above. Bibliography

[1] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning. MIT Press, 2016, http://www. deeplearningbook.org. [2] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, M. P. Lungren and A. Y. Ng, ‘CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning’, arXiv preprint, 14th Nov. 2017. arXiv: 1711.05225. [3] P. Baldi, P. Sadowski and D. Whiteson, ‘Searching for Exotic Particles in High-Energy Physics with Deep Learning’, Nature Communications, vol. 5, no. 1, p. 4308, Dec. 2014, ISSN: 2041-1723. DOI: 10.1038/ncomms5308. [4] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes and J. Dean, ‘Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation’, Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351, 1st Dec. 2017. DOI: 10.1162/tacl_a_ 00065. [5] W. Waegeman, K. Dembczyński and E. Hüllermeier, ‘Multi-target prediction: A unifying view on problems and methods’, Data Mining and Knowledge Discovery, vol. 33, no. 2, pp. 293–324, 1st Mar. 2019, ISSN: 1573-756X. DOI: 10.1007/s10618-018-0595-5. [6] J. Read, C. Bielza and P. Larrañaga, ‘Multi-Dimensional Classification with Super-Classes’, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 7, pp. 1720–1733, Jul. 2014, ISSN: 1041-4347. DOI: 10.1109/TKDE.2013.167. [7] H. Borchani, G. Varando, C. Bielza and P. Larrañaga, ‘A survey on multi-output regres- sion’, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 5, no. 5, pp. 216–233, 2015, ISSN: 1942-4795. DOI: 10.1002/widm.1157. [8] B. F. Skinner, ‘Reinforcement today’, American Psychologist, vol. 13, no. 3, pp. 94–99, 1958, ISSN: 1935-990X(Electronic),0003-066X(Print). DOI: 10.1037/h0049039. [9] F. Khan, B. Mutlu and X. Zhu, ‘How Do Humans Teach: On Curriculum Learning and Teaching Dimension’, in Advances in Neural Information Processing Systems 24, Curran Associates, Inc., 2011, pp. 1449–1457.

135 [10] L. B. Smith, S. Jayaraman, E. Clerkin and C. Yu, ‘The Developing Infant Creates a Cur- riculum for Statistical Learning’, Trends in Cognitive Sciences, vol. 22, no. 4, pp. 325–336, 1st Apr. 2018, ISSN: 1364-6613. DOI: 10.1016/j.tics.2018.02.004. [11] K. A. Krueger and P.Dayan, ‘Flexible shaping: How learning in small steps helps’, Cognition, vol. 110, no. 3, pp. 380–394, 1st Mar. 2009, ISSN: 0010-0277. DOI: 10.1016/j.cognition. 2008.11.014. [12] Y. Bengio, J. Louradour, R. Collobert and J. Weston, ‘Curriculum learning’, in Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 41–48, ISBN: 978-1-60558-516-1. DOI: 10.1145/1553374.1553380. [13] M. Sachan and E. P. Xing, ‘Easy questions first? a case study on curriculum learning for question answering’, in Proceedings of the Annual Meeting of the Association for Compu- tational Linguistics, 2016. [14] A. Graves, M. G. Bellemare, J. Menick, R. Munos and K. Kavukcuoglu, ‘Automated Curricu- lum Learning for Neural Networks’, in Proceedings of the 34th International Conference on Machine Learning, (Sydney, NSW, Australia), vol. 70, JMLR.org, 2017, pp. 1311–1320. [15] D. Weinshall and D. Amir, ‘Theory of Curriculum Learning, with Convex Loss Functions’, arXiv preprint, 9th Dec. 2018. arXiv: 1812.03472. [16] V. I. Spitkovsky, H. Alshawi and D. Jurafsky, ‘Baby Steps: How “Less is More” in unsuper- vised dependency parsing’, in NIPS: Grammar Induction, Representation of Language and Language Learning, 2009. [17] L. Jiang, D. Meng, Q. Zhao, S. Shan and A. G. Hauptmann, ‘Self-paced Curriculum Learn- ing’, in Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, Texas: AAAI Press, 2015, pp. 2694–2700, ISBN: 978-0-262-51129-2. [18] Y. Fan, R. He, J. Liang and B.-G. Hu, ‘Self-Paced Learning: An Implicit Regularization Perspective’, in Thirty-First AAAI Conference on Artificial Intelligence, 1st Jun. 2016. arXiv: 1606.00128. [19] M. Zięba, J. M. Tomczak and J. Świątek, ‘Self-paced Learning for Imbalanced Data’, in Intelligent Information and Database Systems, ser. Lecture Notes in Computer Science 9621, N. T. Nguyen, B. Trawiński, H. Fujita and T.-P. Hong, Eds., Springer Berlin Heidelberg, 14th Mar. 2016, pp. 564–573, ISBN: 978-3-662-49380-9 978-3-662-49381-6. DOI: 10.1007/ 978-3-662-49381-6_54. [20] Ç. Gülçehre and Y. Bengio, ‘Knowledge matters: Importance of prior information for optimization’, Journal of Machine Learning Research, vol. 17, no. 8, pp. 1–32, 2016. [21] W. Zaremba and I. Sutskever, ‘Reinforcement Learning Neural Turing Machines - Revised’, arXiv preprint, 4th May 2015. arXiv: 1505.00521. [22] A. B. Kahng, ‘New Directions for Learning-based IC Design Tools and Methodologies’, in Proceedings of the 23rd Asia and South Pacific Design Automation Conference, (Jeju, Republic of Korea), Piscataway, NJ, USA: IEEE Press, 2018, pp. 405–410.

136 [23] E. Testa, M. Soeken, L. G. Amar and G. D. Micheli, ‘Logic Synthesis for Established and Emerging Computing’, Proceedings of the IEEE, vol. 107, no. 1, pp. 165–184, Jan. 2019, ISSN: 0018-9219. DOI: 10.1109/JPROC.2018.2869760. [24] A. Goñi-Moreno, ‘On genetic logic circuits: Forcing digital electronics standards?’, Memetic Computing, vol. 6, no. 3, pp. 149–155, 1st Sep. 2014, ISSN: 1865-9292. DOI: 10.1007/ s12293-014-0136-8. [25] J.-H. ( Jiang and S. Devadas, ‘Logic synthesis in a nutshell’, in Electronic Design Automa- tion, L.-T. Wang, Y.-W. Chang and K.-T. ( Cheng, Eds., Boston: Morgan Kaufmann, 2009, pp. 299–404, ISBN: 978-0-12-374364-0. DOI: 10.1016/B978-0-12-374364-0.50013- 8. [26] O. Coudert, ‘On solving covering problems [logic synthesis]’, in 33rd Design Automation Conference Proceedings, Jun. 1996, pp. 197–202. DOI: 10.1109/DAC.1996.545572. [27] C. Umans, T. Villa and A. L. Sangiovanni-Vincentelli, ‘Complexity of two-level logic mini- mization’, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 7, pp. 1230–1246, Jul. 2006, ISSN: 0278-0070. DOI: 10.1109/TCAD.2005. 855944. [28] S. M. Cheang, K. H. Lee and K. S. Leung, ‘Applying Genetic Parallel Programming to Synthesize Combinational Logic Circuits’, IEEE Transactions on , vol. 11, no. 4, pp. 503–520, Aug. 2007, ISSN: 1089-778X. DOI: 10.1109/TEVC.2006. 884044. [29] J. Cortadella, M. Galceran-Oms, M. Kishinevsky and S. S. Sapatnekar, ‘RTL Synthesis: From Logic Synthesis to Automatic Pipelining’, Proceedings of the IEEE, vol. 103, no. 11, pp. 2061–2075, Nov. 2015, ISSN: 0018-9219. DOI: 10.1109/JPROC.2015.2456189. [30] C. Vijayakumari, P. Mythili, R. K. James and C. A. Kumar, ‘Genetic Algorithm Based De- sign of Combinational Logic Circuits Using Universal Logic Modules’, in Proceedings of the International Conference on Information and Communication Technologies, vol. 46, Bolgatty Palace & Island Resort, Kochi, India, 3rd–5th Dec. 2014, pp. 1246–1253. DOI: 10.1016/j.procs.2015.01.041. [31] L. Amarù, E. Testa, M. Couceiro, O. Zografos, G. D. Micheli and M. Soeken, ‘Majority Logic Synthesis’, in IEEE/ACM International Conference on Computer-Aided Design, Nov. 2018, pp. 1–6. DOI: 10.1145/3240765.3267501. [32] M. Soeken, H. Riener, W. Haaswijk and G. De Micheli, ‘The EPFL Logic Synthesis Libraries’, in International Workshop on Logic Synthesis, 14th May 2018. arXiv: 1805.05121. [33] B. C. Schafer and K. Wakabayashi, ‘Machine learning predictive modelling high-level syn- thesis design space exploration’, IET Computers Digital Techniques, vol. 6, no. 3, pp. 153– 159, May 2012, ISSN: 1751-8601. DOI: 10.1049/iet-cdt.2011.0115.

137 [34] H.-Y. Liu and L. P. Carloni, ‘On learning-based methods for design-space exploration with High-Level Synthesis’, in 50th ACM/EDAC/IEEE Design Automation Conference, May 2013, pp. 1–7. [35] S. Nath, ‘New Applications of Learning-Based Modeling in Nanoscale Integrated-Circuit Design’, Ph.D. University of California, San Diego, United States – California, 2016, 294 pp. [36] W. Qi, ‘IC Design Analysis, Optimization and Reuse via Machine Learning’, Ph.D. North Carolina State University, United States – North Carolina, 2017, 179 pp. [37] W. Haaswijk, E. Collins, B. Seguin, M. Soeken, F. Kaplan, S. Süsstrunk and G. D. Micheli, ‘Deep Learning for Logic Optimization Algorithms’, in IEEE International Symposium on Circuits and Systems, May 2018, pp. 1–4. DOI: 10.1109/ISCAS.2018.8351885. [38] M. Rusci, L. Cavigelli and L. Benini, ‘Design Automation for Binarized Neural Networks: A Quantum Leap Opportunity?’, in IEEE International Symposium on Circuits and Systems, May 2018, pp. 1–5. DOI: 10.1109/ISCAS.2018.8351807. [39] C.-C. Chi and J.-H. R. Jiang, ‘Logic Synthesis of Binarized Neural Networks for Efficient Circuit Implementation’, in Proceedings of the International Conference on Computer- Aided Design, New York, NY, USA: ACM, 2018, 84:1–84:7, ISBN: 978-1-4503-5950-4. DOI: 10.1145/3240765.3240822. [40] L. Rokach, M. Kalech, G. Provan and A. Feldman, ‘Machine-learning-based Circuit Syn- thesis’, in Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China: AAAI Press, 2013, pp. 1635–1641, ISBN: 978-1-57735-633-2. [41] G. Rematska and N. G. Bourbakis, ‘A survey on reverse engineering of technical diagrams’, in 7th International Conference on Information, Intelligence, Systems Applications, Jul. 2016, pp. 1–8. DOI: 10.1109/IISA.2016.7785372. [42] B. Moradi and A. Mirzaei, ‘A New Automated Design Method Based on Machine Learning for CMOS Analog Circuits’, International Journal of Electronics, vol. 103, no. 11, pp. 1868– 1881, 1st Nov. 2016, ISSN: 0020-7217. DOI: 10.1080/00207217.2016.1138538. [43] S. Hashemi, H. Tann and S. Reda, ‘BLASYS: Approximate Logic Synthesis Using Boolean Matrix Factorization’, in 55th ACM/ESDA/IEEE Design Automation Conference, Jun. 2018, pp. 1–6. DOI: 10.1109/DAC.2018.8465702. [44] R. Caruana, ‘Multitask learning’, Machine learning, vol. 28, no. 1, pp. 41–75, 1997. [45] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan and S. Wermter, ‘Continual lifelong learning with neural networks: A review’, Neural Networks, vol. 113, pp. 54–71, 1st May 2019, ISSN: 0893-6080. DOI: 10.1016/j.neunet.2019.01.012. [46] G. Kumar, G. Foster, C. Cherry and M. Krikun, ‘Reinforcement Learning based Curriculum Optimization for Neural Machine Translation’, arXiv preprint, 28th Feb. 2019. arXiv: 1903. 00041.

138 [47] Y. S. Abu-Mostafa, M. Magdon-Ismail and H.-T. Lin, Learning from Data. New York, NY: AMLBook, 2012, vol. 4, 213 pp., ISBN: 978-1-60049-006-4. [48] K. Dembczyński, W. Waegeman, W. Cheng and E. Hüllermeier, ‘On label dependence and loss minimization in multi-label classification’, Machine Learning, vol. 88, no. 1, pp. 5–45, 1st Jul. 2012, ISSN: 1573-0565. DOI: 10.1007/s10994-012-5285-8. [49] E. Montañes, R. Senge, J. Barranquero, J. Ramón Quevedo, J. José del Coz and E. Hüller- meier, ‘Dependent binary relevance models for multi-label classification’, Pattern Recog- nition, Handwriting Recognition and Other PR Applications, vol. 47, no. 3, pp. 1494–1508, 1st Mar. 2014, ISSN: 0031-3203. DOI: 10.1016/j.patcog.2013.09.029. [50] J. Read, L. Martino and D. Luengo, ‘Efficient Monte Carlo Methods for Multi-Dimensional Learning with Classifier Chains’, Pattern Recognition, vol. 47, no. 3, pp. 1535–1546, Mar. 2014, ISSN: 00313203. DOI: 10.1016/j.patcog.2013.10.006. arXiv: 1211.2190. [51] E. Gibaja and S. Ventura, ‘A Tutorial on Multilabel Learning’, ACM Computing Surveys, vol. 47, no. 3, 52:1–52:38, Apr. 2015, ISSN: 0360-0300. DOI: 10.1145/2716262. [52] D. Chicco, ‘Ten quick tips for machine learning in computational biology’, BioData Mining, vol. 10, 8th Dec. 2017, ISSN: 1756-0381. DOI: 10.1186/s13040-017-0155-3. [53] T. Fawcett, ‘An introduction to ROC analysis’, Pattern Recognition Letters, ROC Analysis in Pattern Recognition, vol. 27, no. 8, pp. 861–874, 1st Jun. 2006, ISSN: 0167-8655. DOI: 10.1016/j.patrec.2005.10.010. [54] D. Brzezinski, J. Stefanowski, R. Susmaga and I. Szczch, ‘Visual-based analysis of clas- sification measures and their properties for class imbalanced problems’, Information Sciences, vol. 462, pp. 242–261, 1st Sep. 2018, ISSN: 0020-0255. DOI: 10.1016/j.ins. 2018.06.020. [55] B. W. Matthews, ‘Comparison of the predicted and observed secondary structure of T4 phage lysozyme’, Biochimica et Biophysica Acta (BBA) - Protein Structure, vol. 405, no. 2, pp. 442–451, 20th Oct. 1975, ISSN: 0005-2795. DOI: 10.1016/0005-2795(75)90109-9. [56] R. G. Pontius Jr. and M. Millones, ‘Death to Kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment’, International Journal of Remote Sens- ing, vol. 32, no. 15, pp. 4407–4429, 10th Aug. 2011, ISSN: 0143-1161. DOI: 10.1080/ 01431161.2011.552923. [57] S. Boughorbel, F. Jarray and M. El-Anbari, ‘Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric’, PLoS ONE, vol. 12, no. 6, e0177678, 2nd Jun. 2017, ISSN: 1932-6203. DOI: 10.1371/journal.pone.0177678. [58] G. Jurman, S. Riccadonna and C. Furlanello, ‘A Comparison of MCC and CEN Error Mea- sures in Multi-Class Prediction’, PLoS ONE, vol. 7, no. 8, e41882, 8th Aug. 2012, ISSN: 1932-6203. DOI: 10.1371/journal.pone.0041882.

139 [59] M. Bekkar, H. K. Djemaa and T. A. Alitouche, ‘Evaluation Measures for Models Assessment over Imbalanced Data Sets’, Journal of Information Engineering and Applications, vol. 3, no. 10, p. 13, 2013. [60] G. Canbek, S. Sagiroglu, T. T. Temizel and N. Baykal, ‘Binary classification performance measures/metrics: A comprehensive visualized roadmap to gain new insights’, in Inter- national Conference on Computer Science and Engineering, Oct. 2017, pp. 821–826. DOI: 10.1109/UBMK.2017.8093539. [61] M. L. Zhang and Z. H. Zhou, ‘A Review on Multi-Label Learning Algorithms’, IEEE Transac- tions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1819–1837, Aug. 2014, ISSN: 1041-4347. DOI: 10.1109/TKDE.2013.39. [62] G. Madjarov, D. Kocev, D. Gjorgjevikj and S. Džeroski, ‘An extensive experimental compar- ison of methods for multi-label learning’, Pattern Recognition, Best Papers of Iberian Con- ference on Pattern Recognition and Image Analysis (IbPRIA’2011), vol. 45, no. 9, pp. 3084– 3104, 1st Sep. 2012, ISSN: 0031-3203. DOI: 10.1016/j.patcog.2012.03.004. [63] L. A. F. Park and J. Read, ‘A Blended Metric for Multi-label Optimisation and Evaluation’, in Machine Learning and Knowledge Discovery in Databases, M. Berlingerio, F. Bonchi, T. Gärtner, N. Hurley and G. Ifrim, Eds., ser. Lecture Notes in Computer Science, Springer International Publishing, 2019, pp. 719–734, ISBN: 978-3-030-10925-7. [64] K. Brinker, J. Fürnkranz and E. Hüllermeier, ‘A Unified Model for Multilabel Classifica- tion and Ranking’, in Proceedings of the 2006 Conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 – September 1, 2006, Riva Del Garda, Italy, Amsterdam, The Netherlands, The Netherlands: IOS Press, 29th Aug. 2006, pp. 489–493, ISBN: 978-1-58603-642-3. [65] O. Luaces, J. Díez, J. Barranquero, J. J. del Coz and A. Bahamonde, ‘Binary relevance efficacy for multilabel classification’, Progress in Artificial Intelligence, vol. 1, no. 4, pp. 303– 313, 1st Dec. 2012, ISSN: 2192-6360. DOI: 10.1007/s13748-012-0030-x. [66] J. M. Moyano, E. L. Gibaja, K. J. Cios and S. Ventura, ‘Review of ensembles of multi-label classifiers: Models, experimental study and prospects’, Information Fusion, vol. 44, pp. 33– 45, 1st Nov. 2018, ISSN: 1566-2535. DOI: 10.1016/j.inffus.2017.12.001. [67] A. Melo and H. Paulheim, ‘Local and global feature selection for multilabel classification with binary relevance’, Artificial Intelligence Review, 2nd May 2017, ISSN: 1573-7462. DOI: 10.1007/s10462-017-9556-4. [68] M.-L. Zhang and L. Wu, ‘Lift: Multi-label learning with label-specific features’, IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 107–120, 2015. DOI: 10.1109/TPAMI.2014.2339815.

140 [69] G. Tsoumakas and I. Vlahavas, ‘Random k-Labelsets: An Ensemble Method for Multilabel Classification’, in European Conference on Machine Learning, J. N. Kok, J. Koronacki, R. L. de Mantaras, S. Matwin, D. Mladenič and A. Skowron, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2007, pp. 406–417, ISBN: 978-3-540-74958-5. [70] G. Tsoumakas, I. Katakis and I. Vlahavas, ‘Effective and efficient multilabel classification in domains with large number of labels’,in Proceedings of the ECML/PKDD 2008 Workshop on Mining Multidimensional Data, sn, vol. 21, 2008, pp. 53–59. [71] J. Read, B. Pfahringer, G. Holmes and E. Frank, ‘Classifier chains for multi-label classifi- cation’, Machine Learning, vol. 85, no. 3, pp. 333–359, 30th Jun. 2011, ISSN: 0885-6125, 1573-0565. DOI: 10.1007/s10994-011-5256-5. [72] J. Read, B. Pfahringer, G. Holmes and E. Frank, ‘Classifier Chains for Multi-label Clas- sification’, in Machine Learning and Knowledge Discovery in Databases, W. Buntine, M. Grobelnik, D. Mladenić and J. Shawe-Taylor, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2009, pp. 254–269, ISBN: 978-3-642-04174-7. [73] K. Dembczyñski, W. Cheng and E. Hüllermeier, ‘Bayes optimal multilabel classification via probabilistic classifier chains’, in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 279–286. [74] R. Senge, J. J. del Coz and E. Hüllermeier, ‘On the Problem of Error Propagation in Classifier Chains for Multi-label Classification’, in Data Analysis, Machine Learning and Knowledge Discovery, M. Spiliopoulou, L. Schmidt-Thieme and R. Janning, Eds., ser. Stud- ies in Classification, Data Analysis, and Knowledge Organization, Springer International Publishing, 2014, pp. 163–170, ISBN: 978-3-319-01595-8. [75] S. Godbole and S. Sarawagi, ‘Discriminative Methods for Multi-labeled Classification’, in Advances in Knowledge Discovery and Data Mining, H. Dai, R. Srikant and C. Zhang, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2004, pp. 22–30, ISBN: 978-3-540-24775-3. [76] A. Clare and R. D. King, ‘Knowledge Discovery in Multi-label Phenotype Data’,in Principles of Data Mining and Knowledge Discovery, L. De Raedt and A. Siebes, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2001, pp. 42–53, ISBN: 978-3- 540-44794-8. [77] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993, ISBN: 978-1-55860-238-0. [78] D. Kocev, C. Vens, J. Struyf and S. Džeroski, ‘Ensembles of Multi-Objective Decision Trees’, in European Conference on Machine Learning, J. N. Kok, J. Koronacki, R. L. de Mantaras, S. Matwin, D. Mladenič and A. Skowron, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2007, pp. 624–631, ISBN: 978-3-540-74958-5.

141 [79] A. Elisseeff and J. Weston, ‘A Kernel Method for Multi-labelled Classification’, in Advances in Neural Information Processing Systems 14, (Vancouver, British Columbia, Canada), Cambridge, MA, USA: MIT Press, 2001, pp. 681–687. [80] G. Rätsch, S. Sonnenburg and C. Schäfer, ‘Learning Interpretable SVMs for Biological Sequence Classification’, BMC Bioinformatics, vol. 7, no. 1, S9, 20th Mar. 2006, ISSN: 1471- 2105. DOI: 10.1186/1471-2105-7-S1-S9. [81] M.-L. Zhang and Z.-H. Zhou, ‘ML-KNN: A lazy learning approach to multi-label learning’, Pattern Recognition, vol. 40, no. 7, pp. 2038–2048, 1st Jul. 2007, ISSN: 0031-3203. DOI: 10.1016/j.patcog.2006.12.019. [82] J. Wicker, B. Pfahringer and S. Kramer, ‘Multi-label Classification Using Boolean Matrix Decomposition’,in Proceedings of the 27th Annual ACM Symposium on Applied Computing, New York, NY, USA: ACM, 2012, pp. 179–186, ISBN: 978-1-4503-0857-1. DOI: 10.1145/ 2245276.2245311. [83] M. Gethsiyal Augasta and T. Kathirvalavakumar, ‘Rule extraction from neural networks - A comparative study’, in International Conference on Pattern Recognition, Informatics and Medical Engineering, Mar. 2012, pp. 404–408. DOI: 10.1109/ICPRIME.2012.6208380. [84] M. Courbariaux, Y.Bengio and J.-P.David, ‘BinaryConnect: Training Deep Neural Networks with binary weights during propagations’, in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett, Eds., Curran Associates, Inc., 2015, pp. 3123–3131. [85] K. N. Vougas, T.Jackson, A. Polyzos, M. Liontos, E. O. Johnson, V. Georgoulias, P.Townsend, J. Bartek and V. G. Gorgoulis, ‘Deep Learning and Association Rule Mining for Predicting Drug Response in Cancer’, biorxiv;070490v1, 19th Aug. 2016. [86] M. Kim and P. Smaragdis, ‘Bitwise Neural Networks’, arXiv preprint, 22nd Jan. 2016. arXiv: 1601.06071. [87] E. Spyromitros, G. Tsoumakas and I. Vlahavas, ‘An Empirical Study of Lazy Multilabel Classification Algorithms’, in Artificial Intelligence: Theories, Models and Applications, J. Darzentas, G. A. Vouros, S. Vosinakis and A. Arnellos, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2008, pp. 401–406, ISBN: 978-3-540-87881-0. [88] M.-L. Zhang and K. Zhang, ‘Multi-label Learning by Exploiting Label Dependency’, in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (Washington, DC, USA), New York, NY, USA: ACM, 2010, pp. 999–1008, ISBN: 978-1-4503-0055-1. DOI: 10.1145/1835804.1835930. [89] S.-j. Huang and Z.-h. Zhou, ‘Multilabel learning by exploiting label correlations locally’, in 26th AAAI Conference on Artificial Intelligence, 2012. [90] C. A. Tawiah and V. S. Sheng, ‘A Study on Multi-label Classification’, in Advances in Data Mining.ApplicationsandTheoreticalAspects, P.Perner, Ed., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2013, pp. 137–150, ISBN: 978-3-642-39736-3.

142 [91] A. Fallah Tehrani and D. Ahrens, ‘Modeling label dependence for multi-label classification using the Choquistic regression’, Pattern Recognition Letters, vol. 92, pp. 75–80, 1st Jun. 2017, ISSN: 0167-8655. DOI: 10.1016/j.patrec.2017.04.018. [92] J. Read and J. Hollmén, ‘A Deep Interpretation of Classifier Chains’, in Advances in Intelli- gent Data Analysis XIII, ser. Lecture Notes in Computer Science, Springer, Cham, 30th Oct. 2014, pp. 251–262, ISBN: 978-3-319-12570-1 978-3-319-12571-8. DOI: 10.1007/978-3- 319-12571-8_22. [93] K. Dembczyński, W. Waegeman and E. Hüllermeier, ‘An Analysis of Chaining in Multi-label Classification’, in Proceedings of the 20th European Conference on Artificial Intelligence, Amsterdam, The Netherlands, The Netherlands: IOS Press, 2012, pp. 294–299, ISBN: 978-1-61499-097-0. DOI: 10.3233/978-1-61499-098-7-294. [94] J. Read and J. Hollmén, ‘Multi-label Classification using Labels as Hidden Nodes’, arXiv preprint, 31st Mar. 2015. arXiv: 1503.09022. [95] Z. Cataltepe, ‘The scheduling problem in learning from hints’, California Institute of Technology, Pasadena, CA, Technical Report 94-09, 1994. [96] A. Pentina, V. Sharmanska and C. H. Lampert, ‘Curriculum learning of multiple tasks’, in IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp. 5492–5500. DOI: 10.1109/CVPR.2015.7299188. [97] G. B. Peterson, ‘A day of great illumination: B. F. Skinner’s discovery of shaping.’, Journal of the Experimental Analysis of Behavior, vol. 82, no. 3, pp. 317–328, Nov. 2004, ISSN: 0022-5002. DOI: 10.1901/jeab.2004.82-317. [98] Y. S. Abu-Mostafa, ‘Learning from hints in neural networks’, Journal of Complexity, vol. 6, no. 2, pp. 192–198, 1st Jun. 1990, ISSN: 0885-064X. DOI: 10.1016/0885-064X(90) 90006-Y. [99] J. L. Elman, ‘Learning and development in neural networks: The importance of starting small’, Cognition, vol. 48, no. 1, pp. 71–99, 1st Jul. 1993, ISSN: 0010-0277. DOI: 10.1016/ 0010-0277(93)90058-4. [100] D. L. T. Rohde and D. C. Plaut, ‘Language acquisition in the absence of explicit negative evidence: How important is starting small?’, Cognition, vol. 72, no. 1, pp. 67–109, 25th Aug. 1999, ISSN: 0010-0277. DOI: 10.1016/S0010-0277(99)00031-1. [101] A. Roli, M. Villani, R. Serra, S. Benedettini, C. Pinciroli and M. Birattari, ‘Dynamical Proper- ties of Artificially Evolved Boolean Network Robots’, in AI*IA 2015 Advances in Artificial Intelligence, M. Gavanelli, E. Lamma and F. Riguzzi, Eds., ser. Lecture Notes in Computer Science, vol. 9336, Springer International Publishing, 23rd Sep. 2015, pp. 45–57, ISBN: 978-3-319-24308-5 978-3-319-24309-2. DOI: 10.1007/978-3-319-24309-2_4. [102] T. Karras, T. Aila, S. Laine and J. Lehtinen, ‘Progressive Growing of GANs for Improved Quality, Stability, and Variation’, in International Conference on Learning Representations, 2018, p. 26.

143 [103] S. Narvekar and P. Stone, ‘Learning Curriculum Policies for Reinforcement Learning’, arXiv preprint, 1st Dec. 2018. arXiv: 1812.00285. [104] M. P. Kumar, B. Packer and D. Koller, ‘Self-paced Learning for Latent Variable Models’, in Advances in Neural Information Processing Systems 23, USA: Curran Associates Inc., 2010, pp. 1189–1197. [105] L. G. Valiant, ‘A Theory of the Learnable’, Communications of the ACM, vol. 27, no. 11, pp. 1134–1142, Nov. 1984, ISSN: 0001-0782. DOI: 10.1145/1968.1972. [106] M. Li and P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, 4th ed., ser. Graduate Texts in Computer Science. New York, NY: Springer, 2013, 834 pp., ISBN: 978-1-4757-2606-0. DOI: 10.1007/978-1-4757-2606-0. [107] M. Li and P. Vitányi, ‘Learning Simple Concepts under Simple Distributions’, SIAM Journal on Computing, vol. 20, no. 5, pp. 911–935, 1st Oct. 1991, ISSN: 0097-5397. DOI: 10.1137/ 0220056. [108] K.-H. Thung and C.-Y. Wee, ‘A brief review on multi-task learning’, Multimedia Tools and Applications, vol. 77, no. 22, pp. 29 705–29 725, 1st Nov. 2018, ISSN: 1573-7721. DOI: 10. 1007/s11042-018-6463-x. [109] Y. Zhang and Q. Yang, ‘An overview of multi-task learning’, National Science Review, vol. 5, no. 1, pp. 30–43, 1st Jan. 2018, ISSN: 2095-5138. DOI: 10.1093/nsr/nwx105. [110] C. Li, J. Yan, F. Wei, W. Dong, Q. Liu and H. Zha, ‘Self-Paced Multi-Task Learning.’, in 31st AAAI Conference on Artificial Intelligence, 2017, pp. 2175–2181. [111] A. Lad, R. Ghani, Y. Yang and B. Kisiel, ‘Toward Optimal Ordering of Prediction Tasks’, in Proceedings of the SIAM International Conference on Data Mining, SIAM, 30th Apr. 2009, pp. 884–893, ISBN: 978-0-89871-682-5. [112] J. Ceberio, A. Mendiburu and J. A. Lozano, ‘The linear ordering problem revisited’, Euro- pean Journal of Operational Research, vol. 241, no. 3, pp. 686–696, 16th Mar. 2015, ISSN: 0377-2217. DOI: 10.1016/j.ejor.2014.09.041. [113] C. S. Signorino and J. M. Ritter, ‘Tau-b or Not Tau-b: Measuring the Similarity of Foreign Policy Positions’, International Studies Quarterly, vol. 43, no. 1, pp. 115–144, 1st Mar. 1999, ISSN: 0020-8833. DOI: 10.1111/0020-8833.00113. [114] E. J. Emond and D. W. Mason, ‘A new rank correlation coefficient with application to the consensus ranking problem’, Journal of Multi-Criteria Decision Analysis, vol. 11, no. 1, pp. 17–28, 2002, ISSN: 1099-1360. DOI: 10.1002/mcda.313. [115] J. Kemeny and J. Snell, ‘Preference Rankings: An Axiomatic Approach’, in Mathematical Models in the Social Sciences, K. Kemeny and J. Snell, Eds., MIT Press, Cambridge, 1962, pp. 9–23. [116] M. G. Kendall and B. B. Smith, ‘The Problem of m Rankings’, The Annals of Mathematical Statistics, vol. 10, no. 3, pp. 275–287, Sep. 1939, ISSN: 0003-4851, 2168-8990. DOI: 10. 1214/aoms/1177732186.

144 [117] D. Granata and V. Carnevale, ‘Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets’, Scientific Reports, vol. 6, p. 31 377, 11th Aug. 2016, ISSN: 2045-2322. DOI: 10.1038/srep31377. [118] G. Navarro, R. Paredes, N. Reyes and C. Bustos, ‘An empirical evaluation of intrinsic dimension estimators’, Information Systems, vol. 64, pp. 206–218, 1st Mar. 2017, ISSN: 0306-4379. DOI: 10.1016/j.is.2016.06.004. [119] E. Facco, M. d’Errico, A. Rodriguez and A. Laio, ‘Estimating the intrinsic dimension of datasets by a minimal neighborhood information’, Scientific Reports, vol. 7, no. 1, Dec. 2017, ISSN: 2045-2322. DOI: 10.1038/s41598-017-11873-y. arXiv: 1803.06992. [120] F. Camastra and A. Vinciarelli, ‘Estimating the intrinsic dimension of data with a fractal- based method’, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1404–1407, Oct. 2002, ISSN: 0162-8828. DOI: 10.1109/TPAMI.2002.1039212. [121] J. B. Tenenbaum, V. de Silva and J. C. Langford, ‘A Global Geometric Framework for Non- linear Dimensionality Reduction’, Science, vol. 290, no. 5500, pp. 2319–2323, 22nd Dec. 2000, ISSN: 0036-8075, 1095-9203. DOI: 10.1126/science.290.5500.2319. [122] N. Armanfard, J. P. Reilly and M. Komeili, ‘Local Feature Selection for Data Classification’, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 6, pp. 1217– 1227, Jun. 2016, ISSN: 0162-8828. DOI: 10.1109/TPAMI.2015.2478471. [123] V. Pestov, ‘An axiomatic approach to intrinsic dimension of a dataset’, Neural Networks, Advances in Neural Networks Research: IJCNN ’07, vol. 21, no. 2, pp. 204–213, 1st Mar. 2008, ISSN: 0893-6080. DOI: 10.1016/j.neunet.2007.12.030. [124] M. Brunato and R. Battiti, ‘X-MIFS: Exact Mutual Information for feature selection’, in International Joint Conference on Neural Networks, Jul. 2016, pp. 3469–3476. DOI: 10. 1109/IJCNN.2016.7727644. [125] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang and H. Liu, ‘Feature Selection: A Data Perspective’, ACM Computing Surveys, vol. 50, no. 6, 94:1–94:45, Dec. 2017, ISSN: 0360-0300. DOI: 10.1145/3136625. [126] Y. Bengio, A. Courville and P. Vincent, ‘Representation Learning: A Review and New Perspectives’, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug. 2013, ISSN: 0162-8828, 2160-9292. DOI: 10.1109/TPAMI.2013.50. [127] G. Chandrashekar and F. Sahin, ‘A survey on feature selection methods’, Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, Jan. 2014, ISSN: 00457906. DOI: 10.1016/ j.compeleceng.2013.11.024. [128] S. Kashef, H. Nezamabadi-pour and B. Nikpour, ‘Multilabel feature selection: A com- prehensive review and guiding experiments’, Wiley Interdisciplinary Reviews: Data Min- ing and Knowledge Discovery, vol. 8, no. 2, e1240, 1st Mar. 2018, ISSN: 1942-4787. DOI: 10.1002/widm.1240.

145 [129] S. Davies and S. Russell, ‘NP-Completeness of Searches for Smallest Possible Feature Sets’, in AAAI Symposium on Intelligent Relevance, AAAI Press, 1994, pp. 37–39. [130] C. Cotta and P. Moscato, ‘The k-feature set problem is W[2]-complete’, Journal of Com- puter and System Sciences, vol. 67, no. 4, pp. 686–690, 1st Dec. 2003, ISSN: 0022-0000. DOI: 10.1016/S0022-0000(03)00081-3. [131] M. Charikar, V. Guruswami, R. Kumar, S. Rajagopalan and A. Sahai, ‘Combinatorial feature selection problems’, in Proceedings 41st Annual Symposium on Foundations of Computer Science, 2000, pp. 631–640. DOI: 10.1109/SFCS.2000.892331. [132] C. Li, H. Farkhoor, R. Liu and J. Yosinski, ‘Measuring the Intrinsic Dimension of Objective Landscapes’, arXiv preprint, 24th Apr. 2018. arXiv: 1804.08838. [133] V. V. Shende, S. S. Bullock and I. L. Markov, ‘Synthesis of quantum-logic circuits’, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 6, pp. 1000–1010, Jun. 2006, ISSN: 0278-0070. DOI: 10.1109/TCAD.2005.855930. [134] J. Preskill, ‘Quantum Computing in the NISQ era and beyond’, Quantum, vol. 2, p. 79, 6th Aug. 2018. DOI: 10.22331/q-2018-08-06-79. [135] J. J. Lohmueller, T. Z. Armel and P. A. Silver, ‘A tunable zinc finger-based framework for Boolean logic computation in mammalian cells’, Nucleic Acids Research, vol. 40, no. 11, pp. 5180–5187, Jun. 2012, ISSN: 0305-1048. DOI: 10.1093/nar/gks142. [136] F. Lienert, J. P. Torella, J.-H. Chen, M. Norsworthy, R. R. Richardson and P. A. Silver, ‘Two- and three-input TALE-based AND logic computation in embryonic stem cells’, Nucleic Acids Research, vol. 41, no. 21, pp. 9967–9975, Nov. 2013, ISSN: 0305-1048. DOI: 10.1093/nar/gkt758. [137] J. Macia and R. Sole, ‘How to Make a Synthetic Multicellular Computer’, PLoS ONE, vol. 9, no. 2, 19th Feb. 2014, ISSN: 1932-6203. DOI: 10.1371/journal.pone.0081248. [138] Y. Amir, E. Ben-Ishay, D. Levner, S. Ittah, A. Abu-Horowitz and I. Bachelet, ‘Universal computing by DNA origami robots in a living animal’, Nature Nanotechnology, vol. 9, no. 5, pp. 353–357, May 2014, ISSN: 1748-3395. DOI: 10.1038/nnano.2014.58. [139] V. Gaudet, ‘A survey and tutorial on contemporary aspects of multiple-valued logic and its application to microelectronic circuits’, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 6, no. 1, pp. 5–12, Mar. 2016, ISSN: 2156-3357. DOI: 10.1109/JETCAS.2016.2528041. [140] S. M. Saeed, X. Cui, R. Wille, A. Zulehner, K. Wu, R. Drechsler and R. Karri, ‘Towards Reverse Engineering Reversible Logic’, arXiv preprint, 26th Apr. 2017. arXiv: 1704.08397. [141] V. Kendon, A. Sebald and S. Stepney, ‘Heterotic computing: Exploiting hybrid computa- tional devices’, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 373, no. 2046, 2015. DOI: 10.1098/rsta.2015.0091.

146 [142] J. Jones, J. G. Whiting and A. Adamatzky, ‘Quantitative transformation for implemen- tation of adder circuits in physical systems’, Biosystems, vol. 134, pp. 16–23, Aug. 2015, ISSN: 03032647. DOI: 10.1016/j.biosystems.2015.05.005. [143] D. Horsman, V. Kendon and S. Stepney, ‘The Natural Science of Computing’, Communi- cations of the ACM, vol. 60, no. 8, pp. 31–34, Jul. 2017, ISSN: 0001-0782. DOI: 10.1145/ 3107924. [144] S. Venkataramani, A. Sabne, V. Kozhikkottu, K. Roy and A. Raghunathan, ‘SALSA: System- atic logic synthesis of approximate circuits’, in Design Automation Conference, Jun. 2012, pp. 796–801. DOI: 10.1145/2228360.2228504. [145] D. Liu, C. Yu, X. Zhang and D. Holcomb, ‘Oracle-guided Incremental SAT Solving to Reverse Engineer Camouflaged Logic Circuits’, in Proceedings of the 2016 Conference on Design, Automation & Test in Europe, San Jose, CA, USA: EDA Consortium, 2016, pp. 433–438, ISBN: 978-3-9815370-6-2. [146] J. Torresen, ‘Evolving Multiplier Circuits by Training Set and Training Vector Partitioning’, in Evolvable Systems: From Biology to Hardware, ser. Lecture Notes in Computer Science, A. M. Tyrrell, P. C. Haddow and J. Torresen, Eds., Springer Berlin Heidelberg, 17th Mar. 2003, pp. 228–237, ISBN: 978-3-540-00730-2 978-3-540-36553-2. DOI: 10.1007/3-540- 36553-2_21. [147] F. Manfrini, H. J. C. Barbosa and H. S. Bernardino, ‘Optimization of combinational logic circuits through decomposition of truth table and evolution of sub-circuits’, in IEEE Congress on Evolutionary Computation, Jul. 2014, pp. 945–950. DOI: 10.1109/CEC. 2014.6900565. [148] C. K. Vijayakumari, P. Mythili, R. K. James and S. A. Kumar, ‘Optimal design of combina- tional logic circuits using genetic algorithm and Reed-Muller Universal Logic Modules’, in International Conference on Embedded Systems, Jul. 2014, pp. 1–6. DOI: 10.1109/ EmbeddedSys.2014.6953039. [149] G. Boole, An Investigation of the Laws of Thought: On Which Are Founded the Mathematical Theories of Logic and Probabilities. Dover Publications, 1854. [150] C. E. Shannon, ‘The synthesis of two-terminal switching circuits’, The Bell System Techni- cal Journal, vol. 28, no. 1, pp. 59–98, Jan. 1949, ISSN: 0005-8580. DOI: 10.1002/j.1538- 7305.1949.tb03624.x. [151] P.Carnevali and S. Patarnello, ‘Exhaustive thermodynamical analysis of Boolean learning networks’, Europhysics Letters, vol. 4, no. 10, p. 1199, 1987. [152] P. Shukla and T. K. Sinha, ‘Learning from examples in feedforward Boolean networks’, Physical Review E, vol. 47, no. 4, pp. 2962–2965, 1st Apr. 1993. DOI: 10.1103/PhysRevE. 47.2962.

147 [153] A. Goudarzi, C. Teuscher, N. Gulbahce and T. Rohlf, ‘Learning, generalisation, and func- tional entropy in random automata networks’, International Journal of Autonomous and Adaptive Communications Systems, vol. 7, no. 3, pp. 295–314, 2014. [154] S. E. Quadir, J. Chen, D. Forte, N. Asadizanjani, S. Shahbazmohamadi, L. Wang, J. Chandy and M. Tehranipoor, ‘A Survey on Chip to System Reverse Engineering’, Journal of Emerg- ing Technologies in Computing Systems, vol. 13, no. 1, 6:1–6:34, Apr. 2016, ISSN: 1550-4832. DOI: 10.1145/2755563. [155] O. E. Akman, S. Watterson, A. Parton, N. Binns, A. J. Millar and P. Ghazal, ‘Digital clocks: Simple Boolean models can quantitatively describe circadian systems’, Journal of The Royal Society Interface, vol. 9, no. 74, pp. 2365–2382, 7th Sep. 2012, ISSN: 1742-5689, 1742-5662. DOI: 10.1098/rsif.2012.0080. [156] B. Shao, X. Liu, D. Zhang, J. Wu and Q. Ouyang, ‘From Boolean Network Model to Contin- uous Model Helps in Design of Functional Circuits’, PLoS ONE, vol. 10, no. 6, e0128630, 10th Jun. 2015. DOI: 10.1371/journal.pone.0128630. [157] T. Guo, D. Herber and J. Allison, ‘Reducing evaluation cost for circuit synthesis using active learning’, presented at the Proceedings of the ASME Design Engineering Technical Conference, vol. 2A-2018, 2018. DOI: 10.1115/DETC201885654. [158] S. R. Safavian and D. Landgrebe, ‘A survey of decision tree classifier methodology’, IEEE Transactions on Systems, Man, and Cybernetics, vol. 21, no. 3, pp. 660–674, May 1991, ISSN: 0018-9472. DOI: 10.1109/21.97458. [159] W. S. Lau, K. H. Lee and K. S. Leung, ‘A Hybridized Genetic Parallel Programming Based Logic Circuit Synthesizer’, in Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, (Seattle, Washington, USA), New York, NY, USA: ACM, 2006, pp. 839–846, ISBN: 978-1-59593-186-3. DOI: 10.1145/1143997.1144145. [160] A. M. Turing, Intelligent Machinery. Report for National Physical Laboratory. Reprinted in Ince, Dc (Editor). 1992. Mechanical Intelligence: Collected Works of Am Turing. Amsterdam: North Holland, 1948. [161] S. Patarnello and P. Carnevali, ‘Learning networks of neurons with Boolean logic’, Euro- physics Letters, vol. 4, no. 4, p. 503, 1987. [162] C. Van den Broeck and R. Kawai, ‘Learning in feedforward Boolean networks’, Physical Review A, vol. 42, no. 10, p. 6210, 1990. [163] Z. Lin, M. Courbariaux, R. Memisevic and Y. Bengio, ‘Neural Networks with Few Multipli- cations’, arXiv preprint, 11th Oct. 2015. arXiv: 1510.03009. [164] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv and Y. Bengio, ‘Binarized Neural Net- works’, in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett, Eds., Curran Associates, Inc., 2016, pp. 4107–4115.

148 [165] B. McDanel, S. Teerapittayanon and H. Kung, ‘Embedded Binarized Neural Networks’, in Proceedings of the 2017 International Conference on Embedded Wireless Systems and Networks, (Uppsala, Sweden), USA: Junction Publishing, 2017, pp. 168–173, ISBN: 978-0- 9949886-1-4. [166] E. K. Burke and Y. Bykov, ‘A late acceptance strategy in hill-climbing for examination timetabling problems’, in International Conference on the Practice and Theory of Auto- mated Timetabling, 2008. [167] M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, J. Ostblom, S. Lukauskas, D. C. Gem- perline, T. Augspurger, Y. Halchenko, J. B. Cole, J. Warmenhoven, J. de Ruiter, C. Pye, S. Hoyer, J. Vanderplas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Martin, K. Meyer, A. Miles, Y. Ram, T. Brunner, T. Yarkoni, M. L. Williams, C. Evans, C. Fitzgerald, Brian and A. Qalieh, Seaborn: V0.9.0, http://seaborn.pydata.org/, Zenodo, 16th Jul. 2018. DOI: 10.5281/zenodo.1313201. [168] P. Moscato, ‘On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms’, California Institute of Technology, Pasadena, California, USA, Technical Report 826, 1989. [169] C. Teuscher, N. Gulbahce and T. Rohlf, ‘Learning and generalization in random Boolean networks’, in Dynamic Days 2007: International Conference on Chaos and Nonlinear Dy- namics, Boston, MA, Jan. 2007. [170] M. V. Butz, ‘Boolean Function Problems’, in Rule-Based Evolutionary Online Learning Systems: A Principled Approach to LCS Analysis and Design, ser. Studies in Fuzziness and Soft Computing, M. V. Butz, Ed., Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 243–248, ISBN: 978-3-540-31231-4. DOI: 10.1007/3-540-31231-5_8. [171] M. Iqbal, W. N. Browne and M. Zhang, ‘Reusing Building Blocks of Extracted Knowledge to Solve Complex, Large-Scale Boolean Problems’, IEEE Transactions on Evolutionary Computation, vol. 18, no. 4, pp. 465–480, Aug. 2014, ISSN: 1089-778X. DOI: 10.1109/ TEVC.2013.2281537. [172] R. J. Urbanowicz and J. H. Moore, ‘ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System’, Evolutionary intelligence, vol. 8, no. 2, pp. 89–116, Sep. 2015, ISSN: 1864-5909. DOI: 10.1007/s12065-015-0128-8. [173] A. Poret and J.-P. Boissel, ‘An in silico target identification using Boolean network at- tractors: Avoiding pathological phenotypes’, Comptes Rendus Biologies, vol. 337, no. 12, pp. 661–678, Dec. 2014, ISSN: 1631-0691. DOI: 10.1016/j.crvi.2014.10.002. [174] S. Barman and Y.-K. Kwon, ‘A novel mutual information-based Boolean network inference method from time-series gene expression data’, PLoS ONE, vol. 12, no. 2, e0171097, 8th Feb. 2017, ISSN: 1932-6203. DOI: 10.1371/journal.pone.0171097.

149 [175] P. Moscato, ‘An introduction to population approaches for optimization and hierarchi- cal objective functions: A discussion on the role of tabu search’, Annals of Operations Research, vol. 41, no. 2, pp. 85–121, 1st Jun. 1993, ISSN: 0254-5330, 1572-9338. DOI: 10.1007/BF02022564. [176] W. Zaremba and I. Sutskever, ‘Learning to Execute’, arXiv preprint, 16th Oct. 2014. arXiv: 1410.4615. [177] S. Burkhardt and S. Kramer, ‘On the Spectrum Between Binary Relevance and Classifier Chains in Multi-label Classification’,in Proceedings of the 30th Annual ACM Symposium on Applied Computing, New York, NY, USA: ACM, 2015, pp. 885–892, ISBN: 978-1-4503-3196-8. DOI: 10.1145/2695664.2695854. [178] B. Chen, W. Li, Y. Zhang and J. Hu, ‘Enhancing multi-label classification based on local label constraints and classifier chains’, in International Joint Conference on Neural Net- works, Vancouver, BC, Canada: IEEE, Jul. 2016, pp. 1458–1463, ISBN: 978-1-5090-0620-5. DOI: 10.1109/IJCNN.2016.7727370. [179] J. Lee, H. Kim, N.-r. Kim and J.-H. Lee, ‘An approach for multi-label classification by directed acyclic graph with label correlation maximization’, Information Sciences, vol. 351, pp. 101–114, 10th Jul. 2016, ISSN: 0020-0255. DOI: 10.1016/j.ins.2016.02.037. [180] J. Read, L. Martino, P. M. Olmos and D. Luengo, ‘Scalable multi-output label prediction: From classifier chains to classifier trellises’, Pattern Recognition, vol. 48, no. 6, pp. 2096– 2109, 1st Jun. 2015, ISSN: 0031-3203. DOI: 10.1016/j.patcog.2015.01.004. [181] P. N. da Silva, E. C. Gonçalves, A. Plastino and A. A. Freitas, ‘Distinct Chains for Different Instances: An Effective Strategy for Multi-label Classifier Chains’, in Machine Learning and Knowledge Discovery in Databases, T. Calders, F. Esposito, E. Hüllermeier and R. Meo, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2014, pp. 453–468, ISBN: 978-3-662-44851-9. [182] S.-J. Huang, Y. Yu and Z.-H. Zhou, ‘Multi-label Hypothesis Reuse’, in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (Beijing, China), New York, NY, USA: ACM, 2012, pp. 525–533, ISBN: 978-1-4503-1462-6. DOI: 10.1145/2339530.2339615. [183] L. Breiman, ‘Bagging predictors’, Machine Learning, vol. 24, no. 2, pp. 123–140, 1st Aug. 1996, ISSN: 1573-0565. DOI: 10.1007/BF00058655. [184] N. Li and Z.-H. Zhou, ‘Selective Ensemble of Classifier Chains’, in Multiple Classifier Sys- tems, Z.-H. Zhou, F. Roli and J. Kittler, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2013, pp. 146–156, ISBN: 978-3-642-38067-9. [185] E. C. Goncalves, A. Plastino and A. A. Freitas, ‘A Genetic Algorithm for Optimizing the Label Ordering in Multi-label Classifier Chains’, in IEEE 25th International Conference on Tools with Artificial Intelligence, Nov. 2013, pp. 469–476. DOI: 10.1109/ICTAI.2013.76.

150 [186] E. C. Gonçalves, A. Plastino and A. A. Freitas, ‘Simpler is Better: A Novel Genetic Algorithm to Induce Compact Multi-label Chain Classifiers’, in Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, New York, NY, USA: ACM, 2015, pp. 559–566, ISBN: 978-1-4503-3472-3. DOI: 10.1145/2739480.2754650. [187] A. Kumar, S. Vembu, A. K. Menon and C. Elkan, ‘Beam search algorithms for multilabel learning’, Machine Learning, vol. 92, no. 1, pp. 65–89, 1st Jul. 2013, ISSN: 0885-6125, 1573-0565. DOI: 10.1007/s10994-013-5371-6. [188] W. Liu and I. W. Tsang, ‘On the Optimality of Classifier Chain for Multi-label Classification’, in Advances in Neural Information Processing Systems 28, Cambridge, MA, USA: MIT Press, 2015, pp. 712–720. [189] L. Sun and M. Kudo, ‘Optimization of classifier chains via conditional likelihood maxi- mization’, Pattern Recognition, vol. 74, pp. 503–517, 1st Feb. 2018, ISSN: 0031-3203. DOI: 10.1016/j.patcog.2017.09.034. [190] X. Jun, Y. Lu, Z. Lei and D. Guolun, ‘Conditional entropy based classifier chains for multi- label classification’, Neurocomputing, 22nd Jan. 2019, ISSN: 0925-2312. DOI: 10.1016/j. neucom.2019.01.039. [191] D. Wang, L. Li, J. Wang, F. Hu and X. Zhang, ‘Extracting Label Importance Information for Multi-label Classification’, in Database Systems for Advanced Applications, J. Pei, Y. Manolopoulos, S. Sadiq and J. Li, Eds., ser. Lecture Notes in Computer Science, Springer International Publishing, 2018, pp. 424–439, ISBN: 978-3-319-91458-9. [192] Ł. Kaiser and I. Sutskever, ‘Neural GPUs Learn Algorithms’, arXiv preprint, 25th Nov. 2015. arXiv: 1511.08228. [193] S. Yang, ‘Logic Synthesis and Optimization Benchmarks User Guide: Version 3.0’, Micro- electronics Center of North Carolina (MCNC), Technical report, 1991. [194] J. Demšar, ‘Statistical Comparisons of Classifiers over Multiple Data Sets’, Journal of Machine Learning Research, vol. 7, pp. 1–30, Jan 2006, ISSN: ISSN 1533-7928. [195] M. Friedman, ‘A Comparison of Alternative Tests of Significance for the Problem of m Rankings’, The Annals of Mathematical Statistics, vol. 11, no. 1, pp. 86–92, 1940, ISSN: 00034851. JSTOR: 2235971. [196] P. Nemenyi, ‘Distribution-free multiple comparisons’, Princeton University, 1963. [197] M. Anthony, G. Brightwell and J. Shawe-Taylor, ‘On specifying Boolean functions by labelled examples’, Discrete Applied Mathematics, vol. 61, no. 1, pp. 1–25, 7th Jul. 1995, ISSN: 0166-218X. DOI: 10.1016/0166-218X(94)00007-Z. [198] E. Boros, ‘Error-free and best-fit extensions of partially defined Boolean functions’,Osaka University, 1997. [199] M. Kearns and L. Valiant, ‘Cryptographic limitations on learning Boolean formulae and finite automata’, Journal of the ACM, vol. 41, no. 1, pp. 67–95, 1994.

151 [200] J. Grefenstette, S. Kim and S. Kauffman, ‘An analysis of the class of gene regulatory functions implied by a biochemical model’, Biosystems, vol. 84, no. 2, pp. 81–90, May 2006, ISSN: 03032647. DOI: 10.1016/j.biosystems.2005.09.009. [201] G. Gerules and C. Janikow, ‘A survey of modularity in genetic programming’, in IEEE Congress on Evolutionary Computation, Jul. 2016, pp. 5034–5043. DOI: 10.1109/CEC. 2016.7748328. [202] G. Huang, G.-B. Huang, S. Song and K. You, ‘Trends in extreme learning machines: A review’, Neural Networks, vol. 61, pp. 32–48, Jan. 2015, ISSN: 0893-6080. DOI: 10.1016/j. neunet.2014.10.001. [203] G. Lan, G. W. DePuy and G. E. Whitehouse, ‘An effective and simple heuristic for the set covering problem’, European Journal of Operational Research, vol. 176, no. 3, pp. 1387– 1403, 1st Feb. 2007, ISSN: 0377-2217. DOI: 10.1016/j.ejor.2005.09.028. [204] G. W. DePuy, R. J. Moraga and G. E. Whitehouse, ‘Meta-RaPS: A simple and effective approach for solving the traveling salesman problem’, Transportation Research Part E: Logistics and Transportation Review, vol. 41, no. 2, pp. 115–130, Mar. 2005, ISSN: 13665545. DOI: 10.1016/j.tre.2004.02.001. [205] T. Benoist, B. Estellon, F. Gardi, R. Megel and K. Nouioua, ‘LocalSolver 1.x: A black-box local-search solver for 0-1 programming’, 4OR-Quarterly Journal of Operations Research, vol. 9, no. 3, p. 299, 25th Mar. 2011, ISSN: 1614-2411. DOI: 10.1007/s10288-011-0165- 9. [206] J. Bergstra and Y. Bengio, ‘Random Search for Hyper-parameter Optimization’, Journal of Machine Learning Research, vol. 13, pp. 281–305, Feb. 2012. [207] L. Amarù, P.-E. Gaillardon and G. De Micheli, ‘The EPFL Combinational Benchmark Suite’, in Proceedings of the 24th International Workshop on Logic & Synthesis, 2015. [208] L. Mathieson, A. Mendes, J. Marsden, J. Pond and P. Moscato, ‘Computer-aided Breast Cancer Diagnosis with Optimal Feature Sets: Application of Safe Reduction Rules and Optimization Techniques’, 2004. [209] P. Moscato, L. Mathieson, A. Mendes and R. Berretta, ‘The Electronic Primaries: Predicting the U.S. Presidency Using Feature Selection with Safe Data Reduction’, in Proceedings of the 28th Australasian Conference on Computer Science, vol. 38, Darlinghurst, Australia, Australia: Australian Computer Society, Inc., 2005, pp. 371–379, ISBN: 1-920682-20-1. [210] Q. Zhang, L. T. Yang, Z. Chen and P. Li, ‘A survey on deep learning for big data’, Information Fusion, vol. 42, pp. 146–157, 1st Jul. 2018, ISSN: 1566-2535. DOI: 10.1016/j.inffus. 2017.10.006. [211] A. G. Baydin, B. A. Pearlmutter, A. A. Radul and J. M. Siskind, ‘Automatic Differentiation in Machine Learning: A Survey’, Journal of Machine Learning Research, vol. 18, no. 153, pp. 1–43, 2018, ISSN: 1533-7928.

152 [212] Y. A. LeCun, L. Bottou, G. B. Orr and K.-R. Müller, ‘Efficient BackProp’, in Neural Networks: Tricks of the Trade: Second Edition, ser. Lecture Notes in Computer Science, G. Montavon, G. B. Orr and K.-R. Müller, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 9–48, ISBN: 978-3-642-35289-8. DOI: 10.1007/978-3-642-35289-8_3. [213] Fangyue Chen, Guanrong Chen, Qinbin He, Guolong He and Xiubin Xu, ‘Universal Per- ceptron and DNA-Like Learning Algorithm for Binary Neural Networks: Non-LSBF Im- plementation’, IEEE Transactions on Neural Networks, vol. 20, no. 8, pp. 1293–1301, Aug. 2009, ISSN: 1045-9227, 1941-0093. DOI: 10.1109/TNN.2009.2023122. [214] F. Chen, G. Wang, G. Chen and Q. He, ‘A novel neural network parallel adder’, in Interna- tional Work-Conference on Artificial Neural Networks, ser. Advances in Computational Intelligence, Springer, 2013, pp. 538–546. [215] H. Huang, F. Chen and L. Xu, ‘Neural Network Binary Multiplier’, in Proceedings of the International Conference on Scientific Computing, The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (World- Comp), 2013, p. 169. [216] Chao Zhang, Jie Yang and Wei Wu, ‘Binary Higher Order Neural Networks for Realizing Boolean Functions’, IEEE Transactions on Neural Networks, vol. 22, no. 5, pp. 701–713, May 2011, ISSN: 1045-9227, 1941-0093. DOI: 10.1109/TNN.2011.2114367. [217] M. Amer and T. Maul, ‘A review of modularization techniques in artificial neural networks’, Artificial Intelligence Review, 4th Apr. 2019, ISSN: 1573-7462. DOI: 10.1007/s10462-019- 09706-7. [218] L. Franco and S. A. Cannas, ‘Generalization properties of modular networks: Implement- ing the parity function’, IEEE Transactions on Neural Networks, vol. 12, no. 6, pp. 1306– 1313, 2001. [219] B. M. Wilamowski, D. Hunter and A. Malinowski, ‘Solving parity-N problems with feedfor- ward neural networks’, in Proceedings of the International Joint Conference on Neural Networks, vol. 4, IEEE, 2003, pp. 2546–2551. [220] E. Orhan and X. Pitkow, ‘Skip Connections Eliminate Singularities’, arXiv preprint, 15th Feb. 2018. arXiv: 1701.09175. [221] I. Sutskever, J. Martens, G. Dahl and G. Hinton, ‘On the Importance of Initialization and Momentum in Deep Learning’, in Proceedings of the 30th International Conference on International Conference on Machine Learning, (Atlanta, GA, USA), vol. 28, JMLR.org, 2013, pp. III-1139–III-1147. [222] T. Dozat, ‘Incorporating Nesterov Momentum into Adam’, Technical report, 2015. [223] K. Janocha and W. M. Czarnecki, ‘On Loss Functions for Deep Neural Networks in Classi- fication’, arXiv preprint, 18th Feb. 2017. arXiv: 1702.05659. [224] F. Chollet et al., Keras, https://keras.io, 2015.

153 [225] X. Glorot and Y. Bengio, ‘Understanding the difficulty of training deep feedforward neural networks’, in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256. [226] A. Joulin and T. Mikolov, ‘Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets’, in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett, Eds., Curran Associates, Inc., 2015, pp. 190–198. [227] C. Cotta, C. Sloper and P. Moscato, ‘Evolutionary search of thresholds for robust feature set selection: Application to the analysis of microarray data’, in Applications of Evolution- ary Computing, Springer, 2004, pp. 21–30. [228] H. Peng, F. Long and C. Ding, ‘Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy’, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, Aug. 2005, ISSN: 0162-8828. DOI: 10.1109/TPAMI.2005.159. [229] V. Froese, R. van Bevern, R. Niedermeier and M. Sorge, ‘A Parameterized Complexity Analysis of Combinatorial Feature Selection Problems’, in Mathematical Foundations of Computer Science, K. Chatterjee and J. Sgall, Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2013, pp. 445–456, ISBN: 978-3-642-40313-2.

154