RULE DRIVEN JOB-SHOP SCHEDULING DERIVED FROM

NEURAL NETWORKS THROUGH EXTRACTION

A thesis presented to

the faculty of

the Fritz J. and Dolores H. Russ

College of Engineering and Technology of Ohio University

In partial fulfillment

of the requirements for the degree

Master of Science

Chandrasekhar V. Ganduri

August 2004

This thesis entitled

RULE DRIVEN JOB-SHOP SCHEDULING DERIVED FROM NEURAL

NETWORKS THROUGH EXTRACTION

BY

CHANDRASEKHAR V. GANDURI

has been approved for

the Department of Industrial and Manufacturing Systems Engineering

and the Russ College of Engineering and Technology by

Gary R. Weckman Associate Professor of Industrial & Manufacturing Systems Engineering

R. Dennis Irwin Dean, Fritz J. and Dolores H. Russ College of Engineering and Technology

GANDURI, CHANDRASEKHAR V. M.S. August 2004. Industrial and Manufacturing

Systems Engineering

Rule Driven Job-Shop Scheduling Derived from Neural Networks through Extraction

(122 pp.)

Director of Thesis: Gary Weckman

This thesis focuses on the development of a rule-based scheduler, based on production rules derived from an artificial neural network performing job shop scheduling. This study constructs a hybrid intelligent model utilizing genetic algorithms for optimization and neural networks as learning tools. Genetic algorithms are used for obtaining optimal schedules and the neural network is trained on these schedules.

Knowledge is extracted from the trained network as production rules using two rule extraction procedures: Validity Interval Analysis and Decision Tree Induction. The performance of this extracted rule set is compared to the performance of genetic algorithm, attribute-oriented induction method, ID3 algorithm and simple dispatching rules in scheduling a test set of 6x6 scheduling instances. The capability of the rule-based scheduler in providing near optimal solutions is discussed.

Approved:

Gary Weckman

Associate Professor of Industrial and Manufacturing Systems Engineering 4

TABLE OF CONTENTS

LIST OF TABLES...... 8

LIST OF FIGURES ...... 9

CHAPTER 1. INTRODUCTION ...... 10

1.1 Manufacturing Scheduling...... 10

1.2 Job Shop Scheduling Problem ...... 11

1.3 Previous Research...... 13

1.4 Current Research...... 14

1.5 Thesis Structure...... 15

CHAPTER 2. SOFT COMPUTING METHODOLOGIES...... 17

2.1 What is Soft Computing?...... 17

2.2 Genetic Algorithms...... 19

2.2.1 Methodology of Genetic Algorithms...... 20

2.2.2 Components of a Genetic Algorithm ...... 21

2.2.3 Simple Genetic Algorithm Outline ...... 23

2.3 ...... 24

2.3.1 Decision Tree Induction...... 25

2.3.2 Attribute-Oriented Induction...... 30 5

2.4 Artificial Neural Networks ...... 32

2.4.1 Neural Computation...... 32

2.4.2 The Multi-Layer Perceptron Classifier ...... 36

2.4.3 Neural-Network Training...... 39

2.4.4 Generalization Considerations...... 40

2.5 Rule Extraction in Neural Networks...... 42

2.5.1 The Rule-Extraction Task...... 42

2.5.2 Approaches to Rule Extraction ...... 44

2.5.3 Validity Interval Analysis...... 47

2.5.4 Extraction of Decision Tree Representations ...... 49

CHAPTER 3. APPROACHES TO THE JOB-SHOP SCHEDULING PROBLEM ... 52

3.1 The Classical Job Shop Scheduling Problem (JSSP)...... 52

3.1.1 Problem Formulation ...... 52

3.1.2 Types of Schedules ...... 56

3.2 Review of Approaches to solve JSSP ...... 58

3.2.1 Heuristics-based Approaches...... 59

3.2.2 Local Search Methods and Meta-Heuristics...... 61

3.2.3 Artificial Intelligence Approaches...... 64

3.2.4 Machine Learning Applications...... 68

CHAPTER 4. METHODOLOGY ...... 71 6

4.1 The Learning Task ...... 71

4.1.1 Genetic Algorithm (GA) Solutions...... 71

4.1.2 Setting up the Classification Problem...... 73

4.1.3 Development of a Neural Network Model...... 76

4.2 Knowledge Extraction from the Neural Network Model ...... 79

4.2.1 Decision Tree Induction...... 80

4.2.2 Propositional Rules by Validity Interval Analysis...... 87

CHAPTER 5. RESULTS AND DISCUSSION...... 91

5.1 Performance of the 12-12-10-6 MLP Classifier ...... 91

5.2 Efficacy of the Rule Extraction Task...... 93

5.3 Schedule Generation and Comparison...... 95

5.3.1 Statistical Analysis...... 98

CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH ...... 102

6.1 Conclusions...... 102

6.2 Future Research...... 104

REFERENCES ...... 106

APPENDIX A EVALUATION OF NN CLASSIFIERS ...... 113

APPENDIX B NETWORK PARAMATERS ...... 114 7

APPENDIX DECISION TREE INDUCTION DATASETS...... 116

APPENDIX D NN DECISION TREE EXTRACTION...... 118

APPENDIX E ID3 DECISION TREE INDUCTION ...... 119

APPENDIX F TEST SCHEDULING SCENARIOS ...... 122

8

LIST OF TABLES

Table 2.1 Binary representations of chromosomes...... 21

Table 2.2 Training set for the PlayTennis concept ...... 26

Table 3.1 A 3 x 3 job-shop problem ...... 53

Table 4.1 The ft06 instance devised by Fisher and Thomson [88]...... 72

Table 4.2 ProcessTime and RemainingTime feature classes...... 74

Table 4.3 MachineLoad feature classification...... 75

Table 4.4 Assignment of class labels to target feature...... 76

Table 4.5 Sample data for the classification task...... 77

Table 4.6 Training parameters for the 12-12-10-6 MLP classifier...... 79

Table 4.7 The rule set containing 48 rules (NN-Rule set) ...... 86

Table 5.1 Confusion matrix of the 12-12-10-6 MLP classifier ...... 91

Table 5.2 Comparison of classifiers...... 92

Table 5.3 Number of features in the rule antecedent for the NN-Rule set ...... 94

Table 5.4 Makespans of schedules for ft06 ...... 96

Table 5.5 Makespans obtained by various schedulers on the test set ...... 97

Table 5.6 The Randomized Complete Block Design table...... 98

Table 5.7 Analysis of variance table...... 99

Table 5.8 Grouping of schedulers based on Duncan’s multiple range test...... 100 9

LIST OF FIGURES

Figure 2.1 Decision tree representation of PlayTennis concept ...... 27

Figure 2.2 Scheme of attribute-oriented induction ...... 31

Figure 2.3 Computation at a node...... 33

Figure 2.4 Information flow for training phase ...... 35

Figure 2.5 A Multi-layer perceptron...... 36

Figure 2.6 Logistic and hyperbolic tangent transfer functions ...... 38

Figure 2.7 Cross validation for termination...... 41

Figure 2.8 Schematic representation of the ANN-DT algorithm...... 50

Figure 3.1 The disjunctive graph representation of the 3 x 3 problem...... 54

Figure 3.2 Gantt chart representation of a schedule ...... 55

Figure 3.3 Venn diagram illustrating relationships between different sets of schedules.. 57

Figure 3.4 Architecture of a Hopfield net...... 66

Figure 4.1 Representation of a GA solution ...... 72

Figure 4.2 The decision tree induction algorithm...... 81

Figure 4.3 A partially expanded view of the induced decision tree...... 84

Figure 4.4 Forward step between two layers P and S in the network...... 89

10

CHAPTER 1. INTRODUCTION

1.1 Manufacturing Scheduling

Scheduling involves the sequencing of activities under time and resource

constraints to meet a given objective. It is a complex decision making activity because of

conflicting goals, limited resources and the difficulty in accurately modeling real world

scenarios. Common examples of scheduling problems encountered in the real world are:

scheduling of aircraft on runways, of tasks in a central processing unit and assignment of

jobs to machines in a factory. In a manufacturing context, scheduling activities are

represented by operations, and resources by machines. The purpose of a scheduler is to

determine the starting time for each operation to achieve the desired performance

measures, while satisfying capacity and technological constraints.

In today’s highly competitive manufacturing environment, there is a definite need

for an integrated global approach to production planning and control. The planning

functions include demand forecasting, capacity and materials planning, process planning and operation scheduling. The scheduling interfaces with other planning functions. For effective global planning and control, a robust and flexible approach capable of generating good solutions in an acceptable period is needed. There exists a considerable deviation between elegant theoretical formulations of the scheduling problem and practical approaches utilized in real world scenarios. This is due to the 11

dynamic nature of the shop floor environment that is marked with a high degree of randomness and uncertainty attributable to machine breakdowns, addition of new jobs, change in job priorities, due dates, etc. These considerations need to be incorporated in an effective manufacturing scheduling system.

1.2 Job Shop Scheduling Problem

Scheduling theory is concerned with the mathematical formulation and study of

various scheduling models and development of associated solution methodologies. Some

widely researched models are: the single machine model and its variants, parallel

machine models, flow shop and the job shop scheduling models. Of these, the

deterministic job shop scheduling model has attracted the most attention for two key

reasons. First, the generic formulation of the model makes it applicable to non-

manufacturing scheduling domains. Second, the problem’s sheer intractability has

inspired researchers to develop a broad spectrum of strategies, ranging from simple

heuristics to adaptive search strategies based on conceptual frameworks borrowed from

biology, genetics and evolution.

The deterministic job shop scheduling problem (JSSP) consists of a finite set of

jobs to be processed on a finite set of machines. The basic entity in the scheduling

process is an operation, which refers to the processing of a particular job step on a specified machine. Various performance measures are used to evaluate the optimality of schedules ranging from minimization of makespan, tardiness and process cost to 12

maximization of throughput and resource utilization. JSSP is a constrained optimization problem (COP), where the precedence constraints on the problem are given by a predetermined order of operations for each job; and capacity or disjunctive constraints require that each operation be processed by only one machine at any given time. A schedule is the feasible resolution of the precedence and capacity constraints in the COP

[1].

The difficulty in JSSP lies in the number of possible schedules. In theory, for an n x m JSSP, the cardinality of a set of possible schedules is (n!)m. Though the set of feasible schedules is a subset of this set, it is still large enough to discourage complete enumeration for even moderately sized problems. The computation time for algorithms searching the possible solution space to identify an optimal schedule increases exponentially with problem size. Hence, the JSSP belongs to a set of problems classified as nondeterministic hard (NP-Hard) problems. French [2] predicts that it is not possible for algorithms to tackle such problems in polynomial time.

Different kinds of approaches have emerged over the years for tackling this combinatorially exploding problem. Mathematical formulations based on mixed integer linear programming and Lagrangian relaxation techniques, have been developed to provide optimal solutions to the JSSP. These techniques however are computationally infeasible and can only handle simplified problem instances [3]. The search-based prescription to the problem is based on exploration of the feasible solution space to identify the optimal solution. Adaptive search algorithms like Genetic Algorithms (GAs), 13

Tabu Search and Simulated Annealing have been applied to JSSP with success and are capable of providing optimal or near optimal solutions.

Heuristics offer a knowledge-based alternative to the problem. Many of the dispatching rules are insights and abstractions formulated from the experts’ knowledge of the problem. Elementary priority dispatch rules such as Shortest Processing Time (SPT),

First Come First Served (FCFS), Earliest Due Date (EDD), etc have proven useful in simulation studies of the job shop environment [4]. Ease of application, rapidity of computation and flexibility to changing shop floor conditions are the key reasons for heuristic-based approaches to be widespread in industry. Their main weakness, however, is that different heuristics cater to different problems and no single heuristic dominates the rest across all scenarios.

1.3 Previous Research

A key shortcoming of optimization methods for scheduling problems is their lack

of insight into the scheduling process. An interesting line of investigation would be to

cast the scheduling problem as a learning task. In such a formulation, the optimal

schedules generated by efficient optimizers provide the desired learning material. These

schedules contain valuable information such as the relationship between an operation’s

attributes and its position in the sequence. An exploration of these schedules by machine

learning techniques would capture predictive knowledge regarding the assignment of

operation’s position in a sequence based on its attributes. This extracted knowledge 14

allows for the development of a rule-based scheduler, which is computationally less intensive than search methods, but still provides good solutions. A further benefit of such a system is the provision of knowledge in the form of human comprehensible rules.

The work of Koonce and Tsai [5] demonstrates the success of the outlined approach. On the optimal schedules generated by a GA, an attribute-oriented induction methodology was used to induce simple propositional rules describing the scheduling process. The authors report that the developed rule-based scheduler consistently provided superior solutions to the SPT heuristic on the chosen benchmark problem and could duplicate the genetic algorithm’s performance on identical problems.

1.4 Current Research

Key questions and desired improvements identified in the previous research are:

• The generated rule set, consisting of 24 distinct rules was insufficient to describe

randomly generated scenarios of the same problem size. Is there a more suitable

learning paradigm, possessing the desired generalization capabilities?

• The presence of considerable noise in the learning data contributed significantly

towards the deviation in performance between the GA scheduler and the rule-

based scheduler. Would a more robust machine-learning scheme having higher

tolerance to noise improve the predictive accuracy of the rule set?

The current research uses Artificial Neural Networks (ANNs) as the machine learning tool of choice to study the scheduling process. ANNs are being recognized as a 15

powerful and general technique for machine learning because of their non-linear modeling abilities. Further, their distributed architecture is more robust in handling the noise-ridden data. The hypothesis or model learned by the neural network is not explicitly stated, but is implicitly enumerated in the network architecture. However, ANNs can be made to yield comprehensible models by using rule extraction procedures.

This thesis has three major objectives:

• To train an ANN on the schedules generated by a GA, to predict the

position/priority of an operation in a schedule based on the job attributes.

• To capture the embedded knowledge by extracting symbolic rules and decision

trees by using appropriate ANN rule extraction algorithms.

• A comparative evaluation of the predictive accuracy of the extracted rule set, the

trained ANN, GA scheduler and other machine learning algorithms.

1.5 Thesis Structure

The thesis has been organized into six chapters as follows. Chapter 1 introduces the

job shop scheduling problem and explains the motivation of the current research. Chapter

2 provides background material for the various soft computing methods utilized in this

thesis. Chapter 3 describes the job shop scheduling problem and undertakes a survey of the current approaches to solve it. Chapter 4 presents the methods and tools used in this work to achieve the research objectives and focuses on the development of the neural network model and implementation of rule extraction methods. Chapter 5 discusses the 16

application of the extracted rule set in scheduling test problems and analyzes the results using statistical analysis. Chapter 6 provides conclusions and suggestions for future research. 17

CHAPTER 2. SOFT COMPUTING METHODOLOGIES

2.1 What is Soft Computing?

Soft Computing (SC) refers to the evolving collection of methodologies to build

intelligent systems exhibiting human-like reasoning and capable of tackling uncertainty.

The adoption of this approach has led to the development of systems that have high MIQ

(Machine Intelligence Quotient) [6]. SC-methodologies have proven successful over

classical modeling, reasoning and search techniques in a wide variety of problem domains. The characteristics of problems for which traditional analytical approaches have proven deficient are:

1. Modeling difficulties: Generally, real world problems are poorly defined and

information is empirically available as input-output patterns representing

instances of the problem’s behavior. Precise and accurate mathematical models

for such problems are either unavailable or restrictively expensive to build.

Further, such models exhibit non-linear behavior for which traditional

mathematical modeling tools are of limited utility.

2. Large-scale solution spaces: Problems with large-scale solution spaces are usually

intractable with deterministic search techniques. The computational time and

effort is huge, and deterministic search does not employ mechanisms for

successfully navigating through local optima. 18

3. Knowledge Acquisition: Expert knowledge in a problem domain is often fuzzy,

consisting of imprecise declarations, partial truths and approximations. Hence,

crisp classifications and unambiguous definitions are not always possible. Also, in

some cases, there is a need to directly acquire knowledge from problem data

without human intervention.

Soft computing is not a single methodology, but consists of a suite of approaches capable of exploiting the above described problem characteristics to yield tractable and robust intelligent systems at low solution cost. According to Zadeh [7]: “… in contrast to traditional, hard computing, soft computing is tolerant of imprecision, uncertainty, partial truth.” The discipline of SC encompasses several paradigms like fuzzy set theory, neural networks, approximate reasoning, stochastic optimization methods like genetic algorithms, simulated annealing and machine learning techniques. SC unites these complementary approaches into a cohesive structure, providing a scaffold for the construction of innovative, hybrid intelligent systems. The key strengths of the constituent approaches are as follows:

1. Fuzzy Set Theory allows for imprecise knowledge representation in the form of

fuzzy if-then rules.

2. Neural Networks exhibit learning and adaptive behavior with non-linear modeling

capabilities. 19

3. Genetic Algorithms provide systematic global search of solution space and are

capable of evolving better candidate solutions starting with random initial

solutions.

4. Machine Learning methods are important for automated knowledge acquisition.

The above capabilities have allowed SC-approaches to successfully confront several real world problems in robotics, space flight, process control, production and aerospace applications [8]. The remaining sections of this chapter provide the necessary background material for the SC-methodologies utilized in the current research effort.

2.2 Genetic Algorithms

Genetic Algorithms (GAs), first proposed by John Holland [9], are stochastic

search techniques applicable to a wide range of optimization problems. Their

methodology is based on the principles and mechanisms of natural genetics and

evolutionary processes. GAs, as general-purpose optimization tools offer several unique

advantages over conventional optimization techniques. GAs combine elements of

directed and stochastic search methods providing a good balance between exploration

and exploitation of the solution space [10]. They are applicable to both continuous and

combinatorial optimization problems. Functional derivative or gradient information is not

required for determining the search direction in these algorithms. This characteristic

makes them a flexible tool for optimizing a large number of objective functions, which

are either not differentiable or whose gradient calculation is computationally expensive. 20

GAs are stochastic as they incorporate randomness in determining search directions.

Hence, they are less likely to be trapped in local minima, having a good chance of finding the global optimum, given enough computation time. GAs work with solution populations rather than with single members, making them capable of yielding multiple solutions of high quality. These characteristics have made GAs popular agents for optimization, finding many significant applications in both academic research and industry.

2.2.1 Methodology of Genetic Algorithms

The solution or search space contains all feasible solutions. Each point in this

space, called a chromosome, has an associated fitness value that usually equals the

objective function evaluated at that point. GA maintains a population of chromosomes,

which is repeatedly evolved over generations towards better fitness. The next generation

is created from the current population by using genetic operators like crossover and

mutation. Analogous to the evolutionary principle of survival of the fittest, the

chromosomes with a higher fitness value are more likely to survive and participate in the

creation of new populations. This principle ensures that successful chromosomes pass

their good genes to the next generation. The population continuously evolves toward

better fitness, and the algorithm converges to the best chromosome after several

generations. Empirical studies indicate that GAs provide optimal or near optimal

solutions to many optimization problems. 21

2.2.2 Components of a Genetic Algorithm

This section explains the major components of GAs like encoding schemes, fitness

function evaluations, population selection strategies, genetic operators needed for evolving better solutions in successive generations and termination criteria for determining convergence.

Encoding Scheme

Encoding schemes translate possible solutions into chromosomes, which are

usually binary string representations. Chromosomes (2, 5) and (2, 6) in a two dimensional

search space have possible binary string representations as shown in Table 2.1.

Table 2.1 Binary representations of chromosomes

Chromosome (2, 5) 010101

Chromosome (2, 6) 010110

Alternate non-string encoding schemes are available for integer, floating-point, or

discrete-valued numbers. The design of genetic operators, like crossover and mutation,

depends on the chosen encoding scheme. Hence, the choice of a good coding scheme is

essential for a GA to operate effectively. 22

Fitness Evaluation

The fitness value of a chromosome is the objective function evaluation of the

decoded chromosome. The objective function plays the role of the environment in

ranking the members of the chromosome population. Fitness evaluation is important in the context of chromosome selection for reproduction.

Selection

This operation is analogous to the evolutionary principle of survival of the fittest – the chromosomes with better fitness values are chosen as parents to produce offspring for the next generation. The roulette wheel selection method chooses individuals according to their objective function value, where the probability of selection of an individual from a population is the ratio of its objective function value over the expected objective function value for the whole generation [11]. Other available chromosome selection methods are: Boltzman, tournament, rank and steady state selection.

Genetic Operators

GAs generate new populations through crossover and mutation. A crossover operator exploits current gene pool potentials and retains the useful solution features from previous generations. Crossover generates new chromosomes by interchanging two parent chromosomes at a chosen point. Many crossover operators have been proposed such as, partially mapped crossover (PMX), order-based crossovers (OBC), cycle crossover (CX), etc. The mutation operator is capable of new chromosome generation by spontaneous random variation of the chromosomes. It provides diversity to the 23

chromosome population and prevents the population from becoming trapped in local minima. Inversion, insertion, reciprocal exchange and scramble mutation operators are examples of mutation operators.

Termination Criteria

The criteria for termination of GAs are similar to the ones used in general search methods. The algorithm can be terminated after running it for a fixed number of

generations. A predetermined tolerance threshold for error can also be utilized for

determining convergence.

2.2.3 Simple Genetic Algorithm Outline

The functionality of a simple genetic algorithm can be described procedurally as

follows:

1. [Initialize] Generate a random population of chromosomes of predetermined size.

2. [Evaluate] Evaluate the fitness of each individual in the population

3. [Repopulation] This step has the following sub-steps:

a) [Select] Select two members of the population with probabilities

proportional to their fitness values.

b) [Crossover] Apply crossover with a probability equal to the crossover rate.

c) [Mutation] Apply mutation with a probability equal to the mutation rate

d) [Loop] Repeat (a) to (d) until the new population is generated.

4. [Loop] Repeat steps 2 to 4 until termination criteria are met. 24

Many parameters and settings need to be decided in building a practical GA. These are decided based on problem characteristics and heuristics like the De Jong’s guidelines

[12].

2.3 Machine Learning

The ability to learn from examples and form a model of the world is the foundation

of biological intelligence. This model of the world, often implicit, allows us to adapt to a

dynamically changing environment and is necessary for our survival. Artificial

Intelligence (AI) aims at constructing artifacts (machines, programs) that have the

capability to learn, adapt and exhibit human-like intelligence. Hence, learning is

important for practical applications of AI. The field of machine learning is the study of

methods for programming computers to learn [13]. Many important algorithms have been

developed and successfully applied to diverse learning tasks such as speech recognition,

game playing, medical diagnosis, financial forecasting and industrial control [14].

A learning system is given a set of examples encoded in machine-readable format, referred to as a training set, as input. The system then generates a model or hypothesis to

perform a task of interest (pattern recognition, classification, prediction, etc). The model

is evaluated based on its ability to generalize correctly on examples not used for training

purposes. Hence, predictive accuracy or generalization is an important criterion in

evaluating alternate machine learning schemes. An accurate model allows us to gain insight into the problem domain and this lays a special emphasis on the criterion of 25

comprehensibility. It refers to the ease of understanding the model by a human user and serves the purpose of validation, knowledge discovery and refinement. Fayyad et al. [15] assert that inductive learning with a focus on comprehensibility is a central activity in the developing area of knowledge discovery in databases and data mining.

Although a wide choice of machine learning schemes are available, they differ significantly in terms of predictive accuracy, comprehensibility and ease of implementation across different problem domains. The selection of a suitable learning algorithm for a specific problem is based on considerable experimentation with different learning algorithms and evaluating the induced model in terms of predictive accuracy, comprehensibility, and other possible criteria.

The following sub-sections present two important symbolic machine learning methods, decision tree induction and attribute oriented induction. Neural networks, the main class of non-symbolic machine learning tools used in this research are covered in a later section in this chapter.

2.3.1 Decision Tree Induction

Decision trees are among the most popular symbolic machine learning algorithms.

They express the learned hypothesis or target function using a unique representation format known as a decision tree. Decision trees can easily be compiled into simple if-then

rules for improving human comprehensibility. They have been successfully applied to a 26

variety of learning tasks from diagnosis of medical cases to learning to assess the credit risk of loan applicants [16].

Table 2.2 depicts the training data for the target concept, PlayTennis. This decision tree classifies Saturday mornings as suitable for playing a game of tennis or not.

Table 2.2 Training set for the PlayTennis concept

Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No

Each node in the tree indicated by an oval specifies a logical test based on some

attribute or feature in the problem. The tree has a root node, outlook, having three 27

possible attribute values: sunny, overcast and rain. Each of the outgoing branches from a node corresponds to one of the possible values of the attribute. Hence, the root node has three branches. A tree has also a set of leaf nodes, which represent the outcome of the classifier (i.e., decision to play tennis or not). The classification of an instance involves traversal through the tree starting at the root node until a leaf node is encountered. The example instance (outlook = sunny, humidity = high) would follow the leftmost branch of the depicted tree in Figure 2.1.

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Figure 2.1 Decision tree representation of PlayTennis concept

Adapted from Quinlan [16] 28

The tree predicts that target concept, PlayTennis = no, indicating unsuitable weather conditions for playing tennis. Also, the attribute temperature was not utilized in constructing the tree, indicating its insignificance in the decision-making process.

Many decision tree induction methods have been developed in the last two decades with different capabilities and requirements. The ID3 algorithm is the core algorithm on which many variants have been developed. The algorithm constructs a decision tree in a top-down fashion by recursively partitioning the instances at each node. The determination of an attribute for partitioning the instance space is an important aspect in decision tree induction. ID3 uses a statistical property called information gain to select among candidate attributes in constructing a tree. Information gain provides a criterion that measures the effectiveness of an attribute in classifying the training instances.

Let S denote the set of training instances. The information gain, InfoGain (T) obtained by choosing an attribute T for splitting the set S as given in [17]:

InfoGain(T ) = info(S) − infoT (S) (Eq. 2.1)

In the above equation, info(S) is the amount of information needed to classify an instance in S, and infoT (S) is the corresponding measure after partitioning the set S based

on attribute T. If the set has k possible partitions for k classes, the information content is:

29

k freq(Cj, S) ⎛ freq(Cj, S) ⎞ info(S) = − log ⎜ ⎟ (Eq. 2.2) ∑ | S | 2 ⎜ | S | ⎟ j = 1 ⎝ ⎠

Where j ranges over the k classes and freq(Cj, S) denotes the number of examples belonging to class Cj. Given a partition based on attribute T, the expected value of information over the induced n subsets is given by:

n | Si | infoT (S) = −∑ .info(Si ) (Eq. 2.3) i=1 | S |

In the above expression, Si is the subset of examples in S having and ith outcome, and i ranges over the n subsets.

The procedure of selecting a splitting attribute and partitioning the training instance set is recursively done for each internal node. Only the examples, which reach that node

(i.e., the examples that satisfy logical tests on the path), are used in attribute selection.

This process continues until either of these two criteria is met:

1. Every available attribute has been included in the tree path

2. The training instances at a given node belong to the same class. If so, the node is

labeled as a leaf node. 30

The C4.5 algorithm [18] is similar to the ID3 algorithm, but employs a variation of the information gain criterion called the gain ratio. Further, it can handle continuous attribute value ranges, pruning of decision trees and rule derivation.

2.3.2 Attribute-Oriented Induction

When the training set for learning is provided as a database, the task of inducing hypothesis describing the data is called data mining. In real world applications, databases are predominantly used for representing and maintaining information. Often, the information is enormous, noisy, uncertain, and can involve missing values. A growing need for knowledge discovery in databases led to the rapid development and adaptation of special-purpose machine learning techniques suited for databases.

Attribute-oriented induction (AOI) is a technique for mining knowledge from relational databases by inducing characteristic and classification rules describing the hypothesis. It is a set-oriented method that generalizes the task-relevant subset of data, attribute-by-attribute, into a general relation [19]. AOI is a data-driven induction process capable of generalization to a desired level of abstraction. The method integrates machine learning concepts like induction, generalization, concept hierarchies and database operations to discover rules. Figure 2.2 shows the inputs and output of the AOI method.

31

Database

Queries Attribute Oriented Generalized Relation List of Attributes Induction Concept Hierarchy

Figure 2.2 Scheme of attribute-oriented induction

This technique requires provision of domain knowledge in the form of concept hierarchies for obtaining generalized relations. Concept hierarchies can be explicitly given by the experts or automatically generated by data analysis [20]. AOI is capable of utilizing these concept hierarchies to generate logical rules. The following are some of the key steps in induction algorithm:

1. Concept Tree Ascension: Generalize the relationship by elimination of identical

tuples using predetermined threshold to control the generalization process.

2. Vote Aggregation: The number of identical tuples being merged during the tree

ascension is important for the learning task. A counter is maintained indicating

the number of tuples in the initial relation that are generalized to the current

relation.

3. Simplification: The generalized relation is simplified by merging of nearly

identical tuples (i.e., differing in the value of one attribute). 32

4. Rule Transformation: The obtained final relation is to a logical rule in the format

desired by the user.

The method is capable of generating two types of induction rules: learning characteristic rules (LCHR) and learning classification rules (LCLR) [21]. Their induction procedures are similar, differing in the attribute generalization process.

Detailed description of the AOI methodology can be found in [19] and [21].

2.4 Artificial Neural Networks

2.4.1 Neural Computation

The motivation for the early development of neural networks stemmed from the desire to mimic the functionality of the human brain. A neural network is an intelligent data-driven modeling tool that is able to capture and represent complex and non-linear input/output relationships. Neural networks are used in many important applications, such as function approximation, pattern recognition and classification, memory recall, prediction, optimization and noise-filtering. They are used in many commercial products such as modems, image-processing and recognition systems, speech recognition software, data mining, knowledge acquisition systems and medical instrumentation, etc [22].

A neural network is composed of several layers of processing elements or nodes.

These nodes are linked by connections, with each connection having an associated weight, Wi. The weight of a connection is a measure of its strength and its sign is 33

indicative of the excitation or inhibition potential. Figure 2.3 shows a simple perceptron having n inputs, {X1, X2… Xi… Xn}.

W1 X1 W2 X2

Input Output Wi f (∑ X W ) - θ Xi i i

Wn X n

Figure 2.3 Computation at a node

The perceptron has a threshold or bias, θ, which is the value the net input required to produce non-zero activation. The net input to a perceptron, neti, is given by:

neti = ∑ WiXi + θ (Eq. 2.1)

34

A transfer function, f maps the net input to a range, O which is the activation or output of the perceptron. It is given by:

Output, O = f (neti) (Eq. 2.2)

Neural networks have two distinct phases of operation: training and production.

Some design parameters need to be chosen before training the network. These include:

1. System architecture or topology: The number of nodes in each layer and

corresponding transfer functions.

2. Training Algorithm: The training algorithm and the performance measure or the

cost function.

3. Generalization Considerations: The number of epochs or cycles needed to ensure

good generalization and criteria for termination of training phase.

Parameters like weights and biases are modified during the training phase. The network uses problem data to assign values to these parameters. The distinguishing characteristic of neural networks is their adaptivity, which requires a unique information flow design depicted in Figure 2.4. The performance feedback loop utilizes a cost function to provide a measure of deviation between the calculated output and the desired output. This performance feedback is utilized directly to adapt the parameters, W and θ, so that the system output improves with respect to the desired goal.

35

Desired

Input Neural network Output (W, θ)

Adjust parameters Cost

Training algorithm Error

Figure 2.4 Information flow for training phase

Adapted from Principe et al. [23]

Once a network is trained, it is ready for the production phase. The task of the network in the production phase is to produce an output, given an input, based on the model or hypothesis learned during training. It is important to note that unlike in the training phase, the network parameters remain unchanged during the production phase. In this thesis, neural networks are used for classification. The subsequent sections explain the architecture of the multi-layer perceptron classifier, the learning algorithms employed for training and the generalization considerations. 36

2.4.2 The Multi-Layer Perceptron Classifier

A multi-layer perceptron (MLP) consists of a cascade of perceptrons arranged in layers. A single hidden layer network is illustrated in Figure 2.5. The input layer contains nodes that represent the features of the given problem. A real-valued feature is represented by a single node, whereas a discrete feature with n distinct values is represented by n input nodes. The hidden layer maps the input to another space, which forms the input region for the output layer.

Y

Output Layer

Hidden Layer

Input Layer

X1 X2 X3 X4 X5

Figure 2.5 A Multi-layer perceptron

37

The output layer represents the decision of the classifier. A single output node, as shown in Figure 2.5, allows us to determine the class membership of given input vector

(i.e., whether a given input belongs to a predetermined class or not). For n distinct classes, each output node represents a possible class and hence, n output nodes are needed. A winner-take-all heuristic is used to determine the class membership in such cases. The class of the output node with maximum activation is the class computed by the network.

MLPs have been proven to be universal approximators [24], capable of implementing any given function. This is only possible with the choice of non-linear transfer functions. Two of the most commonly used functions are the logistic function and hyperbolic tangent function. The main difference in these two functions is the range of their output values as illustrated in Figure 2.6 for net input in the range [-4, 4].

38

Figure 2.6 Logistic and hyperbolic tangent transfer functions

The logistic function has an output range [0, 1], and the activation of a node, ai is given by:

1 ai = (Eq. 2.3) 1+ e-net input

The hyperbolic tangent function compresses a unit’s net input into an activation value in the range [-1, 1]:

e net input - e- net input ai = (Eq. 2.4) enet input + e- net input 39

2.4.3 Neural-Network Training

The training phase in neural networks provides the answer to the following questions: Is there a set of network parameters (weights and biases) that allow a network to map a given a set of input patterns to desired outputs? If so, how are the parameters determined? The most commonly used training algorithm is the backpropagation algorithm, first discussed by Rumelhart et al. [25]. The term back-propagation refers to the direction of propagation of error. The goal of the training regimen is to adjust the weights and biases of the network to minimize the cost function. Though several cost functions are available, the function appropriate for classification problems is the cross- entropy function [26]:

E = ∑ ∑ t pi ln(ypi ) + (1 - t pi ) ln(1 - ypi ) (Eq. 2.5) pi

In the above equation, E is the cross-entropy cost function, p is the number of training patterns and i is the number of classes. The term, y pi is the estimated probability that an input pattern belongs to class i, and tpi is the target with the range [0, 1]. Network output is interpreted as the probability that the given input pattern belongs to a certain class.

The cost function, E needs to be minimized and its derivative w.r.t the weight is calculated and denoted by ∂E/∂w. Having obtained the derivative, the problem of 40

adjusting weights is an optimization problem. Back-propagation uses a form a gradient descent to update weights according to the formula:

∂E Weight update, ∆wij = −η (Eq. 2.6) ∂wij

The term, wij denotes the weight from node i to node j. η > 0 is the learning rate and ∂E/∂wij, is the derivative of the error, E w.r.t weight wij. The network is initialized with random weights and the training algorithm modifies the weights according to the above discussed procedure. Many alternative optimization techniques have been utilized; variations of the basic method include methods like the conjugate-gradient method, momentum learning, etc. Stochastic search algorithms, like simulated annealing and genetic algorithms, have also been applied to avoid the problem of convergence to local minima by [27] and [28]. However, these methods are costly to implement, as these methods are global optimization procedures and hence require longer times.

2.4.4 Generalization Considerations

The collection of input pattern-desired response pairs used to train the learning system is called the training set. The testing set contains examples not used for the training purpose and is used to evaluate the generalization capabilities of the network.

Vapnik [29] indicates that performance of the network trained with back-propagation 41

always improves with the number of training cycles. However, the error on the testing set initially decreases with the number of cycles, and then increases as shown in Figure 2.7.

r

Stopping point Testing Set Prediction Erro Prediction

Training Set

Number of cycles

Figure 2.7 Cross validation for termination

This phenomenon is called overtraining and is indicative of poor generalization capabilities. One solution to this problem is to split the training set into two sets – the training set and validation set. After every fixed number of iterations, the error on the validation set is calculated. Training is terminated when this error starts to increase. This method is called early stopping or stopping with cross-validation.

42

2.5 Rule Extraction in Neural Networks

2.5.1 The Rule-Extraction Task

A neural network captures task-relevant knowledge as part of its training regimen.

This embedded knowledge is encoded in the network as:

• The architecture or topology of the network.

• The transfer functions used for non linear mapping.

• A set of network parameters (weights and biases).

The knowledge represents the hypothesis or model learned by the network. Usually these models are difficult to understand because the processing in a neural network occurs at the sub-symbolic level as numerical estimation and manipulation of network parameters. It may not always be possible to directly translate these large sets of real valued parameters into symbols or concepts that have semantic significance. The nonlinear mapping between the input features and target concept is represented by the hidden units in the network. Thus, hidden units represent higher-level derived features, which may not correspond to known features in the problem domain.

Inducing comprehensible models from neural networks is vital for a variety of reasons. Comprehensibility is important for model validation and knowledge discovery. It is often mandatory in safety critical applications like medical diagnostics, guided missile systems, and air traffic control that the system’s behavior be completely transparent. The system’s output should be validated under all possible input conditions. Lack of 43

transparency and inability to provide an end user explanation capability have generally been recognized as the most significant obstacles to the more widespread application of neural networks. Thus, a clear need exists for the development of user friendly shells for neural network based learning systems.

The goal of rule extraction approaches is to translate the hypothesis learnt by the network into symbolic inference rules. Craven [17] defines the task of rule extraction as follows: “Given a trained neural network and the data it was trained, produce a description of the network’s hypothesis that is comprehensible yet closely approximates the network’s predictive behavior.” The proliferation of rule extraction techniques has prompted researchers [30] and [31] to develop criteria to evaluate the proposed algorithms and their extracted knowledge representations as summarized below:

1. Comprehensibility: The extent to which the extracted representations are humanly

comprehensible.

2. Expressive power: The structure of the output presented to the end-user. Various

representation formats like simple propositional rules, M-of-N rules, fuzzy

inference rules, decision trees, etc. can be used based on the problem domain.

3. Fidelity: The ability of the extracted representations to mimic the behavior of the

original network.

4. Predictive Accuracy: The generalization capabilities of the extracted

representations. 44

5. Scalability: The ability of the rule extraction method in adapting to different

problem sizes (dimensionality of the input space, number of processing elements

etc).

6. Generality: The degree to which a rule extraction method imposes special

requirements like tailored training regimens or restrictions on network

architecture.

2.5.2 Approaches to Rule Extraction

The first attempt at rule extraction from neural networks can be traced to a paper by

Gallant [32] on connectionist expert systems. Classification rules describing the network’s behavior were obtained by analyzing the role of attribute ordering in correctly classifying a problem. A variety of rule extraction methods have been developed since then for addressing the problem of comprehensibility in neural networks. Andrews et al.

[33] classify rule extraction approaches into the following three categories, based on the view taken by the algorithms of the underlying network topology: decompositional, pedagogical and eclectic.

Decompositional methods extract rules at the level of each individual hidden and output unit within the trained neural network. These rules are then combined to describe the behavior of the overall network. As this approach is based on the analysis of the architecture of the network, it can be considered as a local approach to rule extraction.

Most approaches within this category employ a search procedure for finding subsets of 45

incoming weights that exceed the bias or threshold on a node. The identified subsets of such activations are translated into propositional rules. The subset method by Fu [34] and the M-of-N algorithm developed by Towell and Shavlik [35] are generic representatives of this category. The subset method extracts simple propositional rules. The M-of-N algorithm, as the name suggests is capable of extracting m-of-n rules. An m-of-n expression is satisfied when m of the possible n antecedents are satisfied. Setiono [36] extracts rules by first clustering the activation values of hidden units. Then the network is repeatedly split into sub-networks for ease of analysis. The RULEX technique developed by Andrews and Geva [37] directly interprets the weight vectors as rules. This technique can be used only for a particular type of called the Constrained

Error Back-propagation (CEBP) perceptron. Though simple in conception, the decompositional approach to rule extraction has various limitations. The algorithmic complexity increases exponentially with network complexity. Various restrictions are imposed on the network architecture and the training procedures, which adversely affects the generalization capabilities of the neural network.

Pedagogical techniques extract rules that map network inputs to outputs directly, effectively treating the neural network as a black box. In Saito and Nakano’s approach

[38], useful rules are selected from a candidate rule set that is generated by examining input activation values of the network which activate a given output unit. Craven and

Shavlik’s Rule-extraction-as-learning [39] is a pedagogical approach, which exploits the property that networks can be queried. Instead of using a search procedure, the rule 46

extraction process is driven by sampling and queries to extract conjunctive rules from a trained network. This approach is less computationally intensive than search based methods. Validity Interval Analysis (VI-Analysis) proposed by Thurn [40], extracts rules by a generate-and-test procedure, by propagating validity intervals through the network.

Linear programming is used to determine if the set of proposed validity intervals are consistent with the network’s activation values on all nodes. The RULENEG approach developed by Pop et al. [41] focuses on extracting conjunctive rules from a neural network. The algorithm is based on the observation that changing the truth value of one of the antecedents in a conjunctive rule changes the consequent of the rule.

Several pedagogical approaches have also been developed for extracting decision tree representations of the neural network. Craven and Shavlik [42] extract decision trees from trained neural networks using a novel algorithm named TREPAN. This algorithm employs a greedy gain ratio criterion for evaluating attribute splits. Binary and M-of-N decision trees can be derived by this method. The ANN-DT (Artificial Neural Network -

Decision Tree) algorithm proposed by Shmitz et al. [43] is capable of growing binary decision trees from neural networks by using attribute selection criteria based on significance analysis for continuous valued features. The DecText (Decision Tree

Extractor) algorithm [44] is effective in extracting high fidelity trees from trained networks. The paper also proposes different criteria for selecting an attribute to partition the training data. 47

The third categories of rule extraction techniques, labeled eclectic approaches, combine elements of the above-discussed basic categories. The BRAINNE system proposed by Sestino and Dillon [45] extracts simple if-then rules. The method uses a unique approach to handle continuous data without discretization. The genetic algorithm based rule extraction approach developed by Keedwell et al. [46]. Genes contain the weight between two adjacent layers. Chromosomes are then constructed to represent a path from the input layer to the output layer. The fitness function is calculated as the product of the weights from the input to the output layer. The algorithm identifies the fittest chromosomes, which are then mapped into if-then rules. A major limitation of this method is that only single antecedent rules can be extracted.

2.5.3 Validity Interval Analysis

One of the more popular rule extraction and refinement techniques is the Validity

Interval Analysis (VI-Analysis) [40]. The underlying idea of the method is similar to sensitivity analysis. A systematic variation of the inputs is undertaken to examine the changes in network classification. This process helps to characterize the neural network in terms of symbolic if-then rules. A key advantage is that the ANN does not need a specialized training regimen or restricted architecture as required by other rule extraction algorithms. 48

VI-Analysis provides a general procedure to check the consistency of proposed rules with a neural network. The basic procedure of the VI-Analysis algorithm adapted from [33] and [40] is as follows:

1. Generation of candidate rule set: For rule extraction, the first step is the

generation of a feasible rule set. For small discrete domains, a simple enumeration

of all possible rules can be undertaken. Alternately, for larger domains, a

procedure based on the properties of directed-acyclic graphs can be employed.

For continuous domains, VI-Analysis generates rules by iteratively growing the

antecedent of any arbitrary rule.

2. Validity interval assignment: A candidate rule is translated into a set of validity

intervals to be specified on the input and output nodes of the network. A validity

interval on a node is the range of its activation values. Based on the rule to be

verified, validity intervals are assigned to all input and output nodes in a network.

3. Interval refinement: These proposed validity intervals are refined by propagating

them through the network in two phases: forward and backward. Validity

intervals represent constraints on the activation values of all nodes in the network.

The novel idea of VI-Analysis is the employment of linear programming

techniques to generate and refine validity intervals in the subsequent layers in the

direction of interval propagation.

4. Rule validation based on convergence: There are two possible outcomes to the

previous step. VI-Analysis converges validating the proposed rule. Otherwise, a 49

contradiction is found, proving that the constraints imposed by the proposed

validity intervals are inconsistent with the behavior of the network. The rule is

rejected and steps 2-4 are repeated with another candidate rule.

VI-Analysis is designed as a general-purpose rule verification procedure and has been successfully applied to both discrete and continuous classification problems. A drawback of this method is the computational intensity requiring many calls to an optimization module. In addition, activation levels of the nodes are assumed independent of one another. This assumption is not always valid and the algorithm may not find maximally general rules. Maire [47] shows that VI-Analysis always converges in one run

(forward and backward phase) for single layer networks and has an exponential rate of convergence for multilayer networks.

2.5.4 Extraction of Decision Tree Representations

Many of the approaches for extracting decision tree representations of trained neural networks are based on the idea of sampling the neural network model of training data and inducing decision trees. The ANN-DT algorithm generates univariate decision trees and a schematic representation of the algorithm adapted from [43] is shown in

Figure 2.8.

50

Estimate neighbor- hood areas for S1

Use interpolated data S Extraction of binary to sample neural decision tree network Selection of attribute Selection of split point

Train neural network on original dataset S2

Figure 2.8 Schematic representation of the ANN-DT algorithm.

Adapted from Schultz et al. [43]

As illustrated in the above figure, the sampled data set S is split into two data sets

S1 and S2, based on the selected attribute. The main steps in the ANN-DT algorithm are as follows:

1. Interpolation of Correlated Data: An artificial data set is prepared by random

sampling of the feature space. For these exemplars, the class label is obtained by

querying the neural network modeling the training data. 51

2. Selection of Attribute: For discrete output classes, a normalized measure of

information gain referred to as gain ratio is used for selecting the attribute. An

alternate method based on analysis of attribute significance can also be used.

3. Stopping Criteria: The selected attribute splits the current set of data into two

subsets. By recursive splitting of data, a decision tree is generated. For discrete

classes, the process is terminated when an internal node contains data with one

output class. For continuous outputs, termination occurs when standard deviation

of variance of data is zero.

For a number of classification tasks, the ANN-DT algorithm outperforms the standard decision tree induction algorithms, ID3 and C4.5 in predictive accuracy. In addition, the extracted trees were faithful representations of the original neural network. 52

CHAPTER 3. APPROACHES TO THE JOB-SHOP

SCHEDULING PROBLEM

3.1 The Classical Job Shop Scheduling Problem (JSSP)

3.1.1 Problem Formulation

The deterministic job shop scheduling problem (JSSP) is one of the classical problems in scheduling literature. JSSP consists of a finite set of n jobs to be processed on a finite set of m machines and is denoted as an n x m problem. The routing of a job is a predetermined sequence of operations. Each operation is processed on a specified machine and has a fixed processing time. The job routings and the associated processing times are given by a process plan. JSSP is a constrained optimization problem (COP) where the precedence constraints on the problem are given by the job routings. Capacity or disjunctive constraints require that each operation be processed by only one machine at any given time. Other assumptions include the following:

• Machine repetitions by a job are not allowed.

• Machine absences are not allowed (i.e., each job is processed on every machine).

• Uninterrupted processing of operations without preemption.

• No machine breakdowns throughout the scheduling process.

• Transportation time between machines is zero. 53

• The job shop is static and deterministic in nature i.e., there is no randomness

involved in determining all the necessary parameters for defining the job shop

problem.

Table 3.1 is an example of a 3x3 JSSP. The data in the table is in the format:

{Machine, Processing Time} and shows the routing of each job as operations to be performed on a specified machine and the processing time required for that operation.

Table 3.1 A 3 x 3 job-shop problem

Operation Job 1 2 3 1 1,3 2,3 3,3 2 1,2 3,3 2,4 3 2,3 1,2 3,4

The order and the processing times of operations for job 2 is interpreted from the above table as follows: First, job 2 is processed on machine 1 for 2 time units, then on machine 3 for 3 time units and finally on machine 2 for 4 time units. The disjunctive graph representation, G = {O, A, E} proposed by Roy and Sussman [48] is one of the very popular models used for describing job shop scheduling instances. It is a node- weighted graph, where the vertices in O represent the operations of the different jobs.

Each vertex has an assigned weight denoting the processing time for that operation. A is a 54

set of conjunctive arcs representing the precedence constraints for each job. The disjunctive edges represent the operations to be processed on the same machine. The disjunctive graph representation of the 3 x 3 scheduling instance shown in Table 3.1 is illustrated in Figure 3.1.

O11 O12 O13

Source O21 O23 O22 Sink

O31 O32 O33

Figure 3.1 The disjunctive graph representation of the 3 x 3 problem.

Adapted from Yamada and Nakano [49]

In the above figure, the conjunctive constraints are given by complete arrows and the dashed arrows indicate the disjunctive constraints. Two fictitious nodes, source and sink nodes are added to the graph to represent the starting and ending operations. A 55

schedule is the feasible solution to the specified constraints. In the disjunctive graph model, a schedule is obtained by transforming the disjunctive edges into conjunctive constraints by selection. A feasible schedule to the above problem is depicted as a Gantt chart in Figure 3.2.

MACHINE 1, 1 2, 1 3, 2 JOB 1 1

MACHINE 3, 1 1, 2 2, 3 JOB 2 2

MACHINE 2, 2 1,3 3, 3 JOB 3 3

3 6 9 12 15 1

Figure 3.2 Gantt chart representation of a schedule

Using the four-field notation of Conway et al. [50], JSSP can be represented as

(n/m/G/Cmax). In this tuple, n denotes the number of jobs to be scheduled on m machines.

G refers to the generalized job shop problem and Cmax indicates that minimization of 56

makespan is the performance criterion. Minimization of makespan (Cmax) is often used as the performance criterion. Makespan is defined as the completion time of the final job to leave the system [51]. Various other performance measures are used to evaluate the schedules ranging from minimization of tardiness, process cost to maximization of throughput and resource utilization. In JSSP, the objective of the scheduler is to determine the starting time for each operation, satisfying the precedence and capacity constraints and achieving the desired performance measures.

3.1.2 Types of Schedules

Semi-Active Schedules

A feasible schedule to the JSSP can be obtained by resolving the constraints specified in the disjunctive graph model in a consistent manner. A modification of the above schedule, allowing all operations to start at the earliest possible time results in a semi-active schedule. This type of modification, which results in a compacted schedule on a machine without altering the sequence of operations, is known as local left-shift. All schedules for which no local left-shift is possible constitute the set of semi-active schedules.

Active Schedules

In a semi-active schedule, an operation may exist that can be left shifted to begin at an earlier time prior to other operations on that machine. This type of shift, known as a global left-shift preserves the feasibility of the schedule, while improving its makespan. 57

The set of all schedules in which no global left shift can be made is called the set of active schedules and is a subset of the set of semi-active schedules.

Non delay Schedules

A schedule in which no machine is kept idle at a time when it could begin processing some operation is called a non-delay schedule. All non-delay schedules are active schedules, but not all active schedules may be non-delay schedules.

Figure 3.3 illustrates the relationships between different sets of schedules using a

Venn diagram. The optimal schedules are indicated by the solid circle in the diagram.

All schedules

Active

Semi-Active Non-Delay

Figure 3.3 Venn diagram illustrating relationships between different sets of schedules

Adapted from Baker [1] 58

3.2 Review of Approaches to solve JSSP

The JSSP is a good representative of the general domain of scheduling problems and has earned a reputation for its sheer intractability. Two broad approaches exist in formulating a solution to the problem: exact and approximate. The exact or optimization methods include efficient algorithms, mathematical formulations frequently based on either Lagrangian relaxation or decomposition and the branch and bound procedures. The most common mathematical formulation of JSSP is the mixed integer linear programming format [52]. The exact methods yield optimal solutions, but cannot be solved in polynomial time. Glover and Greenberg [53] suggest that exact methods are unsatisfactory for large combinatorially difficult problems.

An important issue is not only the optimality of the solution provided by the exact methods, but also the time and cost incurred. There is often a practical expedient in trading optimality for speed for large problems. The motivation for the development of approximate methods is to deliver a good solution in acceptable time. The following subsections review the current research in the approximation techniques to JSSP. For this purpose, these techniques have been categorized as follows: heuristics-based approaches, local search methods and meta-heuristics, artificial intelligence approaches comprising both symbolic and connectionist systems and machine learning applications. 59

3.2.1 Heuristics-based Approaches

The Giffler-Thomson algorithm [54] is a classic enumeration scheme forming the common basis for all priority dispatching rules. In this procedure, an operation is chosen from among the available set of operations based on its priority index. Priority dispatch rules are used to assign a priority index for each operation. The selected operation is then added to the schedule. The procedure continues until all operations have been scheduled.

The importance of this procedure lies in its capability to generate one or all members of the set of active schedules.

Some commonly used elementary priority dispatch rules are: Shortest Processing

Time (SPT), Minimum Slack Time (MST), First Come First Served (FCFS), Last Work

Remaining (LWKR) and Shortest Remaining Processing Time (SRMPT). Panwalker and

Iskander [55] provide a comprehensive survey of scheduling heuristics. A total of 113 priority dispatching rules are presented, reviewed and classified in this work. The main drawback in using these elementary rules is that different rules perform best in different scenarios and no single rule dominates the rest across all scenarios. To improve performance, probabilistic combinations of the elementary priority rules are often employed for determining priority. Blackstone et al. [56] provide a detailed comparison of several elementary dispatching rules and their combinations. Lawrence [57] compares the performance of ten individual priority dispatch rules with a randomized combination of these rules. Superior results were delivered by the combination method, but it required substantially more computing time. Kaschel et al. [58] provide empirical results of the 60

performance of priority rules in scheduling several benchmark instances. The authors compare single priority rules, simple combinations of them and combinations of priority rules by the Analytic Hierarchy Process (AHP) method. AHP is a statistical decision making tool capable of generating weighted combinations of priority rules and provided the best results in this study.

The shifting bottleneck procedure (SPB) developed by Adams et al. [59] is one of the powerful heuristics-based approaches for tackling JSSP. The procedure is divided into subproblem identification, bottleneck selection, subproblem solution, and schedule reoptimisation. The strategy is to divide the original scheduling problem into m single machine problems and solving each subproblem iteratively. This is based on the conjecture of overlap between solutions to the single machine problems and the job shop problem. Each subproblem solution is compared with all the others and the machines are ranked based on the comparisons. The bottleneck machine is the unsequenced machine with the largest solution value. Without considering the other unsequenced machines, the bottleneck machine is scheduled based on the sequenced machines. The last step involves local reoptimization of previously scheduled machines as single machine problems.

Ramudhin and Marier [60] generalize the shifting bottleneck procedure for many types of scheduling problems including flow shops, assembly shops, partially ordered shops, etc. Applegate and Cook [61] design and implement a variant of SBP having an initial solution procedure called “Bottle-k” and an algorithm called “Shuffle” with Edge- finder at its core. For the last k unscheduled machines, Bottle-k branches by selecting 61

each of the remaining machines in turn. For the initial schedule constructed by Bottle-k,

Shuffle fixes the processing order of one or a small number of heuristically selected machines. The remaining machines are optimally scheduled by Edge-finder.

3.2.2 Local Search Methods and Meta-Heuristics

The search space of all possible solutions for even moderately sized JSSP instances is extremely large and discourages complete enumeration or random sampling. Ensuring the optimality of solutions in such cases is difficult, if not impossible. The advantage of local search methods is their ability to find near optimal solutions at low cost. Local search is based on the concept of neighborhood structures. The basic principle behind local search is to improve a given solution by a search in the neighborhood of the solution. Two solutions belong to the same neighborhood structure, if one can be obtained by a well defined modification of the other [51]. However, these methods may be trapped in local optima. If a meta-strategy is employed for navigating through these local optimal solutions to find the global optimum, then these methods are known as iterated local search methods or meta-heuristics [3]. Well known approximation methods in this category include: simulated annealing, genetic algorithms, tabu search etc.

Simulated Annealing (SA) is a stochastic local search method having its origins in a simulated model for growing crystals, known as the Metropolis algorithm [62]. SA accepts solutions with higher values of the cost function than the current solution with a decreasing probability over time, in order to get away from the local optima and explore 62

the feasible solution region to reach a global optimum. Steinhofel et al. [63] present two simulated annealing-based algorithms for the JSSP having a makespan minimization objective. These algorithms differ in the cooling schedule and could obtain near optimum solutions within a relatively short time on a set of benchmark scheduling instances. The speed of convergence of this procedure being an issue, Szu and Hartley [64] have devised a method called fast simulated annealing which permits occasional long steps to speed the convergence rate.

Tabu search [65] is a local search meta-heuristic, capable of remembering the features of the solution landscape previously visited by relying on specialized memory functions. The method guides the search process away from solutions that resemble previously achieved solutions. A predetermined number of recently visited solutions are maintained in the short term memory as a tabu-list. Medium-term memory remembers solution areas where good solutions have previously been achieved, allowing the search process to return to these areas at a later stage. Long term memory allows diversification into unexplored regions. As a wide exploration of the search space is ensured by these memory functions, the search is less likely to be trapped in local optima.

Dell’ Amico and Trubian [66] present a generic tabu-search procedure for the job shop scheduling problem. Hao and Pannier [67] undertake a comparative study of simulated annealing and tabu search methods. They assert that tabu search is a more efficient local search method with lesser likelihood of getting trapped in local optima. 63

The earliest application of Genetic Algorithms (GAs) to JSSP is by Davis [68], where the GA constructs a preferred order of operations for each machine. Problem representation schemes and construction of genetic operators are important issues in building a genetic algorithm. Cheng et al. [69] in their tutorial survey of genetic algorithms in JSSP deal with the issues concerning representation schemes and encoding approaches. They classify the representation schemes used in JSSP into nine categories, translating into two encoding approaches, direct and indirect. The direct approach encodes a schedule into a chromosome. In the GA of Nakano and Yamada [70], the chromosome is an ordered list of completion times of operations. A crossover operator, called GA/GT was also developed in this work based on the Giffer-Thomsom algorithm

[54]. In the indirect approaches, a sequence of decision preferences is encoded into a chromosome, and genetic operators are used to improve the ordering of preferences over generations. The preference list based representation proposed by Kobayashi et al. [71] is an example of the indirect encoding approach. The chromosome is composed of several substrings, each corresponding to an operation sequence for a machine.

A detailed discussion of the various adaptive genetic operators is given in Gen and

Cheng [10]. A schedule builder is usually built into a genetic algorithm for handling precedence constraints. Then, the JSSP can be treated as a permutation problem, where the task of the GA is to evolve better permutations (solutions) over generations. The role of the schedule builder is to generate feasible solutions (i.e. semi-active schedules). The number of semi-active schedules is large and can slow the convergence of GAs. To 64

overcome this problem, dispatching heuristics such as First-In-First-Out (FIFO) and Left- shift are often embedded in the schedule builder. The FIFO heuristic generates semi- active schedules, while the Left-Shift rule generates active schedules.

3.2.3 Artificial Intelligence Approaches

Most symbolic Artificial Intelligence (AI) approaches cast the JSSP as a constraint satisfaction problem (CSP). A CSP specifies a set of decisions to be made and a set of constraints to determine the validity of such decisions. The general procedure to solve a

CSP is to reduce the search space by utilizing a constructive search strategy. Such a strategy incrementally builds a solution by assigning values to variables and checking for constraint violations. If any violations are found, a backtracking strategy is employed to undo previous variable assignments. The procedure is repeated with a fresh set of variable assignments. The Intelligent Scheduling and Information System constructed by

Fox [72] is a good example of the AI scheduling system. There are many variations of the generic constraint satisfaction procedure. Fox and Sadeh [73] provide a comparative summary of a variety of constraint satisfaction approaches applied to a set of benchmark scheduling instances.

Neural network scheduling systems offer an alternate AI-based scheduling paradigm. Cheung [74] provides a comprehensive survey of the main neural network architectures used in scheduling. These are: searching network (Hopfield net), probabilistic network (Boltzmann machine), error-correcting network (multilayer 65

perceptron), competing network and self-organizing network. Jain and Meeran [75] also provide an investigation and review of the application of neural networks in JSSP.

A Hopfield network, shown in Figure 3.4, is a fully connected, unlayered network with binary input and output data. Each processing element is connected to every other processing element in the network. For application in JSSP, it is necessary to map the scheduling problem onto an energy function. The makespan, precedence and resource constraints are translated to an energy function suited to the network structure. Hopfield nets possess inherent dynamics to minimize the system energy function. Constraint violations are penalized with an increase in the value of the energy function. The

Hopfield net tries to stabilize to low energy configuration (i.e., the optimum) by keeping the constraint violations to a minimum.

66

PE PE

PE PE

PE PE PE – Processing Element

Figure 3.4 Architecture of a Hopfield net.

Foo and Takefuji [76] formulate the scheduling problem as an integer linear programming problem. The energy function for Hopfield net is the sum of starting times of the jobs. The authors demonstrate that the model attains a near optimal solution on a benchmark instance. However this method is capable of handling only small problem sizes, the number of control parameters is large and the model often gravitates to a local minimum. Zhou et al. [77] propose a Linear Programming Neural Network (LPNN) to overcome some of the shortcomings of the previous approach. Instead of a quadratic energy function, a linear function is utilized. The number of control variables, number of neurons and interconnections are greatly reduced in the LPNN-Hopfield net formulation. 67

Sabuncuoglu and Gurgun [78] propose a modified Hopfield net model, incorporating an external processor to monitor and control the progress of the network.

This method differs from other Hopfield net formulations as the feasibility constraints are dropped from the energy function of the Hopfield net. Instead, feasibility and cost computations occur in the external processor. The authors report that the method could optimally schedule a number of tough benchmark scheduling instances.

The Hopfield net approach to JSSP has a number of limitations: difficulties in mapping the objective function of the scheduling problem to an appropriate system energy function, slow convergence to optima, unreliable termination criteria and convergence of the model to local optima.

The other prominent neural network (NN) systems used in scheduling are the error- correcting networks. These systems are multilayer perceptron (MLP) networks, where takes place by the back propagation algorithm. Jain and Meeran [79] propose a modified MLP model, where the neural network performs the task of optimization and outputs the desired sequence. A novel input-output representation scheme is used which greatly reduces the number of processing elements needed to encode the JSSP. Although the method has been able to handle large problem sizes (30 x

10) compared to other approaches, the generalization capability of the model is limited to approximately 20% deviation from the training sample.

In contrast to the above approach, many applications of the error-correcting networks to JSSP utilize the neural network as a component of a hybrid scheduling 68

system. Rabelo and Alptekin [80] use the neural network to rank and determine coefficients of priority rules. An expert system utilizes these coefficients to generate schedules. Dagli and Sittasathanchai [81] use a genetic algorithm for optimization and the neural network performs multiobjective schedule evaluation. The network maps a set of scheduling criteria to appropriate values provided by experienced schedulers. Yu and

Liang [82] present a hybrid approach for JSSP in which genetic algorithms are used for optimization of job sequences and a neural network performs optimization of operation start times. This approach has been successfully tested on a large number of simulation cases and practical applications.

Yih et al. [83] propose a hybrid semi-Markov neural network method to schedule crane operations in a chemical plant. The reported results show that the trained network performs better than a human scheduler. Kim et al. [84] trained a NN using the results achieved from the Apparent Tardiness Cost rule. The main drawback of the error- correcting network approaches to JSSP is that either the NN is not utilized for optimization. In addition, non-optimal data acquired from an expert, shop floor or priority dispatch rules is used for training the network.

3.2.4 Machine Learning Applications

The above subsections gave a survey of the optimization-based approaches to

JSSP. Though generally successful, these approaches to JSSP suffer from the following limitations: 69

• Knowledge Discovery: Optimization procedures often identify good solutions, but

lack the capability to explain the process by which they arrive at these solutions.

Hence, little insight is gained into the scheduling process.

• Computational Intensity: High computational intensity of optimization procedures

often limits development of practical applications in real world scheduling

problems.

• Adaptability: These procedures are unwieldy in a dynamic shop environment as

they do not possess the ability to quickly adapt to changing problem scenarios.

Machine learning approaches can be used in conjunction with optimization methods to build more robust hybrid systems. Such hybrid systems allow for the development of a scheduler that is computationally less intensive than optimization procedures, but still provides good solutions in acceptable time. A second benefit of such systems is the provision of knowledge in the form of comprehensible rules effectively aiding a human worker in the scheduling task.

Koonce and Tsai [5] use genetic algorithms combined with attribute-oriented induction (AOI) methodology to develop a rule-based scheduler for JSSP. Optimal or near-optimal schedules are obtained by a genetic algorithm. These schedules constitute the knowledge base for the learning task. AOI is used to mine the knowledge base for extracting relationships between an operation’s priority and its attributes. The authors report that the developed rule set consistently provided superior solutions compared to the Shortest Processing Time heuristic. Also, the rule set could duplicate the GA’s 70

performance on an identical problem. Koonce and Kantak [85] extend the above work to develop supplementary rule sets by increasing the knowledge base for the learning task.

This research resolved inaccuracies in operation ranking identified in the previous work.

Also, it is shown that performance of the rule-based scheduler is enhanced by expanding the knowledge base. However, performance was not affected by rules learned from additional data sets, beyond a threshold number.

Kwak and Yih [86] use data-mining based control approach for a testing and rework cell in a computer-integrated manufacturing environment. A decision tree was extracted on large-scale training data generated by simulation. The system makes decisions on job preemption and dispatching rules based on this knowledge in real time and compared favorably to other control heuristics with respect to the number of tardy jobs. Following a similar approach, Yoshida and Hideyuki [87] apply data mining to extract association rules between the performance measure and the dispatching of rules.

A guidance scheme for selection of dispatching rules satisfying a plural performance measure was also developed. 71

CHAPTER 4. METHODOLOGY

The goal of the current research was to develop a rule-based scheduler for the job shop scheduling problem. This chapter describes the methods and tools utilized in this work to achieve the research objectives. This first section deals with the learning task and explains the development of a neural network model from the optimal solutions to a benchmark job shop problem obtained by a genetic algorithm. The second section focuses on the implementation of the rule extraction procedures for capturing the embedded knowledge in the neural network model.

4.1 The Learning Task

4.1.1 Genetic Algorithm (GA) Solutions

The knowledge base for the learning task was provided by the genetic algorithm’s solution to the job shop problem. For this purpose, a well-known 6x6 problem instance, ft06 devised by Fisher and Thomson [88] has been chosen as the benchmark problem.

This test instance has six jobs, each with six operations to be scheduled on six machines and has a known optimum makespan of 55 units. The data for the instance is shown in

Table 4.1 using the following structure: machine, processing time.

72

Table 4.1 The ft06 instance devised by Fisher and Thomson [88]

Operation Job 1 2 3 4 5 6 1 3,1 1,3 2,6 4,7 6,3 5,6 2 2,8 3,5 5,10 6,10 1,10 4,4 3 3,5 4,4 6,8 1,9 2,1 5,7 4 2,5 1,5 3,5 4,3 5,8 6,9 5 3,9 2,3 5,5 6,4 1,3 4,1 6 2,3 4,3 6,9 1,10 5,4 3,1

A distributed genetic algorithm (GA) developed by Shah and Koonce [89] was utilized in this research for obtaining solutions to the benchmark problem. A solution generated by the GA is a sequence, like the following: {1, 3, 2, 4, 6, 2, 3, 4, 3, 6, 6, 2, 5,

5, 3, 5, 1, 1, 6, 4, 4, 4, 1, 2, 5, 3, 2, 3, 6, 1, 5, 2, 1, 6, 4, 5}. Each number in the sequence is representative of the job number and the operation it is undergoing. The repetition of job numbers in the sequence indicates the next available operation for that job. The representation of a GA solution is shown in Figure 4.1.

Job 1 3 2 4 6 2 3 4 3 6 6 2 5 5 3 5 1 1 6 4 4 4 1 2 5 3 2 3 6 1 5 2 1 6 4 5 Operation 1 1 1 1 1 2 2 2 3 2 3 312432343454445 5 6 5 5 5 6 6 6 6 6

Figure 4.1 Representation of a GA solution 73

On the benchmark instance shown in Table 4.1, the GA was run 2000 times. The optimal makespan of 55 units was achieved 1147 times. The next step was to transform these 1147 schedules (chromosome sequences) to a data structure suitable for the classification task.

4.1.2 Setting up the Classification Problem

The schedules obtained by the GA contain valuable information relevant to the scheduling process. The learning task was to predict the position of an operation in the sequence, based on its features or attributes. Based on a study of operation attributes commonly used in priority dispatch rules, the following attributes have been identified as input features: operation, process time, remaining time and machine load. These input features have been clustered into different classes using the concept hierarchy for 6 x 6 job shop problems developed by Koonce and Tsai [5].

Operation

Each job has six operations that must be processed in a given sequence. The

Operation feature identifies the sequence number of the operation ranging between 1 and

6. This feature has been clustered into four classes as: {1} First, {2, 3} Middle, {4, 5}

Later, and {6} Last.

ProcessTime and RemainingTime

The ProcessTime feature represents the processing time for the operation. The

RemainingTime feature denotes the sum of processing times for the remaining operations 74

of that job and provides a measure of the work remaining to be done for completion of the job. For the benchmark ft06 instance, the processing times ranged from 1 to 10 units, while the remaining times ranged from 0 to 39 units. Based on the data, three classes

(clusters) for these features were identified as follows. The ranges were split into three equal intervals. The first interval was classified as Short, second as Medium, and last interval was labeled as Long. Table 4.2 shows the classification of the ProcessTime and

RemainingTime for these intervals.

Table 4.2 ProcessTime and RemainingTime feature classes

Attributes Short Middle Long ProcessTime [ 1, 3.33 ] ( 3.33, 6.67 ] ( 6.67, 10 ] RemainingTime [ 0, 13 ] ( 13, 26 ] (26, 39 ]

Machine Load

The Machine Load feature determines the machine loading and was clustered into two classes: Light and Heavy. This feature represents the capacity or utilization of machines in units of time and Table 4.3 shows the classification of machine loading for the ft06 instance.

75

Table 4.3 MachineLoad feature classification

Machine ProcessingTime MachineLoad M1 40 Heavy M2 26 Light M3 26 Light M4 22 Light M5 40 Heavy M6 43 Heavy

The machine processing times range between 22 and 43 units, with an average of

32.5 units. All the machines having processing times less than 32.5 units were classified as Light, while those having processing times greater than 32.5 units were labeled as

Heavy.

Priority

The target concept to be learned was the priority or position in the sequence. Since an operation can be positioned in any one of the 36 locations available in the GA sequence, it may be difficult to discover an exact relationship between the input features and the position. However, if the problem was modified to predict a range of locations for the operation, the learning task becomes easier. The target feature priority, thus determines the range of positions in the sequence where the operation can be inserted.

The possible range of positions have been split into 6 classes and assigned class labels as shown in Table 4.4. 76

Table 4.4 Assignment of class labels to target feature

Range of Priority positions 1 – 6 Zero 7 – 12 One 13 – 18 Two 19 – 24 Three 25 – 30 Four 31 – 36 Five

4.1.3 Development of a Neural Network Model

There are three aspects related to development of a neural network model. The first is the choice of the training, cross-validation (CV) and testing data sets and their sizes, the second is the selection of suitable architecture, training algorithm and learning constants, and the third is the determination of the termination criteria. Unfortunately, there are no definitive heuristics or formulae to determine these parameters. Considerable experimentation was necessary to achieve a good network model of the data. The software NeuroSolutions developed by NeuroDimensions Incorporated was used for development and testing of the neural network model.

Training, Cross-validation and Test Datasets

The 1,147 optimal schedules obtained by the GA represent a total number of

41,292 operations (1,147 schedules x 36 operations/schedule). Assignment of input 77

features and target classes was done for each operation according to the classification scheme described in the previous subsection. Sample data for the classification task is shown in Table 4.5.

Table 4.5 Sample data for the classification task

Process Remaining Machine Pattern_ID Operation Priority Time Time Load 1 First Short Long Light 0 2 Middle Medium Long Light 1 … … … … … … … … … … … … 41, 291 Later Medium Short Heavy 4 41, 292 Last Short Short Light 5

The entire data set included 24 distinct input patterns with different target feature values (priority), constituting a total of 41,292 patterns (exemplars). This classification data set was split into training, cross validation and testing data sets with 70%, 15% and

15% memberships.

Network Architecture and Learning Parameters

Different network architectures were experimented with to determine the best classifier. A detailed evaluation of these classifiers is tabulated in Appendix A. Based on this comparison, a two hidden layered MLP (12-12-10-6) with hyperbolic tangent transfer functions at the hidden layers was chosen, as it had the best classification accuracy for the 78

testing data set. Gradient descent was used as the training algorithm. The key learning parameters for this algorithm are the step size and momentum terms. These parameters control the rate of learning and the speed of convergence respectively. Together with the termination criteria, these constitute the training parameters for the neural network.

Termination Criteria

The training algorithm determines the weight vector, which maps the network input to output. A weight vector is randomly initialized and then adapted during the training regimen. The randomness of the initial weight vector is important for learning, but the inherent non-linear dynamics of the training process implies different convergence properties with different initial weight vectors. Therefore, a number of runs are required to increase the probability of a good initial solution (weight vector). Within each run, a number of training cycles (epochs) are needed to ensure good generalization. The following four termination criteria have been employed to determine convergence of the training algorithm:

1. Number of runs before termination

2. The maximum number of epochs/run

3. Non-improvement of cross-validation error with training.

4. Increase in the cross-validation error with training

The training parameters (i.e., learning parameters and the termination criteria) for the 12-12-10-6 MLP classifier are given in Table 4.6. The result of the training regimen 79

is a neural network model of the underlying data distribution. The final set of network parameters for the 12-12-10-6 MLP classifier is given in Appendix B.

Table 4.6 Training parameters for the 12-12-10-6 MLP classifier

Network Parameters Value

Step size 0.01

Momentum factor 0.7

Number of Runs 10

Number of Epochs/Run 10,000 Number of Epochs without 500 improvement in CV error

4.2 Knowledge Extraction from the Neural Network Model

The neural network can be considered an implicit model of the training data. The goal of the rule extraction algorithms is to translate this implicit model into explicit symbolic form. In this work, the extracted knowledge is captured in two symbolic representations: decision trees and propositional rules. The possible rule space for these procedures is derived in the following way. The number of classes in the input features of the classification problem (Operation, ProcessingTime, RemainingTime and the 80

MachineLoad features) are four, three, three and two respectively. Hence, the number of possible rule antecedent combinations (patterns) in the rule space is 4 x 3 x 3 x 2 = 72.

Rule extraction procedures were employed to find faithful representation of the neural network within this entire rule space, as described in the following subsections.

4.2.1 Decision Tree Induction

Datasets

The 72 patterns in the rule space were presented to the trained neural network for classification. After obtaining the class labels, these patterns were separated into training and sampled sets. The 24 distinct input patterns in original benchmark problem comprised the training set for decision tree induction. The sampled set is created with the remaining 48 patterns (72 total patterns – 24 training patterns). These two data sets are provided in Appendix C for reference.

Algorithm

The algorithm constructs the decision tree in a recursive fashion. First, an attribute is selected to be placed at the root node of the decision tree. A branch is then added to this node of the tree for each possible value of this attribute. The branching process splits the data set into a number of subsets. The process is recursively repeated at every branch, using only those data patterns that actually reach the branch. The branching continues until all the patterns that reach a leaf node belong to the same class. No further expansion of this leaf node is necessary and the node is designated with the appropriate class label. 81

The expansion then proceeds to other branches of the tree, until all possible leaf nodes have been produced. Figure 4.2 provides a sketch of the algorithm.

Decision Tree Induction Algorithm

Input: training set, sampled set, minimum_sample, termination criteria

1. initialize the tree as a leaf node 2. while termination criteria not met

3. pick a node in the depth-first manner to expand

4. test for size of training set case 1: size of training set is greater than the minimum_sample size, proceed to step 5. case 2: make an augmented training set of size greater than the minimum_sample size by combining the training and sampled sets. 5. select an attribute for splitting the node based on the normalized information gain ratio criterion 6. for each possible attribute value make a new leaf node

Return: extracted decision tree

Figure 4.2 The decision tree induction algorithm

A normalized measure of information gain was employed for selecting an attribute to split the node. This procedure differs from the standard decision tree algorithms like 82

ID3 in the data set utilized for tree induction. Some features of the above algorithm are based on the TREPAN decision tree induction algorithm [17]. Like TREPAN, it ensures the availability of a minimum number of instances at a node, before giving a class label to the node or choosing a splitting test for it. The data set at each node is determined by the minimum_sample parameter, which is specified by the user. This important parameter controls the size and depth of the induced decision tree, which in turn affect the classification accuracy of the decision tree. If the size of the training set, t is less than the minimum_sample parameter, the training set is augmented with data from the sampled set by drawing (minimum_sample – t) patterns. Since the neural network is modeled on the data in the training set, the primary reliance on this set for expansion of the decision tree increases the fidelity of the induction process.

Application

The algorithm was implemented in the JAVA programming language utilizing the

JBuilder X development environment. The Waikato Environment for Knowledge

Analysis (WEKA) java software package [81] was also employed in developing the code.

WEKA provides a host of well-documented data structures, classes and tools for development of machine learning schemes.

The minimum_sample parameter was varied between two and seven, producing decision trees of differing sizes and accuracies. The minimum_sample value of five provided the decision tree with the best classification accuracy. The confusion matrix and the associated performance are provided in Appendix D. This tree has 48 83

distinct pathways (i.e., path from root node to leaf node) and a partially expanded view of the decision tree is shown in Figure 4.3.

84

Priority: Zero Short Priority: Light Four Medium Process Machine Time Load Short Heavy Long Priority: Three Remaining Priority: Time Three

Medium Priority: Zero First Short

Process Medium Priority: Time Zero Unexpanded Long node Priority: Middle One Operation Long

Later Priority: Light Unexpanded One node Machine Load Short Heavy Last Priority: Process Zero Time

Medium Priority: Light Zero Unexpanded node Machine Load

Heavy Long Priority: One

Priority: Zero

Figure 4.3 A partially expanded view of the induced decision tree 85

In this decision tree, the gray ovals represent the leaf nodes and indicate the class label (priority) of the operation represented by the path. The decision tree can be easily decomposed into a propositional rule set yielding 48 distinct rules as shown in Table 4.7.

86

Table 4.7 The rule set containing 48 rules (NN-Rule set)

87

In this rule set, referred to as NN-Rule set, the keyword “Any” denotes all possible values of an attribute. The first rule in NN-Rule set implies that the First operation for a job with a Short processing time, a Short remaining time and processed on a machine of any load class (i.e., Light or Heavy) has an associated priority of Zero (highest priority) for scheduling in the sequence

4.2.2 Propositional Rules by Validity Interval Analysis

The general procedure of Validity Interval Analysis (VI-Analysis) has been described in section 2.5.3 of chapter two in this document. This subsection describes the implementation of the key aspects of this procedure. MATLAB was chosen as the development tool for coding the procedure. MATLAB is a high-level technical computing language and offers a rich development environment providing a wide choice of in-built functions and add-on toolboxes. The main processing tasks were coded in script files (a block of MATLAB commands). The MATLAB toolboxes used in the program included the neural network toolbox, the optimization toolbox and the statistics toolbox. The main steps in the implementation are as follows:

Instantiation of the neural network object

The 12-12-10-6 MLP classifier was instantiated as a neural network object. This object is structured to contain the information necessary for describing the architecture of the neural network (number of layers and nodes, weights, and biases). The script files interface with the neural network object by means of tailored object access functions. 88

Development of candidate rule set and validity interval assignment

VI-Analysis, as a rule extraction procedure yields propositional rules by verification and refinement of candidate rules. The 48 rules in the NN-Rule set constituted the candidate rule set for the verification process. These rules were translated into a set of validity intervals. The nodes of the input and output layers were assigned these validity intervals. The nodes in the hidden layers were assigned a validity interval [-

1, 1], which corresponds to the range of the hyperbolic tangent transfer function used in the 12-12-10-6 MLP classifier.

Forward and backward passes

These validity intervals are propagated in forward and backward directions through the network. A forward pass consists of a series of forward steps through successive layers of the network, from the input layer to the output layer. The backward pass is similar to the forward pass. In this pass, the validity intervals are propagated from the output layer to the input layer. The outcome of these passes is a set of refined validity intervals specifying the activation ranges on all nodes in the network. A forward or backward step between two successive layers in the network involves a call to the linear programming solver. The MATLAB function, Linprog was used to solve the linear programming problem. Linprog is an active set method and is a variation of the well- known simplex method for linear programming [91]. Figure 4.4 outlines the key actions involved in a forward step between two successive layers, P and S. 89

Forward Step between Layers P and S

Inputs: validity intervals of all nodes in layer P, (Plb, Pub). validity intervals of all nodes in layer S, (Slb, Sub). weight vector between layers P and S, WPS bias of all nodes in layer S, bS. transfer function on nodes in layer S, tfS

1. set up the optimization problem: objective function: maximize Ci(x) = ∑WPS ∗ x + bS k ∈ P constraints: (a) Plb ≤ x ≤ Pub -1 -1 (b) tfS (Slb) ≤ ∑ WPS ∗ x + bS ≤ tfS (Sub) k ∈ P

2. use linear programming to determine new bounds on all nodes in layer S: for all i ∈ S, make a call to the linear programming solver to derive new bounds on the node i. outcome of this step: (Slbnew, Subnew)

3. refine the validity interval on all nodes in Layer S: for all i ∈ S, Slb = maximum (Slb, Slbnew) Sub = maximum (Sub, Subnew) outcome of this step: (Slb, Sub)

Return: refined validity interval for layer S (Slb, Sub).

Figure 4.4 Forward step between two layers P and S in the network

90

Termination criteria for convergence

A candidate rule is verified based on the convergence of the forward and backward passes. A run constitutes one forward and one backward pass. The procedure terminates when the difference between validity intervals between two consecutive runs becomes less than a predefined tolerance (a small threshold). Generally, the number of runs required for convergence can be considered a function of the specified validity intervals and the parameters of the neural network (weights and biases). A contradiction in the above procedure implies that the candidate rule incorrectly describes the behavior of the neural network. Such a rule is expunged from the candidate rule set and the procedure is repeated with the other candidate rules. This process continues until the candidate rule set is exhausted. A tolerance of 0.0001 was chosen for termination. All the 48 rules in the

NN-Rule set were verified by this procedure. 91

CHAPTER 5. RESULTS AND DISCUSSION

5.1 Performance of the 12-12-10-6 MLP Classifier

The confusion matrix was used to evaluate the performance of the 12-12-10-6 MLP classifier. The confusion matrix is a table where the desired classification (GA solution) and the output of the classifier are compared on the testing data set. The confusion matrix for the 12-12-10-6 MLP classifier is shown in Table 5.1.

Table 5.1 Confusion matrix of the 12-12-10-6 MLP classifier

Output/ Priority Priority Priority Priority Priority Priority Desired (Zero) (One) (Two) (Three) (Four) (Five) Priority (Zero) 811 81 3 0 0 0 Priority (One) 346 846 244 5 0 0 Priority (Two) 200 428 789 176 7 0 Priority (Three) 0 27 412 1101 506 39 Priority (Four) 0 0 0 56 314 106 Priority (Five) 0 0 1 46 514 1221 Classification 59.76 61.22 54.45 79.55 23.41 89.38 Accuracy (%) Total Accuracy 61.38 (%)

92

In the above table, the entries with same row and column class labels represent the number of testing set instances classified correctly by the 12-12-10-6 MLP classifier. The classification accuracy for each class was calculated by dividing the number of correct classifications by the total number of instances in that class and is shown in the last row of Table 5.1.

The ID3 decision tree algorithm was also employed to induce a decision tree for comparing its performance with the 12-12-10-6 MLP classifier. The classification data set developed from genetic algorithm’s (GA) solution to the ft06 benchmark instance provided the training data for this algorithm. The induced decision tree (described as a propositional rule set) and related performance statistics for the ID3 classifier are tabulated in Appendix E. Table 5.2 provides a comparison of the performance of these two classifiers.

Table 5.2 Comparison of classifiers

Mean Square Classifier Accuracy (%) Error (MSE) 12-12-10-6 MLP 61.38 0.08

ID3 60.64 0.28

93

The overall accuracy of the classifiers was calculated from the individual accuracies for different classes as a weighted average. An indirect measure of the classification performance is also provided by the Mean Square Error (MSE) term. The lower mean square error of the 12-12-10-6 MLP classifier implied a lower average difference between its output and the desired output.

The deviation in the performance of these classifiers from optimal classification is mainly attributable to the presence of considerable noise in the classification data set.

Two sources of this noise were:

• GA Assignments: The GA assigned different priorities to the same operation in

different schedules. This led to ambiguity in the classification data set with the

same training patterns having different target features. This considerably

increased the complexity of the learning task.

• Encoding of the classification problem: The chromosome sequences were mapped

according to the classification scheme presented in chapter 4 into training

patterns. This compilation reduced the dimensionality of the data. Though this

was desirable for comprehensibility, it still represented a loss of information and a

source of noise for the classification task.

5.2 Efficacy of the Rule Extraction Task

The efficacy of the rule extraction task was evaluated along the following dimensions: 94

• Comprehensibility and Expressive Power: The propositional rules in the NN-Rule

set developed by the rule extraction procedures used easily understood feature and

class labels for describing the extracted knowledge. The antecedent of each rule in

the rule set was a simple conjunction of the input features. The number of input

features in the antecedent of a rule provides an indirect measure of the

comprehensibility and expressive power of the rule set. Table 5.3 shows the rules

in the NN-Rule set (shown on the right) segregated according to the number of

features in their respective antecedents (shown on the left). Approximately, a third

of the rules had fewer than four antecedents (maximum) increasing the

comprehensibility of the developed rule set.

Table 5.3 Number of features in the rule antecedent for the NN-Rule set

Number of features in Number of rules the rule antecedent (Total: 48) 1 None 2 2 3 13 4 33

95

• Accuracy and Fidelity: The rule set accurately mimicked the behavior of the

trained neural network in classifying all the patterns in the rule space. Hence, the

fidelity of the extraction process was maximum.

5.3 Schedule Generation and Comparison

To schedule the 6 x 6 benchmark instance (ft06), a priority index was assigned to each of the 36 operations. This was accomplished by matching the features of the operation with the antecedent of the induced rules in the NN-Rule set. The consequent of the matched rule represented the priority index of that operation. The operations, based on the assigned priorities, were sequenced on each machine and the schedule was then developed manually according to the Giffler-Thomson algorithm [54]. This algorithm chooses from among the available operations based on the priority index. Also, the operations were locally left-shifted to improve the makespan of the generated schedule.

The Gantt chart was used as a tool for visualizing the developed schedules. A similar procedure was utilized for scheduling the problem with the ID3-Rule set and the Shortest

Processing Time (SPT) heuristic.

Table 5.4 gives a comparative summary of the makespans of schedules generated by the GA, the NN-Rule set, the ID3-Rule set, SPT and other priority dispatching rules for the ft06 instance. The performance of the Attribute-Oriented Induction (AOI) rule set is reported from the work of Koonce and Tsai [5]. Makespans for the schedules built by 96

using dispatching rules (other than SPT) are from an empirical study conducted by

Kaschel et al [58].

Table 5.4 Makespans of schedules for ft06

Scheduler Makespan

Genetic Algorithms, GA 55

Neural Network rule set , NN-Rule 59

Decision Tree Classifier, ID3-Rule 66

Attribute-oriented Induction, AOI 67

Shortest Processing Time, SPT 83

Most Work Remaining, MWKR 67 Shortest Remaining Processing Time, 84 SRMPT Smallest ratio of Processing Time to 71 Total Work, SPT-TWORK

As can be seen from the table, only the GA was able to achieve an optimal makespan of 55 units. The NN-Rule set, developed in the current work could achieve a makespan of 59 units, a deviation of 4 time units (7.27 %) from the optimum. The deviations of other methods ranged from 6 to 29 units (10.9% - 52.7%). The performance 97

of the neural network based rule set (NN-Rule) is considerably better than the performance of other methods in scheduling the benchmark 6x6 problem.

A test problem set consisting of 10 randomly generated 6 x 6 problem scenarios was also used to compare the performance of different schedulers. The test set was developed by Koonce and Tsai [5] and is tabulated in Appendix F. Table 5.5 shows the performance of various schedulers (GA, NN-Rule, AOI-Mean, AOI-Mode, ID3-Rule and

SPT) on the test cases.

Table 5.5 Makespans obtained by various schedulers on the test set

NN- AOI- AOI- ID3- Scenario_Name GA SPT Rule Mean Mode Rule ft06-R1 46 50 53 49 53 54 ft06-R2 53 56 58 56 58 64 ft06-R3 60 61 67 62 64 71 ft06-R4 48 56 55 60 58 63 ft06-R5 55 61 63 61 62 66 ft06-R6 54 58 59 61 58 67 ft06-R7 51 55 53 53 59 60 ft06-R8 67 74 75 76 75 71 ft06-R9 54 57 68 59 56 59 ft06-R10 59 65 70 70 68 86 Average 54.7 59.3 62.1 60.7 61.1 66.1

98

From the average makespan values in the above table, it is evident that the rule set derived from the neural network approach comes closest to the GA in scheduling the test problems.

5.3.1 Statistical Analysis

Analysis of Variance (ANOVA) was used for comparing the performance of alternate schedulers. The experiment was designed as a Randomized Complete Block

Design to account for the variability arising from different job shop problem scenarios in the test set. The different schedulers constituted the treatments and the scenarios were the blocks as shown in Table 5.6.

Table 5.6 The Randomized Complete Block Design table

RANDOMIZED COMPLETE BLOCK DESIGN

Treatments Blocks (Scenarios) Treatment Treatment (Schedulers) 1 2 3 4 5 6 7 8 9 10 Totals Averages GA 46 53 60 48 55 54 51 67 54 59 547 54.70 NN-Rule 50 56 61 56 61 58 55 74 57 65 593 59.30 AOI-Mean 53 58 67 55 63 59 53 75 68 70 621 62.10 AOI-Mode 49 56 62 60 61 61 53 76 59 70 607 60.70 ID3-Rule 53 58 64 58 62 58 59 75 56 68 611 61.10 SPT 54 64 71 63 66 67 60 71 59 86 661 66.10 Block Totals 305 345 385 340 368 357 331 438 353 418 Block 3640 60.67 51 58 64 57 61 60 55 73 59 70 Averages

99

The assumptions made for this experiment are that the observations are independent and normally distributed with the same variance for each treatment

(scheduler). The Anderson-Darling test verified the normality assumption. Bartlett’s test was used to validate the assumption of homogeneity of variances. The null hypothesis for this experiment H0 (stating that all the treatment means are equal) was tested at 5% significance. The obtained ANOVA table is shown in Table 5.7.

Table 5.7 Analysis of variance table

ANOVA Table Source of Sum of Degrees of Mean P- F F variation squares freedom Square value Critical Treatments 692.3333 5 138.4667 13.9709 3E-08 2.4221 (Schedulers) Blocks 2421 9 269 (Scenarios) Error 446 45 9.9111

Total 3559.3333 59

The computed F-value (13.97) in the above table was found to be greater than the critical F-value (2.42). Hence, the null hypothesis was rejected, concluding that there exists a significant difference in the treatment means. Duncan’s multiple range test was used to identify the pairs of treatments, which had a significant difference in means. The treatment means were sorted in an ascending order. The least significant studentized 100

range, rp (subscript p denotes the number of treatment means) depends on the number of means and the degrees of freedom. For p values between 2 and 6, the values of rp were obtained from the least significant studentized range table. Duncan’s critical value, Rp for these means was computed. The difference between a pair of means drawn from the set of ordered treatment means was compared with the critical value, Rp. If this difference was greater than the critical value, it was concluded that there existed a significant difference between the treatments. This comparison was carried out for all 15 possible combinations of treatment pairs. The determination of significant difference in means allowed the treatments (schedulers) to be combined into groups. The six treatments have been divided into three groups (A, B, C) as shown in Table 5.8.

Table 5.8 Grouping of schedulers based on Duncan’s multiple range test

Scheduler Group GA A NN-Rule B AOI-Mean B AOI-Mode B ID3 B SPT Rule C

101

The three groups identified by the Duncan’s test correspond to different scheduling approaches for the job shop problem. The optimization method (GA) provided the best makespan. The machine learning methods constituted the second group (B). Within this group, the neural network-based approach provided the best results having a lower average makespan on the test problem set. The performance of all the members in the second group was better than the shortest processing time (SPT) heuristic. 102

CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH

6.1 Conclusions

This thesis presents a novel knowledge-based approach for the job shop scheduling problem by utilizing the various constituents of the soft computing paradigm. The ability of a genetic algorithm (GA) to provide multiple optimal solutions was exploited to generate a knowledge base of good solutions. A neural network was successfully trained on this knowledge base. Then, rule extraction algorithms were employed to induce decision tree and propositional rule representations describing the behavior of the trained neural network. The rule extraction task was successful in generating a rule set which completely and accurately mimicked the behavior of the trained neural network. The scheduler developed from this rule set can be utilized to schedule any 6 x 6 job shop scenario. Also, the developed system provides knowledge in the form of comprehensible rules which can effectively aid a human in the scheduling task.

A test problem set consisting of 10 randomly generated 6 x 6 scenarios was used to evaluate the performance of the developed rule-based scheduler. The makespans produced by the GA were considered to be the known optimal solutions for these scenarios. The rule-based scheduler had a deviation of 4.6 time units (8.4 %) from the optimum (i.e., average makespan of the GA) on the test problem set. Also, the rule-based scheduler performed better than the Shortest Processing Time (SPT) heuristic in all ten 103

cases. Though the rule-based scheduler could not match the performance of the genetic algorithm, it is computationally less intensive than the GA and offers a more comprehensible scheduling approach. It also provides an attractive alternative to simple heuristics like SPT for scheduling 6 x 6 job shop problems.

A comparative evaluation of the rule-based scheduler with other schedulers developed from different machine learning methodologies was also undertaken. Two schedulers developed by other researchers using the Attribute-Oriented Induction (AOI) data mining methodology and another scheduler based on the ID3 decision tree induction algorithm were used for comparison. Among these schedulers, the rule-based scheduler developed in the current work had the closest average makespan to that of the genetic algorithm. However, statistical analysis revealed no significant differences in the performance of these schedulers on the test problem set.

The similar performance of the current approach compared to the AOI data mining methodology proves the feasibility of neural network based data mining. Unlike the rule set derived in this work, those induced from AOI and ID3 methods were insufficient to describe any randomly generated 6 x 6 scenario. Also, the decision tree induction algorithm utilized for knowledge extraction from the neural network is similar to the ID3 algorithm. The deviation in their performance of the rule-based scheduler and the ID3- based scheduler is mainly attributable to the robustness of the neural networks in handling noisy data sets. 104

In summary, this research was able to successfully develop a rule-based scheduler, which provides a close approximation to the performance of a GA scheduler for the 6 x 6 job shop scheduling problems.

6.2 Future Research

This research focused primarily on deriving production rules from a neural network performing job shop scheduling. There is a definite scope for improvement in the current research along the following directions.

Use of multiple data sets

The knowledge base for training the neural network in the current approach was derived from solutions to a single benchmark 6 x 6 problem. This knowledge base can be augmented with near-optimal solutions to randomly generated 6 x 6 scenarios provided by a GA. This can lead to an improvement in the generalization capabilities of the trained neural network.

Knowledge-based neurocomputing

The time and cost of training the neural network is an important consideration in many practical applications. The training regimen can be considerably enhanced by knowledge-primed or hint-based training strategies. Such techniques map the available domain knowledge into the architecture of the neural network to substantially reduce the training effort. The predictive accuracy of the neural networks is also generally improved with these techniques. 105

Neural networks for regression

The classification problem formulated in this work uses a pre-defined concept hierarchy to cluster and label the operations in a GA sequence. This compilation leads to significant noise in the learning data set. An alternate approach could utilize neural networks for regression, where the task of the neural network is to develop a relationship between several predictor variables (operation attributes) and a dependent variable

(priority). In this approach, the predictor variables are numbers representing the various attributes of the operation. An appropriate rule extraction method could then be employed to derive rules describing the behavior of the neural network. It is believed that a numerical encoding of the problem constitutes a more suitable learning representation for the neural network, leading to a more accurate model of the underlying data distribution.

Investigation of different job shop scenarios

The scalability of the current approach needs to be explored with larger job shop problem like the 10x10, 15x10, and 20x10 problems. Also, adaptation of the current approach to suit stochastic and dynamic environments can be attempted.

106

REFERENCES

[1] Baker, K. (1974). Introduction to sequencing and scheduling. New York, NY: John Wiley & Sons, Inc. [2] French, S. (1982). Sequencing and scheduling: An introduction to the mathematics of the job-shop. New York, NY: John Wiley & Sons, Inc. [3] Jain, A. S., & Meeran, S. (1998). A state-of-the-art review of job-shop scheduling techniques. Technical report. Department of Applied Physics, Electronic and Mechanical Engineering, University of Dundee, Dundee, Scotland. [4] Blackstone Jr., J. H., Phillips, D. T., & Hogg, G. L. (1982). A state-of-the-art survey of dispatching rules for manufacturing job shop operations. International Journal of Production Research, 20, 27-45. [5] Koonce, D. A., & Tsai, S. C. (2000). Using data mining to find patterns in genetic algorithm solutions to a job shop schedule. Computers & Industrial Engineering, 38 (3), 361-374. [6] Bonissone, P. P. (1997). Soft computing: the convergence of emerging reasoning technologies. Soft Computing, 1(1), 6-18. [7] Zadeh, L. A. (1994). Fuzzy logic: issues, contentions and perspectives. IEEE International Conference on Acoustics, Speech, and Signal Processing, 4, 19- 22. [8] Dote, Y., & Ovaska, S. J. (2001). Industrial applications of soft computing: A review. Proceedings of the IEEE, 89(9), 1243-1265. [9] Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press. [10] Gen, M. & Cheng, R. (2000). Genetic algorithms and engineering optimization. New York, NY: John Wiley & Sons, Inc. [11] Jang, J. S. R., & Sun, C. T., & Mizutani, E. Neuro-fuzzy and soft computing. Upper Saddle River, NJ: Prentice Hall. [12] Mitchell, M. (1996). An Introduction to genetic algorithms. Cambridge, MA: MIT Press. [13] Dietterich, T. G. (1990). Machine learning. Annual Review of Computer Science, 4, 255-306. [14] Mitchell, T. (1997). Machine learning. 1st edition. Computer Science Series. Boston, MA: WCB McGraw-Hill. [15] Fayyad, U., Piatetsky S. G., Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3). [16] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106. [17] Craven, M. (1996). Extracting comprehensible models from trained neural networks. Ph.D. dissertation. University of Wisconsin, Madison, WI. 107

[18] Quinlan, J. R. (1993). C4.5: Programs in machine learning. San Mateo, CA: Morgan Kaufmann. [19] Han, J. & Fu, Y. (1996). Exploration of the power of attribute-oriented induction in data mining in advances in knowledge discovery and data mining. Cambridge, MA: AAAI, MIT Press, 399-421. [20] Han, J., Cai, Y., & Cercone, N. (1992). Knowledge discovery in databases: An attribute-oriented approach. Proceedings of 1992 International Conference on Very Large Data Bases (VLDB'92). Vancouver, Canada, 547-559. [21] Han, J., Cai, Y., Cercone, N. & Huang, Y. (1994) Discovery of data evolution regularities in large databases. Journal of Computer and Software Engineering, 1-29. [22] Efraim, T., Jay E. A., Liang T. P., & McCarthy, R. V. (2001). Decision support systems and intelligent systems. Upper Saddle River, NJ: Prentice Hall. [23] Principe, J. C., Euliano, E. R., & Lefebvre, W. C. (1999). Neural and adaptive systems: Fundamentals through simulations with cd-rom. New York, NY: John Wiley & Sons, Inc. [24] Reed, R. D., & Marks, R. J. (1998). Neural smithing: Supervised learning in feedforward artificial neural networks. Cambridge, MA: MIT Press. [25] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by backpropagation errors. Nature, 323, 533-536. [26] Hinton, G. E. (1989). Connectionist learning procedures. Artificial. Intelligence, 40(1-3), 185-234. [27] Whitley, D., Starkweather, T., & Bogart, C. (1990). Genetic algorithms and neural networks: Optimizing connections and connectivity. Parallel Computing, 14(3), 347-361. [28] Engel, J. (1988). Teaching feed-forward neural networks by simulated annealing. Complex Systems, 2(6), 641-648. [29] Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag. [30] Tickle, A., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks. IEEE Transactions on Neural Networks, 9(6), 1057-1068. [31] Craven, M. & Shavlik, J. (1999). Rule extraction: where do we go from here?. University of Wisconsin Machine Learning Research Group working paper, 99- 1. [32] Gallant, S. I. (1988). Connectionist expert systems. Communications of the ACM, 31, 152-169. [33] Andrews, R., Diederich., & Tickle, A. B. (1995). Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge Based Systems, 8, 373-389. 108

[34] Fu, L. M. (1991). Rule learning by searching on adapted nets. Proceedings of the Ninth National Conference on Artificial Intelligence. Anaheim, CA: AAAI Press, 590-595. [35] Towell, G. G., & Shavlik, J. W. (1993). Extracting refined rules from knowledge-based neural networks. Machine Learning, 13, 71-101. [36] Setiono, R. (1997). Extracting rules from neural networks by pruning and hidden-unit splitting. Neural Computation, 9, 205-225. [37] Andrews, R., & Geva, S. (1994). Rule extraction from a constrained error back propagation MLP. Proceedings of Fifth Australian Conference on Neural Networks. Brisbane, Queensland, 9-12. [38] Saito, K. & Nakano, R. (1990). Rule extraction from facts and neural networks. Proceedings of the International Neural Network Conference. San Diego, CA, 379-382. [39] Craven, M. W., & Shavlik, J. W. (1994). Using sampling and queries to extract rules from trained neural networks. Proceedings of the Eleventh International Conference on Machine Learning. New Brunswick, N.J: Morgan Kaufmann, 37-45. [40] Thurn, S. B. (1993). Extracting provably correct rules from artificial neural networks. Technical Report IAI-TR-93-5. University of Bonn, Bonn, Germany. [41] Pop, E., & Ruleneg, J. D. (1994). Rule-extraction from neural networks by step- wise negation. Technical report. Queensland University of Technology, Neurocomputing Research Center. [42] Craven, M.W., & Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. Advances in Neural Information Processing, 8, 24-30. [43] Schmitz, G. P. J., Aldrich, C., & Gouws, F. S. (1999). ANN-DT: An algorithm for extraction of decision trees from artificial neural networks. IEEE Transactions on Neural Networks, 10(6), 1392-1401. [44] Boz, O. (2002). Converting a trained neural network to a decision tree. Proceedings of the 2002 International Conference on Machine Learning and Applications – ICMLA. Las Vegas, NE: CSREA Press, 110-116. [45] Sestito, S., & Dillon, T. (1992). Automated knowledge acquisition of rules with continuously valued attributes. Proceedings of the Twelfth International Conference on Expert Systems and their Application. Avignon, France, 645- 656. [46] Keedwell, E., Narayanan, A., & Savic, D. A. (1999). Using genetic algorithms to extract rules from trained neural networks. Proceedings of the Genetic and Evolutionary Computing Conference. Orlando, FL: Morgan Kaufmann, 793. [47] Maire, F. (2000). On the convergence of validity interval analysis. IEEE Transactions on Neural Networks, 11(3), 802-807. 109

[48] Roy, B., & Sussmann, B. (1964). Les Problèmes d’Ordonnancement avec Contraintes Disjonctives, Note D.S. no. 9 bis, SEMA. Paris, France. [49] Yamada, T., & Nakano, R. (1995). Job shop scheduling by simulated annealing combined with deterministic local search. Metaheuristics International Conference. Hilton, Breckenridge, Colorado, USA, 344-349. [50] Conway, R. W., Maxwell, W. L., Miller, L. W. (1967). Theory of scheduling. Addison Wesley. Reading, MA. [51] Pinedo, M. (1995). Scheduling theory, algorithms and systems. Englewood Cliffs, NJ: Prentice-Hall. [52] Manne, A. S. (1960). On the job-shop scheduling problem. Operations Research, 8, 219-223. [53] Glover, F., & Greenberg, H. J. (1989). New approaches for heuristic search: A bilateral linkage with artificial intelligence. European Journal of Operational Research, 39, 119-130. [54] Giffler, B., & Thompson, G. L. (1960). Algorithms for solving production scheduling problems. Operations Research, 8(4), 487-503. [55] Panwalker, S. S., & Iskander, W. (1977). A survey of scheduling rules. Operations Research, 25, 45-61. [56] Blackstone Jr., J. H., Phillips, D. T., & Hogg, G. L. (1982). A state-of-the-art survey of dispatching rules for manufacturing job-shop operations. International Journal of Production Research, 20, 27-45. [57] Lawrence, S. (1984). Supplement to resource constrained project scheduling: An experimental investigation of heuristic scheduling techniques. Graduate School of Industrial Administration. Carnegie-Mellon University, Pittsburgh, USA. [58] Käschel, J., Teich, T., Köbernik, G., & Meier, B. (1999). Algorithms for the job shop scheduling problem: A comparison of different methods. European Symposium on Intelligent Techniques. Greece, June 3-4. [59] Adams, J., Balas, E., & Zawack, D. (1987). The shifting bottleneck procedure for job shop scheduling. International Journal of Flexible Manufacturing Systems, 34, 391-401. [60] Ramudhin, A., Marier, P. (1996). The generalized shifting bottleneck procedure. European Journal of Operational Research, 93(1), 34-38. [61] Applegate, D., & Cook, W. (1991). A computational study of the job-shop scheduling problem. ORSA Journal on Computing, 3(2), 149-156. [62] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21, 1087-1092. [63] Steinhofel, K., Albrecht, A., & Wong, C. K. (1999). Two simulated annealing- based heuristics for the job-shop scheduling problem. European Journal of Operational Research, 118, 524-548. 110

[64] Szu, H., & Hartley, R. (1987). Fast simulated annealing. Physics Letters A, 122, 157-162. [65] Glover, F. (1986). Future paths for integer programming and links to artificial intelligence. Computers and Operations Research, 13(5), 533-549. [66] Dell’Amico, M., & Trubian, M. (1993). Applying tabu search to the job-shop scheduling problem. Annals of Operations Research, 41, 231-252. [67] Hao, J., & Pannier, J. (1998). Simulated annealing and tabu search for constraint solving. Fifth International Symposium of Artificial Intelligence and Mathematics. [68] Davis, L. (1985). Job-shop scheduling with genetic algorithms., in Grefenstette, J. J. (ed)., Proceedings of the First International Conference on Genetic Algorithms and their applications. Pittsburg, PA: Lawrence Erlbaum., 136-140. [69] Cheng, R., Gen, M., & Tsujimura, Y. (1996). A tutorial survey of job-shop scheduling problems using genetic algorithms-I. representation. Computers & Industrial Engineering, 30(4), 983-997. [70] Nakano, R., & Yamada, T. (1991). Conventional genetic algorithm for job-shop problems. In Kenneth, M.K., & Booker, L. B. (ed), Proceedings of the Fourth International Conference on Genetic Algorithms and their Applications. San Diego, USA, 474-479. [71] Kobayashi, S., Ono, I., & Yamamura, M. (1995). An efficient genetic algorithm for job shop scheduling problems. Proceedings of the Sixth International Conference on Genetic Algorithms. San Francisco, CA: Morgan Kaufmann Publishers, 506-511. [72] Fox, M. S. (1987). Constraint-directed search: A case study of job-shop scheduling. Research Notes in Artificial Intelligence. London: Pitman Publishing. [73] Fox, M. S., & Sadeh, N. (1990). Why is scheduling difficult ? A CSP perspective. In Aiello, L. (ed), ECAI-90 Proceedings of the 9th European Conference on Artificial Intelligence. Stockholm, Sweden. August 6-10, 754- 767. [74] Cheung, J. Y. (1994). Scheduling. In Dagli, C. H. (ed), Artificial Neural Networks for Intelligent Manufacturing. London: Chapman and Hall, Chapter 8, 159-193. [75] Jain, A. S., & Meeran, S. (1998). Job-shop scheduling using neural networks. International Journal of Production Research, 36(5), 1249-1272. [76] Foo, S. Y., & Takefuji, Y. (1988). Integer linear programming neural networks for job shop scheduling. In Kosko, B. (ed), Proceedings of the 1988 IEEE International Conference on Neural Networks. San Diego, California. 24-27 July, 2, 341-348. 111

[77] Zhou, D. N., Cherkassky, V., Baldwin, T. R., & Olson D. E. (1991). A neural network approach to job-shop scheduling. IEEE Transactions on Neural Network, 2(1), 175-179. [78] Sabuncuoglu, I., & Gurgun, B. (1996). A neural network model for scheduling problems. European Journal of Operations Research, 93(2), 288-299. [79] Jain, A. S., & Meeran, S. (1996). Scheduling a job-shop using a modified back error propagation neural network. Proceedings of the IMS’96 First Turkish Symposium on Intelligent Manufacturing Systems. Adapazari, Turkey. 30-31 May, 462-474. [80] Rabelo, L. C., & Alptekin, S. (1989). Using hybrid neural networks/expert systems for intelligent scheduling in flexible manufacturing systems. IJCNN International Joint Conference on Neural Networks. Washington. June 18-22, 2, 608. [81] Dagli, C. H., & Sittasathanchai, S. (1995). Genetic neuro-scheduler: A new approach for job shop scheduling. International Journal of Production Economics, 41, 135-145. [82] Yu, H., & Liang, W. (2001). Neural network and genetic algorithm-based hybrid approach to expanded job-shop scheduling. [83] Yih, Y., Liang, T. P., & Moskowitz, H. (1991). A hybrid approach for crane scheduling problems. In Dagli, C. H., Kumara, S. R. T., Shin, Y. C. (ed), Intelligent Engineering Systems Through Artificial Neural Networks. New York: ASME, 867-872. [84] Kim, S. Y., Lee, Y. H., & Agnihotri, D. (1995). A hybrid approach for sequencing jobs using heuristic rules and neural networks. Production Planning and Control, 6(5), 445-454. [85] Kantak, S. A., & Koonce, D. (2002). Improving the data mining exploration technique for job-shop schedules by using multiple data sets. Proceedings of the Sixth International Conference on Engineering Design and Automation. Maui, Hawaii, 31-36. [86] Kwak, C., & Yih, Y. (2004). Data-mining approach to production control in the computer-integrated testing cell. IEEE Transactions on Robotics and Automation, 20(1), 107-116. [87] Yoshida, T., & Hideyuki, T. (1999). A study on association among dispatching rules in manufacturing scheduling problems. Proceedings of Seventh IEEE International Conference on Emerging Technologies and Factory Automation. Barcelona, Spain, 1355-1360. [88] Muth, J., & Thompson, G. (1963). Industrial Scheduling. Englewood Cliffs, NJ: Prentice Hall. [89] Shah, N., & Koonce, D. (2004). Using Distributed Genetic Algorithms For Solving Job Shop Scheduling Problems. Proceedings of the IIE 2004 Annual Conference. Houston, TX. 112

[90] Witten, I. H., & Frank, E. (2000). Data Mining: Practical machine learning tools with java implementations. San Francisco, CA: Morgan Kaufmann. [91] Dantzig, G.B., Orden, A., & Wolfe, P. (1955). Generalized simplex method for minimizing a linear form under linear inequality constraints. Pacific Journal of Mathematics, 5, 183-195. 113

APPENDIX A EVALUATION OF NN CLASSIFIERS

Number of Transfer Training Classification Classifier Dimensionality hidden Functions Algorithm Accuracy Layers

Multilayer Hyperbolic 1 12-10-06 Gradient Descent 60.23 Perceptron Tangent

Multilayer 1 12-10-06 Logistic Gradient Descent 59.24 Perceptron

Multilayer Hyperbolic 2 12-12-10-06 Gradient Descent 61.38 Perceptron Tangent

Multilayer 2 12-12-10-06 Logistic Gradient Descent 59.73 Perceptron

Multilayer Perceptron Hyperbolic 2 12-12-12-06 Gradient Descent 60.55 with Genetic Tangent Optimization

Guassian, RadialBasis 2 12-50-12-06 Hyperbolic Gradient Descent 57.23 Function Tangent

114

APPENDIX B NETWORK PARAMATERS

Weights: Input layer-Hidden Layer 1 0.9123 -0.2768 1.5594 0.9058 0.1776 0.5796 0.6231 -0.8628 -0.0715 -0.6052 1.6923 0.0673 -0.7666 -0.9619 -2.6060 -1.0251 -0.0792 0.2438 0.8190 -0.7447 0.7624 -0.7305 0.5555 -0.0667 0.4753 -0.4593 0.4850 -0.4069 1.0759 -0.1278 0.1953 2.0040 0.2256 0.7618 -0.5887 -0.1589 0.0338 -0.5676 0.3055 0.3345 0.3832 -0.4294 -2.2908 0.4461 0.6360 1.3488 0.4987 -1.5545 -0.7944 -0.7964 -1.1415 -0.4488 -0.1498 -0.7014 2.0213 0.6554 -0.5699 0.1607 -0.7210 0.1802 0.1683 -0.6480 0.5660 1.1736 -0.2328 1.0258 -1.9235 -0.6634 0.7291 -0.0527 -0.0075 0.5918 1.0552 -0.4915 0.3404 -0.9357 1.3692 -0.7756 -0.4042 -0.0375 0.0734 -0.1587 0.6149 -1.3970 -0.3631 -0.3550 -1.7826 -0.6189 -0.8022 1.2232 0.5065 0.6702 -0.0056 -1.5114 0.7556 -0.6216 1.6824 -0.2856 -0.0387 0.0056 0.9305 -0.1129 -1.4812 0.0138 0.2289 0.8556 -0.0625 -1.2512 -0.9648 -0.2342 0.7400 0.5697 -0.4759 -0.8091 1.1389 0.5616 0.3184 0.3839 -0.1995 0.6829 0.8476 0.6979 -0.4644 -0.5511 0.6744 -0.0334 0.1738 0.0637 -0.3708 0.2394 0.4069 0.4139 -0.8359 -0.1874 0.3126 0.7538 -0.6722 -0.6482 -0.4647 -0.7395 0.4543 -0.4968 -0.7484 -0.3485 Bias (Hidden Layer 1) -0.0905 0.6094 0.1131 0.5621 -0.5367 -0.4181 0.5933 -0.0133 -0.8458 -0.5765 -1.0965 0.6397

Weights: Input layer-Hidden Layer 1 0.5571 -0.3602 0.1414 -0.7832 -0.3804 -0.2448 0.0862 -0.9249 0.8619 -0.5449 0.5571 -0.3602 0.5547 -0.1614 -0.1229 -0.0608 0.3725 -0.6337 -0.1804 -0.0749 0.1003 -0.2343 0.5547 -0.1614 0.4505 -0.6033 -0.8880 -0.7018 0.0201 0.2417 0.0813 -0.5456 0.4569 0.3470 0.4505 -0.6033 -0.3162 0.3347 -0.8263 -0.3991 -0.6570 -0.1452 -0.4113 -0.3403 -0.2938 -0.3506 -0.3162 0.3347 0.1542 -0.1296 -0.0562 -0.3185 0.0587 0.6082 0.5793 -0.4112 0.8912 0.2988 0.1542 -0.1296 -0.1275 0.6447 -0.8568 0.6127 -0.5409 0.4739 0.1634 0.5450 0.0178 0.4723 -0.1275 0.6447 -0.5006 0.5957 -0.9413 -1.0095 0.1426 -0.3619 0.2224 0.1996 1.0647 -0.0976 -0.5006 0.5957 -0.4090 -0.7953 0.2317 -0.2136 -0.4740 0.1982 -1.0695 0.1254 0.6185 0.1406 -0.4090 -0.7953 -0.8876 0.5305 0.4433 0.2208 -0.0301 0.1347 -0.1048 -0.2871 0.0420 0.3858 -0.8876 0.5305 -0.2402 -0.6580 -0.3210 0.3898 0.7971 0.0907 -0.0162 -0.7024 0.7870 -0.6499 -0.2402 -0.6580 0.4093 -0.2612 -0.2724 0.3230 -0.7930 0.0719 0.5929 0.0214 -0.9684 0.5082 0.4093 -0.2612 0.1379 0.5880 0.2016 0.1407 0.5487 -0.0780 0.1584 0.5418 0.1562 0.1604 0.1379 0.5880 Bias (Hidden Layer 1) 0.1228 0.2039 0.2257 -0.2635 0.4563 -0.4577 -0.0247 -0.2997 -0.1031 -0.1119 0.1228 0.2039

115

Weights: Hidden Layer 2-Output Layer -0.1975 -0.1735 -0.4378 -0.6495 -0.4428 0.2214 -0.0597 0.0387 0.3034 -0.3584 0.1847 -0.5140 -0.4220 0.0271 0.8882 0.4397 -0.3841 0.0998 -0.0914 0.5762 0.2284 -0.3222 0.0218 0.2648 -0.7335 0.0343 -0.4633 -0.2331 0.2156 0.0338 -0.0731 -0.0304 0.4673 0.0976 0.3871 0.3752 0.4362 0.3789 -0.0465 -0.3752 0.2069 -0.2413 -0.1369 0.0175 0.1814 0.3336 -0.2838 -0.6639 -0.0561 0.2071 0.1676 0.1965 0.7364 -0.6345 0.5902 0.3686 -0.4502 0.6067 -0.0411 -0.2815 Bias -0.2438 -0.5107 0.0277 -0.2319 -0.3288 -0.6118

116

APPENDIX C DECISION TREE INDUCTION DATASETS

Training Dataset ID Operation ProcessTime RemainingTime MachineLoad Priority 1 First Short Medium Light Zero 2 First Short Long Light One 3 First Medium Long Light Zero 4 First Long Medium Light One 5 First Long Long Light Zero 6 Middle Short Short Light Two 7 Middle Short Medium Light One 8 Middle Short Medium Heavy One 9 Middle Medium Short Heavy Three 10 Middle Medium Medium Light Two 11 Middle Medium Medium Heavy One 12 Middle Medium Long Light One 13 Middle Long Medium Heavy Two 14 Later Short Short Light Three 15 Later Short Short Heavy Four 16 Later Short Medium Light Three 17 Later Medium Short Heavy Four 18 Later Long Short Light Three 19 Later Long Short Heavy Three 20 Later Long Medium Heavy Three 21 Last Short Short Light Five 22 Last Medium Short Light Five 23 Last Medium Short Heavy Five 24 Last Long Short Heavy Five

117

Sampled Data set ID No. Operation ProcessTime RemainingTime MachineLoad Priority 1 First Short Short Light Zero 2 First Short Short Heavy Zero 3 First Short Medium Heavy Zero 4 First Short Long Heavy Zero 5 First Medium Short Light Four 6 First Medium Short Heavy Three 7 First Medium Medium Light Zero 8 First Medium Medium Heavy Zero 9 First Medium Long Heavy One 10 First Long Short Light Three 11 First Long Short Heavy Three 12 First Long Medium Heavy One 13 First Long Long Heavy Zero 14 Middle Short Short Heavy Three 15 Middle Short Long Light Zero 16 Middle Short Long Heavy Zero 17 Middle Medium Short Light Five 18 Middle Medium Long Heavy One 19 Middle Long Short Light Three 20 Middle Long Short Heavy Three 21 Middle Long Medium Light One 22 Middle Long Long Light One 23 Middle Long Long Heavy Zero 24 Later Short Medium Heavy Three 25 Later Short Long Light Zero 26 Later Short Long Heavy Three 27 Later Medium Short Light Three 28 Later Medium Medium Light Two 29 Later Medium Medium Heavy Four 30 Later Medium Long Light Three 31 Later Medium Long Heavy Three 32 Later Long Medium Light Three 33 Later Long Long Light Three 34 Later Long Long Heavy Three 35 Last Short Short Heavy Five 36 Last Short Medium Light Five 37 Last Short Medium Heavy One 38 Last Short Long Light Zero 39 Last Short Long Heavy Five 40 Last Medium Medium Light Five 41 Last Medium Medium Heavy Five 42 Last Medium Long Light Zero 43 Last Medium Long Heavy Five 44 Last Long Short Light Five 45 Last Long Medium Light One 46 Last Long Medium Heavy One 47 Last Long Long Light One 48 Last Long Long Heavy One 118

APPENDIX D NN DECISION TREE EXTRACTION

Confusion Matrix Priority Priority Priority Priority Priority Priority Output/ Desired (Zero) (One) (Two) (Three) (Four) (Five) Priority (Zero) 16 0 0 0 0 0 Priority (One) 0 16 0 0 0 0 Priority (Two) 0 0 4 0 0 0 Priority (Three) 0 0 0 20 0 0 Priority (Four) 0 0 0 0 4 0 Priority (Five) 0 0 0 0 0 12 Classification Accuracy (%) 100 100 100 100 100 100 Total Accuracy (%) 100

Performance Statistics Parameters Values Time taken to build model 0.23 seconds Total Number of Instances 72 Correctly Classified Instances 72 Incorrectly Classified Instances 0

119

APPENDIX E ID3 DECISION TREE INDUCTION

Confusion Matrix Priority Priority Priority Priority Priority Priority Output/ Desired (Zero) (One) (Two) (Three) (Four) (Five)

Priority (Zero) 719 332 199 0 0 0 Priority (One) 80 757 448 21 0 0 Priority (Two) 0 189 739 381 0 0 Priority (Three) 0 5 188 1039 87 16 Priority (Four) 0 0 2 495 526 299 Priority (Five) 0 0 0 25 301 947 Classification Accuracy (%) 89.99 59.00 46.89 52.98 57.55 75.04 Total Accuracy (%) 60.64

Performance Statistics Parameters Values Time taken to build model 0.7 seconds Total Number of Instances (Test set) 7795 Correctly Classified Instances 4727 Incorrectly Classified Instances 3068 Root Mean Square Error 0.2835

120

ID3 Rule set Rule Operation ProcessTime MachineLoad RemainingTime Priority ID 1 First Short Light Medium 0 2 First Short Heavy Medium 0 3 First Short Light Long 1 4 First Short Heavy Long 0 5 First Short Light Short null 6 First Short Heavy Short null 7 First Medium Light Medium 0 8 First Medium Heavy Medium 0 9 First Medium Light Long 0 10 First Medium Heavy Long 0 11 First Medium Light Short 0 12 First Medium Heavy Short 0 13 First Long Light Medium 1 14 First Long Heavy Medium 1 15 First Long Light Long 0 16 First Long Heavy Long 0 17 First Long Light Short null 18 First Long Heavy Short null 19 Middle Short Light Short 2 20 Middle Short Heavy Short 2 21 Middle Medium Light Short 3 22 Middle Medium Heavy Short 3 23 Middle Long Light Short null 24 Middle Long Heavy Short null 25 Middle Short Light Medium 1 26 Middle Short Heavy Medium 1 27 Middle Medium Light Medium 2 28 Middle Medium Heavy Medium 1 29 Middle Long Light Medium 2 30 Middle Long Heavy Medium 2 31 Middle Short Light Long 1 32 Middle Short Heavy Long 1 33 Middle Medium Light Long 1 34 Middle Medium Heavy Long 1 35 Middle Long Light Long 1 36 Middle Long Heavy Long 1 121

ID3 Rule set (continued) Rule Operation ProcessTime MachineLoad RemainingTime Priority ID 37 Later Short Light Medium 3 38 Later Short Heavy Medium 4 39 Later Short Light Long null 40 Later Short Heavy Long 4 41 Later Short Light Short 3 42 Later Short Heavy Short 4 43 Later Medium Light Medium 4 44 Later Medium Heavy Medium 4 45 Later Medium Light Long 4 46 Later Medium Heavy Long 4 47 Later Medium Light Short 3 48 Later Medium Heavy Short 3 49 Later Long Light Medium 3 50 Later Long Heavy Medium 3 51 Later Long Light Long null 52 Later Long Heavy Long null 53 Later Long Light Short 3 54 Later Long Heavy Short 3 55 Last Short Light Medium 5 56 Last Short Heavy Medium 5 57 Last Short Light Long 5 58 Last Short Heavy Long 5 59 Last Short Light Short 5 60 Last Short Heavy Short 5 61 Last Medium Light Medium 5 62 Last Medium Heavy Medium 5 63 Last Medium Light Long 5 64 Last Medium Heavy Long 5 65 Last Medium Light Short 5 66 Last Medium Heavy Short 5 67 Last Long Light Medium 5 68 Last Long Heavy Medium 5 69 Last Long Light Long 5 70 Last Long Heavy Long 5 71 Last Long Light Short 5 72 Last Long Heavy Short 5 122

APPENDIX F TEST SCHEDULING SCENARIOS ft06-R1 ft06-R2 ft06-R3 ft06-R4 ft06-R5 ft06-R6 ft06-R7 ft06-R8 ft06-R9 ft06-R10 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2,9 2,9 5,4 4,6 6,5 6,7 6,2 4,9 4,1 2,8 3,7 5,10 4,1 5,7 2,5 4,1 5,2 1,9 3,9 5,7 4,1 1,3 6,8 6,4 3,7 3,5 2,3 2,9 5,10 1,3 6,8 6,1 3,1 3,8 4,7 2,10 3,4 5,10 1,9 4,2 1,4 3,8 2,10 1,1 5,3 5,3 1,3 6,2 2,9 6,6 5,2 4,1 1,8 2,6 1,2 1,5 4,10 3,7 6,8 3,6 4,9 4,4 2,10 5,8 1,10 6,10 5,3 2,7 6,1 3,4 2,7 6,5 5,10 4,5 5,5 5,8 2,8 5,3 1,10 2,5 6,7 5,7 3,6 2,5 4,10 1,8 4,6 6,4 4,9 5,2 3,1 1,7 6,5 1,4 3,10 3,2 3,10 1,3 5,8 1,1 5,4 3,1 4,1 6,1 6,8 2,2 1,4 3,9 2,3 4,5 1,4 2,5 1,10 3,3 2,2 4,10 6,2 4,7 3,1 6,9 5,1 5,8 1,4 4,2 6,2 4,10 6,3 4,1 3,6 4,7 1,6 1,3 4,9 6,1 3,4 2,8 4,10 3,9 1,6 1,1 2,3 3,10 2,5 5,8 5,5 1,8 3,3 6,3 2,7 3,6 3,6 2,1 3,9 2,3 4,7 6,7 1,2 1,9 5,5 5,8 4,4 6,6 5,9 1,4 1,10 5,5 5,6 2,9 4,3 6,2 6,5 4,10 6,4 3,3 2,10 3,3 2,3 5,1 6,1 2,6 3,6 6,10 3,2 5,4 1,3 2,9 5,10 2,7 1,5 1,6 6,4 2,1 2,5 1,1 5,2 6,8 6,1 4,1 5,1 3,3 1,5 1,9 1,9 6,10 2,6 5,7 2,10 6,9 3,7 5,9 5,6 3,10 4,8 4,2 3,7 3,2 1,10 1,10 2,7 4,7 4,7 4,5 5,6 2,8 4,5 1,5 3,1 3,8 4,5 2,7 2,4 5,6 6,10 3,7 6,5 4,3 4,6 5,4 6,4 6,6 2,2 4,7 2,7 2,8 3,7 1,2 3,10 4,6 2,6 5,7 4,4 3,4 3,4 1,4 6,10 2,7 2,5 2,4 6,6 3,9 1,7 6,9 4,7 6,5 2,1 3,9 1,5 5,10 1,5 6,4 5,2 1,3 6,1 5,2 5,4 5,1 6,1 3,7 5,5 2,7 3,6 5,2 1,7 3,2 1,4 6,1 5,7 1,8 4,8 1,8 6,5 2,7 5,10 4,4 4,3 4,2 4,1 6,9 3,7 4,10 5,7 1,10 1,10 2,3 3,8 2,9 4,9 3,1 6,9 3,9 1,1 2,4 5,1 4,1 5,10 4,5 6,3 2,9 4,1 1,8 2,9 4,4 6,3 5,3 4,4 3,9 3,3 6,10 2,4 6,10 4,4 3,9 4,1 6,6 2,10 1,4 5,4 1,9 1,3 4,1 6,4 5,5 3,10 3,8 6,1 5,1 1,10 4,4 5,1 5,5 3,8 6,10 2,4 1,3 1,4 6,10 2,1 5,9 3,5 2,10