MO2TOS: Multi-fidelity Optimization with Ordinal Transformation

and Optimal Sampling

Jie Xu, Si Zhang, Edward Huang, Chun-Hung Chen

Department of System Engineering and Operations Research

George Mason University, Fairfax, VA 22030 Loo Hay Lee

Department of Industrial and Systems Engineering

The National University of Singapore, Kent Ridge 119260, Singapore Nurcin Celik

Department of Industrial Engineering

The University of Miami, Coral Gables, FL 33146, USA

Abstract

Simulation optimization can be used to solve many complex optimization problems in au- tomation applications such as job scheduling and inventory control. We propose a new frame- work to perform efficient simulation optimization when simulation models with different fidelity levels are available. The framework consists of two novel methodologies: ordinal transformation (OT) and optimal sampling (OS). The OT methodology uses the low-fidelity simulations to transform the original solution space into an ordinal space that encapsulates useful information from the low-fidelity model. The OS methodology efficiently uses high-fidelity simulations to sample the transformed space in search of the optimal solution. Through theoretical analysis and numerical experiments, we demonstrate the promising performance of the Multi-fidelity Optimization with Ordinal Transformation and Optimal Sampling (MO2TOS) framework.

1 1 Introduction

Scientists and engineers have long been using analytical and computational models to approximately describe and predict the behaviors of physical processes/systems and use these models to guide the design of engineering systems. Optimization methods can often be applied to these models to help identify designs that are not only viable but potentially achieve the best performance possible.

Prototypes are then built to evaluate the actual performance of a few most promising designs

(according to the model) and select the best one.

In the past decades, the explosive growth of computing technology has revolutionized such long- standing engineering design practices. While still an indispensable process in engineering design, physical prototyping is giving ways to high-fidelity computer simulation models (in this paper, we use the term “simulation” to refer to any computational model that can be used to directly eval- uate the objective value of a candidate design, possibly with bias and/or noise), which can quite accurately evaluate the actual performance of a candidate design using a small fraction of time and resources that physical prototyping would require.

Because of the accuracy of high-fidelity simulation models and their flexibility in modeling and analyzing complex systems that are intractable to traditional analytical methods, it is desirable to perform optimization directly with these simulation models. Simulation-based optimization, or simply simulation optimization (Chen and Lee 2011, Lee et al. 2010, Xu et al. 2010, 2014), refers to a class of methods that search for optimal or near optimal solutions by using simulation models to directly evaluate the objective values of candidate solutions.

However, high-fidelity simulation models often have high computation cost and can be very time- consuming to run. As a result, when the time and the computing resources to perform optimization is limited, only a small number of candidate solutions can be evaluated via high-fidelity simulations in the simulation optimization process. Finding optimal or near optimal solutions using high-

fidelity simulations may not be practical in many applications (Celik 2010). In contrast, models of lower-fidelity (e.g., analytical approximations, coarser-grained simulations) are much faster and can evaluate a large number of candidate solutions in a short amount of time.

As a result, current industrial practice uses low-fidelity models to screen a large number of candidate solutions and select a few “top” solutions for evaluation by high-fidelity models. Such

2 a methodology has its intuitive appeal, is easy of use, and sometimes can achieve quite successful outcomes. However, it is heuristic in nature and may not be robust because low-fidelity model can be fraught with significant bias and variability. This paper offers a systematic, efficient, and robust procedure to exploit the benefits of both high- and low-fidelity models in optimization. We pro- pose a Multi-fidelity Optimization with Ordinal Transformation & Optimal Sampling (MO2TOS) framework and demonstrate its benefits through rigorous analysis and numerical experiments.

While low-fidelity models may suffer from significant biases and variabilities, they often provide useful information on candidate solutions such as their relative quality. This is because low-fidelity models are often built on reasonable abstractions and simplifications of the underlying physical processes. The MO2TOS framework exploits this property and proposes an Ordinal Transformation

(OT) approach to transform the original decision space into a new one-dimensional space where all solutions are positioned according to their ordinal ranks using the low-fidelity model. To see the benefit of transforming the solution space into a one-dimensional ordinal space, it is important to remember that the original solution space can be high-dimensional, have multiple locally optimal solutions spread far apart, and include a mix of integer-valued and categorical variables. After

OT, the new solution space is one-dimensional, likely to be well-behaved and display some trend, and greatly simplifies further optimization using just a small number of high-fidelity simulations.

We then propose an optimal sampling (OS) method to search the transformed space for a (near) optimal solution. OS adaptively samples the transformed space taking into account the quality of the low-fidelity model and thus achieves both efficiency and robustness.

There has been related work on the optimization of complex systems with multi-fidelity simula- tions. The Multi-Fidelity Sequential Kriging Optimization (MFSKO) procedure constructs kriging models to approximate the difference in simulation output between models of consecutive fidelity levels (Huang et al. 2006). MFSKO then sequentially determines the next solution to simulate and the level of fidelity for that simulation with an objective to maximize the expected improvement in the quality of the best solution found so far. The Value-based Global Optimization (VGO) proce- dure (Moore 2012) aims to maximize the value of a sampling decision and also uses a kriging model to predict the performance of a new solution. Compared to MFSKO, there are also some significant technical differences in how kriging models are used in determining the next solution to simulate and the fidelity level to use. In March and Willcox (2012), the authors used a radial basis func-

3 tion (RBF) interpolation to create a surrogate model of the high-fidelity simulation model in the neighborhood of a trust region. The RBF surrogate model is then used to predict the high-fidelity simulation results for new solutions and select the next solution for high-fidelity simulation.

All these earlier methods basically create a surrogate model using an interpolation method

(kriging or RBF) to correct the bias of the low-fidelity model and perform the optimization using the “corrected” low-fidelity model. While they have been shown to work reasonably well on a number of engineering design problems, the performance of these methods depends critically on the quality and applicability of the interpolation method. It is well known that interpolation methods such as kriging and RBF would require a large number of design points to perform well when the underlying response surface is highly nonlinear and multi-modal, and/or the dimension of the solution space is high. In many complex system design problems, there are a mix of discrete and categorical decision variables. This would present additional challenges to these interpolation methods.

In comparison, MO2TOS is a very general and flexible framework and has the following im- portant advantages: 1) MO2TOS handles a mix of discrete and categorical decision variables in a high-dimensional solution space; 2) MO2TOS is a general framework and is not restricted to any specific interpolation technique such as kriging; and 3) MO2TOS has the potential to provide both efficiency and robustness through the new OT and OS methodologies, and can effectively perform optimization even when low-fidelity models may not be good.

The rest of the paper is organized as follows. In Section 2, we illustrate the principles and benefits of MO2TOS using a machine allocation example in the context of flexible manufacturing systems. In Section 3, we present a mathematical model and analysis to show the benefits of OT and propose an OS strategy. We also present a simulation optimization algorithm under the general

MO2TOS framework in Section 3. Section 4 presents numerical results. We conclude the paper and point out future research directions in Section 5.

2 A MOTIVATIONAL CASE STUDY

In this paper, we assume that there is one high-fidelity model that generate accurate predictions of the system and represents the “ground truth”. We also have a low-fidelity model whose results

4 provide approximations to the high-fidelity model with unknown bias. We now use a flexible manufacturing system as an example to illustrate the basic principles and potential benefits of

MO2TOS. While this illustrative example is relatively simple, the ideas are equally applicable to more general problems. There are two types of products and five work stations. Each type of product needs to go through a of processing steps. Some steps are processed by the same work-station. Therefore, a job needs to re-enter some work stations multiple times. Each station has multiple machines. Inter-arrival and service times are all independent, identical, and normally distributed. Figure 1 shows the flow of jobs through this manufacturing system. When more than one type of products are waiting for the same machine, product 1 has higher priority over product 2. The machine can perform serial batches with at most two jobs of the same type to save the setup time. The re-entrant process flow and the non-exponential inter-arrival and service times make simulation necessary. The manufacturing system can have up to 37 machines. But the number of machines in each work station must be between 5 and 10. The optimization problem is to determine the number of machines in each machine such that the average cycle time

(considering both types of products) in the system is minimized. In this optimization problem, there are five integer decision variables and a total of 780 feasible solutions.

Figure 1: A reentrant flexible manufacturing systems with two product types

5 We construct a discrete-event simulation model that fully models the reentrant flow, priority- based queueing, batch-processing, and non-exponential inter-arrival and service times of the flexible manufacturing system. This discrete-event simulation model serves as the high-fidelity simulation model. One possible low-fidelity model can be obtained by treating the network as an open Jackson network. Specifically, we assume single product type and all service and inter-arrival times are exponentially distributed. All machines are assumed to have infinite buffer capacity, do not break down, and do not perform batch processing. We can then use closed-form M/M/c equations to calculate the average cycle time for each step. Obviously, computing closed-form equations are much faster than running a discrete-event simulation model. But this approximation removes many important features of the system and thus inevitably suffer from significant biases.

We run a large number of simulations to obtain very reliable estimates of the performance of a solution according to the high-fidelity model. In Figure 2, we plot high-fidelity model results.

Because this is a 5-dimensional problem, we cannot draw the results in the original 5-dimensional space. Instead, we index solutions based on their positions in the original space and then place them on the horizontal axis using the indices. This represents one possible way to partition the original solution space. Clearly, groups of solutions in this space, such as the five groups illustrated in Figure 2, would contain solutions of very different performance levels and it is difficult to tell which group is better than the other. Therefore, a sampling strategy would have to keep sampling from all groups in search of the optimal solution.

We use the M/M/c queuing equations available to Jackson networks (Gross et al. 2008) as the low-fidelity model to estimate the cycle time for all 780 alternative solutions. We describe the OT procedure below:

1. Evaluate and rank all feasible solutions using the low-fidelity model.

2. Transform all feasible solutions into a one-dimensional ordinal space where solutions are

positioned according to their ranks determined in Step 1.

We show in Figure 3 both the low- fidelity (the blue curve below) and high-fidelity (the red curve above) simulation estimates of all 780 solutions after OT. The horizontal axis gives the rank of a solution as determined by the low-fidelity model. The left side represents solutions that are ranked to be better. What is remarkable from Figure 3 is that despite the big bias in low-fidelity results,

6 Figure 2: High-fidelity simulation results with solutions indexed by their positions in the original solution space the relative order among solutions is actually quite accurate. This is seen by the roughly monotonic trend in the high-fidelity model result curve (red curve below). In Figure 3, we also see that we can naturally partition solutions into three groups. Solutions within the left and the middle groups have quite similar performance and thus these two groups have small group variances. While the right group shows substantial variability, it is less of a concern because the right group’s distances to the left and middle groups are large and we can safely sample within the left and middle groups to search for the optimal solution. Comparing the three groups in Figure 3 to the three groups in Figure 2, it is clear that it is much easier to search for an optimal solution using a sampling approach after OT.

For this flexible manufacturing system example, the low-fidelity model has significant biases.

However, the ordinal rankings of all solutions are largely correct. As a result, partitions based on these ordinal rankings in the transformed space after OT would greatly facilitate subsequent sampling efforts in search of the optimal solution. However, in general, we would not know a priori whether the partition based on low-fidelity model results alone is good or not. Therefore, it is important to design an OS strategy that focuses on more promising groups and at the same time also sample other groups to avoid being misled by the unknown bias in the low-fidelity model. We summarize the key observations in this illustrative example below:

7 Figure 3: Low- (blue curve above) and high-fidelity (red curve below) simulation results after OT

• OT allows us to partition solutions into groups with small group variances and large group

distances, which is very difficult to achieve with any partitioning scheme in the original

solution space;

• The small group variances and large group distances allow a search/sampling strategy to

efficiently search for the optimal solution.

Notice that if the low-fidelity model is able to correctly rank all solutions, the red curve below in Figure 3 would become the ordered performance curve, a concept well-known in the ordinal optimization literature (Ho et al. 2008). So here we see that the blue curve generated using the low-fidelity model provides a very good approximation of the ordered performance curve. We thus expect that OT will also have the theoretical advantage as in ordinal optimization: determining ordinal performance is much less challenging than determining cardinal values of solutions. Our fundamental contribution is to show how this theoretical advantage can be exploited using a low-

fidelity model, as demonstrated in the rest of the paper.

3 MO2TOS ALGORITHM AND ANALYSIS

In this section, we first describe MO2TOS in detail and propose a practical two-stage algorithm as a specific implementation of MO2TOS. We then use a mathematical model to analyze MO2TOS

8 and demonstrate its benefits in terms of reduced group variances and enlarged group distances in a rigorous manner. Without loss of generality, we assume that the problem is a minimization problem. A maximization problem can be easily converted into a minimization problem by using the negative objective value of a solution. We assume that all decision variables are integer-valued or binary. We further assume that we can enumerate all feasible decisions using the low-fidelity model. We will discuss the implications of these assumptions later in Section 5.

3.1 The MO2TOS Algorithm

MO2TOS is comprised of two methodologies, OT and OS, that jointly work together to achieve efficiency and robustness in multi-fidelity optimization. OT transforms the original solution space into a one-dimensional ordinal space based on the rankings of solutions using the low-fidelity model.

We then partition solutions into groups based on their positions in the ordinal space. Compared to other ways of partitioning the original solution space, our approach can place solutions with similar performance into one group, which might be far apart from each other in the original solution space. As a result, the performance of solutions within a group tend to have similar performance and thus just evaluating a few solutions with high-fidelity simulations can help us quickly and quite accurately determine the quality of this group of solutions. We propose to measure the differences in quality among groups using group variance in the rest of the paper, which is defined as the sample variance of the performance of all solutions in the group. We will show later that OT can lead to significant reduction in group variances. Furthermore, groups formed in the transformed space are also more separable, i.e., the differences in quality between solutions in different groups tend to be larger than other partitioning schemes in the original solution space. We will measure this difference using the average performance of solutions in a group and refer to it as group distance in the rest of the paper.

After OT, we propose an OS approach to search solutions in the transformed one-dimensional space for optimal solutions using high-fidelity simulations. OS is critical to the actual performance of MO2TOS. This is because the bias in low-fidelity models is unknown and the positions of the solutions in the transformed space may not correctly reflect their true relative performance. As a result, a greedy sampling strategy (like current practice by industrial practitioners) that only focuses on solutions ranked high by the low-fidelity model may perform very badly when the low-fidelity

9 simulation model is not very good. It is thus important to sample both broadly and efficiently such that we can both exploit the efficiency brought by the low-fidelity model and OT and safeguard against bad low-fidelity models. In this paper, we propose an OS method that works with the groups of solutions formed after the OT step. As we shall demonstrate later, because OT forms groups that have smaller group variance and larger group distances, the OS approach can work more efficiently than in the original solution space without OT.

We emphasize that MO2TOS represents a general framework for simulation optimization with multiple models. In this paper, we present a specific implementation of the MO2TOS framework in

Figure 4. We first evaluate all solutions using the low-fidelity model. We then partition solutions into equal-sized groups and allocate n0 high-fidelity evaluations to every group at the end of the OT step to obtain initial estimates of group variances and group distances, which provide input to the OS step. Because the initial estimates based on a small number of high-fidelity simulations can be unreliable, this process proceeds iteratively. Each iteration allocates a total of ∆ high-fidelity samples based on the OS allocation rule. The algorithm iterates until all computing budget has been used.

Figure 4: Flowchart of the MO2TOS algorithm

10 3.2 Analysis of Ordinal Transformation

Before we proceed with our mathematical model and analysis, we first introduce our notations below:

• X: an alternative solution;

• m: the total number of feasible solutions;

• g(X)/f(X): the result of the low-fidelity/high-fidelity simulation model evaluated at solution

X. In this paper, we restrict our attention to deterministic cases. For stochastic simulations,

we assume that we always run enough replications to reduce simulation noise to negligible

levels. We will study the allocation of simulation replications in stochastic simulation in our

future work;

• δ(X): the bias of the low-fidelity simulation model at solution X.

We thus have the following equation:

f(X) = g(X) + δ(X) (1)

The quality of the low-fidelity model has a major impact on the performance of MO2TOS. We propose to measure the quality of a low-fidelity model by ρ, the correlation between g(X) and f(X). We make the following assumptions on g(X), f(X), and δ(X).

Assumption 1 If we randomly sample X from the solution space, f(X) is an i.i.d realization of a random variable with finite variance c2; g(X) is an i.i.d realizations of a uniformly distributed random variable; g(X) and δ(X) are independent.

For simplicity, we consider equal group size in this paper and assume we form k groups each containing n solutions. So the total number of solutions is m = kn. As a benchmark, we form equal-size groups in the original solution space by random sampling. In comparison, with the

OT procedure in the MO2TOS framework, we rank all solutions using the low-fidelity model and partition solutions into equal size groups based on their ranks. The solution deemed to be the best by the low-fidelity model receives a rank of 1 and the worst receives a rank of m. We want to point out this is an equal-quantile (of g(X)) partitioning strategy.

11 We use i = 1, 2, . . . , m to index solutions. In the benchmark case, solution index i does not have any specific meaning and just represents the ith randomly sampled solution from the original solution space. In the OT case, i is the rank of the solution in the transformed ordinal space. We may also use XOTi to make the meaning of i explicit in the rest of the paper. We use j = 1, 2, . . . , k to index groups and group solutions with adjacent indexes into a group. So the solutions in group j are X(j−1)n+1,X(j−1)n+2,...,Xjn. We define the group variance of group j as   jn 1 ∑ ( ) σ2 = E  f(X ) − f¯ 2 . (2) j n − 1 i j i=(j−1)n+1

When the group is formed by I.I.D. sampling n solutions from the original space, group variance is simply the variance of f(X), which is Var(f(X)) = c2 according to Assumption 1. Under

Assumption 1, we have Lemma 1 on the upper bound of group variance described in (2) after OT.

2 Lemma 1 Under equal-size grouping, an upper bound for the group variance σOTj for all j = 1, 2, . . . , k after OT is given by

( ) 3 6 σ2 ≤ σ2 =∆ c2ρ2 + c2ρ2 + c2(1 − ρ2) . (3) OTj OT kn + 2 k2

Proof: After OT, recall we use XOTi to denote the ith best solution as judged by the low-fidelity model. Based on (2), we have the group variance after OT for group j as  jn jn 1 ∑ ∑ ∑ σ2 = (n − 1) Var (f(X )) − Cov (f(X ), f(X )) OTj n(n − 1) OTi OTi OTt − − ̸ i=(j 1)n+1 i=(j 1)n+1 t=i  ∑jn ∑jn ∑ 2  +(n − 1) E (f(XOTi)) − E(f(XOTi)) E (f(XOTt)) . (4) i=(j−1)n+1 i=(j−1)n+1 t≠ i

Because g(X) follows a uniform distribution, we can write g(X) = a + (b − a)U, where U ∼

U(0, 1) is a standard uniform random variable with lower bound 0 and upper bound 1. According √ to equation (1) and Assumption 1, we have Cov (f(X), g(X)) = Var(g(X)) = ρc Var (g(X)). This √ implies that b − a = 12ρc and the variance of δ(X) is then given by (1 − ρ2)c2. √ After OT, for i = 1, 2, . . . , m, g(XOTi) = a+ 12ρcUOTi, where UOTi is the ith order statistic of the standard uniform distribution U(0, 1). It is well known that UOTi follows a Beta distribution,

12 UOTi ∼ Beta(i, nk + 1 − i) David and Nagaraja (2003). Therefore, we have

√ i E(X )) = a + 12ρc , OTi nk + 1

i(nk + 1 − i) Var (g(X )) = 12ρ2c2 . OTi (nk + 1)2(nk + 2)

We use d to denote the mean of the bias δ(x). We then have

√ i E(f(X )) = a + 12ρc + d, OTi nk + 1

i(nk + 1 − i) ( ) Var (f(X )) = 12ρ2c2 + 1 − ρ2 c2, OTi (nk + 1)2(nk + 2) and

Cov (f(XOTi), f(XOTt)) = Cov (g(XOTi), g(XOTt))

2 2 = 12ρ c Cov (UOTi,UOTi) , ∀i ≠ t, i, t = 1, 2, . . . , m.

For notational convenience, we let   jn jn 1 ∑ ∑ ∑ A = (n − 1) Var (f(X )) − Cov (f(X ), f(X )) , n(n − 1) OTi OTi OTt − − ̸  i=(j 1)n+1 i=(j 1)n+1 t=i  jn jn 1 ∑ ∑ ∑ B = (n − 1) E2 (f(X )) − E(f(X )) E (f(X )) . n(n − 1) OTi OTi OTt i=(j−1)n+1 i=(j−1)n+1 t≠ i

It is well known in order statistic (David and Nagaraja 2003) that the order statistics of standard uniform distribution, UOTi and UOTt are positively correlated. So we have

jn 1 ∑ A < Var (f(X )) n OTi i=(j−1)n+1 jn ∑ i(nk + 1 − i) ( ) = 12ρ2c2 + 1 − ρ2 c2. (nk + 1)2(nk + 2) i=(j−1)n+1

13 For any 1 ≤ i ≤ m, we know i(nk + 1 − i) 1 < . (nk + 1)2(nk + 2) 4(nk + 2)

So we have 3 ( ) A < ρ2c2 + 1 − ρ2 c2. (5) (nk + 2)

For B, we have

jn 1 ∑ ∑ B = [E (f(X )) − E(f(X ))]2 2n(n − 1) OTi OTt i=(j−1)n+1 t≠ i jn ( ) ρ2c2 ∑ ∑ i − t 2 = . n(n − 1) nk + 1 i=(j−1)n+1 t≠ i

After some simple , we have

(n − 1)2 6ρ2c2 B < 6ρ2c2 < . (6) (nk + 1)2 k2

Equation (3) immediately follows equations (4), (5) and (6) and thus Lemma 1 holds. 2

Note that the upper bound (3) applies to all groups j = 1, . . . , k. We are now ready to present

Theorem 1 that quantifies the reduction of group variance via OT.

Theorem 1 Under equal-size grouping, OT can reduce the group variance of each group by at least

[ ( )] 3 6 100 1 − + ρ2% (7) m + 2 k2 as compared with the group variance of random sampling from the original solution space when k ≥ 4.

2 Proof: We use σRS to denote the group variance of randomly sampled solutions from the original 2 2 ≥ solution space. Obviously we have σRS = Var(f(X)) = c . Notice that when k 4,

3 6 + < 1. kn + 2 k2

14 2 2 2 ≥ Therefore, from equation (3), we know σOT < c = σRS when k 4. The percentage of group variance reduction after OT is then give by

σ2 − σ2 σ2 − σ2 RS OTj ≤ RS OT σ2 σ2 RS [ RS( )] 3 6 = 1 − + ρ2. kn + 2 k2

Finally, notice that kn is simply the total number of feasible solutions m and then we obtain the percentage of group variance reduction after OT as given in (7). 2

We next compare the group distances δj1,j2 between two neighboring groups before and after OT. We first formally define group distance as the difference between the expected value of the group average performance   j n j n 1 ∑1 1 ∑2 δ = E  f(X ) − f(X ) . (8) j1,j2 n i n i i=(j1−1)n+1 i=(j2−1)n+1

Before OT, if we randomly sample solutions to form groups, it is obvious that δj1,j2 = 0. After OT,

δj1,j2 is no longer 0 and thus the absolute value of group distance is increased by OT. We quantify this increase in group distance in the theorem below.

Theorem 2 After OT, the absolute value of group distance between two neighboring groups in- creases to √ 12mc|ρ| (9) k(m + 1) where c is the standard deviation of f(X).

Proof: The analysis in the Proof of Lemma 1 allows us to write the average performance of group j as √ jn ( ) 12ρc ∑ i E f¯ = a + d + . (10) OTj n nk + 1 i=(j−1)n+1

Following equation (10), we immediately have for any two groups j1 and j2,

( ) ( ) ¯ − ¯ δj1,j2 = E fOTj1 E fOTj2 n √ = (j − j ) 12ρc. (11) 1 2 kn + 1

15 With random sampling on the original solution space, δj1,j2 = 0 for any two groups. Therefore, equation (10) immediately follows from (11) for two neighboring groups (i.e., |j1 − j2| = 1) and using the fact that kn = m. 2

We plot the percentage of group variance reduction and the increase in group distance after

OT as a function of ρ for m = 25, k = 5 in Figures 5 and 6. In Figure 6, we normalize the group distance by dividing (9) by c and thus the vertical axis represents group distance in terms of the percentage of c. Notice that the percentage of group variance reduction plotted in Figure 5 is only a lower bound. Theorems 1 and 2 explicitly show the benefit of OT. When ρ is not too small, OT can make groups more separated with smaller group variances, which implies easier search for the optimal solution.

Figure 5: Percentage of group variance reduction through OT

Figure 6: Normalized group distance between two neighboring groups after OT

16 3.3 Optimal Sampling

When low-fidelity simulations are available, a first thought is to use low-fidelity simulations to quickly screen feasible solutions and then only use high-fidelity models to evaluate a few “top” solutions, or the “best” group according to the low-fidelity model after OT. However, as discussed previously, such an approach may not work at all in practice. The reason is the unknown and potentially significant bias in low-fidelity model. Therefore, it is important to balance using high-

fidelity models to more intensively search groups of solutions that appear to be good according to the low-fidelity model, and to broadly explore groups of solutions that appear to be not as attractive as other groups.

Under the MO2TOS framework, the OT step allows the partitioning of solutions into groups that have reduced group variance and increased group distances. These two benefits make it possible to design an OS strategy that intelligently maintains such a balance between exploiting promising groups and exploring less known groups and improves the efficiency of the simulation optimization process.

We formulate the problem of designing an OS strategy as determining the allocation of high-

fidelity simulation budget T (i.e., the maximum number of high-fidelity simulations possible for the simulation optimization process) to groups after formed after OT with an objective to maximize the probability of correctly selecting the best group as measured by average group performance.

Formally, the CS-BG (correctly selecting the best group) problem is given below

max P {CS-BG} N1,N2,...,Nm

s.t.N1 + N2 + ··· + Nm = T.

The above CS-BG formulation is similar to the the Optimal Computing Budget Allocation

(OCBA) problem in Chen et al. (2000), Yan et al. (2012), Chen et al. (2014). In this paper, we assume that the distribution of f(X) within a group can be approximated by a normal distribution with an unknown but constant group mean and variance. This assumption then allows us to derive

Theorem 3. The proof of this theorem can be easily adapted from the original OCBA theorem in

Chen et al. (2014) and please refer to Chen et al. (2014) for details.

17 Theorem 3 Assume f(X)’s are independent and normally distributed for all X in a group. Let b be the index for the group with the best group mean according to the samples collected thus far. Let

Nj be the number of high-fidelity simulations allocated to group j, j = 1, 2, . . . , k. An approximation of the probability of correctly selecting the best group is asymptotically maximized when

( ) N δ /σ 2 l = b,j j , j ≠ l ≠ b, (12) N δ /σ j vb,l l u u ∑k N 2 N = σ t l . (13) b b σ2 l=1,l≠ b l

From (12), we see that the larger the group distance between group l and the current observed best group b, the smaller the number of high-fidelity evaluations allocated to group l. This is reasonable

2 as such a group is unlikely to contain better solutions. However, if the group variance σl is larger, group l should also receive more high-fidelity evaluations because there is more uncertainty about the performance of the solution in this group. Once we compute Nj for j = 1, 2, . . . , k, we randomly sample without replacement Nj solutions from group j. Equation (12) also illustrates the benefit of the OT step. Compared to using the same OS

2 allocation rule prior to OT, OT typically reduces group variances σl and increases group distances

δb,l. As a result, more computing budget would be spent on exploring more promising groups rather than reducing uncertainty in each group. It is reasonable to expect that the OS strategy will be able to work more efficiently and lead to even more savings in computing budget under the

MO2TOS framework than in the original OCBA formulation (Chen et al. 2000).

4 NUMERICAL EXPERIMENTS

In this section, we first present in Section 4.1 an example that validates the lower bound on the reduction in the group variance given in Equation (7). We then focus on testing the efficiency of the

MO2TOS algorithm in this section. We compare MO2TOS to two other simulation optimization algorithms. For simplicity, all algorithms form equal-sized groups and sample solutions from these groups. MO2TOS forms group in the transformed space while the two other algorithms do so in the original solution space. One algorithm being compared equally allocates high-fidelity simulation budget to all groups. We use the legend “Equal” in the figures below to refer to this algorithm. This

18 is basically a random search simulation optimization algorithm. The second algorithm uses OCBA as the optimal sampling method to allocate the simulation budget to groups. This is equivalent to a sole use of OS without OT. We use the legend “OCBA” in the figures below to refer to this algorithm. In the following experiments, the averages of 10,000 iid replications are plotted. In each replication, we follow the same steps but when we randomly sample solutions from a group for high-fidelity simulations using different random number streams. We the number of groups k = 10, which gives both good and robust performance in all test problems. We first present results on the machine allocation problem in Section 4.2 and then show the results on a multimodal test function with different low-fidelity functions in Section 4.3.

4.1 Validation of the Group Variance Reduction Bounds

In Theorem 1, we present a lower bound on the amount of group variance reduction that OT can achieve under conditions given in Assumption 1. We conduct a simulation experiment to validate this bound. The model is given in Equation (1). We let g(X) follow a uniform distribution U(0, 1) and δ(X) follow a uniform distribution U(0, a) with a > 0. The distribution of δ(X) is independent of g(X). Therefore, all conditions specified in Assumption 1 ar met. Notice that the magnitude of a determines the quality of the low-fidelity model g(·). The larger the value of a, the lower the quality of g(·).

In the experiment, we consider m = 10, 000 solutions. For solution Xi, i = 1, 2, . . . , m, we generate one realization from U(0, 1) and use it as g(Xi). We repeat this procedure to generate

δ(Xi) and then obtain f(Xi) = g(Xi) + δ(Xi). We group all 10, 000 solutions into 10 groups, each with 1, 000 solutions. Before OT, the groups are based on the index i. After OT, the groups are based on the ranks determined by g(X). We first show the results when a = 0.5, which leads to a correlation ρ = 0.898 between g(X) and f(X). Figure 7 shows the standard deviation of group performance before and after OT. We also plot the group performance standard deviation computed using the lower bound of the amount of reduction in Equation (7), which forms an upper bound on the actual group performance variance. In addition to the significant reduction in group performance variance, we see that the bound is quite close to the actual values, suggesting that the bound is reasonably tight.

We further vary the value of a to test the tightness of the bound when the quality of the low-

19 Figure 7: Group Performance Standard Deviations

fidelity model g(X) ranges from low (large values of a) to high (small values of a). In Figure 8, we see that our bound is reasonably tight for a large range of values of a. When the value of a is small, the bound is more conservative, but is still within reasonable range.

Figure 8: Group Performance Standard Deviations

4.2 Machine Allocation Problem Results

We now present results on the motivating example of an machine allocation problem described in

Section 2. This problem requires running stochastic simulations on a solution when high-fidelity simulation is requested. Sampling within each group is done by random sampling without re- placement. In all experiments, we only examine the performance of the three algorithms under comparison for a relatively small total computing budget. Therefore, we never exhaustively sam-

20 pled a group with the high-fidelity model in all experiments. This represents the realistic situation where high-fidelity simulations are extremely time-consuming to run and thus only a small fraction of solutions can be evaluated by high-fidelity simulations. The initial sample size of each group is set to n0 = 3. For the equal allocation and OCBA algorithms, we first partition solutions into equal-sized groups based on the positions of these solutions in the original 5-dimensional space. We then either equally allocate high-fidelity simulations to all groups or use OCBA to determine the allocation.

For equal allocation, how the original solution space is partitioned into groups has no effect on the performance of the algorithm as it is a random search strategy. While different partitioning schemes would lead to different results for OCBA, we would not know a priori which partitioning would give the best result. So we use the indexing scheme used in Section 2 to index the 5-dimensional solutions on the horizontal axis in Figure 2 to represent one way to partition the original solution space and apply OCBA. Finally, for MO2TOS, we form equal-sized groups based on the ranks of solutions after OT.

Figure 9 plots the performance of the best solution found by these three procedures as a function of total computing budge. We partition the 780 solutions into 10 groups with 78 solutions in each group. We notice that OCBA achieves marginal significant savings compared to equal allocation.

The main reason is all groups have similar means and large variances. In comparison, MO2TOS achieves significant savings in computing budget.

Figure 9: Results of the machine allocation optimization problem

21 4.3 A Multimodal Function with Different Low-Fidelity Models

We also test MO2TOS on another problem, which is a maximization of a one-dimensional multi- modal function given by

sin6(0.09πx) f(x) = + 0.1 cos(0.5πx) 2((x−10)/80)2 2 ( ) ( ) x − 40 2 x + 10 +0.5 + 0.4 sin π , 60 100 x ∈ {0, 0.1, 0.2,..., 99.9, 100}. (14)

The low-fidelity function is the first term in the high-fidelity function

sin6(0.09πx) g1(x) = , x ∈ {0, 0.1, 0.2,..., 99.9, 100}. (15) 22((x−10)/80)2

Note that the low-fidelity function is adapted from a test function in the simulation optimization literature (Xu et al. 2013). The maximum value of this function is 1.4277. We discretize the solution space by a grid of 0.01 and has 10,000 feasible solutions. We form 10 groups, each with

1000 solutions. We plot the high-fidelity and low-fidelity function in Figure 10, which shows that while g1(x) is noticeably different from f(x), it describes the trend in f(x) reasonably well, which implies good ordinal rankings of solutions after OT.

Figure 10: The high-fidelity and low-fidelity functions of the multi-modal test problem

We plot the results in Figure 11. The initial sample size of each group is set to n0 = 2. From Figure 11, we see that OCBA’s performance was not much better than equal allocation when the total computing budget is small, but became worse than equal allocation when total computing

22 budget is large. This is because the high-fidelity function is highly nonlinear and has high group variances and very small group distances when equal-sized groups are formed on the original solution space. Figure 12 plots the group average performance and group distances together with group variances in the OCBA experiment. In the figure, the center of the bar is the group mean and the height of the bar is two times group standard deviation. From the figure, it is clear that OCBA would need to spend a lot of high-fidelity evaluations to search all over the solution space. In contrast, MO2TOS was able to achieve quite substantial savings in computing budget. Running high-fidelity simulations on just 1% of all solutions, MO2TOS is able to find the optimal solution.

Figure 13 plots the group average (center of the bar) plus/minus two group standard deviations for the groups formed after OT. Comparing this plot to Figure 12, the benefit of OT is very clear in terms of reduced group variances and increased group distances, both of which contributed to the efficiency of MO2TOS as seen in Figure 11.

Figure 11: Results of the multi-modal test function with g1(x) as the low-fidelity function.

To test the sensitivity of the MO2TOS algorithm to the quality of the low-fidelity function, we also experimented with two other low-fidelity functions given below:

sin6(0.09π(x − 1.2)) g2(x) = , (16) 22((x−10)/80)2

sin6(0.09π(x − 5)) g3(x) = . (17) 22((x−10)/80)2

23 Figure 12: Group variance and average performance of the multi-modal test function for OCBA.

Figure 13: Group variance and average performance of the multi-modal test function for OT.

The low-fidelity function g2(x) shifts the location of local optimal solutions. The function g3(x) further shifts the location of local optimal solutions and actually systematically identify good solutions as poor ones. Recall that in Section 3, the correlation coefficient ρ has a major effect on the ability of OT to reduce group variance and increase group distance. These three low-

fidelity functions lead to very different correlations: 0.95 for g1(x), 0.71 for g2(x) and −0.50 for g3(x). We plot experiment results using g2(x) and g3(x) in Figures 14 and 15. From these two figures, we see that while the lower quality of the low-fidelity functions may adversely affect the performance of MO2TOS, the impact is not that significant and MO2TOS still outperforms the other two algorithms significantly.

24 Figure 14: Results of the multi-modal test function with g2(x) as the low-fidelity function.

Figure 15: Results of the multi-modal test function with g3(x) as the low-fidelity function.

5 Conclusion

We present a new framework for efficient multi-fidelity simulation optimization in this paper. The new MO2TOS framework provides a flexible, effective, and easy to implement approach to exploit simulation models of different fidelity levels. Instead of directly modeling the bias as most existing approaches did, the proposed MO2TOS framework takes on a fundamentally different approach and instead uses the lower-fidelity model to find information on the relative orders of feasible solutions.

That ordinal information is then used to transform the original solution space, which may be high-

25 dimensional and highly nonlinear and multimodal, into a one-dimensional and much better behaved solution space. We show through a rigorous theoretical analysis how groups formed after OT have reduced group variances and increased group distances. These features facilitate the adoption of an

OS strategy and enhanced its efficiency. Through numerical experiments on a multi-model function and a realistic machine allocation problem in a flexible manufacturing system, we demonstrate how

MO2TOS can be applied in practice. We also expect that the benefit of MO2TOS may be even more pronounced in high-dimensional problems. In our future work, we will apply this new framework to realistic semiconductor manufacturing problems that are high-dimensional.

The proposed MO2TOS framework is general and flexible and opens up a new research avenue.

This paper points out to many future research directions that may provide a solid theoretical and algorithmic foundation for successful MO2TOS-based optimization algorithms. We highlight the following key questions that need to be further explored:

1. How should we choose the group size and sample solutions within a group? Both decisions

are problem dependent. When the low-fidelity model is quite good, MO2TOS would benefit

from having more groups and “greedy” sampling of solutions based on their ranks. However,

such a strategy can also perform quite badly when the low-fidelity is very poor and incorrectly

ranks many solutions. This is the focus of our ongoing research.

2. How should we more adaptively partition solutions into groups after OT instead of using an

equal-size grouping scheme?

3. As we collect high-fidelity simulation results, would it be possible to update the low-fidelity

model and iteratively apply OT and OS? One can analyze the benefit of such a method by

examining any potential improvement in the rank correlation or alignment probability.

4. When there are multiple lower-fidelity models, how should we choose which model(s) to use

and how to combine their predictions when performing OT? One way is to use the correlation

between the high- and a low-fidelity model to define the quality of a low-fidelity model.

Because different low-fidelity models may provide the best approximation to the high-fidelity

model at different regions of the search space, it is worthwhile to explore other ways to define

the boundaries between models in the future that provide more information.

26 5. When lower-fidelity models’ computing cost is not negligible and/or the decision space is

too large to enumerate, how should we optimally allocate computing budget among different

models and solutions?

6. When stochastic simulation noise is present, how should we optimally allocate computing

budget among different models and solutions?

6 Acknowledgments

This work was supported in part by the National Science Foundation under Grant No. CMMI-

1233376 and CMMI-1462787.

References

Celik, N. 2010. Integrated decision making for planning and control of distributed manufacturing enterprises using dynamic-data-driven adaptive multi-scale simulations (dddams). Ph.D. thesis, Dept. Systems and Industrial Engineering, University of Arizona, Tucson, AZ.

Chen, C.-H., L. H. Lee. 2011. Stochastic Simulation Optimization: An Optimal Computing Budget Allocation. World Scientific Publishing Co, Singapore.

Chen, C.-H., J. Lin, E. Yucesan, S. E. Chick. 2000. Simulation budget allocation for further enhancing the ef- ficiency of ordinal optimization. Journal of Discrete Event Dynamic Systems: Theory and Applications 10 251–270.

Chen, W., S. Gao, C. H. Chen, L. Shi. 2014. An optimal sample allocation strategy for partition-based random search. IEEE Transactions on Automation Science and Engineering 11 177–186.

David, H. A., H. N. Nagaraja. 2003. Order Statistics. 3rd ed. Wiley.

Gross, D., J. F. Shortle, J. M. Thompson, C. M. Harris. 2008. Fundamentals of Queueing Theory. 4th ed. Wiley-Interscience.

Ho, Y.-C., Q.-C. Zhao, Q.-S. Jia. 2008. Ordinal optimization: Soft optimization for hard problems. Springer Science & Business Media.

Huang, D., T. T. Allen, W. I. Notz, R. A. Miller. 2006. Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization 32 369–382.

Lee, L. H., C.-H. Chen, E. P. Chew, J. Li, N. A. Pujowidianto, S. Zhang. 2010. A review of optimal computing

27 budget allocation algorithms for simulation optimization problem. International Journal of Operations Research 7 19–31.

March, A., K. Willcox. 2012. Provably convergent multifidelity optimization algorithm not requiring high- fidelity derivatives. AIAA Journal 50 1079–1089.

Moore, R. A. 2012. Value-based global optimization. Ph.D. thesis, Dept. Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA.

Xu, J., L. J. Hong, B. L. Nelson. 2010. Industrial strength compass: A comprehensive algorithm and software for optimization via simulation. ACM Transactions on Modeling and Computer Simulation 20 3:1–3:29.

Xu, J., E. Huang, C.-H. Chen, L. H. Lee. 2014. Simulation optimization: A review and exploration in the new era of cloud computing and big data. Under review for Asia-Pacific Journal of Operational Research .

Xu, J., B. L. Nelson, L. J. Hong. 2013. An adaptive hyperbox algorithm for discrete optimization via simulation. INFORMS Journal on Computing 25 133–146.

Yan, S., E. Zhou, C. H. Chen. 2012. Efficient selection of a set of good enough designs with complexity preference. IEEE Transactions on Automation Science and Engineering 9 596–606.

28