Metamodeling for Large-Scale Optimization Tasks Based on Object Networks

Ludmilla Werbos, Robert Kozma, Rodrigo Silva-Lugo, Giovanni E. Pazienza, and Paul Werbos

Abstract—Optimization in large-scale networks – such as expensive models or computations. In this paper, we large logistical networks and electric power grids involving summarize the theory of inverse metamodeling and apply it many thousands of variables – is a very challenging task. In this to a practical problem, related to large-scale optimization paper, we present the theoretical basis and the related tasks. experiments involving the development and use of visualization The paper is structured as follows: in Sec. II, we introduce tools and improvements in existing best practices in managing the theoretical background of metamodeling; in Sec. III, we optimization software, as preparation for the use of present a practical problem and the numerical simulations “metamodeling” – the insertion of complex neural networks or other universal nonlinear function approximators into key obtained by using the principles of metamodeling; in Sec. parts of these complicated and expensive computations; this IV, we discuss the importance of using proper tools to novel approach has been developed by the new Center for represent the knowledge obtained from our simulations; in Large-Scale Integrated Optimization and Networks (CLION) Sec. V, we draw the conclusions and discuss the future at University of Memphis, TN. developments of our project.

II. PRINCIPLES OF METAMODELING

I. INTRODUCTION A. Forward Metamodeling HIS paper presents a combined method of solving and In the world of stochastic optimization and operations Tunderstanding scheduling and optimization tasks using research, “metamodeling” usually means forward both and Neural Networks. The metamodeling. In forward metamodeling, a neural network problem of transportation network configuration and fleet or some other function approximator is trained to assignment in realistic case is multidimensional (hundreds of approximate the function we are trying to minimize. We can thousands variables and constraints) and computationally represent a neural network as a function, f(x,W), where x are challenging. The best way to approach it at present time is the inputs to the network, and W are the weights or Mixed Integer Programming (MIP) [1]. This approach led to parameters of the network. In the case of forwards great success stories, such as [2], [3], especially recently metamodeling, f is just a scalar, so we may write the neural with arrival of Gurobi software [4] and faster multicore network as f(x,W). The task is to minimize the cost function computers. (u, α) subject to some constraints, where: However, the more we succeed, the more we want. The • C is the total cost scope of real-life problems is growing faster than software • α is the set of all the inputs, including those in the and hardware capabilities. If a few years ago we would be constraints satisfied to be able to predict a typical day for • u is the set of all the things we are trying to optimize, manufacturing, distribution or logistics optimization including the fleet assignments purposes, now we want to see an image of a typical weekend There are three steps to forward metamodeling here: or a week. That would give the planners more power to 1. Pick a sample {ui} of possible fleet assignments ui – consider all the opportunities and their combinations on the either all based on the same input vector α, or with larger scale, instead of limiting themselves to a limited different α. Let us assume that we have m sample duration chunks of time modeled independently. points; in other words, i = 1 to m. Because the goal here is to minimize the amount of time 2. Calculate C(ui, α) for each of these assignments required to get to a desired level of accuracy, the best 3. Train the neural network f(x,W) to approximate C approach will include some use of metamodeling – the use over this sample; in other words, find the weights of fast universal approximators to approximate slower, more W which minimize the error between f and C, for our sample of m cases, where x(i) is set to ui or to

ui||αi . Another way to say this is that we train the Manuscript submitted February 10, 2011. This work has been supported neutral network f over the set of sample pairs { }. Memphis. Forward metamodeling is especially useful for stochastic . Kozma, R. Silva-Lugo and G.E. Pazienza are with CLION, The optimization problems where it is very expensive to University of Memphis. G.E. Pazienza is also with the Pazmany University, calculate the function C, or where the only way to get its Budapest ([email protected]). Paul Werbos is with CLION, The University of Memphis, and the value is to do a physical experiment. But in the practical National Science Foundation (NSF).

logistics problems we have been looking at, it is generally what makes the mammal brain different from and more very easy to calculate the cost C after u and α are already powerful than the reptile brain. We may or may not be ready known. Therefore, the usual forwards metamodeling is not to get into this kind of advanced capability, but it certainly is so useful here. fundamental and important. 3) Robust optimization: Most of the US power grid is now B. Inverse Metamodeling managed by large Independent System Operators (ISOs) and The term “inverse metamodeling” is fairly new. The term Regional Transmission Organizations (RTO), who use large itself is due to Russell Barton of Penn State [5], but it is optimization packages to make decisions ranging from possible to generalize the idea, to include the following kind commitment of generation a day ahead, to actual dispatch of of inverse metamodeling, described in [6]: generation 15 minutes ahead, to planning many years ahead. 1. Pick a sample of possible α, {αi}, with i = 1 to m. The optimization problems now solved by these systems [8], 2. Run the optimizer to calculate the optimal u, u*(α), for are not so different from those in logistics. When the utilities each α, subject to the constraints. shift from day-ahead commitment to actual operations, they 3. Train the neural network f(x,W) to approximate the always have to make some (costly) adjustments, because of function u*(α) over the training set {<αi,u*(αi)>}. In other things which cannot be predicted ahead of time. Therefore, words, pick W so as to minimize the error between f and u*, they can get better results by reformulating the optimization when α is used as the input to the neural network. problem itself, by injecting noise. If one accounts for If the neural network were an exact approximator, and uncertainties in the vector u, the problem becomes more extremely fast, this would be extremely useful in saving complicated in theory – but the actual surface of expected computation time; however, it is not likely to be quite so error becomes more convex, which would typically increase good in the beginning. It will require a very complex neural the value of warm starts and possibly allow better use of new network to get decent approximation in this application. This methods. kind of research is the priority area for CLION, which specializes in complex neural networks. If the neural III. OPTIMIZATION OF SCHEDULING FOR LARGE-SCALE network is designed to match the capabilities of the LOGISTICS TASKS dedicated hardware implementation, as much as possible, it should be very fast in real time. Even if the approximation is A. Efficient Scheduling Process not perfect, this may provide a fast warm start for the Efficient scheduling is a crucial issue in many areas optimizer program. The value of warm starts varies a lot belonging to both defense and civilian domains, including from problem to problem, but generally provides a faster disaster response, power distribution, transportation, convergence compared to random or randomized starting set manufacturing job scheduling etc. [9-11]. Throughout the of data. years, there have been countless efforts to tackle this C. Variations and Extensions problem; nevertheless, the appearance on the market of 1) Gradient-assisted learning (GAL): With forwards manycore platforms and advanced FPGAs offers unexplored possibilities, which requires novel and more efficient metamodeling, we can usually calculate ∇uC for each approaches. The ultimate goal is to develop efficient sample point u , whenever we have a differentiable i nonlinear optimizers, like approximate dynamic algorithm to calculate C itself, at a computational cost about programming [12], though we are far away from this goal: the same as the computation of C itself. This comes from the today's state of art optimization tools explore mixed integer use of the chain rule for ordered derivatives, see [7]. programming techniques. We can use Gradient Assisted Learning to train the neural In the present work, we used the Gurobi software to network to match not only f and C, but f and C as well. ∇ ∇ explore and understand better the problem of large scale This can be extended to inverse metamodeling as well. If the optimization tasks, which may include hundreds of vector u has, say, 200 components, GAL causes a sample of thousands of variables. In particular, we focused on the N cases to be “worth” 200 times as much – as if 200×N decision support to optimize the assignment of resources for examples had been collected. For inverse metamodeling of the domestic market operation of a large distribution logistics tasks, it is necessary to get sensitivity information company, where in the logistics operation there are about from the existing optimizer to make this possible, and build 200 delivery points served by a fleet. Our Efficient up some new code. Scheduling Process (ESP) is composed of three main steps, 2) Metaexploration: This is when we train a neural each of them formulated as a Mixed Integer Programming network to output a “good” new sample point, u or α , m+1 m+1 (MIP) problem: 1) determine point to point flight and for whatever use. In traditional stochastic search volume; 2) assign dispatch locations and volumes; 3) assign optimization, without metamodeling, the choice of a new fleet vehicles to move containers between dispatch locations sample point to explore is central to the power of the and volumes. The last one is by far the most computationally methods. In very complex systems, where an inverse complex among the three steps, and for this reason our study metamodel can only provide a warm start, selection of a new only refer to the optimization of this particular part of the point to follow up is also very important. In fact – the key ESP. idea in [6] is that a layer of the cerebral cortex of the The ESP must accomplish the following tasks: maximize the mammal brain performs metaexploration, and that this is profit; assign available resources while minimizing costs and

respecting operational constraints; find the optimal cost simulations of P2 and P3. We used to different criteria to resource assignment to move the forecasted volume of classify the results given on P1: first, the shortest time products, adhering to operational constraints; globally reflow required to reach 1% MIP gap (STO), and second the least all forecasted volume to minimize costs according to number of nodes explored during the branch and bound capacity and operational constraints. phase (LNE). The result of this process can be found in Table I for (STO) and in Table II for (LNE): we selected the B. Numerical Experiments best five sets for each of the two methods. The optimization software used in our simulations is the well-known Gurobi Optimizer. The Gurobi parameters These five sets were then used for the optimization of P2 changed through the simulations are: BarOrder, Crossover, and P3. For the sake of conciseness, in this paper we only Crossover Basis, Cuts, and MIPFocus (a thorough present the numerical results for P2, but the results for P3 description of their functionalities can be found in [13]). are qualitatively similar to them. Also, we implemented two When considering all possible combinations of the values different versions, which are based on different forecasts for these five parameters can assume, we obtain 720 cases. the future traffic. The results are shown in Table III and Before performing the simulations, we decomposed Table IV for (STO) and (LNE), respectively. the original problem P into three sub-problems: P1, P2, and All our simulations were performed on a Dell PowerEdge P3. P1 is the optimization problem for the first stage of the server with 24 cores and 48 GB of RAM with a Linux Red original problem, P, done independently of the next stage. Hat operative system. P2 is optimizing the second stage of the original problem P with the assumption that the results of optimization done at We observe that the greatest improvement is achieved for the first stage by solving the P1 are not supposed to be version 2 with (STO), with an average time of around 9 changed, and treated just as the other constraints. P3 is hours, much better than the 24 hours usually required for this optimizing the first stage of the original problem while at the step. However, the quality of this result is not uniform same time taking into account the processes that need to be through the other configurations: for example, for version 1 optimized at the second stage of the original problem P. It is of (LNE) all simulations ran for 24 hours, without reaching important to emphasize that P1 is less complex than P2 and the desired MIP gap of 1%. It is also worth mentioning that P3: for this reason, we run all possible 720 combinations of the custom hardware and software configurations may have the parameters for the sub-problem P1, and then we selected had an effect in the performance improvement [14]. the ‘best’ combinations of the five parameters to perform the

TABLE I MIP PARAMETERS FOUND FOR PROBLEM P1 WHEN USING THE FASTER RUNS (STO) CRITERION BarOrder Crossover Cross. Basis Cuts MIPFocus Set 1 A. M. Degree Auto Quick Very aggr Feasibility Set 2 Auto Auto Quick Very aggr. Feasibility Set 3 Auto Disabled Quick Very aggr Feasibility Set 4 A. M. Degree Disabled Quick Very aggr. Feasibility Set 5 N. dissection Primal Quick Aggressive Feasibility

TABLE II MIP PARAMETERS FOUND FOR PROBLEM P1 WHEN USING THE LEAST NODES EXPLORED (LNE) CRITERION BarOrder Crossover Cross. Basis Cuts MIPFocus Set 1 Auto D/P-P Slow/Stable Auto BOB Set 2 Auto D/P-P Slow/Stable Conservative. BOB Set 3. A. M. Degree D/P-P Slow/Stable Very aggr BOB Set 4 A. M. Degree D/P-P Slow/Stable Auto. BOB Set 5 N. dissection D/P-D Slow/Stable Very aggr BOB

TABLE III PROBLEM P2 – SIMULATION TIME AND PERFORMANCE WHEN USING THE FASTER RUNS (STO) CRITERION Version 1 Version 2 Time (h) MIP Gap (%) Time (h) MIP Gap (%) Set 1. 24.0 1.16 5.07 0.99 Set 2 10.3 0.95 5.30 0.99 Set 3. 10.6 0.95 7.91 0.98 Set 4 24.0 1.16 7.96 0.98

Set 5 24.0 1.01 18.4 1.00

Avg. 18.58 1.05 8.93 0.99

Std. Dev. 7.42 0.11 5.47 0.01

TABLE IV PROBLEM P2 – SIMULATION TIME AND PERFORMANCE WHEN USING THE LEAST NODES EXPLORED (LNE) CRITERION Version 1 Version 2 Time (h) MIP Gap (%) Time (h) MIP Gap (%) Set 1. 24.0 1.05 12.92 0.99 Set 2 24.0 1.05 24.00 1.66 Set 3. 24.0 1.14 13.20 0.99 Set 4 24.0 1.14 24.00 1.66 Set 5 24.0 1.90 10.78 1.00

Avg. 24.0 1.26 16.98 1.26

Std. Dev. 0 0.36 6.48 0.37

The results of numerical experiments provide a training set for use in metamodeling, with various models described in Section 2. Our primary intention is to use Recurrent Object Networks to be trained with Extended Kalman Filtering (EKF) [15] or more advanced methods under development.

IV. KNOWLEDGE REPRESENTATION AND VISUALIZATION When dealing with very complex problems, it is crucial to have an effective way to represent the knowledge and hence visualize information that would be otherwise very hard to retrieve from the raw data. Proper visualization helps in formulating the optimization problems and in assigning initial values and weights in a sequence of optimizers. For our experiments with data visualization, we used a network composed of 141 sites located in the 48 contiguous

US states chosen on the basis of their importance in the Fig. 1. US sites classified according to the volume of the incoming freight cargo market [16]. We did not considered the sites with very traffic; the size of the circle is proportional to the number of pieces low traffic among the 200 sites used in the experiments, in delivered in our model. order to avoid clutter in the figures. We assigned to each of We can easily identify that a few key sites – eg, Memphis, them a realistic volume of traffic, which we calculated from Dallas, Chicago, New York, Los Angeles, Atlanta – generate the available data of the last few years. Finally, we created most of the volume. This is in accordance with the model of specific software, programmed with , the freight traffic, in particular in the air transport, as a scale- which has been used to generate the figures presented in this free network [17] in which a few nodes, called hubs, are section. In the following, we will always refer to ‘volume’ as connected to most of the remaining nodes and contribute a number of pieces received/delivered, but we would obtain large share to the traffic [18-20]. similar results if we use other reference parameters, such as the weight of the freight or the number of vehicles employed. The incoming and the outgoing traffics for the 141 sites are shown in Figs. 1 and 2, respectively.

Fig. 2. US sites classified according to the volume of the outgoing freight traffic; the size of the circle is proportional to the number of pieces received Fig. 4. Connections among the 22 busiest US sites in our model: the width in our model. of the line is proportional to the traffic between the two sites.

Indeed, this phenomenon is represented even better in Fig. 3, in which we compare the cumulative percentage of the freight traffic (i.e., the sum of the traffic generated in n sites ordered by total, incoming and outgoing, traffic) with a linear model, in which each site contributes 1/141 to the total volume. Cleary, our model is highly nonlinear: the 7 busiest sites (5% of 141) count for 25% of the total traffic, the 22 busiest sites (16% of 141) count for 50% of the total traffic, and the 48 busiest sites (34% of the total) count for 75% of the total traffic. In correspondence with n=48 we also find the maximum discrepancy between the two models.

Fig. 5. Connections among the 7 busiest US sites in our model: the width of the line is proportional to the traffic between the two sites.

We can then conclude that an effective visualization method is a fundamental tool for representing the knowledge coming from the data, which especially in this kind of problems can be particularly cumbersome.

Fig. 3. Comparison between the cumulative percentage of the freight traffic V. CONCLUSION AND FURTHER DEVELOPMENT (in blue) and a linear model (in green): it is possible to observe that the 7 busiest sites count for 25% of the total traffic, the 22 busiest sites count for By selecting the optimal set of control parameters for 50% of the total traffic, the 48 busiest sites count for 75% of the total Gurobi 3.0 out of 720 parameter sets, we were able to traffic. determine a set of parameters that significantly increased the performance of the system. Important progress was made in These considerations suggest us that we can optimize the the understanding of the details of the data used for the transportation among just a few sites thus obtaining a generation of the MIP models by the distribution company. dramatic impact on the final result. In particular, in Figs. 4 Future directions of research will include expanding the and 5, we show the network generated by the 22 and the 7 problem both vertically (up to several days) and horizontally busiest sites, respectively: we can observe that now the (combining separate solution stages into integrated model). network is simple enough to notice the finest details of both However, the complexity of such integrated model might the site locations and the connections among them, while approach the limit of traditional Mixed Integer Programming still representing the 50% of the traffic, in the first case, and methods, calling for alternative methods of solution, such as the 25% of the traffic, in the second one. approximate dynamic programming [21], [22], which will

provide further increase in the performance necessary for the universal approximator,” IEEE Trans. Neur. Netw. 19(3), pp. 929- daily operation of the distribution company. 937, 2008. [16] 2009 Annual World Airport Traffic Report (WATR), Airport Council It is not possible for the neural network to come up with a International, 2009. better solution than the existing optimizer, if the [17] A.L. Barabasi and R. Albert, “Emergence of scaling in random optimization problem is formulated exactly as it is now and networks,” Science, vol. 286, no. 5439, 1999. the existing optimizer is allowed to run to perfect [18] S. Conway, “Scale-free networks and commercial air carrier transportation in the United States,” in 24th Congress of the completion. However, if the existing optimizer is only International Council of the Aeronautical Sciences, 2004. allowed to run for some time T, the combination of a good [19] W. Li and X. Cai, “Statistical analysis of airport network of China,” warm start and a new run of the existing optimizer for the Physical Review E, vol. 69, no. 4, 2004. full time T should lead to a better solution. This new [20] D. DeLaurentis, E. Han, and T. Kotegawa, “Network-theoretic approach for analyzing connectivity in air transportation networks,” combination could be used to create a new training set for Journal of Aircraft, vol. 45, no. 4, pp. 1669—1679, 2004. the inverse metamodeling, a better metamodel, and [21] W. Powell, Approximate dynamic programming: solving the curses of iteratively improve the quality of solution. It seems likely dimensionality (John Wiley and Sons, 2007) that that some neural networks within the brain also serve as [22] H.P. Simao, J. Day, A.P. George, T. Giord, J. Nienow, W.B. Powell, Transportation Science 43, 178 (2009) metamodels, to approximate and anticipate more complex calculations by other neural networks; for example, simple, fast feedforward networks may be trained to anticipate and then initialize more complicated recurrent networks, or value function networks may be viewed as approximators of a future value calculation. It is also important to remark the important role played by visualization tools that are able to properly represent the information of our model. In the near future, we plan to improve such tools and make them publicly available.

REFERENCES [1] G. B. Dantzig and M. N. Thapa. 2003. Linear Programming 2: Theory and Extensions. Springer-Verlag. [2] C. Barnhart and R. R. Schneur,”Air Network Design for Express Shipment Service” Operations Research, Vol. 44, No. 6 (Nov. - Dec., 1996), pp. 852-8631 [3] S.-C Ting and Tzeng, G.-H., 2004. An optimal containership slot allocation for linear shipping revenue management. Maritime Policy & Management 31, pp. 199-211. [4] http://www.prnewswire.com/news-releases/gurobi-announces-the- release-of-gurobi-optimizer-40-software-106821458.html [5] R. Barton and M. Meckesheimer, Metamodel-based simulation optimization (with), Chapter 18 in Handbooks in Operations Research and Management Science: Simulation, S. G. Henderson and B. L. Nelson, eds., New York: Elsevier Science, 2006 [6] P. J. Werbos, Brain-Like Stochastic Search: A Research Challenge and Funding Opportunity, http://arxiv.org/abs/1006.0385, submitted on 1 Jun 2010 [7] P. J. Werbos, . Backwards Differentiation in AD and Neural Nets: Past Links and New Opportunities, in M. Bucker, G. Corliss, P.Hovland, U. Naumann, and B. Norris (eds), Automatic Differentiation: Applications, Theory and Implementations, Springer (LNCS), New York, 2005. [8] http://www.ferc.gov/industries/electric/indus-act/market-planning.asp [9] A.S. Ko, N.B. Chang, Journal of Environmental Management 88(1), 11 (2008) [10] L. Contesse, J. Ferrer, S. Maturana, Annals of Operations Research 139, 39 (2005) [11] S. Kesen, S. Das, Z. Gungor, The International Journal of Advanced Manufacturing Technology 47, 665 (2010) [12] L. Werbos, P. Werbos, Self-organization in CNN-Based Object Nets, 2010 12th International Workshop on Cellular Nanoscale Networks and their Applications (CNNA), 2010 [13] Gurobi parameters: http://gurobi.com/doc/40/refman/node572.html, retrieved on February 1, 2011 [14] R. Silva-Lugo, R. Kozma, L.Werbos, “Optimization of Scheduling for Larg-Scale Logistics Tasks”, 5th Multidisciplinary International Scheduling Conference 2011 (submitted) [15] R. Ilin, R. Kozma, P.J. Werbos, “Beyond backpropagation and feedforward models: A practical training tool for more efficient