The PLS Regression Model: Algorithms and Application to Chemometric Data

Total Page:16

File Type:pdf, Size:1020Kb

The PLS Regression Model: Algorithms and Application to Chemometric Data Universita` degli Studi di Udine Dipartimento di Matematica e Informatica Dottorato di Ricerca in Informatica Ciclo xxv Ph.D. Thesis The PLS regression model: algorithms and application to chemometric data Candidate: Supervisor: Del Zotto Stefania prof. Roberto Vito Anno Accademico 2012-2013 Author's e-mail: [email protected] Author's address: Dipartimento di Matematica e Informatica Universit`adegli Studi di Udine Via delle Scienze, 206 33100 Udine Italia To all the kinds of learners. Abstract Core argument of the Ph.D. Thesis is Partial Least Squares (PLS), a class of techniques for modelling relations between sets of observed quantities through la- tent variables. With this trick, PLS can extract useful informations also from huge data and it manages computational complexity brought on such tall and fat datasets. Indeed, the main strength of PLS is that it performs accurately also with more vari- ables than instances and in presence of collinearity. Aim of the thesis is to give an incremental and complete description of PLS, to- gether with its tasks, advantages and drawbacks, starting from an overview of the discipline where PLS takes place up to the application of PLS to a real dataset, moving through a critical comparison with alternative techniques. For this reason, after a brief introduction of both Machine Learning and a corresponding general working procedure, first Chapters explain PLS theory and present concurrent meth- ods. Then, PLS regression is computed on a measured dataset and concrete results are evaluated. Conclusions are made with respect to both theoretical and practical topics and future perspectives are highlighted. The first part starts describing Machine Learning, its topics, approaches, and designed techniques. Then, a general working procedure for dealing with such prob- lems is presented with its three main phases. Secondly, a systematic overview of the PLS methods is given, together with math- ematical assumptions and algorithmic features. In its general form PLS, developed by Wold and co-workers, defines new components by maximizing covariance between two different blocks of original variables. Then, measured data are projected onto the more compact latent space in which the model is built. The classical computa- tion of the PLS is based on the Nonlinear Iterative PArtial Least Squares (NIPALS) algorithm; but, different variants have been proposed. The PLS tasks include re- gression and classification, as well as dimension reduction. Nonlinear PLS can also be specified. Since PLS works successfully with large numbers of strictly linked vari- ables with a noticeable computational simplicity, PLS results in a powerful versatile tool that can be used in many areas of research and industrial applications. Thirdly, PLS can be compared with concurrent choices. As far as regression is con- cerned, PLS is connected with Ordinary Least Squares (OLS), Ridge Regression (RR), Canonical Correlation Analysis (CCA). PLS can also be related with Prin- cipal Component Analysis and Regression (PCA and PCR). Moreover, Canonical Ridge Analysis is a unique approach that collects all the previous methods as spe- cial cases. In addition, all these techniques can be cast under a unifying approach called continuum regression. Some solutions, as OLS, are the most common estima- tion procedures and have desirable features but at the same time they require strict constrains to be applied successfully. Since measured data of real applications do not usually comply needed conditions, alternatives should be used. PLS, RR, CCA and PCR try to solve this problem but PLS performance overcomes the others. As regards classification and nonlinearity, PLS can be linked with Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM). LDA as PLS is a supervised classification and dimensionality reduction technique but LDA has a limited rep- resentation ability and sets some assumptions that real data usually do not fulfill causing computation fails. SVM has some interesting properties that make it a com- petitor with respect to PLS but its advantages and drawbacks should be carefully analyzed. SVM allows to expand PLS approach comprising nonlinearity. In the second part PLS is applied to a chemometric problem. The aim is to define a model able to predict the fumonisins content in maize from Near InfraRed (NIR) spectra. A dataset collects NIR spectra of maize and their fumonisins (a group of mycotox- ins) content. The NIR wavelengths are almost a thousand strictly collinear variables; observations are of the order of hundreds. A model is computed using PLS with full cross validation, after identifying the outliers among both the instances and the wavelengths. The results are promising as regards the correlation between measured and fitted values with respect to both calibration and validation. In addition, the relation between measured quantities is represented by a simpler model with about twenty latent variables and a decreasing residual variance. Moreover, the model assures screening ability with respect to the most important European legal limits, fixed as tolerable daily intake for fumonisins consumption of maize-based food. Fur- thermore, it can also correctly classify other groups of contaminated maize. In conclusion, maize as well as maize-based feed and food are a common dietary sta- ple for both cattle and people. However, fungi infection of maize is usually certified throughout the world. Moreover, fungi may produce mycotoxins and consumption of contaminated maize causes serious diseases. As a consequence, procedures to mea- sure as early as possible fumonisins content in maize batches are highly needed in order to split up safely dangerous maize from healthy one. As a matter of fact, tra- ditional techniques for measurements of mycotoxins are expensive, long, destructive, and require trained personnel, whereas NIR spectroscopy is cheaper, faster, simpler, and produce no waste materials. Therefore, PLS modelling techniques grounded on NIR spectroscopy data provide a quite promising methodology for fumonisins detection in maize. As usual, further systematic analyses are required in order to confirm conclusions and improve results. Acknowledgments During these years at the Department of Mathematics and Computer Science of the University of Udine I met people without whom my work would not have been possible. I would like to thank all of them for their insightful guidance and help. I would like to express my sincere gratitude to my supervisor prof. Vito Roberto for his continuous support, precious advice and welcome suggestions that allowed me to improve myself and my studies, capabilities and results. Words cannot describe how lucky I feel for the chance to work with prof. Giacomo Della Riccia. At the Research Center \Norbert Wiener" we developed together the statistical analysis and his enthusiasm and profound experience were essential to overcome any obstacle. He is a very nice person who delighted me with his incred- ible and amazing stories, memories of a life, that make my thoughts travel from Alexandria to Moscow, from Paris to Boston. I also would like to thank my reviewers, professors Corrado Lagazio, Carlo Lauro and Ely Merzbach as well as professors Ruggero Bellio, Francesco Fabris, Hans Lenz and Paolo Vidoni for their invaluable feedback, encouragement and useful discus- sions. I am grateful to the director of Ph.D. Course in Computer Science prof. Alberto Policriti, who always answered my questions kindly and carefully followed my stu- dent career step by step. Furthermore, I would like to thank every professor of the courses I followed, not only were they helpful in lecturing they also gave excellent suggestions for final reports. The described case study is a part of \Micosafe", a project funded by the Nat- ural, Agricultural and Forestry Resources Office of Friuli Venezia Giulia Region, proposed and developed by the Department of Agricultural and Environmental Sci- ences of the University of Udine in cooperation with the International Research Center for the Mountain (CirMont) and the Breeders Association of FVG. I thank professors Bruno Stefanon and Giuseppe Firrao for choosing the Department of Mathematics and Computer Science as their partner for statistical analysis and for making data available. Brigitta Gaspardo, Emanuela Torelli and Sirio Cividino are the researchers with whom I collaborated. I would like to express my gratitude to them. The fruitful and enjoyable experience of my studies could not have been accom- plished without the daily presence of the people who love and live everyday with me, listening to me and sharing thoughts and feelings. I take this occasion to express to each of them my thanks. My parents, Laura and Renato, sustain me and highlight my shortcomings to make me a better woman. Their life, their behaviour and their choices are the first visible example and model for me. My brothers, Simone e Michele, fill my life with their \disorder": they make me happy when I am sad or angry, they speak aloud when I am too silent, they ignore my recurring useless comments. Simone shares his life and time with me in all the possible ways. He understands me even without words, he sees deep inside me and he respects with endless perse- verance me and my moody moments. The presence of my extended family is also essential: grandparents, aunts and un- cles, cousins and other relatives make me feel glad and we stay joyfully together. Claudia and her group of teenagers warm my heart with their smiles, their hugs and their jokes. The presence of many other people enriches my days. My friend Cristina spent many hours near me, always listening to my questions and trying to help me. My colleague Omar puts up my grumbles and still asks me to join him for a coffee break. With Elena I have the most restful talks where she gives me the best advice for every kind of situation. Furthermore, there are many other people that I cannot list here but whose friend- ship is essential for me, with whom I spend pleasant moments of spare time and who cannot ever be forgotten.
Recommended publications
  • Paradigms of Combinatorial Optimization
    W657-Paschos 2.qxp_Layout 1 01/07/2014 14:05 Page 1 MATHEMATICS AND STATISTICS SERIES Vangelis Th. Paschos Vangelis Edited by This updated and revised 2nd edition of the three-volume Combinatorial Optimization series covers a very large set of topics in this area, dealing with fundamental notions and approaches as well as several classical applications of Combinatorial Optimization. Combinatorial Optimization is a multidisciplinary field, lying at the interface of three major scientific domains: applied mathematics, theoretical computer science, and management studies. Its focus is on finding the least-cost solution to a mathematical problem in which each solution is associated with a numerical cost. In many such problems, exhaustive search is not feasible, so the approach taken is to operate within the domain of optimization problems, in which the set of feasible solutions is discrete or can be reduced to discrete, and in which the goal is to find the best solution. Some common problems involving combinatorial optimization are the traveling salesman problem and the Combinatorial Optimization minimum spanning tree problem. Combinatorial Optimization is a subset of optimization that is related to operations research, algorithm theory, and computational complexity theory. It 2 has important applications in several fields, including artificial intelligence, nd mathematics, and software engineering. Edition Revised and Updated This second volume, which addresses the various paradigms and approaches taken in Combinatorial Optimization, is divided into two parts: Paradigms of - Paradigmatic Problems, which discusses several famous combinatorial optimization problems, such as max cut, min coloring, optimal satisfiability TSP, Paradigms of etc., the study of which has largely contributed to the development, the legitimization and the establishment of Combinatorial Optimization as one of the most active current scientific domains.
    [Show full text]
  • Solving Non-Boolean Satisfiability Problems with Stochastic Local Search
    Solving Non-Boolean Satisfiability Problems with Stochastic Local Search: A Comparison of Encodings ALAN M. FRISCH †, TIMOTHY J. PEUGNIEZ, ANTHONY J. DOGGETT and PETER W. NIGHTINGALE§ Artificial Intelligence Group, Department of Computer Science, University of York, York YO10 5DD, UK. e-mail: [email protected], [email protected] Abstract. Much excitement has been generated by the success of stochastic local search procedures at finding solutions to large, very hard satisfiability problems. Many of the problems on which these procedures have been effective are non-Boolean in that they are most naturally formulated in terms of variables with domain sizes greater than two. Approaches to solving non-Boolean satisfiability problems fall into two categories. In the direct approach, the problem is tackled by an algorithm for non-Boolean problems. In the transformation approach, the non-Boolean problem is reformulated as an equivalent Boolean problem and then a Boolean solver is used. This paper compares four methods for solving non-Boolean problems: one di- rect and three transformational. The comparison first examines the search spaces confronted by the four methods then tests their ability to solve random formulas, the round-robin sports scheduling problem and the quasigroup completion problem. The experiments show that the relative performance of the methods depends on the domain size of the problem, and that the direct method scales better as domain size increases. Along the route to performing these comparisons we make three other contri- butions. First, we generalise Walksat, a highly-successful stochastic local search procedure for Boolean satisfiability problems, to work on problems with domains of any finite size.
    [Show full text]
  • Random Θ(Log N)-Cnfs Are Hard for Cutting Planes
    Random Θ(log n)-CNFs are Hard for Cutting Planes Noah Fleming Denis Pankratov Toniann Pitassi Robert Robere University of Toronto University of Toronto University of Toronto University of Toronto noahfl[email protected] [email protected] [email protected] [email protected] Abstract—The random k-SAT model is the most impor- lower bounds for random k-SAT formulas in a particular tant and well-studied distribution over k-SAT instances. It is proof system show that any complete and efficient algorithm closely connected to statistical physics and is a benchmark for based on the proof system will perform badly on random satisfiability algorithms. We show that when k = Θ(log n), any Cutting Planes refutation for random k-SAT requires k-SAT instances. Furthermore, since the proof complexity exponential size in the interesting regime where the number of lower bounds hold in the unsatisfiable regime, they are clauses guarantees that the formula is unsatisfiable with high directly connected to Feige’s hypothesis. probability. Remarkably, determining whether or not a random SAT Keywords-Proof complexity; random k-SAT; Cutting Planes; instance from the distribution F(m; n; k) is satisfiable is controlled quite precisely by the ratio ∆ = m=n, which is called the clause density. A simple counting argument shows I. INTRODUCTION that F(m; n; k) is unsatisfiable with high probability for The Satisfiability (SAT) problem is perhaps the most ∆ > 2k ln 2. The famous satisfiability threshold conjecture famous problem in theoretical computer science, and sig- asserts that there is a constant ck such that random k-SAT nificant effort has been devoted to understanding randomly formulas of clause density ∆ are almost certainly satisfiable generated SAT instances.
    [Show full text]
  • On Total Functions, Existence Theorems and Computational Complexity
    Theoretical Computer Science 81 (1991) 317-324 Elsevier Note On total functions, existence theorems and computational complexity Nimrod Megiddo IBM Almaden Research Center, 650 Harry Road, Sun Jose, CA 95120-6099, USA, and School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel Christos H. Papadimitriou* Computer Technology Institute, Patras, Greece, and University of California at Sun Diego, CA, USA Communicated by M. Nivat Received October 1989 Abstract Megiddo, N. and C.H. Papadimitriou, On total functions, existence theorems and computational complexity (Note), Theoretical Computer Science 81 (1991) 317-324. wondeterministic multivalued functions with values that are polynomially verifiable and guaran- teed to exist form an interesting complexity class between P and NP. We show that this class, which we call TFNP, contains a host of important problems, whose membership in P is currently not known. These include, besides factoring, local optimization, Brouwer's fixed points, a computa- tional version of Sperner's Lemma, bimatrix equilibria in games, and linear complementarity for P-matrices. 1. The class TFNP Let 2 be an alphabet with two or more symbols, and suppose that R G E*x 2" is a polynomial-time recognizable relation which is polynomially balanced, that is, (x,y) E R implies that lyl sp(lx()for some polynomial p. * Research supported by an ESPRIT Basic Research Project, a grant to the Universities of Patras and Bonn by the Volkswagen Foundation, and an NSF Grant. Research partially performed while the author was visiting the IBM Almaden Research Center. 0304-3975/91/$03.50 @ 1991-Elsevier Science Publishers B.V.
    [Show full text]
  • Arxiv:1910.02319V2 [Cs.CV] 10 Nov 2020
    Covariance-free Partial Least Squares: An Incremental Dimensionality Reduction Method Artur Jordao, Maiko Lie, Victor Hugo Cunha de Melo and William Robson Schwartz Smart Sense Laboratory, Computer Science Department Federal University of Minas Gerais, Brazil Email: {arturjordao, maikolie, victorhcmelo, william}@dcc.ufmg.br Abstract latent space [23][8]. Previous works have demonstrated that dimensionality reduction can improve not only com- Dimensionality reduction plays an important role in putational cost but also the effectiveness of the data rep- computer vision problems since it reduces computational resentation [19] [35] [33]. In this context, Partial Least cost and is often capable of yielding more discriminative Squares (PLS) has presented remarkable results when com- data representation. In this context, Partial Least Squares pared to other dimensionality reduction methods [33]. This (PLS) has presented notable results in tasks such as image is mainly due to the criterion through which PLS finds the classification and neural network optimization. However, low dimensional space, which is by capturing the relation- PLS is infeasible on large datasets, such as ImageNet, be- ship between independent and dependent variables. An- cause it requires all the data to be in memory in advance, other interesting aspect of PLS is that it can operate as a fea- which is often impractical due to hardware limitations. Ad- ture selection method, for instance, by employing Variable ditionally, this requirement prevents us from employing PLS Importance in Projection (VIP) [24]. The VIP technique on streaming applications where the data are being contin- employs score matrices yielded by NIPALS (the standard uously generated. Motivated by this, we propose a novel algorithm used for traditional PLS) to compute the impor- incremental PLS, named Covariance-free Incremental Par- tance of each feature based on its contribution to the gener- tial Least Squares (CIPLS), which learns a low-dimensional ation of the latent space.
    [Show full text]
  • A Tour of the Complexity Classes Between P and NP
    A Tour of the Complexity Classes Between P and NP John Fearnley University of Liverpool Joint work with Spencer Gordon, Ruta Mehta, Rahul Savani Simple stochastic games T A two player game I Maximizer (box) wants to reach T I Minimizer (triangle) who wants to avoid T I Nature (circle) plays uniformly at random Simple stochastic games 0.5 1 1 0.5 0.5 0 Value of a vertex: I The largest probability of winning that max can ensure I The smallest probability of winning that min can ensure Computational Problem: find the value of each vertex Simple stochastic games 0.5 1 1 0.5 0.5 0 Is the problem I Easy? Does it have a polynomial time algorithm? I Hard? Perhaps no such algorithm exists This is currently unresolved Simple stochastic games 0.5 1 1 0.5 0.5 0 The problem lies in NP \ co-NP I So it is unlikely to be NP-hard But there are a lot of NP-intermediate classes... This talk: whare are these complexity classes? Simple stochastic games Solving a simple-stochastic game lies in NP \ co-NP \ UP \ co-UP \ TFNP \ PPP \ PPA \ PPAD \ PLS \ CLS \ EOPL \ UEOPL Simple stochastic games Solving a simple-stochastic game lies in NP \ co-NP \ UP \ co-UP \ TFNP \ PPP \ PPA \ PPAD \ PLS \ CLS \ EOPL \ UEOPL This talk: whare are these complexity classes? TFNP PPAD PLS CLS Complexity classes between P and NP NP P There are many problems that lie between P and NP I Factoring, graph isomorphism, computing Nash equilibria, local max cut, simple-stochastic games, ..
    [Show full text]
  • Introduction to SAT History, Algorithms, Practical Considerations
    Introduction to SAT History, Algorithms, Practical considerations Daniel Le Berre 1 CNRS - Universit´ed'Artois SAT-SMT summer school Semmering, Austria, July 10-12, 2014 1. Contains material provided by Joao Marques Silva, Armin Biere, Takehide Soh 1/117 Agenda Introduction to SAT A bit of history (DP, DPLL) The CDCL framework (CDCL is not DPLL) Grasp From Grasp to Chaff Chaff Anatomy of a modern CDCL SAT solver Nearby SAT MaxSat Pseudo-Boolean Optimization MUS SAT in practice : working with CNF 2/117 Disclaimer I Not a complete view of the subject I Limited to one branch of SAT research (CDCL solvers) I From an AI background point of view I From a SAT solver designer I For a broader picture of the area, see the handbook edited in 2009 by the community 3/117 Disclaimer : continued I Remember that the best solvers for practical SAT solving in the 90's where based on local search or randomized DPLL I This decade has been the one of Conflict Driven Clause Learning solvers. I The next one may rise a new kind of solvers (parallel architectures) ... 4/117 Agenda Introduction to SAT A bit of history (DP, DPLL) The CDCL framework (CDCL is not DPLL) Grasp From Grasp to Chaff Chaff Anatomy of a modern CDCL SAT solver Nearby SAT MaxSat Pseudo-Boolean Optimization MUS SAT in practice : working with CNF 5/117 Context : SAT receives much attention since a decade Why are we all here today ? I Most companies doing software or hardware verification are now using SAT solvers.
    [Show full text]
  • Pure Nash Equilibria and PLS-Completeness∗
    CS364A: Algorithmic Game Theory Lecture #19: Pure Nash Equilibria and PLS-Completeness∗ Tim Roughgardeny December 2, 2013 1 The Big Picture We now have an impressive list of tractability results | polynomial-time algorithms and quickly converging learning dynamics | for several equilibrium concepts in several classes of games. Such tractability results, especially via reasonably natural learning processes, lend credibility to the predictive power of these equilibrium concepts. See also Figure 1. [Lecture 17] In general games, no-(external)-regret dynamics converges quickly to an approximate coarse correlated equilibrium (CCE). ∗ c 2013, Tim Roughgarden. These lecture notes are provided for personal use only. See my book Twenty Lectures on Algorithmic Game Theory, published by Cambridge University Press, for the latest version. yDepartment of Computer Science, Stanford University, 462 Gates Building, 353 Serra Mall, Stanford, CA 94305. Email: [email protected]. CCE tractable CE in general MNE tractable in 2-player 0-sum games PNE tractable in symmetric routing/congestion games Figure 1: The hierarchy of solution concepts. 1 [Lecture 18] In general games, no-swap-regret dynamics converges quickly to an ap- proximate correlated equilibrium (CE). [Lecture 18] In two-player zero-sum games, no-(external)-regret dynamics converges quickly to an approximate mixed Nash equilibrium (MNE). [Lecture 16] In atomic routing games that are symmetric | that is, all players share the same source and sink | -best-response dynamics converges quickly to an approximate pure Nash equilibrium (PNE). Also, Problem 32 shows how to use linear programming to compute an exact CCE or CE of a general game, or an exact MNE of a two-player zero-sum game.
    [Show full text]
  • Total NP Functions I: Complexity and Reducibility
    NP Functions Total NP Functions I: Complexity and Reducibility Sam Buss (UCSD) [email protected] Newton Institute, March 2012 Sam Buss TFNP NP Functions Total NP Functions — TFNP [Meggido-Papadimitriou’91, Papadimitriou’94]. Definition TFNP, the class of Total NP Functions, is the set of polynomial time relations R(x, y) such that - R(x, y) is polynomial time and honest (so, |y| = |x|O(1)), - R is total, i.e., for all x, there exists y s.t. R(x, y). Thm. If TFNP problems are in FP (p-time), then NP ∩ coNP = FP. Pf. If (∃y ≤ s)A(x, y) ↔ (∀y ≤ t)B(x, y) is in NP ∩ coNP, then A(x, y) ∨¬B(x, y) defines a TFNP predicate. Thus, any NP ∩ coNP predicate gives a TFNP problem. These are called F (NP ∩ coNP) functions. NP Also open: Does TFNP = FP ? Sam Buss TFNP NP Functions These two papers emphasized especially the following problems: 1. Polynomial Local Search (PLS). Functions based on finding a local minimum. [JPY’88] 2. PPAD. Functions which are guaranteed total by Sperner’s Lemma, or Brouwer’s Fixpoint Theorem, or similar problems. “PPAD” = “Polynomial Parity Argument in Directed graphs”. 3. Nash Equilibrium (Bimatrix Equilibrium) and Positive Linear Complementarity Problem (P-LCP). Solutions can be found with Lemke’s algorithm (pivoting). [DGP’08] says: “Motivated mainly by ... Nash equilibria” Sam Buss TFNP NP Functions Polynomial Local Search (PLS) Inspired by Dantzig’s algorithm and other local search algorithms: Definition (Johnson, Papadimitriou, Yanakakis’88) A PLS problem, y = f (x), consists of polynomial time functions: initial point i(x), neighboring point N(x, s), and cost function c(x, s), a polynomial time predicate for feasibility F (x, s), and a polynomial bound b(x) such that 0.
    [Show full text]
  • Combinatorial Problems and Search
    STOCHASTIC LOCAL SEARCH FOUNDATIONS AND APPLICATIONS Introduction: Combinatorial Problems and Search Holger H. Hoos & Thomas St¨utzle Outline 1. Combinatorial Problems 2. Two Prototypical Combinatorial Problems 3. Computational Complexity 4. Search Paradigms 5. Stochastic Local Search Stochastic Local Search: Foundations and Applications 2 Combinatorial Problems Combinatorial problems arise in many areas of computer science and application domains: I finding shortest/cheapest round trips (TSP) I finding models of propositional formulae (SAT) I planning, scheduling, time-tabling I internet data packet routing I protein structure prediction I combinatorial auctions winner determination Stochastic Local Search: Foundations and Applications 3 Combinatorial problems involve finding a grouping, ordering, or assignment of a discrete, finite set of objects that satisfies given conditions. Candidate solutions are combinations of solution components that may be encountered during a solutions attempt but need not satisfy all given conditions. Solutions are candidate solutions that satisfy all given conditions. Stochastic Local Search: Foundations and Applications 4 Example: I Given: Set of points in the Euclidean plane I Objective: Find the shortest round trip Note: I a round trip corresponds to a sequence of points (= assignment of points to sequence positions) I solution component: trip segment consisting of two points that are visited one directly after the other I candidate solution: round trip I solution: round trip with minimal length Stochastic Local Search: Foundations and Applications 5 Problem vs problem instance: I Problem: Given any set of points X , find a shortest round trip I Solution: Algorithm that finds shortest round trips for any X I Problem instance: Given a specific set of points P, find a shortest round trip I Solution: Shortest round trip for P Technically, problems can be formalised as sets of problem instances.
    [Show full text]
  • Evolving Combinatorial Problem Instances That Are Difficult to Solve
    Evolving Combinatorial Problem Instances That Are Difficult to Solve Jano I. van Hemert http://www.vanhemert.co.uk/ National e-Science Centre, University of Edinburgh, United Kingdom Abstract This paper demonstrates how evolutionary computation can be used to acquire diffi- cult to solve combinatorial problem instances. As a result of this technique, the corre- sponding algorithms used to solve these instances are stress-tested. The technique is applied in three important domains of combinatorial optimisation, binary constraint satisfaction, Boolean satisfiability, and the travelling salesman problem. The problem instances acquired through this technique are more difficult than the ones found in popular benchmarks. In this paper, these evolved instances are analysed with the aim to explain their difficulty in terms of structural properties, thereby exposing the weak- nesses of corresponding algorithms. Keywords Binary constraint satisfaction, travelling salesman, Boolean satisfiability, 3-SAT,diffi- cult combinatorial problems, problem hardness, evolving problems. 1 Introduction With the current state of the complexity of algorithms that solve combinatorial prob- lems, the task of analysing computational complexity (Papadimitriou, 1994) has be- come a task too difficult to perform by hand. To measure progress in the field of al- gorithm development, many studies now consist of performing empirical performance tests to either show the difference between the performance of several algorithms or to show improvements over previous versions. The existence of benchmarks readily available for download have contributed to the popularity of such studies. Unfortu- nately, in many of these studies, the performance is measured in a black box manner by running algorithms blindly on benchmarks. This reduces the contribution to some performance results, or perhaps a new found optimum for one or more benchmark problems.Whatisarguedhereisthatamore interesting contribution can be made by showing performance results in relation to structural properties.
    [Show full text]
  • Random Cnfs Are Hard for Cutting Planes
    Electronic Colloquium on Computational Complexity, Report No. 45 (2017) Random CNFs are Hard for Cutting Planes Noah Fleming, Denis Pankratov, Toniann Pitassi, Robert Robere March 6, 2017 Abstract The random k-SAT model is the most important and well-studied distribution over k-SAT instances. It is closely connected to statistical physics; it is used as a testbench for satisfiablity algorithms, and lastly average-case hardness over this distribution has also been linked to hardness of approximation via Feige’s hypothesis. In this paper, we prove that any Cutting Planes refutation for random k-SAT requires exponential size, for k that is logarithmic in the number of variables, and in the interesting regime where the number of clauses guarantees that the formula is unsatisfiable with high probability. 1 Introduction The Satisfiability (SAT) problem is perhaps the most famous problem in theoretical com- puter science, and significant effort has been devoted to understanding randomly generated SAT instances. The most well-studied random SAT distribution is the random d-SAT model, F(m; n; d), where a random d-CNF over n variables is chosen by uniformly and indepen- dently selecting m clauses from the set of all possible clauses on d distinct variables. The random d-SAT model is widely studied for several reasons. First, it is an intrinsically natural model analogous to the random graph model, and closely related to phase transitions and struc- tural phenomena occurring in statistical physics. Second, the random d-SAT model gives us a testbench of empirically hard examples which are useful for comparing and analyzing SAT algorithms; in fact, some of the better practical ideas in use today originated from insights gained by studying the performance of algorithms on this distribution and the properties of typical random instances.
    [Show full text]