Pattern Recognition (For Neuroimaging Data) Fundamentals

Total Page:16

File Type:pdf, Size:1020Kb

Pattern Recognition (For Neuroimaging Data) Fundamentals Pattern Recognition (for Neuroimaging Data) Fundamentals OHBM Educational Course Vancouver, June 25, 2017 C. Phillips, GIGA – Research, ULiège, Belgium [email protected] http://www.giga.ulg.ac.be Today’s Menu Overview • Introduction – Uni- vs. multi-variate – Pattern recognition framework • Pattern Recognition – Data representation – Linear machine & Kernel – SVM principles – Validation & inference • Conclusion Overview • Introduction – Uni- vs. multi-variate – Pattern recognition framework • Pattern Recognition – Data representation – Linear machine & Kernel – SVM principles – Validation & inference • Conclusion Introduction Series of images = 4D image = 3D array of feature series = series of 3D images N Many variable values Series of measurements Univariate vs. multivariate Standard univariate approach, aka. Statistical Parametric Mapping Standard Statistical Analysis (encoding) Input Voxel-wise Output GLM model Independent Correction estimation ... statistical for test at each multiple voxel comparisons Univariate statistical BOLD signal BOLD Parametric map Time Find the mapping from explanatory variable (my design matrix) to observed data (one voxel values across images). Univariate vs. multivariate Multivariate approach, aka. “pattern recognition” Input Output … Training “trained machine” Samples from Cond 1 Phase = link from image to Cond {1,2} … Samples from Cond 2 New sample Test Phase Prediction: Cond 1 or Cond 2 Find the mapping f from observed data X (one whole image) to explanatory variable y (label/score) f : X y Pattern recognition concept Data X Labels y f : X y f : x* y* Pattern recognition concepts • Classification vs regression problem – Classification → output = one discrete label e.g. condition A/B, healthy/diseased, etc. → y{1,1} – Regression → output = one continuous value e.g. age, score, level, etc. → y[,] • Supervised vs unsupervised learning At training, you know – both input & output → Supervised – only the input → Unsupervised Pattern recognition framework Input (brain scans) Output (label or score) No mathematical X1 model available y1 X2 y2 X3 Machine y3 Learning Methodology Computer-based procedures that learn a function from a series of examples Learning/Training Phase Training Examples: Optimize the parameters of (X ,y ),...,(X ,y ) 1 1 s s a function f such that f f(xi) y i Testing Phase Test Example X f(X*) = y* * Prediction Overview • Introduction – Uni- vs. multi-variate – Pattern recognition framework • Pattern Recognition – Data representation – Linear machine & Kernel – SVM principles – Validation & inference • Conclusion Data representation Image = 3D matrix of voxels Whole brain volume “Feature Vector” or “data point” Data dimensions • dimensionality of a sample/“data point” = #voxels considered • number of samples/“data points” = #scans/images considered Linear classification example Only 2 voxels L R 2 Class A Class B Class A Class B voxel 4 2 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 4 Sample with unknown label Sample 4 Class ? 2 Sample 1 Sample * • Hyper-plane = decision boundary 4 voxel 1 • Training = fixing hyper-plane parameters • When #features(=#voxels) » #samples(=#scans) → “ill posed problem” Data representation Solutions to the dimensionality problem: • Region of interest (ROI) • Searchlight = scan all locally defined ROIs • Feature selection strategies • Kernel Methods + Regularization – computational shortcut – efficient solution of ill-conditioned problems Kernel matrix Kernel matrix = “similarity measure” Image 3 Linear kernel → dot product K(7,3)=(4*-2)+(1*3)=-5 4 1 Image 7 -2 3 K • 2 patterns xi and xj → a real number characterizing their similarity • simple similarity measure = a dot product → linear kernel • kernel matrix size : #samples x #samples Support Vector Machine SVM • Relies on kernel representation • “maximum margin”, ρ, classifier ⊤ (w xi + b) > 0 ⊤ (w xi + b) < 0 (w,b) N w = a x å i i Data: <xi,yi>, i=1,..,N i=1 d Observations: xi R “Support vectors” have αi ≠ 0 Labels: yi {-1,+1} Class prediction Samples of class 1 “Weight vector” or “Discrimination map” Voxel 1 Voxel 2 … Voxel 1 Voxel 2 w1 = +5 w2 = -3 Samples of class 2 Training Voxel 1 Voxel 2 Voxel 1 Voxel 2 … f(x) = (w1*v1+w2*v2)+b = (+5*0.5-3*0.8)+0 Testing = 0.1 New example v1 = 0.5 v2 = 0.8 Positive value Class 1 Kernel methods SVM Hard binary classification – simple & efficient, quick calculation but – NO ‘grading’ in output {-1, 1} Gaussian Processes probabilistic model – more complicated, slower calculation but – returns a probability [0 1] – can be multiclass Other approaches: Deep learning, tree-based methods, etc. Validation principle Data samples = {labels , features} label features: feat 1 feat 2 feat 3 … feat m 1 1 … 2 -1 … Trained 3 -1 … … … classifier … … … … i 1 … Training set Training i+1 1 … 1 i+2 1 … -1 … … … … … … … … n -1 … -1 Test set Test True Predicted label label Prediction accuracy evaluation M-fold cross-validation • Split data in 2 sets: “train” & “test” evaluation on 1 “fold” • Rotate partition and repeat evaluations on M “folds” • Applies to scans/events/blocks/subjects/… Leave-”X”-out approach Classification validation Predicted Confusion matrix class 1 0 Accuracy estimation 1 A B class 0 C D • Class 1 accuracy, p1 = A/(A+B) Actual • Class 0 accuracy, p0 = D/(C+D) • Total accuracy, p = (A+D)/(A+B+C+D) Other criteria • Sensitivity/specificity • Positive/Negative Predictive Value (PPV/NPV) • Balanced accuracy = (p1 + p0)/2 Regression validation Consider N folds CV: • prediction error in one fold • mean across all folds Out-of-sample “mean squared error” (MSE) n y Other measure: target, Correlation between predictions and targets predicted, f(xn) Inference by permutation testing • H0: “no link between features and target” • Test statistic, e.g. CV accuracy • Estimate distribution of test statistic under H0 Random permutation of labels Estimate CV accuracy Repeat M times • Calculate p-value as Overview • Introduction – Uni- vs. multi-variate – Pattern recognition framework • Pattern Recognition – Data representation – Linear machine & Kernel – SVM principles – Validation & inference • Conclusion Conclusions Univariate Multivariate • 1 voxel • 1 volume • Target → Data • Data → Target • Look for difference or • Look for similarity or correlation score • General Linear Model • Specific machine • GLM inversion → • Machine training → parameter & error terms machine parameters • Calculate contrast of • Estimate prediction interest accuracy with CV • Voxel/cluster activation • Sample label prediction inference → localisation inference → no localisation …much more to come • Pradeep → cross-validation, how & why. • Carsten → permutation & statistical inference • Jessica → weight maps & their interpretation • Jo → fMRI, BOLD signal & hrf • Bertrand → fMRI, stability of your results • Georg → multi-subject data • Olivier → multi-modal data & disease prediction • Janaina → psychiatric applications • Vince → deep learning approaches in neuro-imaging • Moritz → MVPA models interpretation Thank you for your attention! Any question? Thanks to the PRoNTo Team for the borrowed slides. References Reviews: • Haynes and Rees (2006) Decoding mental states from brain activity in humans. Nat. Rev. Neurosci., 7, 523-534 • Pereira, Mitchell, Botnivik (2009) Machine learning classifiers and fMRI: a tutorial overview. Neuroimage,45, S199-S209 Books: • Hastie , Tibishirani, Friedman (2003) Elements of Statistical Learning. Springer • Shawe-Taylor and Christianini (2004) Kernel Methods for Pattern Analysis. Cambridge: Cambridge University Press. • Bishop, Jordan, Kleinberg, Schölkopf (2006) Pattern Recognition and Machine learning. Springer Machines: • Burges (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167. • Rasmussen, Williams (2006) Gaussian Processes for Machine Learning. The MIT Press. • Tipping (2001) Sparse Bayesian Learning and the Relevance Vector Machine Journal of Machine Learning Research, 1, 211-244 • Breiman (1996) Bagging Predictors Machine Learning, 24, 123-140 • Rakotomamonjy, A., Bach, F., Canu, S., & Grandvalet, Y. (2008). SimpleMKL. Journal of Machine Learning Research, 9, 2491-2521. .
Recommended publications
  • Harvard-MIT Division of Health Sciences and Technology HST
    Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo C# .NET Algorithm for Variable Selection Based on the Mallow’s Cp Criterion Jessie Chen, MEng. Massachusetts Institute of Technology, Cambridge, MA Abstract: Variable selection techniques are important in statistical modeling because they seek to simultaneously reduce the chances of data overfitting and to minimize the effects of omission bias. The Linear or Ordinary Least Squared regression model is particularly useful in variable selection because of its association with certain optimality criterions. One of these is the Mallow’s Cp Criterion which evaluates the fit of a regression model by the squared distance between its predictions and the true values. The first part of this project seeks to implement an algorithm in C# .NET for variable selection using the Mallow’s Cp Criterion and also to test the viability of using a greedy version of such an algorithm in reducing computational costs. The second half aims to verify the results of the algorithm through logistic regression. The results affirmed the use of a greedy algorithm, and the logistic regression models also confirmed the Mallow’s Cp results. However, further studies on the details of the Mallow’s Cp algorithm, a calibrated logistic regression modeling process, and perhaps incorporation of techniques such as cross- validation may also be useful before drawing final conclusions concerning the reliability of the algorithm implemented. Keywords: variable selection; overfitting; omission bias; linear least squared regression; Mallow’s Cp; logistic regression; C-Index Background Variable Selection Variable selection is an area of study concerned with the strategies for selecting one subset out of a pool of independent variables that is able to explain or predict the dependent variable well enough, such that all contributions from the variables that remain unselected may be neglected or considered pure error [13].
    [Show full text]
  • Huu Minh PHAM
    UNIVERSITY OF EASTER FINLAND SCIENCE AND FOREST FACULTY -----&----- LEARNING DIARY OF RESEARCH METHODS IN FOREST SCIENCES Student's name: Pham Huu Minh Student number: 291366 1. Research process Science is a systematic and logical approach to discovering how things in the universe work. It is also the body of knowledge accumulated through the discoveries about all the things in the universe. When conducting research, scientists use the scientific method to collect measurable, empirical evidence in an experiment related to a hypothesis(often in the form of an if/then statement), the results aiming to support or contradict a theory. 1. Make an observation or observations. 2. Ask questions about the observations and gather information. 3. Form a hypothesis — a tentative description of what's been observed, and make predictions based on that hypothesis. 4. Test the hypothesis and predictions in an experiment that can be reproduced. 5. Analyze the data and draw conclusions; accept or reject the hypothesis or modify the hypothesis if necessary. 6. Reproduce the experiment until there are no discrepancies between observations and theory. Statistical analysis is fundamental to all experiments that use statistics as a research methodology. Most experiments in social sciences and many important experiments in natural science and engineering need statistical analysis. Statistical analysis is also a very useful tool to get approximate solutions when the actual process is highly complex or unknown in its true form. Example: The study of turbulence relies heavily on statistical analysis derived from experiments. Turbulence is highly complex and almost impossible to study at a purely theoretical level.
    [Show full text]
  • New Ideas for Method Comparison: a Monte Carlo Power Analysis
    New ideas for method comparison: a Monte Carlo power analysis Dr Giorgio Pioda∗ 10/5/2021 Contents 1 Introduction 1 1.1 Joint ellipse (JE) vs confidence intervals (CI) based method comparison . .1 1.2 Robust Deming regressions . .2 2 Methods 3 2.1 Monte Carlo simulation models . .3 2.2 The type I error . .5 2.3 Type II error and power comparison: the slope . .9 2.4 Type II error and power comparison: the intercept . 21 2.5 The heteroscedastic case . 24 2.6 Ties: the (badly ignored) role of the precision of the methods on the validation . 31 2.7 Conclusions about the methods . 38 3 Application examples 39 3.1 Creatinine R data set from the {mcr} package . 39 3.2 Glycated hemoglobin (extreem) example . 42 4 Conclusions 46 5 Aknowledgments 46 6 Appendix 47 6.1 Appendix A: the data generation function . 47 6.2 Appendix B: empirical rejection plot for the long range experiments . 50 6.3 Appendix C: the plots for Dem, MMDem, Paba and WDem . 51 6.4 Appendix D: Nonlinear exponential power regression. 54 6.5 Appendix E: comparison tables . 55 arXiv:2105.04628v1 [stat.ME] 10 May 2021 6.6 Appendix F: 250 samples CI vs 40 samples ellipse power comparison . 56 6.7 Appendix G. tables for the intercept experiments . 57 6.8 Appendix H: heteroscedastic additional comparative plots . 59 1 Introduction 1.1 Joint ellipse (JE) vs confidence intervals (CI) based method comparison Comparison methods are very important for all analytical laboratories and are largely standardized like in the EU ISO 15189 norms collection [1].
    [Show full text]
  • Descriptive Statistics
    Regression analysis DESCRIPTIVE STATISTICS Dr Alina Gleska Institute of Mathematics, PUT April 22, 2018 Regression analysis 1 Regression analysis Regression analysis Two-dimensional data Statistical observation – the pair (X;Y ). Types of analysis: correlation – we define the shape, the direction and the strength of relationships; regression – we define the mathematical function between correlated variables. Regression analysis Regression analysis Regression analysis – a set of statistical processes for estimating the relationships among variables. Regression function – a mathematical function of the independent variable X. We distinguish two types of dependence: functional - for each value ofX we have only one value of Y; statistical (stochastic) - for each value ofX we have many values ofY. Regression analysis The statistical dependence The statistical dependence we can write as: Y = f (X) + e; where e – a random error. The regression equation (the regression model) – the equation describing the relationships among variables after adding a random error. Regression analysis Regression Types of regression: Y = f (X) + e – a simple regression (a regression model with a single explanatory variable); Y = f (X1;:::;Xn) + e – a multiple regression; X – a reason – the independent variable; Y – a result, an effect – the dependent variable. Types of simple regression: linear – the best fitted function is a linear function; nonlinear – the best fitted function is a nonlinear function (for example an exponential function or a logarithmic function). Regression analysis The choice of the regression model The regression model is chosen according to the dispersion of the empirical data on their scatterplot. REMARK! In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data.
    [Show full text]
  • A Comprehensive Guide to Machine Learning
    A Comprehensive Guide to Machine Learning Soroush Nasiriany, Garrett Thomas, William Wei Wang, Alex Yang Department of Electrical Engineering and Computer Sciences University of California, Berkeley January 17, 2018 2 About CS 189 is the Machine Learning course at UC Berkeley. In this guide we have created a com- prehensive course guide in order to share our knowledge with students and the general public, and hopefully draw the interest of students from other universities to Berkeley's Machine Learning curriculum. We owe gratitude to Professor Anant Sahai and Professor Stella Yu, as this book is heavily inspired from their lectures. In addition, we are indebted to Professor Jonathan Shewchuk for his machine learning notes, from which we drew inspiration. The latest version of this document can be found at http://snasiriany.me/cs189/. Please report any mistakes to [email protected]. Please contact the authors if you wish to redistribute this document. Notation Notation Meaning R set of real numbers n R set (vector space) of n-tuples of real numbers, endowed with the usual inner product m×n R set (vector space) of m-by-n matrices δij Kronecker delta, i.e. δij = 1 if i = j, 0 otherwise rf(x) gradient of the function f at x r2f(x) Hessian of the function f at x p(X) distribution of random variable X p(x) probability density/mass function evaluated at x E[X] expected value of random variable X Var(X) variance of random variable X Cov(X; Y ) covariance of random variables X and Y Other notes: n • Vectors and matrices are in bold (e.g.
    [Show full text]
  • The Validation of Expert System Traffic Psychological Assessment to Romanian Driving Schools
    Procedia Available online at www.sciencedirect.com Social and Behavioral Procedia - Social and Behavioral Sciences 00 (2011) 000–000 Sciences Procedia - Social and Behavioral Sciences 30 (2011) 457 – 464 www.elsevier.com/locate/procedia WCPCG-2011 The validation of Expert System Traffic psychological assessment to Romanian Driving Schools Mihai Aniţeia*, Mihaela Chraifb, Gernort Schuhfriedc, Markus Sommerd a Professor, PhD, University of Bucharest, Faculty of Psychology and Educational Sciences / Bd. M. Kogalniceanu 050107, Bucharest, Romania bPostdoctoral fellow, University of Bucharest, Faculty of Psychology and Educational Sciences, Bd. M. Kogalniceanu 050107, Bucharest, Romania c Schuhfried GmbH,Vienna, Austria dAsistent Professor, PhD, University of Vienna, Vienna, Austria Abstract Analyzing the multiple regression model for the composite criterion, the multiple correlation coefficient, evidence a high and statistically significant correlation between the predictors and the criterion (r=0.741, p<0.05). Also, the beta coefficients provide that the variables of the tests are predictors for the performances registered in traffic (p<0.05). This study based on the findings of the previous research highlight that the Romanian driving schools should improve the psychological assessment batteries with modern and validated instruments. The predictive regression validation model emphasizes the importance of using high performance statistical programs in choosing the psychological tests for evaluation. ©© 20112011 PublishedPublished by by Elsevier
    [Show full text]
  • Improving the Prediction of Readmissions Amongst Medicare
    Improving the Prediction of Readmissions Amongst Medicare Patients in a California Hospital Nhan Huynh, Dylan Robbins-Kelley, Holly Fallah Faculty Advisors: Ian Duncan, Janet Duncan, Wade Herndon Dept. of Statistics and Applied Probability, University of California Santa Barbara Introduction Methodology Results Purpose and Motivation Logistic Regression Models Centers for Medicare and Medicaid Services A regression model where the data set has a Three models were created: (CMS) reduced Medicare payments for binary response or a multinomial response . LACE model hospitals with excess readmissions (within 30 and several predictors . General Model days of discharge) for following health We are interested in predicting the . Age 65+ model with CMS penalty conditions: probability a patient is readmitted to the conditions . Heart Attack, Heart Failure, Pneumonia, hospitals within 30 days after discharge Hip/Knee Replacement, Chronic based on characteristics such as: Obstructive Pulmonary Disease. age, gender, length of stay during Readmissions can lead to longer stays, and admission, diagnoses, admission from put patients at additional risk of hospital- emergency department, number of acquired infections and complications. emergency visits, etc… Development of LACE Logistic regression links the binary outcomes Table to compare predicted and actual re- Currently the LACE index is a widely used of readmission status with a combination of admissions using the age 65+ model: readmission model in the United States, due the linear predictors. Let p=probability the patient is readmitted to its simplicity and moderate predictive power. within 30 days after discharge LACE scores every patient on the risk of Let b0=intercept readmission upon discharge based on the Let bp=coefficient of variable following parameters: Let Xp=variable .
    [Show full text]
  • C# .NET Algorithm for Variable Selection Based on the Mallow's Cp Criterion
    C# .NET Algorithm for Variable Selection Based on the Mallow’s C p Criterion Jessie Chen, MEng. Massachusetts Institute of Technology, Cambridge, MA [email protected] Abstract: Variable selection techniques are important in statistical modeling because they seek to simultaneously reduce the chances of data overfitting and to minimize the effects of omission bias. The Linear or Ordinary Least Squared regression model is particularly useful in variable selection because of its association with certain optimality criterions. One of these is the Mallow’s C p Criterion which evaluates the fit of a regression model by the squared distance between its predictions and the true values. The first part of this project seeks to implement an algorithm in C# .NET for variable selection using the Mallow’s C p Criterion and also to test the viability of using a greedy version of such an algorithm in reducing computational costs. The second half aims to verify the results of the algorithm through logistic regression. The results affirmed the use of a greedy algorithm, and the logistic regression models also confirmed the Mallow’s C p results. However, further studies on the details of the Mallow’s C p algorithm, a calibrated logistic regression modeling process, and perhaps incorporation of techniques such as cross- validation may also be useful before drawing final conclusions concerning the reliability of the algorithm implemented. Keywords: variable selection; overfitting; omission bias; linear least squared regression; Mallow’s C p; logistic regression; C-Index Background Variable Selection Variable selection is an area of study concerned with the strategies for selecting one subset out of a pool of independent variables that is able to explain or predict the dependent variable well enough, such that all contributions from the variables that remain unselected may be neglected or considered pure error [13].
    [Show full text]
  • Estimating Product Composition Profiles in Batch Distillation Via
    ARTICLE IN PRESS Control Engineering Practice 12 (2004) 917–929 Estimating product composition profiles in batch distillation via partial least squares regression Eliana Zamprognaa, Massimiliano Baroloa,*, Dale E. Seborgb a Dipartimento di Principi e Impianti di Ingegneria Chimica (DIPIC), Universita" di Padova, Via Marzolo, 9, 35131 Padova PD, Italy b Department of Chemical Engineering, University of California, Santa Barbara, CA 93106, USA Received 15 February 2003; accepted 24 November 2003 Abstract The properties of two multivariate regression techniques, principal component analysis and partial least squares (PLS) regression, are exploited to develop soft sensors able to estimate the product composition profiles in a simulated batch distillation process using available temperature measurements. The estimators’ performance is evaluated with respect to several issues, such as pre-processing of the calibration and validation data sets, number of measurements used as sensor inputs, presence of noise in the input measurements, and use of lagged measurements. A simple augmentation of the conventional PLS regression approach is also proposed, which is based on the development and sequential use of multiple regression models. The results prove that the PLS estimators can provide accurate composition estimations for a batch distillation process. The computational requirements are very low, which makes the estimators attractive for on-line use. r 2004 Elsevier Ltd. All rights reserved. Keywords: Batch distillation; Composition estimators; Soft sensors; Partial least squares regression; Principal component analysis 1. Introduction composition), at constant distillate composition (with variable reflux ratio), and at total reflux. A combination Batch distillation is a well-known unit operation that of these three basic modes can be used to optimize the is widely used in the fine chemistry, pharmaceutical, performance of the separation.
    [Show full text]
  • Logistic and Multiple Regression: a Two-Pronged Approach to Accurately Estimate Cost Growth in Major Dod Weapon Systems
    Air Force Institute of Technology AFIT Scholar Theses and Dissertations Student Graduate Works 3-2004 Logistic and Multiple Regression: A Two-Pronged Approach to Accurately Estimate Cost Growth in Major DoD Weapon Systems Matthew B. Rossetti Follow this and additional works at: https://scholar.afit.edu/etd Part of the Finance and Financial Management Commons Recommended Citation Rossetti, Matthew B., "Logistic and Multiple Regression: A Two-Pronged Approach to Accurately Estimate Cost Growth in Major DoD Weapon Systems" (2004). Theses and Dissertations. 3964. https://scholar.afit.edu/etd/3964 This Thesis is brought to you for free and open access by the Student Graduate Works at AFIT Scholar. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of AFIT Scholar. For more information, please contact [email protected]. LOGISTIC AND MULTIPLE REGRESSION: A TWO-PRONGED APPROACH TO ACCURATELY ESTIMATE COST GROWTH IN MAJOR DoD WEAPON SYSTEMS THESIS Matthew B. Rossetti, B.A. First Lieutenant, USAF AFIT/GCA/ENC/04-04 DEPARTMENT OF THE AIR FORCE AIR UNIVERSITY AIR FORCE INSTITUTE OF TECHNOLOGY Wright-Patterson Air Force Base, Ohio APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. The views expressed in this thesis are those of the author and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the U. S. Government. AFIT/GCA/ENC/04-04 LOGISTIC AND MULTIPLE REGRESSION: A TWO-PRONGED APPROACH TO ACCURATELY ESTIMATE COST GROWTH IN MAJOR DoD WEAPON SYSTEMS THESIS Presented to the Faculty Department of Mathematics and Statistics Graduate School of Engineering and Management Air Force Institute of Technology Air University Air Education and Training Command In Partial Fulfillment of the Requirements for the Degree of Master of Science in Cost Analysis Matthew B.
    [Show full text]
  • Multiple Regression Validation Forrest Breyfogle III August 31, 2004
    Multiple Regression Validation Forrest Breyfogle III www.smartersolutions.com August 31, 2004 The following is my response to a posting at the Quality Digest’s www.insidequality.com, where I am the Six Sigma Discussion Forum Moderator. INITIAL POSTING Once I've calculated my equation for multiple regression, how do I go about validation? My only direction, for now, is to take the residuals of the calculated vs. actual and create a control chart (3sds). Test an appropriate sample size again and calculate those residuals (actual vs. calculated). If they fall within the original control chart, the equation is considered valid. Does this seem correct? Any other suggestions? RESPONSE: One approach that builds upon the described basic multiple regression model validation approach is: 1. Use historical data to determine a multiple regression equation. A best subsets approach can be very useful to determine the key process input variables (KPIVs) to include within the equation. 2. Create an infrequent subgrouping/sampling plan such that normal variation levels of the KPIVs will impact the XmR response control chart as common cause variability. 3. Select a sample per the sampling plan. 4. For the sample, record a response and the equation’s KPIVs levels when the response was recorded. 5. Use the regression equation to calculate a prediction value for the recorded KPIV levels of the sample. 6. Determine for the sample the difference between the predicted and the measured response. Do not adjust the predicted equation with the newly recorded response value and KPIV level. I will call this “residual” difference a “prediction delta”.
    [Show full text]
  • Modeling Power Output of Horizontal Solar Panels Using Multivariate Linear Regression and Random Forest Machine Learning Christil K
    Air Force Institute of Technology AFIT Scholar Theses and Dissertations Student Graduate Works 3-21-2019 Modeling Power Output of Horizontal Solar Panels Using Multivariate Linear Regression and Random Forest Machine Learning Christil K. Pasion Follow this and additional works at: https://scholar.afit.edu/etd Part of the Oil, Gas, and Energy Commons Recommended Citation Pasion, Christil K., "Modeling Power Output of Horizontal Solar Panels Using Multivariate Linear Regression and Random Forest Machine Learning" (2019). Theses and Dissertations. 2348. https://scholar.afit.edu/etd/2348 This Thesis is brought to you for free and open access by the Student Graduate Works at AFIT Scholar. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of AFIT Scholar. For more information, please contact [email protected]. AFIT-ENV-MS-19-M-192 MODELING POWER OUTPUT OF HORIZONTAL SOLAR PANELS USING MULTIVARIATE LINEAR REGRESSION AND RANDOM FOREST MACHINE LEARNING THESIS Christil K. Pasion, 2d Lt, USAF AFIT-ENV-MS-19-M-192 DEPARTMENT OF THE AIR FORCE AIR UNIVERSITY AIR FORCE INSTITUTE OF TECHNOLOGY Wright-Patterson Air Force Base, Ohio DISTRIBUTION STATEMENT A. APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED The views expressed in this thesis are those of the author and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the United States Government. This material is declared a work of the United States Government and is not subject to copyright protection in the United States. AFIT-ENV-MS-19-M-192 MODELING POWER OUTPUT OF HORIZONTAL SOLAR PANELS USING MULTIVARIATE LINEAR REGRESSION AND RANDOM FOREST MACHINE LEARNING THESIS Presented to the Faculty Department of Engineering Management Graduate School of Engineering and Management Air Force Institute of Technology Air University Air Education and Training Command In Partial Fulfillment of the Requirements for the Degree of Master of Science in Engineering Management Christil K.
    [Show full text]