Biplotgui: Interactive Biplots in R
Total Page:16
File Type:pdf, Size:1020Kb

Load more
Recommended publications
-
A Generalized Linear Model for Principal Component Analysis of Binary Data
A Generalized Linear Model for Principal Component Analysis of Binary Data Andrew I. Schein Lawrence K. Saul Lyle H. Ungar Department of Computer and Information Science University of Pennsylvania Moore School Building 200 South 33rd Street Philadelphia, PA 19104-6389 {ais,lsaul,ungar}@cis.upenn.edu Abstract they are not generally appropriate for other data types. Recently, Collins et al.[5] derived generalized criteria We investigate a generalized linear model for for dimensionality reduction by appealing to proper- dimensionality reduction of binary data. The ties of distributions in the exponential family. In their model is related to principal component anal- framework, the conventional PCA of real-valued data ysis (PCA) in the same way that logistic re- emerges naturally from assuming a Gaussian distribu- gression is related to linear regression. Thus tion over a set of observations, while generalized ver- we refer to the model as logistic PCA. In this sions of PCA for binary and nonnegative data emerge paper, we derive an alternating least squares respectively by substituting the Bernoulli and Pois- method to estimate the basis vectors and gen- son distributions for the Gaussian. For binary data, eralized linear coefficients of the logistic PCA the generalized model's relationship to PCA is anal- model. The resulting updates have a simple ogous to the relationship between logistic and linear closed form and are guaranteed at each iter- regression[12]. In particular, the model exploits the ation to improve the model's likelihood. We log-odds as the natural parameter of the Bernoulli dis- evaluate the performance of logistic PCA|as tribution and the logistic function as its canonical link. -
Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2009 Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis Lisha Chen Yale University Andreas Buja University of Pennsylvania Follow this and additional works at: https://repository.upenn.edu/statistics_papers Part of the Statistics and Probability Commons Recommended Citation Chen, L., & Buja, A. (2009). Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis. Journal of the American Statistical Association, 104 (485), 209-219. http://dx.doi.org/10.1198/jasa.2009.0111 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/statistics_papers/515 For more information, please contact [email protected]. Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis Abstract In the past decade there has been a resurgence of interest in nonlinear dimension reduction. Among new proposals are “Local Linear Embedding,” “Isomap,” and Kernel Principal Components Analysis which all construct global low-dimensional embeddings from local affine or metric information.e W introduce a competing method called “Local Multidimensional Scaling” (LMDS). Like LLE, Isomap, and KPCA, LMDS constructs its global embedding from local information, but it uses instead a combination of MDS and “force-directed” graph drawing. We apply the force paradigm to create localized versions of MDS stress functions with a tuning parameter to adjust the strength of nonlocal repulsive forces. We solve the problem of tuning parameter selection with a meta-criterion that measures how well the sets of K-nearest neighbors agree between the data and the embedding. Tuned LMDS seems to be able to outperform MDS, PCA, LLE, Isomap, and KPCA, as illustrated with two well-known image datasets. -
Exploratory and Confirmatory Factor Analysis Course Outline Part 1
Course Outline Exploratory and Confirmatory Factor Analysis 1 Principal components analysis Part 1: Overview, PCA and Biplots FA vs. PCA Least squares fit to a data matrix Biplots Michael Friendly 2 Basic Ideas of Factor Analysis Parsimony– common variance small number of factors. Linear regression on common factors→ Partial linear independence Psychology 6140 Common vs. unique variance 3 The Common Factor Model Factoring methods: Principal factors, Unweighted Least Squares, Maximum λ1 X1 z1 likelihood ξ λ2 Factor rotation X2 z2 4 Confirmatory Factor Analysis Development of CFA models Applications of CFA PCA and Factor Analysis: Overview & Goals Why do Factor Analysis? Part 1: Outline Why do “Factor Analysis”? 1 PCA and Factor Analysis: Overview & Goals Why do Factor Analysis? Data Reduction: Replace a large number of variables with a smaller Two modes of Factor Analysis number which reflect most of the original data [PCA rather than FA] Brief history of Factor Analysis Example: In a study of the reactions of cancer patients to radiotherapy, measurements were made on 10 different reaction variables. Because it 2 Principal components analysis was difficult to interpret all 10 variables together, PCA was used to find Artificial PCA example simpler measure(s) of patient response to treatment that contained most of the information in data. 3 PCA: details Test and Scale Construction: Develop tests and scales which are “pure” 4 PCA: Example measures of some construct. Example: In developing a test of English as a Second Language, 5 Biplots investigators calculate correlations among the item scores, and use FA to Low-D views based on PCA construct subscales. -
A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets
Narechania et al. GigaScience (2016) 5:44 DOI 10.1186/s13742-016-0152-3 TECHNICAL NOTE Open Access Clusterflock: a flocking algorithm for isolating congruent phylogenomic datasets Apurva Narechania1, Richard Baker1, Rob DeSalle1, Barun Mathema2,5, Sergios-Orestis Kolokotronis1,4, Barry Kreiswirth2 and Paul J. Planet1,3* Abstract Background: Collective animal behavior, such as the flocking of birds or the shoaling of fish, has inspired a class of algorithms designed to optimize distance-based clusters in various applications, including document analysis and DNA microarrays. In a flocking model, individual agents respond only to their immediate environment and move according to a few simple rules. After several iterations the agents self-organize, and clusters emerge without the need for partitional seeds. In addition to its unsupervised nature, flocking offers several computational advantages, including the potential to reduce the number of required comparisons. Findings: In the tool presented here, Clusterflock, we have implemented a flocking algorithm designed to locate groups (flocks) of orthologous gene families (OGFs) that share an evolutionary history. Pairwise distances that measure phylogenetic incongruence between OGFs guide flock formation. We tested this approach on several simulated datasets by varying the number of underlying topologies, the proportion of missing data, and evolutionary rates, and show that in datasets containing high levels of missing data and rate heterogeneity, Clusterflock outperforms other well-established clustering techniques. We also verified its utility on a known, large-scale recombination event in Staphylococcus aureus. By isolating sets of OGFs with divergent phylogenetic signals, we were able to pinpoint the recombined region without forcing a pre-determined number of groupings or defining a pre-determined incongruence threshold. -
A Resampling Test for Principal Component Analysis of Genotype-By-Environment Interaction
A RESAMPLING TEST FOR PRINCIPAL COMPONENT ANALYSIS OF GENOTYPE-BY-ENVIRONMENT INTERACTION JOHANNES FORKMAN Abstract. In crop science, genotype-by-environment interaction is of- ten explored using the \genotype main effects and genotype-by-environ- ment interaction effects” (GGE) model. Using this model, a singular value decomposition is performed on the matrix of residuals from a fit of a linear model with main effects of environments. Provided that errors are independent, normally distributed and homoscedastic, the signifi- cance of the multiplicative terms of the GGE model can be tested using resampling methods. The GGE method is closely related to principal component analysis (PCA). The present paper describes i) the GGE model, ii) the simple parametric bootstrap method for testing multi- plicative genotype-by-environment interaction terms, and iii) how this resampling method can also be used for testing principal components in PCA. 1. Introduction Forkman and Piepho (2014) proposed a resampling method for testing in- teraction terms in models for analysis of genotype-by-environment data. The \genotype main effects and genotype-by-environment interaction ef- fects"(GGE) analysis (Yan et al. 2000; Yan and Kang, 2002) is closely related to principal component analysis (PCA). For this reason, the method proposed by Forkman and Piepho (2014), which is called the \simple para- metric bootstrap method", can be used for testing principal components in PCA as well. The proposed resampling method is parametric in the sense that it assumes homoscedastic and normally distributed observations. The method is \simple", because it only involves repeated sampling of standard normal distributed values. Specifically, no parameters need to be estimated. -
GGL Biplot Analysis of Durum Wheat (Triticum Turgidum Spp
756 Bulgarian Journal of Agricultural Science, 19 (No 4) 2013, 756-765 Agricultural Academy GGL BIPLOT ANALYSIS OF DURUM WHEAT (TRITICUM TURGIDUM SPP. DURUM) YIELD IN MULTI-ENVIRONMENT TRIALS N. SABAGHNIA2*, R. KARIMIZADEH1 and M. MOHAMMADI1 1 Department of Agronomy and Plant Breeding, Faculty of Agriculture, University of Maragheh, Maragheh, Iran 2 Dryland Agricultural Research Institute (DARI), Gachsaran, Iran Abstract SABAGHNIA, N., R. KARIMIZADEH and M. MOHAMMADI, 2013. GGL biplot analysis of durum wheat (Triticum turgidum spp. durum) yield in multi-environment trials. Bulg. J. Agric. Sci., 19: 756-765 Durum wheat (Triticum turgidum spp. durum) breeders have to determine the new genotypes responsive to the environ- mental changes for grain yield. Matching durum wheat genotype selection with its production environment is challenged by the occurrence of significant genotype by environment (GE) interaction in multi-environment trials (MET). This investigation was conducted to evaluate 20 durum wheat genotypes for their stability grown in five different locations across three years using randomized completely block design with 4 replications. According to combined analysis of variance, the main effects of genotypes, locations and years, were significant as well as the interactions effects. The first two principal components of the site regression model accounted for 60.3 % of the total variation. Polygon view of genotype plus genotype by location (GGL) biplot indicated that there were three winning genotypes in three mega-environments for durum wheat in rain-fed conditions. Genotype G14 was the most favorable genotype for location Gachsaran and the most favorable genotype of mega-environment Kouhdasht and Ilam was G12 while G10 was the most favorable genotypes for mega-environment Gonbad and Moghan. -
A Biplot-Based PCA Approach to Study the Relations Between Indoor and Outdoor Air Pollutants Using Case Study Buildings
buildings Article A Biplot-Based PCA Approach to Study the Relations between Indoor and Outdoor Air Pollutants Using Case Study Buildings He Zhang * and Ravi Srinivasan UrbSys (Urban Building Energy, Sensing, Controls, Big Data Analysis, and Visualization) Laboratory, M.E. Rinker, Sr. School of Construction Management, University of Florida, Gainesville, FL 32603, USA; sravi@ufl.edu * Correspondence: rupta00@ufl.edu Abstract: The 24 h and 14-day relationship between indoor and outdoor PM2.5, PM10, NO2, relative humidity, and temperature were assessed for an elementary school (site 1), a laboratory (site 2), and a residential unit (site 3) in Gainesville city, Florida. The primary aim of this study was to introduce a biplot-based PCA approach to visualize and validate the correlation among indoor and outdoor air quality data. The Spearman coefficients showed a stronger correlation among these target environmental measurements on site 1 and site 2, while it showed a weaker correlation on site 3. The biplot-based PCA regression performed higher dependency for site 1 and site 2 (p < 0.001) when compared to the correlation values and showed a lower dependency for site 3. The results displayed a mismatch between the biplot-based PCA and correlation analysis for site 3. The method utilized in this paper can be implemented in studies and analyzes high volumes of multiple building environmental measurements along with optimized visualization. Keywords: air pollution; indoor air quality; principal component analysis; biplot Citation: Zhang, H.; Srinivasan, R. A Biplot-Based PCA Approach to Study the Relations between Indoor and 1. Introduction Outdoor Air Pollutants Using Case The 2020 Global Health Observatory (GHO) statistics show that indoor and outdoor Buildings 2021 11 Study Buildings. -
Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD
mathematics Article Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD Jose Giovany Babativa-Márquez 1,2,* and José Luis Vicente-Villardón 1 1 Department of Statistics, University of Salamanca, 37008 Salamanca, Spain; [email protected] 2 Facultad de Ciencias de la Salud y del Deporte, Fundación Universitaria del Área Andina, Bogotá 1321, Colombia * Correspondence: [email protected] Abstract: Multivariate binary data are increasingly frequent in practice. Although some adaptations of principal component analysis are used to reduce dimensionality for this kind of data, none of them provide a simultaneous representation of rows and columns (biplot). Recently, a technique named logistic biplot (LB) has been developed to represent the rows and columns of a binary data matrix simultaneously, even though the algorithm used to fit the parameters is too computationally demanding to be useful in the presence of sparsity or when the matrix is large. We propose the fitting of an LB model using nonlinear conjugate gradient (CG) or majorization–minimization (MM) algo- rithms, and a cross-validation procedure is introduced to select the hyperparameter that represents the number of dimensions in the model. A Monte Carlo study that considers scenarios with several sparsity levels and different dimensions of the binary data set shows that the procedure based on cross-validation is successful in the selection of the model for all algorithms studied. The comparison of the running times shows that the CG algorithm is more efficient in the presence of sparsity and when the matrix is not very large, while the performance of the MM algorithm is better when the binary matrix is balanced or large. -
Pcatools: Everything Principal Components Analysis
Package ‘PCAtools’ October 1, 2021 Type Package Title PCAtools: Everything Principal Components Analysis Version 2.5.15 Description Principal Component Analysis (PCA) is a very powerful technique that has wide applica- bility in data science, bioinformatics, and further afield. It was initially developed to anal- yse large volumes of data in order to tease out the differences/relationships between the logi- cal entities being analysed. It extracts the fundamental structure of the data with- out the need to build any model to represent it. This 'summary' of the data is ar- rived at through a process of reduction that can transform the large number of vari- ables into a lesser number that are uncorrelated (i.e. the 'principal compo- nents'), while at the same time being capable of easy interpretation on the original data. PCA- tools provides functions for data exploration via PCA, and allows the user to generate publica- tion-ready figures. PCA is performed via BiocSingular - users can also identify optimal num- ber of principal components via different metrics, such as elbow method and Horn's paral- lel analysis, which has relevance for data reduction in single-cell RNA-seq (scRNA- seq) and high dimensional mass cytometry data. License GPL-3 Depends ggplot2, ggrepel Imports lattice, grDevices, cowplot, methods, reshape2, stats, Matrix, DelayedMatrixStats, DelayedArray, BiocSingular, BiocParallel, Rcpp, dqrng Suggests testthat, scran, BiocGenerics, knitr, Biobase, GEOquery, hgu133a.db, ggplotify, beachmat, RMTstat, ggalt, DESeq2, airway, -
Multidimensional Scaling
UCLA Department of Statistics Papers Title Multidimensional Scaling Permalink https://escholarship.org/uc/item/83p9h12v Author de Leeuw, Jan Publication Date 2000 eScholarship.org Powered by the California Digital Library University of California MULTIDIMENSIONAL SCALING J. DE LEEUW The term ‘Multidimensional Scaling’ or MDS is used in two essentially different ways in statistics (de Leeuw & Heiser 1980a). MDS in the wide sense refers to any technique that produces a multidimensional geometric representation of data, where quantitative or qualitative relationships in the data are made to correspond with geometric relationships in the represen- tation. MDS in the narrow sense starts with information about some form of dissimilarity between the elements of a set of objects, and it constructs its geometric representation from this information. Thus the data are dis- similarities, which are distance-like quantities (or similarities, which are inversely related to distances). This chapter concentrates on narrow-sense MDS only, because otherwise the definition of the technique is so diluted as to include almost all of multivariate analysis. MDS is a descriptive technique, in which the notion of statistical inference is almost completely absent. There have been some attempts to introduce statistical models and corresponding estimating and testing methods, but they have been largely unsuccessful. I introduce some quick notation. Dis- similarities are written as δi j , and distances are di j (X). Here i and j are the objects we are interested in. The n × p matrix X is the configuration, with p coordinates of the objects in R . Often, we also have as data weights wi j reflecting the importance or precision of dissimilarity δi j . -
Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
JournalofMachineLearningResearch11(2010)451-490 Submitted 4/09; Revised 12/09; Published 2/10 Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization Jarkko Venna [email protected] Jaakko Peltonen [email protected] Kristian Nybo [email protected] Helena Aidos [email protected] Samuel Kaski [email protected] Aalto University School of Science and Technology Department of Information and Computer Science P.O. Box 15400, FI-00076 Aalto, Finland Editor: Yoshua Bengio Abstract Nonlinear dimensionality reduction methods are often used to visualize high-dimensional data, al- though the existing methods have been designed for other related tasks such as manifold learning. It has been difficult to assess the quality of visualizations since the task has not been well-defined. We give a rigorous definition for a specific visualization task, resulting in quantifiable goodness measures and new visualization methods. The task is information retrieval given the visualization: to find similar data based on the similarities shown on the display. The fundamental tradeoff be- tween precision and recall of information retrieval can then be quantified in visualizations as well. The user needs to give the relative cost of missing similar points vs. retrieving dissimilar points, after which the total cost can be measured. We then introduce a new method NeRV (neighbor retrieval visualizer) which produces an optimal visualization by minimizing the cost. We further derive a variant for supervised visualization; class information is taken rigorously into account when computing the similarity relationships. We show empirically that the unsupervised version outperforms existing unsupervised dimensionality reduction methods in the visualization task, and the supervised version outperforms existing supervised methods. -
Principal Components Analysis (Pca)
PRINCIPAL COMPONENTS ANALYSIS (PCA) Steven M. Holand Department of Geology, University of Georgia, Athens, GA 30602-2501 3 December 2019 Introduction Suppose we had measured two variables, length and width, and plotted them as shown below. Both variables have approximately the same variance and they are highly correlated with one another. We could pass a vector through the long axis of the cloud of points and a second vec- tor at right angles to the first, with both vectors passing through the centroid of the data. Once we have made these vectors, we could find the coordinates of all of the data points rela- tive to these two perpendicular vectors and re-plot the data, as shown here (both of these figures are from Swan and Sandilands, 1995). In this new reference frame, note that variance is greater along axis 1 than it is on axis 2. Also note that the spatial relationships of the points are unchanged; this process has merely rotat- ed the data. Finally, note that our new vectors, or axes, are uncorrelated. By performing such a rotation, the new axes might have particular explanations. In this case, axis 1 could be regard- ed as a size measure, with samples on the left having both small length and width and samples on the right having large length and width. Axis 2 could be regarded as a measure of shape, with samples at any axis 1 position (that is, of a given size) having different length to width ratios. PC axes will generally not coincide exactly with any of the original variables.