Towards a Systematics for Protein Subcellular Location
Total Page:16
File Type:pdf, Size:1020Kb
Towards a Systematics for Protein Subcellular Location: Quantitative Description of Protein Localization Patterns and Automated Analysis of Fluorescence Microscope Images Robert F. Murphy, Michael V. Boland and Meel Velliste Department of Biological Sciences, Biomedical Engineering Program, and Center for Light Microscope Imaging and Biotechnology, Carnegie Mellon University 4400 Fifth Avenue Pittsburgh, PA 15213 [email protected] From: ISMB-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved. Abstract protein function. Overlooked in much of this work is the Determination of the functions of all expressed proteins importance of subcellular location for proper protein represents one of the major upcoming challenges in function. computational molecular biology. Since subcellular The organelle or structure where a protein is located location plays a crucial role in protein function, the provides a context for it to carry out its role. Each availability of systems that can predict location from organelle provides a unique biochemical environment that sequence or high-throughput systems that determine may influence the associations that a protein may form and location experimentally will be essential to the full the reactions that it may carry out. For example, the characterization of expressed proteins. The development of concentrations of protons, sodium, calcium, reducing prediction systems is currently hindered by an absence of agents, and oxidizing agents vary dramatically between training data that adequately captures the complexity of protein localization patterns. What is needed is a organelles. Thus we can imagine that knowledge of the systematics for the subcellular locations of proteins. This location(s) in which a previously uncharacterized gene paper describes an approach to the quantitative description product will be found would be of significant value when of protein localization patterns using numerical features and attempting to determine (or predict) its properties. the use of these features to develop classifiers that can Such knowledge can be obtained by two approaches: recognize all major subcellular structures in fluorescence experimental determination or prediction. Experimental microscope images. Such classifiers provide a valuable tool determination of subcellular location is accomplished by for experiments aimed at determining the subcellular three main approaches: cell fractionation, electron distributions of all expressed proteins. The features also microscopy and fluorescence microscopy. As currently have application in automated interpretation of imaging practiced, these approaches are time consuming, experiments, such as the selection of representative images or the rigorous statistical comparison of protein distributions subjective, and highly variable. While each method can under different experimental conditions. A key conclusion yield important information, they do not provide is that, at least in certain cases, these automated approaches unambiguous information on location that can be entered are better able to distinguish similar protein localization into databases. patterns than human observers. There have been pioneering efforts to predict subcellular location from protein sequence (Eisenhaber and Bork 1998; Horton and Nakai 1997; Nakai and Horton 1999; Introduction Nakai and Kanehisa 1992). These efforts have been modestly successful, correctly classifying approximately As the initial sequencing of a number of eukaryotic 60% of proteins whose locations are currently known. A genomes is completed, a major effort to determine the major limitation of the usefulness of these systems is that structure and function of all expressed proteins is only broad categories of subcellular locations were used beginning. Significant current work in this new area (often (see Table 1). This limitation is a reflection of the nature referred to as functional genomics or proteomics) is of the available training data: the subcellular location is devoted to the analysis of gene and protein expression only approximately known for most known proteins. patterns (in different adult tissues and during development) For example, the categories used by YPD (Garrels and the prediction and determination of protein structure. 1996)are overlapping but not explicitly hierarchical (the More limited efforts have been made to predict aspects of brief, standardized but very general description, such as PSORT YPD "integral membrane protein" or "cytoplasmic." For the Bud neck remaining proteins, the field contains unstructured text that Cell wall varies from being very general to quite specific. T h e Centrosome/spindle pole body ambiguities in database descriptions and reports of chloroplast experiments reflect imprecision and investigator-to- cytoplasm Cytoplasmic investigator variation in terminology (especially in the Cytoskeletal endomembrane system), uncertainty about the actual endoplasmic reticulum Endoplasmic reticulum location of many proteins (e.g., whether a Golgi protein is Endosome/Endosomal in cis or medial cisternae), and the fact that many proteins vesicles cycle between different locations. What is needed is a outside Extracellular (excluding cell systematic approach to describing subcellular location that wall) can Golgi body Golgi • incorporate information obtained by diverse Lipid particles methodologies lysosome/vacuole Lysosome/vacuole • address differences in cell morphology and organelle Microsomal fraction structure between cell types mitochondria Mitochondrial • provide sufficient accuracy and resolution that proteins mitochondria inner Mitochondrial inner with similar but not identical subcellular locations can membrane membrane be distinguished mitochondria Mitochondrial intermembrane • reflect the possibility that subcellular localization intermembrane space space patterns can be formed from weighted combinations of mitochondria matrix Mitochondrial matrix simpler patterns (e.g., a protein may be found in both space the endoplasmic reticulum and the Golgi complex). mitochondria outer Mitochondrial outer membrane membrane nucleus Nuclear Developing a Systematics Nuclear matrix Nuclear nucleolus The first requirement for creating a systematics for Nuclear pore subcellular location is a means of obtaining the set of all Nuclear transport factor possible localization patterns (initially in one cell type but Other vesicles of the eventually in others). This can be accomplished by random secretory/endocytic pathways tagging of all expressed genes (Jarvik et al. 1996; Rolls et microbody (peroxisome) Peroxisome al. 1999) and then collecting fluorescence microscope plasma membrane Plasma membrane images showing the distribution of each tagged gene Secretory vesicles product. Criteria are then needed for deciding whether two Unspecified membrane proteins show the same localization pattern or whether each should be assigned to its own class. This could Table 1. Categories of subcellular localization used for previous potentially be decided by imaging both proteins in a single prediction systems. The categories used by PSORT (Nakai and cell (by labeling each with a different fluorescent probe), Kanehisa 1992) for yeast, animal and plant cells are shown. The but the number of pairwise combinations of the estimated categories used in YPD (Garrels 1996) have been used to test 10,000 to 100,000 proteins expressed in a single cell make other prediction approaches. this effectively impossible. microsomal fraction is expected to include much of the As an alternative, we suggest a heuristic procedure for endoplasmic reticulum, endosomes, and the Golgi exploring the space of possible localization patterns. This apparatus). They also do not provide sufficient resolution procedure starts by numerically describing all known to determine if two proteins can be expected to show the protein localization patterns and then attempting to cluster same distribution. For example, a protein predominantly the proteins into essentially non-overlapping groups. Each located in the cis Golgi protein would have a different group is then be examined to determine whether any subcellular location pattern than one in the trans Golgi existing knowledge (i.e., from cell fractionation (and would be expected to show different protein sorting experiments or visual inspection of images) suggests that it motifs). Similarly, endosomes in animal cells can be should be split. Additional numeric image descriptors resolved into (at least) early endosomes, sorting would then be sought to resolve the subpopulations and the endosomes, recycling endosomes, and late endosomes. process repeated until either all existing knowledge has The limitations of current knowledge on subcellular been accounted for or until the limitations of fluorescence location can be further verified by inspection of the microscopy are reached. As will be discussed below, our "subcellular location" field of the Swiss-PROT database. results suggest that automated pattern analysis is more The contents of the field fall into three broad categories. sensitive than human observation. For most proteins, it is empty. For many, it consists of a As alluded to above, one benefit of having a systematic cells. This is illustrated in Figure 2, which shows scheme is that the location of known proteins can be more representative images of the pattern of transferrin receptor accurately described so that systems for predicting location (primarily found in endosomes). Given this