Bioinformatic Analysis of Single Nucleus Transcriptome Data of Huntington's Disease

Item Type dissertation

Authors Malaiya, Sonia

Publication Date 2019

Abstract Huntington’s Disease (HD) is a dominantly inherited neurodegenerative disorder caused by a trinucleotide expansion in exon 1 of the Huntingtin (Htt) . The earliest changes in HD are observed in the striatum, prior to the onset of symptoms. Here w...

Keywords Bioinformatics; Neurosciences; Genetics; eccentric spiny neurons; single cell RNA sequencing; single nucleus RNA sequencing; trajectory analysis; transcriptomics; Huntington Disease

Download date 06/10/2021 08:56:53

Link to Item http://hdl.handle.net/10713/11617

CURRICULUM VITAE

Name: Sonia Malaiya

Contact Information: [email protected]

Degree and Date to be Conferred: M.S., 2019

Collegiate Institutions Attended:

2017-2019 University of Maryland, Baltimore, M.S., December 2019

2012-2016 PES Institute of Technology, Bangalore, B.E., May 2016

Major: M.S. in Human Genetics and Genomic Medicine

B.E. in Biotechnology

Professional Positions Held:

November 2019 Research Fellow, University of Maryland Baltimore, Institute

of Genomic Sciences, 670, W. Lexington Street, Baltimore,

MD-21201

March 2016-October 2016 Research Assistant, PES Institute of Technology, Outer

Ring Road, Banashankari 3rd Stage, Banashankari,

Bengaluru, Karnataka 560085, India

ABSTRACT

Title: Bioinformatic Analysis of Single Nucleus Transcriptome Data of Huntington’s

Disease

Sonia Malaiya, M.S., 2019

Thesis Directed By: Dr. Seth A. Ament, Assistant Professor, Department of Psychiatry

Huntington’s Disease (HD) is a dominantly inherited neurodegenerative disorder caused by a trinucleotide expansion in exon 1 of the Huntingtin (Htt) gene. The earliest changes in HD are observed in the striatum, prior to the onset of symptoms. Here we use a knock in HttQ175/+ mouse model to perform single nucleus RNA sequencing

(snRNAseq) of the striatums of four HttQ175/+ and three wild type 14 and 15 month old mice and obtain their expression profiles. Using available snRNAseq quality control and analysis methods, we identify eleven cell types within our samples, including the newly discovered “Eccentric MSNs”. We compute the differentially expressed between the two genotypes and find significant lowering of cell type specific markers in most cells with the HttQ175/+ mutation. Trajectory analyses reveal stages of HttQ175/+

MSNs that range from identical to extremely distinguished form the wild type MSNs, supporting the length dependent somatic expansion hypothesis.

Bioinformatic Analysis of Single Nucleus Transcriptome Data of Huntington’s Disease

by Sonia Malaiya

Thesis submitted to the Faculty of the Graduate School of the University of Maryland, Baltimore in partial fulfillment of the requirements for the degree of Master of Science 2019

To my parents for the patience and support, To my siblings for the love and motivation, To my aunt and uncle for all the life guidance, And to the extended family and friends For never complaining when I forgot to call!

iii

Acknowledgements

I would like to thank Dr. Seth Ament for being my teacher and mentor, for the endless

patience as I tried to learn and understand new concepts every day. His guidance,

expertise and contribution were invaluable at every step of this project.

This project would not have been possible without the timely generation of the dataset, for which I am very grateful to Ms. Marcia Cortés-Gutiérrez, and the advice and recommendations of Dr. Brian Herb, who provided the first stepping stones that

opened new paths when all hope was lost.

I would also like to thank Dr. Carlo Colantuoni and Dr. Jeff Carroll, for their guidance in shaping this analysis and their inputs, their outlook has been invaluable, and to Dr.

Michelle Giglio and Dr. Mervyn Monteiro for a fresh perspective and providing

direction.

In addition, I wish to acknowledge the input of the members of the Ament lab - ideas that were sparked during conversations have helped shaped a better, scientific way of

thinking.

iv

Table of Contents

INTRODUCTION ...... 1

METHODS ...... 14

RESULTS ...... 32

Specific Aim 1: Molecular identification of cell types in the mouse striatum using

single nucleus transcriptome data ...... 32

Specific Aim 2. Identify cell type-specific molecular changes in a mouse model of

the HD mutation ...... 42

DISCUSSION ...... 64

REFERENCES...... 68

v

List of tables

Table 1. Cell counts after each step of QC……………………………………..……33

Table 2. Distribution of cell types across samples and genotypes………..…………39

vi

List of Figures

Figure 1. Violin plots of the nGene distribution per sample………………………...... 16

Figure 2. Original clustering of the 14 and 15 month old dataset with identifiable cell types but presence of mutually exclusive genes within a cluster. Penk is an MSN gene, Mbp is an oligodendrocyte gene and Gad1 has the highest expression in Pnoc and Pvalb interneurons…………………………………………………………………………..17

Figure 3. Contamination fractions of each sample. The Valley can be seen between 3.125 and 3.75 on the x-axis in all samples except BBY3 and BCR1 where it is shifted……………………………………………………………………………..….19

Figure 4. Possible types of droplets giving being sequenced to give rise to the libraries observed……………………………………………………………………………...20

Figure 5. tSNE of all eight of the 14 and 15 month old samples. Left: colored by the clusters identified by Seurat. Right: Colored by the age of the nuclei. A trail is formed where the younger samples group together and the older samples spread out………...23

Figure 6. Coloring tSNEs by sample for different variations of MNN. Since it was discovered that the results from MNN correction were dependent on the order of samples provided to the function, I tried different orders to cluster the nuclei and color by sample…………………………………………………………………………..…24

Figure 7. tSNE of 4524 high quality single nuclei, colored by cell type, sample and genotype…………………………………………………………………………...…35

Figure 8. 2873 neurons clustered and colored by A: cluster number; B: Sample of origin; C: Htt genotype; D: number of genes expressed by each nucleus; E: Expression of Drd1 and F: Expression of Adora2a………...……………………………………..37

Figure 9. Distribution of markers across cell types…………………………………...40

Figure 10. Distribution of markers across cell types……………………………….…41

Figure 11. Distribution of genes Snap25 and Bcan in astrocytes……………………..42

Figure 12. Distribution for genes Rsrp1 and Rgs9 in D1 MSNs………………………44

Figure 13. Distribution of genes Meg3 and Pde10a in D2 MSNs…………………….45

Figure 14. Distribution of genes Meg3 and Inf2 in Eccentric MSNs………………….46

vii

Figure 15. Pairwise correlation between the log fold ratios of the 391 significantly differentially expressed genes between each combination of the 12 cell types, showing the correlation coefficient and the p-Values of correlation……………………………49

Figure 16. Correlation between D1 MSNs and D2 MSNs on the left and between D1 MSNs and Eccentric MSNs on the right………………………………………………50

Figure 17. Correlation between D2 MSNs and Oligodendrocytes on the left and D2 MSNs and Sst interneurons on the right………………………………………………51

Figure 18. Overlap among down regulated genes on the left and up regulated genes on the right……………………………………………………………………………….52

Figure 19. Rank Rank Hypergeometric overlap between A. D1 and D2 MSNs; B. D1 (x) and Eccentric MSNs (y) and C. D2 (x) and Eccentric MSNs (y)………………….53

Figure 20. Comparison of the interneuron populations, the Chat, Pnoc and Sst interneurons on the x axis and the canonical MSNs on the y axis……………………..54

Figure 21. Overlap between D2 MSNs on the x axis and Oligodendrocytes on the y axis…………………………………………………………………………………...55

Figure 22. Venn diagrams showing the number of common DEGs between different cell types. A: all neuronal and interneuron types; B: all non neuronal subtypes; C: all neuron subtypes; D: all interneuron subtypes………………………………………...56

Figure 23. ROC curves for the enrichment of the Langfelder modules in our data. The module has been denoted at the top of each plot along with information on whether it was up or down regulated in their dataset……………………………………………..59

Figure 24. tSNEs colored by the AUCell thresholds for the Langfelder modules. The pink or red cells are the ones that passed the threshold. The light blue cells were the ones closest to the threshold line to its left…………………………………………….60

Figure 25. Trajectories followed by each of the cell types generated using Monocle2……………………………………………………………………………..62

Figure 26. Violin plots of a component of the canonical MSN trajectory that shows a population of Q175 MSNs that are similar to wild type MSNs and intermediate stages that lead to a completely different stage in Q175 nuclei that cannot be seen in most wild type nuclei…………………………………………………………………………….63

viii

INTRODUCTION

A Brief History

Indications of neurological disorders have been observed as far back in history as the

Vedic scriptures1 of India (circa 1500-500 B.C.) and literatures of the ancient Greece

(9th century B.C.1). Many medical records discussing brain disorders – symptoms and possible ways of management – have been discovered and studied by scientists since then. Texts containing Parkinson’s-like symptoms date back to 1000 B.C. in Indian2 and Chinese passages3.

The first modern medical descriptions of these disorders were published in the late eighteenth and early nineteenth century period, with Parkinson’s disease being described by James Parkinson in 18174. The first article published about Huntington’s disease (HD) was by Charles Waters in 1842, prior to which other physicians had described singular cases of patients displaying the symptoms of HD. In 1860, Johan

Christian Lund, a Norwegian physician, recorded the occurrence of jerky movements and dementia in some of his patients. However, it was only in 1872 that a 22-year old

George Huntington published “On Chorea”5, a paper describing in great detail the disease and its symptoms as observed by him, his father, and his grandfather - all of whom George Huntington were physicians to predecessors of the same families that presented with the disease to him.

He wrote:

1

“The name ‘chorea’ is given to the disease on account of the dancing propensities of those who are affected by it, and it is a very appropriate designation. The disease, as it is commonly seen, is by no means a dangerous or serious affection, however distressing it may be to the one suffering from it, or to his friends. Its most marked and characteristic feature is a clonic spasm affecting the voluntary muscles... The disease commonly begins by slight twitchings in the muscles of the face, which gradually increase in violence and variety. The eyelids are kept winking, the brows are corrugated, and then elevated, the nose is screwed first to the one side and then to the other, and the mouth is drawn in various directions, giving the patient the most ludicrous appearance imaginable. The upper extremities may be the first affected, or both simultaneously… As the disease progresses the mind becomes more or less impaired, in many amounting to insanity, while in others mind and body gradually fail until death relieves them of their suffering. When either or both the parents have shown manifestations of the disease, one or more of the offsprings invariably suffers from the condition... It never skips a generation to again manifest itself in another...”5–7

Although the severity of the disease may not have been well understood, it was due to the clarity and accuracy of the symptoms and inheritance pattern in his publication that earned him the eponym for this disease.

HD Today

Modern medicine describes HD as a neurodegenerative disorder that affects the motor, behavioral and cognitive systems. More specifically, the motor symptoms include involuntary, jerky or twitching movements (chorea), inability to perform voluntary movements caused by muscle rigidity or dystonia, poor coordination, a change in gait and a lag in facial muscular movements such as slow movement of eyes and difficulty swallowing. Psychologically, patients suffer from depression, irritability, excess

2

fatigue, and loss of social skills. In some cases an obsessive compulsive need for repetitive movements and actions can be observed among other cognitive symptoms such as difficulty in learning new information, indecisiveness, inability to think forward and fixation on a single thought, loss of self control, and a general tardiness in most cognitive functions. General onset of symptoms in adult HD is in the 40s while early onset HD is when the symptoms appear before the age of 20 and accounts for nearly 5-

10% of all HD cases. Also known as Juvenile HD, it presents with a drop in academic performance, tremors, seizures and changes in handwriting, along with some or all other symptoms of adult onset HD.

The localization of the Huntingtin gene on 4 was first found in

1983 using linkage studies in families8 and the actual genetic mutation associated with

HD was first described in 1993 in the HTT gene on the 4p16.3 region. The mutation was identified to be an expanded CAG repeat (codes for glutamine (Q)) in exon 1 of the gene, akin to what was previously seen in the Fragile X syndrome9, and has since been grouped with various other polyglutamine disorders, including 6 forms of spinal cerebellar ataxias, dentatorubral pallidoluysian atrophy (DRPLA), and spinal and bulbar muscular atrophy10.

In HD, less than 35 CAG repeats give rise to normal phenotype and no disease, between 36 and 39 repeats have variable penetrance or a very late onset11 and 40 or more repeats cause a definite manifestation of the disease. There is an inverse correlation between the repeat length and onset age for more than 60 repeats in heterozygotes12. The longer of the two elongated CAG repeats in homozygotes causes a fully dominant effect on age at onset of symptoms13, with no difference in the age at onset between homozygotes anf heterozygotes. However, contradictory studies exist on the effect of homozygosity on the progression of disease: earlier studies showed a faster

3

progression in homozygotes14 but a more recent longitudinal study found no significant difference in the progression of disease between homozygotes and heterozygotes15.

Early studies found that patients live between 15 to 20 years after onset of symptoms but the progression of disease is directly correlated to the number of repeats and hence, the age of onset16,17. The quality of life for patients and caregivers progressively declines and often causes financial burden, emotional strain and a sense of isolation in most communities18.

Prevalence of HD varies among populations. Approximately 4-7 in every

100,000 Americans are affected. The incidence amongst East Asians with adequate means of diagnosis is nearly 10-100 times lower, and is often attributed to the difference in the haplotype that is associated with the instability of the CAG repeats. These haplotypes, that have been observed to cause an increased expansion bias, have a higher prevalence in Caucasian and African American populations due to mixed race ancestry19,20.

Huntingtin within the cell

The huntingtin protein has an essential role in the embryonic21–23 and striatal24–27 development. It has been noted that the loss of function caused by the absence of a normal HTT allele may exacerbate the pathology observed due to the gain of function of the mutant HTT28. Interestingly, increasing repeat length has been associated with an intellectual advantage until 40-41 repeats29, indicating the importance of the protein and the polyQ length in brain development.

One of the major observations in HD is the formation of aggregates of the

Huntingtin protein which has been attributed to the pathology of the disease. These aggregates are observed throughout the brain as the disease progresses, starting in the

4

striatum and cerebral cortex. While these aggregates have been extensively studied, the exact effect of the presence of these aggregates is unknown, although some studies suggest that they provide protection by sequestering the mutant protein to a single region30,31, reducing the levels of diffused protein and its interactions with other molecules within the cell.

Transcription of a mutant HTT gene with an elongated CAG track gives rise to an mRNA hairpin structure that gets recognized by Dicer in the cytoplasm and spliced.

The resultant fragments of RNA may be responsible for some of the toxicity due to interactions with other transcripts and silencing. A similar mechanism has been implicated in other trinucleuotidex repeat expansion diseases (TREDs)32.

Extensive research has been conducted on the direct products of the normal and mutant HTT gene and its interactions with molecules of interest and early changes have been well characterized. However, due to its pleiotropic nature across various cell types and the cell autonomous fashion of HD related changes, a global picture of all its interactions and thus, the pathogenesis of the disease is unknown. Current treatment methods target the symptoms and aid in management. Antisense Oligonucleotide

Therapy33,34 and RNA interference are other fields of research for the lowering of the huntingtin transcripts and have shown promising results in preliminary studies.

HD in the Striatum

The basal ganglia is a subcortical structure in the human brain that consists of the striatum, substantia nigra, globus pallidus, and subthalamic nucleus, all of which are involved in voluntary movement, cognition, behavior, learning, eye movement and emotional response. The most severely affected region of the brain in HD is the striatum, which is the largest part of the basal ganglia, with a selective loss of

5

GABAergic medium spiny neurons. The striatum is essential to the learning, motor movements and decision making, explaining the symptoms observed as the neurons start degenerating due to HD. Other major cell types in the striatum include interneurons, astrocytes, microglia, oligodendrocytes, oligodendrocyte progenitor cells

(OPCs) or polydendrocytes, fibroblasts and endothelial cells. Studies have revealed expression of the mutant HTT gene and consecutive changes associated with it in these cells as well.

Cell types in the striatum

Neurons

90% - 95% of the neurons in the striatum are comprised of GABAergic inhibitory projection neurons called the Medium spiny neurons (MSNs). These neurons are morphologically homogeneous, yet can be subdivided on the basis of the regions they project into as well as the dopamine receptors they express. Until a few years ago, there were believed to be only two populations of MSNs, the Drd1-expressing (“D1”) and

Drd2-expressing (“D2”) MSNs35,36(p2), both of which expressed the MSN marker

Ppp1r1b. These MSNs are equally and homogenously distributed throughout the striatum. In HD, the most affected cells are these neurons37,38 and initial theories were based on the premise that all other cells in the brain respond to this loss of neurons as the disease progresses. However, years of mouse studies suggest that cell autonomous effects of mHTT expression across a range of cell types contributes to pathogenesis, and other cell types have been discovered to have their own responses to the mutation in HTT, but the loss of MSNs in the striatum did explain the loss of cognitive and motor symptoms characteristic to HD.

6

D1 MSNs, also known as striatonigral or direct MSNs, project into the entopenduncular nucleus and the substantia nigra pars reticulata and express Drd1a,

Chrm4, Dynorphin (Dyn) and substance P neuropeptide (Tac1). These neurons are responsible for reward based learning and promote movement.

D2 MSNs, also known as thestriatropalladial or indirect MSNs, project into the lateral globus pallidus and express the dopamine D2 receptor, Drd2, the adenosine A2A receptor (Adora2a or A2A) and enkphalin (Penk). These neurons cause aversion based learning and inhibit movement35(p2),39–41.

Early studies showed a selective loss of D2 MSNs in the early and middle stages of HD37, explaining the lack of inhibition of involuntary movements. Meanwhile, the

D1 neurons which promote movement were still assumed active, thereby causing the chorea characteristic of HD. However, recent studies have shown that there is a significant loss in striatal neuron terminals even in premanifest HD, particularly the D1 neurons, which results in slower movements42,43.

A third population of MSNs expressing both D1 and D2 receptors, also known as eccentric neurons44, has been observed in the laboratory setting for the last two decades and the percentage contribution of these neurons to the overall striatal population has been overestimated45,46 but they were fully characterized using single cell RNA sequencing in 201844 by the McCarroll group at Harvard. This population contributes to nearly 4% of the striatal MSN population and 4-5% of the Ppp1r1b

MSNs. It can be hypothesized that these neurons were previously undetected due to the biochemical nature of analyses, camouflaged as either D1 or D2 neurons due the expression of canonical MSN markers. The McCarroll study44 showed that these neurons differ in the expression of at least 110 genes from the canonical MSN types,

7

which differ from each other only on the basis of expression levels of 68 genes.

Eccentric MSNs can further be subdivided into two groups that differ on the basis of markers used to identify D1 and D2 MSNs, making it even harder to distinguish them from D1 or D2 MSNs in a biochemical analysis.

Approximately 5% of the striatal neuron population in mice is comprised of aspiny interneurons which are further subdivided on the basis of their neurotransmitters and electrophysical properties, among other characteristics47. These include large cholinergic interneurons expressing the choline acetyltransferase receptor (Chat), and three GABAergic interneurons, one expressing the somatostatin (Sst) and neuropeptideY receptor, another expressing parvalbumin, and a third expressing calretinin48–51. Cholinergic interneurons have shown a significant decrease in their dendritic territory and reduction in the expression of CHAT in HD models52–54.

Non-neuronal cells

Astrocytes are star shaped microglial cells present in the striatum and throughout the brain, contributing to the structural integrity of the brain and the energy needs of their surroundings. In addition, astrocytes are involved in the injury repair mechanism of neurons through the process of astrogliosis. They form a glial scar at the site of injury that is important for the regeneration of neurons. In HD, there is known to be a surge of astrocytes in the striatum as more neurons are lost. Mutant huntingtin expression in astrocytes is also known to diminish their resistance to glutamate neurotoxicity55 and increase their activity, causing loss of neurons.

Microglia are the glial macrophages of the CNS, responsible for inflammatory reaction, and are sensitive to the slightest changes in the brain. In HD, microglia are activated prior to the clinical manifestation of the disease56 and have been shown to

8

activate the astrocytes, causing neurotoxicity57. Introduction of normal microglia in

HD–infected mice brains has shown to reduce disease phenotype as well as slow progression of disease58, suggesting that microglia contribute causally to the disease phenotype.

Oligodendrocytes are glial cells that provide support and insulation to neurons by the process of myelination of the axons. Dysfunction of these cells has been implicated in other neurodegenerative diseases such as ALS as well. Like microglia, myelin damage and axonal degeneration can be observed before the onset of symptoms in HD subjects59,60. The increase in the number of oligodendrocytes in the caudate nucleus was among the first histological changes observed in HD61. The mutant HTT protein expressed in oligodendrocytes interacts with myelinating proteins like MYRF and affects the overall activity of the cells62.

Model Organisms for HD

Since the identification of the HTT gene in 1993, many animal models have been developed to study the disease including yeast, fruitflies, zebrafish, mice, rats, pigs, minipigs, sheep and monkies24,25,26. The selection of the model depends on the aspect of HD being studied, with smaller organisms like yeast adequate for low cost, high throughput genetic analyses in isolated systems, and non human primates best suited for pharmacoanalysis.

In order to study the genetic changes associated with each cell type, genetically modified mice were chosen for this study to carry an elongated human exon 1 of the

HTT gene.

The choice of this animal model was due to the fully characterized genome, relative similarities to the human nervous system, fully developed cognitive, motor and

9

behavioral systems, short life span, high fecundity, and availability. For HD, a diversity of mouse models is available, including transgenic mice, with random integration of the mHTT gene, alongwith it’s promoter, anywhere in the mouse endogenous genome

(such as the R6/1 and R6/2 models), as well as knock in models, with insertion of the mHTT gene in the mouse homolog of HTT. Knock in models carrying a range of lengths of the CAG elongation on various backgrounds between 20 and 175 polyQ repeats are available. However, the number of repeats does not cause the same phenotypic effects of disease in mice as they do in humans.

It is unclear if the onset of changes seen in human HD patients is in the 40s because it is the time is required for the progression of the disease or if onset is dependent on completing a percentage of the average life cycle. Mice reach adulthood at around two months of age. In mice with greater than 50 repeats (Q50), very modest

HD associated phenotypes can be observed prior to natural death66. In mice that have over 80 repeats (Q80), molecular changes are easily quantifiable. Like in humans, phenotypic changes in mice can be observed at a younger age in cases with longer repeats. For these reasons, mice carrying one endogenous and one mutant HTT gene with 175 repeats (Q175) have been used widely in HD studies in recent years.

At a molecular level, changes in the striatum can be observed in HttQ175/+ mice at 2 months of age with an increasing downregulation of synaptic function genes in neurons, and upregulation of inflammatory genes in the glial cells of the striatum, as age progresses. At 4-5 months, early behavioral changes in the dark phase of the diurnal cycle can be seen. The rotarod test consists of a rotating rod that the mice are put on and the time that they are able to stay on this rod is recorded. 7-8 month old HttQ175/+ mice show significant decline in these tests. By the 12th month, hyperactivity and brain atrophy commence - events that precede cell death even in normal cases. In the four

10

months that follow, HttQ175/+ mice also display a lack in motivation, motor deficits and cognitive impairment67.

A small presence of HTT aggregates in the striatum has been observed as early as 9 to 10 weeks of age68, with a gradual increase over time and dense aggregates are visible at 16 months. Visibly lowered striatal volume is observed around 12 months and continues to decline67.

These observations indicate that ages of 1, 2, 6-7 and 12-16 months are ages at which different levels of disease progression can be visible in mice.

Single cell and Single nucleus RNA sequencing

Standard approaches to determine the cascade of events following the transcription of mHTT have not been successful in being able to determine the mechanisms at play in the heterogenous brain environment. Single cell and single nucleus RNA sequencing

(sc and snRNAseq) are disruptive new technologies that are a step over bulk sequencing methods due to the advantage of being able to discern the transcriptional profiles of individual cells that are known to differ even among similar cells and have been instrumental in understanding expression and cell fate decisions.

Single nucleus RNA sequencing is a modification of the single cell method for tissue samples with compromised membrane integrity that could be a result of freezing the tissues, increasing the sensitivity to proteolytic cleavage. It is also a solution to stressed cells, which may have a higher expression of mitochondrial genes in the cytoplasm, making identification of libraries difficult. This stress could even be caused by the dissociation of tissues for the sequencing and choosing between single cell or single nucleus depends on the sensitivity of the data required.

11

Transcriptional libraries from single cells are produced by creating microenvironments for cell disruption and cDNA tagging to determine the source of origin of each transcript. These libraries are then amplified and sequenced on next generation sequencing platforms. The raw base calls are used to develop FASTQ files that are then used to create large, high dimensional matrices that contain information on the number of transcripts of each gene that were read from each library. These counts can be used to cluster libraries based on similarities of gene expression and identify the type of cell that each cluster represents, based on the relative abundance of known markers for each cell type. The introduction of a mutant HTT gene in mice causes variations in the expression of certain genes, which can be picked up using different statistical analysis tools, some of which are available in R.

In the context of HD, significant gaps still exist between the genotypic characterization and the pathology of the disease. Here, we aim to use single nucleus

RNA sequencing of striatum samples from 14 and 15 month old mice in order to determine the types of cells present and the differential expression in genes associated with the mutant Htt genotype.

HDinHD Database

HDinHD69 or Huntington’s Disease in High Definition is an open source database that aims to collate research and provide a collaborative environment for the

HD research community. It is developed by the CHDI foundation and provides several tools to visualize, share and explore transcriptional and proteomics data for HD and is a platform for researchers to share their tools and computational models.

The HD Molecular Signatures70 (or HDsigDB) tool allows exploration of gene sets discovered in HD studies. The 2367 gene sets for Mus musculus range from cell

12

type specific markers to gene sets associated with other neurological or muscular disorders such as autism or Duchenne muscular dystrophy, available for download as a single dataset.

13

METHODS

Mice

C57BL/6J.HttQ175/+ mice and wildtype littermate controls were bred and aged in the colony of the Carroll lab at Western Washington University. Four 14-15-month-old male mice of each genotype were used to generate the primary dataset described in this thesis. In addition, two females and one male of each genotype at the age of 7 months were used for initial protocol development. The Carroll lab performed brain dissections and provided flash-frozen striatal tissue for molecular studies.

Isolation of nuclei from frozen brain tissue

Nuclei were isolated from flash-frozen striatal tissue by Marcia Cortes, Laboratory

Research Manager in the Ament lab, as described in previous protocols71–73, with slight modifications. A detergent-mechanical cell lysis method was used, involving 3 major steps: lysis, homogenization and density barrier centrifugation. In a laminar hood, a ~50 mg piece of frozen brain tissue was placed into a pre-frozen BioPulverizer (BioSpec) and smashed to a thin frozen layer, then immediately transferred to lysis buffer (250mM

Sucrose, 25mM KCl, 5mM MgCl2, 1uM DTT, 1X RNAse Inhibitor, 0.1% TritonX-

100, 10mM Tricine Buffer, pH 8.0). Tissue was disaggregated, flushing up and down, first using a 1 ml pipette tip, then with a 30G needle in a 3ml syringe. Homogenized tissue was diluted to 10ml with lysis buffer and filtered through a 70 micron filter.

Homogenate was spun at 1,000g for 8 min at 4°C. Pellet was then re-suspended and

500 µl of pellet suspension was diluted with 500 µl 50% Iodixanol Solution (50%

Iodixanol, 250mM Sucrose, 150 mM KCl, 30mM MgCl2, 1X RNAse Inhibitor, 60mM

Tricine Buffer, pH 8.0), filtered for second time with a 70 micron mesh and placed on top of 500 µl layer of 29% Iodixanol Solution (29% Iodixanol, 250mM Sucrose, 150

14

mM KCl, 30mM MgCl2, 1X RNAse Inhibitor, 60mM Tricine Buffer, pH 8.0). Density barrier was centrifuged at 13,500g for 20 min at 4°C. The pellet was collected and washed with 10 ml of PBS, 2% BSA, 1X RNAse Inhibitor, centrifuged at 1,000g for 8 min at 4°C, and finally suspended in 1 ml of PBS, 2% BSA, 1X RNAse Inhibitor and filtered using a 40 micron Flowmi cell strainer. Single nuclei suspensions were counted and evaluated for integrity using Propidium Iodide in a MoxiGo cytometer using 650 nm filter. Nuclei count was adjusted to 5000 nuclei/ml.

Library prep and sequencing

Droplet-based library prep and sequencing were performed by the Genomics Resources

Center at the Institute for Genome Sciences, using 10X Genomics technology74

(Pleasanton, CA). 13,000 nuclei were loaded into each well of a 10X Chromium microfluidics controller using PBS + 2% BSA. Sequencing libraries were generated using the Chromium Single Cell Gene Expression protocol with Version 3 chemistries.

These samples were sequenced across two lanes of an Illumina HiSeq4000 sequencer to obtain 75 paired-end reads.

Data pre-processing

Raw sequencing reads were processed to counts of unique molecular identifiers (UMIs) in each droplet with cellranger v3.0.2 (10X Genomics). De-multiplexed FASTQ files were generated from base calls using cellranger mkfastq. FASTQ files from each sample (n=8) were processed using cellranger count, which aligns UMIs to genes with

STAR75 and MAPQ 255, using the 10X Genomics mm10 transcriptome model. It then detects empty droplets with EmptyDrops76, as well as droplets containing multiple cells with a cellrnager-specific algorithm, both of these sets of droplets are filtered. The

15

outputs are matrices with counts of unique molecular identifiers (UMIs) from each droplet that map to each of 27,998 mouse genes based on the mm10 mouse transcriptome version.

Cell QC and reduction of extracellular RNA contamination

The counts obtained from cellranger for the 14 and 15 month old samples were used to cluster the libraries. The violin plots for the number of genes expressed in each sample

are seen in Figure 1.

# of Genes per library library per Genes of #

Sample Identity Figure 1. Violin plots of the nGene distribution per sample Almost all the samples have a similar pattern with high densities at a low expression value around 900 and at a higher expression value around 1500-1800 except

BBY3, which had fewer genes per cell with non-zero counts.

16

Clustering was conducted on libraries that expressed more than 200 genes and passed a mitochondrial filter of less than one standard deviation from the mean mitochondrial counts for that sample. Initial analysis revealed that the marker genes for some cell types that are expected to have highly specific expression in a single cell type were detected at low levels across multiple clusters. Figure 2 shows clusters colored by such genes.

Figure 2. Original clustering of the 14 and 15 month old dataset with identifiable cell types but presence of mutually exclusive genes within a cluster. Penk is an MSN gene, Mbp is an oligodendrocyte gene and Gad1 has the highest expression in Pnoc and Pvalb interneurons.

17

This pattern suggested contamination of some droplets with extranuclear RNA from the solution containing the isolated nuclei, a common issue in single-cell and single- nucleus RNA-seq data from solid tissues. Some empty droplets containing only extracellular RNA or droplets containing multiple nuclei may be missed by the filters in the cellranger pipeline. Therefore, we used SoupX77 to detect cells with high contamination from extracellular RNA and adjust the UMI counts to reduce its impact across all libraries. SoupX profiles the composition of transcripts in the extranuclear

“soup” based on UMIs in empty droplets, estimates the contamination fraction of each nucleus-containing droplet based on user-provided cell type-specific marker genes, and adjusts the counts matrix based on these estimates. We note that the contamination fraction estimates from SoupX are highly dependent on the manually curated list of cell type-specific genes and frequently overestimate contamination when marker genes are not perfectly cell type-specific77. For this analysis, we used the DropViz78 atlas of single-cell expression in the mouse brain79 to identify marker genes that were >20-fold enriched in a single striatal cell type and had <10 transcripts per 100,000 in all other striatal cell types.

SoupX-derived contamination estimates for each sample can be seen in Figure 3.

18

Figure 3. Contamination fractions of each sample. The Valley can be seen between 3.125 and 3.75 on the x-axis in all samples except BBY3 and BCR1 where it is shifted. We observe a common “valley” structure in each plot that is generally present between nUMI values of 1260 (approximately log(3.125)) to 6300 (approximately log(3.8)) for all samples except BBY3 and BCR1, which had the valley between 560

(approximately log(2.75)) and 1780 (approximately log(3.25)) and 1260

(approximately log(3.125)) and 4470 (approximately log(3.65)). We hypothesize that the libraries with low nUMI counts and high contamination fractions present in the left corner of most of these plots originate from damaged or low quality nuclei with a large proportion of reads from the soup. The valley originates from single-nuclei with some soup contamination and the rightmost peak in these plots must be a product of multiple nuclei being captured in a single droplet, giving rise to high nUMI counts and the high contamination fractions observed. Figure 4 summarizes the possible types of droplets that have been sequenced to give rise to the libraries being observed

19

Figure 4. Possible types of droplets giving being sequenced to give rise to the libraries observed. In order to conserve the subtle differences in expression between the real nuclei in the libraries, we selected a uniform contamination fraction estimate of 10% across all samples, and we used this fraction to subtract UMIs of likely extracellular origin from the counts matrix.

Libraries with very low or very high UMI counts based on SoupX contamination fraction plots were filtered out using sample-specific thresholds, specified by the mean contamination fration of each sample, as follows: 1260-6300

UMIs/cell for BBY2, BCR2, BCR3, BCR4, BCR5 and BCR6; 560-1780 UMIs/cell for

BBY3 and 1260-4470 nUMIs for BCR1. This left us with 8994 usable libraries that fell within the respective range for each sample. In addition, we removed libraries with >5% of read counts from mitochondrial genes, as the presence of these non-nuclear transcripts indicates both incomplete fractionation of nuclei and cellular stress.

Some droplets were found to retain expression of marker genes for multiple cell types even after cleaning with SoupX. These droplets most likely contain RNA from more than one nucleus or retain a large fraction of UMIs from extranuclear RNA after

SoupX correction. In order to remove these droplets from our analysis, we scored libraries on the number of cell type-specific genes found in their counts, using the following marker genes: Cldn10 for astrocytes; Gjb1, Aspa, Klk6 and Galnt6 for

20

oligodendrocytes; Gpr17 and Neu4 for polydendrocytes; Gpr34, Csflr, Fcrls, Siglech,

Mpeg1 and Ncf1 for microglia; Cldn5, Gkn3, Podxl, Clic5 and Sox17 for endothelial cells; Slc6a13, Col1a1 and Lum for fibroblasts; Ccdc153 and 1700016K19Rik for ependymal cells; Chat for cholinergic interneurons, Nos1 for somatostatin-positive interneurons, Kit for Parvalbumin and Prepronociceptin interneurons; and Ppp1r1b for all subtypes of medium spiny neurons. The markers for non neuronal cell types were selected based on their abundant expression in the specified cell types and the lack of expression in other types, as determined by the DropViz reference dataset. Therefore, for these markers, a threshold of 1 UMI/library was used to score droplets as containing

RNA from the respective non-neuronal cell type.

Fully exclusive markers for neuronal cell types were not identified in reference datasets. The markers selected were relatively more abundant in specific neuronal cell types. Therefore, we selected quantitative thresholds by plotting the distribution of normalized UMI counts for the neuronal sub-type markers Nos1, Kit and Ppp1r1b, revealing bimodal distributions. Based on these distributions, thresholds of 30 normalized UMI counts for Kit or Nos1 and 100 normalized UMI counts for Ppp1r1b were used to score droplets as containing RNA from Pvalb+ interneurons, Pnoc+ interneurons, and MSNs, respectively. Droplets were removed if they expressed above- threshold marker gene expression for more than one cell type.

Scoring and mitochondrial filtering in this manner helped eliminate 991 libraries that may have originated from contaminated droplets or multiple nuclei or had

>5% mitochondrial counts.

21

Normalization and batch effect correction

Starting from the adjusted UMI counts of high-quality nuclei, we applied within-sample normalization and across-sample batch effects corrections. In the absence of proper normalization, nuclei typically cluster by batch and sequencing depth rather than by cell type. First, UMI counts from each nucleus were normalized to log

(counts per million)80,81, which corrects for variation in the total number of UMIs sequenced per nucleus, then applies a log-normalization to improve normality. Then, we used MNN82 (Mutual Nearest Neighbors) to reduce batch effects across samples.

MNN determines the Euclidean distance between pairs of nuclei from different batches.

Nuclei are considered neighbors if they are among each other’s k nearest nuclei from the other batch. These pairs are then used to determine correction vectors that return corrected counts for all the nuclei in each sample. The inputs to MNN are centered and scaled normalized counts, typically using a curated set of highly variable genes that are informative across batches. A uniform set of highly-variable genes was selected by identifying the top 5000 most variable genes in each sample separately using the

FindVariableFeatures() function in Seurat, then taking the intersect across samples, resulting in a final set of 937 highly-variable genes. Expression levels for these highly- variable genes were centered and scaled in each dataset and the scaled matrices were used as inputs to MNN using the mnnCorrect() function in the scran R package.

mnnCorrect() requires several parameters, most importantly the number of neighboring nuclei used in the k-nearest neighbors algorithm (k) and the order in which samples are supplied to the algorithm. Values for these parameters were determined empirically by minimizing the number of nuclei assigned to sample biased clusters, or to clusters characterized by technical variables including high mitochondrial reads and

22

low or high UMI counts. I considered values of k ranging from 10 to 60 with increments of 10, and selected k=30.

Removal of sample BBY3

After correcting counts using SoupX and filtering to have only nuclei with less than 5% mitochondrial reads, I created a batch corrected tSNE plot of all the samples in order to obtain clusters and identify them using their markers. The plots colored by age showed that the 14 month old samples grouped separately from the 15 month old samples, which looked similar to a batch effect. This can be seen in Figure 5. On removing one of the two older samples at a time, it was discovered that BBY3 drove these batch effects and may have been an outlier due to lower nUMI counts.

The mnnCorrect() function normalizes samples sequentially, and is thus dependent on the order in which the samples are provided. Different orders of samples for MNN correct were also tried and colored by sample (Figure 6). In each case, BBY3 nuclei formed their own clusters.

Figure 5. tSNE of all eight of the 14 and 15 month old samples. Left: colored by the clusters identified by Seurat. Right: Colored by the age of the nuclei. A trail is formed where the younger samples group together and the older samples spread out.

23

Figure 6. Coloring tSNEs by sample for different variations of MNN. Since it was discovered that the results from MNN correction were dependent on the order of samples provided to the function, I tried different orders to cluster the nuclei and color by sample. Since any more significant batch correction threatened to diminish the cell type differences, removing BBY3 entirely from the analysis was the best possible solution.

This step resulted in 5429 high quality nuclei that could be batch corrected and coerced into cell type specific clusters.

To identify an optimized ordering for the remaining seven samples, 17 MNN iterations were performed using different sequential and randomized sample orders as well as attempts to collate randomized results for optimum clustering. In the final ordering, samples were in the order: BBY2, BCR1, BCR2, BCR3, BCR4, BCR5 and

BCR6.

Cell type clustering

The next step of the analysis was to cluster nuclei of the same cell type together.

Previous work has demonstrated that clustering based on shared nearest neighbors in principal component space is an efficient and robust method to identify clusters of molecularly similar cells in single-cell RNA-seq data83,84. We implemented this strategy using the RunPCA(), FindNeighbors(), and FindClusters() functions in Seurat v3.0 for our single-nucleus dataset.

24

The high dimensional matrices generated with snRNAseq have expression values of thousands of genes for each nucleus, a large portion of which may be correlated. Principal component analysis is a linear dimensional reduction method that has been recommended for the analysis of single-cell and single-nucleus RNA sequencing85,86. I used the SoupX corrected and scoring filtered cells to compute the principle components for this dataset. By definition, the first principal component (PC) accounts for the largest amount of variation in the nuclei, the second PC the second largest and so on. The maximum variation in the nuclei is usually captured by the first few PCs, which can then be used for further downstream analysis. The genes defining the top 10 PCs were analyzed in order to ensure that housekeeping or cell cycle genes did not define those PCs and they were biology driven. The top 10 PCs empirically captured the maximum variation and were used for further clustering and visualization of the nuclei.

Shared nearest neighbors (SNN) were determined using the Jarvis-Patrick algorithm87, implemented in the Seurat FindNeighbors() function. The SNN algorithm starts with a nearest neighbors graph that draws an edge between a point (node) a and its neighbor b if b is the closest node to a. The k closest neighbors to each node are found and edges are drawn along those nodes. This is the k nearest neighbor algorithm that is used by Seurat for clustering cells. The weights of the edges can be computed based on the number of neighbors that a and b share. A standard weight is then used as the threshold and all edges that do not achieve that weight can be eliminated, leaving a few well-connected clusters. It is possible to change the value of k in order to obtain meaningful clusters. In datasets with clusters of different densities, k must be large enough to pick up sparsely distributed points and yet small enough to separate dense, closely placed points that represent very similar (but not the exact same) cell type87,88.

25

Clusters were computed using values of k starting from 10 with increments of 10 in order to create SNN graphs and the k value of 30 produced clusters that represented all the cell types expected in the dataset.

Modularity is a quantification of the strength of the clusters and an assessment of the cluster assignment for each cell by comparing the density of the cluster with random cells from the dataset. Louvain modularity is an optimization method for cluster detection that conducts multiple iterations of reassignments on the outer cells of each cluster until it finds the best fit. I used Seurat to define clusters of our nuclei using the

SNN graph and Louvain modularity. A resolution of 0.4 was selected empirically based on separation of known major cell types.

The transformations that are done on the data and the clusters that are assigned are best interpreted when looking at the visual representations of the cluster assignments. In the recent years, t-distributed Stochastic Neighbor Embedding89 has emerged as an efficient method of visualizing the high dimensional data on a two- dimensional plot and is preferred over visualization techniques such as Isomap, SNE,

CCA, MVU, etc. by separating cells belonging to different clusters on the two dimensions much better while keeping them from stretching to infinity.

For this analysis, the RunTSNE() function from Seurat was used to compute the tSNE parameters using the default Barnes-Hut algorithm90 for the top 10 dimensions of the PCA. This was used as the standard practice for all the data visualization in this analysis.

Assigning cell type labels to molecular clusters based on marker genes

A list of the marker genes significantly enriched in each cluster relative to all other clusters was obtained using the FindAllMarkers() function in Seurat. Wilcoxon

Rank Sum tests were computed for all genes with >0.25 log( fold )-enriched expression

26

in each cell type. These marker genes were then used to assign a cell type label to each molecularly-defined cluster based on known marker genes from the literature and from recent atlases of single-cell gene expression, primarily the Dropviz database79,91

(dropviz.org) and confirmed using sources such as the Brain RNA Seq by the Barres lab92, Online Mendelian Inheritance in Man (OMIM), the Mouse brain atlas developed by the Allen institute93, the Mouse Genome Informatics94,95 and other sources in order to determine the function of the genes and which types of cells they are largely expressed in. Some cell types can be efficiently be labeled based on the expression of highly specific marker genes such as Trem2 in microglia and Drd1 and Drd2 in MSNs.

In addition Celltype Specific Expression Analysis (CSEA) tool92,96 was used to characterize enrichments of lists of marker genes for genes with specific expression in each major cell type based on translatome sequencing studies. Clusters being represented by more than two samples were chosen for further analysis.

Subtypes of medium spiny neurons with more subtle differences were identified through a second round of clustering within major cell types. Clusters that were annotated as medium spiny neurons were merged together to form a smaller dataset of

2837 nuclei. The top 2000 most variable genes were identified and scaled using the

FindVariableFeatures() and ScaleData() functions in Seurat. The clustering that followed was sensitive to minor variations and hence, the number of genes expressed

(nFeature_RNA) and the mitochondrial percentage (percent.mt) were variables that were regressed out of the analysis to prevent clustering to be driven by those. Different resolutions were used for clustering, from 0.2 to 3.0 to obtain D1 and D2 specific clusters. A resolution of 1.0 was chosen for these nuclei and the clusters were labeled by manual annotation of marker genes.

27

Annotating neurons

Drd1 and Adora2a are known to be highly specific markers for the two canonical MSN subtypes. In order to identify the changes that were being caused in these subtypes exclusively, libraries expressing exclusively Drd1 were labelled as the

D1 MSNs and libraries expressing exclusively Adora2a were labelled as the D2 MSNs.

Other neurons were labelled as the unknown group and compared to the known D1 and

D2 subtypes as well as the eccentric neurons.

Differential expression between genotypes

We identified cell-type specific differentially expressed genes (DEGs) in

HttQ175/+ vs. wildtype mice using Wilcoxon Rank Sum tests implemented with the

Seurat FindMarkers() function. The thresholds for fold change, minimum percentage of cells expressing the gene and the minimum number of cells in a group were all set to

0 in order to obtain lists that contained statistics on all the genes with detectable expression in each cell type. The list thus created contained the p-values for each gene, log fold change of expression between the HttQ175/+ vs. wild type cells, the percentage of cells where the feature was detected in each genotype and the adjusted p-value based on Bonferroni correction. These statistics were compared for each of the eleven cell types in the sample to determine the common genes showing differential expression in related and unrelated groups of cells. These were also used to find the pairwise correlation between all the cell types and for the interpretation of the results.

In order to determine the enrichment of gene sets from HDsigDB and the enrichment of (GO) terms for each of the 11 cell types, Dr. Ament used the geneSetTest() function from the R package limma, which ranks genes using the

(negative log10(p_val))*sign(avg_logfc) value and tests whether each gene is enriched

28

in a positive or negative association using the Wilcoxon signed-rank test. The p-value is then adjusted using a Bonferroni adjustment. We use the adjusted p-value or FDR in order to determine the gene sets or GO terms that are significantly enriched in each cell type.

Comparison with Reference Gene Lists

A study conducted by Peter Langfelder97 at UCLA to determine the gene co- expression modules in the striatum, cortex and liver of knock in mice with various CAG lengths using bulk RNA sequencing of tissues revealed 13 striatal and 5 cortical gene co-expression modules that were highly correlated with the number of CAG repeats and mouse age.

Nuclei were scored for the genes in each module using receiver-operator curves

(ROC) and the Area Under the ROC Curve (AUC) method was used to determine the activity of each of these modules within all the 5429 nuclei from our dataset using the

AUCell package in R. The method looks for bimodality in the ROC and a default threshold is set that passes only the nuclei forming the second peak. The nuclei that passed were then colored pink or red on the tSNE, based on the score (red being the highest scored nuclei for that module). The counts used for this analysis were log- normalized, centered and scaled prior to the analysis.

Trajectory analysis

To obtain the trajectory of the nuclei as they progress toward the difference observed in Huntington’s disease, I used the R package Monocle v24, which applies reversed graph embedding to order the nuclei along an unsupervised potentially branching trajectory. Monocle2 begins by representing the nuclei in a high dimensional

29

space and performing dimensional reduction. An initial trajectory is determined using these nuclei and the nuclei not on this trajectory are moved closer to the path using an optimizing method. These nuclei are represented in a high dimensional space again and the steps are reiterated until a satisfactory trajectory is obtained. Monocle2 then determines the root of this trajectory and numbers the branching points based on their distance from this root. Finally, all nuclei are ordered along the pseudotime, an estimation of how far a nucleus has progressed along its specific molecular fate. This method is known as the DDRTree method and does not require prior knowledge of the number of branches being expected in a trajectory. Since the genes to define these paths are unsupervised, the trajectories formed are solely based on the transcriptional changes observed in the cell type.

Monocle begins by creating a CellTypeHierarchy object that contains the gene counts matrix as well as additional metadata about the individual libraries and each gene. Genes that have been differentially expressed as a response to the change in Htt genotype are selected using the differentialGeneTest() function. This step can be unsupervised or semi supervised, meaning that the list can be entirely generated using machine learning approaches or the user can provide a list of known differentially expressed genes and Monocle will compute a larger list of related genes specific to that dataset. For this analysis, we use a completely unsupervised approach to allow Monocle to discover its own DEGs out of genes that are expressed in 10 or more nuclei in the entire dataset. Next, the high dimensional data is projected into two dimensions using the DDRTree method, that will make it possible to visualize the data with the reduceDimension() function. The trajectory is inferred for the nuclei, which are then ordered along the inferred trajectory using the orderCells() function. The root of the tree is determined and branches are numbered as they move along the process of nuclei

30

reaching a more developed molecular stage. The plot_cell_trajectory() function can be used to color the nuclei as they line along their trajectory based on the metadata provided. Here we color our nuclei by their respective Htt genotype to visualize the difference in progression of HttQ175/+ and wild type nuclei belonging to each cell type.

The trajectory of the overall neuron population is plotted, along with the astrocytes, Chat interneurons, D1 and D2 MSNs, endothelial cells, eccentric MSNs, microglia, oligodendrocytes, polydendrocytes, Pnoc and Sst interneurons.

31

RESULTS

Specific Aim 1: Molecular identification of cell types in the mouse striatum using single nucleus transcriptome data

Identification of cell types in the mouse striatum

We sequenced the transcriptomes of 13,819 cellular nuclei from the striatum of 14-15 month-old HttQ175/+ vs. wildtype mice, using droplet-based single-nucleus RNA-seq.

Following extensive QC and batch-effects correction (Methods), we analyzed 5,429 high-quality nuclei, including 3,968 nuclei from four HttQ175/+ mice and 1,461 nuclei from three wildtype mice (Table 1).

32

Table 1. Cell counts after each step of QC

33

Maximum-modularity clustering of these 5,429 nuclei with Seurat revealed 17 clusters of molecularly similar cell types. Clusters were then annotated based on known cell-type markers79,92,96,98. We manually removed three clusters whose marker genes indicate lower-quality, damaged nuclei, as indicated by high counts of mitochondrial genes, mixed markers from multiple cell types, and/or bias toward a single sample, and we merged three clusters of medium spiny markers with highly overlapping marker genes. Each of the final ten clusters formed by 4524 high quality nuclei corresponded to a single cell-type, with all major neuronal and non-neuronal cell-types represented in our dataset (Figure 7).

The top level clustering revealed 2837 nuclei from four clusters corresponding to all canonical MSNs (including both D1 and D2 subtypes) based on markers such as

Gria2, Pde10a, Calm2 and Rgs9*, as well a distinct cluster identified as eccentric

MSNs comprising of 166 nuclei (about 5.5% of all the MSNs in the dataset, similar to the percentage of eccentric MSNs found by Saunders et. al.79) marked by the expression of Otof (FDR = 1.73 e-48), Casz1 (FDR = 2.87 e-68) and Atp2b4 (FDR = 1.34 e-61), which are all eccentric MSN specific markers based on the study by Saunders et al79. Clusters specific to cholinergic (73 nuclei, marker Chat (FDR = 0)), somatostatin (288 nuclei, marker Sst (FDR = 6.73 e-129)) and parvalbumin (120 nuclei, marker Kit (FDR = 0)) interneurons were also identified. The non-neuronal cell types included astrocytes (300 nuclei, marker Slc1a2 (FDR = 0)), microglia (82 nuclei, marker Cx3cr1 (FDR = 4.18 e-

167)), oligodendrocytes (468 nuclei from two clusters, markers Plp1 (FDR = 0 and 6.54 e-292) and Mbp (FDR = 0 and 4.88 e-151)), endothelial cells (112 nuclei, marker Flt1

(FDR = 0)) and polydendrocytes (also known as oligodendrocyte progenitor cells,

OPCs) (78 nuclei, marker Pdgfra (FDR = 0)). Importantly, at this resolution, nuclei in all clusters were grouped by cell type and not by a mouse’s genotype or other technical

34

or biological variables.

* The markers mentioned here are not the only markers that the cell type assignment

was based on, rather the representative markers for each cluster.

Figure 7. tSNE of 4524 high quality single nuclei, colored by cell type, sample and genotype

35

Identification of neuronal sub-types

The neuronal and oligodendrocyte clusters were separately regrouped and subclustered as defined in the methods section to determine if meaningful subtypes could be identified. On analysis of the markers and their p-values, the oligodendrocytes did not subcluster into reproducible clusters and the original cluster annotation could be used as a suitable representation of all the nuclei in that cluster.

On the other hand, subclustering of the neuronal nuclei showed interesting results. Multiple parameters were varied in order to separate D1 and D2 MSNs into individual clusters but in each case, the differences in the Htt genotype was the driving factor for the subclustering. Figure 8 shows the neuronal subclusters colored by the different variables. It is interesting to note the stark separation of the HttQ175/+ and wild type clusters. This extreme genotype effect can be attributed to the late timepoint for mice with extremely large CAG repeats. From this data, we can conclude that the significance of the Htt genotypic changes masked the significance of the MSN sub type differences at this age.

36

Figure 8. 2873 neurons clustered and colored by A: cluster number; B: Sample of origin; C: Htt genotype; D: number of genes expressed by each nucleus; E: Expression of Drd1 and F: Expression of Adora2a Subclusters were also attempted to be annotated by using the same classification as the McCarroll dataset using Metaneighbor. The clusters did not show high similarity to any of the series 10 (D1), 11 (D2) or 13 (eccentric) neurons. This could be attributed to the significant batch differences between the McCarroll, which is a single cell dataset, and our dataset, which is a single nucleus dataset, or due to the distortion of signal because of the presence of HttQ175/+ nuclei. Saunders et al.79 also find that there are relatively smaller markers differentiating between the D1 and D2 MSNs, which could be used to explain the inability to differentiate between these subtypes using algorithmic approaches.

In order to identify real D1 and D2 MSNs, I annotated nuclei from the 2837 canonical MSNs in this dataset based on their expression of Drd1 and Adora2a. In 784

MSNs, I identified transcripts for Drd1 but not Drd2, and labeled these nuclei D1

37

MSNs. In 464 MSNs, I identified transcripts of Adora2a but not Drd1, and labeled these nuclei D2 MSNs. In an additional 1569 nuclei from the MSN cluster, I identified transcripts from neither Drd1 nor Adora2a, most likely reflecting dropout effects. 20

MSNs expressed both markers but were otherwise more similar to canonical MSNs than to eccentric MSNs. The presence of both markers could be attributed to a few residual extra nuclear reads contaminating the sample. These 20 neurons were separated from further analyses and were only part of the overall canonical neuron group.

The final cell types and contribution from each sample and genotype to each cell type has been shown in table 2. We observe that the percentage of Chat

Interneurons, Microglia and the identified D1 and D2 MSNs is significantly different between genotypes. The Chat interneurons are a small population in both genotypes, and this difference is almost surely technical. In case of the microglia and the identified canonical MSNs, these differences may translate to biological changes. In the identified

D1 and D2 MSNs, we see in the following sections that the molecular identity of these cells is lost with the progression of disease. Thus, the decrease in identified MSNs and the increase in the Unknown population may be a biological effect. In the case of microglia, a larger population in the HttQ175/+ genotype may reflect invasion of striatum by microglia.

It is important to note, however, that the number of nuclei shown in Table 2 are not a very accurate representation of the cell type composition in the striatum. This is due to biases in the nucleus isolation criteria and in the informatics QC and filtering criteria.

Figures 9 and 10 show the distribution of the normalized counts of select marker genes for each final cell type in the form of violin plots.

38

Table 2. Distribution of cell types across samples and genotypes.

39

Figure 9. Distribution of markers across cell types

40

Figure 10. Distribution of markers across cell types

41

Specific Aim 2. Identify cell type-specific molecular changes in a mouse model of the HD mutation

Identifying Differentially Expressed Genes

We used Wilcoxon signed-rank tests to evaluate genes expression changes in the

HttQ175/+ vs. WT nuclei of each cell type, followed by gene set enrichment analyses with curated gene lists from the HDsigDB70 Mus musculus dataset and Gene Ontology

(GO) terms.

In astrocytes, we identified 9 down-regulated genes and 2 up-regulated genes at a False Discovery Rate (FDR) < 0.05, as well as 1187 up regulated and 386 down regulated genes at a more lenient unadjusted p-value < 0.05. Genes up-regulated in astrocytes from HttQ175/+ mice were enriched for human HD related down regulated gene sets in the caudate nucleus like “HD markers in human caudate nucleus [down]”99

(adjusted p-value = 4.39 e-54), neuron specific gene sets such as “Top neuron-enriched genes in humans and mice”100 from a cortex dataset (adjusted p-value = 4.38 e-75), as well as synaptic markers such as “Synaptic neuropil transcriptome”101 (adjusted p-value

= 3.52 e-41). Down-regulated genes were enriched for gene sets such as “Top astrocyte- specific genes in humans and mice”100 (adjusted p-value = 2.76 e-81) and “HD markers in human caudate nucleus [up]”102 (adjusted p-value = 1.27 e-13). Figure 11 shows the

Figure 11. Distribution of genes Snap25 and Bcan in astrocytes

42

distribution of Snap25 and Bcan, which are representative genes from the enriched up and down regulated gene sets respectively. Snap25 is also associated with synapse related GO terms and codes for a presynaptic plasma membrame protein that is known to regulate neurotransmitter release and has been found to be differentially expressed in previous HD studies as well. The up-regulated genes in HttQ175/+ astrocytes were enriched for GO terms like “synapse” (FDR = 4.08 e-25), “axon” (FDR = 7.11 e-19),

“neuron projection” (7.22 e-19) and “dendrite” (FDR = 9.82 e-15).

D1 MSNs had 15 up-regulated and 164 down-regulated genes at FDR < 0.05 and 567 up regulated and 1883 down regulated genes at an unadjusted p-value of < 0.05. The up-regulated genes in the HttQ175/+ D1 MSNs were enriched for other up regulated gene sets in HD, derived from studies such as the Langfelder study97, for example, “CAG length-dependent genes in 6 mon mouse striatum” (adjusted p-value = 2.02 e-136) and

“Genes up-regulated in striatum of HdhQ175 mice”103 (adjusted p-value = 9.27 e-79) as well as the Langfelder Striatum RNA M20, M43, M39 and M4 Modules, among the top 30 hits (adjusted p- values = 7.11 e-55, 4.27 e-12, 1.73 e-10 and 3.69 e-09 respectively). The down regulated genes were enriched for most other down regulated gene sets in HD, for example, “CAG length-dependent genes in 10 mon mouse striatum

[down]”97 (adjusted p-value = 1.41 e-210) and the Langfelder Striatal M2 module

(adjusted p-value = 1.55 e-196).

43

Figure 12 shows the distribution of genes Rsrp1 and Rgs9, which are representative genes from the enriched gene sets of up- and down- regulated genes respectively. GO terms enriched for the down-regulated genes were “neuron spine”

(FDR = 0.0153) and “dendritic spine” (FDR = 0.048). We did not find any GO terms significantly enriched for the up-regulated genes.

Figure 12. Distribution for genes Rsrp1 and Rgs9 in D1 MSNs

In D2 MSNs, 10 genes were up-regulated and 64 were down-regulated at FDR < 0.05 and 462 genes were up-regulated and 1177 down-regulated at unadjusted p-value < 0.05 in D2 MSNs. Similar to the D1 MSNs, the up-regulated genes in D2 MSNs from

HttQ175/+ mice were also enriched for HD specific up-regulated gene sets at various ages such as “CAG length-dependent genes in 6 mon mouse striatum [up]”97 (adjusted p- value = 4.03 e-114), “Up-regulated genes in striatum of 10 mon HD Q140 mice v/s

Q20”97 (adjusted p-value = 4.71 e-69) and “Expression in 9 wk R6/2 HD mice [up]”104

(adjusted p-value = 3.18 e-25). These were also enriched in the Striatal M20 and M39 modules from the Langfelder study97 (adjusted p-values = 2.24 e-47 and 6.62 e-15 respectively). Down-regulated genes were enriched for gene sets that were down regulated in other HD studies as well, for example, “Down-regulated genes in striatum of 6 mon HD Q175 mice vs Q20”97 (adjusted p-value = 2.28 e-164) and the Langfelder

44

Striatal M2 module (adjusted p-value = 1.8 e-160). Figure 13 shows the distribution of genes Meg3 and Pde10a, which are representative genes present in the enriched gene sets of the up and down regulated genes respectively. The down regulated genes were also associated with the GO term “proton transmembrane transport” (FDR = 0.03).

Figure 13. Distribution of genes Meg3 and Pde10a in D2 MSNs

Probably the most interesting cell type are the recently discovered Eccentric MSNs which had one up-regulated and 5 down-regulated genes at FDR < 0.05 and 121 up- regulated and 694 down-regulated genes in HttQ175/+ nuclei at a p-value < 0.05. Among the down regulated genes, Ryr1 codes for a protein that forms channels within the cell for the transport of calcium ions, which is necessary for muscular movement and is associated with muscular disorders and depression, Baiap2 is involved in cytoplasmic signal transduction, as an insulin receptor substrate, cytokinesis, regulation and reorganization of the actin cytoskeleton and is associated with diseases such as

Dentatorubral-pallidoluysian atrophy and Attention Deficit Hyperactivity Disorder.

The function of Inf2 is to sever actin filaments and increase the rate of polymerization and depolymerization.

The geneSetTest() results for Eccentric MSNs revealed that the up-regulated genes in the HttQ175/+ nuclei were enriched for 15 gene sets, that were enriched in the

45

D1 or D2 MSNs as well, including “Up-regulated genes in striatum of 6 mon HD Q175 mice vs Q20”97 (adjusted p-value = 1.77 e-18), “Expression in Hsp90 inhibitor treated

R6/2 HD mice vs WT mice [up]” (adjusted p-value = 1.67 e-10) and the Langfelder

Striatum M20 module97 (adjusted p-value = 7.29 e-05). The 49 significantly enriched gene sets in HttQ175/+ eccentric MSNs for the down-regulated genes were also enriched in the canonical MSNs and were sets of down regulated genes in HD in past studies, including the Langfelder M2 module (adjusted p-value = 1.57 e-30). Figure 14 shows the distribution of Meg3 and Inf2, which are representative genes in the enriched gene sets of the up and down regulated genes respectively.

Figure 14. Distribution of genes Meg3 and Inf2 in Eccentric MSNs Endothelial cells have two down-regulated genes at FDR < 0.05 and 203 up-regulated and 523 down-regulated genes with p-value < 0.05. The up-regulated genes in the

HttQ175/+ nuclei were enriched for gene sets such as “Top neuron-enriched genes in humans and mice”100 (adjusted p-value = 1.92 e-36), “Genes up-regulated in control iPSCs after 80 d of differentiation”105 (adjusted p-value = 1.18 e-17), which are described as a “list of up-regulated genes in HD vs control Induced Pluripotent Stem

Cells (iPSCs) after 80 d compared to 0 d of cortical differentiation” and “Top OPC- enriched genes in humans and mice”100 (adjusted p-value = 2.03 e-09). Down-regulated genes were enriched for endothelial gene sets such as “Top endothelial cell-enriched

46

genes in humans and mice”100 (adjusted p-value = 2.1 e-101) and HD gene sets such as

“HD markers in mouse STHdh cells”106 (adjusted p-value = 1.68 e-16).

Up-regulated genes in HttQ175/+ endothelial cells were associated with GO terms such as “trans-synaptic signaling” (FDR = 4.12 e-03), “intrinsic component of synaptic membrane” (FDR = 1.12 e-02), “synaptic signaling” (FDR = 1.17 e-02) and other synapse related terms. Down regulated genes were associated with GO terms such as

“cardiovascular system development” (FDR = 8.87 e-11), “blood vessel morphogenesis” (FDR = 9.01 e-11), “blood vessel endothelial cell migration” (FDR =

3.06 e-03) and other migration and morphogenesis related terms as well as

“angiogenesis” (FDR = 2.04 e-07), the lack of which is known to play a role in neurodegeneration107.

Somatostatin interneurons had two down-regulated genes with FDR < 0.05 and 181-up regulated and 793 down-regulated genes with p-value < 0.05. The up-regulated genes in the HttQ175/+ nuclei were not enriched for any gene sets based on the geneSetTest() adjusted p-values. The down-regulated genes were enriched for gene sets such as

“Genes expressed in embryonic serotonergic neurons”108 (adjusted p-value = 3.56 e-

04) and “Down-regulated genes in striatum of 6 mon HD Q140 mice vs Q20”97

(adjusted p-value = 0.023).

The Microglia did not have any differentially expressed genes that satisfied an FDR threshold of < 0.05 but the up-regulated genes were enriched for other up-regulated gene sets derived from bulk RNA seq HD studies, such as “CAG length-dependent genes in 6 mon mouse striatum [up]”97 (adjusted p-value = 1.76 e-08) and the down- regulated genes were enriched for microglia specific genes such as “Top microglia-

47

specific genes in humans and mice”100 (adjusted p-value = 7.67 e-11) and the down regulated counterparts of the same HD related gene sets, such as “CAG length- dependent genes in 6 mon mouse striatum [down]”97 (adjusted p-value = 1.23 e-07) and the Langfelder Striatal M2 module (adjusted p-value = 2.44 e-05).

Similar to the microglia, the oligodendrocytes also did not have any differentially expressed genes that satisfied FDR < 0.05 but the up-regulated genes were enriched for

HD related gene sets such as “Up-regulated genes in striatum of 10 mon HD Q140 mice vs Q20”97 (adjusted p-value = 2.12 e-52) and the Langfelder Striatal modules M20,

M30 and M46 (adjusted p-values = 3.76 e-28, 1.07 e-08 and 1.17 e-07 respectively).

Down-regulated genes were enriched for oligodendrocyte markers such as “Top oligodendrocyte-enriched genes in humans and mice”100 (adjusted p-value = 1.07 e-86) and the Langfelder Striatal M11 module (adjusted p-value = 7.95 e-25) that was highly enriched with glial genes, along with a few other HD related down-regulated gene sets.

The up-regulated genes in HttQ175/+ oligodendrocytes were associated with the GO term

“nucleoplasm part” (FDR = 4.44 e-04) and the down regulated genes were associated with terms such as “myelin sheath” (FDR = 5.94 e-03) and “cytosolic ribosome” (FDR

= 1.68 e-02).

The up-regulated genes in Parvalbumin interneurons were enriched for HD related gene sets such as “Genes changed in striatum of HdhQ175 mice”103 (adjusted p-value = 1.18 e-04) as well as large oligodendrocyte and microglia specific sets (adjusted p-values =

2.48 e-03 and 5.04 e-03).

48

Down-regulated genes in Polydendrocytes were enriched for OPC marker sets such as

“Top OPC-enriched genes in humans and mice”100 (adjusted p-value = 2.11 e-09).

Overlap Between Differentially Expressed genes between cell types

A list of 391 DEGs that had an adjusted p-value of less than 0.05 in any cell type was created and used to plot the correlation of log fold change between cell types (Figure

15).

Figure 15. Pairwise correlation between the log fold ratios of the 391 significantly differentially expressed genes between each combination of the 12 cell types, showing the correlation coefficient and the p-Values of correlation

49

D1 and D2 MSNs had a significantly high correlation with each other (r = 0.82,

p-value = 2.2e-96) and with the eccentric MSNs (r = 0.64, p-value = 1.9e-54 and r =

0.58, p-value = 4.6e-40 respectively). Figure 16 shows enlarged plots of correlation

Figure 16. Correlation between D1 MSNs and D2 MSNs on the left and between D1 MSNs and Eccentric MSNs on the right

between D1 MSNs and D2 and Eccentric MSNs respectively. Their correlation with

the unknown MSNs was also extremely high but not completely linear, indicating that

the unknown population was a mix of both types of canonical MSNs.

A medium strength correlation was also discovered between the D1 and D2

MSNs and the microglia, oligodendrocytes and somatostatin interneurons. Figure 17

shows enlarged plots of the correlation between D2 MSNs and Oligodendrocytes and

D2 MSNs and Somatostatin interneurons.

While oligodendrocytes are positively correlated with D2 MSNs, we can see

that there is a significant portion of nuclei that are also negatively correlated. We

explore the relation between up and down regulated genes between cell types in the

next section.

50

Figure 17. Correlation between D2 MSNs and Oligodendrocytes on the left and D2 MSNs and Sst interneurons on the right

The somatostatin interneurons have a positive correlation with the D2 MSNs as

well, but the spread of the plot indicates that the overlap is not driven by a single set of

genes. We also see that the interneurons have very little to no correlation with each

other. We found that the strength of nuclei of each cell type and the rigid p-value

threshold, irrespective of the number of nuclei contributing to each population, could

be the cause for the lack of resolution and hence, the correlation between the observed

values.

51

Rank-Rank Hypergeometric Overlap

In order to determine if variable p-value cutoffs for each cell type could be used to determine the overlapping Differentially Expressed up- and down-regulated genes in the HttQ175/+ nuclei between cell types, Dr. Seth Ament conducted a Rank Rank

Hypergeometric Overlap analysis. The heatmaps generated are shown in figure 18.

Figure 18. Overlap among down regulated genes on the left and up regulated genes on the right.

The heatmaps show a higher overlap among the up-regulated genes between most cell types. Overlapping genes are associated with GO terms such as “trans- synaptic signaling” and “synapse part” between astrocytes and endothelial cells. Up regulated genes at non adjusted p-values < 0.05 were found to be enriched for terms like “histone methylation”, “establishment of blood brain barrier”, "gated channel activity", “synaptic membrane”, "chromatin binding”, “glutamatergic synapse” and

“polyubiquitin modification-dependent protein binding" across four or more cell types.

These terms possibly indicate some epigenetic changes in some cell types, along with a change in the synaptic activity, especially across non neuronal cells.

52

Down regulated genes overlap to a large extent between the D1 and D2 MSNs

and to a smaller extent between D1, D2 and the glial cell types including the

oligodendrocytes and microglia as well as the Sst interneurons. The overlap between

the D1 and D2 MSNs and between the canonical MSNs eccentric MSNs can be seen in

figure 19. GO terms associated with down regulated genes at non-adjusted p-value <

0.05 in four or more cell types are "proton transmembrane transporter activity",

"calcium ion export across plasma membrane", "G protein-coupled receptor activity",

"regulation of glutamate receptor signaling pathway", "calmodulin binding",

"cognition", “locomotory behavior”, “learning and memory”, "actin filament bundle

assembly" , “behaviour” and other terms expected to be seen down regulated in

neurons. A B

C

Figure 19. Rank Rank Hypergeometric overlap between A. D1 and D2 MSNs; B. D1 (x) and Eccentric MSNs (y) and C. D2 (x) and Eccentric MSNs (y)

The bottom left corner in both plots shows the overlap in the up regulated genes

and the top right corner shows the overlap in the down regulated genes. We see a higher

overlap between the down regulated genes in these comparisons, which contrasts with

53

the comparison between most other cell types. This can be attributed to the fact that the

MSNs have a larger number of down regulated genes than up regulated genes.

Comparing the MSN populations to the interneuron populations, we find that

the Sst interneurons have the highest overlap of down-regulated genes with the

canonical neurons. The Chat interneurons show some overlap of upregulated genes and

the Pnoc interneurons have almost no overlap. This can be seen in figure 20.

Figure 20. Comparison of the interneuron populations, the Chat, Pnoc and Sst interneurons on the x axis and the canonical MSNs on the y axis The interneuron populations showed very little to no overlap in the DEGs with

each other.

Following the results from correlation analysis described in the previous

section, we examined the plots for overlap between the D2 MSNs and

oligodendrocytes. This RRHO plot is seen in figure 21.

54

Figure 21. Overlap between D2 MSNs on the x axis and Oligodendrocytes on the y axis

It is interesting to note that there is a high overlap in the up regulated genes as well as moderate overlap in other genes between the two cell types. This moderate overlap could be the cause for the “tail” that was seen in the correlation plot in figure 17.

55

Overlap in multiple cell types

Genes that had a raw p-value of less than 0.05, were also compared across multiple cell

types to determine differentially expressed genes that were overlapping between cell

types using Venn diagrams as shown in figure 22. Gene Eif4g1, which codes for a

subunit of the translational machinery of a cell, was found to be differentially expressed

in all cell types except the Pnoc interneurons and the polydendrocytes. This gene has

been found in 180 gene sets form HDsigDB, being implicated in HD related gene sets

such as “HD markers in human caudate nucleus”102, “Expression in striatum from 10

A B

C D

Figure 22. Venn diagrams showing the number of common DEGs between different cell types. A: all neuronal and interneuron types; B: all non neuronal subtypes; C: all neuron subtypes; D: all interneuron subtypes

56

wk HdhQ111 HD mice”109, “Expression in striatal cells of HdhQ111 HD mice

[down]”110 and “Genes down-regulated in HdhQ111 striatal cells vs HdhQ7"111.

Similarly, gene Tmem91, which codes for a transmembrane protein and belongs to the

Dispanin subfamily C, was also differentially expressed in all cell types except the Chat and Sst interneurons. Tmem91 has been associated with Meckel Syndrome Type 10 and is present in 118 of the mouse gene sets from HDsigDB, including “Up-regulated genes in striatum of HD Q175 mice "97 and the Langfelder module M997 which was enriched for mitochondrial and post synaptic density proteins and was positively associated with

CAG length.

The known canonical MSNs and all three interneuron subtypes did not share any common DEGs. However, this may be a result of the low resolution provided by the Chat interneurons. Thus, we looked at the eight DEGs shared by the D1, D2, pnoc and sst neurons. These included Tmem57, Phf14, Abhd17a, Slc25a3, Ewsr1, Snhg20,

Klhl29 and Malat1. All of these genes have been implicated as differentially expressed genes in previous HD studies, except Tmem57, which is a transmembrane protein of unclear significance.

All the glial cell types shared only three significant DEGs, which were Tmem91,

Meg3 and Onecut2. The canonical MSNs and eccentric neurons share 141 DEGs, the most significant ones of which were: Meg3, Inf2, Pde10a, Malat1, Ryr1 and Baiap2.

All of shared glial and canonical MSN DEGs were found to be differentially expressed in HD studies in the past.

We see a similar trend in the interneurons as we did in the previous tests, wherein the interneurons have only one DEG that is significant in all three identified subtypes, Tmem59l which is a transmembrane protein that controls the maturation and

57

transport of APP. The remainder of the significant responses in the interneurons do not have a high overlap between categories either.

Comparison with Reference Gene Lists

Previous gene expression studies of bulk striatal and cortical tissues from HD knock-in mice revealed 13 striatal and 5 cortical gene co-expression modules that were highly correlated with the number of CAG repeats and mouse age84. These modules are referred to as the Langfelder modules for the purposes of this text. A receiver operator curve was created as seen in Figure 23. The cells that passed the default threshold (red vertical lines) for each module are colored in dark blue.

58

Figure 23. ROC curves for the enrichment of the Langfelder modules in our data. The module has been denoted at the top of each plot along with information on whether it was up or down regulated in their dataset. The algorithm scored each nucleus based on the expression of the gene modules and plotted the scores. The scores for these cells for each module were then used to color the tSNE plots. We see these in figure 24. The Langfelder Module M2 is a striatal module and was annotated by the group as having cAMP signaling, post synaptic density proteins, and caudate marker genes. From the Langfelder paper, Module M7 was made of genes involved in stress response and cell death and M11 and M52 were

59

annotated as glia related. The tSNEs can be used to determine the exact cell type these genes were enriched in.

Figure 24. tSNEs colored by the AUCell thresholds for the Langfelder modules. The pink or red cells are the ones that passed the threshold. The light blue cells were the ones closest to the threshold line to its left. From Figure 23 A and Figure 24 A, we saw that Module M2 was enriched in the canonical MSN clusters and to some extent in the eccentric MSNs. This is expected since these are neuronal clusters and Module M2 is a large module consisting of largely down regulated genes. Subfigure B showed that M7 was found to be best represented in one of the oligodendrocyte clusters along with subtle representation in the other glial clusters. M11 could be classified as an overall glial cluster with enrichment in

60

oligodendrocytes, astrocytes and endothelial cells. M52 was highly and almost exclusively enriched in the endothelial cells, implying that it may be a more endothelial cluster than glial.

Trajectory analysis via Monocle

We performed the trajectory analysis of each cell type using Monocle112 and colored it by their Htt genotype. The trajectories are shown in Figure 25.

61

Figure 25. Trajectories followed by each of the cell types generated using Monocle2

In the case of astrocytes, D1 and D2 MSNs, and oligodendrocytes, we found that one of the components (on the x or y axis) used to define the trajectories separated the nuclei on the basis of the genotype with one end of the component being dominated by the

HttQ175/+ nuclei and the other end being dominated by the WT nuclei. This supports the

62

hypothesis that normal cells are present in the HttQ175/+ striatum along with cells that express genes differentially and that this change is gradual, with transitional states present.

Dr. Seth Ament plotted a single component of the trajectory of the pooled canonical MSN population in the form of violin plots for these nuclei, as shown in

Figure 26.

Figure 26. Violin plots of a component of the canonical MSN trajectory that shows a population of Q175 MSNs that are similar to wild type MSNs and intermediate stages that lead to a completely different stage in Q175 nuclei that cannot be seen in most wild type nuclei Here, we see that there is a population of HttQ175/+ MSNs that are indistinguishable from the wild type MSNs, and intermediate stages are present that lead to a different population of HttQ175/+ MSNs that are discreet from the majority of the wild type neuron population. This demonstrates that even at an endpoint study like this, there exists a population of MSNs that is akin to the normal MSNs in the wild type mice.

63

DISCUSSION

In a first of its kind study of single nucleus transcriptomics of the striatums of

Huntington’s Disease mice, using the currently available methods and new approaches, we identified the original cell type of each nucleus present in the striatum samples of four HttQ175/+ and three wild type 14 and 15 month old mice. Identified cell types included the recently described eccentric MSNs by Saunders et al79 from their large- scale single-cell RNA-seq of striatum from juvenile mice. Our dataset confirms that these MSNs differ significantly from the canonical D1 and D2 Medium spiny neurons, while still expressing some of their specific markers such as Drd1 and Adora2a and persist in the aging brain. Due to the clear demarcation of these MSNs, we did not include them with the other MSNs when subclustering. We identified differentially expressed genes in the HttQ175/+ nuclei for nearly all the cell types present in our dataset and determined enriched gene sets for up and down regulated genes. We found the correlation between the DEGs for each cell type as well as the overlapping genes using variable p-value thresholds.

The trajectories for the more abundantly available cell types provide evidence that even at 15 months of age, cells exist in the HttQ175/+ mice that cannot be distinguished from wild type cells, along with intermediate stages leading to completely distinct states of the same cell type in the HttQ175/+ samples. A length dependent somatic expansion hypothesis proposed by Kaplan et al.113 suggests that symptoms appear when enough cells reach a pathological threshold of CAG repeats, which in the case of HD is predicted to be 115 repeats. If we are to assume a larger pathological threshold for mice, the HttQ175/+ MSNs that appeared normal may not have expanded enough CAG tracts to be different from the wild type MSNs, while the intermediate and final stages seen in Figure 26 may have intermediate and pathologically significant number of repeats

64

respectively. This may also explain the inability to observe typical symptoms of HD in mice, since most mice would die by the time a majority of the MSNs have enough somatic expansions to reach the mouse pathological threshold. Unfortunately, due to technical limitations with 10X Sequencing, we cannot determine the number of repeats of CAG in each cell and find the correlation. However, future studies can focus on capturing the CAG length to test this hypothesis.

Differentially expressed genes between HttQ175/+ and wild type cells for the medium spiny neurons have been implicated in previous studies and are known to be associated with other muscular and neurodegenerative disorders. We found that gene expression changes in eccentric MSNs from HttQ175/+ mice had a significant overlap with those in canonical MSNs and the overlap among down regulated genes was stronger than that in up regulated genes. Future analyses will focus on comparing the magnitude of these changes in each MSN type. It may be meaningful to conduct single cell or single nucleus sequencing on physically sorting of D1, D2 and Eccentric MSNs

(for example, FACs sorting) in order to confidently label libraries for a cell type since the D1 and D2 specific genes are down regulated in HttQ175/+ cells and the differences are masked by Htt genotypic differences. Looking at the expression levels of the MSN sub types in younger mice may also reveal interesting results.

Bulk RNA sequencing methods have been ineffective in being able to elucidate the responses to the change in Htt genotype in rare cell types such as the striatal interneurons. Here, we find that the striatal cholinergic and parvalbumin interneurons differ significantly from the MSNs in their response to the change in Htt genotype. The

Sst interneurons share some response in the down regulated genes with the canonical neurons but not as much in case of the up regulated genes.

65

However, there is nearly no significant overlap in the differentially expressed genes between the three populations of interneurons themselves, implying different responses to HD. A possible explanation for this lack of overlap could be the small sample sizes for the Chat and Pnoc interneurons. However, there was no significant overlap in the differentially expressed genes even when a raw p-value threshold of <

0.05 was used, indicating a general lack of overlap in their response.

Gene set enrichment analysis of non-neuronal cell types provides evidence that the cell type specific and enriched genes were down regulated in the HttQ175/+ cells for each cell type. Astrocytes and endothelial cells were also enriched for genes associated with GO terms related to synapse and synaptic signaling. These findings indicate a clear involvement of these cell types in the progression of disease and their investigation in high powered studies may lead to novel discoveries in the field.

One of the largest limitations of this study is the low resolution for rare cell types in our dataset. The canonical neurons dominate the final dataset with nearly 60% of the nuclei representing this population. We are also under powered to accurately determine the responses in cell types such as the Chat interneurons and the microglia due to under representation of the wild genotype (~9% and ~19% of the total of these types of nuclei are wild type.)

Results from various tests indicate that there are some residual ambient RNA transcripts in our data, despite using conservative QC methods, which may be masking subtle responses and changes. The estimation of the contamination fraction of each sample and the removal of ambient transcripts is highly dependent on a manually curated list of mutually exclusive genes. While this may be possible for some cell types, such a list is difficult to compile when attempting to remove ambient transcripts from

66

droplets containing the same cell types, for example, removing ambient neuron transcripts from droplets containing neuron nuclei. Since nearly all single cell and single nucleus datasets contain some level of ambient RNA, tools to accurately determine and remove this contamination are the need of the hour.

We will be conducting further analysis on this dataset of 14 and 15 month old mice in order to determine the subtle differences in the decently powered cell types, as well as continue to construct trajectories using other methods available in order to remove technical variates and find the genes involved in the development of the biological intermediates that lead to the phenotypes seen in HttQ175/+ mice.

We recommend that future studies using single nucleus RNA sequencing for

Huntington’s Disease conduct the analysis on younger samples at different age groups in order to determine early changes that occur in order to reach the genotypic expressions seen in our dataset.

67

REFERENCES

1. Longrigg J. Epilepsy in ancient Greek medicine—the vital step. Seizure. 2000;9(1):12-21. doi:10.1053/seiz.1999.0332

2. Manyam BV. Paralysis agitans and levodopa in “Ayurveda”: ancient Indian medical treatise. Mov Disord. 1990;5(1):47-48. doi:10.1002/mds.870050112

3. Zhang Z-X, Dong Z-H, Román GC. Early descriptions of Parkinson disease in ancient China. Arch Neurol. 2006;63(5):782-784. doi:10.1001/archneur.63.5.782

4. Parkinson J. An Essay on the Shaking Palsy. JNP. 2002;14(2):223-236. doi:10.1176/jnp.14.2.223

5. Huntington G. On chorea. The Medical and Surgical Reporter of Philadelphia. cited in H Skirton (2005),‘Huntington’s Disease: a nursing perspective’, MedSurg Nursing. 1872;14(3):167-174.

6. Lo DC, Hughes RE, Hughes RE. Neurobiology of Huntington’s Disease : Applications to Drug Discovery. CRC Press; 2010. doi:10.1201/EBK0849390005

7. Bhattacharyya KB. The story of George Huntington and his disease. Ann Indian Acad Neurol. 2016;19(1):25-28. doi:10.4103/0972-2327.175425

8. Gusella JF, Wexler NS, Conneally PM, et al. A polymorphic DNA marker genetically linked to Huntington’s disease. Nature. 1983;306(5940):234. doi:10.1038/306234a0

9. MacDonald ME, Ambrose CM, Duyao MP, et al. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease . Cell. 1993;72(6):971-983. doi:10.1016/0092-8674(93)90585-E

10. Fan H-C, Ho L-I, Chi C-S, et al. Polyglutamine (PolyQ) diseases: genetics to treatments. Cell Transplant. 2014;23(4-5):441-458. doi:10.3727/096368914X678454

11. Rubinsztein DC, Leggo J, Coles R, et al. Phenotypic characterization of individuals with 30-40 CAG repeats in the Huntington disease (HD) gene reveals HD cases with 36 repeats and apparently normal elderly individuals with 36-39 repeats. Am J Hum Genet. 1996;59(1):16-22.

12. Squitieri F, Cannella M, Simonelli M. CAG mutation effect on rate of progression in Huntington’s disease. Neurol Sci. 2002;23 Suppl 2:S107-108. doi:10.1007/s100720200092

13. Lee J-M, Ramos EM, Lee J-H, et al. CAG repeat expansion in Huntington disease determines age at onset in a fully dominant fashion. Neurology. 2012;78(10):690-695. doi:10.1212/WNL.0b013e318249f683

14. Squitieri F, Gellera C, Cannella M, et al. Homozygosity for CAG mutation in Huntington disease is associated with a more severe clinical course. Brain. 2003;126(Pt 4):946-955. doi:10.1093/brain/awg077

68

15. Cubo E, Martinez-Horta S-I, Santalo FS, et al. Clinical manifestations of homozygote allele carriers in Huntington disease. Neurology. 2019;92(18):e2101-e2108. doi:10.1212/WNL.0000000000007147

16. Rosenblatt A, Liang K-Y, Zhou H, et al. The association of CAG repeat length with clinical progression in Huntington disease. Neurology. 2006;66(7):1016. doi:10.1212/01.wnl.0000204230.16619.d9

17. Illarioshkin SN, Igarashi S, Onodera O, et al. Trinucleotide repeat length and rate of progression of Huntington’s disease. Annals of Neurology. 1994;36(4):630-635. doi:10.1002/ana.410360412

18. Williams JK, Skirton H, Barnette JJ, Paulsen JS. Family Caregiver Personal Concerns in Huntington Disease. J Adv Nurs. 2012;68(1):137-146. doi:10.1111/j.1365- 2648.2011.05727.x

19. Squitieri F, Andrew SE, Goldberg YP, et al. DNA haplotype analysis of Huntington disease reveals clues to the origins and mechanisms of CAG expansion and reasons for geographic variations of prevalence. Hum Mol Genet. 1994;3(12):2103-2114. doi:10.1093/hmg/3.12.2103

20. Warby SC, Visscher H, Collins JA, et al. HTT haplotypes contribute to differences in Huntington disease prevalence between Europe and East Asia. Eur J Hum Genet. 2011;19(5):561-566. doi:10.1038/ejhg.2010.229

21. Nasir J, Floresco SB, O’Kusky JR, et al. Targeted disruption of the Huntington’s disease gene results in embryonic lethality and behavioral and morphological changes in heterozygotes. Cell. 1995;81(5):811-823. doi:10.1016/0092-8674(95)90542-1

22. Zeitlin S, Liu JP, Chapman DL, Papaioannou VE, Efstratiadis A. Increased apoptosis and early embryonic lethality in mice nullizygous for the Huntington’s disease gene homologue. Nat Genet. 1995;11(2):155-163. doi:10.1038/ng1095-155

23. Duyao MP, Auerbach AB, Ryan A, et al. Inactivation of the mouse Huntington’s disease gene homolog Hdh. Science. 1995;269(5222):407-410. doi:10.1126/science.7618107

24. Clabough EBD, Zeitlin SO. Deletion of the triplet repeat encoding polyglutamine within the mouse Huntington’s disease gene results in subtle behavioral/motor phenotypes in vivo and elevated levels of ATP with cellular senescence in vitro. Hum Mol Genet. 2006;15(4):607-623. doi:10.1093/hmg/ddi477

25. MacDonald ME. Huntingtin: alive and well and working in middle management. Sci STKE. 2003;2003(207):pe48. doi:10.1126/stke.2003.207.pe48

26. McKinstry SU, Karadeniz YB, Worthington AK, et al. Huntingtin is required for normal excitatory synapse development in cortical and striatal circuits. J Neurosci. 2014;34(28):9455-9472. doi:10.1523/JNEUROSCI.4699-13.2014

27. Dragatsis I, Dietrich P, Ren H, et al. Effect of early embryonic deletion of huntingtin from pyramidal neurons on the development and long-term survival of neurons in cerebral cortex and striatum. Neurobiol Dis. 2018;111:102-117. doi:10.1016/j.nbd.2017.12.015

69

28. Cattaneo E, Zuccato C, Tartari M. Normal huntingtin function: an alternative approach to Huntington’s disease. Nature Reviews Neuroscience. 2005;6(12):919. doi:10.1038/nrn1806

29. Lee JK, Conrad A, Epping E, et al. Effect of Trinucleotide Repeats in the Huntington’s Gene on Intelligence. EBioMedicine. 2018;31:47-53. doi:10.1016/j.ebiom.2018.03.031

30. Arrasate M, Mitra S, Schweitzer ES, Segal MR, Finkbeiner S. Inclusion body formation reduces levels of mutant huntingtin and the risk of neuronal death. Nature. 2004;431(7010):805-810. doi:10.1038/nature02998

31. Miller J, Arrasate M, Shaby BA, Mitra S, Masliah E, Finkbeiner S. Quantitative relationships between huntingtin levels, polyglutamine length, inclusion body formation, and neuronal death provide novel insight into huntington’s disease molecular pathogenesis. J Neurosci. 2010;30(31):10541-10550. doi:10.1523/JNEUROSCI.0146-10.2010

32. Krol J, Fiszer A, Mykowska A, Sobczak K, Mezer M de, Krzyzosiak WJ. Ribonuclease Dicer Cleaves Triplet Repeat Hairpins into Shorter Repeats that Silence Specific Targets. Molecular Cell. 2007;25(4):575-586. doi:10.1016/j.molcel.2007.01.031

33. Imbert M, Blandel F, Leumann C, Garcia L, Goyenvalle A. Lowering Mutant Huntingtin Using Tricyclo-DNA Antisense Oligonucleotides As a Therapeutic Approach for Huntington’s Disease. Nucleic Acid Ther. 2019;29(5):256-265. doi:10.1089/nat.2018.0775

34. Tabrizi SJ, Leavitt BR, Landwehrmeyer GB, et al. Targeting Huntingtin Expression in Patients with Huntington’s Disease. N Engl J Med. 2019;380(24):2307-2316. doi:10.1056/NEJMoa1900907

35. Gerfen CR, Engber TM, Mahan LC, et al. D1 and D2 dopamine receptor-regulated gene expression of striatonigral and striatopallidal neurons. Science. 1990;250(4986):1429- 1432. doi:10.1126/science.2147780

36. Le Moine C, Bloch B. D1 and D2 dopamine receptor gene expression in the rat striatum: sensitive cRNA probes demonstrate prominent segregation of D1 and D2 mRNAs in distinct neuronal populations of the dorsal and ventral striatum. J Comp Neurol. 1995;355(3):418-426. doi:10.1002/cne.903550308

37. Reiner A, Albin RL, Anderson KD, D’Amato CJ, Penney JB, Young AB. Differential loss of striatal projection neurons in Huntington disease. Proc Natl Acad Sci USA. 1988;85(15):5733-5737. doi:10.1073/pnas.85.15.5733

38. Deng YP, Albin RL, Penney JB, Young AB, Anderson KD, Reiner A. Differential loss of striatal projection systems in Huntington’s disease: a quantitative immunohistochemical study. J Chem Neuroanat. 2004;27(3):143-164. doi:10.1016/j.jchemneu.2004.02.005

39. Hikida T, Kimura K, Wada N, Funabiki K, Nakanishi S. Distinct roles of synaptic transmission in direct and indirect striatal pathways to reward and aversive behavior. Neuron. 2010;66(6):896-907. doi:10.1016/j.neuron.2010.05.011

70

40. Jiang HK, McGinty JF, Hong JS. Differential modulation of striatonigral dynorphin and enkephalin by dopamine receptor subtypes. Brain Res. 1990;507(1):57-64. doi:10.1016/0006-8993(90)90522-d

41. Macpherson T, Morita M, Hikida T. Striatal direct and indirect pathways control decision-making behavior. Front Psychol. 2014;5. doi:10.3389/fpsyg.2014.01301

42. Deng YP, Wong T, Bricker-Anthony C, Deng B, Reiner A. Loss of corticostriatal and thalamostriatal synaptic terminals precedes striatal projection neuron pathology in heterozygous Q140 Huntington’s disease mice. Neurobiol Dis. 2013;60:89-107. doi:10.1016/j.nbd.2013.08.009

43. Deng Y-P, Wong T, Wan JY, Reiner A. Differential loss of thalamostriatal and corticostriatal input to striatal projection neuron types prior to overt motor symptoms in the Q140 knock-in mouse model of Huntington’s disease. Front Syst Neurosci. 2014;8:198. doi:10.3389/fnsys.2014.00198

44. Saunders A, Macosko E, Wysoker A, et al. A Single-Cell Atlas of Cell Types, States, and Other Transcriptional Patterns from Nine Regions of the Adult Mouse Brain. bioRxiv. April 2018:299081. doi:10.1101/299081

45. Aizman O, Brismar H, Uhlén P, et al. Anatomical and physiological evidence for D1 and D2 dopamine receptor colocalization in neostriatal neurons. Nat Neurosci. 2000;3(3):226-230. doi:10.1038/72929

46. Lester J, Fink S, Aronin N, DiFiglia M. Colocalization of D1 and D2 dopamine receptor mRNAs in striatal neurons. Brain Res. 1993;621(1):106-110. doi:10.1016/0006- 8993(93)90303-5

47. Mayer C, Hafemeister C, Bandler RC, et al. Developmental diversification of cortical inhibitory interneurons. Nature. 2018;555(7697):457-462. doi:10.1038/nature25999

48. Kawaguchi Y, Wilson CJ, Augood SJ, Emson PC. Striatal interneurones: chemical, physiological and morphological characterization. Trends Neurosci. 1995;18(12):527- 535.

49. Bennett BD, Bolam JP. Characterization of calretinin-immunoreactive structures in the striatum of the rat. Brain Res. 1993;609(1-2):137-148. doi:10.1016/0006- 8993(93)90866-l

50. Cicchetti F, Prensa L, Wu Y, Parent A. Chemical anatomy of striatal interneurons in normal individuals and in patients with Huntington’s disease. Brain Res Brain Res Rev. 2000;34(1-2):80-101.

51. Kubota Y, Mikawa S, Kawaguchi Y. Neostriatal GABAergic interneurones contain NOS, calretinin or parvalbumin. Neuroreport. 1993;5(3):205-208. doi:10.1097/00001756- 199312000-00004

52. Aquilonius SM, Eckernås SA, Sundwall A. Regional distribution of choline acetyltransferase in the human brain: changes in Huntington’s chorea. J Neurol Neurosurg Psychiatry. 1975;38(7):669-677. doi:10.1136/jnnp.38.7.669

71

53. Massouh M, Wallman M-J, Pourcher E, Parent A. The fate of the large striatal interneurons expressing calretinin in Huntington’s disease. Neurosci Res. 2008;62(4):216-224. doi:10.1016/j.neures.2008.08.007

54. Deng Y-P, Reiner A. Cholinergic interneurons in the Q140 knock-in mouse model of Huntington’s disease: Reductions in dendritic branching and thalamostriatal input. J Comp Neurol. 2016;524(17):3518-3529. doi:10.1002/cne.24013

55. Shin J-Y, Fang Z-H, Yu Z-X, Wang C-E, Li S-H, Li X-J. Expression of mutant huntingtin in glial cells contributes to neuronal excitotoxicity. J Cell Biol. 2005;171(6):1001-1012. doi:10.1083/jcb.200508072

56. Tai YF, Pavese N, Gerhard A, et al. Microglial activation in presymptomatic Huntington’s disease gene carriers. Brain. 2007;130(Pt 7):1759-1766. doi:10.1093/brain/awm044

57. Liddelow SA, Guttenplan KA, Clarke LE, et al. Neurotoxic reactive astrocytes are induced by activated microglia. Nature. 2017;541(7638):481-487. doi:10.1038/nature21029

58. Benraiss A, Wang S, Herrlinger S, et al. Human glia can both induce and rescue aspects of disease phenotype in Huntington disease. Nat Commun. 2016;7. doi:10.1038/ncomms11758

59. Bartzokis G, Lu PH, Tishler TA, et al. Myelin breakdown and iron changes in Huntington’s disease: pathogenesis and treatment implications. Neurochem Res. 2007;32(10):1655-1664. doi:10.1007/s11064-007-9352-7

60. Phillips O, Squitieri F, Sanchez-Castaneda C, et al. Deep white matter in Huntington’s disease. PLoS ONE. 2014;9(10):e109676. doi:10.1371/journal.pone.0109676

61. Myers RH, Vonsattel JP, Paskevich PA, et al. Decreased Neuronal and Increased Oligodendroglial Densities in Huntington’s Disease Caudate Nucleus. J Neuropathol Exp Neurol. 1991;50(6):729-742. doi:10.1097/00005072-199111000-00005

62. Huang B, Wei W, Wang G, et al. Mutant Huntingtin Downregulates Myelin Regulatory Factor-Mediated Myelin Gene Expression and Affects Mature Oligodendrocytes. Neuron. 2015;85(6):1212-1226. doi:10.1016/j.neuron.2015.02.026

63. Hofer S, Kainz K, Zimmermann A, et al. Studying Huntington’s Disease in Yeast: From Mechanisms to Pharmacological Approaches. Front Mol Neurosci. 2018;11. doi:10.3389/fnmol.2018.00318

64. Morton AJ, Howland DS. Large Genetic Animal Models of Huntington’s Disease. Journal of Huntington’s Disease. 2013;2(1):3-19. doi:10.3233/JHD-130050

65. Pouladi MA, Morton AJ, Hayden MR. Choosing an animal model for the study of Huntington’s disease. Nature Reviews Neuroscience. 2013;14(10):708-721. doi:10.1038/nrn3570

66. Alexandrov V, Brunner D, Menalled LB, et al. Large-scale phenome analysis defines a behavioral signature for Huntington’s disease genotype in mice. Nat Biotechnol. 2016;34(8):838-844. doi:10.1038/nbt.3587

72

67. Smith GA, Rocha EM, McLean JR, et al. Progressive axonal transport and synaptic protein changes correlate with behavioral and neuropathological abnormalities in the heterozygous Q175 KI mouse model of Huntington’s disease. Hum Mol Genet. 2014;23(17):4510-4527. doi:10.1093/hmg/ddu166

68. Ament SA, Pearl JR, Grindeland A, et al. High resolution time-course mapping of early transcriptomic, molecular and cellular phenotypes in Huntington’s disease CAG knock- in mice across multiple genetic backgrounds. Hum Mol Genet. 2017;26(5):913-922. doi:10.1093/hmg/ddx006

69. HDinHD | Huntington’s Disease in High Definition. https://www.hdinhd.org/.

70. HDSigDB Database. https://hdsigdb.hdinhd.org/home.

71. Krishnaswami SR, Grindberg RV, Novotny M, et al. Using single nuclei for RNA-seq to capture the transcriptome of postmortem neurons. Nat Protoc. 2016;11(3):499-524. doi:10.1038/nprot.2016.015

72. Lake BB, Codeluppi S, Yung YC, et al. A comparative strategy for single-nucleus and single-cell transcriptomes confirms accuracy in predicted cell-type expression from nuclear RNA. Sci Rep. 2017;7(1):6031. doi:10.1038/s41598-017-04426-w

73. Matson KJE, Sathyamurthy A, Johnson KR, Kelly MC, Kelley MW, Levine AJ. Isolation of Adult Spinal Cord Nuclei for Massively Parallel Single-nucleus RNA Sequencing. J Vis Exp. 2018;(140). doi:10.3791/58413

74. Zheng GXY, Terry JM, Belgrader P, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049. doi:10.1038/ncomms14049

75. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. doi:10.1093/bioinformatics/bts635

76. Lun ATL, Riesenfeld S, Andrews T, et al. Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. bioRxiv. April 2018:234872. doi:10.1101/234872

77. Young MD, Behjati S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data | bioRxiv. https://www.biorxiv.org/content/10.1101/303727v1.

78. DropViz. http://dropviz.org/.

79. Saunders A, Macosko EZ, Wysoker A, et al. Molecular Diversity and Specializations among the Cells of the Adult Mouse Brain. Cell. 2018;174(4):1015-1030.e16. doi:10.1016/j.cell.2018.07.028

80. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26(4):493-500. doi:10.1093/bioinformatics/btp692

81. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621-628. doi:10.1038/nmeth.1226

73

82. Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA- sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 2018;36(5):421-427. doi:10.1038/nbt.4091

83. Xu C, Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics. 2015;31(12):1974-1980. doi:10.1093/bioinformatics/btv088

84. Macosko EZ, Basu A, Satija R, et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell. 2015;161(5):1202-1214. doi:10.1016/j.cell.2015.05.002

85. Integrating single-cell transcriptomic data across different conditions, technologies, and species | Nature Biotechnology. https://www.nature.com/articles/nbt.4096. Accessed July 21, 2019.

86. Stuart T, Butler A, Hoffman P, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019;177(7):1888-1902.e21. doi:10.1016/j.cell.2019.05.031

87. Jarvis RA, Patrick EA. Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Trans Comput. 1973;C-22(11):1025-1034. doi:10.1109/T- C.1973.223640

88. Ertöz L, Steinbach M, Kumar V. A new shared nearest neighbor clustering algorithm and its applications. January 2002.

89. Maaten L van der, Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9(Nov):2579-2605.

90. Maaten L van der. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research. 2014;15:3221-3245.

91. Gierahn TM, Wadsworth MH, Hughes TK, et al. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat Methods. 2017;14(4):395-398. doi:10.1038/nmeth.4179

92. CSEA tool | Dougherty Lab: From Autism to Astrocytes. http://genetics.wustl.edu/jdlab/csea-tool-2/. Accessed July 23, 2019.

93. ISH Data :: Allen Brain Atlas: Mouse Brain. https://mouse.brain-map.org/. Accessed October 14, 2019.

94. Eppig JT, Richardson JE, Kadin JA, Ringwald M, Blake JA, Bult CJ. Mouse Genome Informatics (MGI): reflecting on 25 years. Mamm Genome. 2015;26(7-8):272-284. doi:10.1007/s00335-015-9589-4

95. MGI-Mouse Genome Informatics-The international database resource for the . http://www.informatics.jax.org/. Accessed October 14, 2019.

96. Dougherty JD, Schmidt EF, Nakajima M, Heintz N. Analytical approaches to RNA profiling data for the identification of genes enriched in specific cells. Nucleic Acids Res. 2010;38(13):4218-4230. doi:10.1093/nar/gkq130

74

97. Langfelder P, Cantle JP, Chatzopoulou D, et al. Integrated genomics and proteomics define huntingtin CAG length-dependent networks in mice. Nat Neurosci. 2016;19(4):623-633. doi:10.1038/nn.4256

98. Zhang Y, Chen K, Sloan SA, et al. An RNA-Sequencing Transcriptome and Splicing Database of Glia, Neurons, and Vascular Cells of the Cerebral Cortex. J Neurosci. 2014;34(36):11929-11947. doi:10.1523/JNEUROSCI.1860-14.2014

99. Common mechanisms in neurodegeneration and neuroinflammation: a BrainNet Europe gene expression microarray study. - PubMed - NCBI. https://www.ncbi.nlm.nih.gov/pubmed/25119539. Accessed October 15, 2019.

100. McKenzie AT, Wang M, Hauberg ME, et al. Brain Cell Type Specific Gene Expression and Co-expression Network Architectures. Sci Rep. 2018;8(1):8868. doi:10.1038/s41598- 018-27293-5

101. Cajigas IJ, Tushev G, Will TJ, tom Dieck S, Fuerst N, Schuman EM. The local transcriptome in the synaptic neuropil revealed by deep sequencing and high- resolution imaging. Neuron. 2012;74(3):453-466. doi:10.1016/j.neuron.2012.02.036

102. Durrenberger PF, Fernando FS, Kashefi SN, et al. Common mechanisms in neurodegeneration and neuroinflammation: a BrainNet Europe gene expression microarray study. J Neural Transm (Vienna). 2015;122(7):1055-1068. doi:10.1007/s00702-014-1293-0

103. Beaumont V, Zhong S, Lin H, et al. Phosphodiesterase 10A Inhibition Improves Cortico- Basal Ganglia Function in Huntington’s Disease Models. Neuron. 2016;92(6):1220- 1237. doi:10.1016/j.neuron.2016.10.064

104. Mielcarek M, Landles C, Weiss A, et al. HDAC4 reduction: a novel therapeutic strategy to target cytoplasmic huntingtin and ameliorate neurodegeneration. PLoS Biol. 2013;11(11):e1001717. doi:10.1371/journal.pbio.1001717

105. Mehta SR, Tom CM, Wang Y, et al. Human Huntington’s Disease iPSC-Derived Cortical Neurons Display Altered Transcriptomics, Morphology, and Maturation. Cell Rep. 2018;25(4):1081-1096.e6. doi:10.1016/j.celrep.2018.09.076

106. Sadri-Vakili G, Bouzou B, Benn CL, et al. Histones associated with downregulated genes are hypo-acetylated in Huntington’s disease models. Hum Mol Genet. 2007;16(11):1293-1306. doi:10.1093/hmg/ddm078

107. Zacchigna S, Lambrechts D, Carmeliet P. Neurovascular signalling defects in neurodegeneration. Nat Rev Neurosci. 2008;9(3):169-181. doi:10.1038/nrn2336

108. Wylie CJ, Hendricks TJ, Zhang B, et al. Distinct transcriptomes define rostral and caudal serotonin neurons. J Neurosci. 2010;30(2):670-684. doi:10.1523/JNEUROSCI.4656- 09.2010

109. Lee J-M, Zhang J, Su AI, et al. A novel approach to investigate tissue-specific trinucleotide repeat instability. BMC Syst Biol. 2010;4:29. doi:10.1186/1752-0509-4-29

75

110. Lee J-M, Ivanova EV, Seong IS, et al. Unbiased gene expression analysis implicates the huntingtin polyglutamine tract in extra-mitochondrial energy metabolism. PLoS Genet. 2007;3(8):e135. doi:10.1371/journal.pgen.0030135

111. Pirhaji L, Milani P, Dalin S, et al. Identifying therapeutic targets by combining transcriptional data with ordinal clinical measurements. Nat Commun. 2017;8(1):623. doi:10.1038/s41467-017-00353-6

112. Reversed graph embedding resolves complex single-cell developmental trajectories | bioRxiv. https://www.biorxiv.org/content/10.1101/110668v1. Accessed July 21, 2019.

113. Kaplan S, Itzkovitz S, Shapiro E. A Universal Mechanism Ties Genotype to Phenotype in Trinucleotide Diseases. PLOS Computational Biology. 2007;3(11):e235. doi:10.1371/journal.pcbi.0030235

76