Volume 28 Bioinformatics in Human Health and Heredity

Edited by R. Chakraborty Center for Computational Genomics, Institute of Applied Genetics and Department of Forensic and Investigative Genetics University of North Health Science Center, Fort Worth, Texas, USA

C.R. Rao C.R. Rao AIMSCS, University of Hyderabad Campus, Hyderabad,

P.K. Sen Departments of Biostatistics and Statistics and Operational Research University of North Carolina, Chapel Hill, North Carolina, USA

Amsterdam . Boston • Heidelberg . London • New York . Oxford Paris. San Diego . San Francisco . Singapore . Sydney . Tokyo IS ELSEVIER North-Holland is an imprint of Elsevier Table of Contents

Volume 28 Bioinformatics in Human Health and Heredity

Preface to Handbook 28 xiii

Contributors: Vol. 28 xvii

Ch. 1. Introduction: Wither Bioinformatics in Human Health and Heredity 1 Ranajit Chakraborty, C.R. Rao and Pranab K. Sen

1. Introduction 2 2. Sciences dealing with biological information and rationale of their integration 3 3. Goals and major research areas of bioinformatics 4 r 4. Why bioinformatics is so important and open areas of research :. 8 References 9 ?

Section A: Theme - Microarray Analysis 11

Ch. 2. Bayesian Methods for Microarray Data 13 Tanzy Love and Alicia Carriquiry

1. Introduction 13 * 2. Literature review 19 3. Hierarchical models for microarray analysis 23 4. Embryonic maize tissue development 28 5. Conclusion 32 6. Appendix 32 References 37 vi Table of Contents

Ch. 3. Statistical Analysis of Gene Expression Studies with Ordered Experimental Conditions 39 Shyamal D. Peddada, David M. Umbach and Shawn Harris

1. Introduction 39 2. "Short-series" time-course data 42 3. "Long series" time-course data for cyclic and developmental processes 52 4. Concluding remarks 61 References 62

Ch. 4. Meta-Analysis of High Throughput Oncology Data 67 Jeffrey C. Miecznikowski, Dan Wang, David L. Gold and Song Liu

1. Introduction 67 2. Case study 71 3. Discussion 88 4. Conclusions 90 References 91

Section B: Theme - Analytical Methods 97

Ch. 5. A Statistical Appraisal of Biomarker Selection Methods Applicable to HIV/AIDS Research 99 Bosny J. Pierre-Louis, CM. Suchindran, Pai-Lien Chen, Stephen R. Cole and Charles S. Morrison^

1. Introduction 100 2. Biomarker definitions 100 3. HIV infection biomarker review 103 4. Statistical screening methods for biomarker selection 104 5. Causal inference approaches for biomarker selection 105 6. Targeted maximum likelihood estimation 112 7. Classifier performance assessed by ROC curve 114 8. Some impending statistical challenges 116 9. Multiplicity considerations in biomarker research 117 10. An application: hormonal contraception and HIV genital shedding and disease progression (GS study) 119 11. Discussion and conclusion 121 References 122

Ch. 6. The Use of Hamming Distance in Bioinformatics 129 Aluisio Pinheiro, Hildete Prisco Pinheiro and Pranab Kumar Sen

1. Introduction 130 2. Some diversity measures 131 3. [/-statistics representation for the Hamming distance based measures in bioinformatics 134 Table of Contents

4. Analysis of variance tests based on Hamming distances 136 5. MANOVA: roadblocks for k » n 142 6. Microarray gene expression models: statistical perspectives 145 7 Asymptotics under null and local alternatives 148 8. Applications of Hamming distance measures 152 9. Discussion 158 Appendix 159 References 160

Ch. 7. Asymptotic Expansions of the Distributions of the Least Squares Estimators in Factor Analysis and Structural Equation Modeling _ 163 Haruhiko Ogasawara

1. Introduction 163 . »j 2. Least squares estimators for unstandardized variables 165 3. Asymptotic distributions of the least squares estimators 166 4. Least squares estimators for standardized variables 170 5. Numerical examples 171 6. Discussion 180 Appendix 188 References 197

Ch. 8. Multiple Testing of Hypotheses in Biomedical Research 201 Hansen Bannerman-Thompson, M. Bhaskara Rao and Ranajit Chakraborty

1. Introduction 202 2. What is multiple testing? 202 „ - •*•_ 3. Parametric approach 207 * 4. Nonparametric procedure 211 5. The enigma of ^-values 212 6. Analogues of Type I error rates 221 7. Multiple testing procedures 226 8. Conclusions 236 References 237

Section C: Theme - Genetics and DNA Forensics 239

Ch. 9. Applications of Bayesian Neural Networks in Prostate Cancer Study 241 Sounak Chakraborty and Malay Ghosh

1. Introduction 242 2. Feedforward neural networks: frequentist and Bayesian approach 244 3. Priors and their properties 246 4. Prostate cancer: univariate analysis with clinical covariates 249 5. Multivariate analysis with clinical covariates 253 6. Univariate and multivariate analysis with gene expression data 257 7. Summary and conclusion 260 References 261 viii Table of Contents

Ch. 10. Statistical Methods for Detecting Functional Divergence of Gene Families 263 Xun Gu

1. Introduction 263 2. The two-state model for functional divergence 264 3. Testing type-I functional divergence after gene duplication 265 4. Predicting critical residues for (type-I) functional divergence 268 5. Implementation and case-study 271 References 271

Ch. 11. Sequence Pattern Discovery with Applications to Understanding Gene Regulation and VaccinevDesign 273 Mayetri Gupta and Surajit Ray

1. Introduction 273 2. Pattern discovery in studying gene regulation 274 3. Hidden Markov models for sequence analysis 280 4. Using auxiliary data in motif prediction 284 5. Vaccine development using a pattern discovery approach 294 6. Pattern discovery using amino acid properties 297 7. Using HMMs to classify binders and non-binders 301 8. Concluding remarks 305 References 305

Ch. 12. Single-Locus Genetic Association Analysis by Ordinal Tests 309 Ge Zhang, Li Jin and Ranajit Chakraborty

1. Introduction 310 2. Penetrance model for single-locus genetic association 312 3. Indirect association and two-locus model 314 4. Single-locus association tests 317 5. Statistical methods for ordered categorical data analysis 319 6. The equivalence between the CML test and the Bartholomew's Chibar test 322 7. Type I error of different single-locus association tests 326 8. Power of different single-locus association tests 329 9. Simulation study using real HapMap ENCODE data 331 10. Conclusion 336 References 337

Ch. 13. A Molecular Information Method to Estimate Population Admixture 339 Bernardo Bertoni, Tatiana Velazquez, Monica Sans and Ranajit Chakraborty

1. Introduction 339 2. Materials and methods 341 Table of Contents

3. Results 343 4. Discussion 347 Acknowledgments 350 References 350

Ch. 14. Effects of Inclusion of Relatives in DNA Databases: Empirical Observations from 13K SNPs in Hap-Map Population Data 353 Saurav Guha, Jianye Ge and Ranajit Chakraborty

1. Introduction 354 ^ 2. Material and methods 354 " ~" 3. Results 355 4. Discussion 361 References 365 ' "'

Section D: Theme - Epidemiology 367

Ch. 15. Measurement and Analysis of Quality of Life in Epidemiology 369 Mounir Mesbah

1. Introduction 369 2. Measurement models of Health related Quality of Life 371 3. Validation of HrQoL measurement models 378 4. Construction of Quality of Life scores 382 5. Analysis of Quality of Life change between groups 385 6. Simulation results 392 ,. "-»•_ 7. Real data examples 392 '' 8. Conclusion 397 References 399

Ch. 16. Quality of Life Perspectives in Chronic Disease and Disorder Studies 401 Gisela Tunes-da-Silva, Antonio Carlos Pedroso-de-Lima and Pranab Kumar Sen \

1. Introduction 401 2. Biology of diabetes 403 3. Genetics of Thalassemia minor 405 4. Nondegradation vs. degradation processes 407 5. QAL survival analysis 410 6. QASA in diabetes studies—QOL aspects 422 7. Need for data collection, monitoring, and analysis 424 8. Some simulation studies 426 Acknowledgements 430 References 431 x Table of Contents

Ch. 17. Bioinformatics of Obesity 433 Bandana M. Chakraborty and Ranajit Chakraborty

1. Introduction 433 2. Epistemology and history of obesity 434 3. Measurements and types of obesity 435 4. Relationships between various measures of obesity and their implications 445 5. Diseases associated with obesity 449 6. Causes of obesity 453 7. Combating obesity epidemics 463 8. Future studies and epilogue 467 References 470

Ch. 18. Exploring Genetic Epidemiology Data with Bayesian Networks 479 Andrei S. Rodin, Grigoriy Gogoshin, Anatoliy Litvinenko and Eric Boerwinkle

1. Introduction 480 2. Bayesian Networks 481 3. Example application in genetic epidemiology 486 4. Software and applications 503 5. Summary and future directions 506 Acknowledgments 507 References 507

Section E: Theme - Database Issues ,.511'.

Ch. 19. Perturbation Methods for Protecting Numerical Data: Evolution and Evaluation 513 Rdthindra Sarathy and Krish Muralidhar

1. Introduction 513 2. Definition of data utility and disclosure risk for perturbation methods 514 3. The theoretical basis for perturbation methods 517 4. Evolution of perturbation methods for numerical data 518 5. Evaluation of perturbation methods 522 6. Comparison of perturbation with other masking methods 528 7. Conclusions 530 References 530

Ch. 20. Protecting Data Confidentiality in Publicly Released Datasets: Approaches Based on Multiple Imputation 533 Jerome P. Reiter

1. Introduction 533 2. Description of synthetic data methods 535 Table of Contents

3. Inferential methods 538 4. Concluding remarks 542 References 543

Subject Index 547 Handbook of Statistics: Contents of Previous Volumes 561