<<

The Pennsylvania State University

The Graduate School

College of Earth and Mineral Sciences

THE GEOGRAPHICAL ANALOG ENGINE: HYBRID NUMERIC AND

SEMANTIC SIMILARITY MEASURES FOR U.S. CITIES

A in

Geography

by

Tawan Banchuen

© 2008 Tawan Banchuen

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

May 2008

The thesis of Tawan Banchuen was reviewed and approved* by the following:

C. Gregory Knight Professor of Geography Thesis Co-Advisor Co-Chair of Committee

Mark N. Gahegan Professor of Geography Thesis Co-Advisor Co-Chair of Committee

Robert G. Crane Professor of Geography

Dennis W. Thomson Professor of Meteorology

Karl Zimmerer Professor of Geography Head of the Department of Geography

*Signatures are on file in the Graduate School iii ABSTRACT

This dissertation began with the goal to develop a methodology for locating climate change analogs, and quickly turned into a quest for computational means of locating geographical analogs in general. Previous work in geographical analogs either only computed on numeric information, or manually considered qualitative information. Current and emerging technologies, such as electronic document collections, the Internet, and the Semantic Web, make it possible for people and organizations to store millions of books and articles, share them with the world, or even author some themselves. The amount of electronic and online content is expanding at an exponential speed, such that analysts are increasingly overwhelmed by the sheer volumes of accessible information. The dissertation explores techniques from knowledge engineering, , information sciences, linguistics and cognitive science, and proposes a novel, automatic methodology that computes similarity within online/offline textual information, and graphically and statistically combines the results with those of numeric methods. U.S. cities with populations larger than 25,000 people are selected as a test case. Places are evaluated based on their numeric characteristics in the County and City Data Book and qualitative characteristics from Wikipedia entries. The dissertation recommends a way to convert Wikipedia entries into the (OWL) ontologies, which computer algorithms can read, understand and compute. The dissertation initially experiments with Mitra and Wiederhold’s semantic measure to quantify similarity between places in the qualitative space. Many shortfalls are identified, and a series of experimental enhancements are explored. The experiments demonstrate that good semantic measures should employ a comprehensive stop-words list and a complete, but succinct vocabulary. A semantic measure that can recognize synonyms must understand the intended senses of words in a place description. Furthermore, analysts need to be careful with two styles of descriptions: descriptions of places that are (1) created by following a template, or (2) laden with statistical statements can result in falsely high similarity between the places. It is illustrated that scatter plots of numeric similarity scores versus semantic similarity scores can effectively help analysts consider similarity between places in two-space. Analysts can visually observe whether the numeric ranks of places agree with the semantic iv ranks. The dissertation also shows that the Spearman’s rank correlation test and the Kruskal-Wallis test of means can provide statistical confirmation for visual observations. The proposed hybrid methodology enables analysts to automatically discover geographical analogs in ways that strictly numeric methods or manual semantic analysis cannot offer. v TABLE OF CONTENTS

LIST OF FIGURES ...... viii

LIST OF TABLES...... xi

ACKNOWLEDGEMENTS ...... xiv

Chapter 1 Why the Geographical Analog Engine? ...... 1

1.1 Inadequate Numeric Methods and Overwhelming Qualitative Information...... 3 1.2 The Geographical Analog Engine...... 5

Chapter 2 Analogs...... 9

2.1 Definitions...... 9 2.2 Roles of Analogy Making in Human Cognition...... 9 2.2.1 Metaphors...... 10 2.2.2 Classification ...... 11 2.2.3 Abduction...... 12 2.3 Contemporary Usage of Analogs ...... 14 2.4 Conclusions ...... 17

Chapter 3 Similarity: Numeric and Semantic Measures ...... 18

3.1 Ontology ...... 19 3.5.1 Web Ontology Language (OWL)...... 20 3.5.2 Semantic Networks ...... 25 3.2 Similarity Measures...... 29 3.8.1 Semantic Measure...... 30 3.8.1.1 Lexicon Based Measure...... 30 3.8.1.1.1 Mitra and Wiederhold’s Algorithm ...... 31 3.8.1.1.2 Enhanced Version of Mitra and Wiederhold’s Algorithm ... 32 3.8.1.2 Feature-Set Based Measure...... 32 3.8.1.2.1 Basic Synonym Count (BSC) ...... 33 3.8.1.2.2 Synonym Count with a Vocabulary (SCV)...... 34 3.8.1.3 Corpus Based Measure...... 35 3.8.1.3.1 Corpus Based Synonym Count (CBSC) ...... 36 3.8.1.4 Graph Based Measure ...... 36 3.8.1.5 Measure Based on Information Content...... 38 3.8.1.6 Expert Judgment...... 40 3.8.2 Numeric Measure: Euclidean Similarity...... 40 3.8.2.1 Principal Component Analysis...... 41 3.8.2.2 K-Means Cluster Analysis...... 42 3.3 Statistical Tests of Similarity Measures...... 43 3.9.1 Kruskal-Wallis Test...... 44 vi 3.9.2 Spearman’s Rank Correlation Coefficient Test...... 45 3.4 Uncertainty of Similarity Measures ...... 46 3.11 Conclusions ...... 48

Chapter 4 Preliminary Experiment: Six University Cities ...... 49

4.1 Comparison Using Mitra and Wiederhold’s Algorithm...... 54 4.2 Comparison Using Enhanced Mitra and Wiederhold’s Algorithm...... 57 4.3 Conclusions ...... 60

Chapter 5 Numeric Analysis...... 61

5.1 Principal Component Analysis ...... 61 5.2 K-Means Cluster Analysis ...... 66 5.3 Selection of Cities for Subsequent Analyses...... 67 5.3.1 Accounting for the Current State of Wikipedia Entries...... 74 5.4 Conclusions ...... 76

Chapter 6 Semantic Analysis ...... 77

6.1 Synonym Count with a Vocabulary (SCV)...... 78 6.1.1 Results and Discussion...... 79 6.1.2 Discussion of SCV Performance ...... 99 6.2 Corpus-Based Synonym Count (CBSC)...... 99 6.2.1 Results and Discussion...... 100 6.2.2 Discussion of Approaches to Semantic Disambiguation...... 107 6.3 Measure Comparison Using Statistical Tests...... 108 6.3.1 Discussion of Statistical Tests ...... 111 6.4 Conclusions ...... 111

Chapter 7 Summary and Conclusions...... 114

7.1 Revisiting the Research Goal ...... 114 7.2 The Novel Hybrid Methodology...... 115 7.3 Future Research ...... 121

Bibliography...... 125

Appendix A List of Variables in the City Tables and Their Descriptions...... 136

Appendix B Major Variables and Their Loadings for Selected Components...... 149

Appendix C Header Statistics of 490 Wikipedia City Entries ...... 162

Appendix D Snapshots of Wikipedia Entries...... 171

Appendix E Three Wikipedia Sections of Ten Cities...... 194 vii E.1 Ann Arbor...... 195 E.1.1 Economy...... 195 E.1.2 Climate ...... 196 E.1.3 Culture...... 196 E.2 ...... 199 E.2.1 Economy...... 199 E.2.2 Climate ...... 200 E.2.3 Culture...... 201 E.3 Chicago...... 203 E.3.1 Economy...... 203 E.3.2 Climate ...... 204 E.3.3 Culture and contemporary life...... 204 E.4 ...... 210 E.4.1 Economy...... 210 E.4.2 Climate ...... 211 E.4.3 Culture...... 212 E.5 Las Vegas ...... 217 E.5.1 Economy...... 217 E.5.2 Climate ...... 218 E.5.3 Culture...... 218 E.6 ...... 220 E.6.1 Economy...... 220 E.6.2 Climate ...... 221 E.6.3 Culture...... 222 E.7 ...... 228 E.7.1 Economy...... 228 E.7.2 Climate ...... 230 E.7.3 Culture...... 230 E.8 Reno...... 234 E.8.1 Gaming industry ...... 234 E.8.2 Climate ...... 235 E.8.3 Culture...... 235 E.9 ...... 236 E.9.1 Economy...... 236 E.9.2 Climate ...... 237 E.9.3 Culture...... 238 E.10 ...... 243 E.10.1 Economy ...... 243 E.10.2 Climate ...... 244 E.10.3 Culture and entertainment ...... 245

Appendix F Tables of Intersection Terms...... 251

Appendix G Correlation Coefficient Matrix...... 260

viii LIST OF FIGURES

Figure 1-1: Concept map of a portion of the SWEET physical phenomena ontology..... 6

Figure 2-1: Diagram of the extended example...... 14

Figure 3-1: A semantic network of aircraft types. Redrawn after Giarratano and Riley’s (2005) Figure 2.5...... 27

Figure 3-2: Concept map of a second-grade student. Redrawn after Novak and Musonda’s (1991) Figure 2A...... 28

Figure 3-5: Definitions of the Pacific High and the High from the American Meteorological Society glossary ...... 39

Figure 4-1: Locations of the university cities...... 50

Figure 4-2: Concept map of the Wikipedia entry of State College, PA...... 52

Figure 4-3: Portion of the OWL ontology of the Wikipedia entry of State College, PA...... 53

Figure 4-4: Plots of similarity scores computed with the original algorithm...... 56

Figure 5-1: Improvise visualization of the six clusters: Cluster 1 in yellow, Cluster 2 in blue, Cluster 3 in green, Cluster 4 in purple, Cluster 5 in orange and Cluster 6 in brown. (the only city in Cluster 5) is selected and displayed in red...... 70

Figure 5-2: Scatter plots of the six clusters. The members of Cluster 1 are selected and shown in red...... 72

Figure 5-3: Map of the six clusters. Cluster 1 in yellow, Cluster 2 in blue, Cluster 3 in green, Cluster 4 in purple, Cluster 5 in orange and Cluster 6 in brown. New York City (the only city in Cluster 5) is selected and displayed in red. Note that the map shows an estimated-by-sight location of the Rocky Mountains...... 73

Figure 6-1: Similarity comparison of Los Angeles, CA and other cities based on (a) Principal Component I (PC I) and SCV, and (b) Principal Component II (PC II) and SCV...... 80

Figure 6-2: Similarity comparison of Ann Arbor, MI and other cities based on (a) Principal Component I (PC I) and SCV, and (b) Principal Component II (PC II) and SCV...... 88 ix Figure 6-3: Similarity comparison of Las Vegas, NV and other cities based on (a) Principal Component I (PC I) and SCV, and (b) Principal Component II (PC II) and SCV...... 90

Figure 6-4: Similarity comparison of San Francisco, CA and other cities based on (a) Principal Component I (PC I) and SCV, and (b) Principal Component II (PC II) and SCV...... 92

Figure 6-5: Similarity comparison of Las Vegas, NV and other cities based on SCV and CBSC...... 102

Figure 6-6: Similarity comparison of Las Vegas, NV and other cities based on SCV and CBSC with the modified vocabulary...... 104

Figure 7-1: Similarity comparison of Las Vegas, NV and the ten cities based on Euclidean similarity with respect to PC I and CBSC with the augmented vocabulary...... 119

Figure D-1: Snapshot of Wikipedia entry of State College, PA...... 172

Figure D-2: Snapshot of Wikipedia entry of State College, PA ...... 173

Figure D-3: Snapshot of Wikipedia entry of Ann Arbor, MI...... 174

Figure D-4: Snapshot of Wikipedia entry of Ann Arbor, MI...... 175

Figure D-5: Snapshot of Wikipedia entry of Boston, MA...... 176

Figure D-6: Snapshot of Wikipedia entry of Boston, MA...... 177

Figure D-7: Snapshot of Wikipedia entry of Chicago, IL...... 178

Figure D-8: Snapshot of Wikipedia entry of Chicago, IL...... 179

Figure D-9: Snapshot of the Wikipedia entry Houston, TX...... 180

Figure D-10: Snapshot of Wikipedia entry of Houston, TX...... 181

Figure D-11: Snapshot of Wikipedia entry of Las Vegas, NV...... 182

Figure D-12: Snapshot of Wikipedia entry of Las Vegas, NV...... 183

Figure D-13: Snapshot of Wikipedia entry of Los Angeles, CA...... 184

Figure D-14: Snapshot of Wikipedia entry of Los Angeles, CA...... 185

Figure D-15: Snapshot of Wikipedia entry of Philadelphia, PA...... 186 x Figure D-16: Snapshot of Wikipedia entry of Philadelphia, PA...... 187

Figure D-17: Snapshot of Wikipedia entry of Reno, NV...... 188

Figure D-18: Snapshot of Wikipedia entry of Reno, NV...... 189

Figure D-19: Snapshot of Wikipedia entry of San Diego, CA...... 190

Figure D-20: Snapshot of Wikipedia entry of San Diego, CA...... 191

Figure D-21: Snapshot of Wikipedia entry of San Francisco, CA...... 192

Figure D-22: Snapshot of Wikipedia entry of San Francisco, CA...... 193

xi LIST OF TABLES

Table 4-1: The Six University Cities and Their Universities ...... 50

Table 5-1: Pearson Correlation Coefficient Matrix of Ten Sampled Variables...... 63

Table 5-2: Interpretation of The Selected Components...... 64

Table 5-3: Percents of Explained Variance by Each PC...... 65

Table 5-4: Multi-Way ANOVA of the Clusters...... 67

Table 5-5: Numbers of Cities in Each Cluster...... 69

Table 6-1: Intersection Terms of Los Angeles, CA vs. Boston, MA...... 86

Table 6-2: Intersection Terms of Los Angeles, CA vs. Houston, TX ...... 87

Table 6-3: The Selected Ten Cities in Descending Order of Their Semantic Similarity to the Specified Cities...... 95

Table 6-4: Summary of Correct and Incorrect Synonyms for (a) Top and (b) Bottom Matches...... 96

Table 6-5: Summary of High-IC Terms for (a) Top and (b) Bottom Matches...... 97

Table 6-6: Intersection Terms of Las Vegas, NV vs. Reno, NV ...... 103

Table 6-7: Intersection Terms of Las Vegas, NV vs. Reno, NV ...... 105

Table 6-8: Intersection Terms of Las Vegas, NV vs. San Diego, CA...... 105

Table 6-9: Intersection Terms of Las Vegas, NV vs. Houston, TX...... 106

Table 6-10: Spearman Correlation Coefficients...... 109

Table 6-11: Two-Tailed Significance Values of Kruskal-Wallis Test...... 110

Table A-1: Variables in Table C-1 — Area and Population ...... 126

Table A-2: Variables in Table C-2 — Population by Age, Sex, and Race ...... 129

Table A-3: Variables in Table C-3 — Group Quarters Population and Households...... 133

Table A-4: Variables in Table C-4 — Housing, Crime, and Labor Force...... 136

Table A-5: Variables in Table C-5 — and Wholesale Trade...... 139 xii Table A-6: Variable in Table C-6. — Retail Trade and Accommodation and Foodservices ...... 142

Table A-7: Variables in Table C-7 — Government Finances and Climate...... 145

Table B-1: Principal Component I ...... 150

Table B-2: Principal Component II ...... 155

Table B-3: Principal Component III...... 156

Table B-4: Principal Component IV...... 157

Table B-5: Principal Component V...... 157

Table B-6: Principal Component VI...... 158

Table B-7: Principal Component VII ...... 159

Table B-8: Principal Component VIII...... 159

Table B-9: Principal Component IX...... 160

Table B-10: Principal Component X...... 160

Table B-11: Principal Component XI ...... 160

Table B-12: Principal Component XII ...... 161

Table B-13: Principal Component XIII...... 161

Table B-14: Principal Component XIV...... 161

Table F-1: Intersection Terms of Ann Arbor, MI vs. San Diego, CA...... 252

Table F-2: Intersection Terms of Ann Arbor, MI vs. Reno, NV...... 253

Table F-3: Intersection Terms of Las Vegas, NV vs. Reno, NV...... 253

Table F-4: Intersection Terms of Las Vegas, NV vs. San Diego, CA ...... 254

Table F-5: Intersection Terms of Las Vegas, NV vs. Boston, MA...... 255

Table F-6: Intersection Terms of San Francisco, CA vs. Chicago, IL...... 256

Table F-7: Intersection Terms of San Francisco, CA vs. Philadelphia, PA ...... 257

Table F-8: Intersection Terms of San Francisco, CA vs. Las Vegas, NV ...... 258 xiii Table F-9: Intersection Terms of Las Vegas, NV vs. Houston, TX...... 259

Table G-1: Spearman Correlation Coefficient Matrix...... 261

xiv ACKNOWLEDGEMENTS

Special thanks to the National Science Foundation and the GeoVISTA Center in the Department of Geography at the Pennsylvania State University for providing financial support of this research. The author would also like to thank his co-advisors, Dr. Mark N. Gahegan and Dr. C. Gregory Knight for their valuable advice and mentoring. He appreciates insightful research direction of the other committee members, Dr. Robert G. Crane and Dr. Dennis W. Thomson as well. Dr. Adam Z. Rose, albeit no longer on the committee, well deserves recognition for always encouraging the author to pursue an innovative frontier and for sharing his first-grade economic geography knowledge. Earnest gratitude is due to my GeoVISTA colleagues, especially, Anuj Jaiswal, for their inputs and state-of-the-art computer code. The author would also like to thank his parents and grandparents for their financial support and for keeping on believing, albeit with much doubt, that he could finish this work. His two sisters, Lalita and Pakinee earn similar recognition for not trying to cut off his allowances. Special thanks are due to the Bodiratnangkura family for making his graduate study aboard exciting and fun. The author will not forget many friends at Penn State who would not mind spending time with a blunt, tactless person, and taught him numeral life lessons. Some of these amazing people include Niall Donnelly, Rafael Cancel, Charles Brooks, Jessica Kaplan, members of the Ballroom Dance Club, Master Kaye and his students, and Penn State Geography graduate students. I am forever indebted to you all. Thank you.

Chapter 1

Why the Geographical Analog Engine?

Analogy making plays a central role in human intelligence. French (1995) contended that human recognition of objects or events and construction of knowledge are mostly a process of analogy-making. The following hypothetical, yet realistic, example illustrates how analogy making plays a role in human cognition and knowledge construction. John arrives at London for the first time. At his first glance, he recognized that London is an urbanized, densely populated area, which is very much like New York where John was born and has lived for 25 years. Like New York, John suspects that London has a subway system. So he decides to save some money by taking the subway to his hotel instead of a taxi. As he walks around and looks for an entrance to the subway system, he can feel somebody’s hand in the front pocket of his jeans. But John grabs his pocket and pulls away before the attempt to pull out his wallet is successful. In fact, John has anticipated this from his 25-year experience in New York. He has moved his wallet from the back pocket of his jeans to the front as soon as he arrived and realized how urbanized London is. He suspected that, similar to New York, pick pocketing and robbery are quite common. This example shows that a person can recognize a new thing, in this case London, by making analogy to something that he/she already knows about. Then he/she further constructs knowledge about the new thing by assuming that it possesses other similar characteristics as well. These additional, yet-to-be-verified characteristics can be true or false. If John had gone to Oxford instead, he might have found that his precautionary action was unnecessary. Chorley (1964) noted that analogs have long been used in reasoning and seeking a new perspective on a problem. A good analog is the one that shares some similar features with the study object and is more familiar to the investigator. Chorley suggested three major types of analogs, mathematical symbolization of real world processes, laboratory experiments, and natural historical and/or geographical analogs. In this work, the focus is solely on the last type which refers to places that share a set of similar characteristics and relations today or in the past or future. Chorley stated that no analog is perfect, but none is 2 useless either. Analogs are like torches of different intensities that throw light in many directions. This dissertation develops a novel approach for identification of geographical analogs. Geographical analogs have been widely used in research and applications in geography, on both the physical and the human sides. Swearingen (1987) documented how the French protectorate administration, inspired by the climatic resemblance between Morocco and , amended their agricultural policy in the 1930’s to adopt and transfer agricultural practices and technologies from California. The amendment necessitated by the failing wheat policy that had been formulated based on the slightly-better-than-hearsay belief that Morocco was the granary of Rome. In fact, the agricultural regions of Morocco receive inadequate rainfall for wheat production. To solve the problem, the protectorate government looked for a more concrete successful example as a model for the new policy and was intrigued by the success of California’s irrigation and its agroclimatic similarity to Morocco. The Moroccan agriculture, as a result of the new policy, transformed from unprofitable wheat production to a profit-making, rapidly growing, and mainly citrus industry. By 1984, Morocco was the world second largest exporter of oranges. Farrigan and Glasmeier (2002) employed one form of geographical analog techniques called a quasi-experimental control method to observed net effects of the U.S. prison development boom of 1980’s and 1990’s. Similar to a true controlled experiment, the method observes changes of a dependent variable from the differences between a control group and a treatment group. Based on some prior-to-1970 statistics of socioeconomic conditions and distance from nearby metro areas, they created a pool of analogous rural U.S. counties. The main premise is that these analogous counties will undergo similar changes under the same socioeconomic influence. They selected from the pool two groups of counties: one with and one without a state-run prison that constructed between 1985 and 1995. The former served as the treatment group while the latter the control group. The net effects could then be observed as the difference in economic changes of the two groups. Such a method provides a cost-effective alternative to a true social experiment, which is normally expensive and yet yields questionable benefits over its quasi-experimental counterpart since the requirement that that the control group is absolutely unaltered by the treatment is hardly ever achieved. 3 Glantz (1988) pioneered the application of analogs in anthropogenic climate change research. His book, Societal Responses to Regional Climatic Change: Forecasting by Analogy, compiled a myriad of case studies of societal responses to extreme environmental change. These extreme changes were analogs of what could happen as a result of climate change. Each study identified strengths and weaknesses of a society in coping with extreme events, such as sea/lake water-level rise, prolonged drought, and cold spell. Glantz contended that these case studies did not suggest that societies would react the same way should similar extreme events occur in the future. They only inform analysts about how well societies prepare for future extreme climate-related, environmental changes and how societies could improve their preparedness. The lessons learned are not limited to the studied society; they can be applied to other analogous societies as well. Extending Glantz’s idea, this author suggests that analysts can search for contemporary or historical geographical analogs of the future climate and future socioeconomic circumstances of a place under investigation and then use their information to formulate a policy that reduces vulnerability of the place to climate change. Moreover, the author argues that some other societies resembling the case studies, now or later, could experience similar extreme events. Geographical analogs to the case studies can serve as priority targets for policy actions to reinforce their preparedness according to the lessons learned. For example, places vulnerable to hurricanes similar to New Orleans should study the infrastructure failures, communication problems, and lack of aid resources as a result of Hurricane Katrina, evaluate their own vulnerability and make adjustments accordingly.

1.1 Inadequate Numeric Methods and Overwhelming Qualitative Information

At the heart of geographical analog identification is the notion of similarity measurement between places. Existing methods of similarity measurement range from purely quantitative to entirely qualitative. Related to the prison impact example above, one can compute similarity as a multivariate Euclidean distance between a control and a treatment based on various numeric variables, e.g., population growth rate, poverty growth rate and earnings. Such methods are well established throughout the social sciences. 4 However, using numeric distance measures restricts analysts to comparing only those aspects: (i) that can be represented numerically; and (ii) for which actual values are at hand, or can be acquired. Note also that numeric measures of distance, while providing the respectability usually associated with statistical approaches, can be problematic to apply when comparing relationships between places, or accounting for roles that a place plays in a regional economy. Furthermore, in a quasi-experimental control study, the Euclidean measures may suggest a few or many control places as similar to a treatment place. In such a case and where relationships, roles, and other aspects not readily represented numerically are deemed important, an analyst must resort to additional information about those places, such as Web pages, reports, photographs, audio or video interviews, and personal notes. After considering this information, the analyst can then choose the most similar place in his/her opinion as the control location. Each of the aforementioned sources can provide valuable information about a place. To keep the task manageable, this dissertation only tackles textual sources. Objectively evaluating textual information presents a momentous endeavor. Authors can write in one of many styles, using specialized vocabularies in their fields of interest. Readers with various background and professions will interpret such writing differently. Not only can a word have multiple meanings, but many words can also share the same meaning. As a result, determining the sense of a word can be a challenge, let alone determining the intended sense of a sentence or a paragraph. Obviously, analysts need routinely to go beyond numeric methods and analyze qualitative information. Such information has become so abundantly available and easily accessible as a result of the Internet. In most cases, analysts no longer need to make a trip to a library since publications can be accessed via the library’s internet site. An internet search engine, e.g., Google, Yahoo or MSN, will return tens, hundreds and possibly millions of Web pages for a city name. Analysts can no longer manually consider all information they can obtain and hope to analyze. They need a way to automate the analysis and combine the results with those of the numeric methods. 5 1.2 The Geographical Analog Engine

When searching for geographical analogs, past researchers either focus on numerical information or qualitative information of places. This dissertation develops a geographical analog engine that simultaneously uses both kinds of information, providing the users with the best of both worlds. The engine uses ontology, in short a formal representation of semantics (in this case detailing the properties and relationships of places) expressed in a human and machine comprehensible language, to automate acquisition and analysis of qualitative information. The automation allows the users to overcome the innate time and physical limitation of information processing. Figure 1-1 demonstrates an ontology visualized as a concept map. The map displays a portion of the physical phenomena ontology developed by the Semantic Web for Earth and Environmental Terminology (SWEET) project (Raskin and Pan 2005). The ontology includes transient events of Earth system science and related concepts, such as volcanic eruption, hurricane, El Nino, earthquake and terrorist events; and their associated concepts of time, space, earth realm, living elements and non-living elements. A circle in the example concept map represents a concept and each link a relation between concepts. One can represent ontology in multiple ways. Another representation, N-triples, can assist someone in interpreting ontologies if they are not familiar with them. N-triples are similar to an English sentence. A triple has a subject, a predicate, and an object. Links in a concept map are predicates. Child nodes are objects and parent nodes subjects. The subclass relationship is frequently translated to the “is-a” predicate in N- triples. From the physical phenomena ontology, we can say that Lighting is-a AtmosphericElectricalPhenomena or is-a . The dissertation uses textual descriptions of places from Wikipedia . The Wikipedia description of a place is not inherently machine comprehensible because it does not explicitly specify meanings of words and relationships between words and sections, for example, what property of a place a word or a paragraph describes; and a section is a subsection of another section or a main section. Hence, a single computer algorithm cannot directly compute similarity of places based on meanings and relationships in text documents. Such an algorithm depends on another algorithm to extract 6 meanings and relationships in a Wikipedia entry, and create the ontology of the entry. For instance, in the same manner as the physical phenomena ontology, the ontology of an entry can specify that a paragraph describes the geography of State College, PA or that is a subsection of Entertainment. Once the ontologies of the entries of interested places have been created, then it is possible for a semantic similarity algorithm to compute similarity between them.

Figure 1-1: Concept map of a portion of the SWEET physical phenomena ontology

The U.S. cities with populations larger than 25,000 people are selected as the test case. With the test case, this dissertation explores various semantic measures, which evaluate similarity of places based on their qualitative information. It begins with a lexical-based measure developed by Mitra and Wiederhold (2002), and gradually improves the measure as its shortcomings are identified. The dissertation later experiments with and enhances a corpus-based measure, also one of Mitra and Wiederhold’s (2002) algorithms in an attempt to determine the intended senses of words. Two final best measures are radically different 7 from Mitra and Wiederhold’s original algorithms. The improvements include removing stop words from descriptions, employing domain vocabularies, decreasing computation time and incorporating a feature-set based model. The geographical analog engine also computes similarity among places in the numeric space, and compares the results with those of qualitative information by means of graphical comparison and statistical tests. The numeric scores are computed as a function of weighted Euclidean distance between places in the statistical attribute space, which is based on the U.S. census statistics in the County and City Data Book (U.S. Census Bureau 2000). Table 1-1 shows the values of a few variables and places from the data book. Note that an endless number of place characteristics exist. Each subset is suitable for a particular purpose. To look for a medium-size, cold-winter town, for instance, one may examine variables such as number of population, average January temperature and heating degree days. The development of a comprehensive set of characteristics for a particular purpose either quantitative or qualitative is beyond the scope of this investigation. The aim here, given a set of place characteristics, is to develop a useful approach to compare them.

Table 1-1: A Few Variables and Places from the Data Book Average # Food Service January Heating Degree City Population Establishments Temperature Days (˚F) Houston, TX 1953631 3902 52.2 1371 Las Vegas, NV 478434 872 45.5 2407 State College, PA 38420 139 24.7 6364

Each set of numeric similarity scores of places are graphically combined with its semantic counterpart by means of scatter plots. The numeric scores are plotted on one axis, and the semantic scores on the other. Scatter plots can demonstrate whether two measures are correlated, and whether the mean of one measure is greater or lower or equal to that of the other. One can also see the ranks of places on a scatter plot. In addition to graphical comparison, statistical tests of correlation and means are performed to confirm visual observations. 8 The outcome of this work provides a novel methodology, accounting for aspects of places that can and cannot be represented numerically, for comparison of places. The methodology enables analysts to discover geographical analogs beyond traditional means of strictly numeric or semantic methodologies. The ability of the methodology to automatically compute similarity speeds up discovery and allows a more extensive search of qualitative textual information. Although, it is engineered and tested for one specific, geographically narrow case, analysts can construct ontologies of places for other cases and domains (either social or physical sciences), and use the same methodology to locate relevant geographical analogs. Chapter 2 will discuss analog theory, explaining the theoretical basis for the application of analogs in this research. Chapter 3 elucidates relevant technologies, similarity measures and statistical tests, and describes the rationale for their selection where appropriate. Chapters 4, 5 and 6 report and discuss the experimental results, focusing on preliminary evaluation of the chosen datasets and similarity measures, systematic analysis of the numeric dataset and thorough evaluation of various similarity measures, respectively. The last chapter revisits the research objectives, demonstrates how well they are met, and recommends future research directions.

Chapter 2

Analogs

Before searching for analogs, it is essential that one understands what they are and why they are useful. This chapter begins by defining the meaning of analogs in the scope of the dissertation, and then explains how analogs play a central role in human cognition. Confusion between analogs and metaphors will be explored. The chapter ends with several examples of how researchers and practitioners widely employ analogs in their work. The relevant technologies and methods for locating analogs will be the topic of the next chapter.

2.1 Definitions

This work uses the meaning of analogs after Chorley (1964) and Hesse (1966). Analogs are “things that possess some common characteristics” (Chorley 1964), which Hesse (1966) referred to as “analogies”. When two things, which can be either physical objects or merely mental concepts, share some characteristics, they are said to be analogous or analogs of one another. The common characteristics can be relations to other things, some specific properties or both. For instance, a dog and a cat, albeit two different species, can be analogous since they have common properties: 4 legs, 2 ears and can both be kept as pets. Bill is a man, and Susie is a woman. They are friends of Steve. Although they have different genders, Bill and Susie are analogous by their friendship with Steve. In other words, they share the same relation ‘is a friend of’ and the same property ‘Steve’. Note that only one analogy is required for two things to be analogous.

2.2 Roles of Analogy Making in Human Cognition

“Human analogy making forms one of the key components of intelligence (French 1995, 1).” 10 Based on the given definition of analogs, this section will illustrate how analogy making plays a vital part in several cognitive processes, that is to say, metaphorical communication, classification and abduction. A full review of mental reasoning is beyond the scope of this dissertation; nevertheless the reader will, hopefully, appreciate the pervasiveness of analogy making in human cognition and reasoning. They will understand that metaphors are more than just analogs, and how classification can be stated as a process of grouping together analogs. It will be shown how analogs are the sample objects from which one can induce the properties of a population, and how one can generate hypotheses via analogy making.

2.2.1 Metaphors

The terms ‘analog’ and ‘metaphor’ tend to be used interchangeably in everyday language. The author proposes that the distinction should be made clear by following Black (1962) in Models and Metaphors. According to Black, metaphors are analogs with additional characteristics. Similar to analogs as defined herein, Black stated that a metaphor always involves two systems with some similarities. The first is labeled primary and is described purely by observational statements. The other is secondary and is described by observational statements as well as by familiar theories. One additional characteristic is that all concepts used in the descriptions must be understood by all parties involved, or else the audience will not understand the intended meaning of a metaphor, or worse they will not understand it at all. Black (1962) then explained that humans interpret metaphors by transferring culturally accepted theories of the secondary system to the primary system. For example, the metaphorical expression that “this issue is not written in stone” is interpreted by transferring the well-known properties of stone, namely, (1) solid and (2) resistant to change, to ‘this issue’. Another additional characteristic is that a metaphor should sound ridiculous when taken at its face value. Considering the same example, nobody would actually write an issue at hand on a stone tablet. Doing so would not miraculously make an issue more certain either. Therefore, the example expression is quite absurd when taken literally.

11 Black (1962) also distinguished two types of metaphors: substitution and interaction metaphor. A substitution metaphor can be replaced by an explicit description of the analogous properties. Arguably, the given example is of the substitution type. We can rewrite the statement as “this issue is not solid and resistant to change.” In an interaction metaphor, the extent of analogy is not known. The analogy itself evolves and expands as one contemplates the primary and the secondary systems. Examples of this type are any interesting metaphors used in science of which their fruitfulness is to be determined by additional research, such as life evolution for landform development, a jigsaw puzzle for plate tectonics, and the gravity model for urbanization. Davis (1899) conceptualized a land area as a living thing that grows and is altered by outside factors. The evidence is in how Davis called areas that were transformed little by various processes ‘young’, and those that had greatly transformed ‘mature’. It is possible that the metaphor inspired him to further search for the genetics of landform. In fact, he used the term ‘genetics’ for three quantities, namely, structure, process and time that determine the form of a land area.

2.2.2 Classification

In Posterior Analytics, Aristotle indicated how classification is based on analogy (Barnes 1994, 98a). Aristotle stated that the names of properties that place things in the same classification were coined by analogy and handed down to us. Hesse (1966) further explained Aristotle’s point, suggesting that Aristotle had created the term ‘osseous nature’ to fill the gap in the Greek language, which did not have the word ‘vertebrate’. Aristotle placed the fish’s spine, squid’s pounce and animal’s bone in the same classification because they share this osseous nature. The author is not asserting that the names of all properties were coined by analogy; he merely attempts to reveal how classification is deeply supported by analogy and how long ago this was recognized. Modern classification systems, such as the Köppen-Geiger climate classification, also involve analogy making. Köppen-Geiger demarcates climate zones by drawing on the world map boundaries of areas with common ranges of property values (Christopherson 2003). The system takes into account average monthly temperatures, average monthly precipitation,

12 and total annual precipitation. Translated by the given definition of analogs, each climate zone encompasses area analogs with property values falling in the same ranges, with the boundaries selected to delineate major biome regions. For instance, the coolest months of two tropical climate analogs must be warmer than 18 degrees Celsius. Similarly, we can also view computational classification methods as effectively identifying and grouping analogs. Gerstengarbe et al. (1999) applied a non-hierarchical, minimum-distance clustering method to classify the climate of Europe by monthly and annual means of precipitation, and surface air temperature. They obtained the data from 228 meteorological stations over the time period between 1979 and 1992, and determined by experimentation what they believed was the optimum number of climate regions, 11. The method depended on a distance matrix, from which the closest stations were selected to form climate regions. The distance was computed as the Euclidean distance in the chosen- variables space between individual stations and each centroid of the climate regions. The author argues that the distance measure essentially evaluates the similarity between stations based on their properties — the chosen variables. As a result, the stations in a resulting climate region share analogous properties, and therefore are analogs of each other by the given definition.

2.2.3 Abduction

Abduction is a mode of human reasoning first described in semiotics and logic by Charles Sanders Peirce (1955). Since then a number of scientists have been advocating abduction as one means for hypothesis generation (Overholt and Stallings 1976, Kuipers 1999, Sowa 2004). Sowa (2004) posited that abduction generates a new hypothesis to explain some observation(s) by searching existing knowledge for similar facts that can help explain the new observation(s). The following hypothetical example demonstrates how abduction works.

13

Surprising Bladder cancer rates are high in X Fact geology, Y water service, and Z nearby industries Background Arsenic exposure is high in X Knowledge geology, Y water service, and Z nearby industries Hypothesis Arsenic exposure causes bladder cancer

First, the surprising fact about the bladder cancer rates is noticed — the high rates tend to occur in X geology, Y water service and Z nearby industries. Note that X, Y and Z denote types (e.g., karst geology, groundwater service, textile industry). Searching background knowledge, it is found that the arsenic exposure is also high in similar regions. Then by abduction, the analyst hypothesizes that arsenic exposure causes bladder cancer. Sowa (2004) stated that locating relevant background knowledge is a process of analogy making. In this case, the analyst searched for another phenomenon that shares the surprising characteristics of bladder cancer. Two features of analogy proposed by Hesse (1966) can further illuminate the understanding of abduction. Hesse proposed that two types of relations exist among analogs. One is composed of the similarity/difference relations associating properties of one thing to properties of another thing, namely, the analogies this review has been discussing about. The other type is composed of the causal relations among the internal properties of an object. The following example, which extends the above example, will illustrate how the latter type of relations can also constitute the analogies between objects and how an analyst generates hypotheses from them. Figure 2-1 shows the extended example. The similarity/difference and causal relations are part of the background knowledge. Possible examples of causal relations in the arsenic exposure case are “the Y water service withdraws raw water stored in the X geology” and “the Z nearby industries discharge arsenic- containing waste water into the raw water deposit.”

14

Arsenic Exposure Bladder Cancer Similarity • X geology • X geology Relations • Y water service • Y water service • Z nearby industries • Z nearby industries Causal Relatios Difference Causal Relatios • can cause a disease Relations • is a disease

Figure 2-1: Diagram of the extended example.

Two related hypotheses can be formulated from the background knowledge and tested: (1) the suggested causal relations are valid for the bladder cancer case as well, and therefore the raw water in the bladder cancer case is also contaminated with arsenic; and (2) arsenic exposure is the cause of bladder cancer. The first hypothesis is generated by suspecting that the bladder cancer case may also have the causal relations of the arsenic exposure case. If we think of causal relations as properties of a case, the analyst inferred unknown properties of the bladder cancer case from the known properties of the arsenic exposure case. How the analyst arrived at the second hypothesis seems more complex. It involves more than just transferring a missing property, and likely a matter of synthesizing all available information. It is worth noting though that two properties with a difference relation are the most useful in this particular case. The analyst produced the second hypothesis by combining and reinterpreting two dissimilar properties: “can cause a disease” and “is a disease”.

2.3 Contemporary Usage of Analogs

Several scientists, such as Chorley (1964), Hesse (1966) and Oreskes (2003), have endeavored to raise awareness among the scientific community to the fact that scientific models are analogs of the study subjects. Chorley (1964) classified analogs or models into three categories: mathematical, experimental, and natural models. Mathematical models are systems of abstract mathematical symbols that represent simplified, conceptual models of

15 pieces of the real world. Experimental models are tangible structures of simplified conceptual models. Lastly, natural models are real-world systems that resemble the systems of interest and are more familiar to scientists. Note that Chorley’s model building requires scientists to first develop conceptual models by imagining in their mind how natural systems work, and then to translate their imagination into these three types of models. Any of these models are suitable for the systems they represent because they are analogous to the systems in some ways, such as behavior, geometry and composition. Therefore, they are the analogs of the systems. Oreskes (2003) recently contended that the word ‘models’ increasingly refers to solely computer models of natural systems, e.g., groundwater, weather, and urban sprawl. In the field of climate change research, Meyer et al. (1998) suggested the urban heat island as a contemporary analog to climate change. Similar to human induced climate change, the weather around a metropolitan area warms up a few degrees Celsius. Urban settings also affect precipitation and extreme weather events. Changnon (1992) posited that (1) we can learn to adjust to human induced climate change by studying how several metropolitan areas are coping with urban heating; (2) urban heat islands emphasize the significance of developing scientifically credible methods to identify changes within the noise of climate variability; and (3) the problem shows the lack of appropriate data and impact methodology, as well as the necessity for multidisciplinary study to understand impacts of climate change on hydrology, regional economy, and human activities. Comparative case studies exemplify another usage of analogs. Knight (2001) stated that comparative case studies identify or evaluate universality over many cases and employ analogs as one of their organizing strategies. When using the analog strategy, an analyst selects cases that are similar in some way and investigates further commonality. Knight noted that cases do not need to overlap in time or space. They can merely share a conceptual linkage, such as various societies’ responses to acid rain, climate change or drought. Each response can serve as a case sharing with other cases its associated phenomenon, namely, acid rain, flood or drought. For example, Knox (1995) studied a group of world cities, which were identified to be cores of economic and cultural globalization. He further found that all such cities are sites of leading global markets of commodities, bond, and equities; are locations of international high-order business service centers; and are headquarters of non-governmental organizations and inter-governmental organizations.

16 Diamond (2005) chose a few prehistoric and modern societies that had and had not collapsed from places he had read about or visited. He explored the chosen places to answer such questions as: “how could a society fail to have seen the dangers that seem so clear to us in retrospect?”; “can we say that their end was the inhabitants’ own fault, or that they were instead tragic victims of insoluble problems?” and “how much past environmental damage was unintentional and imperceptible, and how much was perversely wrought by people acting in full awareness of the consequences?” Analogs can also be employed to conduct a controlled experiment. Conducting a true controlled experiment in social studies, even if possible, is often prohibitively expensive (Cook and Campbell 1979). Alternatively, social scientists often employ quasi-experimental control methods. Isserman and Rephann (1995) evaluated the effectiveness of the Appalachian Regional Commission (ARC) on promoting the Appalachian Region economy. In order to measure the effectiveness, they needed to know how the region would have developed without ARC and compared how the region had changed because of ARC. Certainly, development without the ARC cannot be studied within the region itself. Instead, Isserman and Rephann looked at counties analogous to Appalachian counties, but at least 60 miles away from any of them. They considered similarity in terms of aspects that led to the establishment of ARC such as spatial isolation, economic structure, poverty, and stagnation. Farrigan and Glasmeier (2002) utilized the same method to study the economic impacts of the prison development boom on persistently poor rural places. The commonalties of analogs are not the only valuable aspect in science; the differences are as well. Hesse (1966) termed common properties positive analogy, the different properties negative analogy, and those that are neither as neutral analogy. One comparative study on serologic profiles of virus hepatitis B (HB) clearly illustrates the value of neutral and negative analogies. Fang et al. (2001) compared HB cases in China and Korea. They found that the HB positive rate was highest among the Korean-Chinese (12%), followed by the Han-Chinese (7.2%) and the Korean (4.1%), respectively. However, the HB infection rate, which included people who had recovered from the disease, was highest among the Korean (78.6%), followed by the Korean-Chinese (77%) and the Han-Chinese (60.7%), respectively. They attributed the difference to the Chinese and Korean vaccination policies — a negative analogy. The Korean government mandated all Koreans to receive vaccination, while the

17 Chinese only mandated newborns. That explains why Korea at the time of the study had fewer cases of HB positive than China even though it used to be a big problem in Korea. Fang et al. remarked that lifestyle and diet can also be major contributing factors, but the necessary data were not collected in the study. They were not certain whether there were any differences in lifestyle and diet, and therefore these two parameters are neutral analogies. Hesse (1966) posited that it is through demystifying neutral analogies that scientific innovations emerge.

2.4 Conclusions

The literature review above defines the meaning and usage of analogs in this work. It denotes the possibly confusing usage of the word “analogy”, which means a common characteristic of analogs in this work, and is not synonymous with the word “metaphor”. The review describes many useful roles of analogy making in human cognition, and provides examples of contemporary application of analogs in geography, economics, epidemiology and climate change. Chapter 3 will explain two types of knowledge representation that can encode qualitative place characteristics in formats that machine can understand, and then describe various semantic similarity measures that can compute on them.

Chapter 3

Similarity: Numeric and Semantic Measures

This chapter discusses and, where appropriate, describes technologies and methods relevant to analog searching. It begins with the topic of knowledge engineering, followed by a section on various similarity measures, a section on details of the statistical methods employed in the dissertation and a section on uncertainty of similarity measures. The next section, Section 3.1 will introduce ontology, the underlying concept that is utilized here to encode place descriptions in a human- and machine-comprehensible way. Multiple languages exist for writing ontology. Section 3.5.1 describes one ontology representation language, the Web Ontology Language (OWL). It explains how the majority of today’s Web pages are not machine comprehensible, and hence are not amendable to various automatic place comparison methods explored by this dissertation. Section 3.5.2 depicts another ontology representation language, Semantic Networks. Semantic Networks, albeit machine comprehensible, were designed with better human comprehension in mind. Therefore, they are suitable for a place analog application, which requires human interpretation of place ontologies. Section 3.1 covers necessary concepts and technologies in knowledge engineering that will be employed in Chapter 4 to translate Wikipedia pages into a machine-computable format. The author assumes that the reader has some basic understanding of HyperText Markup Language (HTML), Extensible Markup Language (XML) and logics. By the end of the section, the reader should be adequately equipped to comprehend the suggested translation process. Section 3.2 then describes various semantic and numeric similarity measures that work with ontologies. Section 3.3 explains statistical tests of means and correlations, which will be employed in Chapter 6 to compare similarity measures. Although, this dissertation does not quantify uncertainty in similarity measures, Section 3.4 depicts possible errors in the chosen datasets and similarity measures, and suggests a few methods to quantify them. 19 3.1 Ontology

In philosophy, ontology is the science of the nature of being, or types and structures of things in reality (Smith 2003). The early work can be traced back to Aristotle and Socrates (Lewis 1994). Computer scientists have adopted the name and the principles, and defined ontology according to their own applications (Guarino and Giaretta 1995). Guarino (1997) posited that the most cited definition is “an ontology is an explicit specification of a conceptualization” (Gruber 1995). Conceptualization refers to human abstraction of physical things, abstract concepts, entities and relationships among them that exist in a domain of knowledge and can be represented by a declarative formalism. Ehrig (2007) observed that Gruber’s definition was frequently extended as “an ontology is an explicit, formal specification of a shared conceptualization.” In contrast to Gruber (1995), Ehrig (2007) identified the conceptualization as identification of concepts relevant to a phenomenon and construction of an abstract model with them. He also specified the terms ‘explicit’, ‘formal’ and ‘shared’: explicit denotes that an ontology defines types and usage restrictions of concepts; formal suggests that an ontology is machine comprehensible; and shared implies that an ontology contains consensual knowledge of a group of people. As contended in Section 3.2.7.6, analog searching must involve humans in the foreseeable future. Hence, it is vital for ontologies in this work to be readable by humans as well. Therefore, this dissertation refers to ontology as:

an explicit, human and machine-readable specification of shared conceptualization.

The following two subsections will describe two ontology representation languages that are employed by the dissertation. One of them is textual, and the other is visual. It will be shown that the textual representation gears towards machine readability, while the visual better supports human readability.

20 3.5.1 Web Ontology Language (OWL)

The World Wide Web Consortium (W3C) has endorsed OWL as the ontology representation language for the Semantic Web (McGuinness et al. 2002). As a result, OWL becomes the natural choice for this dissertation, which targets the Web and the upcoming Semantic Web as the main potential source of qualitative information. Understanding OWL requires familiarity with the Resource Description Framework (RDF), XML, frames-based knowledge representation and Description Logics (DL). Each of these technologies requires an in-depth explanation, for which the readers are referred to the given sources. This review will only demonstrate how they work together in OWL. An OWL ontology can be written as a set of RDF triples (Bechhofer et al. 2004). An RDF triple is composed of three elements and typically written in the following manner (Bechhofer et al. 2004):

Subject Predicate Object

Subject is either a resource or a blank node, and object can be a resource, a blank node or a literal. Predicate is a relation between the subject and the object. Resources and literals both refer to identifiable abstract or physical things, but only resources can be identified by Uniform Resource Identifiers (URI). A relation is also a resource, and therefore must be identifiable by a URI. A URI comprises of components, such as a protocol, a server, a path, a file name and a fragment (Berners-Lee et al. 1998). Together they make up a unique identification of a resource. Below is the URI of a resource in the wine ontology . The URI points to a type of wine called Chenin Blanc.

http://www.w3.org/TR/2003/CR-owl-guide-20030818/wine#CheninBlanc

protocol server path filename fragment

21 Instead of pointing to things somewhere else, literals are strings that directly provide descriptions of things. The following snippet of the wine ontology shows a triple with a literal object:

base:WineDescriptor rdfs:comment “Made WineDescriptor unionType of taste and color”

literal object Prefixes base: — http://www.w3.org/TR/2003/CR-owl-guide-20030818/wine# rdfs: — http://www.w3.org/2000/01/rdf-schema#

The triple uses the prefixes “base:” and “rdfs:” to abbreviate the URIs of the subject and the predicate. The predicate indicates that the literal string is a comment of the subject. This triple can be translated into natural language as “WineDescriptor has a comment that it was made as a union type of taste and color”. Lastly, blank nodes are neither resources nor literals. Blank-node subjects or objects are given unique identification strings, which are generated by a unique string generation algorithm and are usually meaningless (e.g., $$uirk278 or nk87jk2) to avoid duplication with those of other blank nodes within the same ontology or other ontololgies. RDF is merely an abstract data model; it needs another specification for serialization — representation of an abstract model on electronic media, e.g., computer memory and hard disks (McBride 2001). The W3C recommended XML for RDF serialization (Klyne and Carroll 2004). XML is today’s standards for marking up Web documents in a machine readable way, which is its main advantage over HTML (Hunter et al. 2007). HTML was designed only to instruct a web browser how to display a document. It does not have a mechanism for encoding what a string on a Web page refers to. For instance, a simple HTML page that displays a name “John Doe”:

example name<tittle><head> <body> <p>My name is John Doe<p>22 <body> <html> </p><p>A computer program cannot easily find the name on this HTML page. It requires a computer program that can understand natural language. Below is the XML file encoding this name: </p><p><name> <firstname>John</firstname> <lastname>Doe</lastname> </name> </p><p>A computer program can effortlessly identify parts of the file that are enclosed by the “firstname” tag and the “lastname” tag. The meaning of a tag is given by its XML schema usually stored at a network accessible location. Note that XML tags can take any names, so an XML programmer can give them meaningful names that describe the contents of the tags (Hunter et al. 2007). Therefore, even without a schema, a programmer can possibly use any lexical database to interpret the meaning of a tag. A suitable language for the Semantic Web must enable reasoning (Horrocks 2002). OWL does this by implementing a very expressive <a href="/tags/Description_logic/" rel="tag">description logic</a> (DL) in RDF (Horrocks et al. 2003), which is a decidable — a method exists for determining validity — subset of First Order Predicate Logic (FOPL) (Grosof et al. 2003). A description logic ontology comprises two parts: a terminological part (Tbox) and assertional part (Abox) (Horrocks et al. 2000). Each part contains sets of axioms — fundamental definitions consisting of symbols representing objects and classes, and algebraic operators that manipulate them (Giarratano and Riley 2005). Tbox states facts — reliable information (Giarratano and Riley 2005) — about concepts (set of objects) and roles (binary relations). Note that all RDF predicates are binary relations since they have exactly one object (Klyne and Carroll 2004). Abox states facts about individuals (single objects) in the classes of Tbox. Horrocks et al. (2002) explains that OWL classes and relations correspond to concepts and roles in Tbox, and OWL instances correspond to individuals in the Abox. The expressive power of OWL depends on its supported class and property constructors, and its supported types of axioms. Horrocks </p><p>23 et al. (2003) enumerated these constructors (e.g., intersectionOf, unionOf, hasValue and oneOf) and axioms (e.g., subclassOf, sameClassAs, inverseOf and transitiveProperty). The following OWL/XML snippet from the wine ontology demonstrates the intersectionOf and cardinality constructors: </p><p><owl:Class rdf:ID=“WhiteWine”> <owl:intersectionOf rdf:parseType=“Collection”> <owl:Class rdf:about=“#Wine”/> <owl:Restriction> <owl:onProperty rdf:resource= “#hasColor”/> <owl:hasValue rdf:resource= “#White”/> </owl:Restriction> </owl:intersectionOf> </owl:Class> </p><p>This XML code constructs the class “WhiteWine” by the constructor “owl:intersectionOf”. The tag “owl:Restriction” is the OWL method for imposing a constraint on a property of a class (Powers 2003). It defines an anonymous class with exactly one property. In this case, it uses the constructor “owl:hasValue” to limit the value of the property “hasColor” to only the resource “White”. Basically, the snippet states that objects in the class “WhiteWine” must be the intersection sets of objects in the class “Wine” and the anonymous class. Since everything in the anonymous class must have a white color (the resource “White”), the objects in the class “WhiteWine” are white wines. In order to fully understand a subclassOf axiom, one needs to recognize that OWL borrows its structure from frames — a knowledge representation paradigm which groups relevant information into frames (Horrocks et al. 2003). Although, an OWL ontology is composed of a list of RDF triples, it has the frames-based structure which assembles RDF triples into classes, instances and properties (Powers 2003). Each frame has slots, which holds its attributes (Nikolopoulos 1997). OWL classes and instances are frames, and properties are attributes (Horrocks et al. 2003). </p><p>24 Taken from the frames paradigm, a class inherits properties of the class which it is a subclass of (Powers 2003). The ensuing OWL snippet illustrates such class inheritance and a transitiveProperty axiom: </p><p><owl:ObjectProperty rdf:ID= “locatedIn”> <rdf:type rdf:resource=“http://www.w3.org/2002/07/owl#TransitiveProperty”/> <rdfs:domain rdf:resource=“http://www.w3.org/2002/07/owl#Thing”/> <rdfs:range rdf:resource=“#Region”/> </owl:ObjectProperty> <owl:Class rdf:ID= “Sauterne”> <rdfs:subClassOf rdf:resource= “#LateHarvest”/> <rdfs:subClassOf rdf:resource= “#Bordeaux”/> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource=“#locatedIn”/> <owl:hasValue rdf:resource=“#SauterneRegion”/> <owl:Restriction> <rdfs:subClassOf> </owl:Class> <Region rdf:ID=“SauterneRegion”> <locatedIn rdf:resource=“BordeauxRegion”/> </Region> </p><p>The tag “owl:ObjectProperty” defines a property in a class (Powers 2003). Here, it defines the property “locatedIn” as a transitive property (TP). Such a property produces axioms with the logic: TP(x,y) and TP(y,z) implies TP(x,z) (Powers 2003). The tags “rdfs:domain” and “rdfs:range” specify the domain and range of the property. The domain of locatedIn is the class “Thing”, which is the super class of all OWL classes (Bechhofer et al. 2004). Therefore, locatedIn can be a property of any class since all subclasses inherit properties of their superclasses. The range, on the other hand, is limited to only the class “Region”. The snippet constructs the class “Sauterne” as a subclass of three other classes, including one anonymous class, which has one property constraint. The objects of the anonymous class must have the property “locatedIn” and the value of the property must be the resource “SauterneRegion”, since objects in the class “Sauterne” inherits all properties of their </p><p>25 superclasses. In natural language, they must also be in Sauterne Region. Furthermore, the snippet states that the Sauterne Region is located in the Bordeaux Region. By the intrinsic logic of the transitive property, the snippet implies that all objects of the class “Sauterne” are also located in Bordeaux Region. Lastly, OWL comes in three species: OWL Lite, OWL DL and OWL Full (Powers 2003). OWL Lite supports only a subset of DL constructs and axiom types. For example, an OWL Lite class is limited to a maximum of one property of the same type. OWL Lite is suitable for representing a taxonomy since each class has at most one relationship to another class. OWL DL, as the name implies, supports all description logic constructs and axioms. While OWL DL ensures decidability of an ontology, it limits ontology to only logically valid statements. To amend this disadvantage, OWL Full allows the full freedom of RDF, which is a much more freeform language. Unlike OWL Lite and OWL DL, an OWL Full ontology can specify classes as individuals, and can use the DL constructs on the language itself (Horrocks et al 2003). For example, the hasValue construct can be applied to the subclassOf relation. However, the additional expressive power comes at the cost of decidability (Horrocks et al. 2003). Users need to pick the species that are appropriate for their applications. </p><p>3.5.2 Semantic Networks </p><p>The designers of OWL recognized that OWL in RDF/XML or RDF, albeit human readable, has impaired human readability for two reasons (Horrocks et al. 2003): (1) XML is verbose; and (2) simply because it is in RDF and XML, which were designed to be utilized by a software agent (Chen 2003) — an entity that operates on behalf of another (Giarratano and Riley 2005). For human readability, another representation is needed. In the early 20th century, Charles Sander Peirce created a graphical representation of logic called the Existential Graph (EG) to facilitate human reasoning (Dau 2006). Computer scientists have also developed a formal graph-based knowledge representation called semantic networks (Giarratano and Riley 2005). Sowa (2001) stated that, although, the knowledge </p><p>26 representation structure in a human brain was still uncertain, a graph representation served as a more probable candidate than logic formulas. Sowa (1992) observed that various kinds of semantics networks had been developed for multiple purposes, ranging from modeling human cognitive mechanisms to optimizing computational efficiency. He commented that computational motivations had occasionally produced the same network as psychological purposes. Only two of these networks will be covered here. This review will first describe formal semantic networks in artificial intelligence (AI). Then it will describe a type of semantic networks called concept maps, which were initially designed to capture and investigate changes in children’s knowledge (Novak and Cañas 2008). This should allow the readers to appreciate the interrelationships and similarity among various semantic networks. Giarratano and Riley (2005) described semantic networks as a classic AI representation of knowledge in the form of propositions —statements that are either true or false, such as “all humans are mammals” and “all squares have four sides”. A semantic network is a labeled, directed graph, and is displayed graphically in terms of nodes and edges as shown in Figure 3-1. Sowa (1984) called it the display form. The nodes represent classes and individuals, and edges attributes. The arrows indicate the direction of attributes, pointing to the attribute values. An attribute and its value together compose a property. The AKO (a kind of) attribute signifies generalization, e.g., “balloon is a subclass of aircraft” and “blimp is a subclass of balloon.” A subclass inherits properties of its superclass. Here, blimp inherits hasShape from balloon, but its value is overridden to ellipsoidal. An instance or individual of a class is indicated by the IS-A attribute, e.g., “Goodyear Blimp is an instance of blimp.” Giarratano and Riley (2005) noted that other semantic networks could use other attribute names to denote classes, instances and other relations.</p><p>27 </p><p> aircraft</p><p>AKO AKO AKO hasShape round balloon propeller jet </p><p>AKO AKO AKO AKO AKO AKO 747 hasShape ellipsoidal blimp special DC-3 Airbus</p><p>IS-A IS-A IS-A </p><p>Goodyear Spirit of Air Force Blimp St. Louis One </p><p>Figure 3-1: A semantic network of aircraft types. Redrawn after Giarratano and Riley’s (2005) Figure 2.5. </p><p>Novak and Musonda (1991), inspired by the usefulness of diagrams and flowcharts in science, developed the concept map as a means to represent knowledge based on psychological theories, such as assimilative learning and constructivist epistemology. In so doing, they hoped to develop a new, rigorous method for evaluating conceptual changes of students before and after a lesson. They described concept maps as “propositional structures in a hierarchical arrangement.” Examples of these propositions are “everything is made of molecules” and “animals are made of cells.” It is worth noting that these propositions are identical to propositions in propositional logic, which is the basis of the formal semantic networks in AI (Giarratano and Riley 2005). Therefore, it is not surprising that concept maps and the AI semantic networks are so similar. Developed as an informal — not machine readable —graphical technique, concept maps have several features that can be beneficial to human readability and realistic representation of human knowledge. Figure 3-2 shows a concept map of a second-grade student. The nodes represent concepts, and edges relations. All edges are neither labeled nor directed. In this particular map, the student only knew that two concepts are related, but </p><p>28 did not know or was confused about their relationship. The concept map also contain many misconceptions: the student thought that smell is trapped in air, and that the number of molecules in liquid is greater than in solid. The ability to record incomplete and incorrect propositions allowed Novak and Musonda (1991) to track changes in students’ understanding at different grade levels. Additionally, they recognized that some concepts are more generalized and salient than, and subsume other concepts; and recommended that concept maps should be drawn in hierarchical order progressively more specific from top to bottom. </p><p> something is not molecules else have Number in sample make up make up</p><p> such that</p><p> air smell water ice sugar movement s>l>g trapped can can melt Disappear into water due to none warmness solids liquid gas</p><p>Figure 3-2: Concept map of a second-grade student. Redrawn after Novak and Musonda’s (1991) Figure 2A. </p><p>Other researchers saw the benefits of this less formal semantic network, and have used them in their research. Rebich and Gautier (2005) employed concept maps to identify misconceptions among students about global climate change, which were obstacles to scientific understanding of the problem. Pike (2005) created a Web-based application that can capture theories as concept maps, track their changes and share them among collaborators. Note that his system encoded concept maps in OWL. Novak and his team </p><p>29 developed a desktop application called CMapTools for similar purposes (Cañas et al. 2004), and have recently proposed a way to convert their underlying data model to OWL (Hayes et al. 2004). Another tool, ConceptVista was developed as a desktop application that natively serializes concept maps in OWL (Gahegan et al. 2007), as opposed to CMapTools that can only export their data model to OWL. ConceptVista allows users to draft informal concept maps and progressively formalize the maps into formal semantic networks. An important message in the last paragraph is that OWL and semantic networks are compatible. For certain semantic networks (e.g., Hayes et al. (2004), Brockmans (2004), Pike (2005) and Luo (2007), translation into and from OWL do not lose any expressivity. For others, (e.g., Kashyap and Borgida (2003) and Martin (2007), loss of expressivity will occur. OWL can be utilized for serialization and machine comprehension, while semantic networks offer better human readability. </p><p>3.2 Similarity Measures </p><p>As suggested in Chapter 1, at the heart of analog searching lays similarity measurement among objects, concepts or situations. Therefore, the topic of this section is the main focus of this dissertation, which seeks to combine semantic and numeric similarity measures. The former type will be explained first, followed by the latter. This review describes only details of the measures relevant to how they are implemented here to evaluate similarity of places. These measures have variants and fine tuning features, which are not covered here. Furthermore, similarity measures are vague and imprecise. Uncertainty of similarity measures is a subject of ongoing research, deserving a dissertation for itself. Therefore, uncertainty estimation of the similarity measures employed is beyond the scope of this dissertation. Nevertheless, Section 3.4 reviews some past work on uncertainty of similarity measures for the benefit of the readers and future work. </p><p>30 </p><p>3.8.1 Semantic Measure </p><p>The first group of measures quantifies similarity between places in qualitative space, which consists of information from Web pages. A Web page can contain textual, image, audio and video information, but this work focuses on the textual parts. The ensuing similarity measures compute on the semantics — meaning — of text in Web pages, and hence they are appropriately called semantic measures. The semantics of text, which is known as lexical semantics in linguistics, can be broken down into three components (Sowa 2001): (1) language structures (i.e., grammar); (2) relations betweens words (e.g., homonym, synonym, hypernym, hyponym and metonym); and (3) words linked together by their relations and language structures (e.g., sentence and paragraph). It is the second component of text semantics, specifically, synonym that the chosen semantic measures compute on save the last measure, expert judgment. Each measure will be explained individually below. </p><p>3.8.1.1 Lexicon Based Measure </p><p>To evaluate similarity between English texts, a semantic measure can use information from a lexical database for English, such as WordNet (Miller 1995). As opposed to dictionaries, which are developed for human readability, WordNet combines the information of traditional dictionaries with modern computing. It contains over 118,000 unique word forms — strings of characters, and over 90,000 word senses — meanings. By and large, WordNet is a network of pairs of the word forms and their senses. Each pair consists of one word form and one word sense. The relations between the pair include synonymy, antonymy, hyponymy, meronymy, troponomy and entailment. This dissertation experiments with the WordNet-based semantic measure developed by Mitra and Wiederhold (2002), and proposes a series of modifications to the measure </p><p>31 </p><p>3.8.1.1.1 Mitra and Wiederhold’s Algorithm </p><p>The algorithm takes a textual description desca of pa and a textual description </p><p> descb of pb as inputs. Given that the lengths of desca and descb are q and p words, respectively, the algorithm constructs a q-by-p similarity matrix of all possible pairs of words </p><p> sem desp in the descriptions. skl denotes a matrix entry for the kth word wk in desca and the lth </p><p> desp sem word wl in descb . If p ≥ q , the similarity score sab of pa and pb is computed as: </p><p> q max(s sem ) for ∀l ∑ kl 3-1 s sem = k =1 , ab q</p><p> else </p><p> p max(ssem ) for ∀k ∑ kl 3-2 ssem = l =1 . ab p</p><p> sem To compute the semantic similarity score skl between wki and wlj , the algorithm looks </p><p> up the definition defk of wk and the definition defl of wl from WordNet. It then creates an m-by-n similarity matrix between words in the definitions, where m is the number of </p><p> words in defk and n in defl . Assuming wki is the ith word in defk and wlj the jth word </p><p> sem in defl , sij is the semantic similarity score between wki and wlj and is computed in the </p><p> sem same manner as skl , and therefore the algorithm is recursive. If m ≥ n , the recursive calculation can be written as: </p><p> n max(ssem (w ,w ,d −1)) for ∀i ∑ ki lj 3-3 ssem = ssem (w ,w ,d) = j =1 , kl k l n</p><p>32 </p><p> else </p><p> m max(ssem (w ,w ,d −1)) for ∀j ∑ ki lj 3-4 ssem = ssem (w ,w ,d) = i=1 , kl k l m</p><p> where d is the recursion depth. When d = 0 , the computation becomes simple string and </p><p> sem sem synonym matching. sij = 1 if wki is identical or synonymous to wlj , and sij = 0 otherwise. </p><p>3.8.1.1.2 Enhanced Version of Mitra and Wiederhold’s Algorithm </p><p>It is well known in the field of information storage and retrieval that a group of words called stop words in documents do not help distinguish one document from another since they appear frequently in most documents (Korfhage 1997). The enhanced algorithm developed as part of this research purges stop words from an input text, using the list of stop words provided with JWordNet, a pure Java API library for WordNet (Johar and Simha 2004). Furthermore, humans learn and recognize concepts by associating new observations with known concepts or by classifying them into categories (Lakoff 1990). The enhanced algorithm imitates such a cognitive process by computing similarity scores between each word in an input text and those in a predefined list of categories according to the original algorithm. Any word with a similarity score less than a set limit is removed from the input text, leaving behind only words comparable to predefined categories. After these two removal steps, the computation continues in the same manner as the original algorithm. </p><p>3.8.1.2 Feature-Set Based Measure </p><p>Some semantic similarity measures based on the set theory treat objects or, in this case, places as sets of features (Tversky 1977). These measures compute similarity scores as </p><p>33 functions of the number of common and distinctive features of places. The features can be nominal or ordinal values. Below are two feature-set based similarity models suggested by Tversky (1977): </p><p>S(a,b) =θf (A ∩ B) − αf (A − B) − βf (B − A), 3-5</p><p> and f (A ∩ B) S()a,b = . 3-6 f (A ∩ B )()()+ αf A − B + βf B − A</p><p>Equation (2.2) is known as the contrast model and Equation (2.3) the ratio model as their arithmetic forms imply. S(a,b) denotes an interval similarity scale between Place a and Place b while A and B represents the sets of features of Place a and Place b, respectively. ()A∩ B is the number of common features, (A − B) the number of features that belong to Place a but not Place b, and ()B − A the number of features that belong to Place b but not Place a. θ , α , and β are model parameters that take different values. For example, if θ =1 and α = β = 0 , Equation (2.2) reduces to S(a,b) = f (A ∩ B) and the similarity score depends solely on the common features. Therefore, Equations (2.2) and (2.3) not only represent two measures, but also generalize several similarity measures. f signifies an interval scale function that measures the contribution of any particular feature(s) to the overall similarity score. For instance, f (A) measures the contribution of the entire features of Place a. </p><p>3.8.1.2.1 Basic Synonym Count (BSC) </p><p>Basic Synonym Count (BSC) radically diverges from Mitra and Wiederhold’s algorithm by introducing the feature-set theory. Mitra and Wiederhold’s algorithm does not account for the frequencies of a term in both descriptions. For example, if the word </p><p>34 </p><p>“university” appears in the description of Place A five times and Place B only once. A university is likely to be an important aspect of Place A, but not of Place B. When the description of Place A is shorter than that of Place B, Mitra and Wiederhold’s algorithm will count all five occurrences of the word “university” towards the similarity score of Place A and Place B, resulting in an artificially high similarity score. An intelligent measure should consider only one occurrence of the word. BSC is such an intelligent measure. It combines lexical information with the feature-set theory. After removing stop words from a place description, BSC considers each remaining term as a feature of the place. The set of features of Place A will contain five “university” features and the set of Place B one “university” feature. Regardless of the description lengths, BSC considers Place A and Place B to have only one “university” feature in common. A feature of one place is common to a feature of another place if the features, namely the terms, are synonyms. Note that a term is a synonym of itself here. Synonymy is determined by Mitra and Wiederhold’s algorithm with the recursion depth, d = 0 . BSC </p><p> bsc scores sab are computed as Eq. 3-6 when α =1 and β = 0 : </p><p> f (A∩ B) sbsc = S(a,b) = . 3-7 ab f (A∩ B) + f (A − B)</p><p>The function f is merely a simple feature count. f (A∩ B) equals the number of common features, and f (A∩ B) + f (A − B) can be simplified to f (A) . A represents the smaller set of features of Place a and Place b, which result in scores between 0 and 1. </p><p>3.8.1.2.2 Synonym Count with a Vocabulary (SCV) </p><p>Synonym Count with a Vocabulary (SCV) extends BSC to take into account previously known domain concepts as in the enhanced Mitra and Wiederhold algorithm, which is described above in Section 3.8.1.1.2. After SCV removes stop words from a description in the same manner as BSC, it further removes words that are not in a domain </p><p>35 vocabulary, which is constructed from a list of categories by purging any stop words on the list. SCV regards the remaining terms as the features of a place, and compute scores using Eq. 3-7. </p><p>3.8.1.3 Corpus Based Measure </p><p>Besides using a lexical database, similarity of terms can be computed based on their context in a document corpus. Mitra and Wiederhold (2002) also provided this alternative, which evaluates the context of a term by the frequencies of the surrounding terms. The context of a term t is defined by the encompassing 1000-characters window. For example, the 30-characters window of term “annual” in the sentence “the City of New York holds its annual New Year Eve celebration in Time Square” is “York holds its New year Eve”. Note that the window ignores partial terms. Part of the term “celebration” is within 15 characters to the right, but the entire term is ignored. The corpus-based measure converts a context window to a term frequency vector. The size of a term vector is the number of unique terms in the corpus. Each element of the vector represents the frequency of a unique term within a context window. </p><p> cb The measure calculates similarity s12 between the contexts of two terms as the cosine of their context vectors, the commonly chosen method in text mining and information retrieval (Dhillon and Modha 2001). Eq. 3-8 shows the cosine calculation (Li and Jain 1998): </p><p>Τ cb T1 T2 s12 = S(t1,t2 ) = . 3-8 T1 T2</p><p>T1 and T2 signify the term vectors of the terms t1 and t2, respectively. ⋅ indicates the norm </p><p> k T = t 2 of a vector. Mitra and Wiederhold (2002) used the Euclidean norm, 2 ∑ k where 1</p><p>T is a term verctor, k is the size of the vector, and tk is the kth entry of the vector. </p><p>36 </p><p>3.8.1.3.1 Corpus Based Synonym Count (CBSC) </p><p>Corpus Based Synonym Count (CBSC) simply replaces Mitra and Wiederhold’s algorithm with the recursion depth, d = 0 in BSC with corpus-based measure described immediately above. Two terms are considered synonymous when their similarity score is over a set limit. Except for the way of determining synonyms, CBSC is calculated in the same manner as BSC with Eq. 3-7. </p><p>3.8.1.4 Graph Based Measure </p><p>A semantic measure can compute similarity based on relationships of words in some conceptual networks, e.g., WordNet and concept maps. For instance, the climate of San Francisco is under the influence of the Pacific High and the climate of Portugal under the influence of the <a href="/tags/Azores_High/" rel="tag">Azores High</a>. A feature-set based similarity measurement as described in 3.8.1.2 will regard the features of the two climates as completely different. However, according to the conceptual network of atmospheric phenomena shown in Figure 3-3, they are both a high pressure cell. Therefore, a more accurate similarity measure should be able to account for their taxonomic relationships. </p><p>37 </p><p>Atmospheric Phenomena</p><p>Precipitation Pressure Cell </p><p>Low Liquid Solid High Pressure Cell Precipitation Precipitation Pressure Cell</p><p>Rain Snow Hail <a href="/tags/Icelandic_Low/" rel="tag">Icelandic Low</a> Central Valley Azores High Pacific High High </p><p>Figure 3-3: Conceptual network of some atmospheric phenomena. </p><p>Measuring the path length between the names of features in the conceptual network of taxonomy in which they belong is one method that can account for meanings of terms (Jiang and Conrath 1997). The path length is just the simple count of the number of edges (links) separating the terms. In this case, two edges separate the term Pacific High from the term Azores High, and therefore the path length equals two. Should there be multiple possible paths, one would choose the shortest. Some pairs, however, have the same path length, but may not be as similar. The Central Valley High and the Pacific High, for example, may not be as similar as the Pacific High and Azores High. Mitra and Wiederhold’s algorithm can be shown to ameliorate such a problem by exploiting a definition network. Instead of navigating from one term up a taxonomic hierarchy and down to the other term like the path length measure, the algorithm navigates down the network of definitions as shown in Figure 3-4. In a simple case when the recursion depth in Eq. 3-3 and Eq. 3-4 equals to 0, the similarity of Pacific High and Azores High depends on the number of common terms in their definitions shown in Figure 3-5. The blue, bold terms indicate the common terms between the definitions. The similarity score is the number of common terms divided by the number of terms in the shorter </p><p>38 description. By computing on descriptions in such a manner, Mitra and Wiederhold’s algorithm can further distinguish terms, which are separated by an identical number of edges in a taxonomy. </p><p>3.8.1.5 Measure Based on Information Content </p><p>Considering again the conceptual network of atmospheric phenomena in Figure 3-3, the similarity between rain and hail may not be the same as that between the Pacific High and the Icelandic Low. Nevertheless, the shortest path method will produce the same similarity score because it assumes that each link represent an identical distance. To avoid such assumptions, Resnik (1995) suggested an alternative method based on information content (IC) of concepts, which can be estimated as the negative log of probability p(c) of encountering a concept: </p><p>IC(c) = −log p(c) . 3-9</p><p>The function p is a monotonic function from 1 to 0 as one goes up a conceptual network. This implies two things: (1) the more specific a concept, the higher its IC; and (2) the higher the probability of encountering a concept, the smaller its IC and vice versa. The similarity of two concepts is the degree to which they share information. The amount of information shared is indicated by the most specific concept that subsumes both concepts. In the case of rain and hail, that would be the IC of precipitation, and the IC of the pressure cell for the Pacific High and the Icelandic Low. </p><p>39 </p><p>… … defines defines</p><p>… Pacific High … … Azores High … </p><p> defines defines</p><p>“(Or Pacific <a href="/tags/Anticyclone/" rel="tag">anticyclone</a>.) The…. “The semipermanent subtropical… defines defines defines defines … … Def Def Def Def</p><p>… … … …</p><p>… … … … … … … …</p><p>Figure 3-4: Example conceptual network of definitions. </p><p>Pacific High Azores High “(Or Pacific anticyclone.) The nearly “The semipermanent subtropical permanent subtropical high of the high over the North Atlantic Ocean, North Pacific Ocean centered, in the so named especially when it is mean, at 30°–40°N and 140°–150°W. located over the eastern part of the On mean charts of sea level ocean. The same high, when pressure, this high is a principal displaced to the western part of the center of action.” Atlantic, or when it develops a separate cell there, is known as the Bermuda high. On mean charts of sea level pressure, this high is one of the principal centers of action in northern latitudes.” </p><p>Figure 3-5: Definitions of the Pacific High and the Azores High from the American Meteorological Society glossary <http://amsglossary.allenpress.com/glossary/browse>. 40 </p><p>3.8.1.6 Expert Judgment </p><p>Computers cannot yet replace humans. In Expert Systems, Nikolopoulos (1997) stated that the dream of human-like machines had broken the hearts of AI scientists for the last five decades. Automatic content analysis, part of which is text analysis, is not an exception. Computer software can be used to analyze content and suggest conclusions, but human experts still have to perform additional prior analysis (e.g., data preparation) and post analysis (e.g., output interpretation) and make the final decisions (Popping 1997). Furthermore, much place information on Web pages is likely to be available in formats not amenable to any computational methods described in the previous subsections, such as images, audio and video clips, diagrams and graphs. Human experts have to review and extract relevant information out of these resources manually. Then, based on their expertise in the problem domain, assign similarity scores to, either physically or as a mental note, and rank places. </p><p>3.8.2 Numeric Measure: Euclidean Similarity </p><p>For numeric characteristics of places, the most common measure of similarity is based on the locations of places in a Euclidean space (Pal 2004). Let </p><p>P = ()p1 , p2 , p3 ,Κ , pm denote a collection of m places. If u ji is the ith characteristic of the jth place, the jth place can be represented as a n dimensional vector </p><p> p j = (u j1 ,u j2 ,u j3 ,Κ ,u jn ). The salience of each characteristic with respect to the comparison of places can be incorporated into the measure of similarity by applying a weight </p><p> wi ()wi ∈[]0,1 to each u ji . Each characteristic is normalized by its range. Using the </p><p>()w notations above, one can compute a weighted Euclidean distance metric d ab between Place a and Place b in P as: </p><p>41 1 n 2 ()w ()w ⎡ 2 2 ⎤ 3-10 d ab = d ()pa , pb = ⎢∑ wi (pai − pbi )⎥ . ⎣ i=1 ⎦</p><p>A smaller distance between two places implies that they are more similar than those pairs with higher distances. Many variants of weighted Euclidean distance exist. Each was developed for a special purpose. Based on the Euclidean distance, the weighted Euclidean </p><p> ed (w) similarity score sab can be calculated as: </p><p> ed (w) 1 sab = (w) . 3-11 1+ d ab</p><p>3.8.2.1 Principal Component Analysis </p><p>The selected numeric dataset, the City Tables in the County and City Data Book, has a large number of attributes, 133 variables. It is difficult to understand relationships in such a large set of variables, and how they impact the similarity between places. Principal component analysis (PCA), a type of factor analysis, reduces the number of variables into a small set of orthogonal components that describe majority of the variance in a dataset (Rogerson 2001). In this work, PCA is performed with SPSS for Windows Version 11 (SPSS Inc. 2001). PCA reduces the dimensions of a dataset by aggregating total variance into a few components, which are linear combinations of the original variables (Davis 1986). PCA actually generates m orthogonal components from a dataset of m variables, tthe eigenvalue of a component indicates the amount of total variance explained by the component. When the eigenvalues is less than 1, the component explains less of the total variance than any of the original variables, thus components with eigenvalues less than 1 are discarded. This dissertation standardizes the original variables by computing the principal components based on their correlation matrix (Davis 1986). As shown by Eq. 3-12, the </p><p>42 score Fj of a resulting component can be calculated from the component loadings Wji, which are the coefficients of the linear combination for the component (SPSS Inc. 2006). The component loading associated with an original variable Xji signifies the correlation between the component and the variable. </p><p> m Fj = ∑WjiXji . 3-12 i=1</p><p>For ease of interpretation, the resulting components are rotated by varimax rotation, which is the most popular orthogonal rotation method (SPSS Inc. 2006). It keeps the rotated components uncorrelated, but maximizes the values of large component loadings and minimizes the values of small component loadings. The rotation results in a minimum number of intermediate loadings, making it clear to which original variables a component is most correlated. </p><p>3.8.2.2 K-Means Cluster Analysis </p><p>Besides having a large number of variables, the City Tables also contain over 1000 cities. In order to further aid data interpretation, this dissertation employs cluster analysis to automatically group cities together. Both cluster analysis and PCA are methods of data reduction (Rogerson 2001). While PCA reduces the number of variables, cluster analysis reduces the number of observations, namely, cities. The employed cluster analysis is called k-means cluster analysis (KCA), which is selected for 2 reasons: (1) KCA has been proven to work well with PCA and Euclidean distance (Yeung and Ruzzo 2001); and (2) KCA performs faster than the commonly-employed hierarchical clustering methods (Davis 1986). For n observations and k clusters, KCA only needs to compute a n-by-k similarity matrix, compared to a n-by-n matrix in the hierarchical methods. Davis (1986) described the k-means cluster analysis (KCA) as a non-hierarchical, arbitrary-origin clustering method. It does not iteratively join most similar observations and create a hierarchy of observations as in most commonly used cluster methods in earth </p><p>43 sciences. Instead, in order to generate k clusters from n observations, KCA chooses k arbitrary points and computes the n-by-k matrix of similarities between the observations and the arbitrary points. The observations closest to an arbitrary point are chosen to join the point to form a cluster. Once all observations join a cluster, KCA computes the means of each cluster and uses them to produce k new clusters in the same manner. The iteration stops when the changes in locations of the k means are less than a specified tolerance. Like PCA, this dissertation also conducts KCA with SPSS for Windows Version 11 (SPSS Inc. 2001). The analysis is configured to use the Euclidean distance metric. SPSS chooses the arbitrary starting points by finding k well-separated observations (SPSS Inc. 2006). </p><p>3.3 Statistical Tests of Similarity Measures </p><p>Given all the similarity measures above, which one or which combination should be used? In order to answer such a question, we need to know whether or not these measures give us the same information and whether or not one is more useful than the others. The latter question requires that we know the correct solution to a similarity problem, against which we evaluate solutions of our measures. Establishing the correct solution to a problem is of course a subjective exercise, and not the objective of this work. This work only attempts to demonstrate that semantic similarity measures provide analsis alternatives to the traditional numeric measures, allowing them to utilize the vast wealth of non-numeric data. Statistical methods exist for testing whether one measure yields the same similarity scores as another. This section describes such statistical methods employed here. The goals are to illustrate statistically that the proposed similarity measures are either equal or orthogonal. Equality of two measures requires that the measures have the same mean and a correlation coefficient r12 of 1 (SPSS Inc. 2006). In other words, one must test the null </p><p> hypothesis H0 : μ1 = μ2 and calculate the correlation coefficient. μ1 is the population mean </p><p> of one measure and μ2 the other. If H 0 is accepted and r12 = 1, the measures are said to be statistically equal or interchangeable. If H 0 is rejected, but r12 is close to 1, the measures are not equal, but correlated. Correlated and equal measures do not provide much, if any, </p><p>44 additional information. Given a choice between such measures, analysts only need to consider one that is cheaper to obtain or computationally faster. On the other hand, orthogonal measures give analysts different perspectives of their study subjects. One tests the null hypothesis H0 : ρ12 = 0 to check for orthogonality. ρ12 signifies a population </p><p> correlation coefficient between two measures. If H 0 is accepted, the measures are orthogonal. A class of statistical tests called nonparametric tests has been devised to deal with cases of unknown population distributions and small datasets as in this work. Normally, statistical tests suited for the hypotheses above are the t-test for means and the Pearson’s test for correlation coefficients (Haggett 1966). However, the t-test requires that similarity scores of each measure are normally distributed, and Pearson’s test requires that scores of two measures together have a bivariate normal distribution. One needs a large number of scores to observe a distribution, which is not possible for semantic measures herein (10 scores each). Moreover, assuming that there were a sufficiently large number of semantic scores, the scores may not be normally distributed. Following the suggestions of Rogerson (2001), the Kruskal-Wallis test and Spearman’s rank correlation coefficient test (see below for their formulation) are employed to test means and correlation coefficients, respectively. These two nonparametric tests only require that the scores be computed independently. All of the proposed measures satisfy this requirement since the computation of one score does not affect the others. One can compute scores in any order across cities and measures and get the same results. </p><p>3.9.1 Kruskal-Wallis Test </p><p>Adapted from Rogerson (2001), one can test the means of scores of two or more measures using the Kruskal-Wallis test as follows: The test begins by pooling together the scores of measures. Given the total of N scores, the smallest score is assigned a rank of 1 and the biggest N. Then one computes the sum of ranks for each measure. The inspiration underlying the test is that the sum of ranks of one measure should be close to that of the </p><p>45 other measures if their means are equal. The Kruskal-Wallis test statistic H can be calculated as: </p><p>12 k R2 H = ( ∑ i − 3(N +1) , 3-13 N(N +1) i=1 ni</p><p> where ni is the number of scores of each measure, and k is the number of measures. Since H has the chi-square distribution with k −1 degrees of freedom, one uses chi-square distribution tables to look up critical values above which the null hypothesis is rejected. In the case when some scores are tied, one assigns the average of the ranks to each tied score. A short example illustrates this best. Given 5 scores: 0.12, 0.30, 0.30, 0.45 and 0.70, the ranks are 1, 2.5, 2.5, 4 and 5. One also divides H by the correction term, defined by: </p><p>3 ∑ti − ti 1− i , 3-14 N()N 2 −1</p><p> where ti is the number of scores tied at a rank. The correction increases the power of H in rejecting a false hypothesis when there are tied scores. </p><p>3.9.2 Spearman’s Rank Correlation Coefficient Test </p><p>As the name implies the correlation coefficient is a function of score ranks (Rogerson 2001). Unlike the Kruskal-Wallis test, two sets of scores are ranked individually, and the numbers of scores considered must be equal. A rank of 1 is assigned to the lowest value, and rank of n to the highest. n is the number of score in each set. The Spearman’s correlation coefficient rs is computed as: </p><p>46 n 6 d 2 ∑ i 3-15 r = 1− i=1 , s n()n 2 −1</p><p> where di is the difference between the ranks of the same city. To test H0 : ρ12 = 0, one can </p><p> use a t-distribution since rs n −1 has a t-distribution with n −1 degrees of freedom. </p><p>3.4 Uncertainty of Similarity Measures </p><p>Reality is neither crisp nor determinate, but rather fuzzy and uncertain (Pfoser et al. 2005). Although making uncertainty of similarity scores explicit is beyond the scope of this dissertation, this section describes known sources of uncertainty of the employed data sets and suggests possible uncertainty estimation methods for the benefits of the readers and future work. U.S. census statistics, of which the County and City Data Book (U.S. Census Bureau 2000) is part of, contain sampling and nonsampling errors (U.S. Census Bureau 2002). The sampling errors stem from the fact that the U.S. Census Bureau only surveys subsets of the U.S. population, and hence census data are not U.S. population data. The other set of errors, nonsampling errors comes from data collection mistakes, dishonest survey answers and data processing blunders. It is difficult to identify nonsampling errors, and therefore one should prevent them in the first place (U.S. Census Bureau 2002). The County and City Data Book comes neither with any estimation of the errors nor the raw survey data from which uncertainty can be estimated. For other datasets in which raw data are available, one can sample or resample the raw data and compute the confidence interval of a data point (Cunderlik and Burn 2006). Similarly, the confidence interval of a similarity score can be computed by sampling or resampling raw data (Blanzieri and Portinale 2000). For example, Lele (1995) explained one simple technique to construct a confidence interval, the Bootstrap technique. The bootstrap technique is commonly chosen when one does not know the underlying statistical model of a dataset. To construct the Bootstrap confidence interval of a similarity score between Place A and Place B, one randomly samples or resamples a data point of Place A and a data point of Place B, and computes the similarity </p><p>47 score between the two data points. This process is repeated k times, preferably 300 to 1000 times, producing a set of k similarity scores. One then ranks the scores in ascending order, and removes the first 5 percent and the last 5 percent in order to create the 90 percent confidence interval. The remaining lowest score is the lower bound of the confidence interval, and the remaining maximum the upper bound. A similarity measure can directly take confidence intervals into account. Cunderlik and Burn (2006) incorporated sampling errors into a similarity measure by calculating similarity between two objects as a function of the confidence intervals of the attribute values of the objects. For a two-dimensional attribute space, Cunderlik and Burn conceptualized a confidence interval as the ellipse around a data point. The similarity score between two data points is the minimum confidence level at which the confidence ellipses of the data points intersect. A small minimum confidence level indicates high similarity between two data points. Semantic data have different sources of errors from the U.S. census data. Cross (2003) depicted that syntactic representation of ontologies cannot completely capture the semantics of characteristics and relationships of places and therefore ontology-based semantic similarity always carries some degree of uncertainty. For example, a string matching similarity measure will not be able to correctly compare homonyms. Any homonyms appear identical for such a measure, even though they do not share the same sense. Cross (2003) also noted uncertainty in heuristic-rule-based similarity. For example, the rule that nodes of two ontologies are similar if their parent and child nodes are similar as well may not be a good similarity rule for some ontologies. In addition, Cross (2003) contended that the cut-off score above which two concepts are said to be similar presents another source of uncertainty. Altering such a cut-off score changes the accuracy of a similarity measure. A similarity measure will indicate that two similar concepts are not similar if the cut-off score is too high and vice versa. In cases similar to this work where the reference similarity between things has not been established, Gal (2006) proposed a way to manage uncertainty by analyzing stability of the ranks of things in terms of their similarity to a thing. According to Gal, when identifying analogs of City A, cities that are consistently ranked high by multiple similarity measures are most likely to be the best analogs of City A. Ding et al. (2006) suggested a means to convert </p><p>48 an OWL ontology into a Bayesian network, which allows a semantic similarity measure to account for uncertainty by utilizing evidential reasoning. Stoilos et al. (2005) put forward a related idea, extending OWL with the fuzzy set theory in order to capture vagueness and imprecision in information and support further development of fuzzy-set-based similarity measures. </p><p>3.11 Conclusions </p><p>The chapter covers in depth ontologies and how to encode them. Ontologies will play a major part in exploration of the stated semantic measures. They allow computer algorithms of the measures to locate a piece of information on a Web page that they are designed to compute. Chapter 4 will describe how one converts a Wikipedia page into an OWL ontology, and experiment with Mitra and Wiederhold’s algorithm. Chapter 5 will employ PCA and KCA to systematically investigate similarity between U.S. cities in the numeric space. Chapter 6 will further experiment with the rest of the suggested semantic measures and compare numeric and semantic measures with the chosen statistical tests. </p><p>Chapter 4 </p><p>Preliminary Experiment: Six University Cities </p><p>This chapter describes the first step toward understanding semantic measures and how they compare with numeric measures. For this purpose, Mitra and Wiederhold’s semantic similarity measure with a calculation depth of 1, which is explained in Section 3.2.1.1.1, is selected because this algorithm has been demonstrated to be as good as or better than other semantic measures (Mitra and Wiederhold 2002) and the computer code of the algorithm is readily available. This initial experiment found a shortfall in Mitra and Wiederhold’s algorithm; a few improvements were developed. The enhanced algorithm is shown to perform better than the original; however, other flaws were identified and will be dealt with in Chapter 6. The Euclidean similarity measure, as explained in Section 3.2.2, is employed for numeric measurements throughout this work. As test cities, six U.S. university cities were selected based on the author and his advisors’ judgment that their economies are highly dependent on a university or universities as shown in Table 4-1. Figure 4-1 displays the locations of the cities, which are scattered from the West to the East Coast. The experiment involves two data spaces: numeric and semantic. The numeric data considered in this experiment are U.S. census statistics taken from the County and City Data Book (U.S. Census Bureau 2000). Being produced by the U.S. Census Bureau, these data are developed using a common methodology and are quality controlled. The dataset is chosen for its complete coverage of all U.S. cities and its inclusion of comprehensive city climate data from the U.S. National Oceanic and Atmospheric Administration. The dataset requires minimal data format manipulation and is aggregated at the city level. Everything considered, it was judged a suitable dataset for the dissertation as field data collection was not part of the research topic. Based on the author and his advisors’ familiarity with the dataset, fifteen variables were chosen to represent land area, demography, economy, and climate. These dimensions are commonly chosen in place comparisons (e.g., Green et al. 1967, Isserman and Rephann 1995, Hill et al. 1998 and Black and Henderson 2003). It is not crucial to ensure that, among 50 all variables in the dataset, the selected variables optimally represent the indicated dimensions. Various comparison objectives require different variable sets. An analyst can add or take away any variables at will; it will not affect Euclidean similarity </p><p>Table 4-1: The Six University Cities and Their Universities City University Boston, MA Boston University Northeastern Univeristy Boulder, CO The University of Colorado at Boulder Madison, WI The University of Wisconsin Palo Alto, CA Stanford University State College, PA The Pennsylvania State University Tuscaloosa, AL The University of Alabama </p><p>Figure 4-1: Locations of the university cities. </p><p>51 calculation. As long as the resulting similarity follows common statistical knowledge about the selected cities, the chosen variables serve the purpose of this experiment. The test case obtains semantic information from Wikipedia, a widely-known, free encyclopedia authored collaboratively by people around the world. Wikipedia is used here as a surrogate for all text-based Internet sources and the burgeoning Semantic Web (Feigenbaum et al. 2007), from which analysts can find information about their places of interest. Currently, Wikipedia has entries for most places imaginable, although, quality and detail of the entries vary. Many entries merely describe places with lists of items, such as businesses, educational institutions and Uniform Resource Locators (URLs.). However, given the current popularity and the current rate of growth, most entries should have detailed descriptions in the near future. Moreover, the peer-reviewed version of Wikipedia, Scholarpedia <www.scholarpedia.org> can provide higher quality information if it achieves the same popularity. Wikipedia entries can be accessed via URLs. For example, the Wikipedia entry of State College, PA is available at </p><p> http://en.wikipedia.org/wiki/State_College%2C_Pennsylvania </p><p>Each entry was accessed on May 10, 2007, parsed and converted into an OWL Lite document (see description of the parsing and converting method below). Wikipedia archives all versions of all entries. On an entry page, one can follow the history link to retrieve the version used in this experiment. Computers cannot read and understand today’s Web pages like humans can. Although a few projects (e.g., del.iou.us <del.iou.us> and the Friend of Friend Project <www.foaf-project.org>) have begun to make the Internet more machine-readable, the majority of the Web pages are not. Researchers have recognized the benefits of a machine- readable Wikipedia (Buffa et al. 2006, Haller et al. 2006, Hepp et al. 2006), but so far the prototype of such a Wikipedia <ontoworld.org> is still in its infancy and has very few entries. Therefore, we need a way that allows computers to understand the current, non- machine-readable Wikipedia. </p><p>52 Fortunately, Wikipedia pages have a consistent style and structure generated with the MediaWiki <www.mediawiki.org>, a free, open source software package that allows users to create, edit, and link Web pages easily, and is often used to create collaborative websites. Exploiting the style and structure, the parsing is guided by the table of content (TOC) of a Wikipedia entry. A TOC element or, namely, an entry section becomes a resource in an ontology, and its subsections become its descriptions along with its corresponding text. Figure 4-2 illustrates the constructed ontology for State College, PA as a concept map. Figure 4-3 shows the OWL encoding of the ‘Entertainment’ section. The section is encoded as an OWL resource with two descriptions: a literal description (blue) and a resource description (green). Note that the literal description comes from any textual description in the section, and the resource description comes from the ‘Sports’ subsection. Readers are referred to Section 3.1.1 for more information on an OWL document. </p><p>Figure 4-2: Concept map of the Wikipedia entry of State College, PA. </p><p>53 </p><p><rdf:Description rdf:about="http://en.wikipedia.org/wiki/State_College_borough- _PA#Entertainment"> <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string" xml:lang="en">Entertainment</rdfs:label> <dc:description rdf:datatype="http://www.w3.org/2001/XMLSchema#string" xml:lang="en"> BARS INCLUDE: The Saloon, 101 Heister St. The best live music in town. Home of the Monkey Boy--the most potent drink you will ever encounter. Modeled after an old English pub. Home to Penn State&amp;#39;s favorite 80&amp;#39;s cover band, Velveeta. www.80scheese.com Allen Street Grill, an upscale restaurant featuring a piano <a href="/tags/Bar_(tropical_cyclone)/" rel="tag">bar</a> Bill Pickles Tap Room Chumleys, State College&amp;#39;s only alternative lifestyle (gay) bar Players Nite Club, playing top 40 and dance. The Phyrst, an Irish pub. The University Resident Theatre Company (URTC). The School of Theatre at Penn State. Zenos Pub, voted in Playboy magazine as one of the <a href="/tags/United_States/" rel="tag">United States</a> top 50 places to have a beer Cell Block, good music, dancing </dc:description> <dc:description rdf:resource="http://en.wikipedia.org/wiki/State_College_borough- _PA#Sports"/> </rdf:Description> Figure 4-3: Portion of the OWL ontology of the Wikipedia entry of State College, PA. </p><p>Other Wiki sites powered by MediaWiki also have the same style and structure, and therefore one can apply the conversion method directly. Interesting examples of such Web sites are WikiTravel <wikitravel.org>, the Imtrav Wiki <www.wanderwiki.com>, Wikiversity <wikiversity.org> and RTFM-wiki <axljab.homelinux.org>. WikiTravel provides free, up-to-date and reliable travel guide to over 17000 destinations. The Imtrav Wiki also contains travel related entries. However, instead of giving general information about destinations similar to those on WikiTravel, the entries tell the readers specific tourist information, such as how to prepare for a trip, what to do and where to dine. Wikiversity is a sister project of Wikipedia. Its entries describe various learning projects. The participants </p><p>54 of each project complete activities by creating learning resource entries on Wikiversity, Wikipedia and other Wikipedia sister projects. Despite the popularity of MediaWiki for collaborative content authoring, most Internet content is not generated using MediaWiki. Since the conversion method depends on the particular style and structure of MediaWiki content, one cannot directly apply the method to any Web pages. However, the underlying principle of style and structure analysis can be applied to any Web sites with consistent style and structure throughout the site. The Piggy Bank project <http://simile.mit.edu/wiki/Piggy_Bank> has a repository of algorithms that convert a Web page into an OWL document. Each of these algorithms, like the proposed method, works with a particular set of Web pages of a particular site. Working with OWL, The proposed semantic measures can be easily extended to work with the Semantic Web. Since computers can read, compute, and reason using OWL (Horrocks et al. 2003), the conversion method enables a plethora of computer automation based on Wikipedia. Participants of the First Semantic Wikis Workshop suggested many ideas for such automation, including generating an overview concept map of related Wikipedia entries (Haller et al. 2006), creating a large domain knowledge corpus reflecting consensus among Wikipedia contributors (Hepp et al. 2006) and providing intelligent content suggestion for Wikipedia entry authoring and editing (Buffa et al. 2006). We will now begin to explore how well the selected semantic measure, Mitra and Wiederhold’s algorithm can compare places using the qualitative information extracted from Wikipedia pages. The results will be evaluated in comparison with those of the Euclidean similarity measure. </p><p>4.1 Comparison Using Mitra and Wiederhold’s Algorithm </p><p>Similarity among the six university cities is computed with the equal weight </p><p>Euclidean similarity measure ( wi = 1) in the census statistics space. For the semantic space, the comparison employs Mitra and Wiederhold’s algorithm to compute the semantic similarity score between two Wikipedia entries as follows: Before comparing two entries, one needs to determine what section in one entry is corresponding to each section of the </p><p>55 other entry. A corresponding section is determined by semantic similarity between section headers at the same level in the tables of contents. The section of one entry with the highest similar score to a section of the other entry is set as its corresponding section. However, if the similarity score of the best match header is less than 0.8, the section is assumed to not have a corresponding section in the other entry, and is assigned a similarity score of 0. The overall semantic score between two entries is the sum of their top-level section similarity scores divided by the number of sections. The similarity score of a section with subsections is determined in the same manner by summing the similarity scores of its top-level subsections and dividing the sum by the number of subsections. Note that the calculation is recursive and ends at a terminal section, which is a section without subsections. Additionally, the measure is not symmetric since one entry is likely to have more sections than the other; both the numerator (sum of section scores) and denominator (section number) can be different. The numeric and semantic similarity scores among the six university cities are computed and plotted as shown in Figure 4-4. The title of each plot specifies the city to which other cities are compared. The Euclidean scores indicate that these cities are all quite similar when we consider all the selected statistical variables. Note that the variables are normalized by the range of values of the entire dataset of 1132 cities and therefore their similarity is relative to the rest of the cities. It is probable that if we consider only subgroups of these variables, the cities would not be so much alike. However, such analysis will be conducted in Chapters 5 and 6. The semantic scores, on the other hand, show that these cities are dissimilar; none of the scores is above 0.4. However, a closer look at scores between sections (not shown in the </p><p> sem plots), reveals that some sections are almost identical ( sab ≥ 0.8). Nonetheless, the algorithm has identified false matches. For instance, consider the demographics sections of: </p><p>(1) Boston, MA: http://en.wikipedia.org/wiki/Boston%2C_MA#Demographics; and (2) Boulder, CO: http://en.wikipedia.org/wiki/Boulder%2C_CO#Demographics. </p><p>56 </p><p>Boston Boulder</p><p>1 1</p><p>0.8 0.8</p><p>0.6 0.6</p><p>0.4 0.4</p><p>0.2 0.2 Semantic Similarity Semantic Semantic Similarity Semantic</p><p>0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Euclidean Similarity Euclidean Similarity</p><p>Madison Palo Alto</p><p>1 1</p><p>0.8 0.8</p><p>0.6 0.6</p><p>0.4 0.4</p><p>0.2 0.2 Semantic Similarity Semantic Similarity Semantic</p><p>0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Euclidean Similarity Euclidean Similarity</p><p>State College Tuscaloosa</p><p>1 1</p><p>0.8 0.8</p><p>0.6 0.6</p><p>0.4 0.4</p><p>0.2 0.2 Semantic Similarity Semantic Similarity Semantic</p><p>0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Euclidean Similarity Euclidean Similarity</p><p>Boston Boulder Madison Palo Alto State College Tuscaloosa</p><p>Figure 4-4: Plots of similarity scores computed with the original algorithm. </p><p>The writers of these sections follow almost identical templates and merely change the numbers. Mitra and Wiederhold’s algorithm considers a number as a string. It does not know whether two numbers are about the same. As long as one digit differs, two numbers are considered to be completely different and receive a similarity score of 0. Even in the </p><p>57 case that the algorithm can distinguish numbers, this kind of section will turn out almost identical since most of the words used are identical. An investigation of lesser, but significant, matches suggests a further problem. The algorithm determines that the history sections of: </p><p>(1) State College, PA: http://en.wikipedia.org/wiki/State_College%2C_PA#History; and (2) Madison, WI: http://en.wikipedia.org/wiki/Madison%2C_WI#History. have about 41 percent synonymous words, but they are not at all similar. This can be attributed to stop words as explained in Section 3.2.1.1.2. When stop words are removed from the text synonymous words are reduced to only 14 percent. </p><p>4.2 Comparison Using Enhanced Mitra and Wiederhold’s Algorithm </p><p>To address the flaws unearthed in the previous analysis, Mitra and Wiederhold’s algorithm is enhanced by purging out stop words and any words that are not in a predefined list of categories. Section 3.2.1.1.2 explains this process in detail. As a proof of concept, this experiment uses the list of 1950 census categories of economic activities (Nelson 1955). The list is small, containing merely 26 categories. The census list suits the dissertation objective for two reasons: (1) it offers an opportunity to investigate how to use today’s domain knowledge, which is mostly in machine-unreadable formats; and (2) its small size allows thorough investigation of the terms to understand behaviors of the semantic measures. For example, the stop-words purging process removes unwanted words from the sentence “Morgan Construction, a manufacturer of steel rolling mills, has their headquarters in Worcester”, and transforms it to “construction.” after removing unwanted words. Other words, such as manufacturer, steel rolling mills and headquarters, may help distinguish cities as well and should be considered for inclusion in future work. Note that those words are particularly useful in this case where the name of the company seems to indicate a building construction business, while the company actually produces steel construction materials. </p><p>58 Furthermore, the experiment considers only the Economy sections of the entries in order to avoid template sections (e.g., the Demographics sections of Boston, MA and Boulder, CO indicated in the previous section). A manual review of all sections of many Wikipedia entries found that Economy sections seem to contain descriptions suitable for semantic measures. They describe a city by words, not by numbers like Demographics sections. For example, instead of stating the amount of annual funding Boston receives from the National Institutes of Health, it states that “Boston receives the highest amount of annual funding from the National Institutes of Health of all cities in the United States.” On the numeric side, the calculation of Euclidean similarity stays the same as the previous section, and the results will not be discussed here. The enhancements produce significantly better results. The plots of Euclidean and semantic similarity scores are as shown in Figure 4-5. Except for Boulder, CO and Palo Alto, CA, the enhanced algorithm produces significantly wider ranges of scores, indicating a better discriminating power. Boulder, CO does not match any other cities according to the economy sections. A literate person reviewing the best matches will find that they are reasonable. Take for example: </p><p>(1) Madison, WI: http://en.wikipedia.org/wiki/Madison%2C_WI#Economy; and (2) Boston, MA: http://en.wikipedia.org/wiki/Boston%2C_MA#Economy. </p><p>The economies of both cities depend largely on universities and government agencies. The cities are home to many major corporations and financial firms. Note that due to the asymmetric nature of the measure, Boston is the best match for Madison, but not the other way around. The best match for Boston is State College. </p><p>59 </p><p>Boston Boulder</p><p>1 1</p><p>0.8 0.8</p><p>0.6 0.6</p><p>0.4 0.4</p><p>0.2 0.2 Semantic Similarity Semantic Similarity Semantic</p><p>0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Euclidean Similarity Euclidean Similarity</p><p>Madison Palo Alto</p><p>1 1</p><p>0.8 0.8</p><p>0.6 0.6</p><p>0.4 0.4</p><p>0.2 0.2 Semantic Similarity Semantic Similarity Semantic</p><p>0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Euclidean Similarity Euclidean Similarity</p><p>State College Tuscaloosa</p><p>1 1</p><p>0.8 0.8</p><p>0.6 0.6</p><p>0.4 0.4</p><p>0.2 0.2 Semantic Similarity Semantic Similarity Semantic</p><p>0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Euclidean Similarity Euclidean Similarity</p><p>Boston Boulder Madison Palo Alto State College Tuscaloosa</p><p>Figure 4-5: Plots of similarity scores computed with the enhanced algorithm. </p><p>60 </p><p>4.3 Conclusions </p><p>Previous work on place analogs failed to account for non-numeric information, e.g., Farrigan and Glasmeier (2002) and Isserman and Rephann (1995) depicted in Chapters 1 and 2, respectively. The Internet has become a major source of such information. Using Wikipedia as a surrogate for other Internet sources, this chapter proposes a way to convert content of Wikipedia pages into a human- and machine-readable format, OWL. The chapter demonstrates that non-numeric information can be integrated automatically into the identification of analogs. However, the preliminary experiment identifies many flaws in the selected semantic measure: (1) brute-force synonym counting, as done in the first trial without a predefined list of categories and stop words, does not provide a good similarity measure; (2) an algorithm should have some knowledge of the domain it is computing about in order to yield meaningful similarity scores; (3) analysts also need to manually select sections of Wikipedia entries to consider in order to avoid template sections and descriptions by numbers/statistics; and (4) a longer list of categories may help improve quality of semantic similarity measurement. These issues will be dealt with in the Chapters 5 and 6.</p><p>Chapter 5 </p><p>Numeric Analysis </p><p>The main purposes of this chapter are to: (1) understand similarity between cities in the numeric space, and (2) select ten cities for further analysis in the next chapter where a Euclidean similarity measure and various semantic similarity measures are compared. The numeric analysis in the previous chapter arbitrarily selected interesting variables from the City Tables. This chapter aims to make the selection more systematic. The selection employs Principal Components Analysis (PCA) to group all variables in the City Tables and, in effect, create fewer salient dimensions for subsequent analyses. PCA has been commonly used in case of an overwhelming number of variables (King 1969, Fovell and Fovell 1993, Rogerson 2001). The City Tables have over 100 variables, making it difficult to understand any analysis results. Many of the variables are likely correlated and should be grouped together. It is true that one can exercise insight and expertise about the dataset to select only a few variables of interests, e.g., two economic variables, two demographic variables and two climatic variables. Nevertheless, PCA make the analysis more generic, allowing it to be applied to other datasets without substantial familiarity. After the dimension reduction, K- means Cluster Analysis (KCA) is employed to create clusters of similar cities. The insights gained from KCA together with analysis of the available content of Wikipedia entries provide the rationale behind selection of the ten cities at the end of the chapter. </p><p>5.1 Principal Component Analysis </p><p>This part of the numeric analysis aims to reduce the number of variables in the City Tables of the County and City Data Book (U.S. Census Bureau 2000) to a limited number of significant orthogonal variables that will be used in the subsequent cluster analysis. The City Tables comprise seven separate tables and contain the total of 126 socioeconomic and seven climatic variables. The complete list of variables and their descriptions can be found in 62 Appendix A. Many records include missing values. In order to maximize the number of complete records, the following variables with frequent missing values are ignored: C2- POP14, C5-WHS03, C5-MAN04, C5-MAN05, C5-MAN06, C5-MAN07, C5-MAN08, C5- MAN09, C6-RTL06, C6-RTL07 and C6-AFS04. Except for C6-RTL07, the excluded variables correlate highly with other variables, and therefore their exclusion should not remove much information from this and subsequent analyses. Table 5-1 displays correlation coefficients of three excluded variables and seven other variables. 1 The excluded variables are highlighted in light gray. C2-POP14 highly correlates with only C2-POP18, which is included in PCA. C5-MAN04 and C6-RTL06 highly correlate with each other and with multiple other variables included in PCA, which, unsurprisingly, end up in the same principal component. Excluding the stated variables, 490 records out of 1082 records in the City Tables are complete and will be used in PCA as described in Section 3.2.2.1. The analysis is performed with SPSS for Windows Version 11 (SPSS Inc. 2001). The analysis produces 14 components with eigenvalues greater than 1, which means that they account for more variance than any single original variable (King 1969). From this point on, PC I refers to the first principal component, PC II the second and so forth. The analysis rotates PCs I to XIV using the Varimax rotation, which facilitates interpretation of the component loadings. The rotation maximizes large component loadings and minimizes small component loadings. The interpretation of the generated components is given in Table 5-2. Their associated variables and component loadings can be found in Appendix C. Table 5-3 lists the percentage of variance explained by PCs I to XIV. </p><p>1 The full correlation coefficient matrix is not provided here due to its enormous size (a 122-by-122 matrix). Readers can easily obtain the dataset and generate the matrix with any suitable statistical software package. </p><p>63 </p><p>Table 5-1: Pearson Correlation Coefficient Matrix of Ten Sampled Variables </p><p>C2- C2- C5- C5- C5- C5- C6- C6- C6- C6-</p><p>POP14 POP18 MAN04 MAN05 MAN06 MAN07 RTL06 RTL02 RTL08 RTL09</p><p>C2- 1 0.805 -0.016 -0.01 -0.021 -0.021 0.069 -0.03 -0.034 -0.028 POP14 </p><p>C2- 1 0.071 0.072 0.062 0.044 0.109 0.032 0.016 0.037 POP18 </p><p>C5- 1 0.968 0.991 0.963 0.832 0.877 0.877 0.875 MAN04 </p><p>C5- 1 0.935 0.965 0.759 0.79 0.786 0.782 MAN05 </p><p>C5- 1 0.964 0.832 0.88 0.883 0.879 MAN06 </p><p>C5- 1 0.762 0.798 0.804 0.792 MAN07 </p><p>C6- 1 0.962 0.96 0.954 RTL06 </p><p>C6- 1 0.994 0.995 RTL02 </p><p>C6- 1 0.994 RTL08 </p><p>C6- 1 RTL09 </p><p>64 </p><p>Table 5-2: Interpretation of The Selected Components Component Interpretation I absolute socioeconomic size of the cities II percentage characteristics in terms of household, age and race; and average number of persons per household III winter climate IV population dynamics (percentage changes) V percentage difference between retired population and 25-to-44-year-old, male population VI percentage of <a href="/tags/African_Americans/" rel="tag">African Americans</a> and crime rate VII summer climate VIII city government finances IX percentage difference between median age, 45-to-64-year-old home owners and 18-to-24-year-old population X total population XI retail trade XII net population change and land area XIII percentage difference between American Indian/Alaska Native population and manufacturing establishments with 20 or more employees XIV population density (per square mile) </p><p>Selecting components for further analysis is not an exact science. The first few PCs, namely, PC I, PC II, and so forth do not always represent an optimum set for cluster analysis. Yeung and Ruzzo (2001) discovered that the general wisdom — the first few PCs will produce good clustering results — was not true; a set of PCs usually exists that can generate closer to inherent clusters in a training dataset than the first PCs. Considering the aspects that each component seems to represent and the percentage of variance explained, PCs I, II, III, IV, VI and VII are included for further analysis. These components represent themes similar to those of sections of Wikipedia articles, namely, economy, demography, climate and crime. It is important for this work that the themes of numeric information and </p><p>65 the themes of semantic information are comparable since comparing similarity measures for the two information spaces is crucial to the main goal. Many of the generated components (PCs I, VIII, XI) are related to economics. Only PC I, which explains most of variance (53.998 percent) in the dataset, is chosen for simplicity. In addition, the other economic components account for much less of the variance (2.751 and 1.628 percents). Similarly, PCs II and IV are chosen to portray city demography. PC VI is chosen for crime, and PCs III and VII for climate. </p><p>Table 5-3: Percents of Explained Variance by Each PC PC % Explained \ Variance I 53.988 II 7.529 III 4.028 IV 3.662 V 3.227 VI 3.034 VII 2.756 VIII 2.751 IX 2.170 X 1.978 XI 1.628 XII 1.506 XIII 1.255 XIV 1.071 </p><p>66 5.2 K-Means Cluster Analysis </p><p>KCA groups together similar cities based on their distance in the six-selected-PCs space from the previous section. Compared to other distance measures, Yeung and Ruzzo (2001) found that KCA based on Euclidean distance works well with PCA, and therefore will be employed here as described in Section 3.2.2.2. The analysis is again performed with SPSS for Windows Version 11 (SPSS Inc. 2001) and uses as the input the scores (projections of the original variables onto a component) of 6 selected components from the previous section. Since one does not know in advance the number of distinct city clusters, the analysis experiments with generating 2 to 8 clusters of cities. The analysis computes Multi-Way ANOVA, which is often employed to check separation of resulting clusters (SPSS Inc. 2006), of the clusters produced. Table 5-4 gives the Multi-Way ANOVA results. Note that ANOVA does not provide a hypothesis test of cluster means, but rather descriptive statistics of the means (SPSS Inc. 2006). Considering the ANOVA results, distinct clusters begin to form at 5 clusters. The significance values are negligible (less than 0.001) for all PCs, indicating the possibility of less than 0.1 percent that the means of these clusters are equal. On the other hand, for two, three and four clusters, many significance values are greater than 0.05, indicating the possibility of greater than five percent that the means of these clusters on several PCs are equal. </p><p>67 </p><p>Table 5-4: Multi-Way ANOVA of the Clusters F Ratio PC 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster I 896.83 443.24 314.50 749.27 605.61 503.37 434.16 II .48 5.33 3.88 20.35 15.17 81.82 17.99 III 1.08 .25 .28 53.57 187.59 153.66 117.50 IV .02 239.23164.22 125.66 93.84 77.85 108.24 V 4.47 2.93 287.87 9.74 26.91 20.63 66.31 VI 1.04 .23 3.55 49.13 99.22 113.50 71.34</p><p>Significance Level PC 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster I <.001 <.001 <.001 <.001 <.001 <.001 <.001 II .485 .005 .009 <.001 <.001 <.001 <.001 III .000 .772 .838 <.001 <.001 <.001 <.001 IV .485 <.001 <.001 <.001 <.001 <.001 <.001 V .298 .054 <.001 <.001 <.001 <.001 <.001 VI .877 .787 .014 <.001 <.001 <.001 <.001</p><p>5.3 Selection of Cities for Subsequent Analyses </p><p>As indicated in the previous chapter, semantic similarity computation is extremely expensive. It was found that computing semantic similarity scores between six to ten cities required about three to five hours on a 3.0 GHz Pentium IV machine with two gigabytes of RAM. Therefore, comparing all 490 cities exhaustively could take more than a year on the same computer. To keep the computation time within an acceptable limit, this section selects ten cities for further analysis. </p><p>68 The previous section produced five to eight distinct clusters. Table 5-5 shows the numbers of cities in each cluster. One strategy for selecting ten cities is to choose a pair from each of five clusters. This way one can try to reproduce these similarity pairings using semantic measures. However, looking closely at Table 5-5, one of the five clusters has only one member, New York City.2 As a result, one will have to select five pairs from six clusters instead, neglecting the New York City cluster. To aid interpretation of the clustering results, they are visualized with Improvise, an open-source, coordinated and interactive visualization tool (Weaver 2006). A snapshot of the visualization is shown in Figure 5-1. Each cluster is given a unique color from a qualitative set of ColorBrewer (Brewer and Harrower 2003). Let us turn to investigate each cluster individually. In terms of socioeconomic size (PC I), New York City pulls away from the other cities. As suggested, it’s the only member of Cluster 5. Although, it has the highest values for crime (violent and property), the crime rate (PC VI, number of incidents per 100,000 residents) is among the lowest in the country. The next biggest cities (yellow cluster: Chicago, IL; Houston, TX; Los Angeles, LA; and Philadelphia, PA) do not share the low crime rate with New York City. Despite their huge size, they are nevertheless average regarding percentage values of household and population characteristics by age and race (PC II). Since these mega cities are located at different </p><p>2 New York City can be considered an outlier in the data. The PCA and KCA analysis without New York City was also tried. The analysis produces comparable principle components and comparable clusters without the New York City cluster. The decision to include New York City in the final analysis is made in order to show the full range of city socioeconomic size. Note also that identifying correct clusters is not the objective of this work. PCA and KCA solely serve as systematic screening methods for picking few cities for further detailed analysis. </p><p>69 </p><p>Table 5-5: Numbers of Cities in Each Cluster Clusters # cluster # cities # clusters cluster # cities 1 488 1 1 2 2 2 2 56 1 1 3 186 3 2 464 7 4 135 3 25 5 17 1 297 6 91 2 1 7 4 4 3 164 1 74 4 28 2 71 1 141 3 4 2 27 4 117 8 5 3 4 5 169 4 1 6 1 5 317 7 6 1 4 8 48 2 188 3 18 6 4 142 5 1 6 137 </p><p>70 </p><p>Figure 5-1: Improvise visualization of the six clusters: Cluster 1 in yellow, Cluster 2 in blue, Cluster 3 in green, Cluster 4 in purple, Cluster 5 in orange and Cluster 6 in brown. New York City (the only city in Cluster 5) is selected and displayed in red. </p><p>71 parts of the country, their climates (PCs III and VII) differ a great deal. Figure 5-2 shows the coordinates of these cities on the scatter plots. Los Angeles has the mildest winter and summer climates, characteristics of the West Coast climate (Christopherson 2003). Houston, the most southern city, has the hottest summer, and Chicago, the most northern city, has the most gruesome winter. New York City and Philadelphia are in the middle of the group; Philadelphia has a hotter summer and a colder winter than New York City. With negative PC IV values, populations of these cities, except New York City, have been steady or declining. Compared to the metropolises above, cities in the other clusters are similar and relatively small in terms of socioeconomic size. They clump together in the lower end of the PC I axis. It is also difficult to separate them out with regard to the percentage values. However, these clusters separate out well in other dimensions and geographically as shown in Figure 5-3. Cluster 2 (blue cluster) contains cities with a colder winter (low PC III values). These cities are all in the northern part of the U.S. The most southern city is Alamogordo, NM. Cluster 6 (brown cluster) contains cities with a hot summer climate (high PC VII values). Most of these cities are in the South and Southwest. Cluster 3 (green cluster) are the fastest growing cities (high PC IV values) of the U.S. Most of them are west of the Rocky Mountains. These fast growing cities geographically locate in the border zone of Clusters 2 and 6. Although some of these fast growing cities have a hotter summer than some hot-summer cities and a colder winter than some cold-winter cities, none locates deep into Clusters 2 or 6. Therefore, it’s arguable that the fast growing cities have a milder winter and summer than their surrounding cities (and the weather may be one the major reasons for the rapid expansion). Lastly, Cluster 4 (purple) contains cities that have mild summer and winter (PCs III and VII). Unsurprisingly, these cities all go up and down the West Coast. The majority of them are in California. A few locate around <a href="/tags/Seattle/" rel="tag">Seattle</a>, WA. The above examination of individual clusters illustrates that KCA produces logical results; it separates the cities into clusters that one would expect considering common geographic, socioeconomic and climatic knowledge. Following the strategy suggested earlier, one can now simply pick any two cities from each cluster for further experimentation. However, this work has to also consider semantic information from Wikipedia and, as the following analysis of Wikipedia entries will demonstrate, only certain cities have adequate semantic information. </p><p>72 </p><p>Figure 5-2: Scatter plots of the six clusters. The members of Cluster 1 are selected and shown in red. </p><p>73 </p><p>Figure 5-3: Map of the six clusters. Cluster 1 in yellow, Cluster 2 in blue, Cluster 3 in green, Cluster 4 in purple, Cluster 5 in orange and Cluster 6 in brown. New York City (the only city in Cluster 5) is selected and displayed in red. Note that the map shows an estimated- by-sight location of the Rocky Mountains. 74 </p><p>5.3.1 Accounting for the Current State of Wikipedia Entries </p><p>Since this dissertation attempts to compare numeric and semantic measures, one would certainly like to measure similarity among places based on the same aspects in both numeric and semantic spaces. On the numeric side, the selected PCs represent socioeconomic, climatic, crime and demographic characteristics. Therefore, one would want to select cities with detailed Wikipedia entries on those characteristics. Wikipedia comprises contributions from millions of authors. As a result, the entries are of varying quality. Finding ten cities with detailed descriptions on the same topics and two from each K-Means cluster proves to be a challenging task. Wikipedia entries of the 490 cities were cached on a hard drive on August 3, 2007 and frequency analysis of the entry headers carried out with a simple Java program. Appendix C lists the headers that occur at least twice along with their frequencies. The entry headers indicate the city characteristics described by an entry, and hence the analysis portrays a holistic picture of common semantic information across entries. The analysis shows that the entries have a total of 2417 unique headers. Appendix C only includes 446 headers due to space limitations, but they are sufficient to demonstrate the nature of Wikipedia entries. Out of the 2417 headers, only 7 headers are common among at least half of the entries3. Some headers are comparable and can indicate comparable sections, such as “Major streets” and “Major roads”, and “Arts and entertainment” and “Arts and theatre’. In order to obtain a better frequency analysis, one needs a way to resolve and combine these comparable headers. Nevertheless, in this case, the comparable headers appear to have low frequencies (less than 5 occurrences) and should not significantly affect the analysis. The header analysis suggests that a handful of entries have sections of interests to this study. Merely a quarter of entries contain a “Climate” section. Less than 10 percent contain a “Crime” section and approximately one third have an “Economy” section. </p><p>3 A few entries have duplicate headers, and therefore the number of entries with a common header may be less than the total count. The difference, however, should be negligible. </p><p>75 Although, most entries include a “Demographics” section, as discovered in the previous chapter, it is not suitable for semantic measures proposed in this study. This section, where present, tends to reiterate census information and follow a template. Moreover, many entries are not descriptive; they include only lists of items and URLs.4 The entry for State College, PA provided in Appendix D gives an example entry with many such lists. The following sections contain lists of items or URLs with minimal descriptions, if at all: “Famous people from State College”, “Points of interest”, ‘Newspapers”, “Magazines”, “Web Media”, “Radio”, “Entertainment”, “Minor League <a href="/tags/Baseball/" rel="tag">Baseball</a>”, “Economy”, “Retail”, “Public Schools”, “Private Schools”, “Other Colleges Near State College”, “Libraries”, “<a href="/tags/Hospital/" rel="tag">Hospitals</a>”, “Roads”, “Airport”, “Mass Transportation”, “See also” and “External links”. The lack of detailed descriptions in most Wikipedia entries renders it impossible to pick a pair of cities from each cluster. Instead, the following cities are selected from the indicated clusters5: </p><p>Cluster 1 — Los Angeles, CA; Chicago, IL; Houston, TX and Philadelphia, PA Cluster 2 — Ann Arbor, MI; Boston, MA and Reno, NV Cluster 3 — Las Vegas, NV Cluster 4 — San Francisco, CA and San Diego, CA </p><p>According to common knowledge, these cities represent three interesting semantic groups: </p><p>1. Metropolitan Cities — Los Angeles, CA; Chicago, IL; Houston, TX and Philadelphia, PA 2. Urbanized University/Technology Cities – Ann Arbor, MI; Boston, MA; San Francisco, CA and San Diego, CA 3. Gambling Cities — Las Vegas, NV and Reno, NV </p><p>4 Some lists of names are quite adequate as long as they include descriptive keywords, such industry types, climate zones, and academic paradigms. 5 Cluster 5 has only one member New York. Cluster 6 does not have any member with a detailed Wikipedia entry. </p><p>76 It is interesting to test whether the semantic measures can identify these groups. Additionally, from the cluster analysis, we know that Cluster 1 cities are similar in terms of their absolute socioeconomic size, and Cluster 2 and Cluster 4 cities are similar in terms of their climates. One would be curious to investigate whether semantic measures can reveal these relationships and perhaps additional relationships as well </p><p>5.4 Conclusions </p><p>Instead of using expert judgment to choose census variables and U.S. cities as done in Chapter 4, this chapter utilized PCA to reduce the number of 122 variables in the City Tables into six principal components, and employed KCA to group 490 U.S. cities into 6 clusters. Such a process is readily applicable to a new, unfamiliar dataset, and assists analysts in understanding their datasets. The author learned from the PCA analysis that the census dataset has six useful dimensions in the context of the dissertation. Although the first strategy to choose cities for further analysis failed, KCA still proved useful in selecting the final ten cites. The final choices are limited ultimately by the availability of detailed Wikipedia entries. The selected cities will serve as the focus test case for the thorough evaluation of two semantic measures in the next chapter. The reader will need to refer back to the list of cities from each cluster and their semantic grouping as the categorization plays an important role in the upcoming evaluation of numeric and semantic measures.</p><p>Chapter 6 </p><p>Semantic Analysis </p><p>This chapter takes the lessons learned in Chapter 4 and computes the numeric and semantic similarity of the ten cities chosen in Chapter 5. The resulting scores are again visualized with scatter plots, which can clearly display and help one to compare similarity of the cities according to two measures. According to the clustering analysis in Chapter 5, one expects to see cities chosen from one cluster being similar in the census statistics space. The clusters form when one considers the six selected Principal Components (PCs) together. Considering separately one or a few of the selected PCs separately, as done below, can yield closer similarity between cities from two different clusters. The chosen cities can be categorized into three semantic groups as indicated in Section 5.3.1: metropolitan cities, urbanized university/technology cities and gambling cities. Grouping of the cities with respect to the cluster analysis does show some resemblance with the semantic groups at first glance. This is merely because the cities from Cluster 1 exactly match the metropolitan cities. The cities from the other clusters differ considerably from the other thematic groups. The ensuing semantic analysis demonstrates that two proposed semantic measures can automatically discover the suggested semantic groups from Wikipedia entries with some errors. The reasons for the errors will be suggested later. The measure can also discover useful relationships between cities that analysts may not anticipate and that might not be captured in readily available statistics. The reader is asked to keep in mind that the proposed semantic measures evaluate similarity described below with respect to information in Wikipedia entries only. There can be characteristics of known cities that could and should increase or decrease similarity between certain cities. That, however, does not indicate errors in the proposed semantic measures. Should those characteristics be incorporated into the Wikipedia entries, or should richer, textual descriptions of cities become available, the measures will likely generate similarity in the manner you expect. 78 6.1 Synonym Count with a Vocabulary (SCV) </p><p>The preliminary experiment in Chapter 4 firmly suggests the removal of unwanted terms before similarity calculation. It also suggests that a bigger list of categories can help improve the measure by retaining more terms that are semantically meaningful — representative characteristics of a city and not grammatical elements. Further analysis of Mitra and Wiederhold’s algorithm itself, as explained in Section 3.2.1.2.1, suggests that a similarity metric based on the feature-set theory as described in Section 3.2.1.2 can better capture similarity between descriptions, and will be used hereafter. Except for computing the similarity between two terms with the WordNet lexicon database, the new metric, Basic Synonym Count (BSC) completely departs from Mitra and Wiederhold’s algorithm. The name stems from fundamental behavior of the metric. It counts the number of common synonyms between two textual descriptions and divides the count by the number of terms in the shorter description. Again according to Chapter 4, one wants to remove unwanted terms from descriptions by means of stop words and categories before computing similarity. The upcoming analysis employs the same stop words as in Chapter 4, but a newer list of census categories. The list comes from the 2005 census industry groups and industries (U.S. Census Bureau 2005). It contains 266 categories of industries, which is considerably larger than the list of 26 categories employed in Chapter 4. The longer list enables the algorithm to recognize more characteristics about cities; nevertheless, many useful terms are still missing from the list as the results will demonstrate. The version of BSC which uses these categories is called SCV, short for Synonym Count with a Vocabulary. Note the term ‘vocabulary’ which signifies the generic nature of the metric: the metric does not consider the categories per se, but terms in the categories. For example, the category ‘beauty salons’ is composed of two terms: ‘beauty’ and ‘salons’. In other words, it can work with any list of terms, whether or not they are categories. Section 3.2.1.2.2 explicates the detail of SCV and the construction of a vocabulary from a list of categories. Just as in Section 4.2, in order to avoid template sections in Wikipedia, and sections with descriptions laden with statistical statements, this experiment computes on Economy sections only. The Wikipedia entries were accessed on August 4, 2007 and converted to </p><p>79 OWL documents as in Chapter 4 in order to provide a consistent, machine-understandable document structure. Section 3.1.1 expounds the benefits of OWL documents in depth. The algorithm navigates document structures (comparable to tables of contents) of two entries and computes their SCV score on the text under the Economy sections and their subsections including the subsection headers. Section headers, such as Real Estate, can provide important semantic information. In this manner, the SCV scores among the selected cities are computed, and the results are discussed below. </p><p>6.1.1 Results and Discussion </p><p>We will thoroughly examine the similarity scores of four specific cities, namely, Los Angeles, Ann Arbor, Las Vegas and San Francisco in relation to themselves and the other selected cities. These four cities are randomly chosen to represent at least one city from each numeric cluster and each semantic group. Cities in each cluster and each theme should appear closer together in the respective spaces. We will start examining the scores by plotting the SCV scores between Los Angeles and the ten cities (including Los Angeles itself) </p><p> on the y-axis and the equal weight ( wi =1) Euclidean similarity scores with respect to PC I and PC II on the x-axis. The plots are shown in Figure 6-1. The author chooses to compare the semantic scores with these two PCs because they represent the socioeconomic dimensions of the cities; it is only logical to compare numeric and semantic scores of the same aspect. With respect to PC I, Chicago, Houston and Philadelphia are the closest cities to Los Angeles. One would expect to see this result since they are grouped in the same cluster mainly on PC I as found in the last chapter. These cities stay as the top matches — cities with higher similarity scores than the rest of the cities in either a numeric or a semantic comparison — with respect to PC II, demonstrating that they look alike in terms of their percentage socioeconomic size as well. Las Vegas also joins the top cities in this dimension. </p><p>80</p><p>Figure 6-1: Similarity comparison of Los Angeles, CA and other cities based on (a) Principal Component I (PC I) and SCV, and (b) Principal Component II (PC II) and SCV. </p><p>81 (a) </p><p>(b) </p><p>82 The semantic measure, in contrast, places Boston at the top of the list, Chicago second and Houston last. According to the suggested semantic themes, it is surprising that Boston is the best match since Boston and Los Angeles are not in the same thematic group. Note, however, that the range of scores is quite small. A critical analyst will question whether the score differences are bigger than the errors. But recall that one knows neither the correct ranking nor the correct scores for this study. Reference scores against which one can calculate the errors do not exist. Section 3.3 discusses this topic fully. Thus, one needs to read the original descriptions and determine the quality — correctness and reasonable ranking — of the matches instead. Appendix E contains the Economy sections of the Wikipedia entries of the ten cities. (The author also includes the Climate sections and the Culture sections in Appendix E as a suggestion for future research; they contain descriptions amenable to semantic measures.) The author will investigate the best and the worst matches at length. If SCV can correctly rank places, the best match will be significantly more similar to Los Angeles than the worst match. Let us begin with the best match. Table 6-1 lists the common synonyms between the Economy sections of Los Angeles and Boston. The meanings of highlighted and colored terms in the table will be discussed later because some highlights and text colors signify concepts not yet introduced. They will be explained as we analyze the terms. The table actually consists of two lists: one for the synonyms from the Los Angeles entry and one for the Boston entry. The lists are sorted by frequencies and reverse alphabetical order, the default sorting order of SortedList — the implemented data structure in the Java programming language. The two lists can contain different terms since one list contains the synonyms of the terms in the other list, rather than the identical terms. Note that this work considers a term to be a synonym of itself. The intersection terms are highlighted yellow with gold text (color of the Boston symbol in the plots) in the appendix so as to assist in locating and interpreting their meanings. The text of Los Angeles also contains intersection terms with that between Los Angeles and of Houston. Where the two intersection sets overlap, the overlapping terms are highlighted light blue (mix of the colors of Boston symbol and Houston symbol) with black text. Reading the descriptions, one should agree that Los Angeles and Boston are indeed similar in many aspects. The cities are home to technology industries, major companies and </p><p>83 financial firms. The key terms, namely, ‘financial’, ‘bank’, ‘tech’, ‘private’, ‘insurance’ and ‘health’, appear in parts, which depict those aspects, of the descriptions, indicating that the algorithm SCV correctly identifies descriptive key terms of the cities. Light green text in Table 6-1 signifies these terms. As expounded in Section 3.2.1.5, one can state that these key terms have high information content (IC). They specify exact types of industries and companies in Los Angeles and Boston. We will learn, as we examine more cities, that most cities have some companies, businesses, retailers and industries. Without high-IC terms, one cannot clearly distinguish one city from the others. That is not to say that we should ignore all terms except for those with high-IC ones. Words have a range of IC values. A city with a technology company differs from a city with a technology retailer. One can also state that a technology retailer is a kind of technology company. Therefore, in this case, ‘retailer’ has higher IC than ‘company’. Although Los Angeles and Boston appear as a good match, interpreting meanings of the terms from the text and contemplating the lists, one will observe flaws in the stop words list, the vocabulary and the mechanics of SCV. Some terms are matched via wrong senses. Some terms should be added to the vocabulary, and some should be added to the stop words list. The term ‘including’ does not describe the city in any meaningful way; it is only a grammatical artifact. Thus it should be added to the stop words list. The term removal process should retain the term ‘university’, which is the key term for identifying Boston as a university town. SCV removes the term because the vocabulary includes ‘universities’, but not ‘university’. <a href="/tags/Tourism/" rel="tag">Tourism</a> constitutes a large portion of the economies of Los Angeles, Boston and many other cities; the term removal process does not keep any terms related to tourism. Words on this topic, such as ‘tourism’, ‘tourist’, ‘attraction’ and ‘site’, should be added to the vocabulary. Three synonyms are matched in senses other than the senses that they are used in the sections: </p><p> national vs. home stock vs. fund office vs. authority </p><p>84 The first term of a pair is from the text of Los Angeles and the second Boston. For instance, ‘national’ and ‘home’ share the meaning “inside the country” (Miller et al. 2005). However, in this context, ‘national’ means “concerned with an entire nation or country”, and ‘home’ means “a place where something began and flourished” (Miller et al. 2005). A critical mind should raise a question at this point whether the synonym matching algorithm matches any synonyms correctly. From the descriptions, the following synonyms are correct: </p><p> portion vs. part portion vs. component study vs. report </p><p>Again, the first term of a pair comes from the text of Los Angeles and the second Boston. In Table 6-1, the correct synonyms have blue background and the incorrect ones yellow. We will compare the correct and incorrect synonym matches after we look at the worst match. Let us turn now to the worst match. Table 6-2 lists the intersection terms, which are given the same highlight and color scheme as in Table 6-1. In Appendix E, the terms are highlighted in yellow with olive green text (color of the Houston symbol in the plots). Again as stated earlier, when the terms overlap with the intersection terms between the descriptions of Los Angeles and Boston, the overlapping terms are highlighted using light blue (mix of the colors of Boston and Houston symbols). Manually judging similarity between Los Angeles and Houston, one will feel that the cities are quite similar in terms of being a major port town, having multiple companies and having a big financial sector. This does not, however, render the result spurious. One can argue that the measure, SCV, does what it is designed to do — count the number of characteristics in the vocabulary that two cities have in common. Houston indeed shares fewer number of characteristics (19 versus 30) in the vocabulary with Los Angeles than Boston; it should have a smaller similarity value, but not 0. A closer investigation of the intersection terms reveals many incorrect synonyms: </p><p> technology vs. engineering production vs. product office vs. authority </p><p>85 manufacturing vs. industry line vs. channel home vs. base data vs. information </p><p>The first term of a pair comes from the description of Los Angeles and the second Houston. The algorithm identifies only one correct synonym match: </p><p> world vs. worldwide </p><p>Additionally, the synonym matching fails to match ‘petroleum’ in the description of Los Angeles with ‘oil’ or ‘gasoline’ in that of Houston. The vocabulary includes only two high- IC terms: health and financial, which identify two common industries. Both cities are one of the most important ports. In order for the measure, SCV to detect a port city, terms such as port, ship, container and channel, should be added to the vocabulary. At this point, it is obvious that the SCV measure has some flaws. Many necessary key terms are missing from the vocabulary; the algorithm cannot recognize several characteristics of cities (e.g., having a major port and depending on universities). The synonym matching process generally produces incorrect results. Lastly, certain terms should be added to the stop-words list. Considering the latter two problems and assuming that we only want to compare cities on aspects represented by terms in the vocabulary, one can evaluate the quality of the current city ranking by considering two questions: </p><p>1. Do top matches have fewer incorrect synonym matches? 2. Do top matches have more high-IC terms? </p><p>If the top matches consistently have fewer incorrect synonyms and more high-IC terms, the ranking is likely correct. The top matches share more characteristics with a city of interest and have lower errors. </p><p>86 Table 6-1: Intersection Terms of Los Angeles, CA vs. Boston, MA Los Angeles. CA Æ Boston, MA Boston, MA Æ Los Angeles, CA Freq. Term Freq. Term 5 companies 5 companies 2 world 3 home 2 portion 2 world 2 national 2 financial 2 home 2 bank 2 financial 1 trade 2 bank 1 tech 1 trade 1 report 1 tech 1 private 1 study 1 part 1 stock 1 national 1 private 1 insurance 1 office 1 industries 1 insurance 1 including 1 industries 1 health 1 including 1 fund 1 health 1 consulting 1 consulting 1 computer 1 computer 1 component 1 centers 1 centers 1 authority </p><p>87 Table 6-2: Intersection Terms of Los Angeles, CA vs. Houston, TX Los Angeles. CA Æ Houston, TX Houston, TX Æ Los Angeles, CA Freq. Term Freq. Term 2 world 2 trade 2 trade 2 business 2 business 1 worldwide 1 technology 1 world 1 production 1 product 1 offices 1 offices 1 office 1 news 1 news 1 national 1 national 1 international 1 manufacturing 1 information 1 line 1 industry 1 international 1 health 1 home 1 financial 1 health 1 engineering 1 financial 1 channel 1 data 1 base 1 authority </p><p>In order to answer the above questions, we will examine the matches of the other three chosen cities in the same manner as Boston and Houston. Figure 6-2 shows the similarity scores between Ann Arbor and the ten cities. Likewise, Figure 6-3 and Figure 6-4 display the similarity scores of the ten cities to Las Vegas and San Francisco, respectively. Appendix F contains lists of intersection terms used in the examination. Figure 6-2 indicates that Reno most resembles Ann Arbor in terms of its absolute socioeconomic size (PC I), followed by Las Vegas and Boston. With respect to their percentage socioeconomic statistics (PC II), San Francisco is closest to Ann Arbor, Boston </p><p>88 </p><p>Figure 6-2: Similarity comparison of Ann Arbor, MI and other cities based on (a) Principal Component I (PC I) and SCV, and (b) Principal Component II (PC II) and SCV. </p><p>89 (a) </p><p>(b) </p><p>90 </p><p>Figure 6-3: Similarity comparison of Las Vegas, NV and other cities based on (a) Principal Component I (PC I) and SCV, and (b) Principal Component II (PC II) and SCV. </p><p>91 (a) </p><p>(b) </p><p>92 </p><p>Figure 6-4: Similarity comparison of San Francisco, CA and other cities based on (a) Principal Component I (PC I) and SCV, and (b) Principal Component II (PC II) and SCV. </p><p>93 (a) </p><p>(b) </p><p>94 second and San Diego Third. Just as the case of Los Angeles, the semantic measure portrays a completely different story. San Diego has the highest similarity score, closely tailed by Chicago and then Boston. The semantic order also disagrees with the numeric order for Las Vegas and San Francisco as is obvious in Figure 6-3 and Figure 6-4. One can safely hypothesize that the proposed SCV semantic measure can provide additional information on similarity between cities. In Section 6.3, statistical tests are performed to test this hypothesis. It is the semantic matches that we are interested in at the moment, and therefore we will leave further discussion on numeric matches until Section 6.3. The rest of this subsection will focus on answering the above two questions. Table 6-3 summarizes the ranking of ten selected cities with regard to their semantic similarity to one of them. The column headers specify cities to which they are compared. We will identify the correct synonyms, the incorrect synonyms and the high-IC terms of the following five top matches and three bottom matches, cities with lower similarity scores than the rest of the cities in either a numeric or a semantic comparison, in the same manner as Boston and Houston: </p><p>• Top matches: San Diego to Ann Arbor, Reno to Las Vegas, San Diego to Las Vegas, Chicago to San Francisco and Philadelphia to San Francisco; • Bottom matches: Reno to Ann Arbor, Boston to Las Vegas and Las Vegas to San Francisco. </p><p>Note that choices of the matches are not critical in finding the answers to the two earlier questions. One just needs to randomly pick a couple of top and bottom matches, interpret the intersection terms and count. Therefore the choices should suffice. Furthermore, the choices represent many interesting cases. Due to its size, one would not expect San Diego to be more similar to Ann Arbor than Boston. The semantic measure, however, tells us otherwise. San Diego is also the best match for Las Vegas. Being a gambling city, perhaps Reno should be the closest match instead. Considering the reasons behind the highest scores of San Diego will be interesting, and analyzing the intersection terms of Reno and Las Vegas will likely provide further insights. Boston, Ann Arbor and San Diego should resemble San Francisco more than Los Angeles, Chicago and Philadelphia </p><p>95 since their economies, like that of San Francisco, are based on their universities and technology industries. Instead, the semantic measure groups San Francisco with other metropolitans except for Houston. One would be curious why the measure does so. </p><p>Table 6-3: The Selected Ten Cities in Descending Order of Their Semantic Similarity to the Specified Cities Rank Los Angeles, CA Ann Arbor, MI Las Vegas, NV San Francisco, CA 1 Boston, MA San Diego, CA San Diego, CA Los Angeles, CA 2 Chicago, IL Chicago, IL Reno, NV Chicago, IL 3 San Francisco, CA Boston, MA Chicago, IL Philadelphia, PA 4 Ann Arbor, MI Philadelphia, PA Ann Arbor, MI Boston, MA 5 Reno, NV Los Angeles, CA Philadelphia, PA Ann Arbor, MI 6 Philadelphia, PA Las Vegas, NV Los Angeles, CA Houston, TX 7 Las Vegas, NV San Francisco, CA Houston, TX San Diego, CA 8 San Diego, CA Houston, TX Boston, MA Reno, NV 9 Houston, TX Reno, NV San Francisco, CA Las Vegas, NV </p><p>Table F-1 to Table F-9 list the intersection terms of the chosen matches. The terms are highlighted yellow in Appendix E just as with the intersection terms between Los Angeles and Boston, and Los Angeles and Houston. San Diego to Las Vegas and Reno to Las Vegas are two special cases in which the terms are highlighted grey in order to distinguish them from the terms of San Diego to Ann Arbor and Reno to Ann Arbor. The text colors of the terms are the colors of their associated symbols in the plots. The terms of San Diego to Ann Arbor are, for instance, displayed in the color of the Symbol of San Diego. When terms are in two intersection sets, they are displayed in black text and highlighted in the color resulting from mixing two colors of the corresponding symbols. For readability, the intersection terms of Boston to Las Vegas and Las Vegas to San Francisco are neither highlighted nor colored. Marking those terms will obscure the intersection terms that are already marked in the descriptions of Las Vegas and San Francisco. The readers are kindly asked to exercise some effort in identifying the unmarked terms. Contemplating each intersection set and interpreting their senses in the descriptions, one would find the synonym and high-IC term statistics of the matches as illustrated in Table 6-4 and Table 6-5. The readers can explore the actual terms in Appendix F. Table F- 1 to Table F-9 display synonyms correctly matched in their intended senses in blue background, synonyms incorrectly matched in yellow background and high-IC terms in light green text. Note that the lengths of the synonym lists of Las Vegas vs. Reno in Table F-3 </p><p>96 are not equal because of the design of WordNet. WordNet sometimes considers that a term is a synonym of another term, but not the other way around. As a result, SCV can find a term in one description to be a synonym of a term in another, but fails to match the latter to the former. The length discrepancies of Table F-6 and Table F-7 can be attributed to the same reason. To work around this problem conservatively, calculation of a similarity score uses the longer list (a longer list of intersection terms means a higher score), and calculation </p><p>Table 6-4: Summary of Correct and Incorrect Synonyms for (a) Top and (b) Bottom Matches (a) </p><p>Correct6 Incorrect Top Matches Freq. % Freq. % Boston, MA to Los Angeles, CA 3 10.0 3 10.0 San Diego, CA to Ann Arbor, MI 1 2.9 5 14.3 Reno, NV to Las Vegas, NV7 1 8.3 1 8.3 San Diego, CA to Las Vegas, NV 0 0.0 7 31.8 Chicago, IL to San Francisco, CA 8 0 0.0 4 15.4 Philadelphia, PA to San Francisco, CA9 0 0.0 6 24.0 Average <1 3.5 4 17.3</p><p>(b) </p><p>Correct Incorrect Bottom Matches Freq. % Freq. % Houston, TX to Los Angeles, CA 1 5.3 7 36.8 Reno, NV to Ann Arbor, MI 1 9.1 4 36.4 Boston, MA to Las Vegas, NV 0 0.0 2 14.3 Las Vegas, NV to San Francisco, NV 0 0.0 4 33.3 Average <1 3.6 4 30.2</p><p>6 Excluded synonyms with the same spellings — terms that are synonyms of themselves. 7 Calculated the percentage values by the number of intersection terms in the comparison of Las Vegas, NV to Reno, NV. 8 Calculated the percentage values by the number of intersection terms in the comparison of Chicago, IL to San Francisco, CA. 9 Calculated the percentage values by the number of intersection terms in the comparison of Philadelphia, PA to San Francisco, CA. </p><p>97 of statistics of synonym and high-IC terms uses the shorter list. Doing so increases the error of a semantic measure by potentially assigning high scores to bad matches and increasing percentage of incorrect synonyms. If most of these artificially better matches turn out to be good matches with an acceptable level of incorrect synonym matching (recall that the percentage of incorrect synonyms is increased by the work around), the semantic measure does well in this worst case scenario, and can do better in case that the synonymy relationship in WordNet is symmetric. </p><p>Table 6-5: Summary of High-IC Terms for (a) Top and (b) Bottom Matches (a) </p><p>Term Top City Pair Freq. % Boston, MA to Los Angeles, CA 8 26.7 San Diego, CA to Ann Arbor, MI 9 25.7 Reno, NV to Las Vegas, NV10 8 66.67 San Diego, CA to Las Vegas, NV 2 9.09 Chicago, IL to San Francisco, CA11 3 11.5 Philadelphia, PA to San Francisco, CA12 6 24.0 13 Average 6 19.4 </p><p>(b) </p><p>Term Bottom City Pair Freq. % Houston, TX to Los Angeles, CA 2 10.5 Reno, NV to Ann Arbor, MI 3 27.3 Boston, MA to Las Vegas, NV 1 7.1 Las Vegas, NV to San Francisco, NV 0 0 Average 2 11.2 </p><p>10 Calculated the percentage values by the number of intersection terms in the comparison of Las Vegas, NV to Reno, NV. 11 Calculated the percentage values by the number of intersection terms in the comparison of Chicago, IL to San Francisco, CA. 12 Calculated the percentage values by the number of intersection terms in the comparison of Philadelphia, PA to San Francisco, CA. 13 Excluding the values of Las Vegas, NV vs. Reno, NV since it’s an outlier. The percentage value seems much larger than the values of the other pairs. </p><p>98 The synonym statistics given in Table 6-4 clearly cast some doubt on the value of the synonym matching process. The process produces on average four times as many incorrect matches as correct ones. The absolute number of incorrect matches seems to be independent of the total number of common terms between two descriptions. Both top and bottom matches have an average number of mismatches of four. As a result, the synonym matching process in effect introduces a constant error to the semantic measure. This means that the matches will have erroneously higher scores than what their scores would be without synonym matching. The increases are greater for the bottom matches than the top matches, resulting in a smaller range of scores and making it harder to distinguish matches. The best match, San Diego to Las Vegas epitomizes a case in which the best match is spurious. Incorrect matches account for 31.8 percent of the similarity between San Diego and Las Vegas. Without synonym matching, Reno would come out the best match as one expects. Turning to the high-IC terms statistics in Table 6-5, the average absolute number of high-IC terms of the top matches is three times as large as that of the bottom matches; the percentage number is almost two times as large. The top matches apparently share more characteristics with the cities of interest than the bottom matches. This means that SCV in general generates a reasonable ranking of the ten chosen cities. Note that two top matches, San Diego to Las Vegas and Chicago to San Francisco do not belong among the top matches. Their percentages of high-IC terms are as low as the average of the bottom matches. On the other hand, one bottom match, Reno to Ann Arbor should be among the top matches as its intersection set contains many high-IC terms. Lastly, Boston to Las Vegas typifies a true bad match in which only a few of their small set of common characteristics are incorrect synonyms and high-IC terms. This means that their similarity score is not exaggerated by incorrect synonyms, and the common terms do not carry much information about Boston and Las Vegas. </p><p>99 6.1.2 Discussion of SCV Performance </p><p>The experiment described in this section demonstrates that high-IC terms, which clearly identify characteristics of cities, play an important role in determining the quality of the similarity scores. Matches with more high-IC terms should be ranked higher than those with fewer, even though they have slightly less intersection terms. The current synonym matching process incorrectly matches terms in senses other than those intended. By doing so, the process artificially increases similarity scores. Since the average number of incorrect synonyms is constant for top and bottom matches, bottom matches will be affected on average more than top matches. The upcoming section proposes a semantic measure that can tackle this problem by determining the sense of a term from its surrounding terms. Despite the two stated shortfalls, the semantic measure proposed in this section, namely SCV, can reasonably rank cities according to their similarity to a city of interest; the top matches of the described experiment have smaller percentage of incorrect synonyms and more high-IC terms than the bottom matches on average. </p><p>6.2 Corpus-Based Synonym Count (CBSC) </p><p>The results of the experiment depicted in the previous section identify term sense disambiguation as a major problem for synonym matching. This section describes an experiment with CBSC, short for Corpus-Based Synonym Count, which can disambiguate term senses based on their usage in a corpus. The main premise is that two terms have similar senses if many surrounding terms of one term are identical to those of the other. CBSC computes this by calculating cosine similarity between the term vectors defined by a window of a certain number of terms preceding to the left and following to the right of every occurrence of each term. Section 3.2.1.3.1 explains the measure at length. Additionally, CBSC eliminates the need for a lexicon database (e.g., WordNet), which is expensive to develop and maintain. CBSC counts synonyms and computes similarity scores using the feature set theory explained in Section 3.2.1.3.1 in the same manner as BSC. For this experiment, two terms </p><p>100 are considered synonyms when their similarity score is greater 0.9. This cut-off point was selected after a few trials to eliminate incorrect synonyms found in the previous section while retaining correct ones. As for the corpus, the experiment uses the Economy sections of Wikipedia entries of the 490 cities selected in the previous section. Words have different meanings in different communities, disciplines or topics. Using the Economy sections as the reference corpus ensures that CBSC considers the senses of terms as they are used by Wikipedia authors and in economics. The experiment compares the ten cities chosen in Chapter 5. As in the previous section, the comparison is based solely on the Economy section, and the descriptions are processed to remove any stop words and retain only terms in the vocabulary. The results below will show that CBSC effectively avoids incorrect synonyms found in the previous section and matches some of them with their correct synonyms. As a result, the quality of the scores improves; one can clearly see cities with spurious high SCV scores when comparing the CBSC scores with the SCV scores. </p><p>6.2.1 Results and Discussion </p><p>We will investigate only the similarity scores of the ten chosen cities to Las Vegas. This one set of similarity scores appears adequate in demonstrating the pros and cons of CBSC. Section 6.3 will employ statistical methods to explore the rest of the results. Figure 6-5 displays the CBSC scores in comparison with the SCV scores. The diagonal line represents equal CBSC and SCV scores, and therefore those data points along the line indicate small changes in the scores. Keeping this in mind, one would notice three data points that fall significantly below the line: San Diego, Reno and Houston. From the detailed analysis in the previous chapter, one would expect San Diego similarity to Las Vegas to decrease after removing incorrect synonym matches. Like San Diego, the SCV score of Houston likely incorporates many incorrect synonym matches; the analysis below will prove this to be the case. Surprisingly, the score of Reno too decreases. The analysis in the last section indicates that Reno is a good match to Las Vegas with only one incorrect synonym match. Removing one incorrect match should not cause a reduction of almost one half of </p><p>101 the SCV score. To unravel this puzzle, we will look at the intersection terms between Reno and Las Vegas as listed in Table 6-6. The table compares intersection terms generated by SCV and CBSC. Green text, yellow background and blue background indicate high-IC terms, incorrect synonym matches and correct synonym matches, respectively, just as in the previous section. The terms identified by SCV but not CBSC are grayed out. Analyzing the intersection terms in Table 6-6, one would observe that CBSC omits a few high-IC terms, i.e., ‘gaming’ and ‘slot’, which are crucial in matching Reno and Las Vegas. This is because the vocabulary does not contain these two terms. SCV has a similar problem, but not as severe because SCV expands the terms in the vocabulary with their respective synonyms in WordNet. It will become clear as we explore more intersection-term sets that CBSC cannot identify synonyms very well at the current cut-off score of 0.9. It fails to recognize that ‘gaming’ is a synonym of gambling and ‘slot’ is a synonym of a term in the vocabulary. As a result, the stop-words/vocabulary filtering process fails to retain ‘gaming’ and ‘slot’. To test that this is the case, the vocabulary is enhanced with two terms related to gambling, namely, ‘gaming’ and ‘casino’. Before we continue on to the results of the augmented vocabulary, it should be noted that CBSC not only eliminates incorrect synonym matches, but also corrects some of them. The term ‘development’ in the Las Vegas entry is now correctly matched with the term ‘development’ in the Reno entry; SCV previously matches it with ‘growth’ in a wrong sense. The new scores are shown in Figure 6-6, and the new intersection terms of Reno and Las Vegas in Table 6-7. The terms identified by CBSC but not SCV are grayed out in the right column. The CBSC score of Reno nearly doubles, indicating that the augmented vocabulary significantly improves analysis results. Now Reno is one of the best matches, as one would anticipate. Considering the terms in Table 6-7, one can easily increase the score further by including ‘slot’ in the vocabulary as well. The other terms identified by SCV, but not CBSC, are quite common in the other entries as they appear many times in the intersection sets we have investigated and will not likely help separate Reno from the other top matches. Including them will increase the scores of the other top matches too. Note that ‘casino’ is not identified by SCV, but by CBSC, confirming that one can effectively control the city characteristics one wants to consider by controlling the terms in a vocabulary. </p><p>102 </p><p>Figure 6-5: Similarity comparison of Las Vegas, NV and other cities based on SCV and CBSC. </p><p>103 </p><p>Table 6-6: Intersection Terms of Las Vegas, NV vs. Reno, NV Las Vegas, NV Æ Reno, NV SCV CBSC Freq. Term Freq. Term 3 gaming 1 including 2 technology 1 gambling 2 development 1 development 1 slot 1 manufacture 1 including 1 growth 1 gambling </p><p>Let us turn now to the scores of San Diego and Houston. We will consider the results of the augmented vocabulary rather than the original. CBSC effectively removes the incorrect synonym matches from the intersection set of Las Vegas and San Diego as illustrated in Table 6-8. The incorrect synonyms are no longer included in the intersection set except for ‘development’, which is correctly matched by CBSC. However the CBSC set does not include a high-IC term ‘technology’. Again, this can be attributed to the incomplete vocabulary. SCV includes ‘technology’ because, according to WordNet, it is a synonym of ‘engineering’, which is in the original vocabulary. For Houston, Table 6-9 demonstrates that SCV indeed produces many incorrect synonym matches as we suspected earlier. The detailed identification of incorrect matches can be found in Table F-9. Once more, CBSC completely removes the incorrect matches and a few other correct ones. None of the ignored terms are high-IC terms, and therefore including them will increase the score, but not its quality. If we assume that the vocabulary contains all city characteristics of interest, Houston most likely does not share any characteristics pertinent to this analysis with Las Vegas. In short, CBSC appropriately gives Houston a low score. </p><p>104 </p><p>Figure 6-6: Similarity comparison of Las Vegas, NV and other cities based on SCV and CBSC with the modified vocabulary. </p><p>105 </p><p>Table 6-7: Intersection Terms of Las Vegas, NV vs. Reno, NV Las Vegas, NV Æ Reno, NV SCV CBSC Freq. Term Freq. Term 3 gaming 3 gaming 2 technology 1 including 2 development 1 gambling 1 slot 1 development 1 manufacture 1 casino 1 including 1 growth 1 gambling </p><p>Table 6-8: Intersection Terms of Las Vegas, NV vs. San Diego, CA Las Vegas, NV Æ San Diego, CA SCV CBSC Freq. Term Freq. Term 5 companies 5 companies 3 technology 1 services 2 development 1 research 1 slot 1 industry 1 services 1 housing 1 research 1 home 1 part 1 development 1 manufacture 1 construction 1 industry 1 index 1 housing 1 home 1 growth 1 building 1 acres </p><p>106 </p><p>Table 6-9: Intersection Terms of Las Vegas, NV vs. Houston, TX Las Vegas, NV Æ Houston, TX SCV CBSC Freq. Term Freq. Term 2 development 1 services 1 technology 1 research 1 switch 1 information 1 services 1 industry 1 research 1 building 1 post 1 part 1 manufacture 1 information 1 industry 1 growth 1 building 1 authority 1 agency 1 acres </p><p>107 </p><p>6.2.2 Discussion of Approaches to Semantic Disambiguation </p><p>The experiment in this section explores a semantic measure that is developed to disambiguate term senses. When the cut-off score is set to 0.9, the chosen measure, CBSC, can successfully eliminate incorrect synonym matches found in the previous section. However, it loses the power to discover terms that share a meaning at this high cut-off score, and effectively reduces to a measure that counts identical terms. Lowering the cut-off score will introduce incorrect matches back to the intersection set, repeating the short-fall of SCV. Although CBSC with a high cut-off score reduces to a measure that counts identical terms, as opposed to synonyms, it has as much power to separate cities as does SCV. Figure 6-6 shows that CBSC and SCV produce about equal scores in most cases except for those with significant numbers of incorrect synonym matches. The next section will demonstrate that the scores of CBSC and SCV are statistically correlated at a significance level of 0.01. With the augmented vocabulary, the CBSC results are in fact better than SCV; CBSC demotes San Diego and Houston, which are plagued with incorrect synonym matches, while maintaining the ranks of Reno and the rest of the ten cities with only a few incorrect synonym matches. One can deduce from the results that a measure that cannot match synonyms in their correct senses will perform worse or, at best, only as good as a measure that can only match identical terms. This is not to say that corpus-based measures do not have any merits; the chosen measure just does not produce more correct synonym matches than incorrect ones. The author still believes that one can use a corpus to train a measure to understand the sense of a term and greatly improve resulting similarity scores. Lastly, CBSC vastly depends on a comprehensive vocabulary to perform well. To speak more broadly, any measures that only analyze frequencies of terms frequency and not their synonyms need a comprehensive vocabulary. The vocabulary used must at least contain most common terms of a particular theme in which an analyst wishes to discover similarity between cities. The experiment in this section demonstrates that adding merely two terms related to gambling almost doubles the score of Reno. </p><p>108 6.3 Measure Comparison Using Statistical Tests </p><p>This section employs the Spearman’s rank correlation coefficient test explicated in Section 3.3.2 to compare all numeric and semantic similarity scores between the ten selected cities generated in the previous two sections. Note that only subsets of the generated scores were analyzed. The multidimensional Euclidean scores with respect to PC I and PC II together, and with respect to all the chosen PCs are considered as well for a fuller picture. On the semantic side, the statistical test considers the BSC scores, the CBSC scores using the original vocabulary (CBSC1) and the CBSC scores using the augmented vocabulary (CBSC2). Note that the BSC scores have been ignored earlier since the findings in Chapter 4 strongly suggest the use of a vocabulary. The BSC scores of the ten cities are included here in order to confirm soundness of the earlier judgment to neglect. The test is performed with SPSS for Windows Version 11 (SPSS Inc. 2001). Table 6-10 displays the resulting Spearman correlation coefficients between the scores of four stated Euclidean similarity measures and four stated semantic measures. The dark blue indicates the statistically significant correlations at the significance level of 0.01; the light blue indicates the statistically significant correlations at the significance level of 0.05. Being statistically significant at the significance level of 0.01 means that the chance of the null hypothesis being true is less than 1 percent for a given set of observed scores. In this case, the null hypothesis is ρ=0 — no correlation between two measures. Table G-1 gives the raw significance values and the number of scores included in the calculation. One can see that the correlations between the Euclidean measures are statistically significant at the 0.01 level. Some correlations are high, but not close 1. The scores along the PC I axis and along the PC II axis separately are highly correlated with the scores along the two axes combined (PC I & PC II). The PC I scores and the two-axes-combined scores are highly correlated with the all-chosen-PCs scores (PCs). Such results are anticipated since one Euclidean measure is part of another Euclidean measure. For example, all the chosen PCs include PC I and PC II. Furthermore, the higher correlation between the PC I scores and the all-chosen-PCs scores than the correlation between the PCII scores and the all- chosen-PCs scores agrees with the principal component analysis, in which PC I accounts for most of the variance in the dataset. </p><p>109 Turning to the semantic measures, one can see that the correlations among them are all high (> 0.6) and statistically significant at the 0.01 level. Note that the visual comparison in Section 6.2.1 also shows strong similarity between SCV and CBSC. This can be easily explained by the fact that one measure is a modified version of the other, and therefore they share many underlying mechanics: the feature-set contrast model, the WordNet lexicon database, the vocabulary and the stop words list. The correlations between BSC and SCV or between BSC and CBSC are lower than those between SCV and CBSC, indicating that BSC does not generate ranks as similar to SCV ranks as CBSC ranks to SCV ranks and vice versa. This means that if SCV is the measure of choice, one would pick CBSC over BSC. One </p><p>Table 6-10: Spearman Correlation Coefficients PC I PC I PC II PCs14 BSC SCV CBSC 15 CBSC 16 & II 1 2</p><p>PC I 1.000 0.354 0.783 0.711 0.229 0.037 -0.002 0.073 </p><p>PC II 1.000 0.777 0.485 0.126 0.118 0.176 0.171 </p><p>PC I 1.000 0.754 0.244 0.104 0.063 0.164 & II </p><p>PCs 1.000 0.275 0.043 0.070 0.123 </p><p>BSC 1.000 0.691 0.607 0.744 </p><p>SCV 1.000 0.788 0.862 </p><p>CBSC1 1.000 0.917 </p><p>CBSC2 1.000 </p><p>14 All chosen PCs: PC I, PC II, PC III, PC IV, PC VI and PC VII 15 CBSC using the original vocabulary 16 CBSC using the augmented vocabulary </p><p>110 would also pick SCV over BSC, if CBSC is the measure of choice. Since Chapters 4 demonstrates that a vocabulary can significantly improve the accuracy of a measure, it is reasonable to assume that the accuracy of the ranks generated by BSC is inferior to the accuracy of those by SCV or CBSC. Hence, the decision to drop BSC from the analysis earlier in this chapter seems justified. </p><p>The correlations are particularly high, close to 1, for SCV and CBSC1, SCV and </p><p>CBSC2, CBSC1 and CBSC2. This is expected as shown in Section 6.2.1 that the ranking by SCV is almost identical to the ranking by CBSC, except for a few cases with above average number of incorrect synonym matches and a few cases with high-IC terms that are not in the original vocabulary. For the same reason, the correlation between SCV and CBSC2 is higher than that between SCV and CBSC1; CBSC2 uses the augmented vocabulary, and therefore reduces the number of discrepancies of the latter type. </p><p>Since the correlations between SCV, CBSC1 and CBSC2 are close to 1, it is worth testing their means for equality with the Kruskal-Wallis analysis in SPSS for Windows Version 11 (SPSS Inc. 2001). Table 6-11 shows the resulting two-tailed significance values of the tests. All the significance values are higher than 0.05, which is the most common significance level for rejecting a null hypothesis. Therefore, the differences between the means of these measures are not statistically significant. The results support the conclusion in the previous paragraph — SCV and CBSC produce similar scores except for the noted cases. </p><p>Table 6-11: Two-Tailed Significance Values of Kruskal-Wallis Test 17 18 SCV CBSC1 CBSC2 SCV 0.535 0.079 </p><p>CBSC1 0.311 CBSC2 </p><p>Lastly, the Spearman’s correlation test demonstrates that the numeric measures are not correlated with the semantic measures. Any statistically significant correlations are low (< 0.3) and are limited to BSC. SCV and CBSC are orthogonal to all numeric measures. The test fails to reject the null hypothesis, and therefore the correlations between SCV and </p><p>17 CBSC using the original vocabulary 18 CBSC using the augmented vocabulary </p><p>111 CBSC and the numeric measures are 0. The test confirms the findings in the previous two sections that SCV and CBSC produce different ranking from the numeric measures in most cases. </p><p>6.3.1 Discussion of Statistical Tests </p><p>Although this work deals with only a small set of ten cities, it is not feasible for one person to manually explore each individual score (100 scores for each measure) and manually compare one set of city ranks generated by one measure with another set by another measure (ten sets for each measure totaling 100 possible comparison pairs). In Section 6.1.1, only four sets of SCV city ranks are explored thoroughly in comparison with the four corresponding sets of city ranks generated by the Euclidean measure with respect to PC I and the four sets with respect to PC II. The statistical tests performed in this section provide a credible, practical way for accounting for all scores in comparing the proposed measures. The statistical results support the findings of the last two sections: (1) the proposed semantic measures give analysts interesting new perspectives that the Euclidean measure can not provide about similarity between cities of interest; (2) SCV and CBSC produce similar city ranks; and (3) BSC is not a good measure. </p><p>6.4 Conclusions </p><p>The chapter describes the experiments with two semantic similarity measures: SCV and CBSC. The experiments in Section 6.1 and 6.2 contrast the resulting semantic similarity ranks with those of the Euclidean similarity measure and common geographic knowledge. The results illustrate that the semantic measures rank cities differently from the Euclidean measure, indicating that they can suggest new insights that traditional numeric analysis can not. The statistical analysis in Section 6.3 confirms this finding. The detailed analysis of the semantic scores demonstrates that the semantic measures adequately capture the essence of the Wikipedia entries and produce a reasonable ranking. Furthermore, given the chosen ten </p><p>112 cities’ semantic themes suggested in Chapter 5 as what an analyst knows, the semantic measures suggest valid relationships between cities that the analyst does not anticipate and can lead to interesting conclusions. For instance, Boston, a medium-size university/technology city is quite analogous to Los Angeles, a metropolis. Basing judgment only on absolute socioecomomic data (PC I), an analyst would not find Boston and Los Angeles to be analogous. San Diego, an urbanized beach/university/technology city on the West Coast has more in common with Las Vegas, the American playground, than most people would think. The proposed measures, however, still have many flaws, including narrow score ranges, incorrect synonym matches, and incomplete vocabulary and stop words lists. The problem of narrow score ranges in fact stems from other problems. The measures retain many terms that should be added to the stop words list, and do not detect some terms related to key characteristics, such as tourism and goods transportation. Moreover, some of the terms are more descriptive than the others, such as ‘technology’, ‘health’, ‘insurance’ and ‘financial’. They signify exactly the type of business, industry and service. These key terms are referred to as ‘high-IC terms’ in this chapter. If the measures weighed high-IC terms more heavily than the others, the range of scores would increase. By contrast, some terms should receive less weight since they appear in most descriptions, such as ‘industry’, ‘company’, ‘business’ and ‘service’. Note that these terms are different from stop words. They do provide important information about cities and do not commonly occur in English text. Their frequencies can indicate the importance of a feature of a place. For example, a city of which ‘business’ appears five times in the description has more of a business centric economy than another city of which ‘business’ appears once in the description. The last problem is incorrect synonym matching from which SCV suffers greatly. SCV cannot identify the sense of a term and matches two terms on any senses that exist in WordNet. As a result, terms are considered synonyms in senses that differ from the senses they are used in Wikipedia entries. CBSC is tried in Section 6.2 to cope with this problem. However, CBSC identifies as many incorrect synonyms as correct ones at a low cut-off score and reduces to a measure that matches terms with identical spellings at the selected cut-off score of 0.9. Nevertheless, the experiment with CBSC demonstrates that a measure based purely on term frequencies can perform as well as, or better than, a measure based on </p><p>113 synonym frequencies obtained in the same manner as SCV. Such a simple measure is also computationally faster. It is also worth mentioning that cities can be similar in various aspects. City A and City B have X in common, while City A and City C have Y in common. The feature-set based similarity score of City A and City B will be approximately equal to that of City A and City C because each pair shares one common feature. A similar situation is observed in the SCV experiment. Las Vegas and Reno have a gambling industry in common, while Las Vegas and San Diego have a communication technology industry in common. As a result, their similarity scores turn out to be about equal. One way to tease these cities apart is to limit the vocabulary to a small set of terms related to a particular theme, such as education, tourism, trade and petroleum. This will cause a vocabulary-based semantic measure to consider only one theme of interest. Despite the stated shortfalls, it has been shown in this chapter that the proposed semantic measures can produce reasonable ranks of cities in comparison to a selected city. The top matches on average share more characteristics with their associated cities than the bottom matches. Semantic analysis can provide insights that cannot be found with numeric methods. Analysts can begin to employ the proposed measures to automatically investigate online and offline textual documents, and thus save a great deal of valuable time in comparison to manual text analysis, which is known to be labor intensive (Roberts 1997).</p><p>Chapter 7 </p><p>Summary and Conclusions </p><p>7.1 Revisiting the Research Goal </p><p>The dissertation initially set out to find climate analogs and quickly turned into a quest to find analogs for geographical places in general. The review of literature discovered that many scientific fields apply some form of place analogs, all of which are of interest to geographers. Natural scientists use an easily accessible study area as a model for another area of interest; economists transfer findings from one case study to another; historians reconstruct past societies from knowledge of familiar ones; public policy planners take lessons learned from other places and apply to their planning areas. The list can go on and on, demonstrating the importance and power of analogical thinking. Current and emerging technologies, such as electronic document collections, the Internet, and the Semantic Web, make it possible for people and organizations to store millions of books and articles, share them with the world, or even author some themselves. The amount of electronic and online content is expanding at an exponential speed. It is only a matter of time before we can Google and Wiki information on everything. Books, originally available only as hard copies, are being converted to electronic formats. Trips to libraries become less and less necessary. As a result, we can conveniently obtain information about things via the Internet with our personal computers, but we are increasingly overwhelmed by the sheer volumes of accessible information. Today’s search engines (e.g., Google, Yahoo and MSN) help us locate electronic documents about places, but do not help us understand them. The goal of this dissertation is to fix this shortfall. Given incomplete ontologies of places, the dissertation explored and developed a novel approach to compare places using both numeric and semantic information sources. Ontologies herein refer to explicit, human- and machine-readable 115 descriptions of places, places which have numerous characteristics and relations with each other. What characteristics are considered in a comparison depends on the objective. Towards this goal, in Chapter 4 the dissertation chose U.S. cities as the test case and obtained their numeric characteristics from the County and City Data Book (U.S. Census Bureau 2005). The semantic information was acquired from Wikipedia <www.wikipedia.org>. Wikipedia entries are in HyperText Markup Language (HTML), which was designed for human comprehension as explained in Section 3.1.1. A computer algorithm that compares places cannot use such Wikipedia entries. A way to translate them into Web Ontology Language (OWL) ontologies was developed as part of this research (see Chapter 4). With the semantic information in a machine-readable format, the dissertation experimented with different semantic similarity measures, and evaluated them against a Euclidean numeric measure. The results created a comprehensive methodology for comparing places, accounting for aspects of places that can and cannot be represented numerically. The methodology lets analysts discover geographical analogs beyond the possibility of strictly numeric or semantic methodologies used in isolation. The ensuing section summarizes how the dissertation achieves its goal. </p><p>7.2 The Novel Hybrid Methodology </p><p>Up until now, geographers, scientists and professionals have developed place analogs either by numeric computation methods or by manual content analysis of non-numeric information. The dissertation explores means of automating content analysis and combining the two schools of methods. The problem of place-analog search can be regarded as an application of entity similarity measurement. For non-numeric information, one performs content analysis of place descriptions and judges the similarity between them; the dissertation limits its scope to textual descriptions. The literature review (see Section 3.2.1) unveils several measures that have been developed to automate textual document comparison by computer and artificial intelligence scientists and cognitive psychologists. The dissertation originally experiments with one such measure — Mitra and Wiederhold’s algorithm (see Section 3.2.1.1.1). The results suggest many flaws in the algorithm, which </p><p>116 latter experiments try to overcome. Finally, the dissertation arrives at two best combinations of multiple measures, i.e., SCV (see Section 3.2.1.2.2) and CBSC (see Section 3.2.1.3.1). The two measures, albeit not perfect, can produce reasonable ranks of cities and are ready for real world applications. The lessons learned from the experiments can be extended to semantic measures in general. Following are usage considerations for semantic measures: </p><p>1. Statistical Description. Analysts must ensure that place descriptions do not depict places by their statistics unless the semantic measure can understand numbers (see Section 4.1). All the semantic measures considered do not understand numbers in text. So if the key characteristics of a place are given by numbers within text (e.g., population density, number of persons per household and average summer temperature), most semantic measures probably cannot detect them. For a semantic measure to work, the descriptions must state these characteristics with concepts (e.g., densely populated, extended households and hot summer climate). 2. Template Description. Faulty high scores can occur when multiple place descriptions follow the same template, since most of the words in such descriptions are identical (see Section 4.1). If the semantic measure computes similarity based on term frequencies, analysts must avoid this kind of description, or identify the template and add the common words in the template to the stop- words list. 3. Complete Stop-Words List. Many words in a sentence are there for grammatical reasons and do not depict characteristics of the place (see Sections 4.2, 6.1.1 and 6.2.1). The more such words are included in the stop-words list, the more accurate the similarity scores become. No single stop-words list is suitable for all applications; analysts need to investigate the terms their semantic measures compute on and decide whether they should be stop-words or not. 4. Complete Vocabulary. If the semantic measure considers only words in a vocabulary, analysts must include words related to aspects of interest in the vocabulary; otherwise, the semantic measure will not consider those aspects (see </p><p>117 Section 6.2.1). Especially, a measure that cannot automatically extend the vocabulary (e.g., CBSC) needs a larger vocabulary. It is always a good idea to inspect and, if necessary, complete the vocabulary manually. Automatic extension can be problematic in two ways: (1) adding synonyms in wrong senses; and (2) not knowing related words that are not synonyms. 5. Succinct Vocabulary Sometimes having a large vocabulary that covers numerous topics can create a problem (see Section 6.1.1). Places can have many aspects in common, some of which may not be of interest. Analysts may also be interested in one or two topics in particular. In order to find only the places analogous in aspects of interest, analysts must removes words related to the other aspects from the vocabulary. As an example, all cities have a “city hall.” 6. Synonym Matching Matching synonyms, instead of identical words, can do more harm than good in cases where the semantic measures cannot identify the correct senses of words (see Section 6.1.1). If possible, analysts should compare the similarity scores with and without synonym matching when using such measures. If they produce different ranks, examining the differences should prove fruitful. </p><p>Of all the considerations, having a complete, succinct vocabulary and a complete stop-words list are the most important. When comparing descriptions from the same source, one can expect comparable word choices; synonym matching is not necessary in this case. Good vocabularies and stop-words lists can also alleviate the problems with statistical and template descriptions by ignoring the template words and the numbers. Section 7.3 will discuss directions for future research that takes into account the stated usage considerations. The conducted experiments illustrate that one can construct a vocabulary, namely, a simple ontology from a list of categories and, likely, from any text (see Sections 4.2 and 6.1). Unlike a full-blown ontology, the constructed vocabularies are merely collections of concepts. They do not state relationships between concepts. Such relationships can be exploited by algorithms to infer information content (IC) of concepts as expounded in Section 3.2.1.5. The experiments in Chapter 6 suggest that semantic measures that account for IC of concepts can produce more accurate ranks than the experimented semantic measures. Nevertheless, the same experiments demonstrate that semantic measures with </p><p>118 vocabularies (e.g., the enhanced version of Mitra and Wiederhold’s algorithm, SCV and CBSC) can produce reasonable ranks of cities. The ranks will also be significantly better than those generated by measures without vocabularies (e.g., Mitra and Wiederhold’s algorithm and BSC). Furthermore, the experiments emphasize the importance of an adequate vocabulary. In order to rank cities with respect to an aspect, the vocabulary must contain terms related to that aspect. Especially in case of semantic measures, such as CBSC, which do not automatically extend vocabularies, an analyst must exercise extreme care with the terms in the vocabularies. Numeric computation methods for place-analogs search are well developed as suggested by the literature review. The dissertation adopts the principal component analysis (PCA)/k-means cluster analysis (KCA) method recommended by Rogerson (2001). PCA assists analysts in objectively reducing a large number of variables into a few orthogonal dimensions. The analysis process is quick and does not require familiarity with the dataset. KCA performs significantly faster than other clustering algorithms, and therefore is suitable for searching for place analogs from a set of possibly thousands or millions of places. With only a few dimensions and clusters of places, analysts can clearly see how places are similar and along what dimension(s). By contrasting the numeric and semantic similarity scores with a scatter plot as done for the experimental results and demonstrated here in Figure 7-1, analysts can easily compare places in both spaces. The diagonal line helps analysts quickly identify equality of the two measures. In this example, most data points fall below the line, indicating that the Euclidean scores are higher than the CBSC scores, and therefore one can conclude that the places are more analogous in numeric space. The opposite can be true in other cases. The plot lets analysts conveniently rank the places based on each measure. The example illustrates that the two sets of ranks are completely different. The two statistical tests employed in Section 6.3 can help analysts confirm their visual observations. The small table within the plot displays the significances of the tests of the example scores. The significance (0.981 >> 0.05) of the Spearman rank correlation test indicates that the Euclidean scores statistically deviate from the semantic scores at the significance level of 0.05. The Kruskal-Wallis test of significance (< 0.01) indicates that the Euclidean mean also statistically stands apart from the Semantic mean at the significance level of 0.01. </p><p>119 The example analysis in the previous paragraph epitomizes the experimental results of Chapter 6. The results reveal that the chosen semantic measures can provide analysts with valuable information that the traditional Euclidean measure cannot. The Euclidean measure can tell analysts ranks of the cities by their absolute and percentage socioeconomic sizes, but it cannot rank the cities by some thematic aspects. The semantic measures can, for example, rank the cities by the importance of universities and technology companies to the cities’ economies. </p><p>H0 Sig.</p><p>µ1 = µ2 < 0.01 ρ = 0 0.981 </p><p>Figure 7-1: Similarity comparison of Las Vegas, NV and the ten cities based on Euclidean similarity with respect to PC I and CBSC with the augmented vocabulary. </p><p>Although it is not demonstrated by the experiments, the semantic measures can correlate with the numeric Euclidean measure. In such a case, they support and potentially explain the numeric similarity. A hypothetical case is when the Euclidean similarity scores with respect to PC I are highly correlated with the CBSC scores, which are based on the vocabulary limited only to words related to higher education. Analysts can hypothesize that </p><p>120 higher education plays a major role in the economies of the cities. They will certainly need to do further research to prove this hypothesis. Nevertheless, CBSC suggests a good starting point for further analysis. Neither the correlation test nor the mean test can tell that one measure is better than the other. The correct ranks of the cities with respect to the Wikipedia entries have not been established, and therefore one cannot test whether the ranks generated by the selected semantic measures are close to the correct ones. Establishing the correct ranks is not a simple task; it involves manual content analysis of the entries. Different people will also interpret the entries differently (Roberts 1997). Here, the author interprets the entries himself and evaluates the correctness of some of the ranks, which leads to identification of flaws in the semantic measures. Manual analysis is very time consuming. Future work should try to establish a test set of cities and their ranks based on their economies, climates, cultures and so forth. Such a set will greatly speed up evaluation of other semantic measures. In large scale analog searching scenarios, namely, most real-world cases, the chosen semantic measures are too computationally expensive. They will not finish computing in a reasonable amount of time. The Euclidean measure can come in handy in such cases since it is computationally cheap. In one experiment, it was able to compare 1000 cities with over 250 variables in a few seconds. Hence, analysts can use the Euclidean measure to narrow down the search before applying a semantic measure. SCV and CBSC take about 30 seconds to compare a pair of cities. A narrowed set of ten cities will take roughly one hour. Other fast numeric methods will work as well. KCA finishes clustering 490 cities in less than a second. Analysts can pick a few cities within a cluster for further semantic analysis. A keyword-based search engine, such as Google or Lucene (Kagathara et al. 2005), can also be used to narrow down the semantic search space. The experimental results show that a few keywords related to certain aspects of interest can identify cities with those aspects well. Analysts can input such keywords to a keyword-based search engine, and create a small set of top ranked cities, to which they can apply a semantic measure. Lastly, it should be noted that place analogs search is just one possible domain of the developed hybrid methodology. Analysts may apply it to any search, such as analogous books, analogous climates, analogous medical treatments or illnesses and analogous </p><p>121 engineering designs. The novel methodology also provides a basis for many more hybrid methodologies. Analysts can improve on the selected semantic measures or substitute them with more accurate, robust ones. </p><p>7.3 Future Research </p><p>The experimental results show many flaws in the developed hybrid methodology which require additional research. Researchers have worked with numeric methods for a long time. Future research should experiment with other more advanced numeric similarity measures, such as variances of the weighted Euclidean similarity measure (D'Agostino 1998, Blundell 2002, Pal 2004) and the self-organizing map (SOMs; Vesanto and Alhoniemi 2000). An unequal-weight Euclidean similarity metric allows analysts to put different emphasis on each variable. The SOM produces an effective visualization of high-dimensional data in a low-dimensional space, offering researchers an overview map for exploring regions of the data points. The proposed semantic measures do not compute and make use of IC of terms. High-IC terms should contribute more to the similarity between places. Researchers can try to determine the IC of terms from the numbers of their hyponyms and hypernyms in WordNet (Seco and T. Hayes 2004). The smaller the ratio of hypernyms to hyponyms, the lower the IC of a term is. Other methods for determining IC include term-frequency-based methods (Resnik 1995) and hybrid lexicon/term-frequency-based methods (Jiang and Conrath 1997). Another major problem is determination of the intended senses of words. A word can have different meanings in two articles or even two sentences (particularly so in English). Any synonym-count-based measure surely benefits from correct identification of word senses. Various methods from natural language processing (NPL) can identify senses of words, e.g. hierarchical decision lists (Yarowsky 2000), connectionist (Palmer-Brown et al. 2002), memory-based (Veenstra et al. 2000) and abstraction-based (Strapparava et al. 2004). Besides word senses, where in an article the words appear is important as well. Two words appearing in the same sentence or adjacent to each other convey a different meaning than </p><p>122 when the words are far apart. Word proximity can be accounted for by information sciences techniques such as the phrase-based Web document similarity measure develop by Hammouda and Kamel (2004). Moreover, negative assertions can greatly affect accuracy of similarity measures. The semantic measures explored in this work would falsely identify that two cities are similar if the description of one city states “The city economy depends on universities…”, and the description of the other city states “The city economy does not depend on universities in a nearby town.” Some previous work that can potentially lead to solutions for this shortfall include McCarty and Law (1989), Esch and Levinson (1995), Repp (2006) and Huang and Lowe (2007) It is worth noting that the chosen semantic measures ignore the numbers in text. Automatic extraction of the senses of numbers in text documents seems to lack research interest. The only available method is limited to extracting geographical coordinates (Woodruff and Plaunt 1994). In order to compute on numbers, semantic measures need to recognize the concepts to which they apply. For example, they need to know that a number is the number of city population or the number of food and hospitality establishments. Until number-sense extraction methods are available, researchers should ignore the numbers in documents and use statistical data instead since inclusion would lead to erroneous similarity scores. For instance, an algorithm might mistakenly compare a population number with a number of food establishments. The number of words in a description also contains information about places. Researchers should experiment with the second and third terms in the denominator of Eq. 3-6 to account for differences in description lengths. The equation reduces the similarity score as a function of the difference. Therefore, two cities with a big set of intersection terms can still be less similar than two cities with a smaller set of intersection terms if the descriptions of those with a smaller set have about the same length and the others do not. The ontology extraction process described in Chapter 4 produces simple ontologies, which have only one kind of relation — description. Richer ontologies can offer more complete information. For example, a description can state how one city is related to another city of interest, what role the city plays in the regional economy and what part of the county the city locates. Research should experiment with methods that can extract such relationships (e.g., Artequakt (Alani et al. 2003), OntoLT (Buitelaar et al. 2003) and RelExt </p><p>123 (Schutz and Buitelaar 2005)) and measures that can compute on them (e.g., Falkenhainer and Forbus (1989), Maedche and Staab (2002), Gurevych (2005) and Maguitman et al (2006)). As discussed in Section 3.4, similarity scores contain uncertainty, which should be quantified, but is outside the scope of this work. Future work can explore the possibility of replacing the suggested crisp OWL data model with fuzzy, probabilistic OWL data models such as Stoilos et al. (2005) and Ding et al. (2006). Many existing fuzzy similarity measures (Fankhauser et al 1991, Chao et al 1996, Plaza et al. 1998, Cross 2004, Popescu et al. 2006) are also worth further investigation. Additionally, researchers should keep in mind various ways of estimating uncertainty in similarity measures when choosing their datasets since the characteristics (e.g., data collection, sample aggregation and metadata availability) of a dataset can determine the appropriate uncertainty estimation method as illustrated in Section 3.4. Lastly, all above suggestions focus on improving the developed novel methodology. A cornucopia of practical applications of the methodology is worth exploring as well. For instance, suggestion systems for travel Web sites (e.g., WikiTravel <wikitravel.org> and the Imtrav Wiki <www.wanderwiki.com>) can present the users with a list of destinations analogous to the users’ chosen ones as the alternatives in case the chosen ones are expensive or fully booked. Similarly, digital databases for academic curriculums such as Wikiversity <wikiversity.org> can employ the methodology developed here to help the users find related resources. After the users create their entries of own learning projects, they are given a list of similar projects, from which they can potentially discover useful learning resources and exercises that they have not thought of. Since similarity measures are central to classification and clustering analysis as explained in Section 2.2.2 and Section 3.2.2.2, respectively, one can employ the novel methodology in various classification and clustering methods. Unlike previous works that classify or cluster places (e.g., economy (Nelson 1955), market (Green 1967), biome (Prentice 1992), climate (Fovell and Fovell 1993), ecosystem (Dolan and Parker 2005) and land use/land cover (Platt and Rapoza 2008) strictly based on their numeric characteristics, researchers and practitioners can automatically reanalyze these places and others based on both numeric and qualitative information with the novel methodology. A number of contemporary applications of analogs outlined in Section 2.3 can definitely benefit for the methodology developed. Climate change researchers can use a </p><p>124 Euclidean distance to locate all places of which the climates are analogous to a predicted climatic/socioeconomic scenario, and then use a semantic similarity measure to narrow down the set based on some textual descriptions for additional in-depth analysis. An economic geographer looking for twin counties for Appalachian Region counties can use the developed methodology to comprehensively search both statistical and textual databases. If someone wants to write the second edition of the book Collapese: How Societies Choose to Fail or Succeed (Diamond 2005), they can write the description of a failed society and the description of a successful society, and employ a semantic similarity measure to identify all failed and successful societies in a textual place database. Then with a Euclidean distance, they can classify these societies based on their populations, climates and geographic locations. A consultant, who is writing a recommendation on cities that can potentially benefit from the emergency response study of Hurricane Katrina, can use the description of New Orleans in the study report to search Wikipedia and some associated statistics to search the <a href="/tags/United_Nations/" rel="tag">United Nations</a> database for candidate cities. This dissertation is offered, then, as the first leg on a continuing journey into the automation of thinking by analogs. The importance of both ontological and numeric dimensions has been emphasized, and it is anticipated that these dimensions will remain salient in future developments. </p><p>Bibliography </p><p>Alani, Harith, Sanghee Kim, David E. Millard, Mark J. Weal, Wendy Hall, Paul H. Lewis and Nigel R. Shadbolt. 2003. Automatic ontology-based knowledge extraction from Web documents. IEEE Intelligent Systems 18 (1):14-21. </p><p>Barnes, Jonathan. 1994. ARISTOTLE: Posterior Analytics. Edited by J. L. Ackrill and L. Judson. New York, NY: <a href="/tags/Oxford_University_Press/" rel="tag">Oxford University Press</a>. </p><p>Bechhofer, Sean, <a href="/tags/Frank_van_Harmelen/" rel="tag">Frank van Harmelen</a>, Jim Hendler, Ian Horrocks, Deborah L. McGuinness, Peter F. Patel-Schneider and Lynn Andrea Stein. 2004. OWL Web Ontology Language Reference. World Wide Web Consortium, accessed on 13 January 2008. http://www.w3.org/TR/2004/REC-owl-ref-20040210/ </p><p>Berners-Lee, Tim, R. Fielding and L. Masinter. 1998. RFC 2396 — Uniform Resource Identifiers (URI): Generic Syntax, part of Standards Track The Internet Society. http://www.isi.edu/in-notes/rfc2396.txt </p><p>Black, Max. 1962. Models and Metaphors: Studies in Language and Philosophy. Itacha, NY: Cornell University Press. </p><p>Black, Duncan and Vernon Henderson. 2003. Urban evolution in the USA. Journal of Economic Geography 3:343-372. </p><p>Blanzieri, Enrico and Luigi Portinale. 2000. Advances in Case-based Reasoning. 5th ed. New York, NY: Springer. </p><p>Blundell, R. Costa-Dias. 2002. Alternative Approaches to Evaluation in Empirical Microeconomics. London, United Kingdom: University College London and Institute for Fiscal Studies. </p><p>Brewer, Cynthia A. and Mark Harrower. 2003. ColorBrewer.org: an online tool for selecting colour schemes for maps. Cartographic Journal 40 (1):27-37. </p><p>Brockmans, Sara, Raphael Volz, Andreas Eberhart and Peter Löffler. 2004. Visual modeling of OWL DL ontologies using UML. Proceedings of the Third International Semantic Web Conference - ISWC 2004, 7-11 November 2004, Hiroshima, Japan. </p><p>Buffa, Michel, Gaël Crova, Fabien Gandon, Claire Lecompte and Jeremy Passeron. 2006. SweetWiki: Semantic Web Enabled Technologies in Wiki. Proceedings of the First Workshop on Semantic Wikis - From Wiki To Semantics [SemWiki2006] at the ESWC 2006, 12 June 2006, Budva, Montenrego. 126</p><p>Buitelaar, Paul, Daniel Olejnik and Michael Sintek. 2003. OntoLT: a Protégé plug-in for ontology extraction from text. Proceedings of the Second International Semantic Web Conference - ISWC 2003, 20-23 October 2003, Sanibel Island, FL. </p><p>Cañas, Alberto J., Marco Carvalho, Marco Arguedas, David B. Leake, Ana Maguitman and Thomas Reichherzer. 2004. Mining the Web to suggest concepts during concept map construction. Proceedings of the First International Conference on Concept Mapping, Pamplona, Spain. </p><p>Changnon, Stanley A. 1992. Inadvertent weather modification in urban areas: lessons for global climate change. Bulletin of the American Meteorological Society 73 (5):619-627. </p><p>Chao, Chun-Tang, Young-Jeng Chen and Ching-Cheng Teng. 1996. Simplification of fuzzy- neural systems using similarity analysis. IEEE Transactions on Systems, Man, and Cybernetics, Part B 26 (2):344-354. </p><p>Chen, Chaomei. 2003. Information visualization versus the Semantic Web. In Visualizing the Semantic Web, edited by V. Geroimenko and C. Chen. London, United Kingdom: Springer-Verlag. </p><p>Chorley, Richard J. 1964. Geography and analogue theory. Annals of the Association of American Geographers 54 (1):127-137. </p><p>Christopherson, Robert W. 2003. Geosystems: An Introduction to Physical Geography. Upper Saddle River, NJ: Pearson Education, Inc. </p><p>Cook, Thomas D. and Donald T. Campbell. 1979. Quasi-Experimentation: Design & Analysis Issues for Field Settings. Boston, MA: Houghton Mifflin Company. </p><p>Cross, Valerie. 2003. Uncertainty in the automation of ontology matching. Proceedings of the Fourth International Symposium on Uncertainty Modeling and Analysis (ISUMA'03). </p><p>Cross, Valerie. 2004. Fuzzy semantic distance measures between ontological concepts. Proceedings of IEEE Annual Meeting of the Fuzzy Information 2004 (NAFIP'04). </p><p>Cunderlik, Juraj M. and Donald H. Burn. 2006. Switching the pooling similarity distances: Mahalanobis for Euclidean. Water Resources Research 42, W03490, doi:10.1029/2005. </p><p>D'Agostino, Ralph B. 1998. Tutorial in biostatistics: propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Statistics in <a href="/tags/Medicine/" rel="tag">Medicine</a> 17:2265-2281. </p><p>Dau, Frithjof. 2006. The role of existential graphs in Peirce's philosophy. Proceedings of the International Conference on Computational Science, 28-31 May 2006, Reading, United Kingdom. </p><p>127</p><p>Davis, John C. 1986. Statistics and Data Analysis in Geology. New York, NY: John Wiley & Sons. </p><p>Davis, William Morris 1899. The geographical cycle. The Geographical Journal 14 (5):481-504. </p><p>Diamond, Jared. 2005. Collapse: How Societies Choose to Fail or Succeed. New York, NY: Penguin Books. </p><p>Dhillon, Inderjit S. and Dharmendra S. Modha. 2001. Concept decompositions for large sparse text data using clustering. Machine Learning 42 (1-2):143-175. </p><p>Dolan, Benjamin J. and George R. Parker. 2005. Ecosystem classification in a flat, highly fragmented region of Indiana, USA. Forest Ecology and Management 219:109-131. </p><p>Ehrig, Marc. 2007. Ontology Alignment: Bridging the Semantic Gap. New York, NY: Springer. </p><p>Falkenhainer, Brian, Kenneth D. Forbus and Dedre Gentner. 1989. The structure-mapping engine: algorithm and examples. Artificial Intelligence 41:1-63. </p><p>Fang, Jin Nü, Chang Ji Jin, Lian Hua Cui, Zhen Yu Quan, BoYoul Choi, Moran Ki and Hung Bae Park. 2001. A comparative study on serologic profiles of virus hepatitis B. World Journal of Gastroenterology 7 (1):107-110. </p><p>Fankhauser, Peter, Martin Kracker and Erich J. Neuhold. 1991. Semantic vs. structural resemblance of classes. ACM SIGMOD Record 20 (4):59-63. </p><p>Farrigan, Tracey L. and Amy K. Glasmeier. 2002. The economic impacts of the prison development boom on persistently poor rural places. State College, PA: The Earth and Mineral Sciences Environmental Institute, Pennsylvania State University. </p><p>Feigenbaum, Lee, Ivan Herman, Tonya Hongsermeier, Eric Neuman and Susie Stephens. 2007. Semantic Web in action. Scientific American December 2007:90-97. </p><p>Fovell, Robert G. and Mei-Ying Fovell. 1993. Climate zones of the conterminous united states defined using cluster analysis. Journal of Climate 6 (11):2103-2135. </p><p>French, Robert M. 1995. The Subtlety of Sameness: A Theory and Computer Model of Analogy Making. London, England: The MIT Press. </p><p>Gahegan, Mark, Ritesh Agrawal, Tawan Banchuen and David DiBiase. 2007. Building rich, semantic descriptions of learning activities to facilitate reuse in digital libraries. International Journal on Digital Libraries 7 (1-2):81-97. </p><p>Gal, Avigdor. 2006. Managing uncertainty in schema matching with top-k schema mappings. Journal on Data Semantics VI LNCS 4090:90-114. </p><p>128</p><p>Gerstengarbe, Friedrich-Wilhelm, Peter C. Werner and Klaus Fraedrich. 1999. Applying non- hierarchical cluster analysis algorithms to climate classification: some problems and their solution. Theoretical and Applied Climatology 64:143-150. </p><p>Giarratano, Joseph C. and Gary D. Riley. 2005. Expert Systems: Principles and Programming. 4th ed. New York, NY: Thomson Course Technology. </p><p>Glantz, Michael H., ed. 1988. Societal Responses to Regional Climatic Change: Forecasting by Analogy. Boulder, CO: Westview Press. </p><p>Green, Paul E., Ronald Frank and Patrick J. Robinson. 1967. Cluster analysis in test market selection. Management Science 13 (8):B387-B400. </p><p>Grosof, Benjamin N., Ian Horrocks, Raphael Volz and Stefan Decker. 2003. Description logic programs: combining logic programs with description logic. Proceedings of WWW2003, Budapest, Hungary. </p><p>Gruber, Thomas R. 1995. Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human-Computer Studies 43:907-928. </p><p>Guarino, Nicola. 1997. Understanding, building and using ontologies. International Journal of Human-Computer Studies 46:293-310. </p><p>Guarino, Nicola and Pierdaniele Giaretta. 1995. Ontologies and knowledge bases: towards a terminological clarification. In Towards Very Large Knowledge Bases, Knowledge Building & Knowledge Sharing, edited by N. J. I. Mars. Amsterdam, Netherlands: IOS Press. </p><p>Gurevych, Iryna. 2005. Using the structure of a conceptual network in computing semantic relatedness. Proceedings of the 2nd International <a href="/tags/Joint_(cannabis)/" rel="tag">Joint</a> Conference on Natural Language Processing, Jeju Island, Republic of Korea. </p><p>Haggett, Peter. 1966. Locational Analysis in Human Geography. New York, NY: ST. Martin's Press. </p><p>Haller, Heiko, Felix Kugel and Max Völkel. 2006. iMapping Wiki - towards a graphical environment for semantic knowledge management. Proceedings of the First Workshop on Semantic Wikis - From Wiki To Semantics [SemWiki2006] at the ESWC 2006, 12 June 2006, Budva, Montenrego. </p><p>Hammouda, Khaled and Mohamed S. Kamel. 2004. Efficient phrase-based document indexing for Web Document clustering. IEEE Transactions on Knowledge and Data mining 16 (10):1279-1296. </p><p>Hayes, Pat, Tom Eskridge, Thomas Reichherzer and Raul Saavedra. 2004. A Framework for Constructing Web Ontologies using Concept Maps. Proceedings of DAML PI Meeting, New York, NY. http://www.daml.org/meetings/2004/05/pi/pdf/IHMC_DAML_PI_2004.jb.pdf </p><p>129</p><p>Hepp, Martin, Daniel Bachlechner and Katharina Siorpaes. 2006. Harvesting Wiki consensus - using Wikipedia entries as ontology elements. Proceedings of the First Workshop on Semantic Wikis - From Wiki To Semantics [SemWiki2006] at the ESWC 2006, 12 June 2006, Budva, Montenrego. </p><p>Hesse, Mary B. 1966. Models and Analogies in Science. Notre Dame, IN: University of Notre Dame Press. </p><p>Hill, Edward W., John F. Brennan, and Harold L. Wolman. 1998. What is a central city in the United States? Applying a statistical technique for developing taxonomies. Urban Studies 35 (11):1935-1969. </p><p>Horrocks, Ian. 2002. DAML+OIL: a description logic for the Semantic Web. In Bulletin of the Technical Committee on Data Engineering, edited by D. B. Lomet. <a href="/tags/Washington_(state)/" rel="tag">Washington</a>, D.C.: IEEE Computer Society.cluc </p><p>Horrocks, Ian, Peter F. Patel-Schneider and Frank van Harmelen. 2003. From SHIQ and RDF to OWL: the making of a Web Ontology Language. Web Semantics: Science, Services and Agents on the World Wide Web 1 (1):7-26. </p><p>Horrocks, Ian, <a href="/tags/Ulrike_Sattler/" rel="tag">Ulrike Sattler</a> and Stephan Tobies. 2000. Reasoning with individuals for the description logic. Proceedings of the 17th International Conference on Automated Deduction, 17-20 June 2000, Pittsburgh, PA. </p><p>Hunter, David, Joe Fawcett, Jeff Rafter, Eric van der Vlist and Danny Ayers. 2007. Beginning XML. 4th ed. Indianapolis, IN: Wrox/Wiley Publishing, Inc. </p><p>Isserman, Andrew and Terance Rephann. 1995. The economic effects of the Appalachian Regional Commision. Journal of the American Planning Association 61 (3):345-364. </p><p>Jackson, Donald A. 1993. Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches. Ecology 74 (8):2204-2214. </p><p>Jiang, Jay J. and David W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of the International Conference Research on Computational Linguistics (ROCLING X), Taiwan. </p><p>Johar, Kunal and Rahul Simha. 2004 JWordNet. Deparment of Computer Science, The George Washington University, Washington, D.C. http://www.seas.gwu.edu/~simhaweb/software/jwordnet/index.html </p><p>Kagathara, Satish, Manish Deolalkar and Pushpak Bhattacharyya. 2005. A multi stage fall- back search strategy for cross-lingual information retrieval. Proceedings of the Second Symposium on Indian Morphology, Phonology and Language Engineering - SIMPLE'05, February 2005, Kharagpur, India. </p><p>130</p><p>Kashyap, Vipul and Alex Borgida. 2003. Representing the UMLS® semantic network using OWL (or "what's in a semantic Web link?"). Proceedings of the Second International Semantic Web Conference - ISWC 2003, 20-23 October 2003, Sanibel Island, FL. </p><p>King, Leslie J. 1969. Statistical Analysis in Geography. Englewood Cliffs, NJ: Prentice-Hall, Inc. </p><p>Klyne, Graham and Jeremy J. Carroll. 2004. Resource Description Framework (RDF): Concepts and Abstract Syntax. World Wide Web Consortium, accessed on 15 January 2008. http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ </p><p>Knight, C. Gregory. 2001. Human-environment relationship: comparative case studies. In International Encyclopedia of the Social and Behavioral Sciences, edited by N. Smelser and P. Baltes. Oxford, UK: Elsevier. </p><p>Knox, Paul L. 1995. World cities and the organization of global space. In Geographies of Global Change: Remapping the World in the Late Twentieth Century, edited by R. J. Johnston, P. J. Taylor and W. Michael. Oxford, UK: Blackwell. </p><p>Korfhage, Robert R. 1997. Information Storage and Retrieval. New York, NY: John Wiley and Sons, Inc. </p><p>Kuipers, Theo A.F. 1999. Abduction aiming at empirical progress or even truth approximation leading to a challenge for computational modelling. Foundations of Science 4:307-323. </p><p>Lakoff, George. 1990. Women, Fire, and Dangerous Things. Chicago, IL: University of Chicago Press. </p><p>Leake, David B., Ana G. Maguitman, Thomas Reichherzer, Alberto J. Cañas, Marco Carvalho, Marco Arguedas and Tom Eskridge. 2004. "Googling" from a concept map: towards automatic concept-map-based query formation. Proceedings of the First International Conference on Concept Mapping, 14-17 September 2004, Pamplona, Spain. http://cmc.ihmc.us/papers/cmc2004-225.pdf </p><p>Lewis, Frank A. 1994. Aristotle on the relation between a thing and its matter. In Unity, Identity, and Explanation in Aristotle's Metaphysics, edited by T. Scaltsas, D. Charles and M. L. Gill. Oxford, England: Clarendon Press. </p><p>Li, Yong Hong and Anil K Jain. 1998. Classification of Text Documents. The Computer Journal 41 (8):537-545. </p><p>Luo, Junyan. 2007. The Semantic Geospatial Problem Solving Environment: an Enabling Technology for Geographical Problem Solving under Open, Heterogeneous Environments, Department of Geography, The Pennsylvania State University, University Park. </p><p>131</p><p>Maedche, Alexander and Steffen Staab. 2002. Measuring similarity between ontologies. Proceedings of the Knowledge Engineering and Knowledge Management, 1-4 October 2002, Siguenza, Spain. </p><p>Maguitman, Ana G., Filippo Menczer, Fulya Erdinc, Heather Roinestad and Alessandro Vespignani. 2006. Algorithmic computation and approximation of semantic similarity. World Wide Web 8 June 2006. </p><p>Martin, Phillippe. 2007. Knowledge Representation/Translation in RDF+OWL, N3, KIF, UML, and the WebKB-2 Languages (For-Links, Frame-CG, Formalized English), accessed on 13 February 2008. http://meganesia.int.gu.edu.au/~phmartin/WebKB/doc/model/comparisons.html </p><p>McBride, Brian. 2001. Jena: implementing the RDF model and syntax specification. Proceedings of the Second International Workshop on the Semantic Web - SemWeb'2001, 1 May 2001, Hong Kong, China. http://sunsite.informatik.rwth- aachen.de/Publications/CEUR-WS/Vol-40/mcbride.pdf </p><p>McGuinness, Deborah L., Richard Fikes, James Hendler and Lynn Andrea Stein. 2002. DAML+OIL: an ontology language for the Semantic Web. IEEE Intelligent Systems 17 (5):72-80. </p><p>Merriam-Webster, Inc. 2005. Merriam-Webster Online Dictionary, accessed on 13 September 2007. http://www.merriam-webster.com </p><p>Meyer, William B., Karl W. Butzer, E. Thomas Downing, B. L. Turner II, George W. Wenzel and James Wescoat. 1998. Reasoning by analogy. In The Tools for Policy Analysis, edited by S. Rayner and E. Malone. Columbus, Ohio: Battelle Press. </p><p>Miller, George A. 1995. WordNet: a lexical database for English. Communications of the ACM 38 (11):39-41. </p><p>Miller, George A., Christiane Fellbaum, Randee Tengi, Pamela Wakefield, Rajesh Poddar, Helen Langone and Benjamin Haskell. 2005. WordNet 2.1 for Windows. Cognitive Science Laboratory, Princeton University. http://wordnet.princeton.edu/2.1/WordNet-2.1.exe </p><p>Mitra, Prasenjit and Gio Wiederhold. 2002. Resolving terminological heterogeneity in ontologies. Proceedings of ECAI-02 Workshop, CEUR-WS 64. </p><p>Nelson, Howard J. 1955. A service classification of american cities. Economic Geography 31 (3):189-210. </p><p>Nikolopoulos, Chris. 1997. Expert Systems: Introduction to First and Second Generation and Hybrid Knowledge Based Systems. New York, NY: M. Dekker. </p><p>132</p><p>Novak, Joseph D. and Alberto J. Cañas. 2008. The Theory Underlying Concept Maps and How to Construct and Use Them, Technical Report IHMC CmapTools 2006-1 Rev 01-2008. <a href="/tags/Florida/" rel="tag">Florida</a> Institute for Human and Machine Cognition (IHMC). http://cmap.ihmc.us/Publications/ResearchPapers/TheoryUnderlyingConceptMaps .pdf </p><p>Novak, Joseph D. and Dismas Musonda. 1991. A twelve-year longitudinal study of science concept learning. American Educational Research Journal 28 (1):117-153. </p><p>Oreskes, Naomi. 2003. The role of quantitative models in science. In The Role of Models in Ecosystem Science, edited by C. D. Canham, J. J. Cole and W. K. Laurenroth. Princeton, NJ: Princeton University Press. </p><p>Overholt, George E. and William M. Stallings. 1976. Ethnographic and experimental hypotheses in educational research. Educational Researcher 5 (8):12-14. </p><p>Pal, Sankar K. 2004. Foundations of Soft Case-Based Reasoning. Hoboken, NJ: Wiley-Interscience. </p><p>Palmer-Brown, Dominic, Jonathan A. Tepper and Heather M. Powell. 2002. Connectionist natural language parsing. Trends in Cognitive Sciences 6 (10):437-442. </p><p>Peirce, Charles Sanders. 1955. Philosophical Writings of Peirce, edited by J. Buchler. Mineola, NY: Dover Publications. </p><p>Pfoser, Dieter, Nectaria Tryfona and Christian S. Jensen. 2005. Indeterminacy and spatiotemporal data: basic definitions and case study. Geoinformatica 9 (3):211-236. </p><p>Pike, William A. 2005. Augmenting Collaboration Through Situated Representations of Scientific Knowledge, Department of Geography, The Pennsylvania State University, University Park, PA. </p><p>Platt, Rutherford V. and Lauren Rapoza. 2008. An evaluation of an object-oriented paradigm for land use/land cover classification. The Professional Geographer 60 (1):87-100. </p><p>Plaza, Enric, Francesc Esteva, Pere Garcia, Lluíse Godo and Ramon López de Mántaras. 1998. A logical approach to case-based reasoning using fuzzy similarity relations. Information Sciences 106 (1):105-122. </p><p>Popescu, Mihail, James M. Keller and Joyce A. Mitchell. 2006. Fuzzy measures on the gene ontology for gene product similarity. IEEE/ACM Transactions on computational biology and bioinformatics 3 (3):263-274. </p><p>Popping, Roel. 1997. Computer programs for the analysis of texts and transcripts. In Text Analysis for the Social Sciences, edited by C. W. Roberts. Mahwah, NJ: Lawrence Erlbaum Associates. </p><p>Powers, Shelley. 2003. Practical RDF. Sebastopol, CA: O'Reilly Media, Inc. </p><p>133</p><p>Prentice, I. Colin, Cramer Wolfgang, Sandy P. Harrison, Rik Leemans, Robert A. Monserud and Allen M. Solomon. 1992. Special paper: a global biome model based on plant physiology and dominance, soil properties and climate. Journal of Biogeography 19 (2):117-134. </p><p>Raskin, Robert G. and Michael J. Pan. 2005. Knowledge representation in the semantic web for Earth and environmental terminology (SWEET). Computers & Geosciences 31:1119- 1125. </p><p>Rebich, Stacy and Catherine Gautier. 2005. Concept mapping to reveal prior knowledge and conceptual change in a mock summit course on global climate change. Journal of Geoscience and Education 53 (4):355-365. </p><p>Resnik, Philip. 1995. Using information content to evaluate semantic similarity in a taxonomy. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence - IJCAI 95, Montréal, Québec, Canada. </p><p>Roberts, Carl W. 1997. Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts. <a href="/tags/New_Jersey/" rel="tag">New Jersey</a>, NY: Lawrence Erlbaum. </p><p>Rogerson, Peter A. 2001. Statistical Methods for Geography. Thousand Oaks, CA: SAGE Publications Inc. </p><p>Schutz, Alexander and Paul Buitelaar. 2005. RelExt: a tool for relation extraction from text in ontology extension. Proceedings of the 4th International Semantic Web Conference - ISWC 2005, 6-10 November 2005, Galway, Ireland. </p><p>Schwering, Angela, Roland M. Wagner and Bernd Schneider. 2003. A geographic search ways to structure information on the Web. Proceedings of the 6th AGILE, 24 - 26 April 2003, Campus de la Doua, INSA, Lyon, France. http://plone.itc.nl/agile_old/Conference/lyon2003/proceedings/66.pdf </p><p>Seco, N. Veale and J. T. Hayes. 2004. An intrinsic information content metric for semantic similarity in WordNet. Proceedings of the 16th European Conference on Artificial Intelligent (2004), 22-27 August 2004, Valencia, Spain. </p><p>Smith, Barry. 2003. Ontology. In Philosophy of Computing and Information, edited by L. Floridi. Oxford, United Kingdom: Blackwell. </p><p>Sowa, John. 2004. The challenge of knowledge soup. Proceedings of Research Trends in Science, Technology and Mathematics Education, International Center, Goa, India. </p><p>Sowa, John F. 1984. Conceptual Structures: Information Processing in Mind and Machine. Boston, MA: Addison-Wesley Longman Publishing Co., Inc. </p><p>———. 1992. Semantic Networks. In Encyclopedia of Artificial Intelligence, edited by S. C. Shapiro. New York, NY: John Wiley & Sons, Inc. </p><p>134</p><p>———. 2001. Signs, processes, and language games - foundations for ontology. Proceedings of the International Conference on the Challenge of Pragmatic Process Philosophy, May 1999, Radbound University Nijmegen, Nijmegen, Netherlands. http://www.jfsowa.com/pubs/signproc.htm </p><p>SPSS, Inc. 2001. SPSS for Windows, Rel. 11.0.1. SPSS Inc., Chicago, IL. </p><p>———. 2006. SPSS 15.0 Statistical Procedures Companion. Upper Saddle River, NJ: Prentice Hall Inc. </p><p>Strapparava, Carlo, Alfio Gliozzo and Claudio Giuliano. 2004. Pattern Abstraction and Term Similarity for Word Sense Disambiguation IRST at Senseval-3. Proceedings of SENSEVAL -3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, July 2004, Barcelona, Spain. </p><p>Stoilos, Giorgos, Giorgos Stamou, Vassillis Tzouvaras, Jeff Z. Pan and Ian Horrocks. 2005. Fuzzy OWL: uncertainty and the Semantic Web. Proceedings of International Workshop of OWL: Experiences and Directions, Galway. </p><p>Ding, Zhongli, Yun Peng and Rong Pan. 2006. BayesOWL: uncertainty modeling in Semantic Web ontologies. StudFuzz 204:3-29. </p><p>Lele, Subhash and Joan T Richtsmeier. 1995. Euclidean distance matrix analysis: confidence intervals for form and growth differences. American Journal of Physical Anthropology 98:73-86. </p><p>Swearingen, Will D. 1987. Morrocan Mirages. Princeton, NJ: Princeton University Press. </p><p>Tversky, Amos. 1977. Features of similarity. Psychological Review 84: 327-352. </p><p>U.S. Census Bureau. 2000. County and City Data Book. Statistical Compendia Branch, U.S. Census Bureau, Administrative and Customer Services Division </p><p>———. 2002. Current Population Survey: Design and Methodology, Technical Paper 63RV. U.S. Census Bureau, Economics and Statistics Administration, U.S. Department of Commerce. </p><p>———. 2005. Statistics for Industry Groups and Industries: 2005. Washington, DC: U.S. Census Bureau. </p><p>Veenstra, Jorn, Antal van den Bosch, Sabine Buchholz, Walter Daelemans and Jakub Zavrel. 2000. Memory-based word sense disambiguation. Computers and the Humanities 34:171- 177. </p><p>Vesanto, Juha and Esa Alhoniemi. 2000. Clustering of the Self-Organizing Map. IEEE Transactions on Neural Networks 11 (3):586-600. </p><p>135</p><p>Weaver, Christopher Eric. 2006. IMPROVISE: a User Interface for Interactive Construction of Highly-Coordinated Visualization, Department of Computer Science, University of Wisconsin, Madison, WI. </p><p>Woodruff, Allison Gyle and Christian Plaunt. 1994. GIPSY: automated geographic indexing of text documents. Journal of the American Society for Information Science 45 (9):645-655. </p><p>Yarowsky, David. 2000. Hierarchical decision lists for word sense disambiguation. Computers and the Humanities 34:179-186. </p><p>Yeung, Ka Yee and Walter L. Ruzzo. 2001. Principle component analysis for clustering gene expression data. Bioinformatics 17 (9):763-774. </p><p>Appendix A </p><p>List of Variables in the City Tables and Their Descriptions </p><p>126 </p><p>Table A-1: Variables in Table C-1 — Area and Population </p><p>Variable Definition C1-GEO01 Federal Information Processing Standards (FIPS) 7-digit state/place code C1-GEO02 USPS 2-letter state abbreviation C1-GEO03 FIPS 2-digit state code C1-GEO04 FIPS 5-digit place code FIPS 4-digit metropolitan area (MSA or CMSA) code(s) [blank if not C1-GEO05 applicable; see Note below] C1-GEO06 FIPS 4-digit PMSA code(s) [blank if not applicable; see Note below] C1-GEO07 FIPS 4-digit NECMA code [blank if not applicable] C1-GEO08 Area name C1-GEO09 County [see note below] C1-LND01F Footnotes for C1-LND01 C1-LND01 Land area, 2000 (square miles) \1 C1-POP01F Footnotes for C1-POP01 C1-POP01 Population, 2000 (April 1): number C1-POP02F Footnotes for C1-POP02 C1-POP02 Population, 2000 (April 1): rank \2 C1-POP03F Footnotes for C1-POP03 C1-POP03 Population per square mile, 2000 C1-POP04F Footnotes for C1-POP04 C1-POP04 Population, 1990 (April 1): number \3 C1-POP05F Footnotes for C1-POP05 C1-POP05 Population, 1990 (April 1): rank \3 \4 C1-POP06F Footnotes for C1-POP06 C1-POP06 Population, 1980 (April 1): number </p><p>127 Variable Definition C1-POP07F Footnotes for C1-POP07 C1-POP07 Population: net change, 1990-2000 C1-POP08 Population: net change, 1980-1990 C1-POP09F Footnotes for C1-POP09 C1-POP09 Population: percent change, 1990-2000 C1-POP10 Population: percent change, 1980-1990 C1-POP11F Footnotes for C1-POP11 C1-POP11 Hispanic or Latino population, 2000: number \5 C1-POP12F Footnotes for C1-POP12 C1-POP12 Hispanic or Latino population, 2000: percent of total population \5 Note: If all counties, in which the cities are located, are in one metropolitan area, one MSA or PMSA code is listed, as applicable; otherwise, a code is listed for each corresponding county with (X) meaning that the county is not in a metropolitan area. </p><p>Footnotes: \1 Dry land and land temporarily or partially covered by water. \2 Based on 1,078 places (1,070 incorporated places and the 8 CDPs in Hawaii). When places share the same rank, the next lower rank is omitted. \3 Includes count resolution corrections through 1997 and adjustments based on Census 2000 dress rehearsal results and boundary changes reported as legally effective as of January 1, 1998. \4 Based on 1,071 places (1,070 incorporated places and Honolulu CDP in Hawaii). When places share the same rank, the next lower rank is omitted. \5 Persons of Hispanic or Latino origin may be of any race. \6 Not incorporated as of January 1, 1980. \7 Data are for Athens-Clarke County (balance), GA; Athens city (1990 land area - 16.6 square miles; 1990 population - 45,734) merged with Clarke County effective January 14, 1991. \8 Data are for Augusta-Richmond County (balance), GA; Augusta city (1990 land area - 19.7 square miles; 1990 population - 44,639) merged with Richmond County Effective January 1, 1996. </p><p>Sources: Land Area -- U.S. Census Bureau, Census of Population and Housing, Census 2000 Redistricting Data (Public Law 94-171) Summary Files (related Internet site <http://www.census.gov/dmd/www/2kresult.html>). </p><p>128 2000 Population -- U.S. Census Bureau, 2000 Census of Population and Housing, "Census 2000 Profiles of General Demographic Characteristics" data files, published May 2001 (related Internet site <http://www.census.gov/mp/www/pub/2000cen/mscen01.html>). 1990 Population -- U.S. Census Bureau, "(SU-99-7) Population Estimates for Places (Sorted Alphabetically Within State): Annual Time Series, July 1, 1990 to July 1, 1999 (includes April 1, 1990 Population Estimates Base)," published 20 October 2000, <http://www.census.gov/population/estimates/metro-city/placebyst/SC99T7_US.txt>. 1980 Population -- U.S. Census Bureau, 1990 Census of Population and Housing, "Population and Housing Unit Counts," CPH-2-1 through CPH-2-52. </p><p>129 </p><p>Table A-2: Variables in Table C-2 — Population by Age, Sex, and Race </p><p>Variable Definition C2-GEO01 Federal Information Processing Standards (FIPS) 7-digit state/place code C2-GEO02 USPS 2-letter state abbreviation C2-GEO03 FIPS 2-digit state code C2-GEO04 FIPS 5-digit place code FIPS 4-digit metropolitan area (MSA or CMSA) code(s) [blank if not C2-GEO05 applicable; see Note below] C2-GEO06 FIPS 4-digit PMSA code(s) [blank if not applicable; see Note below] C2-GEO07 FIPS 4-digit NECMA code [blank if not applicable] C2-GEO08 Area name C2-GEO09 County [see note below] C2-POP01F Footnotes for C2-POP01 C2-POP01 Population, 2000 (April 1) C2-AGE01F Footnotes for C2-AGE01 C2-AGE01 Population under 5 years, 2000: number C2-AGE02F Footnotes for C2-AGE02 C2-AGE02 Population under 5 years, 2000: percent of total population C2-AGE03F Footnotes for C2-AGE03 C2-AGE03 Population 5 to 17 years, 2000: number C2-AGE04F Footnotes for C2-AGE04 C2-AGE04 Population 5 to 17 years, 2000: percent of total population C2-AGE05F Footnotes for C2-AGE05 C2-AGE05 Population 18 to 24 years, 2000: number C2-AGE06F Footnotes for C2-AGE06 C2-AGE06 Population 18 to 24 years, 2000: percent of total population C2-AGE07F Footnotes for C2-AGE07 </p><p>130 Variable Definition C2-AGE07 Population 25 to 44 years, 2000: number C2-AGE08F Footnotes for C2-AGE08 C2-AGE08 Population 25 to 44 years, 2000: percent of total population C2-AGE09F Footnotes for C2-AGE09 C2-AGE09 Population 45 to 64 years, 2000: number C2-AGE10F Footnotes for C2-AGE10 C2-AGE10 Population 45 to 64 years, 2000: percent of total population C2-AGE11F Footnotes for C2-AGE11 C2-AGE11 Population 65 to 74 years, 2000: number C2-AGE12F Footnotes for C2-AGE12 C2-AGE12 Population 65 to 74 years, 2000: percent of total population C2-AGE13F Footnotes for C2-AGE13 C2-AGE13 Population 75 to 84 years, 2000: number C2-AGE14F Footnotes for C2-AGE14 C2-AGE14 Population 75 to 84 years, 2000: percent of total population C2-AGE15F Footnotes for C2-AGE15 C2-AGE15 Population 85 years and over, 2000: number C2-AGE16F Footnotes for C2-AGE16 C2-AGE16 Population 85 years and over, 2000: percent of total population C2-AGE17F Footnotes for C2-AGE17 C2-AGE17 Population, 2000: median age (years) C2-POP02F Footnotes for C2-POP02 C2-POP02 Population by sex, 2000: males C2-POP03F Footnotes for C2-POP03 C2-POP03 Population by sex, 2000: females C2-POP04F Footnotes for C2-POP04 C2-POP04 Population by sex: males per 100 females (April 1) </p><p>131 Variable Definition C2-POP05F Footnotes for C2-POP05 C2-POP05 Population, one race, White, 2000: number C2-POP06F Footnotes for C2-POP06 C2-POP06 Population, one race, White, 2000: percent of total population C2-POP07F Footnotes for C2-POP07 C2-POP07 Population, one race, Black or African American, 2000: number C2-POP08F Footnotes for C2-POP08 Population, one race, Black or African American, 2000: percent of total C2-POP08 population C2-POP09F Footnotes for C2-POP09 C2-POP09 Population, one race, American Indian and Alaska Native, 2000: number C2-POP10F Footnotes for C2-POP10 Population, one race, American Indian and Alaska Native, 2000: percent of C2-POP10 total population C2-POP11F Footnotes for C2-POP11 C2-POP11 Population, one race, Asian, 2000: number C2-POP12F Footnotes for C2-POP12 C2-POP12 Population, one race, Asian, 2000: percent of total population C2-POP13F Footnotes for C2-POP13 C2-POP13 Population, one race, Native Hawaiian or Other Pacific Islander, 2000: number C2-POP14F Footnotes for C2-POP14 Population, one race, Native Hawaiian or Other Pacific Islander, 2000: percent C2-POP14 of total population C2-POP15F Footnotes for C2-POP15 C2-POP15 Population, one race, some other race, 2000: number \1 C2-POP16F Footnotes for C2-POP16 C2-POP16 Population, one race, some other race, 2000: percent of total population \1 </p><p>132 Variable Definition C2-POP17F Footnotes for C2-POP17 C2-POP17 Population, two or more races, 2000: number \2 C2-POP18F Footnotes for C2-POP18 C2-POP18 Population, two or more races, 2000: percent of total population \2 Note: If all counties, in which cities are located in, are in one metropolitan area, one MSA or PMSA code is listed, as applicable; otherwise, a code is listed for each corresponding county with (X) meaning that the county is not in a metropolitan area. </p><p>Footnotes: \1 Includes all other responses not included in the other five race categories shown. Also includes write-in entries such as multiracial, mixed, interracial, or a Hispanic/Latino group. \2 Refers to combinations of two or more of the six race categories shown under one race. \3 Data are for Athens-Clarke County (balance), GA; Athens city (1990 land area - 16.6 square miles; 1990 population – 45,734) merged with Clarke County effective January 14, 1991. \4 Data are for Augusta-Richmond County (balance), GA; Augusta city (1990 land area - 19.7 square miles; 1990 population - 44,639) merged with Richmond County Effective January 1, 1996. </p><p>Sources: Population by Age, Sex, and Race -- U.S. Census Bureau; 2000 Census of Population and Housing, "Census 2000 Profiles of General Demographic Characteristics" data files, published May 2001, related Internet site <http://www.census.gov/mp/www/pub/2000cen/mscen01.html>.</p><p>133 </p><p>Table A-3: Variables in Table C-3 — Group Quarters Population and Households </p><p>Variable Definition C3-GEO01 Federal Information Processing Standards (FIPS) 7-digit state/place code C3-GEO02 USPS 2-letter state abbreviation C3-GEO03 FIPS 2-digit state code C3-GEO04 FIPS 5-digit place code FIPS 4-digit metropolitan area (MSA or CMSA) code(s) [blank if not C3-GEO05 applicable; see Note below] C3-GEO06 FIPS 4-digit PMSA code(s) [blank if not applicable; see Note below] C3-GEO07 FIPS 4-digit NECMA code [blank if not applicable] C3-GEO08 Area name C3-GEO09 County [see note below] C3-POP01F Footnotes for C3-POP01 C3-POP01 Group quarters population, 2000: number \1 C3-POP02F Footnotes for C3-POP02 C3-POP02 Group quarters population, 2000: institutionalized population \1 \2 C3-HSD01F Footnotes for C3-HSD01 C3-HSD01 Households, 2000(April 1): number C3-HSD02F Footnotes for C3-HSD02 C3-HSD02 Households, 1990: number C3-HSD03 Households: percent change, 1990-2000 C3-HSD04F Footnotes for C3-HSD04 C3-HSD04 Persons in households, 2000: number C3-HSD05F Footnotes for C3-HSD05 C3-HSD05 Persons in households, 2000: persons per household C3-HSD06F Footnotes for C3-HSD06 C3-HSD06 Households, one person, 2000: number C3-HSD07F Footnotes for C3-HSD07 </p><p>134 Variable Definition C3-HSD07 Households, one person, 2000: percent C3-HSD08F Footnotes for C3-HSD08 C3-HSD08 Households, one or more persons under 18 years, 2000: number C3-HSD09F Footnotes for C3-HSD09 C3-HSD09 Households, one or more persons under 18 years, 2000: percent C3-HSD10F Footnotes for C3-HSD10 C3-HSD10 Households, one or more persons 65 years or older, 2000: number C3-HSD11F Footnotes for C3-HSD11 C3-HSD11 Households, one or more persons 65 years or older, 2000: percent C3-HSD12F Footnotes for C3-HSD12 C3-HSD12 Households, family households, 2000: number C3-HSD13F Footnotes for C3-HSD13 Households, family households, with own children under 18 years, 2000: C3-HSD13 number C3-HSD14F Footnotes for C3-HSD14 Households, family households, with own children under 18 years, 2000: C3-HSD14 percent C3-HSD15F Footnotes for C3-HSD15 C3-HSD15 Households, family households, married couple, 2000: number C3-HSD16F Footnotes for C3-HSD16 Households, family households, married couple, with own children, 2000: C3-HSD16 number \3 C3-HSD17F Footnotes for C3-HSD17 Households, family households, married couple, with own children, 2000: C3-HSD17 percent \3 C3-HSD18F Footnotes for C3-HSD18 C3-HSD18 Households, family households, female householder, 2000: number \4 C3-HSD19F Footnotes for C3-HSD19 </p><p>135 Variable Definition Households, family households, female householder, with own children, 2000: C3-HSD19 number \3 \4 C3-HSD20F Footnotes for C3-HSD20 Households, family households, female householder, with own children, 2000: C3-HSD20 percent \3 \4 C3-HSD21F Footnotes for C3-HSD21 C3-HSD21 Households, nonfamily households, 2000: number C3-HSD22F Footnotes for C3-HSD22 C3-HSD22 Households, nonfamily households, 1990: number C3-HSD23 Households, nonfamily households: percent change, 1990-2000 Note: If all counties, in which cities are located in, are in one metropolitan area, one MSA or PMSA code is listed, as applicable; otherwise, a code is listed for each corresponding county with (X) meaning that the county is not in a metropolitan area. </p><p>Footnotes: \1 As of April 1. \2 Includes people under formally authorized, supervised care or custody in institutions at the time of enumeration (such as correctional institutions, nursing homes, and juvenile institutions). \3 Under 18 years. \4 No husband present. \5 Data are for Athens-Clarke County (balance), GA; Athens city (1990 land area - 16.6 square miles; 1990 population – 45,734) merged with Clarke County effective January 14, 1991. \6 Data are for Augusta-Richmond County (balance), GA; Augusta city (1990 land area - 19.7 square miles; 1990 population - 44, 639) merged with Richmond County effective January 1, 1996. </p><p>Sources: Group Quarters Population - U.S. Census Bureau; 2000 Census of Population and Housing, "Census 2000 Profiles of General Demographic Characteristics" data files, published May 2001 (related Internet site <http://www.census.gov/mp/www/pub/2000cen/mscen01.html>). Households, 2000 - U.S. Census Bureau; 2000 Census of Population and Housing, "Census 2000 Profiles of General Demographic Characteristics" data files, published May 2001 (related Internet site <http://www.census.gov/mp/www/pub/2000cen/mscen01.html>). Households, 1990 - U.S. Census Bureau, 1990 Census of Population and Housing, Summary Tape File (STF) 1C on CD-ROM (related Internet site <http://homer.ssd.census.gov/cdrom/lookup>). </p><p>136 </p><p>Table A-4: Variables in Table C-4 — Housing, Crime, and Labor Force </p><p>Variable Definition C4-GEO01 Federal Information Processing Standards (FIPS) 7-digit state/place code C4-GEO02 USPS 2-letter state abbreviation C4-GEO03 FIPS 2-digit state code C4-GEO04 FIPS 5-digit place code FIPS 4-digit metropolitan area (MSA or CMSA) code(s) [blank if not C4-GEO05 applicable; see Note below] C4-GEO06 FIPS 4-digit PMSA code(s) [blank if not applicable; see Note below] C4-GEO07 FIPS 4-digit NECMA code [blank if not applicable] C4-GEO08 Area name C4-GEO09 County [see note below] C4-HSG01F Footnotes for C4-HSG01 C4-HSG01 Housing: total units, 2000 C4-HSG02 Housing: total units, 1990 C4-HSG03 Housing: percent change, 1990-2000 C4-HSG04F Footnotes for C4-HSG04 C4-HSG04 Housing: occupied units, number 2000 C4-HSG05F Footnotes for C4-HSG05 C4-HSG05 Housing: occupied units, owner occupied, number 2000 C4-HSG06F Footnotes for C4-HSG06 C4-HSG06 Housing: occupied units, owner occupied, percent 2000 C4-CRM01F Footnotes for C4-CRM01 C4-CRM01 Number of serious crimes known to police, 1999: total \1 C4-CRM02F Footnotes for C4-CRM02 C4-CRM02 Number of serious crimes known to police, 1999: violent \1 \2 C4-CRM03F Footnotes for C4-CRM03 C4-CRM03 Number of serious crimes known to police, 1999: property \1 \3 </p><p>137 Variable Definition C4-CRM04F Footnotes for C4-CRM04 C4-CRM04 Number of serious crimes known to police: 1998 \1 C4-CRM05F Footnotes for C4-CRM05 C4-CRM05 FBI population, 1999 \1 C4-CRM06F Footnotes for C4-CRM06 C4-CRM06 Crime rate of serious crimes known to police, 1999 \1 \4 C4-CRM07F Footnotes for C4-CRM07 C4-CRM07 FBI population 1998 \1 C4-CRM08F Footnotes for C4-CRM08 C4-CRM08 Crime rate of serious crimes known to police, 1998 \1 \4 C4-CLF01F Footnotes for C4-CLF01 C4-CLF01 Civilian labor force, 2000: total C4-CLF02F Footnotes for C4-CLF02 C4-CLF02 Civilian labor force, 1999: total C4-CLF03F Footnotes for C4-CLF03 C4-CLF03 Civilian labor force, 1999-2000: percent change C4-CLF04F Footnotes for C4-CLF04 C4-CLF04 Civilian labor force, unemployment 2000: total C4-CLF05F Footnotes for C4-CLF05 C4-CLF05 Civilian labor force, unemployment 2000: rate \5 Note: If all counties, in which cities are located in, are in one metropolitan area, one MSA or PMSA code is listed, as applicable; otherwise, a code is listed for each corresponding county with (X) meaning that the county is not in a metropolitan area. </p><p>Footnotes: \1 Data on serious crimes have not been adjusted for underreporting; this may affect comparability over time or among geographic areas. \2 Includes murder and nonnegligent manslaughter, forcible rape, robbery, and aggravated assault. \3 Includes burglary, larceny-theft, and motor vehicle theft. \4 Per 100,000 resident population provided by the U.S. Federal Bureau of Investigation. \5 Civilian unemployed as a percent of total civilian labor force. </p><p>138 \6 Data are for consolidated city of Milford, CT; data for Milford city (remainder) not available. \7 Excludes rape. \8 Data are for consolidated city of Jacksonville, FL; data for Jacksonville (remainder) not available. \9 Data are for Athens-Clarke County (balance), GA; Athens city (1990 land area - 16.6 square miles; 1990 population – 45,734) merged with Clarke County effective January 14, 1991. \10 Data are for consolidated city of Athens-Clarke County, GA; for information on Athens city, see previous footnote. \11 Data are for Augusta-Richmond County (balance), GA; Augusta city (1990 land area - 19.7 square miles; 1990 population - 44,639) merged with Richmond County effective January 1, 1996. \12 Data are for consolidated city of Augusta-Richmond County, GA; for information on Augusta city, see previous footnote. \13 Data are for consolidated city of Columbus, GA; data for Columbus city (remainder) not available. \14 Data for Honolulu CDP are for Honolulu City/County, HI which includes all CDPs listed except Hilo which is in Maui County, HI. \15 Data are for consolidated city of Indianapolis, IN; data for Indianapolis city (remainder) not available. \16 Data are for consolidated city of Butte-Silver Bow, MT; data for Butte-Silver Bow (remainder) not available. \17 Data are for area covered by Las Vegas Metropolitan Police Jurisdiction. \18 Data are for consolidated city of Nashville-Davidson, TN; data for Nashville-Davidson (remainder) not available. </p><p>Sources: Housing, 2000 -- U.S. Census Bureau, 2000 Census of Population and Housing, "Census 2000 Profiles of General Demographic Characteristics" data files, published May 2001 (related Internet site <http://www.census.gov/mp/www/pub/2000cen/mscen01.html>). Housing, 1990 -- U.S. Census Bureau, "1990 Census of Population and Housing, Summary Tape File (STF) 1C" on CD-Rom (related Internet site <http://homer.ssd.census.gov/cdrom/lookup>). Serious Crimes Known to Police -- U.S. Federal Bureau of Investigation, Uniform Crime Reporting Program, "Crime in the United States," annual <http://www.fbi.gov/ucr/Cius_99/99crime/99c2_01.pdf> (accessed: 20 October 2000) and <http://www.fbi.gov/ucr/Cius_98/98crime/98cius05.pdf> (accessed: 9 December 1999). Civilian Labor Force -- U.S. Bureau of Labor Statistics, Local Area Unemployment Statistics, 2000 data published 2 May 2001, 1999 data published 30 May 2001 <ftp://ftp.bls.gov/pub/time.series/la/> (related Internet site <http://www.bls.gov/lauhome.htm>). </p><p>139 </p><p>Table A-5: Variables in Table C-5 — Manufacturing and Wholesale Trade </p><p>Variable Definition C5-GEO01 Federal Information Processing Standards (FIPS) 7-digit state/place code C5-GEO02 USPS 2-letter state abbreviation C5-GEO03 FIPS 2-digit state code C5-GEO04 FIPS 5-digit place code FIPS 4-digit metropolitan area (MSA or CMSA) code(s) [blank if not C5-GEO05 applicable; see Note below] C5-GEO06 FIPS 4-digit PMSA code(s) [blank if not applicable; see Note below] C5-GEO07 FIPS 4-digit NECMA code [blank if not applicable] C5-GEO08 Area name C5-GEO09 County [see note below] C5-MAN01F Footnotes for C5-MAN01 C5-MAN01 Manufacturing (NAICS 31-33), establishments, 1997: total C5-MAN02F Footnotes for C5-MAN02 Manufacturing (NAICS 31-33), establishments with 20 or more employees, C5-MAN02 1997: number C5-MAN03F Footnotes for C5-MAN03 Manufacturing (NAICS 31-33), establishments with 20 or more employees, C5-MAN03 1997: percent C5-MAN04F Footnotes for C5-MAN04 C5-MAN04 Manufacturing (NAICS 31-33), all employees, 1997: number \1 C5-MAN05F Footnotes for C5-MAN05 C5-MAN05 Manufacturing (NAICS 31-33), all employees, 1997: annual payroll ($1,000) C5-MAN06F Footnotes for C5-MAN06 C5-MAN06 Manufacturing (NAICS 31-33), production workers, 1997: number \2 C5-MAN07F Footnotes for C5-MAN07 C5-MAN07 Manufacturing (NAICS 31-33), production workers, 1997: wages ($1,000) </p><p>140 Variable Definition C5-MAN08F Footnotes for C5-MAN08 C5-MAN08 Manufacturing (NAICS 31-33), value added by manufacture ($1,000) C5-MAN09F Footnotes for C5-MAN09 C5-MAN09 Manufacturing (NAICS 31-33), value of shipments ($1,000) C5-WHS01F Footnotes for C5-WHS01 C5-WHS01 Wholesale trade (NAICS 42), 1997: establishments \3 C5-WHS02F Footnotes for C5-WHS02 C5-WHS02 Wholesale trade (NAICS 42), sales, 1997: total ($1,000) \3 C5-WHS03F Footnotes for C5-WHS03 C5-WHS03 Wholesale trade (NAICS 42), sales, 1997: merchant wholesalers ($1,000) \3 C5-WHS04F Footnotes for C5-WHS04 C5-WHS04 Wholesale trade (NAICS 42), 1997: paid employees \3 \4 C5-WHS05F Footnotes for C5-WHS05 C5-WHS05 Wholesale trade (NAICS 42), 1997: annual payroll ($1,000) \3 C5-WHS06F Footnotes for C5-WHS06 C5-WHS06 Wholesale trade (NAICS 42), 1997: operating expenses ($1,000) \3 Note: If all counties, in which cities are located in, are in one metropolitan area, one MSA or PMSA code is listed, as applicable; otherwise, a code is listed for each corresponding county with (X) meaning that the county is not in a metropolitan area. </p><p>Footnotes: \1 Average number of production workers plus the number of other (nonproduction) employees for the pay period including March 12. \2 Average number of production workers for the pay periods including the 12th of March, May, August, and November. \3 Includes only establishments with payroll. \4 For pay period including March 12. \5 Data are for consolidated city of Milford, CT; data for Milford city (remainder) not available. \6 Data are for consolidated city of Athens-Clarke County, GA; Athens city (1990 land area - 16.6 square miles; 1990 population - 45,734) merged with Clarke County effective January 14, 1991. \7 Data are for Augusta-Richmond County (balance), GA; Augusta city (1990 land area - 19.7 square miles; 1990 population - 44,639) merged with Richmond County effective January 1, 1996. </p><p>141 \8 Data are for consolidated city of Columbus, GA; data for Columbus city (remainder) not available. \9 Data are for consolidated city of Butte-Silver Bow, MT; data for Butte-Silver Bow (remainder) not available. (A) 0 to 19 employees. (B) 20 to 99 employees. (C) 100 to 249 employees. (E) 250 to 499 employees. (F) 500 to 999 employees. (G) 1,000 to 2,499 employees. (H) 2,500 to 4,999 employees. (I) 5,000 to 9,999 employees. (J) 10,000 to 24,999 employees. (K) 25,000 to 49,999 employees. (L) 50,000 to 99,999 employees. (M) 100,000 employees or more. </p><p>Sources: Manufacturing -- U.S. Census Bureau, 1997 Economic Census -- Manufacturing, generated by Statistical Compendia Branch, using American Factfinder at <http://www.census.gov/>, (June 2000) [related Internet site <http://www.census.gov/epcd/www/97EC31.HTM>]. Wholesale Trade -- U.S. Census Bureau, 1997 Economic Census, individual state .pdf files from <http://www.census.gov/epcd/www/97EC42.HTM> (accessed June 2000) and ECON97 Report Series CD-ROM, CD-EC97-1, Disc 1E, issued February 2001. </p><p>142 </p><p>Table A-6: Variable in Table C-6. — Retail Trade and Accommodation and Foodservices </p><p>Variable Definition C6-GEO01 Federal Information Processing Standards (FIPS) 7-digit state/place code C6-GEO02 USPS 2-letter state abbreviation C6-GEO03 FIPS 2-digit state code C6-GEO04 FIPS 5-digit place code FIPS 4-digit metropolitan area (MSA or CMSA) code(s) [blank if not applicable; C6-GEO05 see Note below] C6-GEO06 FIPS 4-digit PMSA code(s) [blank if not applicable; see Note below] C6-GEO07 FIPS 4-digit NECMA code [blank if not applicable] C6-GEO08 Area name C6-GEO09 County [see note below] C6-RTL01F Footnotes for C6-RTL01 C6-RTL01 Retail trade (NAICS 44-45), 1997: establishments \1 C6-RTL02F Footnotes for C6-RTL02 C6-RTL02 Retail trade (NAICS 44-45), sales, 1997: total ($1,000) \1 C6-RTL03F Footnotes for C6-RTL03 C6-RTL03 Population, 1997 (July 1) C6-RTL04F Footnotes for C6-RTL04 C6-RTL04 Retail trade (NAICS 44-45), sales, per capita, 1997: amount (dollars) \1 \2 C6-RTL05F Footnotes for C6-RTL05 Retail trade (NAICS 44-45), sales, per capita, 1997: percent of national average C6-RTL05 \1 \2 C6-RTL06F Footnotes for C6-RTL06 Retail trade (NAICS 44-45), sales, from general merchandise stores, 1997: C6-RTL06 amount ($1,000) \1 C6-RTL07F Footnotes for C6-RTL07 </p><p>143 Variable Definition Retail trade (NAICS 44-45), sales, from general merchandise stores, 1997: C6-RTL07 percent \1 C6-RTL08F Footnotes for C6-RTL08 C6-RTL08 Retail trade (NAICS 44-45), 1997: paid employees \1 \3 C6-RTL09F Footnotes for C6-RTL09 C6-RTL09 Retail trade (NAICS 44-45), annual payroll, 1997: total ($1,000) \1 C6-RTL10F Footnotes for C6-RTL10 C6-RTL10 Retail trade (NAICS 44-45), annual payroll, 1997: paid per employee (dollars) \1 C6-AFS01F Footnotes for C6-AFS01 C6-AFS01 Accommodation and foodservices (NAICS 72), 1997: establishments \1 C6-AFS02F Footnotes for C6-AFS02 C6-AFS02 Accommodation and foodservices (NAICS 72), sales, 1997: total ($1,000) \1 C6-AFS03F Footnotes for C6-AFS03 Accommodation and foodservices (NAICS 72), sales, from food services, 1997: C6-AFS03 total ($1,000) \1 C6-AFS04F Footnotes for C6-AFS04 Accommodation and foodservices (NAICS 72), sales, from food services, 1997: C6-AFS04 percent \1 \4 C6-AFS05F Footnotes for C6-AFS05 C6-AFS05 Accommodation and foodservices (NAICS 72), 1997: paid employees \1 \3 C6-AFS06F Footnotes for C6-AFS06 C6-AFS06 Accommodation and foodservices (NAICS 72), 1997: annual payroll ($1,000) \1 Note: If all counties, in which cities are located in, are in one metropolitan area, one MSA or PMSA code is listed, as applicable; otherwise, a code is listed for each corresponding county with (X) meaning that the county is not in a metropolitan area. </p><p>Footnotes: \1 Includes only establishments with payroll. \2 Based on resident population estimated as of July 1, 1997. \3 For pay period including March 12. \4 Foodservices and drinking places (NAICS 722) includes full-service restaurants, limited- service eating places, special food services, and drinking places (alcoholic beverages). </p><p>144 \5 Data are for consolidated city of Milford, CT; data for Milford city (remainder) not available. \6 Data are for consolidated city of Athens-Clarke County, GA; Athens city (1990 land area - 16.6 square miles; 1990 population - 45,734) merged with Clarke County effective January 14, 1991. \7 Data are for Augusta-Richmond County (balance), GA; Augusta city (1990 land area - 19.7 square miles; 1990 population - 44,639) merged with Richmond County effective January 1, 1996. \8 Data are for consolidated city of Columbus, GA; data for Columbus city (remainder) not available. \9 Data are for consolidated city of Butte-Silver Bow, MT; data for Butte-Silver Bow (remainder) not available. (A) 0 to 19 employees. (B) 20 to 99 employees. (C) 100 to 249 employees. (E) 250 to 499 employees. (F) 500 to 999 employees. (G) 1,000 to 2,499 employees. (H) 2,500 to 4,999 employees. (I) 5,000 to 9,999 employees. (J) 10,000 to 24,999 employees. (K) 25,000 to 49,999 employees. (L) 50,000 to 99,999 employees. (M) 100,000 employees or more. </p><p>Sources: Retail Trade -- U.S. Census Bureau, 1997 Economic Census, ECON97 Report Series CD- ROM, CD-EC97-1, Disc 1D issued July 2000 and Disc 1E issued February 2001 (related Internet site <http://www.census.gov/epcd/www/97EC44.HTM>). 1997 Population for Retail Trade calculations -- U.S. Census Bureau, Population Estimates for Places: Annual Time Series, July 1, 1990 to July 1, 1999 (SU-99-7), published 20 October 2000, <http://www.census.gov/estimates/metro-city/placesbyst/SC99T7_US.txt>. Now updated to <http://eire.census.gov/popest/archives/place/placbyst/SC99T7_US.txt>. Accommodation and Foodservices -- U.S. Census Bureau, 1997 Economic Census, ECON97 Report Series CD-ROM, Disc 1E issued February 2001 (related Internet site <http://www.census.gov/epcd/www/97EC72.HTM>). </p><p>145 </p><p>Table A-7: Variables in Table C-7 — Government Finances and Climate </p><p>Variable Definition C7-GEO01 Federal Information Processing Standards (FIPS) 7-digit state/place code C7-GEO02 USPS 2-letter state abbreviation C7-GEO03 FIPS 2-digit state code C7-GEO04 FIPS 5-digit place code FIPS 4-digit metropolitan area (MSA or CMSA) code(s) [blank if not C7-GEO05 applicable; see Note below] C7-GEO06 FIPS 4-digit PMSA code(s) [blank if not applicable; see Note below] C7-GEO07 FIPS 4-digit NECMA code [blank if not applicable] C7-GEO08 Area name C7-GEO09 County [see note below] C7-GVF01F Footnotes for C7-GVF01 C7-GVF01 Population, July 1997 C7-GVF02F Footnotes for C7-GVF02 C7-GVF02 City government finances, general revenue, 1996-1997: total ($1,000) C7-GVF03F Footnotes for C7-GVF03 City government finances, general revenue, 1996-1997: per capita (dollars) C7-GVF03 \1 C7-GVF04F Footnotes for C7-GVF04 City government finances, general revenue from taxes, 1996-1997: total C7-GVF04 ($1,000) C7-GVF05F Footnotes for C7-GVF05 City government finances, general revenue from taxes, 1996-1997: per capita C7-GVF05 (dollars) \1 C7-GVF06F Footnotes for C7-GVF06 C7-GVF06 City government finances, general expenditure, 1996-1997: total ($1,000) C7-GVF07F Footnotes for C7-GVF07 </p><p>146 Variable Definition City government finances, general expenditure, 1996-1997: per capita C7-GVF07 (dollars) \1 C7-GVF08F Footnotes for C7-GVF08 City government finances, general expenditure, police protection, 1996- C7-GVF08 1997: total ($1,000) C7-GVF09F Footnotes for C7-GVF09 City government finances, general expenditure, police protection, 1996- C7-GVF09 1997: percent C7-GVF10F Footnotes for C7-GVF10 City government finances, general expenditure, sewerage and solid waste C7-GVF10 management, 1996-1997: total ($1,000) C7-GVF11F Footnotes for C7-GVF11 City government finances, general expenditure, sewerage and solid waste C7-GVF11 management, 1996-1997: percent C7-GVF12F Footnotes for C7-GVF12 City government finances, general expenditure, highways, 1996-1997: total C7-GVF12 ($1,000) v34. v58. C7-GVF13F Footnotes for C7-GVF13 City government finances, general expenditure, highways, 1996-1997: C7-GVF13 percent C7-CLM01 Climate, average daily temperature (degrees Fahrenheit): January \2 C7-CLM02 Climate, average daily temperature (degrees Fahrenheit): July \2 Climate, average daily minimum temperature (degrees Fahrenheit): January C7-CLM03 \2 \3 Climate, average daily maximum temperature (degrees Fahrenheit): July \2 C7-CLM04 \4 C7-CLM05 Climate: annual precipitation (inches) \2 C7-CLM06 Climate: heating degree days \2 \5 C7-CLM07 Climate: cooling degree days \2 \6 </p><p>147 Note: If all counties, in which cities are located in, are in one metropolitan area, one MSA or PMSA code is listed, as applicable; otherwise, a code is listed for each corresponding county with (X) meaning that the county is not in a metropolitan area. </p><p>Footnotes: \1 Based on resident population estimated as of July 1, 1997. \2 Represents normal values for the 30-year period 1961-1990. \3 Average daily minimum. \4 Average daily maximum. \5 One heating degree day is accumulated for each whole degree that the mean daily temperature is below 65 degrees Fahrenheit. \6 One cooling degree day is accumulated for each whole degree that the mean daily temperature is above 65 degrees Fahrenheit. \7 Data are for consolidated city of Milford, CT; data for Milford city (remainder) not available. \8 Data are for consolidated city of Jacksonville, FL; data for Jacksonville city (remainder) not available. \9 Data are for consolidated city of Athens-Clarke County, GA; Athens city (1990 land area - 16.6 square miles; 1990 population - 45,734) merged with Clarke County effective January 14, 1991. \10 Data are for consolidated city of Augusta-Richmond County, GA; Augusta city (1990 land area - 19.7 square miles; 1990 population - 44,639) merged with Richmond County effective January 1, 1996. \11 Data are for consolidated city of Columbus, GA; data for Columbus city (remainder) not available. \12 Data for Honolulu CDP are for Honolulu City/County, HI which includes all CDPs listed except Hilo which is in Maui County. \13 Data are for consolidated city of Indianapolis, IN; data for Indianapolis city (remainder) not available. \14 Data are for consolidated city of Butte-Silver Bow, MT; data for Butte-Silver Bow (remainder) not available. \15 Data are for consolidated city of Nashville-Davidson, TN; data for Nashville-Davidson (remainder) not available. </p><p>Sources: City Government Finances -- U.S. Census Bureau, 1997 Census of Governments, Volume 4, Government Finances, GC97(4)-4, Finances of Municipal and Township Governments, issued September 2000 (related Internet site <http://www.census.gov/govs/www/cog.html>). 1997 Population for Government Finances calculations -- U.S. Census Bureau, Population Estimates for Places: Annual Time Series, July 1, 1990 to July 1, 1999 (SU-99-7), published 20 October 2000, <http://www.census.gov/estimates/metro- city/placesbyst/SC99T7_US.txt>. Now updated to <http://eire.census.gov/popest/archives/place/placbyst/SC99T7_US.txt>. Climate -- U.S. National Oceanic and Atmospheric Administration (NOAA), National Climatic Data Center (NCDC), Climatography of the United States, Number 81, published </p><p>148 January 1992 (related Internet site <http://lwf.ncdc.noaa.gov/oa/climate/normals/usnormals.html>).</p><p>Appendix B </p><p>Major Variables and Their Loadings for Selected Components </p><p>150 </p><p>Table B-1: Principal Component I </p><p>Component Variable Definition Loading </p><p>C1-POP01 0.997 Population, 2000 (April 1): number </p><p>C2-AGE09 0.997 Population 45 to 64 years, 2000: number </p><p>C2-POP01 0.997 Population, 2000 (April 1): number </p><p>C2-POP03 0.997 Population by sex, 2000: females </p><p>C3-HSD01 0.997 Households, 2000(April 1): number </p><p>C3-HSD04 0.997 Persons in households, 2000: number </p><p>C3-HSD12 0.997 Households, family households, 2000: number </p><p>C4-HSG01 0.997 Housing: total units, 2000 </p><p>C4-HSG04 0.997 Housing: occupied units, number 2000 </p><p>C2-AGE07 0.996 Population 25 to 44 years, 2000: number </p><p>C2-POP02 0.996 Population by sex, 2000: males </p><p>C6-RTL03 0.996 Population, 1997 (July 1) </p><p>C7-GVF01 0.996 Population, July 1997 </p><p>C1-POP04 0.995 Population, 1990 (April 1): number \3 </p><p>C3-HSD02 0.995 Households, 1990: number </p><p>C4-CRM05 0.995 FBI population, 1999 \1 </p><p>C4-CRM07 0.995 FBI population 1998 \1 </p><p>C4-HSG02 0.995 Housing: total units, 1990 </p><p>C3-HSD08 0.994 Households, one or more persons under 18 years, 2000: number </p><p>151 Component Variable Definition Loading Households, family households, with own children under 18 years, C3-HSD13 0.994 2000: number </p><p>C2-AGE11 0.993 Population 65 to 74 years, 2000: number </p><p>C3-HSD15 0.992 Households, family households, married couple, 2000: number </p><p>C4-CLF01 0.992 Civilian labor force, unemployment 2000: total </p><p>C2-AGE03 0.991 Population 5 to 17 years, 2000: number </p><p>C3-HSD06 0.991 Households, one person, 2000: number </p><p>C3-HSD10 0.991 Households, one or more persons 65 years or older, 2000: number </p><p>C4-CLF02 0.991 Civilian labor force, 1999: total </p><p>C3-HSD21 0.990 Households, nonfamily households, 2000: number </p><p>C6-RTL01 0.990 Retail trade (NAICS 44-45), 1997: establishments \1 </p><p>C1-POP06 0.989 Population, 1980 (April 1): number </p><p>C2-AGE01 0.989 Population under 5 years, 2000: number </p><p>C2-AGE13 0.989 Population 75 to 84 years, 2000: number </p><p>C3-HSD22 0.989 Households, nonfamily households, 1990: number </p><p>C2-AGE05 0.988 Population 18 to 24 years, 2000: number </p><p>Households, family households, married couple, with own children, C3-HSD16 0.985 2000: number \3 </p><p>C2-AGE15 0.984 Population 85 years and over, 2000: number </p><p>Households, family households, female householder, 2000: number C3-HSD18 0.983 \4 </p><p>152 Component Variable Definition Loading Households, family households, female householder, with own C3-HSD19 0.983 children, 2000: number \3 \4 Accommodation and foodservices (NAICS 72), 1997: C6-AFS01 0.982 establishments \1 </p><p>C2-POP05 0.981 Population, one race, White, 2000: number </p><p>C4-CLF04 0.981 Civilian labor force, unemployment 2000: total </p><p>City government finances, general expenditure, police protection, C7-GVF08 0.977 1996-1997: total ($1,000) </p><p>C2-POP17 0.975 Population, two or more races, 2000: number \2 </p><p>C5-WHS01 0.974 Wholesale trade (NAICS 42), 1997: establishments \3 </p><p>Accommodation and foodservices (NAICS 72), sales, from food C6-AFS03 0.974 services, 1997: total ($1,000) \1 </p><p>C5-MAN01 0.967 Manufacturing (NAICS 31-33), establishments, 1997: total </p><p>C6-RTL09 0.962 Retail trade (NAICS 44-45), annual payroll, 1997: total ($1,000) \1 </p><p>C4-HSG05 0.960 Housing: occupied units, owner occupied, number 2000 </p><p>C5-WHS06 0.957 Wholesale trade (NAICS 42), 1997: operating expenses ($1,000) \3 </p><p>City government finances, general expenditure, sewerage and solid C7-GVF10 0.957 waste management, 1996-1997: total ($1,000) </p><p>C3-POP02 0.956 Group quarters population, 2000: institutionalized population \1 \2</p><p>C5-WHS04 0.955 Wholesale trade (NAICS 42), 1997: paid employees \3 \4 </p><p>Accommodation and foodservices (NAICS 72), sales, 1997: total C6-AFS02 0.954 ($1,000) \1 </p><p>C5-WHS05 0.953 Wholesale trade (NAICS 42), 1997: annual payroll ($1,000) \3 </p><p>C6-RTL08 0.952 Retail trade (NAICS 44-45), 1997: paid employees \1 \3 </p><p>153 Component Variable Definition Loading </p><p>C4-CRM02 0.951 Number of serious crimes known to police, 1999: violent \1 \2 </p><p>City government finances, general expenditure, highways, 1996-1997: C7-GVF12 0.947 ($1,000) </p><p>C3-POP01 0.946 Group quarters population, 2000: number \1 </p><p>Accommodation and foodservices (NAICS 72), 1997: annual C6-AFS06 0.945 payroll ($1,000) \1 </p><p>C6-RTL02 0.944 Retail trade (NAICS 44-45), sales, 1997: total ($1,000) \1 </p><p>Manufacturing (NAICS 31-33), establishments with 20 or more C5-MAN02 0.941 employees, 1997: number </p><p>C1-POP11 0.927 Hispanic or Latino population, 2000: number \5 </p><p>C4-CRM01 0.923 Number of serious crimes known to police, 1999: total \1 </p><p>C4-CRM04 0.923 Number of serious crimes known to police: 1998 \1 </p><p>Accommodation and foodservices (NAICS 72), 1997: paid C6-AFS05 0.923 employees \1 \3 </p><p>C2-POP15 0.922 Population, one race, some other race, 2000: number \1 </p><p>C2-POP07 0.912 Population, one race, Black or African American, 2000: number </p><p>C2-POP11 0.907 Population, one race, Asian, 2000: number </p><p>C5-WHS02 0.903 Wholesale trade (NAICS 42), sales, 1997: total ($1,000) \3 </p><p>C4-CRM03 0.901 Number of serious crimes known to police, 1999: property \1 \3 </p><p>C7-GVF06 0.901 City government finances, general expenditure, 1996-1997: total ($1,0</p><p>City government finances, general revenue, 1996-1997: total C7-GVF02 0.896 ($1,000) City government finances, general revenue from taxes, 1996-1997: C7-GVF04 0.890 total ($1,000) </p><p>154 Component Variable Definition Loading </p><p>C1-POP07 0.811 Population: net change, 1990-2000 </p><p>Population, one race, American Indian and Alaska Native, 2000: C2-POP09 0.786 number Population, one race, Native Hawaiian or Other Pacific Islander, C2-POP13 0.579 2000: number Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>155 </p><p>Table B-2: Principal Component II </p><p>Component Variable Definition Loading </p><p>C3-HSD09 0.938 Households, one or more persons under 18 years, 2000: percent </p><p>C2-AGE04 0.894 Population 5 to 17 years, 2000: percent of total population </p><p>C3-HSD05 0.885 Persons in households, 2000: persons per household </p><p>C2-AGE02 0.884 Population under 5 years, 2000: percent of total population </p><p>Households, family households, with own children under 18 C3-HSD14 0.865 years, 2000: percent Households, family households, married couple, with own C3-HSD17 0.850 children, 2000: percent \3 Population, one race, some other race, 2000: percent of total C2-POP16 0.787 population \1 </p><p>C3-HSD07 -0.755 Households, one person, 2000: percent </p><p>Hispanic or Latino population, 2000: percent of total population C1-POP12 0.727 \5 </p><p>C4-CLF05 0.513 Civilian labor force, unemployment 2000: rate \5 </p><p>C7-CLM05 -0.448 Climate: annual precipitation (inches) \2 </p><p>Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>156 </p><p>Table B-3: Principal Component III </p><p>Component Variable Definition Loading </p><p>C7-CLM01 0.937 Climate, average daily temperature (degrees Fahrenheit): January \2</p><p>Climate, average daily minimum temperature (degrees C7-CLM03 0.937 Fahrenheit): January \2 \3 </p><p>C7-CLM06 -0.905 Climate: heating degree days \2 \5 </p><p>Households, family households, female householder, with own C3-HSD20 -0.455 children, 2000: percent \3 \4 </p><p>C4-CLF03 0.450 Civilian labor force, 1999-2000: percent change </p><p>C2-POP12 0.445 Population, one race, Asian, 2000: percent of total population </p><p>Retail trade (NAIC44-45), annual payroll, 1997: paid per C6-RTL10 0.384 employee (dollars) \1 Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>157 </p><p>Table B-4: Principal Component IV </p><p>Component Variable Definition Loading </p><p>C3-HSD03 0.955 Households: percent change, 1990-2000 </p><p>C4-HSG03 0.952 Housing: percent change, 1990-2000 </p><p>C3-HSD23 0.864 Households, nonfamily households: percent change, 1990-2000 </p><p>C1-POP09 0.850 Population: percent change, 1990-2000 </p><p>C1-POP10 0.659 Population: percent change, 1980-1990 </p><p>Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>Table B-5: Principal Component V </p><p>Component Variable Definition Loading Households, one or more persons 65 years or older, 2000: C3-HSD11 0.861 percent </p><p>C2-AGE14 0.783 Population 75 to 84 years, 2000: percent of total population </p><p>C2-AGE12 0.727 Population 65 to 74 years, 2000: percent of total population </p><p>C2-AGE16 0.727 Population 85 years and over, 2000: percent of total population </p><p>C2-AGE08 -0.678 Population 25 to 44 years, 2000: percent of total population </p><p>C2-POP04 -0.495 Population by sex: males per 100 females (April 1) </p><p>158 </p><p>Table B-6: Principal Component VI </p><p>Component Variable Definition Loading </p><p>C4-CRM08 0.869 Crime rate of serious crimes known to police, 1998 \1 \4 </p><p>C4-CRM06 0.863 Crime rate of serious crimes known to police, 1999 \1 \4 </p><p>C2-POP08 0.738 Population, one race, Black or African American, 2000: percent of </p><p>C2-POP06 -0.439 Population, one race, White, 2000: percent of total population </p><p>City government finances, general expenditure, sewerage and C7-GVF11 0.316 solid waste management, 1996-1997: percent Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>159 </p><p>Table B-7: Principal Component VII </p><p>Component Variable Definition Loading </p><p>C7-CLM02 0.917 Climate, average daily temperature (degrees Fahrenheit): July \2 </p><p>Climate, average daily maximum temperature (degrees Fahrenheit): C7-CLM04 0.834 \4 </p><p>C7-CLM07 0.789 Climate: cooling degree days \2 \6 </p><p>Population, two or more races, 2000: percent of total population C2-POP18 -0.451 \2 Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>Table B-8: Principal Component VIII </p><p>Component Variable Definition Loading City government finances, general expenditure, 1996-1997: per C7-GVF07 0.876 capita (dollars) \1 City government finances, general revenue, 1996-1997: per capita C7-GVF03 0.873 (dollars) \1 City government finances, general revenue from taxes, 1996- C7-GVF05 0.804 1997: per capita (dollars) \1 City government finances, general expenditure, police protection, C7-GVF09 -0.571 1996-1997: percent City government finances, general expenditure, highways, 1996- C7-GVF13 -0.459 1997: percent Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>160 </p><p>Table B-9: Principal Component IX </p><p>Component Variable Definition Loading </p><p>C2-AGE06 -0.924 Population 18 to 24 years, 2000: percent of total population </p><p>C2-AGE10 0.723 Population 45 to 64 years, 2000: percent of total population </p><p>C2-AGE17 0.672 Population, 2000: median age (years) </p><p>C4-HSG06 0.582 Housing: occupied units, owner occupied, percent 2000 </p><p>Table B-10: Principal Component X </p><p>Component Variable Definition Loading </p><p>C1-POP02 -0.820 Population, 2000 (April 1): rank \2 </p><p>C1-POP05 -0.816 Population, 1990 (April 1): rank \3 \4 </p><p>Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>Table B-11: Principal Component XI </p><p>Component Variable Definition Loading Retail trade (NAICS 44-45), sales, per capita, 1997: amount C6-RTL04 0.867 (dollars) \1 \2 Retail trade (NAICS 44-45), sales, per capita, 1997: percent of C6-RTL05 0.867 national average \1 \2 Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>161 </p><p>Table B-12: Principal Component XII </p><p>Factor Variable Definition Loading </p><p>C1-POP08 0.654 Population: net change, 1980-1990 </p><p>C1-IND01 0.463 Land area, 2000 (square miles) \1 </p><p>Note: See explanation of the numbered remarks in the footnotes of corresponding tables in Appendix A. For example, the remark in the definition of C4-CRM03 is explained in the footnote of Table A-4. </p><p>Table B-13: Principal Component XIII </p><p>Factor Variable Definition Loading Population, one race, American Indian and Alaska Native, 2000: C2-POP10 0.744 percent of total population Manufacturing (NAICS 31-33), establishments with 20 or more C5-MAN03 -0.583 employees, 1997: percent </p><p>Table B-14: Principal Component XIV </p><p>Factor Variable Definition Loading </p><p>C1-POP03 0.342 Population per square mile, 2000 </p><p>Appendix C </p><p>Header Statistics of 490 Wikipedia City Entries </p><p>163 ID Header Count ID Header Count 1 External links 482 31 Law and government 33 2 Demographics 479 32 Attractions 32 3 Geography 403 33 Museums 32 4 History 383 34 Politics 32 5 Education 320 35 Crime 30 6 References 314 36 Further reading 30 7 Transportation 259 37 Private schools 30 8 Media 157 38 Airports 29 9 Economy 150 39 Rail 27 10 See also 142 Primary and secondary 40 26 11 Sister cities 139 schools 12 Sports 129 41 Famous residents 25 13 Climate 113 42 Music 25 14 Government 96 43 Air 24 15 Culture 94 44 Geography and Climate 24 16 Points of interest 73 45 Highways 23 17 Notable residents 71 46 Religion 22 18 Neighborhoods 60 47 Higher education 21 19 Trivia 59 48 Notable natives 21 20 Colleges and universities 50 49 Parks 21 21 Television 42 50 Shopping 21 22 Public schools 39 51 Languages 20 23 Radio 39 52 Events 19 24 Sister Cities 39 53 Government and politics 19 25 Cityscape 38 54 Topography 19 26 Geography and climate 38 55 Tourism 19 27 Notes 38 56 Architecture 18 28 Infrastructure 36 57 Libraries 18 29 Schools 36 58 Performing arts 18 30 Newspapers 34 59 Recreation 18 </p><p>164 ID Header Count ID Header Count 60 Utilities 18 88 Famous Residents 10 61 Downtown 16 89 Other 10 62 Emergency services 16 90 Bibliography 9 63 Print 15 91 Industry 9 64 Public transportation 15 92 Major highways 9 65 Sites of interest 15 93 People and culture 9 66 Business 14 94 Public libraries 9 67 Colleges and Universities 14 95 Public Schools 9 68 Festivals 14 96 Current estimates 8 69 Gallery 13 97 Elementary schools 8 70 High schools 13 98 Healthcare 8 71 Airport 12 99 Major Highways 8 72 Arts 12 100 Notable people 8 73 Landmarks 12 101 Popular culture 8 74 Mass transit 12 102 Radio stations 8 Notable natives and 103 Railroads 8 75 12 residents 104 Sister City 8 76 Private Schools 12 105 Sister city 8 77 Sources 12 Entertainment and 106 7 78 Higher Education 11 performing arts 79 Notable Residents 11 Geography and 107 7 80 Parks and recreation 11 environment Primary and secondary 108 Miscellaneous 7 81 11 education 109 Nicknames 7 82 Roads 11 Parks and outdoor 110 7 83 Television stations 11 attractions 84 Arts and culture 10 111 Sports teams 7 85 Bus 10 112 Suburbs 7 86 Colleges 10 113 Art 6 87 Early history 10 114 Educational institutions 6 </p><p>165 ID Header Count ID Header Count 115 Footnotes 6 143 Radio Stations 5 116 Geology 6 144 Weather 5 117 Health systems 6 145 19th century 4 118 High Schools 6 146 20th Century 4 119 Literature 6 147 Buses 4 120 Mayor 6 148 City Government 4 121 Metropolitan area 6 149 Community 4 122 Middle schools 6 150 Cuisine 4 123 Nightlife 6 151 Cultural references 4 124 Points of Interest 6 Culture and contemporary 152 4 125 Population 6 life 126 Print media 6 153 Culture and recreation 4 127 Public 6 154 Entertainment 4 128 Public education 6 Famous natives and 155 4 129 Sports and recreation 6 residents 130 Today 6 156 Highway 4 131 20th century 5 157 Hospitals 4 132 City government 5 158 Hurricane Katrina 4 133 Elementary Schools 5 159 Interesting facts 4 134 Environment 5 160 Light rail 4 Federal, state and county 161 Local attractions 4 135 5 representation 162 Local media 4 136 Founding 5 163 Location 4 137 In popular culture 5 164 Maps 4 138 Library 5 165 Mayors 4 139 Local government 5 Medical centers and 166 4 140 Magazines 5 hospitals Notable residents and 167 Notes and references 4 141 5 natives 168 Other points of interest 4 142 Private education 5 169 Others 4 </p><p>166 ID Header Count ID Header Count 170 Places of interest 4 199 Famous Citizens 3 171 Post‐secondary 4 200 Famous citizens 3 172 Post‐secondary education 4 201 Features 3 173 Private 4 202 Healthcare and utilities 3 174 Pronunciation 4 Historical structures and 203 3 175 Public High Schools 4 museums 176 Public Transit 4 204 History and culture 3 177 Rail transportation 4 205 Households 3 178 Railroad 4 206 Housing 3 179 References and notes 4 207 Incorporation 3 180 Road 4 208 K‐12 3 181 Roads and highways 4 209 Law enforcement 3 182 21st century 3 210 Layout 3 183 Adjacent towns 3 211 Lifestyle 3 184 Air travel 3 212 Local Media 3 Annual cultural events 213 Major Roads 3 185 3 and fairs 214 Major streets 3 186 Annual events 3 215 Mardi Gras 3 187 Arts and entertainment 3 216 Middle Schools 3 188 Arts and theatre 3 217 Military 3 189 Athletes 3 National Register of 218 3 190 Athletics 3 Historic Places 191 Aviation 3 219 Nearby attractions 3 192 Business and industry 3 220 Newspaper 3 193 Charter schools 3 221 Notable citizens 3 194 City Council 3 222 Notable inhabitants 3 195 Community Events 3 223 Notable Natives 3 196 Companies 3 224 Notable People 3 197 Description 3 Notable residents, past 225 3 198 Facts 3 and present </p><p>167 ID Header Count ID Header Count 226 Notes and References 3 253 Universities and colleges 3 On the National Register 254 Visual arts 3 227 3 of Historic Places 255 Water 3 228 Other information 3 256 2000s 2 Parks and outdoor 257 Accent 2 229 3 recreation 258 Accolades 2 230 Planning 3 259 Actors 2 231 Police 3 260 Adjacent municipalities 2 232 Postal service 3 261 Agriculture 2 Primary and Secondary 262 Air Transport 2 233 3 Education 263 AM 2 234 Professional 3 264 AM radio 2 235 Professional sports 3 265 Amateur 2 236 Professional Sports Teams 3 266 American Revolution 2 237 Public Elementary Schools 3 267 Annual events and fairs 2 238 Public safety 3 268 Area attractions 2 239 Public transport 3 Area colleges and 269 2 240 Public Transportation 3 universities 241 Recent History 3 270 Art and Culture 2 242 Redevelopment 3 271 Arts & Culture 2 243 Revitalization 3 Attractions and points of 272 2 244 Sea 3 interest 245 Seaports 3 273 Awards 2 246 Tallest buildings 3 274 Beginnings 2 247 Television Stations 3 275 Bodies of water 2 248 The arts 3 276 Bridges 2 249 Theater 3 277 Bus service 2 250 Theatre 3 278 Casinos 2 251 Timeline 3 279 Churches 2 252 Twentieth Century 3 280 City council 2 </p><p>168 ID Header Count ID Header Count 281 City Manager 2 308 Downtown revitalization 2 282 Civil War 2 309 Early History 2 Civil War and 310 Early settlers 2 283 2 Reconstruction 311 Economic 2 284 College 2 312 Economic development 2 285 Colleges & Universities 2 313 Economic history 2 286 Colonial era 2 Elementary and 314 2 287 Commerce 2 secondary 288 Communications 2 315 Entertainers 2 289 Community information 2 316 Environmental features 2 290 Controversies and crime 2 Environmental features 317 2 291 Controversy 2 and geography 292 Council members 2 318 Executive 2 293 Courts 2 319 External link 2 294 Crime statistics 2 320 Famous people 2 295 Criminal 2 321 Ferry 2 296 Cultural 2 322 Film and television 2 297 Cultural attractions 2 323 Film locations 2 Culture and 324 Films 2 298 2 entertainment 325 Fire Department 2 299 Culture and Recreation 2 326 FM 2 300 Culture and the arts 2 327 FM radio 2 301 Current teams 2 328 Food 2 302 Cycling 2 329 Football 2 303 Dialect 2 330 For‐profit universities 2 304 Disasters 2 331 Freeways and highways 2 305 Distances 2 332 Gardens 2 306 Diversity 2 333 Gentrification 2 Downtown 334 Geography and cityscape 2 307 2 redevelopment 335 Geography and geology 2 </p><p>169 ID Header Count ID Header Count 336 Golf Courses 2 365 Mass Transit 2 337 Government and Politics 2 366 Media and entertainment 2 338 Harbors 2 367 Media Outlets 2 339 Health and utilities 2 368 Medicine 2 340 Health Care 2 369 Metropolitan Area 2 341 Healthcare and medicine 2 Metropolitan Statistical 370 2 342 Historic sites 2 Area 343 Historical Timeline 2 Metropolitan statistical 371 2 344 <a href="/tags/Hospital/" rel="tag">Hospital</a> 2 area 345 Hurricanes 2 372 Military bases 2 346 Image 2 373 Motorsports 2 347 Images 2 374 Movies 2 348 In film 2 375 Musicians 2 349 Income 2 376 Native Americans 2 350 Industry and business 2 377 Nearby towns and cities 2 351 Intercity rail 2 378 Notable businesses 2 352 Internet Radio 2 Notable Natives and 379 2 353 Judicial 2 Residents 354 Lakes 2 Notable people, past and 380 2 355 Land 2 present LDS Missionary Training Notable Residents (Past & 356 2 381 2 Center Present) 357 Legislative 2 382 Notables 2 358 Line note references 2 383 Open space 2 359 Local culture 2 384 Other notables 2 360 Major Employers 2 385 Outdoor recreation 2 361 Major employers 2 386 Overview 2 362 Major events 2 387 Parks and gardens 2 363 Major routes 2 388 Passenger transportation 2 364 Maps and aerial photos 2 389 Personal income 2 </p><p>170 ID Header Count ID Header Count 390 Places 2 419 Sights 2 391 Places of Interest 2 420 Special Events 2 392 Political and business 2 421 Special events 2 393 Politicians 2 422 Sports & Recreation 2 394 Politics and Government 2 423 Sports and athletics 2 395 Politics and government 2 424 Sports and event venues 2 396 Pop Culture 2 425 Sports figures 2 Popular culture 426 Sports Teams 2 397 2 references 427 Stadiums 2 398 Ports 2 428 State 2 399 Private Education 2 429 Statistics 2 400 Private High Schools 2 Surrounding cities and 430 2 401 Private high schools 2 suburbs 402 Private primary schools 2 Surrounding cities and 431 2 403 Professional sports teams 2 towns 404 Public charter schools 2 432 The Arts 2 405 Public high schools 2 433 Tourism and recreation 2 406 Public Middle Schools 2 434 Tourist attractions 2 407 Public transit 2 435 Township 2 408 Public utilities 2 436 Traditions 2 409 Rail Transportation 2 437 Transit 2 410 Real Estate 2 438 Travel 2 411 Recent Development 2 439 Universities 2 412 Recent developments 2 440 Universities and Colleges 2 413 Recent events 2 441 Venues 2 414 Recent history 2 442 Warm Springs 2 415 Religious institutions 2 443 Waterfront 2 416 Rock 2 444 Waterways 2 417 Scientology 2 445 World War II 2 418 Secondary Education 2 446 Zip codes 2 </p><p>Appendix D </p><p>Snapshots of Wikipedia Entries19 </p><p>19 Obtained on October 29, 2007. 172 </p><p>Figure D-1: Snapshot of Wikipedia entry of State College, PA. 173 </p><p>Figure D-2: Snapshot of Wikipedia entry of State College, PA 174 </p><p>Figure D-3: Snapshot of Wikipedia entry of Ann Arbor, MI. 175 </p><p>Figure D-4: Snapshot of Wikipedia entry of Ann Arbor, MI. 176 </p><p>Figure D-5: Snapshot of Wikipedia entry of Boston, MA. 177 </p><p>Figure D-6: Snapshot of Wikipedia entry of Boston, MA. 178 </p><p>Figure D-7: Snapshot of Wikipedia entry of Chicago, IL. 179 </p><p>Figure D-8: Snapshot of Wikipedia entry of Chicago, IL. 180 </p><p>Figure D-9: Snapshot of the Wikipedia entry Houston, TX. 181 </p><p>Figure D-10: Snapshot of Wikipedia entry of Houston, TX. 182 </p><p>Figure D-11: Snapshot of Wikipedia entry of Las Vegas, NV. 183 </p><p>Figure D-12: Snapshot of Wikipedia entry of Las Vegas, NV. 184 </p><p>Figure D-13: Snapshot of Wikipedia entry of Los Angeles, CA. 185 </p><p>Figure D-14: Snapshot of Wikipedia entry of Los Angeles, CA. 186 </p><p>Figure D-15: Snapshot of Wikipedia entry of Philadelphia, PA. 187 ` </p><p>Figure D-16: Snapshot of Wikipedia entry of Philadelphia, PA. 188 </p><p>Figure D-17: Snapshot of Wikipedia entry of Reno, NV. 189 </p><p>Figure D-18: Snapshot of Wikipedia entry of Reno, NV. 190 </p><p>Figure D-19: Snapshot of Wikipedia entry of San Diego, CA. 191 </p><p>Figure D-20: Snapshot of Wikipedia entry of San Diego, CA. 192 </p><p>Figure D-21: Snapshot of Wikipedia entry of San Francisco, CA. 193 </p><p>Figure D-22: Snapshot of Wikipedia entry of San Francisco, CA. </p><p>Appendix E </p><p>Three Wikipedia Sections of Ten Cities20 </p><p>20 The entry is generated by the Java Bliki engine <http://en.wikipedia.org/w/api.php>. The engine creates an html page from the Wikipedia entry export obtained on August 3, 2007 via the Wikipedia export URL: </p><p> http://en.wikipedia.org/wiki/Special:Export/ </p><p>The special fields (e.g., maps and geographical coordinates) cannot be replicated exactly and are replaced with their alternate texts enclosed in curly brackets ({{an alternate text}}). Some are removed for readability. The engine does not include pictures themselves, but their place holders, which are empty square boxes, and their captions at the bottom 195 </p><p>E.1 Ann Arbor </p><p>E.1.1 Economy </p><p>The University of <a href="/tags/Michigan/" rel="tag">Michigan</a> shapes Ann Arbor's economy significantly. It employs approximately 30,000 workers, including about 7,500 in the medical center. Other employers are drawn to the area by the university's research and development money, and by its graduates. High tech, health services and biotechnology are other major components of the city's economy; numerous medical offices, laboratories, and associated companies are located in the city. Automobile manufacturers, such as General Motors and Ford, also employ residents. </p><p>Nickels Arcade interior, looking towards the east </p><p>Many high-tech companies are located in the city. During the 1980s, Ann Arbor Terminals manufactured a video-display terminal called the Ann Arbor Ambassador. Other high-tech companies in the area include Arbor Networks (provider of Internet traffic engineering and security systems), Arbortext (provider of XML-based publishing software), JSTOR (the digital scholarly journal archive), MediaSpan Media Software (provider of newspaper publishing software and ASP services), and ProQuest, which includes UMI. </p><p>Websites and online media companies in the city include All Media Guide, Everything2, and the Weather Underground. Ann Arbor is also the site of the Michigan Information Technology Center (MITC), whose offices house Internet2 and the Merit Network, a not- for-profit research and education computer network. On July 11, 2006, Google announced plans to open a 1000-employee Ann Arbor office for its AdWords program later in the year. </p><p>Pfizer, the city's second largest employer, operates a large pharmaceutical research facility on the northeast side of Ann Arbor. On January 22, 2007, <a href="/tags/Pfizer/" rel="tag">Pfizer</a> announced it would close operations in Ann Arbor by the end of 2008. The facility was previously operated by Warner-Lambert and, before that, Parke-Davis. The city is the home of other research and engineering centers, including those of General Dynamics and the National Oceanic and Atmospheric Administration (NOAA). Other research centers sited in the city are the United States Environmental Protection Agency's National Vehicle and Fuel Emissions Laboratory and the Toyota Technical Center. </p><p>196 Several major companies are headquartered in Ann Arbor. The original Borders Books was opened on Ann Arbor's State Street in 1971 by brothers Tom and Louis Borders, and began operating other outlets around the region in 1985. The Borders chain is still based in the city, as is its flagship store (although not in its original location). Dogs are allowed inside the flagship store, and the cashiers have a stock of treats for such visitors. Domino's Pizza's headquarters is near Ann Arbor on Domino's Farms, a 271-acre (109 hectare) Frank Lloyd Wright-inspired complex just northeast of the city. Flint Ink Corp., another Ann Arbor-based company, was until recently the world's largest privately held ink manufacturer (in October 2005, it was acquired by Stuttgart-based XSYS Print Solutions). Another Ann Arbor-based company is Zingerman's Delicatessen, which serves sandwiches and Jewish foods, and has developed businesses under a variety of brand names. Zingerman's has grown into a very large family of companies which offers a variety of products (bake shop, mail order, creamery) and services (business education). </p><p>Many cooperative enterprises were founded in the city during the 1960s and 1970s; among those that remain are the People's Food Co-op and the Inter-Cooperative Council at the <a href="/tags/University_of_Michigan/" rel="tag">University of Michigan</a>, a student-housing cooperative. The North American Students of Cooperation (NASCO) is an association of cooperatives headquartered in Ann Arbor. There are also three cohousing communities — Sunward, Great Oak, and Touchstone — located immediately to the west of the city limits. </p><p>E.1.2 Climate </p><p>Ann Arbor has a typically Midwestern humid continental seasonal climate, which is influenced by the Great Lakes. There are four seasons, with winters being cold with moderate snowfall while summers can be warm and humid. The area experiences lake effect, primarily in the form of increased cloudiness during late fall and early winter. The highest average temperature is in July at 83 °F (28 °C) while the lowest average temperature is in January at 16 °F (-9 °C). However, summer temperatures can top 90 °F (32 °C), and winter temperatures can drop below 0 °F (-17 °C). Average monthly precipitation ranges from 2 to 4 inches (44 to 92 mm), with the heaviest occurring during the summer months. Snowfall, which normally occurs from November to April, ranges from 1 to 10 inches (3 to 25 cm) per month. The highest recorded temperature was 105 °F (40.6 °C) on July 24, 1934, while the lowest recorded temperature was -22.0 °F (-30 °C) on January 19, 1994. </p><p>E.1.3 Culture </p><p>197 Mural depicting author Herman Hesse (and Woody Allen, Edgar Allan Poe, Franz Kafka and Anaïs Nin) on Liberty Street. </p><p>Many Ann Arbor cultural attractions and events are sponsored by the University of Michigan. Several performing arts groups and facilities are on the university's campus, as are museums dedicated to art, archaeology, and natural history and sciences (see Museums at the University of Michigan). Regional and local performing arts groups not associated with the university include the Ann Arbor Civic Theatre; the Arbor Opera Theater; the Ann Arbor Symphony Orchestra; the Ann Arbor Ballet Theater; the Ann Arbor Civic Ballet (established in 1954 as Michigan's first chartered ballet company>); and Performance Network, which operates a downtown theater frequently offering new or nontraditional plays. </p><p>The Ann Arbor Hands-On Museum, located in a renovated and expanded historic downtown fire station, contains more than 250 interactive exhibits featuring science and technology. Multiple art galleries exist in the city, notably in the downtown area and around the University of Michigan campus. Aside from a large restaurant scene in the Main Street, South State Street, and South University Avenue areas, Ann Arbor ranks first among U.S. cities in the number of booksellers and books sold per capita. The Ann Arbor District Library maintains four branch outlets in addition to its main downtown building; in 2008 a new branch building is set to replace the branch located in Plymouth Mall. The city is also home to the Gerald R. Ford Presidential Library. </p><p>Several annual events - many of them centered on performing and visual arts - draw visitors to Ann Arbor. One such event is the Ann Arbor Art Fairs, a set of four concurrent juried fairs held on downtown streets, which began in 1960. Scheduled on Wednesday through Saturday in the third week of July, the fairs draw upward of half a million visitors. One event that is not related to visual and performing arts is Hash Bash, held on the first Saturday of April in support of the reform of marijuana laws. It has been celebrated since 1971. The Naked Mile, which features students running naked through the streets in late April to celebrate the end of the winter semester, has occurred since 1986. Beginning in 2000, however, a crackdown by university and city police, citing safety concerns, has reduced the size of the run. </p><p>Ann Arbor has a major scene for college sports, notably at the University of Michigan, a member of the Big Ten Conference. Several well-known college sports facilities exist in the city, including <a href="/tags/Michigan_Stadium/" rel="tag">Michigan Stadium</a>, the largest <a href="/tags/American_football/" rel="tag">American football</a> stadium in the world with a 107,501 seating capacity. The stadium is colloquially known as "The Big House." Crisler Arena and <a href="/tags/Yost_Ice_Arena/" rel="tag">Yost Ice Arena</a> play host to the school's <a href="/tags/Basketball/" rel="tag">basketball</a> and <a href="/tags/Ice_hockey/" rel="tag">ice hockey</a> teams, respectively. Concordia University, a member of the NAIA, also fields sports teams. </p><p>A person from Ann Arbor is called an "Ann Arborite," and many long-time residents call themselves "townies." The city itself is often called A² ("A-squared") or A2 ("A two"), and, less commonly, Tree Town. Recently, some youths have taken to calling Ann Arbor </p><p>198 Ace Deuce or simply The Deuce. With tongue-in-cheek reference to the city's liberal political leanings, some occasionally refer to Ann Arbor as The People's Republic of Ann Arbor or 25 square miles surrounded by reality, the latter phrase being adapted from Wisconsin Governor Lee Dreyfus's description of Madison, Wisconsin. </p><p>199 </p><p>E.2 Boston </p><p>E.2.1 Economy </p><p>See also: Major companies in Greater Boston, List of foreign consulates in Boston </p><p>Left </p><p>Boston's colleges and universities have a major impact on the city and region's economy. Not only are they major employers, but they also attract high-tech industries to the city and surrounding region, like Millennium Pharmaceuticals, Merck & Co., Millipore, Genzyme, and <a href="/tags/Biogen/" rel="tag">Biogen</a> Idec. According to a 2003 report by the Boston Redevelopment Authority, students including computer hardware and software companies as well as biotechnology companies enrolled in Boston's colleges and universities contribute $4.8 billion annually to the city's economy. </p><p>Boston also receives the highest amount of annual funding from the National Institutes of Health of all cities in the United States. </p><p>Tourism comprises a large part of Boston's economy. In 2004 tourists spent $7.9 billion and made the city one of the ten most popular tourist locations in the country. Other important industries include financial services, especially mutual funds and insurance. Boston-based Fidelity Investments helped popularize the mutual fund in the 1980s, and has made Boston one of the top financial cities in the United States. The city is also the regional headquarters of major banks such as Bank of America and Sovereign Bank, and a center for venture capital. Boston is also a printing and publishing center - Houghton Mifflin is headquartered within the city, along with Bedford-St. Martin's Press, Beacon Press, and Little, Brown and Company. The city is home to four major convention centers: the Hynes Convention Center in the Back Bay, the Bayside Expo Center in Dorchester, and the World Trade Center Boston and Boston Convention and Exhibition Center on the South Boston waterfront. Because of its status as a state capital and the regional home of federal agencies, law and government is another major component of the city's economy. </p><p>Major companies headquartered within the city include Gillette, owned by Procter & Gamble, and Teradyne, one of the world's leading manufacturers of semiconductor and other electronic test equipment. New Balance has its headquarters in the city. Boston is also </p><p>200 home to management consulting firms The Boston Consulting Group and Bain & Company, as well as the private equity group Bain Capital. Other major companies are located outside the city, especially along Route 128. The Port of Boston is a major seaport along the United States' east coast, and is also the oldest continuously-operated industrial and fishing ports in the Western Hemisphere. </p><p>E.2.2 Climate </p><p>Beacon Hill in the winter. </p><p>Boston experiences a continental climate that is very common in New England, but with distinct maritime influences due to its position on the Atlantic Ocean. Summers are typically hot and humid, while winters are cold, windy and snowy. It has been known to snow in May or October but these events are rare. </p><p>The earliest recorded 90 °F (32.2 °C) temperature in a year was in late March 1998, while February in Boston has seen 70 (21 °C) degrees only once in recorded history, on February 24, 1985. Spring in Boston can be hot, with temperatures in the 90s when winds are from offshore, though it is just as possible for a day in late May to remain in the 40s due to cool ocean waters. The hottest month is July, with an average high of 81.9 °F (27.7 °C) and a low of 65.1 °F (18.4 °C), conditions are usually humid. The coldest month is January, with an average high of 35.8 °F (2.1 °C) and a low of 21.6 °F (-5.6 °C). Periods exceeding 90 °F in summer and below 10 °F in winter are not uncommon, but rarely prolonged. The record high temperature is 104 °F (40 °C), recorded July 4 1911. The record low temperature is -18 °F (-28 °C), recorded on February 9 1934. </p><p>The city averages about 42 in (108 cm) of rainfall a year. It also coincidentally averages about 42 in (108 cm) of snowfall a year, although this increases dramatically as one goes inland away from the city.Massachusetts' geographic location's jutting out into the North Atlantic also makes the city very prone to Nor'easter weather systems that can produce much snow and rain. <!--Someone confirm the numbers for Noreaster snowfall amounts- "dump more than 20 in (50 cm) of snow on the region in one <a href="/tags/Storm/" rel="tag">storm</a> event."--> Fog is prevalent, particularly in spring and early summer and the occasional tropical storm or hurricane can threaten the region, especially in early autumn. </p><p>201 E.2.3 Culture </p><p>Quincy Market designed by Alexander Parris </p><p>Boston shares many cultural roots with greater New England, including a dialect of the non-rhotic Eastern New England accent known as Boston English, and a regional cuisine with a large emphasis on seafood and dairy products. Irish Americans are a major influence on Boston's politics and religious institutions. Boston also has its own collection of neologisms known as Boston slang. </p><p>Many consider Boston to have a strong sense of cultural identity, perhaps as a result of its intellectual reputation; much of Boston's culture originates at its universities. The city has several ornate theatres, including the Cutler Majestic Theatre, Boston Opera House, The Wang Center for the Performing Arts, Shubert Theater, and the Orpheum Theater. Renowned performing arts groups include the Boston Ballet, Boston Symphony Orchestra, Boston Pops, Boston Lyric Opera Company, and the Handel and Haydn Society (one of the oldest choral companies in the United States).There are also many major annual events such as First Night, which occurs during New Year's Eve, and several events during the Fourth of July. These events include the weeklong Harborfest festivitiesand a Boston Pops concert accompanied by fireworks on the banks of the Charles River. </p><p>Symphony Hall designed by McKim, Mead, and White. </p><p>Because of the city's prominent role in the American Revolution, several historic sites relating to that period are preserved as part of the Boston National Historical Park. Many are found along the Freedom Trail, which is marked by a red line or bricks embedded in the ground. The city is also home to several prominent art museums, including the Museum of Fine Arts and the Isabella Stewart Gardner Museum. In December 2006 the Institute of Contemporary Art moved from its Back Bay location to a new contemporary building designed by Diller Scofidio + Renfro located in the Seaport District. The University of Massachusetts campus at Columbia Point houses the John F. Kennedy Library. The Boston Athenaeum (one of the oldest independent libraries in the United States),Boston Children's Museum, Bull & Finch Pub (whose building is known from the television show Cheers), Museum of Science, and the New England Aquarium are within the city. </p><p>202 Boston is also one of the birthplaces of the hardcore punk genre of music. Boston musicians have contributed greatly to this music scene over the years (see also Boston hardcore). Boston neighborhoods were home to one of the leading local third wave ska and ska punk scenes in the 1990s, led by bands such as The Mighty Mighty Bosstones, The Allstonians, Skavoovie and the Epitones, and the Dropkick Murphys. The 1980s hardcore punk rock compilation This Is Boston, Not L.A. highlights some of the bands that built the genre. Several nightclubs, such as The Channel, Bunnratty's in Allston, and The Rathskeller, were renowned for showcasing both local punk rock bands and those from farther afield. All of these clubs are now closed, and in many cases razed during recent gentrification.</p><p>203 </p><p>E.3 Chicago </p><p>E.3.1 Economy </p><p>The Chicago Board of Trade Building at night </p><p>Chicago has the third largest gross metropolitan product in the nation - approximately $442 billion according to 2007 estimates. The city has also been rated as having the most balanced economy in the United States, due to its high level of diversification. Chicago was named the fourth most important business center in the world in the MasterCard Worldwide Centers of Commerce Index. "London named world's top business center by MasterCard", CNN, June 13, 2007. Additionally, the Chicago metropolitan area recorded the greatest number of new or expanded corporate facilities in the United States for five of the past six years. <ref>http://www.siteselection.com/issues/2007/mar/topMetros/. Accessed 03/10/2007 from 'Site Selection Online'.</ref> The Boeing Company relocated its corporate headquarters from Seattle to Chicago in 2001. </p><p>Chicago is a major financial center with the second largest central business district in the U.S. The city is the headquarters of the Federal Reserve Bank of Chicago (the Seventh District of the Federal Reserve). The city is also home to four major financial and futures exchanges, including the Chicago Stock Exchange, the Chicago Board of Trade (CBOT), the Chicago Board Options Exchange (CBOE), and the Chicago Mercantile Exchange (the "Merc"). The city and the surrounding suburbs are home to 66 Fortune 500 companies. Chicago and the surrounding areas also house many major brokerage firms and insurance companies, such as Allstate Corporation and Zürich North America. In addition, despite Chicago commonly being perceived as a rust-belt city, a study indicated that Chicago has the largest high-technology and information-technology industry employment in the United States. </p><p>Manufacturing (which includes chemicals, metal, machinery, and consumer electronics), printing and publishing, and food processing also play major roles in the city's economy. Nevertheless, much of the manufacturing occurs outside the city limits, especially since World War II.<ref name="hirsch"> Hirsch, Susan E. (2004-2005). Economic Geography. Encyclopedia of Chicago (online edition).</ref> Several medical products and services companies are headquartered in the Chicago area, including Baxter International, Abbott Laboratories, and the Healthcare Financial Services division of General Electric. Moreover, the construction of the Illinois and Michigan Canal, which helped move goods from the Great Lakes south on the Mississippi River, and of the railroads in the 19th </p><p>204 century made the city a major transportation center in the United States. In the 1840s, Chicago became a major grain port, and in the 1850s and 1860s Chicago's pork and beef industry expanded. As the major meat companies grew in Chicago many, such as Armour, created global enterprises. Though the meatpacking industry currently plays a lesser role in the city's economy,<ref name="hirsch"/> Chicago continues to be a major transportation and distribution center. </p><p>The city is also a major convention destination; Chicago is third in the U.S. behind Las Vegas and Orlando as far as the number of conventions hosted annually. <ref>Chicago falls to 3rd in U.S. convention industry (4/26/2006). Crain's Chicago Business.</ref> In addition, Chicago is home to eleven Fortune 500 companies, while the metropolitan area hosts an additional 21 Fortune 500 companies.<ref>Fortune 500 2006 - Illinois. CNNMoney.com.</ref> Chicago also hosts 12 Fortune Global 500 companies and 17 Financial Times 500 companies. The city claims one Dow 30 company, aerospace giant Boeing, which moved its headquarters from Seattle to the Loop in 2001. The city and its surrounding metropolitan area are also home to the second largest labor pool in the United States with approximately 4.25 million workers. In 2006, Chicago placed 10th on the UBS list of the world's richest cities.<ref name="rich city">{{cite web}}</ref> </p><p>E.3.2 Climate </p><p>The city experiences four distinct seasons. In July, the warmest month, high temperatures average 84.9 °F (29.4 °C) and low temperatures 65.8 °F (18.8 °C). In January, the coldest month, high temperatures average 31.5 °F ( -0.3 °C) with low temperatures averaging 17.1 °F ( -8.3 °C). According to the National Weather Service, Chicago’s highest official temperature reading of 105 °F (41 °C) was recorded on July 24 1934. The lowest temperature of -27 °F ( -33 °C) degrees was recorded on January 20 1985. </p><p>Chicago’s yearly precipitation averages about 37 inches (965 mm). Summer is the rainiest season, with short-lived rainfall and thunderstorms more common than prolonged rainy periods.<ref>Chicago Seasonal Temperature and Precipitation Rankings (11/25/2005). National Weather Service Weather Forecast Office - Chicago, IL.</ref> Winter is the driest season, with most of the precipitation falling as snow. The snowiest winter ever recorded in Chicago was 1929 - 30, with 114.2 inches of snow in total. Chicago’s highest one-day rain total was 6.49 inches (164 mm), on August 14 1987. </p><p>E.3.3 Culture and contemporary life </p><p>The city's waterfront allure and nightlife has attracted residents and tourists alike. Over one-third of the city population is concentrated in the lakefront neighborhoods (from </p><p>205 Rogers Park in the north to Hyde Park in the south). The North Side has a large gay and lesbian community. Two neighborhoods in particular, Lakeview and Andersonville (in Edgewater), are home to many LGBT businesses and organizations. The area adjacent to the intersection of Halsted and Belmont is a gay neighborhood known to Chicagoans as "Boystown." The city has many upscale dining establishments as well as many ethnic restaurant districts. These include "Greektown" on South Halsted, "Little Italy" on Taylor Street, just west of Halsted, "Chinatown" on the near South Side, and South Asian on Devon Avenue. </p><p>Entertainment and performing arts </p><p>A Chicago jazz club </p><p>Chicago’s theater district spawned modern improvisational comedy. Two renowned comedy troupes emerged—The Second City and I.O. (formerly known as ImprovOlympic). Renowned Chicago theater companies include the Steppenwolf Theatre Company (on the city's north side), the <a href="/tags/Goodman_Theatre/" rel="tag">Goodman Theatre</a>, and the Victory Gardens Theater. Other theaters sprang from nearly 100 storefront performance spaces such as the Strawdog Theatre Company, The House Theatre of Chicago, TimeLine Theatre Company and Remy Bumppo Theatre Company in the Lakeview area to landmark downtown houses such as the Chicago Theatre, present a variety of plays and musicals. </p><p>Broadway in Chicago, created in July of 2000, hosts touring productions and Broadway musical previews at: LaSalle Bank Theatre, Cadillac Palace Theatre, Ford Center for the Performing Arts (Oriental Theatre), and the Auditorium Theatre of Roosevelt University. </p><p>Lyric Opera of Chicago, the Chicago Symphony Orchestra, the Joffrey Ballet, and several modern and jazz dance troupes perform. The city's classical music genre includes Music of the Baroque, Chicago Opera Theater, the Chicago Chamber Musicians, Chicago a cappella, and others. </p><p>Various forms of music are distinct to Chicago. Among them are Chicago blues, Chicago soul, jazz, and gospel. The city is the birthplace of the house style and is the site of an influential hip-hop scene. In the 1980s, the city was a center for industrial, punk and new wave. This influence continued into the alternative music of the 1990s. The city has been an epicenter for rave culture since the 1980s. A flourishing independent rock music culture brought forth Chicago indie. Annual festivals feature various acts such as Lollapalooza, the Intonation Music Festival and Pitchfork Music Festival. </p><p>Tourism </p><p>206 Navy Pier </p><p>Chicago attracts about 33 million visitors annually. http://www.suntimes.com/business/109470,CST-FIN-Tourism25.article</ref> Upscale shopping along the Magnificent Mile, thousands of restaurants, as well as Chicago's eminent architecture, continue to draw tourists. The city is the United States' third-largest convention destination.Most conventions are held at McCormick Place, just south of Soldier Field. </p><p>Navy Pier, 3,000 feet (900 m) long, houses retail, restaurants, museums, exhibition halls, and auditoriums. Its 150-foot-tall (45 m) Ferris wheel is north of Grant Park on the lakefront and is one of the most visited landmarks in the Midwest, attracting about 8 million people annually. </p><p>Crown Fountain </p><p>The historic Chicago Cultural Center (1897), originally serving as the Chicago Public Library, now houses the city's Visitor Information Center, galleries, and exhibit halls. The ceiling of Preston Bradley Hall includes a 38-foot (11 m) Tiffany glass dome. </p><p>Millennium Park is a rebuilt section of a former railyard that was planned for unveiling at the turn of the 21st century, though it was delayed for several years. The park includes the Cloud Gate sculpture (known locally as "The Bean"). When facing Cloud Gate and Lake Michigan, a curved skyline image reflects. A Millennium Park restaurant outdoor transforms into an ice skating rink in the winter. Two tall glass sculptures make up the Crown Fountain. Architects Krueck & Sexton implemented this design concept of artist Jaume Plensa. The fountain's two towers display visual effects from LED images of Chicagoans' faces, with water spouting from their lips. Frank Gehry's detailed stainless steel bandshell, Pritzker Pavilion, hosts the classical Grant Park Music Festival concert series. Behind the pavilion's stage is the Harris Theater for Music and Dance, an indoor venue for mid-sized performing arts companies, including Chicago Opera Theater and Music of the Baroque. Gehry's stainless steel BP Bridge connects Millennium Park with Daley Bicentennial Plaza. </p><p>In 1998, the city officially opened the Museum Campus, a 10-acre (4-ha) lakefront park surrounding three of the city's main museums: the Adler Planetarium, the Field Museum of Natural History, and the Shedd Aquarium. The Museum Campus joins the southern section of Grant Park which includes the renowned Art Institute of Chicago. Buckingham Fountain anchors the downtown park along the lakefront. During the summer of 2007, Grant Park hosts the public art exhibit, Cool Globes: Hot Ideas for a Cooler Planet. </p><p>207 The Museum of Science and Industry, in Hyde Park, is the only remaining building from the World's Columbian Exposition of 1893. </p><p>The Field Museum </p><p>The Oriental Institute, part of the University of Chicago, has an extensive collection of ancient Egyptian and Near Eastern archaeological artifacts, while the Freedom Museum is dedicated to exploring and explaining the First Amendment to the United States Constitution. Other museums and galleries in Chicago are the Chicago History Museum, DuSable Museum of African-American History, Mexican Fine Arts Center Museum, the Polish Museum of America, Museum of Contemporary Art, the Peggy Notebaert Nature Museum, the Hyde Park Art Center and The Renaissance Society. </p><p>Chicago has some signature foods which reflect the city's ethnic and working-class roots. These include the deep-dish pizza and the Chicago hot dog, which is almost always made of Vienna Beef and loaded with mustard, chopped onion, sliced tomato, pickle relish, celery salt, <a href="/tags/Sport/" rel="tag">sport</a> peppers, and a dill pickle spear (however, putting ketchup on a Chicago hot dog is often viewed as "sacrilegious"). Chicago is also known for Italian Beef sandwiches and the Maxwell Street Polish (always served topped with grilled onions and mustard). Grant Park celebrates the Taste of Chicago festival in late June and early July (basically the week of the Fourth of July). Every type of food in the city is represented, with free concerts and events daily. </p><p>Sports </p><p>U.S. Cellular Field on Chicago's South Side, the home of the Chicago White Sox </p><p>Chicago was named the best sports city in the United States by The Sporting News in 2006.The city has 17 sports teams. Five of those teams play in the four major North American professional sports leagues. </p><p>The <a href="/tags/Chicago_Bears/" rel="tag">Chicago Bears</a> of the <a href="/tags/National_Football_League/" rel="tag">National Football League</a> play at Soldier Field. The Bears are one of two charter NFL teams still in existence, the other being the Arizona Cardinals. </p><p>It is one of three U.S. cities with two <a href="/tags/Major_League_Baseball/" rel="tag">Major League Baseball</a> teams (New York City and Los Angeles). Unlike the other two, the two teams had remained with Chicago since the formation of the American League in 1900. The Chicago Cubs of the National League play at <a href="/tags/Wrigley_Field/" rel="tag">Wrigley Field</a>, which is the second-oldest MLB stadium and is located in the North Side neighborhood of Lakeview, commonly referred to as "Wrigleyville." The </p><p>208 Chicago White Sox of the American League play at U.S. Cellular Field, built in the early 1990s and located in the South Side neighborhood of Bridgeport. </p><p>Wrigley Field on the North Side, the home of the Chicago Cubs </p><p>The <a href="/tags/Chicago_Bulls/" rel="tag">Chicago Bulls</a> of the National Basketball Association play at the United Center on Chicago's Near West side. In 2006, the Chicago Sky joined the WNBA. The Sky play at the UIC Pavilion. </p><p>The Chicago Blackhawks, of the <a href="/tags/National_Hockey_League/" rel="tag">National Hockey League</a>, also play in the United Center. The Hawks are an Original Six franchise, founded in 1926. </p><p>The Chicago Wolves of the American Hockey League and Chicago Rush of the Arena Football League both play at the Allstate Arena in nearby Rosemont. </p><p>The Chicago Fire, members of <a href="/tags/Major_League_Soccer/" rel="tag">Major League Soccer</a> moved from Soldier Field to the new Toyota Park in Bridgeview at 71st and Harlem Avenue during the summer of 2006. Toyota Park is also home to the Chicago Machine of the MLL. </p><p>The Chicago Marathon is held every October since 1977. This event is one of five World Marathon Majors. </p><p>The city was selected on April 14 2007 to represent the United States internationally for the bid for the 2016 Summer Olympics.<ref>Levine, Jay. "Chicago In The Running To Host 2016 Summer Games." CBS. July 26, 2006. Retrieved on December 1 2006.</ref><ref>"Official Chicago 2016 Website." Retrieved on December 1 2006.</ref> Chicago also hosted the 1959 Pan American Games, and Gay Games VII in 2006. Chicago was selected to host the 1904 Olympics, but they were transferred to St. Louis to coincide with the World's Fair.<ref name="1904 Olypics">{{cite web}}</REF> </p><p>209 Media </p><p>Harpo Studios, home of talk show host Oprah Winfrey </p><p>{{main}} Chicago is the third-largest market in the U.S. (after Los Angeles and New York City).<ref>Nielsen Media - DMA Listing (September 24, 2005).</ref> Each of the big four United States television networks directly owns and operates stations in Chicago. WGN-TV, which is owned by the Tribune Company, is carried (with some programming differences) as "Superstation WGN" on cable nationwide. The city is also the home of The Oprah Winfrey Show and Jerry Springer, while Chicago Public Radio produces programs such as PRI's This American Life and NPR's Wait Wait... Don't Tell Me!. </p><p>There are two major daily newspapers published in Chicago: the Chicago Tribune and the Chicago Sun-Times, with the former having the larger circulation. There are also several regional and special-interest newspapers such as the Daily Southtown, the Chicago Defender, the Chicago Sports Review, Chicago Free Press, the Newcity News, the Daily Herald, StreetWise, Windy City Times, The Gazette," Pioneer Press Chicago Group," and the Chicago Reader. </p><p>210 </p><p>E.4 Houston </p><p>E.4.1 Economy </p><p>Houston's energy industry is recognized worldwide - particularly for oil - and biomedical research, aeronautics, and the ship channel are also large parts of its economic base. The area is the world's leading center for building oilfield equipment. Much of Houston's success as a petrochemical complex is due to its busy man-made ship channel, the Port of Houston.<ref>"Port of Houston Firsts", The Port of Houston Authority, 2007-05-15. Retrieved on 2007-05-27.</ref> The port ranks first in the United States in international commerce, and is the tenth-largest port in the world.<ref name="port ranking"/><ref>"General Information", The Port of Houston Authority, 2007-05-15. Retrieved on 2007-05-27.</ref> Unlike most places, where high oil and gasoline prices are seen as harmful to the economy, they are generally seen as beneficial for Houston as many are employed in the energy industry.<ref>{{cite news}}</ref> </p><p>The Houston - Sugar Land - Baytown MSA's Gross Area Product (GAP) in 2006 was $325.5 billion,<ref>"Houston Area Profile", Greater Houston Partnership. Retrieved on 2007-05-27.</ref> slightly larger than Austria’s, Poland’s or Saudi Arabia’s Gross Domestic Product (GDP). When comparing Houston's economy to a national economy, only 21 countries other than the United States have a gross domestic product exceeding Houston's regional gross area product.<ref>"Houston Area Profile", Greater Houston Partnership. Retrieved on 2007-05-27.</ref> Mining, which in Houston is almost entirely exploration and production of oil and gas, accounts for 11% of Houston's GAP; this is down from 21% in 1985. The reduced role of oil and gas in Houston's GAP reflects the rapid growth of other sectors, such as engineering services, health services, and manufacturing.<ref>"Gross Area Product by Industry", Greater Houston Partnership. Retrieved on 2006-12-15.</ref> </p><p>Houston ranks second in employment growth rate and fourth in nominal employment growth among the 10 most populous metro areas in the U.S.<ref>"Employment by Industry", Greater Houston Partnership. Retrieved on 2006-12-15.</ref> In 2006, the Houston metropolitan area ranked first in <a href="/tags/Texas/" rel="tag">Texas</a> and third in the U.S. within the category of "Best Places for Business and Careers" by Forbes magazine.<ref>Badenhausen, Kurt. "2006 Best Places for Business and Careers", Forbes, 2006-05-04. Retrieved on 2006-12- 15.</ref> Forty foreign governments maintain trade and commercial offices here and the city has 23 active foreign chambers of commerce and trade </p><p>211 associations.<ref>"International Representation in Houston", Greater Houston Partnership. Retrieved on 2006-12-15.</ref> Twenty foreign banks representing 10 nations operate in Houston, providing financial assistance to the international community. </p><p>E.4.2 Climate </p><p>Allen's Landing after Tropical Storm Allison, June 2001 </p><p>Houston's climate is classified as humid subtropical (Cfa in Köppen climate classification system). Spring <a href="/tags/Supercell/" rel="tag">supercell</a> thunderstorms sometimes bring tornadoes to the area. Prevailing winds are from the south and southwest during most of the year, bringing heat across the continent from the deserts of Mexico and moisture from the Gulf of Mexico. </p><p>During the summer months, it is common for the temperature to reach over 90 °F (34 °C), with an average of 99 days per year above 90 °F (32 °C).<ref>"Monthly Averages for Houston, Texas", The Weather Channel. Retrieved on 2006-12- 14.</ref><ref>"National Climatic Data Center", National Oceanic and Atmospheric Administration, United States Department of Commerce, 2004-06-23. Retrieved on 2006- 12-14.</ref> However, the humidity results in a heat index higher than the actual temperature. Summer mornings average over 90 percent relative humidity and approximately 60 percent in the afternoon.<ref>"Average Relative Humidity", Department of Meteorology at the University of Utah. Retrieved on 2006-12-14.</ref> Winds are often light in the summer and offer little relief, except near the immediate coast,<ref>WIND - AVERAGE SPEED (mph). Department of Meteorology, University of Utah. 1993. Retrieved on 2007-01-10</ref> To cope with the heat, people use air conditioning in nearly every vehicle and building in the city; in fact, in 1980 Houston was described as the "most air-conditioned place on earth".<ref>A MOMENT IN BUILDING. BLUEPRINTS, Volume X, Number 3, Summer 1992. National Building Museum. Retrieved on 2007-01-11.</ref> Scattered afternoon thunderstorms are common in the summer. The hottest temperature ever recorded in Houston was 109 °F (43 °C) on September 4, 2000.<ref>"History for Houston Intercontinental, Texas on Monday, September 4, 2000", Weather Underground, 2000-09-04. Retrieved on 2006- 12-14.</ref> </p><p>Winters in Houston are fairly temperate. The average high in January, the coldest month, is 63 °F (17 °C), while the average low is 45 °F (7 °C).Snowfall is generally rare. The last snowstorm to hit Houston was on December 24, 2004. The coldest temperature ever recorded in Houston was 5 °F (-15 °C) on January 23, 1940.<ref>Houston Extremes Data </p><p>212 and Annual Summaries. National Weather Service, National Oceanic and Atmospheric Administration. Published 2007-01-05. Retrieved on 2007-01-11.</ref> </p><p>Houston has excessive ozone levels and is ranked among the most ozone-polluted cities in the United States.<ref>"State of the Air 2005, National and Regional Analysis ", American Lung Association, 2005-03-25. Retrieved on 2006-02-17.</ref> Ground-level ozone, or smog, is Houston’s predominate air pollution problem, with the American Lung Association rating the metropolitan area's ozone level as the 6th worst in the United States in 2006.<ref>"State of the Air 2006, 25 Most Ozone-Polluted Cities ", American Lung Association. Retrieved on 2006-04-02.</ref> The industries located along the ship channel are a major cause of the city's air pollution.<ref>"Summary of the Issues", Citizens League for Environmental Action Now , 2004-08-01. Retrieved on 2006-02- 17.</ref> </p><p>Annual Average High Temperatures: 94 °F (summer) 63 °F (winter) </p><p>Annual Average Low Temperatures 75 °F (summer) 45 °F (winter) </p><p>Warmest Month: July </p><p>Coolest Month: January </p><p>Highest Precipitation: June </p><p>Annual Precipitation: 53.96 inches </p><p>E.4.3 Culture </p><p>Houston Art Car Parade </p><p>Houston is a multicultural city with a large and growing international community.The city is home to the nation’s third largest concentration of consular offices representing 86 nations.Houston is designated as a world-class city by the Globalization and World Cities Study Group and Network.<ref>"Inventory of World Cities", Globalization and World Cities Study Group & Network. Retrieved on 2006-12-16.</ref> Houston received the official nickname of "Space City" in 1967 because it is home to NASA's Lyndon B. Johnson Space Center.Other nicknames include "H-Town," "Screwston," "The Big Heart," "Bayou City," "<a href="/tags/Clutch_(mascot)/" rel="tag">Clutch</a> City," "Hustletown," and "Magnolia City." </p><p>Arts and theatre </p><p>213 </p><p>Wortham Center in the Theater District of Downtown Houston </p><p>Houston has an active visual and performing arts scene. The Theater District is located downtown and is home to nine major performing arts organizations and six performance halls. It is the second largest concentration of theater seats in a downtown area in the United States.<ref>Ramsey, Cody. "In a state of big, Houston is at the top", Texas Monthly, September 2002. Retrieved December 10, 2002.</ref><ref>"About <a href="/tags/Houston_Theater_District/" rel="tag">Houston Theater District</a>", Houston Theater District. Retrieved on 2006-12-16.</ref> Houston is one of only five United States cities with permanent, professional, resident companies in all major performing arts disciplines: opera (Houston Grand Opera), ballet (Houston Ballet), music (Houston Symphony Orchestra), and theater (The <a href="/tags/Alley_Theatre/" rel="tag">Alley Theatre</a>).<ref>"Museums and Cultural Arts", Greater Houston Partnership. Retrieved on 2006-12-16.</ref><ref>"Performing Arts Venues", Houston Theater District. Retrieved on 2006-12-16.</ref> Houston is also home to many local folk artists, art groups and various smaller progressive arts organizations.<ref>"A Brief History of the Art Car Museum", ArtCar Museum of Houston. Retrieved on 2006-12-16.</ref> Houston attracts many touring Broadway acts, concerts, shows, and exhibitions for a variety of interests.<ref>2006 fall edition of International Quilt Festival attracts 53,546 to Houston. Quilts., Inc. Press release published 2006-11-30. Retrieved on 2007-01-12.</ref> </p><p>Houston is home to the Bayou City Art Festival, which is considered to be one of the <a href="/tags/Top_Five/" rel="tag">top five</a> art festivals in the United States.<ref name=AmericanStyle2004> {{cite web}}</ref><ref name=AmericanStyle2005>{{cite web}}</ref> </p><p>The Museum District is home to many popular cultural institutions and exhibits, attracting more than 7 million visitors a year.<ref>Houston Museum District. Greater Houston Convention and Visitors Bureau. Retrieved on 2007-02-18.</ref><ref>{{cite news}}</ref> Notable facilities located in the district include The Museum of Fine Arts, Houston Museum of Natural Science, the Contemporary Arts Museum Houston, Holocaust Museum Houston, and the Houston <a href="/tags/Zoo/" rel="tag">Zoo</a>.<ref>Houston Museum District Day. Texas Monthly. 2006. Retrieved on 2007-01-10.</ref><ref>Museum District. Contemporary Arts Museum Houston. Retrieved on 2007-01-10.</ref><ref>Houston Museum District. Greater Houston Convention and Visitors Bureau. Retrieved on 2007- 01-10.</ref> Located in the nearby Montrose area are The Menil Collection and Rothko Chapel. </p><p>Many venues scattered across Houston regularly host local and touring rock, blues, country, hip hop and Tejano musical acts. Unfortunately, there has never been a widely renowned music scene in Houston. Artists seem to relocate to other parts of the United States once attaining some level of success.<ref>{{cite news}}</ref> A notable exception to the rule is Houston hip-hop, which celebrates the unique southern flavor and attitude of its roots. This has given rise to a strong, independent hip-hop music scene, </p><p>214 influencing and influenced by the larger Southern hip hop and gangsta rap communities.<ref>{{cite news}}</ref> Many Houstonian hip-hop artists have attained commercial success, including Bun B, Chamillionaire, Mike Jones, Lil' Flip, and Beyoncé. </p><p>Events </p><p>Many annual events celebrate the diverse cultures of Houston. The largest and longest running is the annual Houston Livestock Show and Rodeo, held over 20 days from late February to early March. Another large celebration is the annual night-time Houston Pride Parade, held at the end of June.Other annual events include the Greek Festival,<ref>The Original Greek Festival, Houston, Texas. 2006. Retrieved on 2007-01- 10. Warning: Automatic sound file.</ref> Art Car Parade, the Houston Auto Show and the Houston International Festival.<ref>The Houston International Festival. 2007. Retrieved on 2007-01-10.</ref> </p><p>Tourism and recreation </p><p>Downtown Aquarium </p><p>Space Center Houston is the official visitors’ center of NASA's Lyndon B. Johnson Space Center. Here one will find many interactive exhibits including moon rocks, a shuttle simulator, and presentations about the history of NASA's manned space flight program. </p><p>The Theater District is a 17-block area in the center of downtown Houston that is home to the Bayou Place entertainment complex, restaurants, movies, plazas, and parks. Bayou Place is a large multilevel building containing full-service restaurants, bars, live music, billiards, and art house films. The Houston Verizon Wireless Theater stages live concerts, stage plays, and stand-up comedy; and the Angelika Film Center presents the latest in art and foreign and independent films.<ref>Angelika Houston. Angelika Film Center. Retrieved on 2007-01-10.</ref> </p><p>Houston is home to many parks including Hermann Park, which houses the Houston Zoo and the Houston Museum of Natural Science, Lake Houston Park, Memorial Park, and Sam Houston Park. The city has 337 city parks and over 200 greenspaces - totaling over 19,600 acres that are managed by the city - including the Houston Arboretum and Nature Center. The Houston Civic Center was replaced by the George R. Brown Convention Centerâ€â€?one of the nation's largest - and the Jesse H. Jones Hall for the Performing Arts, home of the Houston Symphony Orchestra and Society for the Performing Arts. The <a href="/tags/Sam_Houston_Coliseum/" rel="tag">Sam Houston Coliseum</a> and Music Hall have been replaced by the Hobby Center for the Performing Arts. </p><p>215 Other tourist attractions include the Galleria (Texas's largest shopping mall located in the Uptown District), Old Market Square, Tranquility Park, the Downtown Aquarium, and Sam Houston Park (which contains restored and reconstructed homes which were originally built between 1823 and 1905).<ref>The Heritage Society: Walk into Houston's Past. The Heritage Society. Retrieved on 2007-01-10.</ref> The San Jacinto Battlefield State Historic Site where the decisive battle of the Texas Revolution was fought is located on the Houston Ship channel east of the city. </p><p>Sports </p><p>Minute Maid Park </p><p>Houston has teams for nearly every major professional sport. The <a href="/tags/Houston_Astros/" rel="tag">Houston Astros</a> (MLB), Houston Texans (NFL), <a href="/tags/Houston_Rockets/" rel="tag">Houston Rockets</a> (NBA), Houston Comets (WNBA), Houston Aeros (AHL), Houston Wranglers (WTT), Houston Takers (ABA) and Houston Dynamo (MLS) all call Houston home. </p><p>Minute Maid Park (home of the Astros) and <a href="/tags/Toyota_Center/" rel="tag">Toyota Center</a> (home of the Rockets, Comets, and Aeros) are located in a revived area of downtown. The city has the Reliant <a href="/tags/Astrodome/" rel="tag">Astrodome</a>, the first domed stadium in the world; it also holds the NFL's first retractable- roof stadium, Reliant Stadium. Other sports facilities in Houston include Hofheinz Pavilion and <a href="/tags/Robertson_Stadium/" rel="tag">Robertson Stadium</a> (both used for <a href="/tags/University_of_Houston/" rel="tag">University of Houston</a> collegiate sports), and Rice Stadium (home of the Rice University Owls football team). The infrequently used Reliant Astrodome hosted World Wrestling Entertainment's WrestleMania X-Seven on April 1, 2001, where an attendance record of 67,925 was set.<ref>"WrestleMania X- Seven Sets Revenue, Attendance Records", World Wrestling Entertainment, Inc., 2001- 04-02. Retrieved 2006-12-16.</ref> </p><p>On October 19, 2005, The Houston Astros advanced to the World Series for the first time in the team's history, subsequently losing to the Chicago White Sox. In 2006, the Houston Dynamo won the MLS Cup in their first year, after moving from San Jose, California. The Houston Aeros have won four championships: in the WHA (1973, 1974), in the IHL (1999), and in the AHL (2003). The Houston Rockets won back-to-back NBA titles in 1994 and 1995. Houston has hosted major recent sporting events, including the 2004 Major League Baseball All-Star Game, the 2000 IHL All-Star Game, the <a href="/tags/2005_World_Series/" rel="tag">2005 World Series</a>, the 2005 Big 12 Conference football championship game, the 2006 NBA All-Star Game, the U.S. Men's Clay Court Championships from 2001 - 2006, and the Tennis Masters Cup in 2003 and 2004, as well as the annual Shell Houston Open golf tournament. The city hosts the annual NCAA College Baseball Minute Maid Classic every February and NCAA football's Texas Bowl in December. Houston has hosted the Super Bowl championship game twice. Super Bowl VIII was played at Rice Stadium in 1974 and Super Bowl XXXVIII was played at Reliant Stadium in 2004. In early 2006, </p><p>216 the Champ Car auto racing series returned to Houston for a yearly race, held on the streets of the Reliant Park complex. </p><p>Media </p><p>Houston is served by the Houston Chronicle, its only major daily newspaper with wide distribution. The Hearst Corporation, which owns and operates The Chronicle, bought the assets of the Houston Post - its long-time rival and main competition - when The Post ceased operations in 1995. The Post was owned by the family of former Lieutenant Governor Bill Hobby of Houston. The only other major publication to serve the city is the Houston Press, a free alternative weekly with a weekly readership of more than 300,000.<ref name="About">{{cite web}}</ref> </p><p>Houston Community Newspapers (owned and operated by ASP Westward, L.P.) is a news source for smaller localized communities in and around the city. Houston Community Newspapers publishes 35 suburban newspapers, including 2 daily papers and 33 weekly papers. These "community" papers include, among several others, the 1960 Sun, the Deer Park Progress, the Fort Bend/Southwest Sun, the Humble Observer, the Katy Sun, the Kingwood Observer, the River Oaks Examiner, and the Villager.<ref>"Local Top Stories," Houston Community Newspapers (Townnews.com, 1995 - 2007).http://www.hcnonline.com/site/news.cfm?brd=1574&nav_sec=69981&n r=1&nostat=1</ref></p><p>217 </p><p>E.5 Las Vegas </p><p>E.5.1 Economy </p><p>Interior of the Circus Circus casino. A major part of the city economy is based on tourism, including gambling. </p><p>The primary drivers of the Las Vegas economy have been the confluence of tourism, gaming, and conventions which in turn feed the retail and dining industries. Several companies involved in the manufacture of electronic gaming machines, such as slot machines, are located in the Las Vegas area. In the 2000s retail and dining have become attractions of their own. </p><p>Tourism marketing and promotion are handled by the Las Vegas Convention and Visitors Authority, a county wide agency. Its annual Visitors Survey provides detailed information on visitor numbers, spending patterns and resulting revenues http://www.lvcva.com/press/statistics-facts/index.jsp?whichDept=stats. </p><p>The Lloyd D. George Federal District Courthouse in Las Vegas is the first Federal Building built to the post-Oklahoma City blast resistant standards. </p><p>Las Vegas as the county seat and home to the Lloyd D. George Federal District Courthouse, draws numerous legal service industries providing bail, marriage, divorce, tax, incorporation and other legal services. </p><p>Many technology companies have either relocated to Las Vegas or were created there. For various reasons, Las Vegas has had a high concentration of technology companies in electronic gaming and telecommunications industries. Some current technology companies in southern Nevada include Bigelow Aerospace, CommPartners, Datanamics, eVital Communications, NAHETS, Petroglyph, SkywireMedia, Switch Communications, WorldDoc, and Zappos. Companies that originally were formed in Las Vegas, but have since sold or relocated include Westwood Studios (sold to Electronic Arts), Systems Research & Development (Sold to IBM), Yellowpages.com (Sold to Bellsouth and SBC), and MPower Communications. </p><p>218 Constant population growth means that the housing construction industry is vitally important. In 2000 more than 21,000 new homes and 26,000 resale homes were purchased; more than one third of Las Vegas homes are only five years old or less.{{Fact}} In early 2005 there were 20 residential development projects of more than 300 acres each currently underway. </p><p>See also: List of foreign consulates in Las Vegas. </p><p>E.5.2 Climate </p><p>Las Vegas' climate is an arid desert climate (Koppen climate classification BWh) typical of the Mojave Desert, in which it is located, marked with very hot summers, mild winters, abundant sunshine year-round, and very little rainfall. Temperatures in the 90s °F (mid-30s °C) are common in the months of May, June, and September and temperatures normally exceed 100 °F (38 °C) most days in the months of July and August, with very low humidity, frequently under 10%. The hottest temperature ever recorded is 117 °F (47 °C) set twice, on July 19, 2005, at McCarran International Airport (the warmest ever recorded there) and July 24, 1942, at present-day Nellis Air Force Base. Winters are cool and windy, with the majority of Las Vegas' annual 4.49 in (114 mm) of rainfall coming from January to March.<ref>KLAS-TV on many broadcasts along with other stations broadcasts</ref> Winter daytime highs are normally around 60 °F (16 °C) and winter nighttime lows are usually around 40 °F (4 °C). The coldest temperature ever recorded is 8 °F (-13 °C) set on January 25, 1937, at present-day Nellis Air Force Base. Showers occur less frequently in the Spring or Autumn. July through September, the Mexican Monsoon often brings enough moisture from the Gulf of California across Mexico and into the southwest to cause afternoon and evening thunderstorms. Although winter snow is usually visible from December to May on the mountains surrounding Las Vegas, it rarely snows in the city itself. </p><p>E.5.3 Culture </p><p>Tourism </p><p><!-- This section is linked from Las Vegas culture (archaeology) --> </p><p>The major attractions in Vegas are the casinos. The most famous casinos line Las Vegas Boulevard South, also known as the Las Vegas Strip. There are many casinos in the city's downtown area as well, which was the original focal point of the city's gaming industry in its early days. Several large casinos are also located in the county around the city. </p><p>219 Some of the most notable casinos located downtown are on the Fremont Street Experience and include: </p><p>Golden Nugget </p><p>Four Queens </p><p>Binion's Gambling Hall and Hotel </p><p>Fremont Casino </p><p>Plaza Hotel & Casino </p><p>Las Vegas Club </p><p>Fitzgeralds Las Vegas </p><p>Golden Gate Hotel and Casino </p><p>Parks </p><p>City of Las Vegas Parks listing </p><p>Las Vegas Springs Preserve Recreational and educational facility </p><p>Floyd Lamb State Park </p><p>Music </p><p>A number of popular music acts have originated from Las Vegas including rock bands The Killers, Panic! at the Disco, The Higher, Escape The Fate, Slaughter and <a href="/tags/Rhythm_and_blues/" rel="tag">rhythm and blues</a> group 702. </p><p>Sports </p><p>{{main}} </p><p>220 </p><p>E.6 Los Angeles </p><p>E.6.1 Economy </p><p>The Southern Portion of Downtown Los Angeles, consisting of many older buildings and towering <a href="/tags/Skyscraper/" rel="tag">skyscrapers</a>. Companies such as Ernst & Young, Aon, Manulife, Paul Hastings, City National Bank, Union Bank of California and more have offices here. </p><p>The economy of Los Angeles is driven by international trade, entertainment (television, motion pictures, recorded music), aerospace, technology, petroleum, fashion, apparel, and tourism. Los Angeles is also the largest manufacturing center in the United States.<ref name='citydata'>City-data.com</ref> The contiguous ports of Los Angeles and Long Beach together comprise the most significant port in North America and one of the most important ports in the world, and they are vital to trade within the Pacific Rim.<ref name='citydata' /> Other significant industries include media production, finance, telecommunications, law, health and medicine, and transportation. </p><p>For many years, up until the mid-1990s, Los Angeles was home to many major financial institutions in the western United States, including First Interstate Bank, which merged with Wells-Fargo in 1996, Great Western Bank, merged with Washington Mutual in 1998, and Security Pacific National Bank, which merged with Bank of America in 1992. Los Angeles was also home to the Pacific Stock Exchange until it closed in 2001. </p><p>The city is home to five major Fortune 500 companies, including aerospace contractor Northrop Grumman, energy company Occidental Petroleum, healthcare provider Health Net, homebuilding company KB Home, and metals distributor Reliance Steel & Aluminum. The University of <a href="/tags/Southern_California/" rel="tag">Southern California</a> (USC) is the city's largest private sector employer.<ref>Evan George, Trojan Dollars: Study Finds USC Worth $4 Billion Annually to L.A. County, Los Angeles Downtown News, December 11, 2006.</ref> </p><p>The Northern portion of Downtown Los Angeles consists of several large Glass Office Towers, Plazas, and Gardens. Companies like, Citigroup, Wells Fargo, KPMG, US Bank, Bank Of America, Deloitte Touche Tohmatsu and more have offices here. </p><p>Other companies headquartered in Los Angeles include Twentieth Century Fox, Latham & Watkins, Univision, Metro Interactive, LLC, Premier America, CB Richard Ellis, </p><p>221 Gibson, Dunn & Crutcher LLP, Guess?, O'Melveny & Myers LLP, Paul, Hastings, Janofsky & Walker LLP, Tokyopop, The Jim Henson Company, Paramount Pictures, Robinsons-May, Sunkist, Fox Sports Net, Capital Group, 21st century Insurance, L.E.K. Consulting, and The Coffee Bean & Tea Leaf. </p><p>The metropolitan area contains the headquarters of even more companies, many of whom wish to escape the city's high taxes.<ref>EVALUATION OF ALTERNATIVES TO THE CITY’S GROSS RECEIPTS BUSINESS TAXUT Strategies, et. al. Competitiveness of City Taxes and Fees. 1997.</ref> For example, Los Angeles charges a gross receipts tax based on a percentage of business revenue, while many neighboring cities charge only small flat fees.<ref>Cometitiveness 22.</ref> The companies below benefit from their proximity to Los Angeles, while at the same time avoiding the city's taxes (and other problems). Some of the major companies headquartered in the cities of Los Angeles county are Shakey's Pizza (Alhambra), Academy of Motion Picture Arts and Sciences (Beverly Hills), City National Bank (Beverly Hills), Hilton Hotels (Beverly Hills), DiC Entertainment (Burbank), The Walt Disney Company (Fortune 500 - Burbank), Warner Bros. (Burbank), Countrywide Financial Corporation (Fortune 500 - Calabasas), THQ (Calabasas), Belkin (Compton), Sony Pictures Entertainment (parent of Columbia Pictures, located in Culver City), Computer Sciences Corporation (Fortune 500 - El Segundo), DirecTV (El Segundo), Mattel (Fortune 500 - El Segundo), Unocal (Fortune 500 - El Segundo), DreamWorks SKG (Glendale), Sea Launch (Long Beach), ICANN (Marina Del Rey), Cunard Line (Santa Clarita), Princess Cruises (Santa Clarita), Activision (Santa Monica), and RAND (Santa Monica). The L.A. area is also home to the U.S. headquarters of all but two of the major Asian automobile manufacturers (Nissan North America is in the process of relocating its headquarters from Gardena to the Nashville area, and Subaru's U.S. operations are based in Cherry Hill, New Jersey). Further, virtually all the world's automakers have design and/or tech centers in the L.A. region. </p><p>Downtown Los Angeles also is the home of the Los Angeles Convention Center which hosts many popular events including the annual LA Auto Show in December. <!-- moved recently to Jan - last one was in dec. 2006 --> </p><p>E.6.2 Climate </p><p>The city is situated in a Mediterranean climate zone (Köppen climate classification Csb on the coast, Csa inland), experiencing mild, somewhat wet winters and warm to hot summers. Breezes from the Pacific Ocean tend to keep the beach communities of the Los Angeles area cooler in summer and warmer in winter than those further inland; summer temperatures can sometimes be as much as 18°F (10°C) warmer in the inland communities compared to that of the coastal communities. Coastal areas also see a phenomenon known as the "marine layer," a dense cloud cover caused by the proximity of the ocean that helps keep the temperatures cooler throughout the year. When the </p><p>222 marine layer becomes more common and pervades farther inland during the months of May and June, it is called June Gloom.<ref>http://www.city-data.com/city/Los-Angeles- California.html</ref> </p><p>Echo Park, as seen with Lotus Plants and Palm Trees. </p><p>Temperatures in the summer can get well over 90 °F (32 °C), but average summer daytime highs in downtown are 82 °F (27 °C), with overnight lows of 63 °F (17 °C). Winter daytime high temperatures will get up to around 65 °F (18 °C), on average, with overnight lows of 48 °F (10 °C) and during this season rain is common. The warmest month is July, followed by August and then September. This somewhat large case of seasonal lag is caused by Los Angeles' proximity to the ocean and its latitude of 34 ° north. </p><p>The median temperature in January is 57 °F (13 °C) and 73 °F (22 °C) in August. The highest temperature recorded within city borders was 119.0 °F (48.33 °C) in Woodland Hills on July 22, 2006;<ref>Pool, Bob. "In Woodland Hills, It's Just Too Darn Hot." Los Angeles Times July 26, 2006, B1.</ref> the lowest temperature recorded was 18.0 °F (- 7.8 °C) in 1989, in Canoga Park. The highest temperature recorded for Downtown Los Angeles was 112.0 °F (44.4 °C) on June 26 1990, and the lowest temperature recorded was 24.0 °F (-5.0 °C) on January 9 1937. </p><p>Rain occurs mainly in the winter and spring months (February being the wettest month) with great annual variations in storm severity. Los Angeles averages 15 inches (38 cm) of precipitation per year. Snow is extraordinarily rare in the city basin, but the mountainous slopes within city limits typically receive snow every year. The greatest snowfall recorded in downtown Los Angeles was 2.0 inches (5 cm) on January 15, 1932.<ref>Burt, Christopher. Extreme Weather: A Guide and Record Book. New York: Norton, 2004: 100.</ref> {{-}} <!--Infobox begins-->{{Infobox Weather}}</ref> |accessdate = Jun 2007 |source2 = |accessdate2 = }}<!--Infobox ends--> </p><p>E.6.3 Culture </p><p>The Walt Disney Concert Hall, designed by award-winning architect Frank Gehry, is home to the Los Angeles Philharmonic. </p><p>{{Main}} </p><p>223 The people of Los Angeles are known as Angelenos. Nighttime hot spots include places such as Downtown Los Angeles, Silver Lake, Hollywood, and West Hollywood, which is the home of the world-famous Sunset Strip. </p><p>Some well-known shopping areas are the Hollywood and Highland complex, the Beverly Center, Melrose Avenue, Robertson Boulevard, Rodeo Drive, 3rd St. Promenade in Santa Monica, The Grove, Westside Pavillion, The Promenade at Howard Hughes Center and Venice Boardwalk. </p><p>{{seealso}} </p><p>Sports </p><p>Los Angeles is the home of the Los Angeles Dodgers of Major League Baseball, the <a href="/tags/Los_Angeles_Kings/" rel="tag">Los Angeles Kings</a> of the National Hockey League, the <a href="/tags/Los_Angeles_Clippers/" rel="tag">Los Angeles Clippers</a> and <a href="/tags/Los_Angeles_Lakers/" rel="tag">Los Angeles Lakers</a> of the National Basketball Association, the Los Angeles Sparks of the WNBA, the Los Angeles Galaxy and Club Deportivo Chivas USA of Major League Soccer, the Los Angeles Riptide of Major League <a href="/tags/Lacrosse/" rel="tag">Lacrosse</a>, and the Los Angeles Avengers of the Arena Football League. Los Angeles is also home to USC Trojans and the UCLA Bruins in the NCAA, both of which are Division I teams part of the Pacific 10 Conference. UCLA has more NCAA national championships, all sports combined, than any other university in America. USC has the third most NCAA national championships, all sports combined, in the United States. The Los Angeles Angels of Anaheim and Anaheim Ducks are both based in nearby Anaheim. </p><p>Dodger Stadium is the home of the Los Angeles Dodgers. </p><p>There is currently no NFL franchise in the Los Angeles Market, which is the second- largest television market in North America{{Fact}}. However, for the past several years, several billionaire entrepreneurs have shown interest in returning football to L.A., with meetings both with the city and the NFL. Prior to 1995, the Rams called Memorial Coliseum (1946-1979) and Anaheim Stadium (1980-1994) home;<ref>http://www.stlouisrams.com/History/HomesOfTheRams/</ref> and the Raiders played their home games at Memorial Coliseum from 1982 to 1994.<ref>Hong, Peter. "Few Tears Here." Los Angeles Times 29 June 1995: B1.</ref> </p><p>Los Angeles has twice played host to the summer Olympic Games, in 1932 and in 1984. When the tenth Olympic Games were hosted in 1932, the former 10th Street was renamed Olympic Blvd. The 1984 Summer Olympics inspired the creation of the Los Angeles Marathon, which has been held every year in March since 1986. Super Bowls I and VII were also held in the city as well as soccer's international World Cup in 1994. </p><p>224 Los Angeles applied to represent the USOC in international bidding for the 2016 Summer Olympics, but lost to Chicago. </p><p>Beach volleyball and windsurfing were both invented in the area (though predecessors of both were invented in some form by Duke Kahanamoku in Hawaii). Venice, also known as Dogtown, is credited with being the birthplace of skateboarding and the place where Rollerblading first became popular. Area beaches are popular with parties, sunbathers, surfers, swimmers and barefooters, who have created their own subcultures. </p><p>Staples Center, a premier venue for sports and entertainment, is home to five professional sports teams. </p><p>The Los Angeles area contains varied topography, notably the hills and mountains rising around the metropolis, making Los Angeles the only major city in the United States bisected by a mountain range; four mountain ranges extend into city boundaries. Thousands of miles of trails crisscross the city and neighboring areas, providing opportunities for exercise and wilderness access on foot, bike, or horse. Across the county a great variety of outdoor activities are available, such as skiing, rock climbing, gold panning, hang gliding, and windsurfing. Numerous outdoor clubs serve these sports, including the Angeles Chapter of the Sierra Club, which leads over 4,000 outings annually in the area. </p><p>Los Angeles also boasts a number of sports venues, including the Staples Center, a sports and entertainment complex that also hosts concerts and awards shows such as the Grammys. The Staples Center also serves as the home arena for the Los Angeles Lakers and Los Angeles Clippers of the NBA, the Los Angeles Sparks of the WNBA, the Los Angeles Kings of the NHL and the Avengers of the AFL. </p><p>Media </p><p>The Fox Plaza, headquarters for 20th Century Fox, in <a href="/tags/Century_City/" rel="tag">Century City</a> a major financial district for West Los Angeles. </p><p>The major daily newspaper in the area is The Los Angeles Times. La Opinión is the city's major Spanish-language paper. There are also a wide variety of smaller regional newspapers, alternative weeklies and magazines, including the Daily News (which focuses coverage on the San Fernando Valley), L.A. Weekly, Los Angeles CityBeat, Los Angeles magazine, Los Angeles Business Journal, Los Angeles Daily Journal (legal industry paper), The Hollywood Reporter and Variety (entertainment industry papers), and Los Angeles Downtown News. In addition to the English- and Spanish-language </p><p>225 papers, numerous local periodicals serve immigrant communities in their native languages, including Armenian, Korean, Persian, Russian and Japanese. </p><p>Many cities adjacent to Los Angeles also have their own daily newspapers whose coverage and availability overlaps into certain Los Angeles neighborhoods. Examples include the Daily Breeze (serving the South Bay), and The Long Beach Press-Telegram. </p><p>The Los Angeles metro area is served by a wide variety of local television stations, and is the second-largest designated market area in the U.S. with 5,431,140 homes (4.956% of the U.S.). </p><p>Los Angeles is the only city to have all 7 VHF allocations possible assigned to it. Other markets have 7 VHF but they are split among different cities. For instance, New York City has 7 VHF allocations but two of these are assigned to cities in New Jersey. </p><p>Los Angeles Times Headquarters </p><p>Los Angeles, along with Washington, DC, is one of the few TV markets that did not have a VHF allocation reserved for public broadcasting. </p><p>The major network television affiliates include KABC-TV 7 (ABC), KCBS 2 (CBS), KNBC 4 (NBC), KTTV 11 (FOX), KTLA 5 (The CW), and KCOP 13 (My Network TV), and KPXN 30 (i). There are also four PBS stations in the area, including KVCR 24, KCET 28, KOCE 50, and KLCS 58. World TV operates on two channels, KNET-LP 25 and KSFV-LP 6. There are also several Spanish-language television networks, including KMEX 34 (Univision), KFTR 46 (Telefutura), KVEA 52 (Telemundo), and KAZA 54 (Azteca America). KTBN 40 (Trinity Broadcasting Network), is a religious station in the area. </p><p>Several independent television stations also operate in the area, including KCAL 9 (owned by CBS Corporation), KSCI 18 (focuses primarily on Asian language programming), KWHY 22 (Spanish-language), KNLA-LP 27 (Spanish-language), KSMV-LP 33 (variety) — a low power relay of Ventura-based KJLA 57 — KPAL-LP 38, KXLA 44, KDOC 56 (classic programming and local sports), KJLA 57 (variety), and KRCA 62 (Spanish-language). </p><p>Religion </p><p>Los Angeles is one of the most religiously diverse communities in the world. </p><p>226 Los Angeles is home to adherents of many religions, and has over 100 Christian denominations, with Roman Catholicism being the largest due to the high numbers of Hispanic, Filipino, and Irish Americans. </p><p>The Roman Catholic Archbishop of Los Angeles leads the largest archdiocese in the country<ref>Pomfret, John. Cardinal Puts Church in Fight for Immigration Rights. Washington Post. April 2, 2006, accessed May 28, 2007.</ref>. Roger Cardinal Mahony oversaw construction of the Cathedral of Our Lady of the Angels, completed in 2002 at the north end of downtown. </p><p>Built in 1956 the Los Angeles California Temple of The Church of Jesus Christ of Latter- day Saints is the second largest Mormon temple in the world. </p><p>The Los Angeles California Temple, the second largest temple operated by The Church of Jesus Christ of Latter-day Saints, is on Santa Monica Boulevard in the Westwood district of Los Angeles. Dedicated in 1956, it was the first Mormon temple built in California. The grounds includes a visitors' center open to the public, the Los Angeles Regional Family History Center, also open to the public, and the headquarters for the Los Angeles mission. </p><p>Los Angeles is home to the second largest population of Jews in the United States. Many synagogues of the Reform, Conservative, Orthodox, and Reconstructionist movements can be found throughout the city. Most are located in the San Fernando Valley and West Los Angeles. The area in West LA around Fairfax and Pico Boulevards contains a large amount of Orthodox Jews. The oldest synagogue in Los Angeles is the Breed Street Shul in East Los Angeles, which is being renovated{{Dubious}}. </p><p>The Azusa Street Revival (1906 - 1909) in Los Angeles was a key milestone in the history of the Pentecostal movement, not long after Christian Fundamentalism received its name and crucial promotion in Los Angeles. In 1909, the Bible Institute of Los Angeles (B.I.O.L.A., now <a href="/tags/Biola_University/" rel="tag">Biola University</a>) published and widely distributed a set of books called The Fundamentals, which presented a defense of the traditional conservative interpretation of the Bible. The term fundamentalism is derived from these books. </p><p>227 </p><p>The Cathedral of Our Lady of the Angels is the mother church of the Roman Catholic Archdiocese of Los Angeles. </p><p>In the 1920s, Aimee Semple McPherson established a thriving evangelical ministry, with her Angelus Temple in Echo Park open to both black and white church members of the Foursquare Church. Billy Graham became a celebrity during a successful revival campaign in Los Angeles in 1949. Herbert W. Armstrong's Worldwide Church of God used to have its headquarters in nearby Pasadena, now in Glendale. Until his death in 2005, Dr. Gene Scott was based near downtown. The Metropolitan Community Church, a fellowship of Christian congregations with a focus on outreach to gays and lesbians, was started in Los Angeles in 1968 by Troy Perry. Jack Chick, of "Chick Tracts," was born in Boyle Heights and lived in the area most of his life. </p><p>Because of Los Angeles' large multi-ethnic population, there are numerous organizations in the area representing a wide variety of faiths, including Islam, Buddhism, Hinduism, Zoroastrianism, Sikhism, Bahá'ÃÂ, various Eastern Orthodox Churches, Sufism and others. Immigrants from Asia for example, have formed a number of significant Buddhist congregations making the city home to the biggest variety of Buddhists in the world. Los Angeles currently has the largest Buddhist population in the United States. There are over 300 temples in Los Angeles. Los Angeles has been a destination for Swamis and Gurus since as early as 1900, including Paramahansa Yogananda (1920). The Self-Realization Fellowship is headquartered in Hollywood and has a private park in Pacific Palisades. Los Angeles is the home to a number of Neopagans, as well as adherents of various other mystical religions. One wing of the Theosophist movement is centered in Los Angeles, and another is in neighboring Pasadena. Maharishi Mahesh Yogi, considered a spiritual, rather than a religious leader,<ref>http://en.wikipedia.org/wiki/Transcendental_Meditation#Relationship_to_rel igion_and_spirituality</ref> founded the Transcendental Meditation movement in Los Angeles in the late 1950s. The Kabbalah Centre is in the city. The Church of Scientology has had a presence in Los Angeles since it opened February 18, 1954, and it has several churches and museums in the area, most notably the Celebrity Centre in Hollywood, in fact the world's largest community of Scientologists can be found in LA. </p><p>228 </p><p>E.7 Philadelphia </p><p>E.7.1 Economy </p><p>Comcast Center, Philadelphia's newest office building, under construction </p><p>Philadelphia's economy is heavily based upon manufacturing, refining, food, and financial services. </p><p>The city is home to the Philadelphia Stock Exchange and many major Fortune 500 companies, including cable television and internet provider Comcast, insurance companies CIGNA and Lincoln Financial Group, energy company Sunoco, food services company Aramark, Crown Holdings Incorporated, chemical makers Rohm and Haas Company and FMC Corporation, the pharmaceutical company GlaxoSmithKline, Boeing helicopters division, and automotive parts retailer Pep Boys. </p><p>The federal government plays a large role in Philadelphia as well. The city served as the capital city of the United States, before the construction of Washington, D.C. Today, the East Coast operations of the United States Mint are based near the historic district, and the Federal Reserve Bank's Philadelphia division is based there as well. Philadelphia is also home to the U.S. District Court for the Eastern District of Pennsylvania and the U.S. Court of Appeals for the Third Circuit. </p><p>Partly because of the historical presence of the Pennsylvania Railroad, and the large ridership at 30th Street Station, Amtrak also maintains a significant presence in the city. These jobs include customer service representatives and ticket processing and other behind-the-scenes personnel, in addition to the normal functions of the railroad. </p><p>Because of the presence of the federal government, the city has a large contingent of law firms. {{Fact}} The city is also a national center of law because of the prestigious University of Pennsylvania Law School, Temple University Beasley School of Law, Villanova University School of Law, and Drexel University College of Law. Additionally, the headquarters of the American Law Institute is located in the city. </p><p>Philadelphia is also an important center for medicine, a distinction that it has held since the colonial period, when Pennsylvania Hospital was North America's first. The University of Pennsylvania, the city's largest private employer, runs an extensive medical system. There are also major hospitals affiliated with Temple University School of </p><p>229 Medicine, Drexel University College of Medicine, and Thomas Jefferson University. Philadelphia also has three distinguished children's hospitals: Children's Hospital of Philadelphia (located adjacent to the Hospitals of the University of Pennsylvania), St. Christopher's Hospital, and the Shriners' Hospital. In the city's northeast section are Albert Einstein Hospital and the Fox Chase Cancer Center. Together, health care is the largest sector of employment in the city. Several medical professional associations are headquartered in Philadelphia. </p><p>In part because of Philadelphia's long-running importance as a center for medical research, the region is a major center for the pharmaceutical industry. GlaxoSmithKline, AstraZeneca, Wyeth, Merck, GE Healthcare, Johnson and Johnson and Siemens Medical Solutions are just some of the large pharmaceutical companies with operations in the region. </p><p>See also: List of companies based in the Philadelphia area, List of foreign consulates in Philadelphia. </p><p>Innovation </p><p>Philadelphia has been the home of several notable innovations for modern American society. While there have been many more, the following is a list of some of the national firsts that have happened in this city:<ref>http://www.ushistory.org/philadelphia/philadelphiafirsts.html</ref><re f>http://philadelphia.about.com/cs/history/a/philly_firsts.htm</ref> </p><p>• Fire insurance company • Botanical garden • Public library • hospital • Fire engine • Fire company • medical school </p><p>• Pediatric hospital • Cancer hospital • Eye hospital • university • art school & museum • Municipal water system • post office </p><p>• bank • stock exchange • mint </p><p>230 </p><p>• zoo • computer • modernist <a href="/tags/Skyscraper/" rel="tag">skyscraper</a> in North America </p><p>E.7.2 Climate </p><p>Philadelphia falls in the humid subtropical climate zone, although it is the northernmost U.S. city that falls in this classification. Because Philadelphia lies in the northern end of this zone, some of its outlying suburbs, especially to the north and west, fall in the humid continental zone. Summers are typically hot and muggy, fall and spring are generally mild, and winter is cold, although periods of extreme cold are infrequent. Snowfall is variable, with some winters bringing light snow and others bringing some significant snowstorms. It is common for the heavier snowfall to occur north and west of the city. Annual snowfall averages 21 in (534 mm). Precipitation is generally spread throughout the year, with eight to eleven wet days per month,at an average annual rate of 42 in (1068 mm). </p><p>January lows average 23 °F ( - 5 °C) and highs average 38 °F (3 °C). The lowest officially recorded temperature was - 11 °F ( - 24 °C) on February 9, 1934, but temperatures below 14 °F ( - 10 °C) occur only a few times a year. July lows average 67 °F (20 °C) and highs average 86 °F (30 °C), although heat waves see highs above 95 °F (35 °C) with the heat index running as high as 110 °F (43 °C). The highest recorded temperature was 106 °F (41 °C) on August 7 1918. Early fall and late winter are generally driest, with February being the driest month, averaging only 2.74 in (69.8 mm) of precipitation. </p><p>E.7.3 Culture </p><p>Philadelphia has become notable in various arts and in culture. Philadelphia has had a prominent role in music including a Philadelphia own sound known as Philadelphia soul. On July 13 1985, Philadelphia hosted the American end of the Live Aid concert at John F. Kennedy Stadium. On July 2 2005, Bob Geldof, who organized the Live Aid concert, chose Philadelphia as the American host of the Live 8 concert. This time the show was held as a free concert on the Ben Franklin Parkway, where an estimated 600 000 - 800 000 people showed up for the global supershow.The city is home to many art galleries, many of which participate in the First Friday event. The first Friday of every month galleries in Old City are open late and for free. Annual events include film festivals and parades, the most famous being New Year's Day Mummers Parade. In cuisine the city is well known for its hoagies, soft pretzels, water ice, and is home to the cheesesteak. </p><p>231 Tourism </p><p>Independence Hall </p><p>{{see also}} Philadelphia contains many national historical sites that relate to the founding of the United States. Independence National Historical Park is the center of these historical landmarks. Independence Hall, where the Declaration of Independence was signed, and the Liberty Bell are the city's most famous attractions. Other historic sites include homes for Edgar Allan Poe and Betsy Ross and early government buildings like the First and Second Banks of the United States. </p><p>The city contains many museums such as the Pennsylvania Academy of the Fine Arts and the Rodin Museum, the largest collection of work by Auguste Rodin outside of France. The city’s major art museum, the Philadelphia Museum of Art, is one of the largest art museums in the United States and features the steps made popular by the film Rocky.<ref name="Dallasnews">{{cite journal}}</ref> Philadelphia's major science museums include the Franklin Institute, which contains the Benjamin Franklin National Memorial, the Academy of Natural Sciences, and the University of Pennsylvania Museum of Archaeology and Anthropology. History museums include the National Constitution Center, the Atwater Kent Museum of Philadelphia History, the National Museum of American Jewish History, the Historical Society of Pennsylvania, the Grand Lodge of Free and Accepted Masons in the state of Pennsylvania and Masonic Museum and Eastern State Penitentiary. Philadelphia is home to the United States' first zoo and hospital. </p><p>Areas such as South Street and Old City have a vibrant night life. The Avenue of the Arts in Center City contains many restaurants and theaters, such as the Kimmel Center for the Performing Arts, which is home to the Philadelphia Orchestra, and the Academy of Music, the nation's oldest continually operating venue, home to the Philadelphia Opera.<ref name="Dallasnews" /> </p><p>Shopping </p><p>Philadelphia has a strong retail community reflected by both small scale local selections and large malls. Center City is home to The Gallery at Market East, The Shops at Liberty Place and The Shops at the Bellevue, upscale boutique malls, and The Philadelphia Bourse, which orients its offerings towards tourists and visitors. Rittenhouse Row, a section of Walnut Street in Center City, is home to some of the most high end stores and boutiques in the region. Old City and Society Hill, as well, feature upscale boutiques and retailers from local and international merchandisers. Philadelphia also has several neighborhood shopping districts, most notably Manayunk and Chestnut Hill. Also noteworthy is South Street with blocks of inexpensive boutiques. </p><p>232 The Italian Market in South Philadelphia offers a wide assortment of groceries, meats, cheeses and housewares from a diverse array of countries in addition to its Italian flavor. Geno's and Pat's, two famed cheesesteak outlets, are located here. The Reading Terminal Market in Center City includes dozens of restaurants, farm stalls, and shops, many run by Amish farmers from Lancaster County. There are also neighborhood farmers' markets throughout the city. </p><p>The Philadelphia metropolitan area also features shopping malls and outlets. Most notably, the King of Prussia Mall, the second-largest mall in the United States, is thirty minutes away from Center City. Outlet malls, such as Franklin Mills and Lancaster Outlets, are also nearby. </p><p>Media </p><p>Philadelphia's two major daily newspapers are The Philadelphia Inquirer and the Philadelphia Daily News, both of which are owned by Philadelphia Media Holdings L.L.C. The Philadelphia Inquirer, founded in 1829, is the third-oldest surviving daily newspaper in the United States. </p><p>The first experimental radio license was issued in Philadelphia in August, 1912 to St. Joseph's College. The first commercial radio stations appeared in 1922. WIP, then owned by Gimbel's department store, became the first on March 17. Also launched that year were WFIL, WOO, WCAU and WDAS.<ref name="Media">{{cite journal}}</ref> The highest rated stations in Philadelphia today include soft rock WBEB, KYW Newsradio, and urban adult contemporary WDAS-FM. </p><p>During the 1930s, the experimental station W3XE, which was owned by Philco Corp, became the first television station in Philadelphia. The station, which would later become KYW-TV (CBS), became NBC's first affiliate in 1939. By the 1970s WCAU-TV, WPVI- TV, WHYY-TV, WPHL-TV, and WTXF-TV were founded.<ref name="Media" /> In 1952 WFIL (now WPVI), premiered the television show Bandstand, which later became the nationally broadcast show American Bandstand hosted by Dick Clark. </p><p>Philadelphia has a competitive rock radio market, especially between WMMR and WYSP, which both specialize in playing modern and <a href="/tags/Classic_rock/" rel="tag">classic rock</a>. The two stations enjoy a very intense rivalry with each station's listeners being faithfully loyal to their favorite station in most cases. Since 2005, WMMR now plays more music due to a shift in WYSP's programming from a rock station (which also carried controversial <a href="/tags/Shock_jock/" rel="tag">shock jock</a> <a href="/tags/Howard_Stern/" rel="tag">Howard Stern</a>) to a Free FM station (which now carries the syndicated <a href="/tags/Opie_and_Anthony/" rel="tag">Opie and Anthony</a> morning show and The Kidd Chris afternoon show). WYSP also carries live radio broadcasts of all <a href="/tags/Philadelphia_Eagles/" rel="tag">Philadelphia Eagles</a> home and road games. WMMR has the top rated morning show in the Philadelphia area, The Preston and Steve Show, which has been at the top of the ratings since leaving former rock station Y100. </p><p>233 Philadelphia's four urban stations (WUSL ("Power 99"), WPHI ("100.3 The Beat"), WDAS and WRNB) are popular choices on the FM dial. WJJZ is the city's smooth jazz station. When WJJZ was discontinued in August 2006, it caused an uproar among listeners, but it was revived three months later, under new ownership (Greater Media) and with a new frequency (97.5). The former WJJZ is now WISX, "Philly's 106.1". </p><p>234 </p><p>E.8 Reno </p><p>E.8.1 Gaming industry </p><p>Downtown Reno, including the city's famous arch over <a href="/tags/Virginia/" rel="tag">Virginia</a> Street. </p><p>Before the 1960s, Reno was the gambling capital of the United States, but Las Vegas' rapid rise, American Airlines' buyout of Reno Air and the growth of Indian gaming in California have seriously reduced its business. Older casinos were either torn down (Mapes, Nevada Club, Harold's Club, Palace Club) and smaller casinos like the Comstock, Sundowner, Golden Phoenix, Kings Inn, Money Tree, Virginian, and Riverboat closed. Reno Casinos experience slow days during the week, especially during winter, when mountain passes are closed to through traffic from California. Only during weekends, holidays and special events does Reno see an increase in business. </p><p>Two local casinos have shown significant growth, and have moved downtown gaming further south on Virginia Street. These include the Atlantis, and The Peppermill. The Peppermill is viewed as the most outstanding Reno gaming/hotel property by Casino Player and Nevada Magazines. In 2005,the Peppermill Hotel Casino began a $300 million dollar Tuscan-themed expansion. The Peppermill is adding a 600-room all-suite hotel tower, 62,000 square feet of convention space, a resort-style pool complex, and many additional restaurants and lounges. The Grand Sierra Resort Has a sports center where you can bid and eat at Jonny Rockets. They also have a magazine. </p><p>In an effort to bring more tourism to the area, Reno holds several events throughout the year, all of which have been extremely successful. They include Hot August Nightshttp://www.hotaugustnights.net/(a classic car convention and rally and it has old Rock 'n' Roll.), Street Vibrations (a motorcycle fan gathering and rally), The Great Reno Balloon Race, the Best in the West Nugget Rib Cook-off (held in Sparks), a Cinco de Mayo celebration, bowling tournaments (held in the National Bowling Stadium) and the Reno Air Races. </p><p>Reno is the location of the corporate headquarters for International Game Technology, which manufactures slot machines used throughout the world. Ballys Technology and Gaming and GameTech also have development and manufacturing presence in Reno. </p><p>235 E.8.2 Climate </p><p>Reno is situated in a high desert valley of approximately 4,400 feet (1300 m) above sea level. There are four distinct seasons, all of moderate intensity. Winters see some snowfall; however typically it is light. Most precipitation occurs in winter and spring, with summer and fall being extremely dry. Mid-summer highs are typically in the low to mid 90s (degrees Fahrenheit, 30s in degrees Celsius), but temperatures of 100 °F (38 °C) and above do occur regularly. The low humidity and high elevation generally make even the hottest and coldest days quite bearable. July high and low temperatures average 92 °F (33 °C) and 51 °F (11 °C), respectively; in January they are 46 °F (7 °C) and 22 °F (-6 °C). </p><p>E.8.3 Culture </p><p>• National Automobile Museum • Nevada Shakespeare Company • Nevada Museum of Art • University of Nevada, Reno Arboretum • Wilbur D. May Arboretum and Botanical Garden • Reno Pops Orchestra • Artown </p><p>236 </p><p>E.9 San Diego </p><p>E.9.1 Economy </p><p>Downtown San Diego at night </p><p>San Diego Marriott Hotel and Marina </p><p>Several areas of San Diego (in particular <a href="/tags/La_Jolla/" rel="tag">La Jolla</a> and surrounding Sorrento Valley areas) are home to offices and research facilities for numerous biotechnology companies. Major biotechnology companies like Neurocrine Biosciences and Nventa Biopharmaceuticals are headquartered in San Diego, while many biotech and pharmaceutical companies, such as BD Biosciences, Biogen Idec, Merck, Pfizer, Élan, Genzyme, Celgene and Vertex, have offices or research facilities in San Diego. There are also several non-profit biotech institutes, such as the Salk Institute for Biological Studies, the Scripps Research Institute and the Burnham Institute. The presence of <a href="/tags/University_of_California/" rel="tag">University of California</a>, San Diego and other research institutions helped fuel biotechnology growth. In June 2004, San Diego was ranked the top biotech cluster in the U.S. by the Milken Institute. </p><p>San Diego is home to companies that develop wireless cellular technology. <a href="/tags/Qualcomm/" rel="tag">Qualcomm</a> Incorporated was founded and is headquartered in San Diego; Qualcomm is the largest private-sector technology employer (excluding hospitals) in San Diego County.Other companies also have research and development labs in San Diego, principally focused on cloning Qualcomm's CDMA cellular technology. </p><p>The largest software company in San Diego (acccording to the San Diego Business Journal) is security software company Websense Inc. Websense was founded and is headquartered in San Diego. </p><p>The economy of San Diego is influenced by its port, which includes the only major submarine and shipbuilding yards on the West Coast, as well as the largest naval fleet in the world. The cruise ship industry, which is the second largest in California, generates an estimated $2 million annually from the purchase of food, fuel, supplies, and maintenance services.<ref>{{cite news}}</ref> </p><p>237 Due to San Diego's military influence, major national defense contractors, such as General Atomics and Science Applications International Corporation are headquartered in San Diego. </p><p>Tourism is also a major industry owing to the city's climate. Major tourist destinations include <a href="/tags/Balboa_Park_(San_Diego)/" rel="tag">Balboa Park</a>, the <a href="/tags/San_Diego_Zoo/" rel="tag">San Diego Zoo</a>, Seaworld, nearby Wild Animal Park and Legoland, the city's beaches and golf tournaments like the Buick Invitational. </p><p>Real estate </p><p>San Diego has experienced dramatic growth of real estate prices in the last decade, to the extent that the current situation is sometimes described as a "housing affordability crisis". Median house prices more than tripled between 1998 and 2007. According to the California Association of Realtors<ref>C.A.R. reports sales decrease 25 percent in May</ref>, in May 2007, a median house in San Diego cost $612,370. Growth of real estate prices has not been accompanied by comparable growth of household incomes: housing affordability index (percentage of households that can afford to buy a median- priced house) fell below 20% in early 2000's and remains very low. San Diego metropolitan area has second worst median multiple (ratio of median house price to median household income) of all metropolitan areas in the United States. As a consequence, San Diego has been experiencing negative net migration since 2004, with significant numbers of people moving to <a href="/tags/Baja_California/" rel="tag">Baja California</a> and Riverside county, with many residents commuting daily from Tijuana, Temecula, and Murrieta, while commuting to their jobs in San Diego. Others are leaving the state altogether and moving to more affordable regions. <ref>{{cite news}}</ref> <!--Say something about real estate / construction renovation businesses and their role in economy of San Diego--> </p><p>E.9.2 Climate </p><p>San Diego predominantly has a semi-arid warm steppe climate (Koppen climate classification BSh). It enjoys mild, sunny weather throughout the year. Average monthly temperatures range from about 57 °Fahrenheit (14 °C) in January to 72 °Fahrenheit (22 °C) in July, although late summer and early autumn are typically the hottest times of the year. The average annual daily temperature is 70.5 °Fahrenheit. Snow and ice are virtually nonexistent in the wintertime, typically occurring only inland from the coast when present. "May gray and June gloom", a local saying, refers to the way in which San Diego sometimes has trouble shaking off the marine layer, a cloudy layer typically higher in the atmosphere than fog, that comes in during those months. Temperatures soar to very high readings only on rare occasions, chiefly when easterly winds bring hot, dry air from the inland deserts (these winds are called "Santa Anas"). The average annual precipitation is less than 12 inches (300 mm), resulting in a borderline arid climate. Rainfall is strongly concentrated in the cooler half of the year, particularly the months December through </p><p>238 March, although precipitation is lower than any other part of the U.S. west coast. The summer months are virtually rainless. Rainfall is highly variable from year to year and from month to month, and San Diego is subject to both droughts and floods. Thunderstorms and hurricanes are very rare. </p><p>Climate in the San Diego area often varies dramatically over short geographical distances, due to the city's topography (the Bay, and the numerous hills, mountains, and canyons): frequently, particularly during the "May gray / June gloom" period, a thick "marine layer" cloud cover will keep the air cool and damp within a few miles of the coast, but will yield to bright cloudless sunshine between about 5 and 15 miles inland -- the cities of El Cajon and Santee for example, rarely experience the cloud cover. This phenomenon is known as microclimate. </p><p>E.9.3 Culture </p><p>The Museum of Man is one of several museums in Balboa Park. </p><p>Many popular museums, such as the <a href="/tags/San_Diego_Museum_of_Art/" rel="tag">San Diego Museum of Art</a>, the San Diego Natural History Museum, the San Diego Museum of Man, and the Museum of Photographic Arts are located Balboa Park. The Museum of Contemporary Art San Diego (MCASD) is located in an ocean front building in La Jolla and has a branch located at the Santa Fe Depot downtown. The Colombia district downtown is home to historic ship exhibits as well as the San Diego <a href="/tags/Aircraft_carrier/" rel="tag">Aircraft Carrier</a> Museum featuring the USS Midway aircraft carrier. </p><p>San Diego has a growing art scene. "Kettner Nights" at the Art and Design District in Little Italy has art and design exhibitions throughout many retail design stores and galleries on selected Friday nights. "Ray at Night" at North Park host a variety of small scale art galleries on the second Saturday evening of each month. La Jolla and nearby Solana Beach also have a variety of art galleries. </p><p>The San Diego Symphony at Symphony Towers performs on a regular basis and is directed by Jahja Ling. The San Diego Opera at Civic Center Plaza was ranked by Opera America as one of the top 10 opera companies in the United States. <a href="/tags/Old_Globe_Theatre/" rel="tag">Old Globe Theatre</a> at Balboa Park produces about 15 plays and musicals annually. The <a href="/tags/La_Jolla_Playhouse/" rel="tag">La Jolla Playhouse</a> at UCSD is directed by a two-time Tony Award-winner Des McAnuff. The Joan B. Kroc Theatre at Kroc Center's Performing Arts Centeris is a 600-seat state-of-the-art theatre that hosts music, dance and theatre performances. The San Diego Repertory Theatre at the Lyceum Theatres in Horton Plaza produces a variety of plays and musicals. Serving </p><p>239 the northeastern part of San Diego is the California Center for the Arts in Escondido, a 400-seat performing arts theater. </p><p>Tourism has affected the city's culture, as San Diego houses many tourist attractions, such as SeaWorld San Diego, Belmont amusement park, San Diego Zoo, San Diego Wild Animal Park, and nearby Legoland. San Diego's Spanish influence can be seen in the many historic sites across the city, such as the Spanish missions and Balboa Park. Cuisine in San Diego is diverse, and there is an abundance of wood fired California-style pizzas, and Mexican and East Asian cuisine. Annual events in San Diego include Comic-Con, San Diego/Del Mar Fair, and Street Scene Music Festival. </p><p>San Diego has a fairly large gay population and gay culture. The annual Gay Pride Parade usually draws crowds in excess of 100,000 people. According to U.S. Census data from the year 2000, San Diego had a gay index of 186 (gay male index of 226 and a lesbian index of 144); the national average gay index is 100.San Diego has the largest gay index in Southern California, surpassing Los Angeles (168).Most of the gay community, including the LGBT center and every gay bar in San Diego is located in Hillcrest and surrounding neighborhoods of University Heights and North Park. </p><p>San Diego Board Culture </p><p>A surfer at Black's. </p><p>San Diego has always been a hotbed for surf and skate culture. Headquartered here are some of the industry's biggest names including Sector 9 Skateboards, TransWorld Surf, and Rusty Surfboards. Some very well known surf spots include Swamis, Black's Beach,and Windansea. The region even has its own chain of surf shops, Sun Diego. Pro surfers Rob Machado and Taylor Knox, pro skateboarder Tony Hawk, and pro snowboarder Shaun White call the San Diego area their home. </p><p>Sports Club Sport League Stadium <a href="/tags/San_Diego_Padres/" rel="tag">San Diego Padres</a> Baseball MLB (National <a href="/tags/Petco_Park/" rel="tag">PETCO Park</a> League) San Diego American AFL 1961-1969, NFL Qualcomm Stadium Chargers Football 1970-Present O.M.B.A.C. RFC Rugby Rugby Super League Little Q Rugby Pitch at (US) Qualcomm </p><p>240 San Diego Pumitas Soccer National Premier <a href="/tags/Balboa_Stadium/" rel="tag">Balboa Stadium</a> Soccer League San Diego WFC Soccer Women's Premier Cathedral Catholic SeaLions Soccer League High School San Diego Soccer USL W-League <a href="/tags/Torero_Stadium/" rel="tag">Torero Stadium</a> Sunwaves San Diego Indoor National Indoor Cox Arena Shockwave football Football League San Diego Basketball ABA <a href="/tags/San_Diego_High_School/" rel="tag">San Diego High School</a> Wildcats </p><p>San Diego has several sports venues: Qualcomm Stadium is the home of the NFL San Diego Chargers, NCAA Division I <a href="/tags/San_Diego_State_Aztecs/" rel="tag">San Diego State Aztecs</a>, as well as local high school football championships. Qualcomm Stadium also hosts international soccer games, Supercross events and formerly hosted Major League Baseball. Three NFL Super Bowl championships and many college football bowl games have been held there. Balboa Stadium is the city's first stadium, constructed in 1914, and former home of the San Diego Chargers. Currently Balboa Stadium hosts soccer, football and track and field. </p><p>PETCO Park in <a href="/tags/Downtown_San_Diego/" rel="tag">downtown San Diego</a> hosts Major League Baseball along with other occasional soccer and rugby events. The San Diego Sports Arena hosts basketball, and has also hosted ice hockey, <a href="/tags/Indoor_soccer/" rel="tag">indoor soccer</a> and boxing. Cox Arena at <a href="/tags/Aztec_Bowl_(stadium)/" rel="tag">Aztec Bowl</a> on the campus of San Diego State University hosts the NCAA Division I San Diego State Aztecs men's and women's basketball games and also hosts the San Diego Shockwave of the National <a href="/tags/Indoor_Football_League/" rel="tag">Indoor Football League</a>. Torero Stadium at the <a href="/tags/University_of_San_Diego/" rel="tag">University of San Diego</a> hosts college football and soccer, and the <a href="/tags/Jenny_Craig_Pavilion/" rel="tag">Jenny Craig Pavilion</a> at USD hosts basketball and volleyball. </p><p>The San Diego State Aztecs (MWC) and the <a href="/tags/San_Diego_Toreros/" rel="tag">San Diego Toreros</a> (WCC) are NCAA Division I teams. The UCSD Tritons (CCAA) are members of NCAA Division II while the Point Loma Nazarene Sea Lions (GSAC) are members of the NAIA. </p><p>San Diego has been the home of two NBA franchises, the first of which was called the San Diego Rockets. The Rockets represented the city of San Diego from 1967 until 1971. After the conclusion of the 1970-1971 season, they moved to Texas where they became the Houston Rockets. Seven years later, San Diego received a relocated NBA franchise (the <a href="/tags/Buffalo_Braves/" rel="tag">Buffalo Braves</a>), which was renamed the San Diego Clippers. The Clippers played in the San Diego Sports Arena from 1978 until 1984. Prior to the start of the 1984-1985 season, the team was moved to Los Angeles, and is now called the Los Angeles Clippers. </p><p>241 Unfortunately, San Diego has the dubious distinction of being the largest United States city to have not won a Super Bowl, World Series, Stanley Cup, NBA Finals or any other Major League sports championship; this is known as the San Diego Sports Curse. </p><p>Other sports franchises that represented San Diego include the <a href="/tags/San_Diego_Conquistadors/" rel="tag">San Diego Conquistadors</a> of the American Basketball Association, the <a href="/tags/San_Diego_Sockers_(2009)/" rel="tag">San Diego Sockers</a> (which played in various indoor and outdoor soccer leagues during their existence), the <a href="/tags/San_Diego_Flash/" rel="tag">San Diego Flash</a> and the San Diego Gauchos, both playing in different divisions of the <a href="/tags/United_Soccer_League/" rel="tag">United Soccer League</a>, the San Diego Spirit of the Women's United Soccer Association, the San Diego Mariners of the <a href="/tags/World_Hockey_Association/" rel="tag">World Hockey Association</a>, and the <a href="/tags/San_Diego_Gulls/" rel="tag">San Diego Gulls</a> who were in different hockey leagues during each of their three incarnations. </p><p>The annual Rock 'n' Roll Marathon in the city draws 20,000 participants annually. San Diego is also home to Camp La Jolla, the nation's largest fitness camp.{{Fact}} </p><p>San Diego also hosts the prestigious USA Sevens, an event in the annual IRB Sevens World Series for international teams in rugby sevens, a variant of <a href="/tags/Rugby_union/" rel="tag">rugby union</a> with seven players per side instead of 15. The USA Sevens moved from the Los Angeles area to San Diego in 2007. </p><p>Media </p><p>San Diego is served by ''The San Diego Daily Transcript'', as well as the mainstream daily newspaper, The San Diego Union-Tribune and its online portal, ''signonsandiego.com'', the online newspaper ''Voiceofsandiego.org'', and the alternative newsweeklies, the San Diego City Beat and San Diego Reader. Another newspaper with high readership in the region is the North County Times, which serves San Diego's North County area. Business publications include San Diego Metropolitan magazine, and the ''San Diego Business Journal''. </p><p>San Diego's television stations include XETV 6 (FOX), KFMB 8 (CBS), KGTV 10 (ABC), KPBS 15 (PBS), KBNT 17 (Univision), XHAS 33 (Telemundo), K35DG 35 (UCSD-TV), KNSD 39 (NBC), XHDTV 49 (MNTV), KUSI 51 (Independent), and KSWB 69 (CW). Most of the city's stations air on their own cable channel number for each area: </p><p>• Channel 6: Cable 6 • Channel 8: Cable 8 • Channel 10: Cable 10 • Channel 15: Cable 11 • Channel 39: Cable 7 • Channel 49: Cable 13 • Channel 51: Cable 9 • Channel 69: Cable 5 </p><p>242 The radio station skyline in San Diego is headed by nationwide broadcaster, Clear Channel Communications, followed up by CBS Radio, Midwest Television, Lincoln Financial Media, Finest City Broadcasting, and many other smaller stations and networks. Stations include: KOGO AM 600, KFMB AM 760, KCEO AM 1000, KCBQ AM 1170, KLSD AM 1360, KFSD 1450 AM, KPBS 89.5, Z 90.3, 91X, Magic 92.5, Channel 933, Star 94.1, FM 94/9, KyXy 96.5, KSON 97.3/92.1, KIFM 98.1, XMOR Blazin 98.9, ESPN Radio 800, XX Sports Radio AM 1090/FM 105.7, Jack-FM 100.7, 101.5 KGB-FM, KPRI 102.1, Rock 105.3, and a number of popular local Spanish language radio stations. </p><p>243 </p><p>E.10 San Francisco </p><p>E.10.1 Economy </p><p>Alcatraz receives 1.5 million visitors per year.<ref>Wildlife Field Guids: Wildlife Habitats in GGNRA. National Park Labs, National Park Service. Accessed September 4, 2006.</ref> </p><p><!-- PLEASE CONSIDER MAKING YOUR ADDITIONS TO THE SAN FRANCISCO DAUGHTER PAGES. THIS ARTICLE IS MATURE. --> </p><p>Tourism is the backbone of the San Francisco economy. Its frequent portrayal in music, film, and popular culture has made the city and its landmarks recognizable worldwide. It is the city where Tony Bennett left his heart, where the Birdman of Alcatraz spent many of his final years, and where Rice-a-Roni<ref>Finz, Stacy (July 16, 2006) RICE-A- REDUX After a 7-year hiatus, it's billed once again as the San Francisco treat. San Francisco Chronicle. Retrieved on September 5, 2006.</ref> was said to be the favorite treat. San Francisco attracts the third highest number of foreign tourists of any city in the United States<ref name="TravelandTourism">Overseas Visitors To Select U.S. Cities/Hawaiian Islands 2006-2005 U.S. Department of Commerce, Office of Travel & Tourism Industries. Accessed August 27, 2006.</ref> and claims Pier 39 near Fisherman's Wharf to be the third-most popular tourist attraction in the nation.<ref>City and County of San Francisco: Sights in San Francisco. City and County of San Francisco. Accessed September 4, 2006.</ref> More than 15 million visitors came to San Francisco in 2005, injecting nearly $7.5 billion into the economy.<ref name="SFGATE_Raine">Raine, George. (May 13, 2006). Tourism dollars add up: San Francisco seeing more visitors, more cash -- it's our No. 1 industry. San Francisco Chronicle. Accessed August 23, 2006.</ref> With a large hotel and restaurant infrastructure and a world-class facility in the Moscone Center, San Francisco also is a top-ten North American destination for conventions and conferences.<ref>Spain, William (November 13, 2004). Cost factors: Top convention cities boast most-affordable lodging. CBS MarketWatch. Accessed September 3, 2006.</ref> </p><p>The San Francisco skyline centered within the Financial District </p><p>244 legacy of the California Gold Rush turned San Francisco into the principal banking and finance center of the west coast. Montgomery Street in the Financial District is known as the "Wall Street of the West", home to the Federal Reserve Bank of San Francisco, the Wells Fargo corporate headquarters, and the site of the now defunct Pacific Coast Stock Exchange. Bank of America, a pioneer in making banking services accessible to the middle class, was founded in San Francisco and built one of the first modern skyscrapers in the city: Bank of America Center. Many large financial institutions, multinational banks and venture capital firms are based in or have set up regional headquarters in the city. With over thirty international financial institutions,<ref>San Francisco: Economy city-data.com Accessed September 30, 2006.</ref> six Fortune 500 companies<ref>Fortune 500 2006 CNNMoney.com Accessed August 31, 2006.</ref> and a large support infrastructure of professional services, including law, public relations, architecture, and graphic design also populating the downtown, San Francisco is one of ten Beta World Cities. </p><p>San Francisco's economy has increasingly become tied to that of Silicon Valley to the south, sharing a need for highly educated workers with specialized skills. It has been positioning itself as a biotechnology and biomedical hub and research center. The Mission Bay neighborhood, site of a second campus of UCSF, fosters a budding industry and serves as headquarters of the California Institute for Regenerative Medicine, the public agency funding stem cell research programs statewide. </p><p>Small businesses with fewer than ten employees and self-employed firms make up 85 percent of city establishments.The number of San Franciscans employed by firms of greater than 1,000 employees has fallen by half since 1977.<ref name="SFEconomicStrategy"/> The penetration of national big box retail chains into the city has been slow. In an effort to buoy small privately owned businesses in San Francisco, the Small Business Commission<ref>San Francisco Small Business Commission</ref> supports a publicity campaign to keep a larger share of retail dollars in the local economy,while the Board of Supervisors has used the planning code to limit the neighborhoods in which "formula retail" establishments can set up shop,<ref> Supervisors OK limits on chain-store expansion San Francisco Chronicle. Accessed January 19, 2007.</ref> an effort affirmed by San Francisco voters.<ref> Proposition G: Limitations on Formula Retail Stores, City of San Francisco smartvoter.org. Accessed January 19, 2007.</ref> </p><p>E.10.2 Climate </p><p>Fog envelops the Golden Gate Bridge and approaches Crissy Field. </p><p>245 A quotation incorrectly attributed to Mark Twain goes, "The coldest winter I ever spent was a summer in San Francisco."<ref name="marktwain">{{cite web}}</ref>San Francisco benefits from California’s Mediterranean climate, characterized by mild wet winters and warm dry summers.<ref>Climate of San Francisco: Narrative Description Golden Gate Weather Services, Accessed on September 5, 2006</ref> However, surrounded on three sides by water, San Francisco has a climate strongly influenced by the cool currents of the Pacific Ocean which tends to moderate temperature swings and produce a remarkably mild climate with little seasonal temperature variation. Average summertime high temperatures in San Francisco peak at 70 °F (21 °C) and are 20 °F (9 °C) lower than in nearby inland locations like Livermore.<ref name="LivermoreClimate">{{cite web}}</ref> The highest temperature ever recorded in San Francisco was 103 °F (39 °C) on June 14, 2000.<ref>National Climatic Data Center, Climate-2000/June/Climate-Watch/Selected Extremes, "Climatography of the United States," National Climatic Data Center, Accessed 2006-12-03</ref> Winters are mild, with daytime highs near 60 °F (15 °C). Lows almost never reach freezing temperatures, though the lowest temperature ever recorded in San Francisco was 27 °F (- 3 °C) on December 11, 1932.<ref>Climate of San Francisco: Top 10 Temperatures Golden Gate Weather Services, Accessed on 2006-12-03</ref> May through September are quite dry, and rain is a common occurrence from November through March. Snow is extraordinarily rare, with only 10 instances recorded since 1852. The greatest snowfall on record was 3.7 inches (9.4 cm) in downtown San Francisco, and up to 7 inches (17.8 cm) elsewhere, on February 5, 1887.<ref name="SFClimate">{{cite web}}</ref> The last measurable snowfall in San Francisco was on February 5, 1976, when most of the city received an inch (2.5 cm) of snow.<ref>Climate of San Francisco: Snowfall Golden Gate Weather Services, Accessed on 2006-12-03</ref> </p><p>The combination of cold ocean water and the high heat of the California mainland creates the city's characteristic fog that can cover the western half of the city all day during the spring and early summer. The fog is less pronounced in eastern neighborhoods, in the late summer, and during the fall, which are the warmest months of the year. Due to its sharp topography and maritime influences, San Francisco exhibits a multitude of distinct microclimates. The high hills in the geographic center of the city are responsible for a 20% variance in annual rainfall between different parts of the city.<ref name="SFClimate"/> They also protect neighborhoods directly to their east from the foggy and cool conditions experienced in the Sunset District; for those who live on the eastern side of the city, San Francisco is sunnier, with an average of 160 clear days, and only 105 cloudy days per year.<ref>Historical Climate Information Western Regional Climate Center, Accessed September 5, 2006</ref> </p><p>E.10.3 Culture and entertainment </p><p><!-- PLEASE CONSIDER MAKING YOUR ADDITIONS TO THE SAN FRANCISCO DAUGHTER PAGES. THIS ARTICLE IS MATURE. --> </p><p>246 San Francisco is characterized by a high standard of living.<ref>San Francisco by the Numbers: Planning after the 2000 Census. San Francisco Planning and Urban Research Association, Accessed August 28, 2006.</ref> The great wealth and opportunity generated by the Internet revolution drew many highly educated and high income workers and residents to San Francisco. Many poorer neighborhoods have become gentrified. The downtown has seen a renaissance driven by the redevelopment of the Embarcadero, including the neighborhoods South Beach and Mission Bay. Property values and household income have escalated to among the highest in the nation,<ref name="MedianIncome"/><ref>It may not feel like it, but your shot at the good life is getting better. Here's why San Francisco Magazine. Accessed August 28, 2006.</ref> allowing the city to support a large restaurant and entertainment infrastructure. Because the cost of living in San Francisco is exceptionally high, many middle class families have decided they can no longer afford to live within the city and have left to the suburbs of the <a href="/tags/San_Francisco_Bay_Area/" rel="tag">San Francisco Bay Area</a>.<ref name="MiddleClass"/> </p><p>Boutiques along Fillmore Street in Pacific Heights </p><p>Although the centralized commerce and shopping districts downtown, including the Financial District and the area around Union Square, are well-known, San Francisco is also characterized by a rich street environment featuring many mixed-use neighborhoods anchored around central commercial corridors to which residents and visitors alike can walk. They feature a mix of businesses and restaurants catering to the daily needs of the community and drawing in visitors. Some are highly gentrified, dotted with boutiques, cafes and nightlife, such as Union Street in Cow Hollow, and 24th Street in Noe Valley. Others are less so, including Irving Street in the Sunset, or Mission Street in the Mission. This approach has influenced the South of Market redevelopment, with businesses and neighborhood services rising alongside highrise residences.<ref name="FogDev">Wach, Bonnie (October 3, 2003) Fog City rises from the funk. USA Today. Retrieved on September 4, 2006.</ref> </p><p>A streetlight rainbow flag in The Castro. international character San Francisco has had since its founding is witnessed today by large numbers of immigrants from Asia and Latin America. With 39 percent of its residents born overseas,<ref name="SFEconomicStrategy"/> San Francisco has numerous neighborhoods filled with businesses and civic institutions catering to new arrivals. In particular, the arrival of many ethnic Chinese, which accelerated beginning in the 1970s, complemented the already-established community based in Chinatown and has transformed the annual Chinese New Year Parade into the largest cultural event of its </p><p>247 kind.<ref>Lam, Eric (December 22, 2005). San Francisco Chinese New Year Parade Embroiled in Controversy. The Epoch Times. Retrieved on August 31, 2006.</ref> </p><p>Following the arrival of writers and artists of the 1950s, who established the modern coffeehouse culture, and the social upheavals of the 1960s, San Francisco became one of the hypocenters of liberal activism, with Democrats, Greens, and progressives dominating city politics. Indeed, San Francisco has not given the Republican candidate for president greater than 20 percent of the vote since 1988.<ref>Dave Leip's Atlas of U.S. Presidential Elections. Accessed September 6, 2006.</ref> The gay rights contributions and leadership the city has shown since the 1970s has resulted in the powerful presence gays and lesbians have in civic life. A popular destination for gay tourists, it hosts San Francisco Pride, the world's best-known gay pride parade and festival. </p><p>Entertainment and performing arts </p><p><!-- PLEASE CONSIDER MAKING YOUR ADDITIONS TO THE SAN FRANCISCO DAUGHTER PAGES. THIS ARTICLE IS MATURE. --> </p><p>Inside the War Memorial Opera House </p><p>San Francisco's War Memorial and Performing Arts Center features some of the longest operating performing arts companies in the United States. The War Memorial Opera House houses the San Francisco Opera and San Francisco Ballet, while the San Francisco Symphony plays in Davies Symphony Hall. The Herbst Theatre stages an eclectic mix of music performances, as well as public radio's City Arts & Lectures. </p><p>The Fillmore is a music venue located in the Western Addition. It is the second incarnation of a venue which gained fame in the 1960s under concert promoter Bill Graham and was where the Grateful Dead, Janis Joplin, and <a href="/tags/Jefferson_Airplane/" rel="tag">Jefferson Airplane</a> got their start and fostered the San Francisco Sound. Beach Blanket Babylon is a zany musical revue and civic institution. It has performed to sold out crowds in North Beach since 1974. </p><p>The American Conservatory Theater (A.C.T.) has been a leading force in Bay Area performing arts since its arrival in San Francisco in 1967, routinely staging original productions. San Francisco frequently hosts national touring productions of Broadway theatre shows in a number of vintage 1920s-era venues in the Theater District including the Curran, Orpheum, and Golden Gate Theatres. </p><p>248 </p><p>SFMOMA from Yerba Buena Gardens </p><p>Museums </p><p><!-- PLEASE CONSIDER MAKING YOUR ADDITIONS TO THE SAN FRANCISCO DAUGHTER PAGES. THIS ARTICLE IS MATURE. --> {{see also}} The Museum of Modern Art (SFMOMA) contains 20th century and contemporary pieces. It moved to its iconic building in South of Market in 1995 and attracts 600,000 visitors annually.<ref>Corporate Sponsorship (SFMOMA Facts and Audience) San Francisco Museum of Modern Art. Accessed September 1, 2006.</ref> The Palace of the Legion of Honor contains primarily European works. The De Young Museum and the Asian Art Museum have significant anthropological and non-European holdings. </p><p>The Palace of Fine Arts, originally built for the 1915 Panama-Pacific Exposition, today houses the Exploratorium, a popular science museum dedicated to teaching through hands-on interaction. The California Academy of Sciences is a natural history museum and hosts the Morrison Planetarium and Steinhart Aquarium. The <a href="/tags/San_Francisco_Zoo/" rel="tag">San Francisco Zoo</a> cares for a total of about 250 animal species out of which 39 have been deemed endangered or threatened.<ref>About the Zoo: Media Center (Press Kit) San Francisco Zoo. Accessed September 3, 2006.</ref> </p><p>Media </p><p><!-- PLEASE CONSIDER MAKING YOUR ADDITIONS TO THE SAN FRANCISCO DAUGHTER PAGES. THIS ARTICLE IS MATURE. --> {{see also}} <!-- FAIR USE of Caen.jpg: see image description page at for rationale --> </p><p>"One day if I do go to heaven...I'll look around and say, 'It ain't bad, but it ain't San Francisco.'" - Herb Caen (1916 - 1997), columnist for the San Francisco Chronicle </p><p>San Francisco Chronicle, a broadsheet for which Herb Caen famously published his daily musings, is northern California's most widely circulated newspaper.<ref>Top 200 Newspapers by Largest Reported Circulation. (March 31, 2006) Audit Bureau of Circulations. Accessed August 28, 2006.</ref> The San Francisco Examiner, once the cornerstone of William Randolph Hearst's media empire and the home of Ambrose Bierce, declined in circulation over the years and has been reduced to a small tabloid.Sing Tao Daily claims to be the largest of several Chinese language dailies that serve the Bay </p><p>249 Area.Alternative weekly newspapers include the San Francisco Bay Guardian and SF Weekly. San Francisco Magazine is a major glossy magazine. </p><p>The San Francisco metro area is the fifth largest TV market<ref>Nielsen Reports 1.1% increase in U.S. Television Households for the 2006 - 2007 Season (Press Release) (August 23, 2006) Nielsen Media, Accessed September 20, 2006.</ref> and the fourth largest Radio market<ref>ARBITRON RADIO MARKET RANKINGS: Spring 2006 Arbitron, Accessed September 20, 2006.</ref> in the United States. All the major television networks have affiliates serving the Bay Area region, with most of them based in the city. There are also some unaffiliated stations, and CNN, ESPN, and BBC have regional offices in San Francisco. </p><p>Public broadcasting outlets include both a television station and a radio station, broadcasting under the name KQED out of a facility near the Potrero Hill district. KQED-FM is the most-listened to National Public Radio affiliate in the country.<ref>{{PDFlink}} Radio Research Consortium. Accessed August 27, 2006.</ref> San Francisco companies such as CNET and Salon.com pioneered the use of the internet as a media outlet. Leading global media which are marketed specifically to gay and lesbian audiences are centered in San Francisco, with PlanetOut the parent company of major print newsmagazines and online communities. </p><p>Sports </p><p><!-- PLEASE CONSIDER MAKING YOUR ADDITIONS TO THE SAN FRANCISCO DAUGHTER PAGES. THIS ARTICLE IS MATURE. --> The <a href="/tags/San_Francisco_49ers/" rel="tag">San Francisco 49ers</a> of the NFL are the longest-tenured major professional sports franchise in the city. They began playing in 1946 and moved to their present location in Monster Park on Candlestick Point in 1971. They reached prominence in the 1980s and 1990s, winning five Super Bowl titles behind stars <a href="/tags/Joe_Montana/" rel="tag">Joe Montana</a>, Steve Young, Ronnie Lott, and Jerry Rice. s of the <a href="/tags/San_Francisco_Giants/" rel="tag">San Francisco Giants</a> </p><p>Major League Baseball's San Francisco Giants left New York for California prior to the 1958 season. Though boasting stars such as Willie Mays, Willie McCovey, and Barry Bonds, they have yet to win the World Series while based in San Francisco. Game 3 of the 1989 World Series in San Francisco was infamously pre-empted by the Loma Prieta earthquake. The Giants play at AT&T Park which was opened in 2000, a cornerstone project of the South Beach and Mission Bay redevelopment.<ref>{{PDFlink}} Environmental Protection Agency. Accessed August 28, 2006.</ref> </p><p>The Dons, the athletic teams of the University of San Francisco, compete in NCAA Division I. Bill Russell led the Dons to NCAA men's basketball championships in 1955 </p><p>250 and 1956. The San Francisco State Gators compete in Division II. The <a href="/tags/San_Francisco_Dragons/" rel="tag">San Francisco Dragons</a> of <a href="/tags/Major_League_Lacrosse/" rel="tag">Major League Lacrosse</a> play at <a href="/tags/Kezar_Stadium/" rel="tag">Kezar Stadium</a>, which they will share with the California Victory of United Soccer League First Division. The semi-professional San Francisco Bay Seals of the USL's developmental league are a second soccer team in the city. </p><p>San Francisco has ample resources and opportunities for participatory sports and recreation. The Bay to Breakers footrace, held annually since 1912, is best known for colorful costumes and a celebratory community spirit. The San Francisco Marathon is an annual event that attracts more than 7,000 participants.<ref>San Francisco Marathon Expands Cool Reputation The San Francisco Marathon. Accessed September 3, 2006.</ref> There are more than 200 miles (320 km) of bicycle lanes in the city<ref>San Francisco Bicycle Program City and County of San Francisco. Accessed September 3, 2006.</ref> and the Embarcadero and Marina Green are favored sites for in-line skating. Extensive public tennis facilities exist in Golden Gate Park and Dolores Park. </p><p>Boating, sailing, windsurfing and kitesurfing are popular activities on the San Francisco Bay, and the city operates a yacht harbor in the Marina District. San Francisco's residents have been judged to be among the fittest in the United States.<ref name="fitness">{{cite web}}</ref> {{-}} </p><p>Appendix F </p><p>Tables of Intersection Terms </p><p>252 Table F-1: Intersection Terms of Ann Arbor, MI vs. San Diego, CA Ann Arbor, MI Æ San Diego, CA San Diego, CA Æ Ann Arbor, MI Freq. Term Freq. Term 5 research 5 research 5 companies 5 companies 2 software 3 technology 2 offices 2 software 2 engineering 2 offices 1 world 2 home 1 technology 1 world 1 state 1 state 1 site 1 situation 1 services 1 services 1 security 1 security 1 pharmaceutical 1 role 1 people 1 pharmaceutical 1 office 1 people 1 national 1 national 1 housing 1 housing 1 house 1 house 1 home 1 growth 1 general 1 general 1 fuel 1 fuel 1 food 1 food 1 family 1 business 1 development 1 business </p><p>253 </p><p>Table F-2: Intersection Terms of Ann Arbor, MI vs. Reno, NV Ann Arbor, MI Æ Reno, NV Reno, NV Æ Ann Arbor, MI Freq. Term Freq. Term 1 world 2 technology 1 technology 1 world 1 order 1 slot 1 open 1 national 1 office 1 including 1 national 1 growth 1 including 1 club 1 engineering 1 car 1 development 1 business 1 business 1 air 1 automobile </p><p>Table F-3: Intersection Terms of Las Vegas, NV vs. Reno, NV Las Vegas, NV Æ Reno, NV Reno, NV Æ Las Vegas, NV Freq. Term Freq. Term 3 gaming 3 gaming 2 technology 2 technology 2 development 2 growth 1 slot 1 slot 1 manufacture 1 national 1 including 1 manufacturing 1 growth 1 including 1 gambling 1 gambling 1 development </p><p>254 </p><p>Table F-4: Intersection Terms of Las Vegas, NV vs. San Diego, CA Las Vegas, NV Æ San Diego, CA San Diego, CA Æ Las Vegas, NV Freq. Term Freq. Term 5 companies 5 companies 3 technology 3 technology 2 development 2 industry 1 slot 2 growth 1 services 1 situation 1 research 1 services 1 part 1 role 1 manufacture 1 research 1 industry 1 index 1 index 1 housing 1 housing 1 home 1 home 1 estate 1 growth 1 development 1 building 1 construction 1 acres </p><p>255 Table F-5: Intersection Terms of Las Vegas, NV vs. Boston, MA Las Vegas, NV Æ Boston, MA Boston, MA Æ Las Vegas, NV Freq. Term Freq. Term 5 companies 5 companies 2 industries 2 industries 1 switch 1 trade 1 services 1 services 1 part 1 part 1 including 1 national 1 home 1 including 1 electronic 1 electronic 1 authority 1 authority </p><p>256 Table F-6: Intersection Terms of San Francisco, CA vs. Chicago, IL San Francisco, CA Æ Chicago, IL Chicago, IL Æ San Francisco, CA Freq. Term Freq. Term 4 financial 4 world 3 national 4 financial 2 world 2 services 2 services 2 business 2 public 1 worldwide 2 business 1 study 1 worldwide 1 stock 1 stock 1 site 1 site 1 reserve 1 reserve 1 nation 1 nation 1 international 1 international 1 information 1 industry 1 industry 1 including 1 including 1 home 1 home 1 field 1 companies 1 data 1 board 1 companies 1 bank 1 board 1 bank </p><p>257 Table F-7: Intersection Terms of San Francisco, CA vs. Philadelphia, PA San Francisco, CA Æ Philadelphia, PA Philadelphia, PA Æ San Francisco, CA Freq. Term Freq. Term 3 national 2 stock 2 services 2 services 2 financial 2 national 2 bank 2 medicine 1 world 2 financial 1 store 2 bank 1 stock 1 service 1 service 1 reserve 1 reserve 1 research 1 research 1 public 1 professional 1 professional 1 music 1 office 1 medicine 1 manufacturing 1 institute 1 institute 1 industry 1 innovation 1 including 1 including 1 home 1 home 1 design 1 companies 1 companies 1 coast 1 coast 1 agency </p><p>258 Table F-8: Intersection Terms of San Francisco, CA vs. Las Vegas, NV San Francisco, CA Æ Las Vegas, NV Las Vegas, NV Æ San Francisco, CA Freq. Term Freq. Term 2 retail 2 retail 1 services 1 services 1 service 1 service 1 research 1 research 1 national 1 manufacture 1 lodging 1 information 1 industry 1 including 1 including 1 housing 1 data 1 home 1 companies 1 companies 1 agency 1 authority </p><p>259 Table F-9: Intersection Terms of Las Vegas, NV vs. Houston, TX Las Vegas, NV Æ Houston, TX Houston, TX Æ Las Vegas, NV Freq. Term Freq. Term 2 development 3 growth 1 technology 2 industry 1 switch 2 authority 1 services 1 trade 1 research 1 services 1 post 1 role 1 part 1 research 1 manufacture 1 national 1 information 1 land 1 industry 1 information 1 growth 1 engineering 1 building 1 building 1 authority 1 base 1 agency 1 acres </p><p>Appendix G </p><p>Correlation Coefficient Matrix </p><p>261 </p><p>Table G-1: Spearman Correlation Coefficient Matrix PC I PC I PC II PCs BSC SCV CBSC CSSC & II 1 2 ρ 1.000 0.354 0.783 0.711 0.229 0.037 -0.002 0.073 PC I Sig. . 0.000 0.000 0.000 0.022 0.713 0.981 0.468 N 100 100 100 100 100 100 100 100 ρ 1.000 0.777 0.485 0.126 0.118 0.176 0.171 PC II Sig. . 0.000 0.000 0.212 0.243 0.079 0.088 N 100 100 100 100 100 100 100 ρ 1.000 0.754 0.244 0.104 0.063 0.164 PC I Sig. . 0.000 0.015 0.304 0.533 0.102 & II N 100 100 100 100 100 100 ρ 1.000 0.275 0.043 0.070 0.123 PCs Sig. . 0.006 0.669 0.490 0.222 N 100 100 100 100 100 ρ 1.000 0.691 0.607 0.744 BSC Sig. . 0.000 0.000 0.000 N 100 100 100 100 ρ 1.000 0.788 0.862 SCV Sig. . 0.000 0.000 N 100 100 100 ρ 1.000 0.917 </p><p>CBSC1 Sig. . 0.000 N 100 100 ρ 1.000 </p><p>CBSC2 Sig. . N 100 ρ spearman correlation coefficient Sig. statistical significance level N number of scores included in the calculation </p><p>VITA </p><p>Tawan Banchuen </p><p>2003-2008 The Pennsylvania State University Ph.D. Geography 1999-2002 Virginia Polytechnic Institute and M.S. Civil and Enviromental State University Engineering 1995-1999 Chulalongkorn University B.S. Civil and Enviromental Engineering </p><p>Tawan Banchuen was born in Chonburi, a medium-sized town east of Bangkok. He spent most of his childhood in the hospital with his parents who work there as physicians. His grandfather always wanted him to become a physician like his parents, but he never wanted to become one. While in the hospital, he would play on his parents’ computers, games that explore strange places or trade commodities across the world. At that time, he did not know anything about geography. He pursued his bachelor’s degree and master’s degree in Civil and Environmental Engineering as he was aspired to save humankind from self-destruction. Later on during his master’s study, he came across geographic information systems (GISystems) and was intrigued by how the systems allowed him to explore, manipulate and analyze information about the earth, providing a comprehensive view of the earth’s systems and enabling development of strategic plans. As a result, he continued on to pursue his doctorate in geography at the Pennsylvania State University. His doctoral dissertation explores beyond just how humans can use GISystems, but also how machines, namely GISystems, can autonomously exploit themselves. After receiving his doctorate, he founded a company, C & T Research and Venture Capitol <www.ctrvc.com>. The company develops expert systems that suggest to investors when to trade, where to trade and what commodities. The company also owns and operates renewable energy production plants and cotton and electricity arbitrage facilities. </p> </div> </div> </div> </div> </div> </div> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" integrity="sha512-aVKKRRi/Q/YV+4mjoKBsE4x3H+BkegoM/em46NNlCqNTmUYADjBbeNefNxYV7giUp0VxICtqdrbqU7iVaeZNXA==" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script src="/js/details118.16.js"></script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> <noscript><div class="statcounter"><a title="Web Analytics" href="http://statcounter.com/" target="_blank"><img class="statcounter" src="//c.statcounter.com/11552861/0/b956b151/1/" alt="Web Analytics"></a></div></noscript> </body> </html>