UC Berkeley Electronic Theses and Dissertations

UC Berkeley UC Berkeley Electronic Theses and Dissertations

Title A Framework for Integrating Statistical Modeling into a Culturally Competent Evaluation

Permalink https://escholarship.org/uc/item/6k53x83p

Author Pryor, Laura Susan

Publication Date 2017

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California A Framework for Integrating Statistical Modeling into a Culturally Competent Program Evaluation By Laura Susan Pryor

A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Education in the Graduate Division of the University of California, Berkeley

Committee in charge: Professor Mark Wilson, Co-Chair Professor Derek Van Rheenen, Co-Chair Professor Sophia Rabe Hesketh Professor Julian Chow

Fall 2017

Abstract A Framework for Integrating Statistical Modeling into a Culturally Competent Evaluation By Laura Pryor Doctor of Philosophy in Education University of California, Berkeley Professor Mark Wilson, Co-Chair Professor Derek Van Rheenen, Co-Chair In 2011, the American Evaluation Association published the Public Statement on Cultural Competence in Evaluation, highlighting the need for evaluators to take a culturally competent stance. While some may view this charge as primarily applicable to qualitative methods, evaluators with quantitative skillsets are critical contributors. Advances in quantitative techniques allow for statistical modeling to address nuanced questions appropriate to a culturally competent evaluation design; however, the link between cultural competence and statistical modeling remains unclear. Prior research has discussed the need for an inquiry into if and how quantitative methods can help evaluators better respond to and engage with culture (Chouinard and Cousins, 2009). Chapter One of this dissertation responds to this need by reviewing 110 evaluation cases demonstrating culturally competent approaches and synthesizing their prevalence and use of quantitative methods. Findings discuss the predominant methods and purposes for integrating quantitative methods with culturally competent approaches. Furthermore, findings illustrate key themes for aligning quantitative practices with culturally competent approaches. One of the key themes from Chapter One highlighted the need for quantitative evaluation measures to better align with the cultural context of an evaluation. Therefore, Chapter Two outlines the Bear Assessment System process for creating, validating, and calibrating quantitative measures in a manner that complements a culturally competent evaluation approach. This process is explained through reflecting on the case of UC Berkeley’s Athletic Study Center evaluation in which quantitative measures of sense of belonging and self-reliance were created for both formative and summative purposes. While Chapter Two explained how to create, validate, and calibrate quantitative measures, the literature still lacks a clear example of how to analyze measures to reflect cultural competence. Chapter Three extends the evaluation case presented in Chapter Two to demonstrate how an evaluator can apply the Latent Growth Item Response model (LG-IRM) to analyze longitudinal data with both statistical rigor and cultural competence. Keywords: Program Evaluation, Cultural Competence, Item Response Theory, Constructing Measures, Latent Growth Item Response Model, Student Athletes, Sense of Belonging, Self-Reliance. i

Dedication To my Grandma Ito.

Table of Contents Abstract…………………………………………………………………………………………1

Dedication………………………………………………………………………………………ii

List of Figures………………………………………………………………………………….iii

List of Tables……………………………………………………………………………………iv

Acknowledgements…………………………………………………………………………….iiv

Chapter 1: Understanding the Role of Quantitative Methods in Culturally Competent Approaches to Evaluation: A Review and Synthesis of Existing Cases………………………..1 Chapter 2: Culturally Competent Quantitative Measures: Integrating the BEAR Assessment System with a Culturally Competent Program Evaluation Approach…………………………24 Chapter 3: Applying a Culturally Competent Approach to a Latent Growth Item Response Model Analysis…………………………………………………………………………………..62

Concluding Statement: Additional Consideration for Conducting a Culturally Competent Quantitative Evaluation………………………………………………………………………....78 References……………………………………………………………………………………….81

Appendices………………………………………………………………………………………92

iii

List of Figures Figure 1: Mixed methods study classification

Figure 2: Frequency of measure designs

Figure 3: Summary of quantitative culturally competent approaches

Figure 4: The Athletic Study Center Service Delivery Model

Figure 5: Hypothesized Construct Map for Sense of Belonging

Figure 6: Hypothesized Construct Map for Self-Reliance

Figure 7: Example Comic Scenario and Item for Sense of Belonging Measure

Figure 8: Sense of Belonging Wright Map

Figure 9: Self-Reliance Wright Map

Figure 10: Sense of Belonging Wright Map by Construct Level

Figure 11: Self-Reliance Construct Map by Construct Level

Figure 12: Sense of Belonging Infit Meansquare

Figure 4: Self-Reliance Infit Meansquare

Figure 14: Sense of Belonging Person Fit

Figure 15: Self-Reliance Person Fit

Figure 5: Standard Error of Measurement for Sense of Belonging

Figure 6: Standard Error of Measurement for Self-Reliance

Figure 18. Self-Reliance Individual Growth Trajectories

Figure 19. Sense of Belonging Individual Growth Trajectories

List of Tables Table 1: Countries represented in the 1991-2016 sample

Table 2: Evaluation contexts represented in 1991-2106 sample

Table 3:Quantitative purposes used in the 1991-2016 sample

Table 4: Comparing BAS with AEA’s Essential Practices for Culturally Competent Evaluators

Table 5: Example Table from Sense of Belonging Outcome Space

Table 6: Racial/Ethnic Distribution of Pilot Survey Participants

Table 7: Instrument Item Labels

Table 8: Empty Categories

Table 9: The Six Strands of Validity as Described by the Standards for Educational and Psychological Testing

Table 10: Racial/Ethnic Distribution of Pilot Survey Participants

Table 11. Self-Reliance Person Parameter Estimates

Table 12. Sense of Belonging Person Parameter Estimates

Table 13. Latent Regression Estimates

Acknowledgements Thank you to the members of my dissertation committee for their thoughtful guidance and feedback. Furthermore, thank you to Derek Van Rheenen for making it possible for me to work with the Athletic Study Center. Additionally, thank you to my QME classmates for their continuous support and generous assistance and advice. Without the willing participation of the Athletic Study Center staff and student athletes, this dissertation would not be possible. In particular, thank you to Kasra Sotudeh and Tarik Glen for allowing me to work with the ED 98 course for two semesters. Finally, thank you to my friends and family for believing in me, especially Shane Bryant for his unrelenting support.

Chapter One Understanding the Role of Quantitative Methods in Culturally Competent Approaches to Evaluation: A Review and Synthesis of Existing Cases

The evaluation field has evolved since its beginnings in the 1800s and early 1900s (Russ- Eft & Preskill, 2009). Moving into the field’s current era, evaluation contexts, evaluation purposes, and evaluator backgrounds have expanded. This expansion demands that evaluators innovate and create new approaches that can best address current evaluation contexts. Cultural competency in evaluation represents one area of growth resulting from the field’s evolution. Generally defined, cultural competency is a stance taken toward the evaluation context that allows the evaluator to engage with and respond to culture. In 2011, the American Evaluation Association (AEA) published the Public Statement on Cultural Competence in Evaluation, highlighting the need for evaluators to take a culturally competent stance (American Evaluation Association, 2011). In many ways, the essential practices outlined in AEA’s Public Statement on Cultural Competence in Evaluation represents the field’s recent thinking regarding how and why evaluators can be responsive to local culture. The document emphasizes that cultural competency is needed in order to produce valid findings: “Without attention to the complexity and multiple determinants of behavior, evaluations can arrive at flawed findings with potentially devastating consequences,” (American Evaluation Association, 2011, p. 2). The evaluation field has recognized the integral role cultural competency plays in evaluation. Evaluators are therefore continuing to learn how to incorporate the principles from AEA’s Public Statement on Cultural Competence in Evaluation into their current evaluation practice. As an attempt to better understand how evaluators have incorporated culturally competent approaches into their evaluation work, Chouinard and Cousins (2009) conducted a review of cross-cultural evaluations from 1991 to 2008. The primary purpose of their review was “to provide a descriptive review of the empirical literature on culture in evaluation and contribute to the development of a theoretical framework to facilitate future research and understanding concerning the complexity and multidimensionality of evaluation with cross- cultural settings,” (p. 458). Chouinard and Cousins (2009) concluded their review with an Agenda for Further Research, which included the following proposition: “In our review, we noted strong arguments in support of qualitative approaches as a means of giving primacy to the local context, as well as equally persuasive arguments in favor of methodological pluralism to ensure a more thorough rendering of the program and its context. In cross-cultural program and evaluation settings, is one methodological approach preferable over another? Is there a significant added value in mixing multiple methods? Can quantitative approaches help evaluators better engage culture? If so, which (if any) approaches would be consistent with cross-cultural evaluation (e.g., comparative studies)? Which approaches would further cross-cultural understanding?” (p. 487). Evaluation scholars promoting cultural competency in evaluation have stated that when appropriate, quantitative methods are welcomed within evaluation designs (Hood S. , 2001). Yet, quantitative methods are seldom aligned with culturally competent approaches. Given Chouinard and Cousin’s (2009) Agenda for Further Research, as well as the ongoing need for evaluators to innovate in ways that best respond to the ever-changing evaluation landscape, a systematic inquiry into the role of quantitative methods in culturally competent evaluation approaches is timely. Therefore, the purpose of this chapter is to respond to Chouinard and Cousin’s (2009) questions of whether or not, and how, quantitative approaches can help evaluators better engage in culture and which approaches would be consistent with a culturally competent evaluation approach. This response contributes to the theoretical framework regarding culturally competent

3 evaluation approaches, as well as the body of practical knowledge related to cultural competency in evaluation. This chapter also places a special focus on quantitative measures in culturally competent evaluation approaches. This focus stems from the following finding from Chouinard and Cousins (2009): “A number of the studies in our sample discussed the difficulty of using predetermined or standardized measures, outcome indicators and instruments to evaluate programs in culturally diverse communities, as they can conflict with localized community and culturally specific practices” (p. 481). This finding suggests that evaluators using quantitative measures often struggle with how to align them with a culturally competent evaluation approach. Using instruments inappropriate for the local context may result in inaccurate findings or failure to identify program impact. Therefore, to address such issues, evaluators need to understand how measurement has been incorporated into culturally competent evaluations and the ways in which quantitative measures can enhance cultural understanding and responsiveness. Through synthesizing existing cases with culturally competent approaches to evaluation, this review addresses the following questions: 1. To what extent do evaluators using culturally competent approaches incorporate quantitative methods? a. What are the predominant quantitative methods used that align with culturally competent evaluation approaches? b. What are the key purposes and benefits of incorporating quantitative methods into a culturally competent evaluation approach? 2. How do evaluators incorporate quantitative measures in a way that aligns with culturally competent evaluation approaches? a. What are the key challenges with aligning quantitative measures with culturally competent evaluation approaches? 3. When using quantitative methods in evaluation, what are common strategies that align with a culturally competent evaluation approach? Chapter One begins with a conceptual orientation to cultural competence and culturally competent approaches in evaluation, followed by an explanation of the methods used to conduct this review. The chapter then discusses the use and purpose of quantitative methods in culturally competent evaluation. Next, it focuses on how quantitative measures have been used and the challenges with doing so. Through synthesizing this information, the chapter then presents common strategies related to aligning quantitative methods with culturally competent approaches. The chapter concludes with a discussion of what is suggested by the findings as well as areas for future research. Conceptual Orientation Understanding cultural competency in evaluation primarily requires a definition of ‘culture.’ King et al (2004) offer the following definition: “The term culture refers to cognitive, affective, and behavioral patterns that human groups share, that is, the rules and norms by which people live,” (p. 68). The rules and norms which people live by impact the ways in which services are received and communities grow; thus, culture is a significant variable that evaluators must consider. As stated by Bledsoe and Hopson (2008): “Considerations of contextual and relationship factors, including socioeconomic status, respect, and partnerships between researcher and participants, will likely generate more accurate data” (p. 396). The need for

4 evaluators to consider culture implies that evaluators should be ‘competent’ at engaging with and responding to culture. However, determining what is meant by ‘competent’ and thus, defining the term cultural competency is both complex and nuanced. As a result, “there is little agreement on terminology (cultural competence, cultural sensitivity, cultural awareness), definitions, or core approaches,” (King, Nielson, & Colby, 2004, p. 68). Inconsistent terminology may create confusion as to how scholars and practitioners should approach cultural competency; however, ambiguity also provides an opportunity for productive dialogue between scholars, practitioners, and stakeholders to best understand cultural competency in individual evaluation contexts. As an example from social welfare, Chow and Austin (2008) explored definitions of cultural competence for social service organizations. This definition included five major components: (1) Multicultural Service Delivery Philosophy, (2) Responsive Organizational Processes, (3) Responsive Organizational Policies and Procedures, (4) Continuous Organizational Renewal, and (5) Effective Agency- Community Relations. Each component was used as topic of reflection among individual organizations to define and assess organizational cultural competency. Thus, Chow and Austin (2008) stated that the definition of cultural competency in social welfare organizations is continuously evolving. This evolution should prompt an ongoing reflection and dialogue among social service agencies that contributes to improving service delivery. The definition of cultural competence is not static; rather, it is constantly debated and discussed for the purpose of improving applied practice. Defining Cultural Competence in Evaluation The evaluation literature uses several words related to the term cultural competence. Among such terms are multicultural, cross-cultural, transformative, culturally sensitive, values- based, culturally responsive, or culturally anchored (Chouinard & Cousins, 2009). For example, Karen Kirkhart in her 1994 Presidential Address to the American Evaluation Association (AEA) presented her conceptualization of culture in evaluation through coining the term “multicultural validity.” Kirkhart posited that multicultural validity results from evaluators including key social and cultural factors in the evaluation process and findings. Multicultural validity is essential for producing evaluation findings that are ethical and utilizable. Donna Mertens (2008) addressed culture in evaluation through her inclusive/transformative evaluation framework. This framework intentionally includes minority and underrepresented groups in the evaluation process as means to improve societal inequalities (Mertens D. , 2008). Bledsoe and Donaldson (2015) provided a review of the definition of cultural competence in evaluation. In their review, the authors cite Hopson (2009), who defines cultural competence in evaluation as “the development of program standards and criteria, programs and interventions, and measures, so that they are relevant, specifically tailored, credible, and valid for the unique groups and communities of focus,” (p. 7). Similarly, SenGupta et al. (2004) defined cultural competence as “a systematic and responsive inquiry that is cognizant of, understands, and addresses the cultural context in which evaluation takes place” (p. 5). In addition to individual scholars, AEA as a professional organization has defined cultural competence. The Public Statement on Cultural Competence in Evaluation defines cultural competence as follows:

“A stance taken toward culture, not a discrete status or simple mastery of particular knowledge and skills. A culturally competent evaluator is prepared to engage with

diverse segments of communities to include cultural and contextual dimensions important to the evaluation,” (p. 1). These definitions suggest that a ‘one size fits all’ process for conducting a culturally competent evaluation does not exist. Rather, a culturally competent evaluation implies that evaluators select and implement evaluation designs and corresponding tools and strategies that are appropriate for the unique cultural context of the study (SenGupta, Hopson, & Thompson-Robinson, 2004). Therefore, instead of identifying a single checklist for conducting a culturally competent evaluation that applies to all contexts, evaluators may thoughtfully consider which designs, tools, and strategies would best allow them to engage with and respond to a given culture. This dissertation defines these approaches as ‘culturally competent approaches.’ Using culturally competent approaches does not guarantee that an evaluation will be deemed ‘culturally competent.’ Rather, they represent the ways in which an evaluator can work toward cultural competency in evaluation. Culturally competent approaches not only represent how evaluators attempt to conduct culturally competent evaluations, but also provide the context for the ongoing conversation around cultural competence in evaluation. Culturally Competent Approaches to Evaluation Culturally competent approaches to evaluation are not represented by a defined toolkit, but rather an ever-evolving compendium of ways in which evaluators can best engage with and respond to the local cultural context. Bledsoe & Donaldson (2015) state that evaluators naturally look for a set of tools to apply that would deem their evaluation practice as culturally competent. However, given the individualized contexts of each evaluation, what is considered culturally competent in one setting may not be in another. Pon (2003) warns that cultural competence should not be considered a finite set of skills to be acquired and transferred to any setting; this may lead to ignoring unique circumstances and situations of communities and individual difference. With that said, there are a number of existing evaluation approaches that evaluators can select and adapt to best respond to the evaluation’s cultural context. Specifically, Empowerment Evaluation (Fetterman, Kaftarian, & Wandersman, 1996), Culturally Responsive Evaluation (Hopson R. , 2009), Transformative Evaluation (Mertens D. , 2008), and Developmental Evaluation (Patton, 1994) represent evaluation frameworks that place specific emphasis on local culture. While these frameworks are distinct from one another in terms of step-by-step processes, they each promote participatory methods as a way of using evaluation to better engage with and respond to local culture. Chouinard & Cousins (2009) also note that many evaluators predominantly use collaborative or participatory approaches. The methods associated with a collaborative approach, however, can vary depending on the specific evaluation context. Qualitative methods, such as focus groups, in-depth interviews, and participant observation, are utilized as an approach that allows evaluators to consider local culture. To a lesser extent, evaluators have incorporated quantitative analyses through survey data and existing data. Therefore, in order to engage with and respond to culture, the current literature typically has arguments in favor of qualitative approaches and seldom points toward quantitative methods.

The Role of Quantitative Methods in Cultural Competence Qualitative methods are more closely aligned with process-oriented approaches in which deep/long descriptions of program implementation, stakeholder reactions, and site observations are used for program improvement. Scholars have criticized quantitative techniques as a method for understanding the social and cultural factors related to an evaluation. Past criticisms of quantitative methods relate to procedures being overly rigid, promoting masculine dominance, and de-contextualizing human behavior (Mertens, 1998). Moreover, Davis (1992) stated that quantitative approaches assume homogeneity in participants’ conditions through racial categorization. Failure to understand the target population may result in perpetuating stereotypes (Davis, 1992). Furthermore, Marin and Marin (1991) pointed out that among some Hispanic populations, there is a greater tendency to answer in superlative categories when presented with a Likert-type question, due to the cultural value of showing sincerity in expression of feelings. Thus, quantitative results may be culturally biased (Marin & Marin, 1991) unless one knows about Marin and Main’s (1991) work and corrects for it. Regarding generalizability, quantitative analyses often pay little attention to within-group differences among racial and ethnic minority groups or people with disabilities (Mertens D. , 1998); although, in fact, expressions of within group distributions of variables would not be guilty of that. When evaluators use inappropriate approaches, they can erroneously determine that programs are or are not effective (Frierson, Hood, Hughes, & Thomas, 2010). Despite these critiques, as Newton and Llosa (2010) point out, multilevel modeling can help evaluators ask more nuanced questions regarding how the quality of implementation affected overall program impact. Furthermore, item response theory approaches allow evaluators to analyze individual responses to surveys and assessments and also see if differential item functioning exists between subgroups (Green, 1996). Therefore, quantitative approaches can address questions related to program context and difference within and between subgroups. Scott-Jones (1993) suggests that in order to use quantitative methods in a culturally competent way, evaluators should: (1) use appropriate labels and language, (2) disaggregate by key characteristics such as educational level, and (3) report differences within racial groups (such as urban versus rural African Americans). Therefore, quantitative methods can align with culturally competent evaluation approaches. This chapter defines culturally competent quantitative approaches as strategies, methods, and tools that both contribute to quantitative findings and allow evaluators to engage with and respond to culture. Understanding how quantitative methods have been used in culturally competent evaluation frameworks and the challenges and benefits of doing so adds to the evaluation field’s body of knowledge regarding cultural competency in evaluation. Methods Sample Selection The sample of articles selected for this study drew upon and extended the original survey completed by Chouinard and Cousins (2009). Chouinard and Cousins (2009) reviewed 52 evaluation cases dating from 1991-2008. Their sample represented all of the community-based empirical evaluations found under their literature search, which was intended to be broad and far

7 reaching.1 Chouinard and Cousins (2009) stated that “empirical research was understood to include not only traditional social science methods (e.g., case studies, mixed-method inquiry), but reflective narratives based on participant experiences with one or more program contexts,” (p. 462). Furthermore, Chouinard and Cousins (2009) selected cases that highlighted culture as a key consideration for methodological processes and reflected culture in findings or lessons learned. This study reviewed the 52 studies cited in Chouinard and Cousins (2009) and used the authors’ same criteria to update the list to 2016. Specifically, modeled after Chouinard and Cousins (2009), this study updated the original 52 cases through using the following key words: “cross/cultural evaluation,’ “culturally responsive evaluation,” “cultural context,” “culturally competent evaluation,” and “anthropological evaluation.” These keywords were used as search criteria for the following journals: American Journal of Community Psychology, American Journal of Evaluation, American Journal of Preventive Medicine, Canadian Journal of Program Evaluation, Educational Evaluation and Policy Analysis, Evaluation, Evaluation and the Health of Professions, Evaluation and Program Planning, Journal of Multidisciplinary Evaluation, Journal of Primary Prevention, New Directions for Evaluation, and Studies in Educational Evaluation. Chouinard and Cousins selected these journals in order to for the literature search to cover multiple disciplines and geographic locations. Evaluation cases were selected based on Chouinard and Cousins’ (2009) definition of what constituted an empirical study, as well as studies with a focus on culture. This updated search resulted in an additional 58 studies from 2009-2016. The studies reviewed in this chapter represent evaluation cases that sought to incorporate culturally competent approaches as a central component of the evaluation. Sample Characteristics In total, this study reviewed 110 evaluation cases from 1991-2016. Of these studies, 52 were from 1991-2008, 23 were published in 2009-2013, and 35 were published in 2014-2016. Focusing on the articles published in 2009-20162, the reviewed cases were often a reflection of a specific evaluation study, rather than the evaluation study itself. In a small number of cases, the articles described multiple evaluation studies as part of an inquiry into an overarching framework or theory. While only 21 percent (n=11) of the articles selected from 1991-2008 represented an international context, 62 percent (n=36) of the articles selected from 2009-2016 took place in a context outside of the US. Table 1 outlines the frequency and percentage of countries represented in the 1991-2016 sample.

1 Chouinard and Cousins (2009) do not make the claim that their sample is exhaustive, but the authors state that they “are satisfied that it is sufficiently extensive so as to capture the state of the art of empirical research in the area,” (p. 463) 2 See Chouinard & Cousins 2009 for a complete description of the studies from 1991-2008.

Table 1. Countries represented in the 1991-2016 sample

Country Percentage (n) Australia 3 (3) Africa 6 (7) Brazil 5 (6) Canada 9 (10) Chile 2 (2) China 1 (1) Greece 1 (1) India 4 (5) Ireland 1 (1) Israel/Palestine 3 (3) New Zealand 3 (3) Papua New Guinea 1 (1) Sweden 1 (1) United Kingdom 1 (1) Peru 1 (1) Pacific Islands 3 (3) United States 56 (63)

With regard to context, articles were coded as taking place in one of the following contexts: health, education, social services, and community development. Table 2 displays the percentage and frequency of contexts for all articles from 1991-2016. Table 2: Evaluation contexts represented in 1991-2106 sample

Context Percentage (n) Community Development 18 (20) Education 36 (40) Health 25 (28) Social Services 20 (22)

Review Strategy and Analysis A complete review of all 110 articles was conducted. Specifically, systematic notes were taken in the following categories for each study: (1) Evaluation framework(s); (2) Qualitative methods; (3) Quantitative methods; (4) Actual/Potential benefit of using quantitative methods; (5) Limitations from not incorporating quantitative methods (when applicable); (6) Role of quantitative measures (when applicable); (7) Challenges with using quantitative measure (when applicable); and (8) Opportunities for including additional quantitative analyses. After this initial review, studies were then coded in the following categories: quantitative methods only, qualitative methods only, and mixed methods. A study was considered to include quantitative methods if the methods or findings reported frequencies, descriptive statistics, or inferential statistics (see the Appendix for a complete list of the quantitative methods reported in each study). All studies were also coded for the intended purpose of the quantitative data and analysis (when applicable).

Furthermore, studies were identified as to whether they included a quantitative measure. Those studies that included a quantitative measure were then coded as representing one of the following: existing measure, evaluator adapted measure, and evaluator created measure.3 Additionally, for the studies that included a quantitative measure, they were coded as either discussing or not discussing a challenge with the quantitative measure. This analysis revealed themes and patterns regarding frequency of quantitative methodology, predominant quantitative methods used in culturally competent evaluations, purposes of using quantitative methods, and challenges with using quantitative measures. After the descriptive coding, all studies were coded for key themes related to quantitative methods that aligned with culturally competent approaches. This thematic coding was informed by the finding from Chouinard and Cousins (2009) related to quantitative methods and quantitative measures. Furthermore, the chapter’s conceptual orientation guided the analysis process. Matrices displaying each study’s results are provided in the Appendix. Findings: Quantitative Use and Purpose This section explores how the sampled articles used quantitative methods, focusing on the purposes for using quantitative methods and trends in the use of specific methods. Regarding the extent to which the sample used quantitative methods, only five percent (n=6) used quantitative methods alone. In contrast, 33 percent (n=36) used only qualitative methods. The reviewed studies predominantly used quantitative methods in combination with qualitative methods; specifically, 62 percent (n=68) used both quantitative methods and qualitative methods in the evaluation design. Furthermore, the distribution of quantitative, qualitative, and mixed methods studies remained relatively constant throughout the years covered in this review. These findings are consistent with Chouinard and Cousins’ (2009) assertion that there are strong arguments for the use of qualitative methods and methodological pluralism when conducting culturally competent evaluations. Sixty-eight percent of the studies utilized some form of quantitative method (either alone or in a mixed methods context). This suggests that quantitative methods have a significant role in evaluations using culturally competent approaches. Mixed methods studies require an additional layer of investigation to understand the extent to which quantitative methods were used in the reviewed studies. Tashakkori and Teddlie (1998) define mixed methods as studies “that combine the qualitative and quantitative approaches into research methodology of a single study or multiphased study,” (p. 17-18). However, researchers and evaluators can combine qualitative and quantitative approaches in a number of ways; thus, several types of designs can be classified as mixed methods. Tashakkori and Teddlie (1998) present a typology for understanding the combination of quantitative and qualitative approaches within a mixed methods design. Specifically, mixed methods designs can be ‘dominant-less dominant’ in which an evaluator or researcher conducts the study with an emphasis on either a quantitative or qualitative paradigm. In such studies, the less-dominant approach is a small component of the overall design. Alternatively, mixed methods designs can be ‘equivalent’ in which quantitative and qualitative approaches are used equally within the study design. Using the typology presented in Tashakkori and Teddlie (1998), mixed methods studies were considered as either: quantitative-dominant, qualitative-dominant, or equivalent status.

3 Studies were given multiple codes if they contained more than one quantitative measure.

Additionally, due to the reflective nature of several studies, some cases did not provide sufficient information for mixed methods classification and were therefore labeled as ‘unclear.’ Figure 1 outlines the frequency of mixed method study types reviewed in this chapter.

Mixed Methods Study Classification

Unclear 9% (n=6)

Quantitative- 19% (n=13) Dominant

Equivalent 44% (n=30)

Qualitative- 28% (n=19) Dominant

Figure 1: Mixed methods study classification Figure 1 shows that most of the mixed methods studies utilized an equivalent design, followed by qualitative-dominant, and lastly, quantitative-dominant. For studies with a qualitative- dominant approach, evaluators often included the quantitative component to comply with funder requests; quantitative approaches were largely not seen by the evaluators/authors as an integral part of the evaluation design and subsequent findings. The remaining 63 percent of mixed methods studies that represented an equivalent or quantitative-dominant approach used quantitative methods in a way that substantively contributed to the evaluation findings. In some cases, quantitative and qualitative methods complemented one another and triangulated findings. In other cases, the two approaches were used to address different evaluation purposes or questions. Regardless, Figure 1 illustrates that most of the studies classified as mixed methods used quantitative approaches as an integral part of the design. Specific Quantitative Approaches Reviewing the specific quantitative methods used in the sample provided insight into the role of quantitative methods in evaluations using culturally competent evaluation approaches. Because many of the studies were written as a reflection on an evaluation case, specific quantitative methods were not detailed in every article. For example, an article might state that Likert-type survey questions were administered, but not provide detail on how this data was analyzed. Despite this lack of detail, findings from the review suggest that basic frequencies and descriptive statistics represented most of the quantitative analytic approaches. Studies often reported frequencies of participation, attendance, and service utilization. Furthermore, many studies compiled descriptive statistics of survey results as the only form of quantitative data analysis. While frequencies and descriptive statistics were the predominant quantitative approach, a handful of studies did take the quantitative analysis a step further by incorporating inferential analyses of differences between groups. Two studies described using t-tests to detect differences in mean outcomes between population subgroups (Alkon, Tschann, Ruane, Wolff, &

Hittner, 2001; Janzen, Nguyen, Stobbe, & Araujo, 2015). Furthermore, two studies used ANOVAs to explore statistically significant differences in mean outcomes between key population subgroups (Janzen, Nguyen, Stobbe, & Araujo, 2015; Prilleltensky, Nelson, & Valdes, 2000). Notably, one study utilized multivariate regressions to better understand the variables impacting the evaluation outcome (Garaway, 1996). With regards to quantitative evaluation design, three studies implemented a randomized controlled trial (RCT) design (Uhl, Robinson, Westover, Bockting, & Cherry-Porter, 2004; Botcheva, Shih, & Huffman, 2009; Hesse-Biber, 2013). Furthermore, 24 studies used a pre/post or longitudinal design. It is possible that the sampled studies utilized more complicated quantitative methods that they did not report; however, this review can only draw upon what was reported in each study. Primary purposes of Quantitative Measures To provide context for why evaluators may have chosen specific quantitative methods, this section outlines the studies’ purposes for using quantitative methods. Patton (2011) states that there are six primary purposes for evaluation: (1) Overall summative judgement; (2) Formative improvement and learning; (3) Accountability; (4) Monitoring; (5) Developmental; and (6) Knowledge generation. Furthermore, Mertens (2008) asserts that evaluation should be used for social justice purposes in which the evaluation process and findings are intentionally used to improve societal inequities; studies that emphasized social justice as their primary purpose were therefore categorized under this label. Several studies had multiple purposes; however, the quantitative approach used within each evaluation design served a purpose related to one of the following categories: overall summative judgement, formative improvement and learning, developmental, knowledge generation, and social justice. For the studies that included quantitative methods, Table 3 outlines the distribution of purposes. Table 3:Quantitative purposes used in the 1991-2016 sample

Purpose Percentage (n) Overall summative judgment 49 (36) Formative improvement and learning 23 (17) Developmental 4 (3) Knowledge generation 12 (9) Social justice 11 (8)

Quantitative methods are often used for impact or outcomes-focused evaluations; therefore, it is not surprising that the most common quantitative purpose in Table 3 was ‘overall summative judgement.’ What is perhaps more interesting, is that 51 percent of the studies used quantitative methods for a purpose other than an overall summative judgement. For the cases that used quantitative methods for formative purposes, evaluators mainly used quantitative descriptive statistics from surveys to understand stakeholder opinions about a program. In other situations, quantitative data was collected as part of a needs assessment effort that informed program development. Moreover, quantitative student achievement and behavioral data were used to help teachers provide targeted interventions for their students.

Patton (2011) described the developmental purpose of evaluation as “adaptation in complex, emergent, and dynamic conditions,” (p. 130). Under this purpose, studies primarily used quantitative data to understand if programmatic progress was being made in the desired direction. This purpose typically occurred in a context in which the program or intervention was still under development and specific outcomes were still unknown. While only four percent of studies fell under this category, this review suggests that quantitative methods can contribute to developmental purposes. Knowledge generation refers to evaluations that contribute to a general understanding about programs and what makes them successful (Patton, 2011). Quantitative analyses allow for capturing breadth and making generalizations. Therefore, the studies that fell into this category had a randomized controlled trial design or used multiple regressions. The studies used quantitative data to show trends and patterns that enhanced general knowledge about a specific issue related to the evaluation context. The social justice purpose represents evaluations that explicitly consider the needs and circumstances of the program population. While these studies may also overlap with Patton’s purposes, these studies are unique in that the purpose of the quantitative analysis was to promote social justice. This purpose is therefore closely aligned with cultural competence. Studies categorized under this purpose used quantitative approaches to help communities and populations advocate for themselves. For example, in Robertson et al. (2004) the evaluation created the opportunity for a Native American community to collect data about their community. Specifically, the community collected data related to policing activity and police turnover rates. This data was used to advocate for an improved policing system in the community and empower the local tribal courts. In other studies, such as Small et al. (2006), evaluators used descriptive statistics from a survey to better understand life outcomes for an underrepresented population and improve service delivery. In other cases, such as Thurman et al. (2004), the data collected for the evaluation served as a way for the local population to better articulate their needs to external stakeholders, as well as illustrate a vision for future funding applications and resources. The cases that used quantitative methods for a social justice purpose did not describe quantitative analyses beyond frequencies and descriptive statistics, raising the question of if and how additional analyses could improve social justice efforts (to be explored in Chapter Three). Many studies using quantitative methods did so by including quantitative measures, primarily in the form of a survey. Given the prevalence of quantitative measures, as well as Chouinard and Cousins’ (2009) finding regarding the challenges with using pre-existing measures, the following section focuses on the uses and common challenges in developing quantitative measures with culturally competent approaches. Findings: Measurement Use and Challenges Among the 110 reviewed studies, 58 percent (n=64) included at least one quantitative measure in the evaluation design. Quantitative measures were defined as instruments that numerically assessed the degree or prevalence of a particular latent or physical trait. This included cognitive, non-cognitive, as well as biological or physical assessments. Furthermore,

13 measures were categorized as one of three types: (1) Existing, (2) Adapted, or (3) Evaluator- Created. Figure 2 displays the frequency for each of the categories.4

Frequncy of Measure Designs

Evaluator-Created 50%

Adapted 14%

Existing 36%

Figure 2: Frequency of measure designs Existing measures were not specifically designed for the evaluation case reported in the study. Examples of common existing measures included social/emotional measures and standardized tests. Among the studies using quantitative measures, 27 studies used an existing measure in the evaluation design. Adapted measures were altered at the item level and/or translated to make items more appropriate for the evaluation context. Evaluators often adapted measures when a funder or external agency mandated specific existing measures, but needed to be modified to fit the local context. Ten of the studies adapted an existing measure as part of the evaluation design. An evaluator-created measure referred to an instrument created specifically for the evaluation context. Thirty-seven of the studies included the creators of a measure as part of the evaluation design. Given Chouinard and Cousins’ (2009) finding regarding the frequent mismatch between an existing instrument and the local culture, Figure 1 may suggest that some evaluators approach this issue by creating a measure specifically for the evaluation context. However, 37 of the studies still used an existing or adapted measure. Consistent with Chouinard and Cousins’ (2009) finding, 31 studies reported a challenge with using quantitative measures. Five central challenges emerged related to the use of quantitative measures: (1) Challenges with translation; (2) Challenges with funder expectations; (3) Challenges with existing instrument content validity; (4) Challenges with data collection; and (5) Challenges with item formats. Challenges with translation. The first challenge related to translating an existing instrument into a local language(s). Often, an existing instrument’s vernacular did not conceptually transfer to the local language (Alkon, Tschann, Ruane, Wolff, & Hittner, 2001). This challenge arose when translating concepts that have culturally-specific meanings (Clayson, 2002; Small, Tiwari, & Huser, 2006; Stokes, Chaplin, Dessouky, Aklilu, & Hopson, 2011). For example, Clayson et al. (2002) noted that the term “self-sufficiency” did not have the same cultural meaning among the Hispanic/Latino program participants, which created issues when translating existing measures from English to Spanish. Furthermore, evaluators working in K-12 education settings found difficulty with translating student assessment materials (such as reading

4 Some studies had more than one measure. All measures were included in Figure 1, therefore the total is greater than the total number of studies using quantitative measures.

14 passages and item stems) in a way that was relevant for the local student population (Slaughter, 1991). Challenges with funder expectations. In several of the studies, the funder mandated specific instruments for evaluation. Often, funders sought to aggregate evaluation findings across multiple sites and therefore wanted consistency in instrumentation. However, these funder-mandated measures were typically not relevant to the local culture, both linguistically and conceptually (Coppens, Page, & Chan Thou, 2006; Small, Tiwari, & Huser, 2006; Hilton & Libretto, 2016; Ziabakhsh, 2015). In some instances, the local evaluator chose not to implement the funder-mandated measures and used another method for assessing program impact. Copeland-Carson (2005) reported a case in which the local program evaluators conflicted with the funder over the evaluators’ decision to refrain from using culturally inappropriate measures of leadership and empowerment. Challenges with existing instrument content validity. The most prevalent challenge related to misalignment between survey content and local norms or ways of knowing. In other words, the construct measured in the existing instrument did not match the construct as seen by the community (Butty, Reid, & LaPoint, 2004; Cervantes & Pena, 1998; Clayson, 2002; Running Wolf, et al., 2002; Ryan, Chandler, & Samuels, 2007; Chilisa, Major, Gaotlhobogwe, & Mokgolodi, 2016; Easton, 2012; Le Menestrel, Walahoski, & Mielke, 2014). Existing measures are often validated on one, typically dominant, population. Content validity issues arose when the population used to validate the existing measure was not the same as the population in the evaluation context. For example, Coppens et al. (2002) reported that the existing instruments mandated for the evaluation were based on an ‘independent’ American culture, which was misaligned with the ‘collective’ culture observed in the Cambodian program participants. Furthermore, Novins et al. (2004) noted that several measures focused on participant deficits, whereas it may have been more culturally appropriate to focus on participant strengths. In one study, Botcheva et al. (2009) noted that the results from standardized instruments yielded contradictory information from qualitative interview data. Given this discrepancy, Botcheva et al. (2009) suggested that the standardized instruments were not accurately measuring participant perspectives of program impact. In another case, Janzen et al. (2015) administered a combination of evaluator-designed and existing measures. When assessing the reliability of each measure via Cronbach’s Alpha, Janzen et al. (2015) found that the evaluator designed measures had a higher Cronbach’s Alpha. This finding suggested that the evaluator- designed measure was more reliable than the existing measure. Challenges with data collection. In several cases, administering a measure created a set of challenges. For example, Stokes et al. (2011) reported a challenge with administering a measure in an online format. The program participants had limited access to technology and could therefore not take the quantitative online survey. While not necessarily isolated to quantitative measures, Bevan-Brown (2001) noted a low survey response rate due to the local population’s preference for oral rather than written communication. Similarly, low response rates also occurred due to the highly mobile nature of certain populations (Baizerman & Compton, 1992). Additionally, Bowen &Tillman (2014) discussed challenges with using survey administrators that were not a part of the local culture. These administrators used an interview- style for survey administration but were not able to pick up on particular cultural nuances, such as local terminology and measuring systems, that local survey administrators understood.

Challenges with item formats. In some cases, item design created challenges for the evaluator. For example, Coppens et al. (2006) noted that the semantic differential item design with contrasting objectives were difficult to interpret by the local population due to limited English language skills and cultural differences. Furthermore, Bowen & Tillman (2014) noted challenges with converting units from the local measurement system (which was not consistent across farms, and thus difficult to understand) to the metric system when interpreting responses about the size of farm plots. The local population operated via their own measurement system that did not directly translate to the metric system used in the survey. These five challenges illustrate how and why evaluators struggle to align quantitative methods and measures with culturally competent evaluation approaches. However, despite these challenges, the reviewed studies also reported using quantitative approaches in a way that positively complemented culturally competent evaluation approaches. The proceeding section outlines key themes related to aligning quantitative methods with culturally competent approaches. Culturally Competent Quantitative Approaches While no single ‘toolkit’ or ‘how-to’ manual exists for conducting culturally competent evaluations, examining evaluators’ approaches for engaging with a culture provides direction for evaluation practice. As noted by Chouinard and Cousins (2009), a systematic investigation into the ways in which existing evaluations have utilized quantitative methods that align with culturally competent approaches has yet to be conducted. Therefore, this section presents culturally competent quantitative approaches uncovered through reviewing this sample of 110 studies. This compilation of approaches is not meant to serve as a comprehensive guide for how evaluators should align quantitative methods with culturally competent approaches. Rather, this review is intended to contribute to the conversation around the role of quantitative methods in culturally competent evaluations in order to improve evaluators’ responsiveness to local culture and context. Key approaches are organized into the following sequential categories: evaluation design, instrument selection, instrument design, instrument validation, instrument administration, data analysis, and dissemination and use of findings. Key Approaches Related to Quantitative Evaluation Design As mentioned, this sample of 110 studies included randomized controlled trial and quasi- experimental evaluation designs. When using these designs, evaluators implemented specific steps to ensure that the design was responsive to the local culture. For example, Connor et al. (2004) used a delayed control group to ensure that both the treatment and control groups received the positive benefits from the intervention. Similarly, Fisher and Ball (2002) used a multiple baseline design to ensure equity in program benefits across all members of the local community. Arsenault et al. (2016) stated that when working within the prison system, equity was maintained through using a separate but demographically comparable prison population as the comparison group, rather than separating the intervention prison population into a treatment and control group. As another approach, evaluators consulted the community when creating a procedure for randomization (Uhl, Robinson, Westover, Bockting, & Cherry-Porter, 2004; Hesse-Biber, 2013). For example, Uhl et al. (2004) conducted focus groups with community members around randomization procedures. The publicly transparent nature of the randomization procedure was

16 viewed as critical to the evaluation’s success. Specifically, community members decided that randomization should be done in a community meeting setting in which participants would blindly select a coffee mug. The color of the coffee mug determined if a participant would be placed in the treatment or control group (all participants were compensated regardless of group selection). Consulting the community for designing the randomization built critical rapport and buy-in (Uhl, Robinson, Westover, Bockting, & Cherry-Porter, 2004). In addition to approaches specifically related to experimental designs, other studies noted approaches that applied to non-experimental conditions. For example, Letichevesky & Penne Firme (2012) incorporated a meta-evaluation into their evaluation design. Reflecting on the evaluation process, including the quantitative aspects, at the end of the evaluation helped ensure the quality and cultural responsiveness of the evaluation. Moreover, Luo and Liu (2014) used a design called the Participatory Rural Appraisal (PRA) approach for assessing the impact of an intervention in rural China. This approach first gathered community members at a public meeting. They were then asked to rate their satisfaction with the intervention by marking a numeric score on butcher paper. Participants felt comfortable in a community setting. They also felt that they had agency in the evaluation by participating in the data collection and tabulation of results. The PRA strategy yielded quantitative evaluation data considered more valid than the questionnaire data (which received a low response rate due to content validity and data collection challenges). Key Approaches Related to Quantitative Instrument Selection In cases in which evaluators used existing quantitative instruments, evaluators took specific steps that allowed them to engage with and respond to local culture. Most often, evaluators used specific criteria for selecting instruments. This criterion included variables related to local culture and cultural context (Alkon, Tschann, Ruane, Wolff, & Hittner, 2001; Butty, Reid, & LaPoint, 2004; Cervantes & Pena, 1998; Slaughter, 1991). Evaluators selected instrument criteria through literature reviews and consultation with community members. Additionally, Mertens & Zimmerman (2015) selected an instrument that had already been culturally validated to the local population. As another approach, Dura et al. (2014) described the use of ‘cultural beacons’ to identify instruments and measures. A cultural beacon is an object or response that provides important information about the local culture; the cultural beacon is identified by local stakeholders as an important measure. For example Dura et al. (2014) described how participatory sketches were used to measure program impact. Several participants included images of themselves resting on a mat underneath a tree. Local stakeholders pointed out that this image indicated that the participant could afford self-care and leisure time; thus, the image of a mat underneath a tree became a measure for monitoring community integration (Dura et al., 2014). Key Approaches Related to Quantitative Instrument Design Half of the sampled studies created quantitative instruments specifically for the evaluation context. When designing an instrument intended for a quantitative analysis, evaluators considered and incorporated local culture by generating instrument questions with community stakeholders (Bevan-Brown, 2001; Richmond, Peterson, & Betts, 2008; Cornachione, Trombetta, & Nova, 2010; Hopson R. , 2009; Janzen, Nguyen, Stobbe, & Araujo, 2015; Letichevsky & Penna Firme, 2012; Mitakidou, Tressou, & Karagianni, 2015). Most often, this entailed a series of workshop-style meetings in which members from the community drafted and confirmed

17 questions alongside the evaluator. Similarly, other evaluators organized focus groups to help generate instrument questions (Zulli & Frierson, 2004; Bowen & Tillman, 2015; Mertens D. , 2013). Evaluators used key themes that emerged from focus groups to create culturally-specific domains and construct quantitative survey questions. Two cases discussed incorporating African proverbs into the instrument design as a way to set measurement standards that judged the merit or worth of an intervention (Chilisa, Major, Gaotlhobogwe, & Mokgolodi, 2016; Easton, 2012). Other cases incorporated community input adapting an existing instrument to the local context. This input was used to modify the language of specific items to make them more appropriate for local community norms (Fisher & Ball, 2002; Running Wolf, et al., 2002) With regards to the way questions were organized into an instrument, Connor (2004) suggested asking the most non-sensitive questions first, as means of building rapport and comfort with participants. By doing so, participants were more likely to respond honestly to sensitive questions asked later in the instrument. Furthermore, Botcheva et al. (2009) discussed using vignettes and imaginary characters as question stems. This approach allowed Botcheva et al. (2009) to present questions in a format that was congruent with the local population’s narrative culture. And finally, when translating an instrument into another language, Alkon et al. (2004) used back-translation as an important way to respond to local cultural nuances during the translation processes. Key Approaches Related to Quantitative Instrument Validation After instrument selection and design, the validation process assists with ensuring that the instrument is appropriate for the local culture. Evaluators using quantitative instruments highlighted a variety of strategies to ensure a measure’s validity. With regards to face validity5, many evaluation cases presented instruments to community stakeholders for review (Bevan- Brown, 2001; King, Nielson, & Colby, 2004; LaPoint & Jackson, 2004; Prilleltensky, Nelson, & Valdes, 2000; Nevarez, et al., 2013; Willging, Helitzer, & Thompson, 2006; Blackman, Wistow, & Byrne, 2013; Grimes, Dankovchik, Cahn, & Warren-Mears, 2016; Zoabi & Awad, 2015). Letiecq & Bailey (2004) stated that they consulted a specific cultural facilitator to ensure question validity. Furthermore, a set of cases reviewed instruments with program participants (as opposed to stakeholders who were not directly participants in the program) as a way to ensure validity (Bevan-Brown, 2001; Conner, 2004; Coppens, Page, & Chan Thou, 2006; LaPoint & Jackson, 2004). Moreover, Willging et al. (2006) described using cognitive interviews with program participants to validate instrument terminology. Finally, several cases piloted instruments with a representative population prior to formal administration (Botcheva, Shih, & Huffman, 2009; Grimes, Dankovchik, Cahn, & Warren-Mears, 2016; Janzen, Nguyen, Stobbe, & Araujo, 2015; Letichevsky & Penna Firme, 2012; Luo & Liu, 2014; Shoultz, et al., 2015). This piloting process allowed evaluators to modify questions that survey participants saw as inappropriate or confusing. Key Approaches Related to Quantitative Instrument Administration A quantitative instrument’s administration process has direct implications on the quality of the data. Often, there may be culturally-specific nuances related to instrument administration. Several cases highlighted ways in which evaluators were responsive to such nuances when

5 Face validity is considered a subjective test of whether or not an instrument measures what it is intended to measure.

18 administering quantitative instruments. The most prevalent strategy was utilizing community members for data collection (Baizerman & Compton, 1992; LaPoint & Jackson, 2004; Uhl, Robinson, Westover, Bockting, & Cherry-Porter, 2004; Arseneault, Plourde, & Alain, 2016; Bowen & Tillman, 2015; Le Menestrel, Walahoski, & Mielke, 2014; Nevarez, et al., 2013). In such cases, members from the participant community administered surveys or measures to ensure that participants felt comfortable and administrators were trusted. Moreover, some cases used youth from the community for instrument administration (Letiecq & Bailey, 2004; Brandão, Silva, & Codas, 2012; O'Hara, McNamara, & Harrison, 2015). This strategy was chosen if the participants themselves were youth. Hiring youth to administer surveys built rapport among participants and empowered community youth to learn employable skills. When survey administrators were not from the community, Bowen & Tillman (2014) provided culturally responsive evaluation training to ensure that administrators would be responsive to the local community. With regards to the format of instrument administration, some cases noted that the online medium was not culturally relevant for specific populations and was therefore not used (LaFrance, Nelson-Barber, Rechebei, & Gordon, 2015; Stokes, Chaplin, Dessouky, Aklilu, & Hopson, 2011). When working with populations with low literacy levels, Connor (2004) described administering surveys in small groups. Using a small group format, survey administrators gave each participant a survey and then wrote the question and answer options on large flipcharts. The administrator walked survey participants through each question, reading each question and answer option aloud and pointing to every word as they were read. Key Approaches Related to Quantitative Data Analysis The studies in this review pointed to specific strategies used during quantitative data analysis that allowed the evaluator engage with and respond to culture. Primarily, several evaluators reported disaggregating findings by key subgroups such as gender or race (Butty, Reid, & LaPoint, 2004; Nelson-Barber, LaFrance, Trumbull, & Aburto, 2005; Ryan, Chandler, & Samuels, 2007; Voyle & Simmons, 1999; Luo & Liu, 2014; Samuels & Ryan, 2010; Zamir & Abu Jaber, 2015). In doing so, evaluators noticed differences and explored the reasons for these differences using qualitative techniques. Furthermore, disaggregation allowed evaluators to point out inequities between subgroups and advocate for underrepresented groups. Alkon et al. (2001) added more rigor to describing differences in subgroups through using t-tests and ANOVAs to determine statistically significant differences. Additionally, Lustig et al., (2015) illustrated a case in which student achievement data was reported at both the grade level and individual level. This disaggregation allowed teachers to review each individual student in comparison to the whole group. In turn, the teacher directed appropriate interventions to individual students. In a few cases, evaluators discussed specific analytic techniques that allowed them to better engage with participant culture. In general, Thomas (2004) suggested using more than one strategy or statistical technique for analyzing the same data. Garaway (1996) described how multivariate statistical techniques helped the evaluator better understand the relationship between intermediate variables and the outcome measure of literacy acquisition. Most of the cases reviewed in this study used basic descriptive statistics or frequencies; a handful of cases noted that community members or program participants completed this analysis (Fetterman D. , 2005; Peter, et al., 2003; Brandão, Silva, & Codas, 2012). Doing so allowed for a greater sense of community engagement with the analysis and a richer interpretation of quantitative findings.

Key Approaches related to Dissemination and Use of Quantitative Data Approaches related to dissemination and use of quantitative data often occurred in performance management and monitoring contexts. In such contexts, evaluators stressed building capacity among participants to continue collecting and utilizing data in meaningful ways (Nelson-Barber, LaFrance, Trumbull, & Aburto, 2005; Hilton & Libretto, 2016). Similarly, evaluators noted that all approaches should be conducted with the explicit intention of giving back to the community (Hesse-Biber, 2013; Mertens & Zimmerman, 2015). Thus, it was important to involve the community with interpreting findings and ensuring that all stakeholder voices were heard in the process (Christie & Barela, 2005; Mitakidou, Tressou, & Karagianni, 2015). Some studies noted publishing findings in community newsletters or holding public forums to discuss quantitative findings (Prilleltensky, Nelson, & Valdes, 2000; Thurman, Allen, & Deters, 2004). This compilation of culturally competent quantitative approaches presents a variety of strategies and techniques for engaging with and responding to culture. Contextualizing these approaches with the findings presented earlier in the chapter provides a more complete picture of the role of quantitative methods in evaluations using a culturally competent approach. Thus, the proceeding section summarizes findings to show what is suggested and implied by this review. Summary of Findings This chapter systematically reviewed the ways in which quantitative methods have been integrated into evaluations using culturally competent approaches. Findings revealed that despite the conceptual alignment between qualitative methods and culturally competent evaluation, most of the studies utilized quantitative methods (either alone or in a mixed methods design). Evaluators most commonly used quantitative methods to show an overall summative judgement. This finding reminds us that evaluators often operate within a context of external funders and pressures to demonstrate impact. While qualitative methods can be appropriate for showing impact, several funders, including the federal government, prioritize quantitative approaches (American Evaluation Association, 2003). Evaluators seeking to use culturally competent approaches are often confronted with the unavoidable use of quantitative methods. This review shows that this reality does not have to come at the cost of prioritizing cultural competence in evaluation. As findings demonstrate, in addition to summative purposes, quantitative methods have been used for social justice purposes. Quantitative methods do not have to be restricted to an exercise conducted to appease a funder, but could in fact help the evaluator meaningfully consider and include culture, as well as empower the local community. Based on the findings from this review, quantitative measures can play a positive role in helping evaluators use a quantitative approach to thoughtfully incorporate local culture or settings. While Chouinard and Cousins (2009) noted the challenges with using existing or standardized measures in culturally diverse contexts, this review points to the use of evaluator- created or adapted measures as a way for evaluators to better include cultural knowledge(s). This study explicated five specific challenges associated with using quantitative measures in culturally competent evaluations: 1. Challenges with translation; 2. Challenges with funder expectations; 3. Challenges with existing instrument content validity;

4. Challenges with data collection; 5. Challenges with item formats. These challenges point to an issue related to a lack of alignment or understanding of the local cultural context. Thus, addressing these potential challenges through creating or adapting quantitative measures would help an evaluator better engage with and respond to culture, as demonstrated by several cases in this review. For example, if an evaluator is presented with a funder-mandated standardized instrument, the evaluator could use this as an opportunity to explore the ways in which the instrument is or is not appropriate for the local culture. By doing so, the evaluator gains a deeper understanding of the local culture. Furthermore, the evaluator can adapt an existing measure and/or create a new one to make quantitative instruments more relevant to the local culture. As a result, findings become more meaningful for the local context. The process of using a quantitative measure with a culturally competent lens can therefore result in the evaluator becoming more engaged with local culture and the quantitative findings becoming more meaningful and useful for the local community. Furthermore, findings provide us with an orientation as to where the evaluation field is at with regards to culturally competent quantitative approaches. In other words, findings outline how and in what ways evaluators align quantitative methods with culturally competent approaches. Such findings are summarized in Figure 3.

Category Summary of Approaches 1. When implementing an experimental design, consider using a delayed control group or multiple baseline design to ensure equity across the community. Quantitative 2. Consult participants about the best way to organize randomization Evaluation Designs procedures and instrument administration. 3. Plan for a meta-evaluation to assess the quality of the evaluation process, including quantitative methods. 1. Before selecting an instrument, consult community stakeholders and Quantitative participants about specific criteria that should be used for selection. Instrument 2. Specifically search for instruments that have already been culturally Selection validated. 1. Back-translation of instruments. 2. Generate instrument questions with community stakeholders and/or Quantitative participants. Instrument Design 3. Utilize design features consistent with narrative culture such as vignettes and imagery. 1. Review instrument questions with community stakeholders and/or Quantitative participants. Instrument 2. Conduct cognitive interviews with program participants. Validation 3. Pilot instruments with groups representative of the evaluation population. 1. Use community members and/or youth from the community to collect data/administer surveys. Quantitative 2. For survey administrators not from the community, provide culturally- Instrument specific training. Administration 3. Use a format (such as in-person or small group) consistent with the local culture. 1. Disaggregate findings among key subgroups and when possible, conduct tests of statistical significance for differences between groups. Quantitative Data 2. Use more than one strategy or statistical technique for analyzing quantitative Analysis data. 3. To the extent possible, engage the local community and/or participants with tabulating frequencies and descriptive statistics. 1. Build capacity among local stakeholders to analyze and use quantitative data that is collected on an ongoing basis. Quantitative 2. Share findings with stakeholders and engage them with interpretation, Dissemination/Use ensuring that all voices are heard. 3. Use community newsletters, radio, and other public forums to share findings.

Figure 3: Summary of quantitative culturally competent approaches

Figure 3 demonstrates that many evaluators can consciously think about how to use quantitative methods in a way that is consistent with culturally competent approaches to evaluation. While the strategies outlined in Figure 3 provide useful insight into common practices associated with using quantitative measures in culturally competent evaluations, not all of these tactics are necessarily novel or innovative. Considering the advances in statistical modeling and psychometrics that have occurred over the past few decades, Figure 3 also provides a starting point for considering where the evaluation field might develop by further aligning quantitative methods with culturally competent approaches. Conclusion and Next Steps In addition to highlighting existing culturally competent practices, this review prompts a critical reflection as to what is missing from the literature with regards to aligning quantitative methods with culturally competent approaches. While not all cases in this sample discussed the details of the quantitative analysis, it was clear that most studies did not extend their quantitative analyses beyond tabulating frequencies and descriptive statistics. Often, the lack of further analysis was due to small sample sizes; however, in cases with sufficient sample sizes to conduct inferential statistical analyses, it is possible that additional statistical analyses might have helped the evaluator further engage with culture. For example, showing statistically significant differences in satisfaction between or within subgroups may allow a local community to advocate for better resources for a specific subgroup. External stakeholders may find the additional tests of significance as more compelling than a frequency or descriptive statistic, and thus be more likely to pay attention to the evaluation. Furthermore, 24 of the studies used a pre/post or longitudinal design. The studies reviewed in this sample did not provide a thorough insight into how the design and subsequent analyses may or may not help evaluators better engage with and respond to culture. This raises the question: are there specific statistical models or analytic techniques that can complement a culturally competent approach to evaluation? Additionally, findings related to quantitative measures point to the possibility of using the process of designing and analyzing a quantitative measure as a way to meaningfully consider and include local culture. This chapter pointed to strategies such as engaging stakeholders with instrument design and working with stakeholders to administer instruments. These practices suggest a deeper engagement in culture; however, such strategies are not comprehensive. The evaluation field still lacks a clear picture of how the complete process of quantitative instrument design, validation, and analysis can help evaluators better engage with and respond to culture. In other words, while this review suggests that measurement might serve as a bridge between quantitative and culturally competent approaches, the literature lacks a clear, focused example of how this can unfold. Next Steps The purpose of this synthesis and review is not to suggest that quantitative methods can and should be used in every evaluation. Rather, this chapter proposes that there are ways in which quantitative methods can align with culturally competent approaches. Thus, in order to improve the evaluation field’s ability to best respond to local culture, future research is needed in the area of culturally competent quantitative methods. Given the predominance of quantitative measures in evaluation, the evaluation field may consider using techniques and practices from

23 measurement and psychometrics as an opportunity to appropriately incorporate culture. In order to better understand how to do so, the evaluation community could benefit from a detailed explanation of a measurement design, calibration, and validation process that aligns with a culturally competent approach. Therefore, Chapter Two of this dissertation presents the case of an evaluation that called for the design of new quantitative measures as part of the evaluation. Chapter Two outlines how the measure design, calibration, and validation process was conducted in order to explicate how evaluators can use measurement to better engage with and respond to local culture. Moreover, once measures are designed, validated and data is collected, the analysis process can provide an additional opportunity for evaluators to align their methods with a culturally competent approach. Yet, the existing literature does not provide significant detail regarding specific quantitative analytic approaches that best allow evaluators to thoughtfully incorporate culture. In other words, how can evaluators move beyond descriptive statistics, t- tests, and ANOVAs to analyze data with a culturally competent lens? When is a complicated quantitative analysis necessary, and how does it help evaluators respond to the local culture? Such questions are particularly salient when analyzing pre/post data. Therefore, Chapter Three of this dissertation extends the evaluation case presented in Chapter Two to show how the pre/post data collected from the quantitative measures can be analyzed in a way that aligns with a culturally competent approach. This dissertation concludes with an overarching discussion of the strengths and limitations of using quantitative methods and the ways in which they may or may not align with a culturally competent evaluation approach. Additionally, this concluding discussion addresses questions related to how the use of quantitative methods interacts with the epistemological assumptions and conceptual frameworks that underlie a culturally competent approach to evaluation.

Chapter Two Culturally Competent Quantitative Measures: Integrating the BEAR Assessment System with a Culturally Competent Program Evaluation Approach

Chapter One presented a research synthesis exploring the extent to which quantitative methods have been used in program evaluations using culturally competent approaches. This synthesis put an emphasis on how quantitative measures (i.e. surveys, questionnaires) were incorporated into program evaluations using culturally competent approaches, as well as the ways in which quantitative measures added to or detracted from the utility of evaluations aiming to reflect cultural competence. Consistent with Chouinard and Cousins’ (2009) original research synthesis findings, Chapter One’s review showed that several evaluators felt that existing, or ‘off the shelf,’ measures were not appropriate or relevant for the local evaluation context. Specifically, in 48 percent (n=31) of the reviewed articles using quantitative measures from Chapter One, evaluators reported challenges with using quantitative measures, often due to existing measures’ lack of appropriateness and relevancy to the program population. However, this tendency to underutilize quantitative measures due to their lack of consistency with culturally competent approaches begs the question of if and how culturally competent quantitative measures may have improved the evaluations’ impact, utilization, and cultural competency. While Chapter One identified that 58 percent (n=64) of studies did use quantitative measures adapted to or designed for the local context, in these cases, the measures’ calibration and validation approach could have been improved in order to build the psychometric validity and cultural competence of the evaluation. The evaluation literature is therefore in need of a clear example of how to create, calibrate, and validate a measure that is both appropriate for the local evaluation context as well as psychometrically rigorous. To address this gap, Chapter Two presents the case of a university-based evaluation and outlines how the evaluator designed, calibrated and validated two quantitative measures. The evaluator carried out this process through a culturally competent lens using a framework for constructing measures known as the BEAR Assessment System (Wilson & Sloane, 2000). The specific research question addressed in Chapter Two is as follows: How can an evaluator create measures specific to the local program context in a way that integrates a culturally competent approach with the BEAR Assessment System? Furthermore, how can an evaluator calibrate and validate culturally competent measures via the BEAR Assessment System framework? To what extent is this process aligned with the Culturally Responsive Evaluation framework and the American Evaluation Association’s Statement on Cultural Competence in Evaluation? This chapter therefore begins with an overview of the BEAR Assessment System and its complements with culturally competent program evaluation approaches. After doing so, the chapter presents the evaluation case study and proceeds to outline the instrument design process, calibration, validation, and revision of two quantitative measures that incorporated a culturally competent lens. This chapter concludes with a discussion of how the evaluator’s approach to constructing and validating measures was aligned with a Culturally Responsive Evaluation (CRE) framework as well as American Evaluation Association’s (AEA) Statement on Cultural Competence in Evaluation. Furthermore, the discussion explores the challenges and limitations that exist when creating and validating quantitative measures with a culturally competent lens. The BEAR Assessment System and Cultural Competence As pointed out by Chouinard and Cousins (2009), using instruments and measures reflective of the local context has been a challenge for evaluators seeking to use quantitative methods within a culturally competent evaluation framework. Specifically, in their review of culturally competent evaluations, Chouinard and Cousins (2009) stated: “A number of the

26 studies in our sample discussed the difficulty of using predetermined or standardized measures, outcome indicators and instruments to evaluate programs in culturally diverse communities, as they can conflict with localized community and culturally specific practices” (p. 481). Findings from Chapter One confirms that this statement holds true for the literature published from 2009- 2016. The BEAR (Berkeley Evaluation and Assessment Research) Assessment System offers a potential framework for overcoming this issue, thus allowing evaluators to create and use measures that are responsive to the local community and culturally specific practices. The BEAR Assessment System (BAS) is a framework for producing embedded assessments (Wilson & Sloane, 2000). This embedded approach is intended to improve the ‘ecological validity’ of an assessment though integrating assessment into instructional materials and day-to-day activities. The BAS framework is therefore complementary to culturally competent evaluation approaches in that it is intended to reflect the activities and practices of the local context. BAS’s four principals, as highlighted by Wilson & Sloane (2000), further reinforce that the BAS framework is complementary to culturally competent evaluation approaches: 1. Developmental Perspective: Prioritizes the process of learning or growth at an individual level; 2. Match between Instruction and Assessment: Ensures that instruction or program activities match the assessment; 3. Teacher Management and Responsibility: Gives agency to the local stakeholders with regards to the analysis and use of assessment results; 4. Quality Evidence: Uses item response models to ensure that the standards of fairness are the same level of traditional or ‘off the shelf’ assessments. Regarding implementation of the four BAS principles, Wilson (2004) relates BAS to the four building blocks for constructing measures. The iterative process of going through the four building blocks allows one to ground the measures in the context of the evaluation. The four building blocks, which will be detailed in the proceeding sections, present the following four steps for constructing measures: 1. Building Block One: Construct Definition; 2. Building Block Two: The Items Design; 3. Building Block Three: Outcome Space Definition; 4. Building Block Four: Measurement Model. The developmental perspective relates to Building Block One in that the construct map is linked to instruction in order to outline how individuals will move through a learning progression or growth. The second BAS principle, match between assessment and instruction, relates to the items design in that the items should deliberately reflect the construct map. Building Block Three, outcome space definition, corresponds to management by teachers; the outcome space provides the roadmap for teachers to understand and use assessment results. And finally, the

27 fourth step relates to BAS’s fourth principal in which ‘evidence of high quality’ is reflected in the measurement model via creation of a Wright Map (discussed in future sections). Thus, creating measures using the BAS framework and corresponding four building blocks may provide a promising approach for evaluators looking to create culturally competent measures. These BAS principles and corresponding four building blocks reinforce essential practices highlighted in AEA’s Public Statement on Cultural Competence (American Evaluation Association, 2011). More specifically, Table 1 outlines the essential practices for culturally competent evaluators (as stated in the public statement) alongside the BAS Principles to illustrate how they complement one another. Table 4. Comparing BAS with AEA’s Essential Practices for Culturally Competent Evaluators

BAS Principle AEA Essential Practice Recognizing and responding to differences Developmental Perspective & Quality both between and within cultures Evidence

Accurately reflecting how individuals view their own group membership, treating cultural Developmental Perspective categories as fluid

Consult and engage with the groups who are the focus of the data in order to determine Match between Instruction and Assessment analysis approaches

Engage stakeholders in reflections on Teacher Management and Responsibility & assumptions about what constitutes Quality Evidence meaningful, reliable, and valid data

Specifically, BAS’s developmental perspective prioritizes the individual, which allows an evaluator to look at differences both between and within cultures, as well as reflect how individuals want to view their own group membership. Furthermore, BAS’s principle to match instruction with assessment ensures that the evaluator will consult with the group who is the focus of the data to ensure that the instrument is aligned. Additionally, evaluators engage stakeholders by following the BAS principle of teacher management and responsibility, which puts the interpretation of findings in the stakeholders’ hands. Lastly, BAS ensures quality evidence through engaging stakeholders in constructing a validity argument for an instrument. Thus, theoretically, using the BAS approach to constructing measures aligns with the essential practices of culturally competent evaluators. Furthermore, the BAS principles and four building blocks approach align with a culturally responsive evaluation framework. Frierson, Hood, and Hughes (2010) published a guide to conducting culturally responsive evaluation for the National Science Foundation’s 2010 User- Friendly Handbook for Project Evaluation. Specific steps are summarized as follows: Prior to the evaluation: Evaluators recognize their own personal cultural preferences and make a conscious effort to restrict any undue influence they might have on the work.

1) Preparing the evaluation: The evaluator or evaluation team should be fully aware of and responsive to the participants’ and stakeholders’ culture, particularly as it relates to and influences the program. The evaluation director(s) should not assume that racial/ethnic congruence among the evaluation team, participants, and stakeholders equates to cultural congruence or competence. 2) Engaging Stakeholders: Develop a stakeholder group representative of the population the project serves, assuring that individuals from all sectors have the chance for input. 3) Identifying the Purpose/Intent of the Evaluation: a. Culturally responsive process evaluations include careful documentation of program implementation in order to ensure that any cultural nuances are captured. b. Culturally responsive progress evaluation help determine whether the original goals and objectives are appropriate for the target population. c. Culturally responsive summative evaluations fully measure effectiveness of the program and determine its true rather than superficial worth. Thus, it is important to identify the correlates of participant outcomes and measure their effects as well. 4) Framing the right questions: The questions of significant stakeholders must be heard and where appropriate, addressed. Evaluators must ask: what will we accept as evidence when we seek answers to our evaluative questions? 5) Designing the evaluation: This occurs after the evaluation questions have been properly framed, sources of data have been identified, and the type of evidence to be collected has been decided. Stakeholders are involved during each phase. 6) Selecting and Adapting the Instrumentation: Additional pilot testing of instruments should be done with the cultural group or groups involved in the study to examine their appropriateness. If problems are identified, refinements and adaptations of the instruments should be made so that they are culturally sensitive and thus provide reliable and valid information about the target population. 7) Collecting the data: Evaluation data collectors should be trained to understand the culture in which they are working. If interviewer training is entered with the spirit of openness and self-improvement, the results for collecting culturally responsive evaluative data can be considerable. 8) Analyzing the data: Create review panels principally composed of representatives from stakeholder groups to examine evaluative findings gathered by the principal evaluator and/or an evaluation team. Disaggregation of data sets is highly recommended because evaluative findings that dwell exclusively on whole-group statistics can blur rather than reveal important findings. 9) Disseminating and utilizing results: Information from the program evaluation is received by a wide range of individuals who have an interest or stake in the program or project. The dissemination and use of evaluation outcomes is thought through early when preparing an evaluation, that is, during the evaluation planning phase.

The BAS approach complements the stakeholder involvement that permeates the CRE framework. Additionally, CRE steps seven, eight, nine, and ten are incorporated into the four building blocks approach through ensuring piloting, review panels, culturally sensitive instrument revisions, and appropriate dissemination of findings. However, BAS has largely been implemented in an educational assessment context, as opposed to a program evaluation context. Therefore, the question remains regarding exactly how evaluators can intentionally integrate quantitative measures into a culturally competent evaluation approach. Furthermore, what does this process look like in a real-world evaluation? This inquiry is reflective of contemporary research agendas related to culturally competent evaluation, as discussed in Chapter One. Thus, this chapter presents a case in which I (the evaluator) used quantitative measures within a culturally competent evaluation approach. Both the CRE framework and AEA’s essential practices for culturally competent evaluators are used as guidelines to demonstrate how the BAS and four building blocks approach can be used to create culturally competent quantitative measures. The Evaluation Case: The UC Berkeley Athletic Study Center The National Context Among National Collegiate Athletic Association (NCAA) division one universities, significant gaps in graduation rates exist between student athletes – in particular, Black male athletes – and the rest of the undergraduate population (Comeaux, 2015). Academic information on student athletes systematically collected by the NCAA exposes this issue. Specifically, all NCAA division one sports teams are subject to academic progress rate (APR) standards. The APR is represented by a numeric score in which a team is given one point for every team member that stays in school and one point for every team member that is athletically eligible to compete. During the 2014-15 season 36 NCAA division one teams did not meet the APR standards, suggesting that student athletes are not maintaining academic eligibility or staying in school (Hosick, 2014). Specifically, the teams receiving sanctions were: baseball (one team), football (nine teams), men’s basketball (eight teams), men’s cross country (two teams), men’s golf (one team), men’s soccer (one team), men’s tennis (two teams), men’s indoor track (six teams), men’s outdoor track (three teams), wrestling (two teams), and women’s lacrosse (one team). Therefore, sanctions were relatively concentrated in men’s basketball and football. UC Berkeley is no exception to the aforementioned issues. For example, in 2014, the average GPA of UC Berkeley undergraduates was 3.30 while the average GPA of UC Berkeley student-athletes was 2.92. Furthermore, 4.1% of student athletes were on academic probation, whereas 2.3% of UC Berkeley undergraduates were on probation (DeShong, 2014). Thus, on average, student athletes have noticeably lower GPAs and a higher frequency of academic probation than the rest of the undergraduate population. The above narrative however, is not meant to paint a picture of the student athlete as being academically unmotivated or disengaged. The reasons underpinning the gap in achievement between student athletes (particularly in the revenue generating sports) and non- student athletes is complicated and nuanced. Student athletes have unique demands and pressures as compared with the rest of the undergraduate population. For example, student athletes must balance a rigorous training and competition schedule with the demands of a top research university (Raedeke, Lunney, & Venables, 2002). Moreover, student athletes are stigmatized in

30 academic spaces, making it difficult to fully engage with faculty and fellow students (Simons, Bosworth, Fujita, & Jensen, 2007). Student athletes’ demanding schedules, coupled with a lack of acceptance into academic spaces, often results in burnout that in turn, negatively impacts a student athlete’s expectation to graduate (Fearon, Barnard-Brak, Robinson, & Harris, 2011). Thus, institutions have a role to play in ensuring that student athletes are able to manage their schedules and feel accepted on campus; otherwise, a student athlete may be set up for academic failure. As stated by Comeaux (2015): “Institutions need to make sure that the student athlete who is cheered in the stadium, glorified in national television, and prominently displayed in institutional materials is also earning academic credits toward on-time graduation, participating in career-enhancing internships, and receiving academic support from faculty and others,” (p. vii). At some universities, such as the one described in this chapter, student athletes enroll with significantly lower SAT scores and GPA levels than their non-student athlete peers; these student athletes are less academically prepared for the rigor of a top research university. Therefore, NCAA division one universities have student athlete support centers to address the unique issues student athletes face when navigating their paths to graduation. The Local Context UC Berkeley is an NCAA division one university and therefore has specific support systems aimed at ensuring student athlete academic, athletic, and personal success. Specifically, the Athletic Study Center (ASC) at UC Berkeley provides a comprehensive set of services aimed to help student athletes make progress toward degree as well as engage and participate as full members of the academic community. Figure 1 below highlights the ASC’s service model.

Figure 4: The Athletic Study Center Service Delivery Model

Given the diversity of the student athlete population, the ASC prioritizes cultural competency in service delivery. This emphasis on culturally competent service delivery is reflected in the fact that the ASC consciously designs their services to incorporate the multiple backgrounds that comprise UC Berkeley student athletes. As shown in Figure 1, the ASC’s services are intended to do more than simply retain students and keep them athletically eligible (per the APR requirements). The ASC takes a holistic approach to service delivery, with the intention of developing student athletes that are “independent, self-reliant, successful young adults.” Thus, the ASC’s programmatic goals seek to tell a complete story of the student athlete experience and go beyond APR standards to give a well-rounded and balanced perspective of student athlete success. As reflected in Figure 1, the ASC believes that academic development and personal development are not mutually exclusive; in other words, one’s academic development is affected by personal development. This model reflects best practices of academic support for student athletes. As highlighted by Van Rheenen (2015), academic support centers solely focusing on academic monitoring do not adequately support student athletes; rather, support centers must balance student athlete development and APR requirements. To bolster the ASC’s services related to personal development, in 2016 the ASC created a two-unit seminar course (referred to as ED 98) for freshman student athletes. This course was intended to help incoming student athletes develop their sense of identity, community, and leadership skills on the UC Berkeley campus. ED 98 focused on empowering student athletes to ‘own’ their experience at UC Berkeley. In other words, the course prompted student athletes to identify what they want out of their experience at UC Berkeley and provided student athletes with the tools, resources, and mindset approaches to do so. Activities and exercises guided student growth in three areas: consciousness of self, consciousness of others, and consciousness of context. Growth in these three areas was intended to foster a better understanding of self-efficacy, a platform for setting meaningful goals, and a skill set to navigate and utilize campus resources. Therefore, in thinking about an evaluation of the ASC, evaluative measures needed to extend beyond the APR in order to capture to full range of services provided. Evaluation at the ASC As an organization, the ASC is committed to consistently reflecting on and improving their services. Throughout 2016, the ASC Leadership Team6 engaged in a series of workshop meetings to identify, refine, and measure the program’s mission, goals, and outcomes (the complete list of goals, outcomes, and measures can be found in the Appendix). The Leadership Team viewed this exercise as necessary for integrating the ASC’s three departments (see Figure 1) and developing a better understanding of whether services were resulting the intended impact. With the mission, goals, and outcomes solidified, the next step was to identify measures for these outcomes. Therefore, the ASC solicited the help of an external evaluator to assist with refining existing instruments, as well as identifying and creating new measures. Prior to this process, the ASC had been engaging in internal evaluative efforts. Specifically, departments designed and implemented student, tutor, and coach feedback surveys, as well as maintained an extensive database system related to student athlete academic progress. As the ASC and the external evaluator discussed how existing measures aligned with the revised

6 The Leadership Team is comprised of five ASC staff members: The Director, Assistant Director for Advising, Assistant Director for Transition and Development, Academic Director for Academics and Tutorial, and Business Manager

32 goals and outcomes, it became clear that the ASC staff sought a better understanding of how the holistic aspects of their service delivery impacted students throughout the course of their time at UC Berkeley. Referring to Figure 1, the ASC was interested in the extent to which their services contributed to a greater sense of belonging on campus, as well as the development of student athletes as self-reliant individuals. The ASC viewed the ED 98 freshman seminar course as a starting point for student athletes to begin building their identity, academic confidence, and athletic development on campus. Thus, the evaluator recognized that tracking growth in personal development factors was useful to the ASC for two reasons: 1) Understanding the factors mitigating a student athlete’s academic development and 2) Understanding the growth trajectory in personal development attributes emphasized by ASC services. When thinking about which factors represented what the ASC was attempting to foster, the evaluator and the Leadership Team reviewed the ASC goals and outcomes, as well as the syllabus for ED 98. Through this review and synthesis, the ASC and the evaluator decided on two measures that would best reflect ASC services and complement existing evaluative activity. These two measures were designed to assess sense of belonging and self-reliance. Regarding sense of belonging, the literature has clearly documented a student athlete’s propensity to feel alienated on campus (Simons, Bosworth, Fujita, & Jensen, 2007; Fearon, Barnard-Brak, Robinson, & Harris, 2011; Van Rheenen, McNeil, Minjares, & Atwood, 2012; Van Rheenen D. , 2011). Such feelings may result in a lack of engagement with the academic community and poor academic performance. The activities promoted by the ASC via ED 98, tutorial, advising, and academic development departments were all intended to support students to engage with the university and overcome feelings of alienation.

Self-reliance refers to a student’s ability to take ownership over their experience, which was a primary focus of the ED 98 course. Furthermore, advising intends to give students the knowledge they need in order to make progress toward degree and tutorial services are meant to foster the skills necessary for students to rely on themselves to complete their work with academic rigor. While students receive services throughout their enrollment, services are scaffolded over the years so that students begin to navigate campus life and academics on their own. Thus, the ASC’s services are intended to foster self-reliance which, in turn, contributes to achieving the ASC’s stated goals.

Therefore, after a series of meetings and discussion, the ASC and evaluator clearly identified and justified two measures (sense of belonging and self-reliance) that would help the ASC understand and improve their services, as well better communicate student athletes’ experiences at UC Berkeley to external stakeholders7. While existing measures for sense of belonging and self-reliance exist (DeBacker Roedel, Schraw, & Plake, 1994; Ashley & Reiter- Palmon, 2012), none of them were specific to the student athlete population and culture. Student athletes constitute a distinct subgroup at universities such as UC Berkeley; therefore, the ASC and the evaluator felt that sense of belonging and self-reliance meant something unique to the student athlete population. As a result, the evaluator and the Leadership Team felt that an existing measure intended for the general undergraduate population would not be valid for

7 External stakeholders include the UC Berkeley Athletic Coaches, ASC Board of Directors, UC Berkeley Office of the Chancellor, and the NCAA

33 student athletes. To ensure that the measures were most responsive to student athlete culture, the evaluator wanted to create the measures specific for the ASC. To do so in both a rigorous and culturally competent manner, the evaluator turned to the BAS approach for constructing measures and followed the four building blocks through a culturally competent lens. The proceeding section details the four building blocks process that the evaluator undertook for developing measures of sense of belonging and self-reliance.

Building Block One: Construct Development As a first step to constructing measures for the ASC, the evaluator developed construct maps. At the heart of the construct map is the underlying ‘construct.’ A construct is the “idea or concept that is the theoretical object of our interest in the respondent,” (Wilson, 2004, p. 6). In this evaluation case, the evaluator set out to create two constructs: sense of belonging and self- reliance. The ‘construct map’ outlines the construct along a continuum of extremes; in this case, from strong to weak. Furthermore, the construct map outlines qualitative levels between these two extremes. In other words, the construct map shows the different levels (from strong to weak) that a student athlete would fall under. Because the evaluator sought to create measures specific to student athlete culture, the construct required a student athlete specific definition, and the construct map needed to be created within the context of the ASC’s service delivery model. Therefore, the evaluator did not use a pre-existing construct for sense of belonging and self-reliance and worked with the ASC staff and student athletes to create new, hypothesized construct maps. To accumulate the necessary information for creating these culturally-specific construct maps, the evaluator first met individually will all members of the ASC Leadership Team, as well as the primary facilitator for ED 98. During these meetings, the evaluator asked how staff viewed sense of belonging and self-reliance in relation to the activities and outcomes that they had mapped out for their respective departments. Additionally, the evaluator obtained the final syllabus and course textbook for ED 98. As a next step, the evaluator reviewed and coded meeting notes to ascertain a definition of and various levels for sense of belonging and self-reliance. Additionally, the evaluator reviewed the ED 98 course syllabus’ stated objectives and coded them as either related to sense of belonging or self-reliance. While conducting this review process, it became clear to the evaluator that sense of belonging and self-reliance were defined by several underlying characteristics. In order to synthesize these characteristics into a construct map draft, the evaluator drew inspiration from the survey instrument design technique known as ‘Guttman Mapping Sentences.’ Guttman Mapping Sentences provide researchers and evaluators with a method to systematically organize a range of relevant characteristics under one overarching latent variable (Randall & Engelhard, 2000). Each characteristic, while related to the others, possess its own definition at each level of the construct. The evaluator therefor synthesized a set of characteristics, based on the objectives stated in the ED 98 course material, and then used meeting notes and the course textbook to explicate four different levels of each characteristic. The hypothesized sense of belonging construct contained six characteristics and four levels, and the hypothesized self-reliance construct contained five characteristics and four levels. Both construct maps were presented to the Leadership Team and ED 98 facilitators. This stakeholder feedback was incorporated into the final hypothesized constructs, displayed in Figures 2 and 3.

Additionally, it is important to note that because these constructs were new, they had not been empirically verified. Therefore, the construct maps were considered to be ‘hypothesized.’

Figure 5: Hypothesized Construct Map for Sense of Belonging

Figure 6: Hypothesized Construct Map for Self-Reliance

Building Block One therefore allowed the evaluator to consider and create constructs for sense of belonging and self-reliance that were specific to the cultural context of the program population. This specificity aligns with the essential practices of culturally competent evaluators, as well as the culturally responsive evaluation framework. Using stakeholder materials and engaging stakeholders in the process of creating the hypothesized constructs ensured that the definitions of sense of belonging and self-reliance were grounded in the local context and responsive to differences between student athletes and the rest of the undergraduate population. Building Block Two: The Items Design With the hypothesized construct maps completed, the next step to creating the sense of belonging and self-reliance measures was the items design. Primarily, the stakeholders and evaluator determined that a closed-response survey would be the best design for the measures, given the time and resources available in the ASC context. When designing measures, the items design should relate to the construct; in other words, the levels of the construct should be reflected in the items design (Wilson, 2004). To fulfill this requirement, the evaluator decided to create two items for each characteristic outlined in the sense of belonging and self-reliance construct maps. Furthermore, the evaluator decided to use a multiple-choice format in which each of the multiple-choice options related to one of the levels defined in the construct (see Figures 2 and 3). Therefore, each item reflected one characteristic and every level on either the sense of belonging or self-reliance construct maps. Given the decision to construct each measure using multiple choice questions, the evaluator considered whether to use Likert-type items or Guttman-type items. Likert-type items refer to a standard set of options on a continuous scale. For example: strongly agree, agree, disagree, and strongly disagree. Each of the options would then relate back to one of the hypothesized levels on the construct. As explained in Wilson (2004), while Likert-type items are relatively simple to construct, the standard options do not provide much context; it may be difficult for respondents to know the qualitative difference between disagree and strongly disagree. Guttman-type items, on the other hand, provide explanatory statements that theoretically represent a continuous scale (Guttman, 1944). Guttman (1944) describes this approach as Guttman scaling, which is considered to provide a better interpretation of the construct than Likert-type items. In a culturally responsive evaluation, instruments should be culturally sensitive and reflective of the cultural group involved in the evaluation (Frierson, Hood, Hughes, & Thomas, 2010). In accordance with this CRE requirement, the evaluator determined that Guttman-type items, as opposed to Likert-type options, allowed for question option statements that better represented the cultural context. Writing the Items Because the evaluator and Leadership Team prioritized cultural context, the items needed to be written in a way that resonated with the UC Berkeley student athlete population. The evaluator wanted the student athletes to imagine themselves in real-world settings in which sense of belonging or self-reliance characteristics would be displayed. Athletes routinely use images (such as video tape analysis, playbook diagrams, or visualization exercises) as part of their training; therefore, the evaluator believed that incorporating images into the items design would enhance the engagement level and cultural competence of the instrument. Specifically, the evaluator decided to create comic strips portraying real-life situations student athletes might find themselves in. These comic strips would act as prompts for a proceeding set of questions. This

38 approach is further justified by existing examples of program evaluations that incorporate visual imagery into the evaluation design as means of improving the cultural competence of an evaluation (Brandao, Silva, & Codas, 2012; Dura, Felt, & Singhal, 2014). Additionally, while sparse, existing literature validates the use of comic strips for measurement and for research purposes (Cohn, 2014; Cohn, 2007). Specifically, Cohn (2014) conducted four experiments to uncover the extent to which participants could identify and follow a comic strip structure and narrative when presented in an assessment format. Results showed that the comic strip narrative could be identified through diagnostic tests, thus implying that comic strips are valid and interpretable elements to an item stem. In order to create the comic strip scenarios, as well as write the Guttman-type options, the evaluator worked closely with the Leadership Team and ED 98 staff (as further detailed in the proceeding section). Specifically, the evaluator drew upon the conversations from the construct mapping process. Furthermore, the evaluator observed four sections of the ED 98 seminar and took notes regarding specific situations and scenarios that were discussed, as well as students’ responses regarding how they handled these specific situations. The evaluator generated six comic strip scenarios (three for sense of belonging and three for self-reliance) based on conversation and meeting notes. Additionally, the course textbook provided guidance when writing the Guttman-type items. Thus, the evaluator wrote the initial draft of items to incorporate the statements made during the ED 98 course, as well as the descriptions from the course textbook. In total, the first instrument draft contained 22 items and all, except for one, were written with Guttman-type response options.8 Figure 4 presents an example of a comic strip scenario related to sense of belonging and one of the items pertaining to this scenario.

8 One item, due to the nature of the question stem, contained Likert-type options.

Figure 7: Example Comic Scenario and Item for Sense of Belonging Measure Revising the Items The next step in the items design was to solicit feedback from key stakeholders. Therefore, the evaluator conducted four cognitive labs with current UC Berkeley student athletes. During these cognitive labs, the student athlete and the evaluator went through each item, and the student athlete provided feedback regarding: 1) The clarity of item wording; 2) The degree of relevance of each item context; and 3) The appropriateness/sensitivity of question options. At the end of the survey, the evaluator then asked the student athlete to provide any general suggestions and additional comic-strip scenarios that might be applicable. The evaluator incorporated the suggested revisions and then conducted an item panel with the ASC Leadership Team. During this item panel, the evaluator and the Leadership Team went through each question one-by-one. The Leadership Team provided feedback on both the comic strip scenarios and item wording, paying close attention to if and how certain items may unintentionally promote negative student athlete stereotypes. This detailed process produced a set of revisions that further enhanced the cultural competence of the instrument. Furthermore, the evaluator discussed any additional demographic questions that needed to be added to the pilot instrument. After this final round of revisions, the evaluator felt that the instrument was ready to be piloted with the ED 98 students. The final draft contained a total of 22 items, twelve for sense of

40 belonging and ten for self-reliance. Furthermore, the evaluator felt that it was valuable to add an ‘other, please explain’ option for all the items as means to generate additional feedback into the representativeness of the question options. A complete draft of the pilot instrument can be found in the Appendix. Building Block Two was consistent with a culturally competent approach in that the items were constructed to reflect the lived realities of the instrument’s population. The comic strip scenarios and Guttman-type items allowed for the student athlete voice and experience to surface in the survey instrument. Furthermore, the revision process incorporated the perspectives of student athletes themselves, as well as the program staff. Building Block Two added on to the work done in Building Block One in a manner that allowed the evaluator to continue reflecting upon and improving the cultural competence of the survey instrument. Building Block Three: Designing the Outcome Space With the items design complete, the evaluator needed to articulate how each item should be scored in relation to the levels of the construct map. The explication of how responses to items relate to the construct map represents Building Block Three: The Outcome Space. Wilson (2004) describes the outcome space as “a set of categories that are well-defined, finite and exhaustive, ordered, context-specific, and research-based,” (p. 62). In many ways, the outcome space can be interpreted as a rubric. For example, the outcome space for open-ended items would provide example responses at each level of the construct. For closed-response items, such as the items contained in the sense of belonging and self-reliance measures, the outcome space links each possible multiple-choice response to one level of the construct map. As previously explained, the evaluator designed the sense of belonging and self-reliance measures so that each multiple-choice response related to one level on the hypothetical construct map. Therefore, the outcome space included a series of tables that matched each Guttman-type option to one level of the construct. Table 2 below presents one table included in the outcome space. The complete outcome space can be found in the Appendix. As seen in Table 2, each table in the outcome space is labeled with the measure and the characteristic. The first column lists the question, referring to the comic scenario (in italics) anchoring the question. The second column lists the Guttman-type options for the question. The third column presents the four levels of the construct for the question’s corresponding characteristic. Specifically, characteristic level one relates to ‘option a,’ and so on.

Table 5. Example Table from Sense of Belonging Outcome Space

Measure: Sense of Belonging – Characteristic: Shared Purpose Question Options Characteristic Levels a) N/A – I would not be in this situation because GSIs don’t help students like me. b) “It’s your responsibility to (GSI Scenario) make up for missed class Q1: If you were material, so I can’t help in this situation, you.” in your opinion, c) “Sure, maybe you and your what would be classmate have the same the GSI’s most questions, let’s all sit down likely response together.” 1. Is not aware of specific resources (fill in the blank d) N/A – I would not be in this because he/she feels like campus in Panel 2)? situation because I would resources outside of the ASC are have met with my GSI not meant for students like him/her before my game about 2. Is aware of specific campus missing lab. resources outside the ASC that may e) Other, please write: help him/her, but does not feel comfortable connecting with such a) Nothing, alumni resources associations don’t mean 3. Has made appointments, but still anything to you. needs encouragement from others (Student Flyer b) Interested in how the alumni to connect with the campus Scenario) might help you connect with resources outside the ASC to best Q12: Imagine that people who can offer you a meet his/her academic and personal you take the flyer future job, but you are not needs and it says: willing to go to the event 4. Has made appointments and feels “Sponsored by and actually meet alumni. comfortable communicating and the Haas School c) Would like to meet alumni, investing in relevant relationships of Business and you decide to go to the with campus resources outside the Alumni event if you can convince ASC on his/her own Association – some friends to go too. Alumni will be in d) Excited by the possibility of attendance.” meeting with alumni, and What are you you ask the business student most likely to if there are other ways to think about this? connect with the alumni association outside of the ice cream social. e) Other, please write:

The outcome space allowed the evaluator to gain clarity regarding how to code and score the responses. Additionally, the outcome space provided a scoring guide for stakeholders. Because this document clearly explained how each option related back to the construct, stakeholders knew that if a student selected ‘option a,’ this response hypothetically put the student on level one of the construct for that item. One of the guiding principles behind BAS is ‘teacher management and responsibility,” meaning that stakeholders have the tools to use the instrument data effectively (Wilson & Sloane, 2000). The outcome space provided stakeholders with the guide to be able to do so. The essential practices of a culturally competent evaluator, as well as the CRE framework, parallel this aspect of stakeholder engagement; stakeholders should be engaged at every step of the evaluation and should understand how the instruments and their scoring can produce specific findings (Frierson, Hood, Hughes, & Thomas, 2010). Therefore, Building Block Three follows from Building Blocks One and Two to further promote the cultural competence of the measures. Building Block Four: Choosing and Evaluating a Measurement Model With the outcome space clarified, the evaluator’s next step toward constructing the sense of belonging and self-reliance measures was choosing and evaluating a measurement model (Building Block Four). A measurement model “must help us understand and evaluate the scores that come from the item responses and hence tell us about the construct,” (Wilson, 2004, p. 16). In this evaluation case, the measurement model represented the statistical or psychometric model chosen to analyze the instrument data. Per the BAS approach, the measurement model should produce quality evidence that uses generalized forms of item response models (Wilson & Sloane, 2000). Item response models allow for the interpretation of the distance between respondent and response on the construct map, the distance between different responses on the construct map, and the difference between different responses (Wilson, 2004). Therefore, with an item response modeling approach, the evaluator could better understand the appropriateness of the instrument for the student athlete population, as well as the instrument’s psychometric validity. One of the simplest item response models is the Rasch model. The basic Rasch model analyzes dichotomous data; however, the self-reliance and sense of belonging measures are polytomous. Therefore, the evaluator selected the Rasch partial credit model as the measurement model. The partial credit model, developed by Wright and Masters (1982), is an extension of the dichotomous Rasch Model. Instead of answers coded as either right or wrong, the partial credit model allows for intermediate steps. For example, the sense of belonging and self-reliance measures have four performance levels for each item; such a scheme can be thought of as having “three-steps.” Thus, to incorporate multiple steps, the original dichotomous Rasch model extends to include the probability of completing each step (Wright & Masters, 1982). The partial credit model is shown in Equation One.

휂푝푖푘 = 휃푝 − (훿푖 + 휏푖푘) (1)

In Equation One, the ‘i’ subscript defines the item and the ‘k’ subscript defines the step. The 휃푝 represents a person’s ability, while 훿푖 represents level of difficulty for a particular item. The 휂푝푖푘 represents the difference in log odds of scoring 1 rather than a 0 for a particular item and step level. The 휏푖푘 represents the difficulty of each step level. For a three-step item, when k=1, this defines the difference in log odds of scoring a 1 rather than 0 for item “i.” This is analogous to the dichotomous Rasch model. When k=2, this defines the difference in log odds of scoring a 1

43 rather than 2 for item “i.” Finally, when k=3, this defines the difference in log odds of scoring a 2 rather than 3 for item “i.” With the partial credit model, the difficulties between each step do not need to be uniform. Furthermore, the difficulties between each step do not need to be the same for all items (this is not the case for the rating scale model) (Wright & Masters, 1982). The partial credit model can be combined with the construct map to produce a Wright Map. The Wright Map places items and respondents on the same scale in order to produce a rich interpretation of the measure. More specifically, Wright Maps provide a visual display of a respondent’s degree of sense of belonging and self-reliance as compared to the difficulty of each item (Wilson, 2004). Both the partial credit model and the Wright Map are used to evaluate the validity and reliability of the measurement model. The proceeding sections further detail the Wright Map in the context of the sense of belonging and self-reliance measure calibration and validation. The Wright Map, as well as additional tests of item and person fit, allowed the evaluator to evaluate the appropriateness of the chosen measurement model. In the following sections, the evaluation of the measurement model is discussed alongside a description of the piloting process and the measures’ validity and reliability analyses. While the measurement model may appear technical and unrelated to the CRE framework and essential practices of culturally competent evaluators, it is a critical element for ensuring the cultural competence of an instrument. Specifically, using item response models (in this case, the partial credit model) within the four building blocks approach facilitates an analysis approach that allows the evaluator to examine the appropriateness of the instrument for the intended population. Even though several steps were taken in Building Blocks One, Two, and Three to create a culturally competent instrument, without Building Block Four, the evaluator lacks empirical proof regarding how the entire instrument as well as specific items are or are not appropriate for not only the student athlete population as a whole, but also individual subgroups within the student athlete population. The Instrument Pilot Test In order to evaluate the measurement model, as well as the reliability and validity of the instrument, the evaluator piloted the sense of belonging and self-reliance measures with the ED 98 students enrolled during the fall 2016 term. The ED 98 course was held during four different sections and consisted of freshmen varsity student athletes from across multiple sports. All freshman varsity student athletes were invited to take the course; however, their participation in the course was voluntary. The evaluator piloted the instrument during all four sections. Prior to administration, the evaluator was introduced by the section facilitator, and the evaluator explained the purpose and intent behind the survey. The survey took approximately fifteen minutes for students to complete. A total of 66 students participated in the survey.9 Of these students, 62 percent identified as female (n=41), 20 percent were international students (n=13), and nine percent had participated in Summer Bridge (n=6).10 Regarding sports teams represented in the pilot, baseball

9 The original participation number was 72; however, 6 students opted to participate in the evaluation only and not have their data included in any research materials. 10 Summer Bridge is a program offered to all incoming UC Berkeley freshman. It is a six-week summer session in which students take two credit-bearing courses and a seminar on personal wellness. The program is meant to introduce students the structure and rigor of university student life.

44 and women’s rowing had the most participation in the survey. A complete table of the sports teams represented in the pilot is provided in the Appendix. Furthermore, Table 3 shows the distribution of race/ethnicity from the pilot survey. Table 6: Racial/Ethnic Distribution of Pilot Survey Participants

Race/Ethnicity % Asian 11 (n=7) Black/African American 6 (n=4) Hispanic/Latino 3 (n=2) Native Hawaiian/Pacific Islander 2 (n=1) White/Caucasian 60 (n=37) Mixed Race 13 (n=8) Other 5 (n=3)

The instrument was administered via paper and pencil. All sections were surveyed during the same week during the final month of the course. The evaluator manually entered all of the survey data per the outcome space scoring guide. As mentioned, an ‘other’ option was provided on all of the questions in the event that a question’s options did not capture a student’s true response. Any question that was answered with an ‘other’ explanation was coded as missing. However, the evaluator recorded all ‘other’ comments to analyze as part of the revision process. Results from the Pilot Study – Instrument Calibration As discussed in Building Block Four, the evaluator selected the Rasch partial credit model to calibrate the instrument. Using the R statistical software ‘crasch’ package (Arneson, 2015), the evaluator computed the item parameters for the scored data and produced a Wright Map for each measure, as shown in Figure 5 for sense of belonging and Figure 6 for self- reliance. The Wright maps show the respondent locations (ability) on the left and the item location (difficulty) on the right. The respondent locations resemble a histogram. Both distributions look relatively normal, with a slight right-tailed skew for self-reliance and left- tailed skew for sense of belonging. For both Wright Maps, the item locations cover the distribution of respondents. This means that the instruments were neither too hard nor too easy for any one respondent. Such coverage indicates that the instruments were appropriate for the respondent population. Regarding the items side, each item was labeled according to Table 4. Labels were created to illustrate the comic strip scenario and construct map characteristic reflected in each item. For example, the label ‘Dorm.Achieve’ reflects the question pertaining to the college dorm scenario and associated with the ‘Achievement Standards’ characteristic in the self-reliance construct map. The Wright Map shows that some items have three step levels while others only

45 have two. This indicates that particular multiple-choice options (or ‘categories’) for that item received zero responses and were therefore collapsed. Table 5 outlines which questions included empty categories and how they were collapsed. Table 7. Instrument Item Labels Sense of Belonging Self-Reliance Question Number Label Question Number Label Q1 GSI.Purpose Q13 Dorm.Achieve Q2 GSI.Recognize Q14 Dorm.Network Q3 GSI.Interpret Q15 Dorm.Growth Q4 GSI.Community Q16 Career.Network Q5 Group.IdentityA Q17 Career.Input Q6 Group.Recognize Q18 Career.Learn Q7 Group.IndentityB Q19 Paper.Achieve Q8 Group.Holistic Q20 Paper.Input Q9 Flyer.Community Q21 Paper.Learn Q10 Flyer.Interpret Q22 Paper.Growth Q11 Flyer.Holistic Q12 Flyer.Purpose

Table 8. Empty Categories

Question Missing Category Collapsed GSI.Purpose Category 1 (option ‘d’) Level 2 = Level 1 GSI.Recognize Category 1 (option ‘d’) Level 2 = Level 1 Group.IdentityA Category 2 (option ‘c’) Level 3 = Level 2 Group.Recognize Category 2 (option ‘c’) Level 3 = Level 2 Group.IndentityB Category 1 (option ‘d’) Level 2 = Level 1 Flyer.Holistic Category 1 (option ‘d’) Level 2 = Level 1 Paper.Input Category 1 (option ‘d’) Level 2 = Level 1 Paper.Growth Category 1 (option ‘d’) Level 2 = Level 1

Figure 8: Sense of Belonging Wright Map

Figure 9: Self-Reliance Wright Map

To assist with looking closer into which item steps needed revision, the evaluator created Wright Maps organized by construct level, as shown in Figure 7 and Figure 8.

Figure 10: Sense of Belonging Wright Map by Construct Level

Figure 11: Self-Reliance Construct Map by Construct Level The construct level Wright Maps revealed that the overlap between levels for both instruments were not as clear as an ideal case. Ideally, the Wright Map would show clear banding so that one can distinguish respondents between levels. One would like to see a clear point on the logit scale to indicate a move from one level to another. On both Wright Maps, this clear break is seen when

48 moving from Level 3 to Level 4. This break however, is less clear when moving from Level 2 to Level 3. The overlap is an indication that the instruments’ step levels are inconsistent. As a result, for respondents located at approximately one logit on the sense of belonging measure and approximately negative one and a half logits on the self-reliance measure, the respondent should be placed between Level 2 or Level 3. In addition to the Wright Map, the evaluator also examined item fit and person fit. Item fit was analyzed through calculating the infit meansquare for each item. The infit meansquare shows how much the actual residuals differ from the expected results. Item fit should fall within the range of 0.75 to 1.33. Figure 9 shows the infit meansquare for Sense of Belonging and Figure 10 shows the infit meansquare for Self-reliance. For both measures, the items were well within the prescribed range.

Figure 12: Sense of Belonging Infit Meansquare

Figure 13: Self-Reliance Infit Meansquare

To analyze person fit, the evaluator created a scatterplot of the respondents’ infit meansquares. The infit meansquares were estimated using the weighted likelihood estimates (WLE) via ConQuest (Wu, Adams, & Wilson, 2012). Similar to the item fit analysis, the evaluator used the scatterplot to see if respondents fell outside of the prescribed 0.75-1.33 range. As Figure 11 and Figure 12 show, both measures had several respondents who fell below and above the range. For respondents who fell below 0.75, the evaluator interpreted this as being ‘too consistent,’ possibly indicating that a respondent did not take the survey seriously. However, because Guttman-type items were used, this consistency is not a large concern. Respondents falling at about the 1.33 boundary have response patterns that one may not predict. In other words, these respondents have poor person fit.

Sense of Belonging Person Fit 2.5

1.5

0.5

0 0 10 20 30 40 50 60 70 -0.5

Figure 14: Sense of Belonging Person Fit

Self-Reliance Person Fit 5

0 0 10 20 30 40 50 60 70 -1

-2

Figure 15: Self-Reliance Person Fit To address this issue, the evaluator created Kidmaps for the respondents falling above the 1.33 threshold. A Kidmap indicates which questions the respondent scored higher than expected, as well as which questions the respondent scored lower than expected. Furthermore, the evaluator looked at the demographic data for respondents to see if there were any patterns. This person fit analyses revealed that GSI.Community, Group.IdentityA , Group.Recognize, Career.Input , and Paper.Achieve had consistent person misfit. Interestingly, these questions also displayed some step consistency issues, indicating that revisions in the step difficulties may remedy the person misfit. When looking at demographics, a consistent misfit pattern did not arise in terms of sport, gender, international student status, or Summer Bridge participation. Thus, the instrument’s person fit will likely improve through revising the step difficulties. Reliability Following up from the instrument calibration analyses, the evaluator conducted a series of reliability and validity analyses. These analyses allowed the evaluator to address and improve upon the psychometric validity of the instrument, as well as assess the extent to which the instrument was appropriate for the student athlete population. This section reports on the reliability analysis for the sense of belonging and self-reliance measures. Tests of reliability measure the ‘consistency’ or ‘repeatability’ of the instrument. Primarily, the evaluator analyzed the standard error of measurement for both measures. The standard error of measurement provides an indication the estimates’ accuracy. Figure 13 and Figure 14 show the standard error of measurement (sem) for both measures.

Figure 16: Standard Error of Measurement for Sense of Belonging

Figure 17: Standard Error of Measurement for Self-Reliance Both figures were similar, in that they showed a cup-shaped curve, which is typical. Specifically, this shape indicated that the sem was lowest for respondents who were within approximately -1.5 and +1.5 logits, which represented the majority of respondents in the pilot sample. The sem increased for outliers, which is to be expected. In order to assess internal consistency, the evaluator then calculated Cronbach’s Alpha for each of the measures. The Cronbach’s Alpha for sense of belonging was 0.53 for complete cases and 0.46 for all cases. For self-reliance, the Cronbach’s Alpha was 0.58 for complete cases and 0.53 for all cases. Even though one may anticipate a lower internal consistency for social/emotional measures, as opposed to achievement measures, the relatively low Cronbach’s Alpha indicated that the evaluator should revise questions with poor fit, as discussed in the prior section. Alternatively, because each construct was comprised of multiple characteristics, this low

53 internal consistency may reflect some multidimensionality. However, it was not feasible to create an instrument with a sufficient number of items and respondents needed to conduct a multidimensional analysis. As another measure of internal consistency, the evaluator examined person separation reliability. Person separation reliability reports the proportion of variance accounted for by the model. The person separation reliability was 0.43 for sense of belonging and 0.63 for self- reliance. This statistic is not a convincing improvement from Cronbach’s Alpha. However, with item revisions, the internal consistency of both measures should improve. As a final assessment of reliability, the evaluator used the alternate forms reliability coefficient. Using this technique, the evaluator divided each measure into two parallel forms. Because each measure contained two items per construct characteristic, the forms were separated to include one characteristic per each half. The evaluator correlated the two halves to determine the coefficient. The alternate forms coefficient was 0.52 for sense of belonging and 0.56 for self- reliance. However, these statistics did not account for the decrease in items when splitting the instrument into two. Therefore, the evaluator applied the Spearman-Brown formula to the alternate forms analysis, which is considered to be more accurate. Using Spearman-Brown, the alternate forms coefficient was 0.68 for sense of belonging and 0.72 for self-reliance. While the Spearman-Brown numbers are not exceptionally low and the sem for both measures are reasonable, the evaluator’s reliability assessment showed that the instrument had room for improvement. Reliability results may be due to the somewhat limited range of the sample of students. However, several of the items yielded a fair amount of ‘other’ responses, thus creating several incomplete cases. Incorporating these ‘other’ responses into the question options would improve the number of complete cases and therefore improve the overall reliability of the instrument. Validity Building from the reliability analyses, the evaluator analyzed the measures’ validity. Tests of validity demonstrate if the sense of belonging and self-reliance instruments actually measure what they were designed to measure. The evaluator used the strands highlighted in the “Standards for Educational and Psychological Testing” to frame the validity analysis (American Psychological Association, 2014), as shown in Table 6. Table 9. The Six Strands of Validity as Described by the Standards for Educational and Psychological Testing

Strand One Evidence based on instrument content Strand Two Evidence based on response processes Strand Three Evidence based on internal structure Strand Four Evidence based on relations to other variables Strand Five Evidence regarding relationships with criteria Strand Six Evidence based on the consequences of using the instrument

Strand One: Evidence Based on Instrument Content Strand One, evidence based on instrument content, sets the structure for the rest of the validity argument. Evidence is collected regarding the definition of the construct to be measured by the observation instrument (Building Block One), a description of the set of items that comprise the instrument (Building Block Two), a strategy for coding the items and relating them back to the outcome space (Building Block Three), as well as a technically calibrated version of the construct (Building Block Four) (Wilson, 2004). As already discussed, the sense of belonging and self-reliance measures were constructed using the BEAR Assessment System, as guided by the four building blocks approach. The evaluator worked with stakeholders to develop a construct for each measure, designed the items, created an outcome space, and used the partial credit model to analyze the pilot responses. Furthermore, the Wright Maps for the two measures showed that items were neither too easy not too hard to the sample of student athletes. Additionally, the Wright Maps show clear banding for levels 3 and 4 of the construct map, thus providing further evidence for instrument content. Therefore, the two measures have strong evidence based on instrument content. Strand Two: Evidence Based on Response Processes Strand Two, evidence based on response processes, relates to the use of think-alouds and exit interviews during item development. The purpose of think-alouds and exit interviews is to gain insight into any confusing language, item wording, and potential biases. Therefore, think- alouds and exit interviews should be conducted with individuals representative of the population to be surveyed. As mentioned under Building Block Two, the evaluator conducted four think- alouds with current UC Berkeley student athletes. Specifically, the evaluator conducted think- alouds with one women’s basketball student, one men’s track and field student, and two football students. Information collected during the think-alouds prompted the evaluator to include narrative information into the comic strip scenes and modify the Guttman-type options to better reflect how a student athlete would respond to the comic strip scenarios. For example, one student pointed out that for the “Dorm.Network” question, he would simply respond with something such as “just don’t fall behind on your paper again.” An option like this wasn’t available at the time; therefore, Option C was modified to state: “Just ask for an extension and next time, don’t wait until the last minute.” Furthermore, immediately after the survey administration was complete, the evaluator asked the student athletes in each section (as a whole group), if there were any comic strip scenarios or questions that were particularly confusing. Student athletes did not comment on any confusing items, but rather complemented the instrument as being relevant to their lives. Thus, the think-alouds and the group exit interviews contributed to evidence based on response processes. Strand Three: Evidence Based on Internal Structure Evidence based on internal structure can be broken down into three levels: the instrument level, the item level, and the instrument by item level. To provide internal structure evidence at the instrument level, one would calculate a Spearman’s Rho statistic to determine if the intended order of difficulty matched the actual order of difficulty. However, the evaluator intended for each of the items to be at the same difficulty level; rather, each item option would have different levels of difficulty. Therefore, the ‘banding’ shown in the Wright Maps for each measure, most significantly between Level 3 and Level 4, provided evidence based on internal structure.

However, the overlap observed between Level 2 and Level 3 presented a validity concern based on internal structure at the instrument level. At the item level, the evaluator analyzed whether or not the instrument items behaved properly by examining the mean location of each item by response category. One expects that respondents who score higher on the construct generally score higher on each item, thus mean locations should increase. However, if responses are low for a particular answer option, the mean location for that step does not provide an interpretable mean person location estimate. Mean location and response frequency tables are provided in the Appendix. Mean location inconsistencies were only noted if they occurred in item steps with more than five respondents per category. Four items showed mean location inconsistencies: GSI.Recognize, Flyer.Community, Career.Network, and Career.Input. The wright maps show that for Flyer.Community, ‘option c’ and ‘option d’ are too distant from one another; furthermore, for Career.Input, ‘option c’ and ‘option d’ are too close to one another. Modifying these question options to make them closer and more distant should also improve the mean location. Furthermore, for Career.Network and GSI.Recognize included ‘other responses.’ Revising these questions to incorporate themes from the ‘other’ comments should improve this issue. To assess evidence at the item by group level, the evaluator conducted a series of differential item functioning (DIF) analyses using ConQuest software (Wu, Adams, & Wilson, 2012). DIF provides insight into whether or not respondents from two groups at the same locations give different responses. The evaluator conducted DIF analyses for the following groups: male vs. female, White vs non-White, contact sport vs. non-contact sport, and spring sport vs. fall sport. The evaluator had hoped to conduct DIF analyses for Summer Bridge participant vs. Summer Bridge non-participant and international student vs. non-international student; however, the sample size for each of these sub-groups was not large enough to do so. To assess the degree of DIF detected in each question, the evaluator used the Educational Testing Services (ETS) standards, as stated in Paek & Wilson (2011). These standards report three levels of γ (DIF):

• A: if γ ≤ 0.426 or if 퐻0: 훾 = 0 is not rejected below .05; • C: if γ ≥ 0.638 and if 퐻0: |훾| ≤ 0.426 is rejected below .05; and • B: otherwise Given this classification, the evaluator noted items that fell into category C, which represents significant DIF. The self-reliance measure did not contain any items with significant DIF. However, the sense of belonging measure contained one items with significant DIF. Specifically, GSI.Community showed bias in question difficulty against non-white student athletes, spring sport student athletes, and individual sport student athletes. The evaluator therefore noted that GSI.Community should be addressed and revised to eliminate this bias. Specific revisions are detailed in the proceeding ‘Revision’ section. However, aside from GSI.Community, the DIF analysis presented strong evidence for internal structure. Strand Four: Evidence Based on Relations to Other Variables Measures of sense of belonging and self-reliance would hypothetically correlate with a student athlete’s semester GPA. However, at the time of survey administration, student athletes were in their first semester of college, and therefore did not have a GPA to report. Additionally, a student athlete’s grade on the final portfolio project in the freshman seminar course would

56 hypothetically correlate with the sense of belonging and self-reliance measures; but, like GPA, this information could not be collected at the time of survey administration. Another possibility would be to correlate the sense of belonging and self-reliance measures with results from another instrument that claims to measures a similar construct. This approach however, does not seem valid due to the purpose of creating these two measures in the first place; if a student athlete’s construct for sense of belonging and self-reliance is distinct from the general undergraduate population, then another instrument would not necessarily correlate. Therefore, to collect evidence based on relations to other variables, the evaluator considered collecting GPA and course grade data once the term had ended.11 Strand Five: Evidence Regarding Relations with Criteria Strand Five relates to the ability of an instrument to predict adequate or inadequate criterion performance. Specifically, this strand states that the criterion that relates to the instrument be technically sound and associated with the given levels of assessment scores. In the context of the sense of belonging and self-reliance instruments, Strand Five does not apply because stakeholders would not be using scores as criteria for specific decisions regarding student athletes. However, if stakeholders expressed the desire to use this instrument as a basis for decisions such as mandating tutoring or Summer Bridge participation, the evaluator should further explore this strand. Strand Six: Evidence Based on the Consequences of Using the Instrument Evidence based on the consequences of using the instrument addresses the real-world application of an instrument. In an ideal situation, instrument use would be monitored for any unintended consequences or systematic misuse. Given the often high-stakes nature of program evaluations, Strand Six is especially important in an evaluation context. Over the course of instrument development, the evaluator engaged in an ongoing conversation about how the instrument could and should be used. A fear arose amongst ASC staff that the instrument would be used to punish the student athlete in some way. For example, a coach may see the results and alter play-time or positions for a student athlete. This sort of application is not what the instrument is intended for, and the evaluator recommended that ASC staff monitor if a student athlete is being punished in some way as a result of the instrument. Furthermore, the act of taking the survey alone should not make student athletes feel even more alienated from the rest of campus. Student athletes are a heavily monitored group, which may lead to feelings of disempowerment and lack of agency. The sense of belonging and self- reliance measures should not contribute to the feelings of surveillance that student athletes may already feel. Rather, taking the survey should foster a critical reflection that contributes to their academic and personal well-being. To address this issue, the evaluator and ASC staff could conduct follow-up interviews addressing these concerns. The sense of belonging and self-reliance measures are meant to assist the ED 98 course with improvements. Results from the instrument should not be used as a basis for terminating the course. The evaluator acknowledged this concern with ASC staff and ED 98 facilitators and planned to follow up with staff and facilitators in the future to monitor use of the instruments for

11 Due to IRB restrictions, the evaluator cannot obtain GPA and course grade information. Future evaluative activities not connected to dissertation or publication research can utilize GPA and course grade information.

57 course modification purposes. Moreover, the Leadership Team and evaluator discussed the possibility of administering the instruments to student athletes who do not take ED 98. The purpose for this survey administration would be to use the two measures to help guide ASC service delivery for all student athletes (regardless of their participation in ED 98), as well as provide a comparison group to judge the effectiveness of the ED 98 course. However, before high stakes decisions are made about the ED 98 course and its effectiveness, the evaluator will work with the Leadership Team to ensure that the instruments have reached a level of validity and reliability necessary to make such judgements. Summary of the Validity Argument Overall, the sense of belonging and self-reliance measures analyses built a strong validity argument with regards to instrument content and response processes. Furthermore, the DIF analyses showed a lack of DIF between key population sub-groups for all but one of the questions. The open and consistent dialogue the evaluator had with key stakeholders also built a strong argument for evidence regarding relations with other criteria and evidence based on consequences of using the instrument. Despite these positive points, the validity analyses revealed specific areas for improvement. Primarily, the question that showed significant DIF needed to be addressed and revised. Additionally, the lack of banding that occurred between Level Two and Level Three of the Wright Map weakened the validity argument for evidence based on internal structure. Furthermore, several items showed mean location inconsistencies that also detracted from the internal structure strand. These shortcomings however, provided critical information for the evaluator to return to the four building blocks and revise the instrument. Revisions: A Second Iteration of the Four Building Blocks The BAS and four building blocks approach to constructing measures is intended to be an iterative process. As demonstrated, once the evaluator analyzed the pilot responses, a series of issues arose that called for revisions to the original instrument. Thus, the evaluator returned to Building Blocks One and Two in order to revise and improve the instrument. Based on the findings from the instrument calibration, validity, and reliability tests, the evaluator came up with the following set of questions and issues to discuss with ASC staff, ED 98 facilitators, and student athletes: 1. For the following questions, options ‘c’ and ‘d’ seem to be too distinct or ‘difficult’ from one other (i.e. option d could be revised so more students are likely to select this option): 9, 10, 12, 16, why might this be the case and how can we revise? 2. For the following questions, options ‘c’ and ‘d’ seem to be too similar to one another (i.e. option c could be revised so it is less likely for students to select this option): 1, 5, 6, 7, 15, 17, 19, why might this be the case and how can we revise? 3. Possible DIF with Question 4: it is significantly harder for non-white students, spring sport students, and non-contact sports to score higher on question 4 (i.e. option a or b) than for white students, fall sport students, and contact sport students. Why might this be the case and how can we revise?

4. Question 10 had a large amount of ‘other’ responses that tended to relate to ‘indifference’. Why might this be the case and how can we revise? 5. Question 11 had a large amount of ‘other’ responses that tended to relate to the statement: ‘wouldn’t go to section.’ Why might this be the case and how can we revise? 6. Question 5: Options b & c are too similar to one another. Why might this be the case and how can we revise? Primarily, the evaluator consulted the ED 98 facilitators about this list of issues. The evaluator prepared a document containing the distribution of responses for each item, as well as a compilation of the ‘other’ comments for ED 98 facilitators to review alongside the list of questions. The discussion with ED 98 facilitators resulted in several revision suggestions. Notably, comments from this session provided critical insight into the problem associated with the DIF shown in GSI.Community. Facilitators believed that the question was too far removed from the comic strip scene. Furthermore, they suggested that the use of student groups in the comic strip scenarios may have resulted in the DIF bias because some student athletes may have a negative disposition toward student groups on campus. Using the feedback from the ED 98 facilitators, the evaluator then conducted four post- hoc interviews with student athletes who had taken the survey during fall 2016. Specifically, one women’s swimming, one women’s water polo, and two men’s golf student athletes were interviewed. These student athletes provided valuable insights into why certain questions options may be too closely related and if any questions contain specific biases. Most notably, student athletes spoke to how time management and sport-specific issues influenced their survey responses. Their comments showed that the student athlete context is diverse, featuring a number of sport-specific contexts; yet, student athletes all share a common institutional identity. Student athletes also provided suggestions for improving the Guttman-type options. Finally, the evaluator compiled the comments and suggested revisions gathered from the facilitators and student athletes and presented them to the ASC Leadership Team. The Leadership Team and the evaluator reviewed the list of issues and proposed revisions. This conversation supported the comments made by the ED 98 facilitators and student athletes and yielded additional suggested revisions. The evaluator then incorporated this final feedback into a revised version of the instrument. With this revised version, the evaluator could then administer the survey to a new sample of student athletes (as presented in Chapter Three) with confidence that items had been improved from both a psychometric and culturally competent standpoint. Discussion: Reviewing how the BEAR Assessment System Complements a Culturally Competent Evaluation Approach This chapter presented the case of the Athletic Study Center evaluation to show how the four building blocks were used to construct, calibrate, and validate two measures specific to the UC Berkeley student athlete population. In doing so, this chapter provides an illustration of how the four building blocks process and resulting measures were consistent with both the essential practices of culturally competent evaluators as well as the CRE framework. This section further explores this complement between BAS and a culturally competent approach through revisiting Step Seven for conducting a CRE evaluation (Frierson, Hood, Hughes, & Thomas, 2010), as well as the essential practices for culturally competent evaluators (see Table 1).

CRE Step Seven focuses on selecting and adapting the instrumentation. Primarily, Step Seven states: “It is very important the instruments be identified, developed, or adapted to reliably capture the kind and type of information needed to answer the evaluation questions,” (Frierson, Hood, Hughes, & Thomas, 2010, p. 68). Given that the ASC sought to further understand student athletes’ trajectory in the key domains of sense of belonging and self-reliance, the stakeholders and evaluator collectively determined that a survey instrument would be the best medium for investigating growth in these two domains. Per the guidelines outlined in the CRE framework, the survey instrument needed to be appropriate for the student athlete population. The CRE framework explains that while existing measures may have been validated, this does not guarantee cultural responsiveness. Therefore, the evaluator felt that new measures needed to be developed because existing instruments would not reliably capture the student athlete experience in the domains of sense of belonging and self-reliance. The evaluator chose to use the BAS framework and the four building blocks approach for constructing new measures due to its consistency with the AEA essential practices for culturally competent evaluators. Building Blocks One and Two provided the foundation for ensuring that the instrument would be appropriate for the student athlete population, and thus yield a valid inference about the target population. More specifically, in Building Block One, the evaluator worked alongside stakeholders to create construct maps of the proposed measures. These construct maps presented an explicit definition of the sense of belonging and self-reliance measures. The construct maps reflected the ASC goals and outcomes and were specific to the student athlete population. By creating a construct specific to the target population, the instrument items would be grounded in the local perspective of what it means feel a sense of belonging and demonstrate self-reliance. In Building Block Two, the evaluator prioritized ensuring that the items would generate valid responses from the student athlete population, and therefore used a variety of methods and techniques to do so. Specifically, the items design process enhanced the cultural competence of the instrument through the following actions: • The evaluator used qualitative information collected via ED 98 course document reviews, conversations with ASC staff, and observation of the ED 98 course to construct the items. • The evaluator utilized comic strips as means to engage the student athlete population in a culturally competent manner. • The evaluator created Guttman-type items (as opposed to Likert) to add additional context to the items. • The items were revised after comments from think-alouds with student athletes and an item panel with the ASC Leadership Team. These comments improved the relevancy and appropriateness of items for student athletes. The above actions complemented the essential practices for culturally competent evaluators (see Table 1) through consulting with and engaging the student athlete population in the items design process. Moreover, the evaluator utilized comic strips and Guttman-type items to recognize and respond to differences between student athletes and the rest of the undergraduate population, as well as accurately reflect how student athletes view their own group membership. Comic strip scenarios and Guttman-type items were based on comments and feedback from student athletes; thus, the essential practices of culturally competent evaluators were reflected in the items design.

Furthermore, Building Block Three, the outcome space, allowed the evaluator to translate the items design into a scoring approach in a manner that stakeholders could readily utilize. The intent of the outcome space was to allow stakeholders to interpret and use the instrument data for themselves. Therefore, the process of designing and refining the outcome space naturally led to the evaluator engaging stakeholders in discussions regarding what constitutes meaningful, reliable, and valid data. The outcome space provided the stakeholders with a way to interpret how each item response related back to the construct map, which created the opportunity for the evaluator to further refine the items and the construct maps in a way that was valid for stakeholders. The measurement model and instrument calibration also strengthened the instrument’s cultural competence. The Rasch partial credit model was selected due to its ability to model both items and respondents; this features allowed for analyses regarding potential biases and the appropriateness of the instrument. The Wright Map produced during instrument calibration presented a clear visual representation of the extent to which the items matched the respondent sample. Specifically, viewing the distribution of both respondents and items on the same scale provided a way to determine if the difficulty of the items matched the ability of the respondents. Additionally, the measurement model allowed for an analysis of ‘person fit’ to determine if there were any consistent similarities among respondents who responded in a way that did not fit the general pattern. This inquiry allowed the evaluator to determine if there might be potential problems with the instrument. Moreover, the DIF analyses allowed the evaluator to examine DIF between subgroups at the item level, to therefore determine if specific items were significantly easier or more difficult for certain subgroups. Therefore, as shown, the four building blocks approach guided by the BAS framework allowed the evaluator to design, calibrate, and validate quantitative measures in a culturally competent manner. However, just as the AEA Public Statement on Cultural Competence states that cultural competence is not a finite destination, but something evaluators are always striving toward, such is the case with constructing measures. The BAS framework and four building blocks are iterative, thus not assuming that instrument perfection is ever completely obtained. Rather, the BAS framework suggests that the evaluator should repeat the four building blocks to continue improving the reliability and validity of the measure. In doing so, the evaluator not only strengthens the measure’s psychometric validity, but also continuously reflects upon and revisits the cultural competence of the measure. However, this surfaces the question regarding at what point in this iterative cycle an instrument can yield inferences considered valid by the local population and does not cause any undue harm (such as reinforcing stereotypes or prompting any unfair high-stakes decisions). Therefore, it is critical for an evaluator to be transparent about an instrument’s limitations (as discussed in the following section) and also frame limitations as a learning tool for further improving the psychometric validity and cultural competence of a measure. Limitations and Next Steps While the BAS framework and corresponding four building blocks approach to constructing measures facilitates a process consistent with culturally competent approaches to evaluation, the evaluator did experience limitations when striving to align the instrument with the CRE framework and essential practices of culturally competent evaluators. Primarily, the evaluator wanted to ensure that the student athlete population was not treated homogenously, and

61 therefore looked for differences between key subgroups via DIF analyses. However, these DIF analyses were limited by the respondent sample size. Identifying and understanding any differences between Summer Bridge participating athletes and non-Summer Bridge participating athletes was a key point noted by ASC staff. The small sample size of Summer Bridge student athletes did not allow for the DIF analyses; thus, any potential bias with this subgroup would have to be explored through qualitative means, or in later studies. Furthermore, a larger sample size would allow for DIF analyses regarding gender and sport affiliation that included more categories, thus avoiding homogenization. Aside from sample size, the construct maps created in Building Block One may restrict the fluidity of a student athlete’s growth in the sense of belonging and self-reliance constructs. Specifically, each construct is comprised of several characteristics (see Figures 2 and 3). While each construct map is hypothetical, it assumes that a student athlete would possess the same level of each characteristic; however, this may not be the case. For example, on the sense of belonging construct map, a student athlete may be higher on the ‘Holistic Perspective’ characteristic than the ‘Identity Acceptance’ characteristic. Accurately measuring such a difference would require a multidimensional modeling approach which would only be possible with about twice as many items and a larger calibration sample size of respondents. Such an extended survey instrument is not feasible given the ASC program context. Therefore, the evaluator and stakeholders should exercise caution when attempting to use the instrument to assess individual characteristics within the construct map. Because sample size and multidimensionality did represent limitations with ensuring the cultural competence of the instrument, the evaluator continued to reflect upon improvements to the instrument via the four building blocks iterative cycle. Despite these limitations, the evaluator felt that the validity argument (as discussed) as well as the subsequent revisions deemed the instrument ready to implement as an evaluative measure to determine student athlete growth in sense of belonging and self-reliance. Given the intended use of the instrument and strength of the validity argument, the evaluator felt that inferences made using the instrument would yield valid and useful data, as well as not produce any undue negative consequences for student athletes and the ASC. Therefore, the revised instrument was used to conduct a pre and post assessment of student athletes in the fall 2017 ED 98 course. The following chapter thus continues the case of the ASC evaluation to present how an evaluator can use item response modeling to analyze pre/post quantitative measures in a manner consistent with the CRE framework and essential practices of a culturally competent evaluator.

Chapter Three Applying a Culturally Competent Approach to a Latent Growth Item Response Model Analysis

Chapter Two outlined how an evaluator can design, validate, and calibrate quantitative measures that are consistent with a culturally competent evaluation approach. After the instrument piloting process described in Chapter Two, the evaluator used the revised instrument to collect and analyze data that addresses the evaluation questions. The analytic technique must correspond to the evaluation’s approach (i.e. cultural competence) and design (i.e. one-group pretest/posttest, time series, etc.). Findings from Chapter One illustrated that 22 percent (n=24) of the reviewed studies utilized a pre/post or longitudinal (i.e. time series) design. Many of these evaluations used a quantitative data collection instrument. Following the culturally responsive evaluation (CRE) framework, the tools and techniques used to analyze the data collected from the quantitative instrument should be consistent with a culturally competent approach to evaluation (Frierson, Hood, Hughes, & Thomas, 2010). Chapter One’s research synthesis outlined the predominant quantitative analytic techniques used in evaluations consistent with a culturally competent approach. Descriptive statistics and frequencies were the most prevalent quantitative techniques. In a few cases, evaluators used ANOVAs and t-tests to test for statistically significant differences between groups. Furthermore, one evaluation used a multiple regression to understand how specific variables interacted with the outcome measure (Garaway, 1996). Given that most quantitative analyses reviewed in Chapter One used descriptive statistics and frequencies and seldom used inferential statistics, one may question if advanced quantitative analyses12 can align with a culturally competent evaluation approach. This chapter investigates this question through extending the evaluation case presented in Chapter Two by describing the analysis phase of the Athletic Study Center (ASC) evaluation. Specifically, this chapter addresses the following research question: How can an evaluator apply the latent growth item response model (LG-IRM) to analyze pre-post data in a way that aligns with a culturally competent evaluation approach? Through addressing this primary question, this chapter also investigates the following sub- questions: 1) How can an advanced quantitative analysis help evaluators better respond to and engage with local culture? and 2) When is an advanced quantitative analysis appropriate? This chapter begins with a conceptual orientation toward cultural competence during the analysis phase of an evaluation. Following this orientation, the chapter discusses the existing literature regarding culturally competent data analysis and if and how the LG-IRM fits into this approach. This chapter then revisits the ASC evaluation case and discusses the methods used to analyze the evaluation cases’ pre-post data. Next, the chapter presents findings from the evaluation case and discusses how these findings helped the evaluator better engage with and respond to local culture. The chapter concludes with a discussion regarding the benefits and limitations of using the LG-IRM, as well as the contexts in which an advanced quantitative analysis may be appropriate. Conceptual Orientation This chapter focuses on an evaluation’s analysis phase. Culturally competent approaches during the analysis phase are informed by American Evaluation Association’s (AEA) Public Statement on Cultural Competence in Evaluation as well as the CRE framework. The essential practices listed in AEA’s Public Statement on Cultural Competence in Evaluation state that

12 Advanced quantitative analyses are considered to be quantitative analyses that move beyond descriptive statistics, frequencies, ANOVAs, and t-tests.

64 evaluators should recognize and respond to differences both within and between cultures and accurately reflect how individuals view their own group membership (American Evaluation Association, 2011). Furthermore, the CRE framework recommends disaggregating data in order to reveal important findings that may be masked through aggregate data (Frierson, Hood, Hughes, & Thomas, 2010). The overarching message in both statements is that the analysis phase should allow the evaluator to better engage with and respond to culture. Before exploring possible analytic approaches, the proceeding section first outlines why cultural competence is important during the analysis phase. Credible Evidence and Cultural Competence The Encyclopedia of Evaluation defines evaluation as “an applied inquiry process…Conclusions made in evaluations encompass both an empirical aspect (that something is the case) and a normative aspect (judgement about the value of something). It is the value feature that distinguishes evaluation from other types of inquiry” (Fourneir, 2005, pp. 139-140). Evaluation scholars have written extensively about the concept of value in evaluation, noting that evaluation practice and values are deeply intertwined (Schwandt, 2005). Specifically, Greene (2006) states that evaluators should acknowledge and engage with the plurality of values in each evaluation context. Within that engagement, evaluators must decide which and whose values are best to promote and which values are most defensible. Greene (2006) argues that defensible values relate to: “political democratic ideals, namely, social justice, equality, empowerment, and emancipation,” (p.118). Greene’s stance holds implications for how evaluators view the nature and purpose of social knowledge and the purpose of evaluation in society. These arguments do not necessarily imply a specific methodology or analytic approach but rather, argue that the method and analysis must be subservient to the overarching evaluation questions and the values underlying those questions. Thus, when planning for the analysis phase, evaluators should consider the question: Whose interests does the evaluation serve? While this question does not prescribe a particular methodological technique, it holds implications for how evaluators select and use analytic techniques. Conducting an analysis that aligns with a culturally competent approach promotes the values of social justice, equality, empowerment, and emancipation, which is the orientation taken in this dissertation. With regards to specific methods, MacDonald highlights case studies for educational evaluations as a way to promote democracy (MacDonald, 1977). Furthermore, Abma (2001) suggests that constructing and analyzing narratives is a promising approach to ensuring all stakeholder voices are heard. These qualitative analytic methods are seen as best practices for allowing evaluators to engage with and respond to local culture. However, quantitative methods can also correspond with a culturally competent approach. The existing literature does not highlight any specific quantitative method as a best practice; however, this dissertation suggests that item response modeling can be viewed as a best practice when using quantitative methods for a culturally competent approach. Chapter One illustrated that qualitative methods are the dominant approach taken by evaluations aligning with a culturally competent approach. This finding leads to the question of whether quantitative analyses can allow the evaluator to better engage with and respond to culture, and for what purpose. One naturally thinks of quantitative analyses for outcomes- focused evaluations (Greene & Henry, 2005). Classic impact and outcome evaluations often have a quantitative focus and have been criticized for having a myopic view on outcomes without

65 properly accounting for contextual factors that help explain such outcomes (Greene & Henry, 2005). Furthermore, critiques presented in the American Evaluation Association’s response to the US Department of Education’s prioritization of the randomized controlled trials suggests that qualitative methods yield findings that are more sensitive and valid for the local culture. Specifically, the AEA 2003 response stated: “RCTs examine a limited number of isolated factors that are neither limited nor isolated in natural settings. The complex nature of causality and the multitude of actual influences on outcomes render RCTs less capable of discovering causality than designs sensitive to local culture and conditions and open to unanticipated causal factors,” (American Evaluation Association, 2003). This statement implies that RCTs, and perhaps quantitative methods in general, are not the preferred design for evaluations seeking to be sensitive local culture. However, if quantitative methods are not rejected as part of the CRE framework (see Chapter Two), then what would a quantitative analysis look like and what technical tools do evaluators have for responding to local culture through quantitative methods? The following section outlines quantitative analytic techniques that may respond to local culture and specifically focuses on the latent growth item response model as one possible quantitative tool that can align with a culturally competent approach to evaluation. Literature Review Table 4 in Chapter One summarized the quantitative data analysis techniques that aligned with a culturally competent evaluation approach. Building from Chapter One’s review, this section explores additional literature that does not explicitly take a culturally competent approach, but uses quantitative analyses to address contextual and cultural issues. Primarily, Newton and Llosa (2010) discussed how hierarchical linear models facilitated more nuanced findings in quantitative program evaluations. Newton and Llosa (2010) used hierarchical linear models (HLM) to evaluate a reading curriculum. The authors determined that HLM: (1) enabled a better estimate of the average program outcomes; (2) allowed for an estimate of student outcome variation between classroom, within classrooms, and between schools; and (3) provided a way for linking implementation data to program outcomes via a cross-level interaction term (Newton & Llosa, 2010). Therefore, through using HLM, Newton and Llosa (2010) accounted for contextual factors related to implementation quality, as well as differences in outcomes between subpopulations. Furthermore, Gee (2014) outlined how to apply multilevel growth modeling to better understand participants’ individual growth. Specifically, Gee (2014) described how multilevel growth modeling was applied to an evaluation examining how infants’ growth in cognitive and linguistic functioning differed by treatment status. Through using multilevel modeling, Gee (2014) allowed for individual variation in growth rates, which is not possible through a simple ordinary least squares regression. By allowing for variation in growth rates, Gee (2014) used multilevel modeling to respond to contextual nuance. The Latent Growth Item Response Model Longitudinal data is composed of measures that have been repeated over a length of time. Longitudinal data can be analyzed through a variety of techniques, such as the models presented by Gee (2014) and Newton & Llosa (2010). However, item response modeling, as described in Chapter Two, allows for additional analyses that address contextual elements of an evaluation. While HLM and multilevel growth modeling have been discussed in the evaluation literature, the evaluation literature lacks examples of growth models using item response modeling. One

66 particular model, the latent growth item response model (LG-IRM) (Wilson, Zheng, & McGuire, 2012), is an item response model that can align with both AEA’s Public Statement on Cultural Competence in Evaluation and the CRE framework. The LG-IRM is used for longitudinal data analysis. The LG-IRM is essentially a multidimensional model in which each dimension represents a separate time point. As described by Wilson et al. (2012) and elaborated upon in the methods section, the LG-IRM is a special case of the Multidimensional Random Coefficients Multinomial Logit (MRCML) model. The LG-IRM uses data from across all time points to create a more accurate estimate of ability and growth, in a similar way as for the Andersen (1985) and Embreston (1991) models. The LG-IRM represents a hybrid between a standard growth model and an item response model. Wilson et al. (2012) conducted a series of simulations showing that the LG-IRM provided slightly better estimates than an HLM model. But, in addition, the LG- IRM facilitates the following extensions to allow for more responsiveness to local context: 1. Curvilinear model fitting to allow non-linear growth over time; 2. Addition of covariates to incorporate time-varying characteristics and other characteristics; 3. Multidimensional growth to show change in multiple domains; 4. Differential item functioning to show significant differences in items between specific subgroups; this analysis can hold time constant or allow time to vary. Therefore, the LG-IRM is considered both statistically rigorous and appropriately flexible to examine multiple contextual factors. The LG-IRM characteristics may be highly suitable for several program evaluation contexts utilizing longitudinal data. While the LG-IRM has not been specifically discussed in the program evaluation literature, additional literature points to its utility. Wilson et al. (2012) demonstrated how the LG-IRM can be applied to measure growth in both academic achievement and self-esteem. Regarding academic achievement, the authors analyzed three years of social studies assessments, as reported in the National Education Longitudinal Study of 1988 (NELS:88). The analysis showed mean student growth estimates and comparisons between student ability and test difficulty. Wilson et al. (2012) also applied the LG-IRM to six administrations of the Rosenberg Self-Esteem Scale. The analysis showed that the average level of self-esteem decreased over time. The authors also plotted individual student changes to show that some students start at lower levels of self-esteem and have a steeper upward slope and vice versa. Furthermore, individual quadratic growth trajectories were plotted for a random sample of 50 students to show how self-esteem growth trajectories are not always linear. Wilson et al. (2012) also investigated differential item functioning (DIF) and found several items in which the mean for females was significantly lower than the mean for males. Similar to Wilson et al. (2012), McGuire (2010) also applied the LG-IRM to self-esteem data. Specifically, McGuire (2010) used data from the National Survey of Black Americans to quantify improvements in low self-esteem over time. Distinct from Wilson et al. (2012), this analysis showed how the LG-IRM can be applied to polytmous data to provide interpretations of the amount of change in low self-esteem over time. Additionally, Weber (2012) applied the LG- IRM to pre-post data. Weber (2012) described a vocabulary test that was administered at two time points for children in Madagascar and demonstrated how the LG-IRM reduced standard

67 errors for abilities in the posttest through borrowing information in the pretest. Furthermore, Weber (2012) used the LG-IRM to estimate abilities for children with missing scores. The existing literature shows that the LG-IRM is appropriate for longitudinal data (including pre-post data). Moreover, the LG-IRM is suited for longitudinal data derived from a measure, such as an academic or psychological assessment, because it incorporates item response modeling properties with properties contained in other growth models such as hierarchical and multilevel growth modeling. Thus, the LG-IRM may produce more precise estimates for repeated measures data than hierarchical or multilevel growth models. Furthermore, LG-IRM applications have been successful for social-emotional measures, such as those described in Chapter Two. Furthermore, allowing for individual growth curves, conducting DIF analyses, and adding covariates to the growth model facilitates a contextual responsiveness that aligns with a culturally competent approach to evaluation. The LG-IRM therefore matches the analytic needs and theoretical framework of the ASC evaluation. Case Study Context Chapter Two presented the background and evaluation context for the Athletic Study Center (ASC). Specifically, Chapter Two focused on outlining the ASC’s context, evaluation needs, and purpose for creating measures of sense of belonging and self-reliance. This chapter extends the ASC evaluation case to focus on how the sense of belonging and self-reliance measures were administered and analyzed. As discussed in Chapter Two, the ASC sought a better understanding of how the holistic aspects of their service delivery model (see Figure 1 in Chapter Two) impacted student athletes throughout their enrollment at UC Berkeley, thus gauging the ASC’s progress toward their overall mission. By creating the sense of belonging and self-reliance measures, the ASC sought to understand student athletes’ growth trajectories in these two domains. This information would assist the ASC with collecting both formative and summative information. Formatively, longitudinal data collected from the sense of belonging and self-reliance measures can be used to examine individual student growth trajectories. Such information may inform ASC advising and academic support. Furthermore, student athlete subgroups such as athletic teams, international student athletes, or racial/ethnic groups can be analyzed to observe any trends among specific populations. Findings from such analyses may inform intervention targets for student athlete subpopulations. Growth information via the sense of belonging and self-reliance measures can provide findings that inform potential improvements and modifications to ASC service delivery in collaboration with campus partners. A longitudinal analysis of the sense of belonging and self-reliance measures also provides information for summative purposes. As discussed in Chapter Two, the academic progress rate (APR) is an imperfect measure of summative student athlete success (Van Rheenen, 2015). Growth scores on the sense of belonging and self-reliance measures provide additional data points to tell a more complete picture of the student athlete experience at UC Berkeley. Given the ethical issues related to randomization of student athletes, as well as the current impracticality of accessing student athletes from a representative institution, an experimental design cannot be implemented at the ASC in the current context. Thus, any observed growth in the sense of belonging and self-reliance domains cannot be causally attributed to the ASC’s services alone. However, student athlete growth trajectories on these two measures provide a summative judgement of the extent to which student athletes are working toward the ASC’s

68 goals of fostering a sense of belonging and self-reliance among student athletes. Such information may be of interest to ASC external stakeholders. Given both summative and formative needs, the ASC evaluation called for a longitudinal analysis of student athletes’ sense of belonging and self-reliance. As a start to administering and collecting this data, the ASC Leadership Team chose to focus on growth among student athletes enrolled during the fall 2017 ED 98 course. All services at the ASC focus on building student sense of belonging and self-reliance; however, the ED 98 course places a particular emphasis on fostering these characteristics.13 Therefore, student athletes in the fall 2017 ED 98 course were given the sense of belonging and self-reliance instruments as pre-post measures at the beginning and end of the semester. In order to meet the ASC’s evaluation needs, as well as align the evaluation with a culturally competent approach, the evaluator needed to apply a quantitative growth model that was capable of producing a statistically rigorous growth estimate and examining contextual variables to provide a richer explanation of student athlete growth trajectories. Therefore, the evaluator chose to apply the LG-IRM quantitative growth model to the ED 98 pre-post data. Description of the ED 98 Fall 2017 Students During the fall 2017 semester 80 freshmen student athletes enrolled in ED 98. This represented 33 percent of the freshman student athlete population (240 freshmen student athletes enrolled in fall 2017). These student athletes were all freshman newly admitted to UC Berkeley. While ASC advisers encouraged all freshman student athletes to enroll in the course, it was not mandatory and student athletes themselves made the decision to enroll in ED 98. Furthermore ED 98 student athletes were spread across four sections, each with approximately 20 students. The evaluator administered the sense of belonging and self-reliance instruments to all ED 98 sections during week three for the pretest and week twelve for the posttest. Sixty-two student athletes took the pretest and the posttest. Twelve students took the pretest but were not in class to take the posttest (they either dropped the course or were absent); these students were not included in the analysis. Only students with both a pretest and posttest were analyzed. The student athletes participating in the pretest/posttest represented 17 different sports teams. Mirroring the pilot test described in Chapter Two, the survey took approximately fifteen minutes to complete. Of the participating students, 40 percent identified as female (n=25), 11 percent were international students (n=7), 15 percent participated in Freshman Edge (n=9), and three percent participated in Summer Bridge (n=3)14. Furthermore, Table 1 outlines the race/ethnicity of students participating in the pre/post survey.

13 Refer to Chapter Two for a detailed description of the ASC’s services and ED 98 course content. 14 Freshman Edge is a summer program for UC Berkeley students enrolling in the fall semester. Freshman Edge offers popular and required courses, as well as program events designed to help students transition into college life. Similarly, Summer Bridge is a six-week academic residential summer program in which students enroll in two credit-bearing courses and receive counseling services to help transition into college.

Table 10. Racial/Ethnic Distribution of Pilot Survey Participants

Race/Ethnicity Percentage (n) Asian 5(3) Black/African American 8(5) Hispanic/Latino 0(0) Native Hawaiian/Pacific Islander 3(2) White/Caucasian 63(39) Mixed Race 21(14)

The characteristics of the student athletes participating in the pre/post survey largely resembled the student athlete population that participated in the fall 2016 pilot survey. Therefore, the revised survey administered during fall 2016 was considered appropriate for the fall 2017 participants. Similar to the pilot survey, the pre and the post survey were administered via paper and pencil. The evaluator scored the survey data per the outcome space scoring guide (see Chapter Two) and entered it manually into the data file. To continue refining the survey for future iterations, an ‘other’ option was provided on all of the questions to collect additional data on if and how question options could be modified. Any question that was answered with an ‘other’ explanation was coded as missing and the evaluator recorded all ‘other’ comments to help improve the survey. Methods This chapter presents the LG-IRM (Wilson, Zheng, & McGuire, 2012) and relevant extensions to show how quantitative analytic techniques align with a culturally competent evaluation approach. More specifically, the LG-IRM and its extensions are applied to produce accurate growth estimates as well as incorporate contextual and cultural factors. The ASC evaluation case demonstrates how the LG-IRM can respond to local culture and context. Chapter Two outlined how the sense of belonging and self-reliance measures were calibrated and validated using the Rasch partial credit model. As a follow-up, the revised survey instruments were implemented as a pre-post measure. The evaluator analyzed this pre-post data using a polytomous latent growth item response model. The LG-IRM is part of the Rasch model family and has been used to analyze longitudinal data. More specifically, the LG-IRM is a multidimensional item response theory (IRT) model that uses the formulation of the Random Coefficients Multinomial Logit Model (MRCML). A multidimensional item response model assumes that items correspond to a specific dimension within the overall instrument. The LG- IRM is used because it still retains the estimate for each item while modeling dependencies between items. Thus, the LG-IRM is preferable to the Andersen (1985) model (which treats each time point as a separate latent variable) or consecutive model (Embretson, 1991) because the LG-IRM provides more accurate parameter estimates when item dependence is present within an assessment (Wilson, Zheng, & McGuire, 2012). The LG-IRM takes the framework from the MRCML but adds a linear change dimension to explain average growth and variation in growth. Both the initial dimension and the linear change dimension correspond to the same latent ability measure; however, the model estimates two person-specific latent variables to represent ability at two different time points. Equation Two demonstrates the LG-IRM for three time points. While the example in this chapter only has two time points, three time points are shown in order to

70 illustrate how one would measure growth beyond the pre/posttest, as the ASC may plan to implement.

logit(푃푖푝1) = 휃푝 + 0훾푝 − 훿푖,

logit(푃푖푝2) = 휃푝 + 1훾푝 − 훿푖, and (2)

logit(푃푖푝3) = 휃푝 + 2훾푝 − 훿푖,

In Equation Two, the ‘훾푝’ represents the linear change dimension. 휃푝 and 훾푝 have a bivariate normal distribution with and nonzero mean. Thus, the LG-IRM will estimate growth from the pre-post data. The evaluator applied the LG-IRM model to the pre/post data using ConQuest software (Wu, Adams, & Wilson, 2012). It should be noted that in the pre/post format the simple LG-IRM is equivalent to the Embretson (1991) model, though this is not true when DIF and/or covariates are included. To estimate the polytomous LG-IRM, a scoring matrix and design matrix must be specified (per the MRCML formulation). The MRCML specifies scoring vectors 푏푖 which represent the scores associated with selecting category j of each item i. ConQuest software defines the scoring matrix in the command syntax (see Appendix). The design matrix is composed of individual design vectors. In a polytomous context, a design vector 푎푖푗 specifies the combination of item and step parameters that correspond to a response in any category. Because the LG-IRM is designed for longitudinal data contexts, the vectors include responses from each time point (pre and post in this example). Therefore, the design matrix represents the design vectors during each wave. Since the same item and step parameters were used to model responses at both time points, the design matrix is essentially replicated from the pretest to the posttest. As an example, the design matrix for the self-reliance instrument is shown in the Appendix. Unlike the scoring matrix, the evaluator created a design matrix for both the sense of belonging and self-reliance measures and imported them into ConQuest for estimation. Extensions to the LG-IRM The first extension explored in this chapter is differential item functioning. The differential item functioning (DIF) analysis examines if individual items behave differently between specific subgroups. The set of subgroups analyzed are as follows: 1) Gender; 2) Freshman Edge/Summer Bridge vs Non- Freshman Edge/Summer Bridge;15 3) Non-white student athletes vs white student athletes.

To determine DIF, item difficulties and corresponding standard errors were estimated separately for each subgroup. The DIF effect was constrained to be equal over time. Plotting the estimates against each other provides an indication of if items behaved differently between the aforementioned subgroups (Wright & Masters, 1982).

15 Freshman Edge/Summer Bridge DIF was only conducted for the Self Reliance instrument; missing data prevented a DIF analysis to be conducted on the Sense of Belonging instrument.

The second extension incorporates covariates into the LG-IRM. The inclusion of covariates allowed the evaluator to analyze if differences in growth were associated with additional student athlete characteristics. The specific covariates included are as follows: 1. Gender; 2. Revenue vs Non Revenue athletes; 3. Race/Ethnicity; 4. Summer Bridge/Freshman Edge Participation.

A latent regression can be performed under the MRCML framework. Thus, because the LG-IRM utilizes the MRCML framework, a latent regression can be conducted within the LG-IRM. The latent Rasch regression model is a generalized linear mixed model (GLMM) that decomposes the person parameter; this model is considered ‘person explanatory’ (Wilson & De Boeck, 2004). Thus, the latent Rasch regression model allows person properties (such as international student athlete and domestic student athlete) to be predictors. One can think of this model as the latent person parameter (훳) regressed on external person variables. To expand this model from the one-parameter Rasch model, person ability (훳) is replaced with a linear regression equation, as seen in Equation Three.

퐽 휂푝푖 = ∑푗=1 훳푗푋푝푗 + 휀푝 − 훽푖 (3)

In Equation Three, 훳푗 is regressed on the person characteristics with fixed regression coefficients of 훾푗. Thus, 푋푝푗 is the value of person p on characteristic j (such as male or female) and gamma j is the fixed regression coefficient of person property j. The 휀푝 represents the remaining person (p) effect left over from the personal characteristics. The analyses conducted in this chapter extends upon the validation and calibration analysis presented in Chapter Two in order to demonstrate how item response models can be applied to align with culturally competent approaches to evaluation. Findings The evaluator applied the LG-IRM to the pre-post data to estimate ED 98 student growth in sense of belonging and self-reliance. Findings show that the average mean baseline ability for Self-Reliance was 0.634 logits (Table 2). This baseline ability increased on average by 0.235 logits. Relating these results back to the construct map and Wright Map presented in Chapter Two, on average, students began on the border of level 3 and level 4 and finished the ED 98 course at level 4 on the construct map (the highest level). Table 11. Self-Reliance Person Parameter Estimates Dimension Mean SE Variance Baseline 0.794 0.101 0.380 Growth 0.235 0.114 0.306 Note: The correlation between baseline EAP and growth equals 0.197, and the correlation between baseline and post EAP equals 0.858.

For the Sense of Belonging instrument, findings show that the average mean baseline ability for was 1.023 logits (Table 3). This baseline ability increased on average by 0.004 logits. Relating these results back to the construct map and Wright Map presented in Chapter Two, on average, students began at level 4 (the highest level) and finished the course in a relatively similar place. Table 12. Sense of Belonging Person Parameter Estimates Dimension Mean SE Variance Baseline 1.023 0.095 0.333 Growth 0.004 0.081 0.039 Note: The correlation between baseline EAP and growth equals 0.785, and the correlation between baseline and post EAP equals 0.996.

Individual pre and post EAP scores were also plotted to see how growth varied among ED 98 student athletes; Figure 1 and Figure 2 show that growth rates were often particular to the individual student athlete. Looking at the trajectories, one can see that student athletes who started with the lowest abilities on both measures often had a steeper slope. Furthermore, there were student athletes that exhibited negative growth on either or both the sense of belonging and self-reliance instruments. Specifically, 18 students showed negative growth on the self-reliance instrument and 26 students showed negative growth on the sense of belonging instrument. Stakeholders interpreted this finding to imply that some students began the year with high athletic and academic expectations, and despite the ED 98 course content, became overwhelmed

throughout the semester by the rigor of coursework and student athlete demands.

Self RelianceSelf (logits)

0 -1 1 2 time

Figure 188. Self-Reliance Individual Growth Trajectories

Sense of Belonging(logits) -1 1 2 time

Figure 19. Sense of Belonging Individual Growth Trajectories

The fit statistics for both sense of belonging and self-reliance show appropriate fit for all items. Additionally, regarding the instrument items, the evaluator calculated a Cronbach’s Alpha of 0.71 for sense of belonging and 0.60 for self-reliance. This is an improvement from the reliability statistics reported in Chapter Two (0.53 for sense of belonging and 0.58 for self- reliance). A lower number of ‘other’ responses were recorded in this administration of the survey, which may explain the improvement in reliability. Differential Item Functioning The evaluator conducted a differential item functioning analysis to detect bias in individual items among subgroups. DIF parameters were explored for the following subgroups: male vs female, Summer Bridge/Freshman Edge participants vs non-participants, and White vs non-White student athletes. These parameters indicated if a particular item was more or less difficult for a specific subgroup after allowing for group-differences in overall mean ability. DIF analysis were conducted on both the pretest scores and the post test scores. For the pretest, DIF was not present between non-White and White student athletes, nor between Summer Bridge/Freshman Edge participants and non-participants. However, significant DIF16 was shown between females and males for one item: Group.IdentityB item (see Chapter Two for an explanation of item labels). Specifically, this question was significantly easier for females than for males. ED 98 course facilitators commented that they were not surprised by this finding, as they felt that the female students entered the course with more confidence than the males.

16 Refer to Chapter Two for detail regarding how significant DIF was determined.

For the posttest, DIF was also not present between Summer Bridge/Freshman Edge participants and non-participants. However, significant DIF was shown between females and males for the Flyer.Holistic item. Specifically, this question was significantly easier for females than for males, perhaps indicating that females were more open to on-campus social experiences than males. Interestingly, the Group.IdentityB item did not show significant DIF in the posttest. Furthermore, the Dorm.Achieve question on the posttest self-reliance instrument was significantly easier for non-White student athletes. No difference was seen on this question for the pretest. ED 98 stakeholder suggested that this difference might be due to the fact that more students of color have assigned ‘learning specialists’ that encouraged them to complete their assignments on time. This DIF analysis also tells us that the instrument may still have validity issues that should be explored and appropriately revised. Covariates In addition to examining DIF across subgroups, the evaluator analyzed differences in growth associated with specific characteristics. The latent regression analyses was performed separately from the DIF analysis using Mplus software (Muthen & Muthen, 1981-2012). These characteristics were: gender, revenue generating athletes, race/ethnicity, and summer bridge/Freshman Edge participation. Table 4 displays the results and standard errors from this latent regression for each of the covariates. Table 13. Latent Regression Estimates Covariate Sense of Belonging Self-Reliance Female (male reference) -0.268 (0.153) 0.109 (0.183) Non-White (White reference) -0.062 (0.149) 0.089 (0.103) Revenue Generating (non-revenue generating 0.088 (0.245) 0.169(0.303) reference) Summer Bridge/Freshman Edge (non- 0.244 (0.216) 0.393 (0.250) participant reference)

None of the estimates reported in Table 4 were statistically significant. Thus, specific subgroups did not have statistically significantly different growth trajectories when compared with their reference group. This lack of significance may be due to the small sample size, as well as the relatively small number of athletes belonging to each sub-category explored in this analysis. Therefore, this analysis should be revisited once a larger sample size is obtained. Summary of Findings The findings discussed in this section reveal that on average, growth occurred in both the sense of belonging and self-reliance measures. However, the rate of growth differed across individuals. The correlation between the baseline and growth estimates (see Tables 3 and 4) illustrates that those individuals that began with lower levels of sense of belonging and self- reliance were more likely to have steeper growth slopes than those who began with higher levels; and, some student athletes declined. Furthermore, DIF estimates revealed that a few items were significantly more difficult for males and White student athletes. These items came from both the sense of belonging and self-reliance instruments.

The latent trait was also regressed on subgroup dummy variables. The latent regression analyses revealed that growth in sense of belonging and self-reliance was relatively similar between subgroups. Implications of these results are discussed in the following section. Discussion Findings demonstrate that the LG-IRM can produce growth estimates while considering specific contextual and cultural factors. In the ASC evaluation case, the LG-IRM estimated individual growth trajectories, differential item functioning among subgroups, as well as growth patterns among subgroups. This section discusses how this quantitative analysis was appropriate and useful for both formative and summative components of the ASC evaluation and also corresponded to a culturally competent evaluation approach. Formative Use of Findings Formatively, assessing individual growth patterns can assist ASC advisers and learning specialists with their case-by-case advising, as well as providing standard errors in order to assess the significance of the growth. Burns et al. (2011) stated that student athlete satisfaction is dependent on individual characteristics; ASC advisers therefore need to know individual student athlete levels of self-reliance and sense of belonging in order to provide advising distinct for the individual student. Understanding a student athlete’s growth pattern for sense of belonging and self-reliance can help ASC advisers understand where students are located on the self-reliance and sense of belonging constructs (see Chapter Two for construct maps). While this information could be qualitatively assessed through advising conversations, the instruments and resulting quantitative data created a systematic way to examine the trends across a moderately large population of student athletes. Furthermore, this quantitative data provides a record that can be compared with an ASC staff member’s qualitative assessment of a student athlete’s sense of belonging and self-reliance levels. Such triangulation can provide a more comprehensive picture of student athlete well-being. With regards to the ED 98 course, as findings show, the LG-IRM located some cases in which sense of belonging and self-reliance decreased throughout the semester. Comeaux et al. (2011) state that most student athletes enter as freshman with a positive outlook; however, due to the rigor of student athletes’ schedules and realities of increased levels of competition in both athletic and academics, student athletes may feel disillusioned by the start of their second semester. While the ED 98 course is intended to prevent such disillusionment, quantitative data displaying the frequency of students who feel a weaker sense of belonging and self-reliance toward the end of the course can help guide curriculum improvements for the future. This quantitative information can help ED 98 facilitators adjust their curriculum for future semesters, as well as help advisers target interventions to specific subgroups. Additionally, the DIF analysis showed differences in item difficulty for specific subgroups. This information helps with future revisions to the instrument to ensure the data is as valid as possible, thus enabling advisers to provide accurate, data-informed guidance. Summative Interpretation of Findings For summative purposes, results indicated the extent to which student athletes developed their sense of belonging and self-reliance attributes, per the constructs detailed in Chapter Two. While causation cannot be determined in this evaluation case, growth data from the ED 98 sample provides an indication into how student athletes at UC Berkeley developed over their first

76 semester. For example, findings show that on average, ED 98 student athletes improved their sense of belonging and self-reliance levels over the duration of the fall semester. Given that the literature suggests that student athletes may feel disillusioned by the second semester (Comeaux et al., 2011), this finding may indicate that the ED 98 course and additional ASC services are helping students to acquire a strong sense of belonging and self-reliance skills, which is an improvement from cases reported in the literature. Examining growth trajectories of specific subgroups also contributes to summative purposes. Existing literature states that when entering as freshmen, revenue generating athletes do not identify as strongly with the academic culture of their university as compared with non- revenue generating athletes (Comeaux et al., 2011). This finding can be compared with the self- reliance and sense of belonging analyses from ED 98 students which showed that UC Berkeley revenue generating athlete sense of belonging, on average, did not decline. The quantitative LG-IRM findings helped to illustrate a more complete picture of student athletes that moved beyond national quantitative student athlete measures such as the APR. While this information could have been collected through qualitative means, the quantitative data was appropriate given both formative and summative purposes. Quantitative information can help the ASC advocate for specific resources or interventions to make sure that all student athletes receive the services they need for both academic and personal success at UC Berkeley. Conclusion and Future Research The LG-IRM represents a quantitative analytic technique that not only produced accurate growth estimates, but also incorporated contextual and cultural issues. Specifically, through using item response modeling, the evaluator obtained individual growth trajectories, DIF estimates, and growth by subgroups. Doing so ensured that the evaluator did not treat the student athlete as a homogenous group. The evaluator responded to the fact that revenue generating and non-revenue generating student athletes (and other subgroups such as gender) have distinct backgrounds and experiences that may result in different growth trajectories. Furthermore, the LG-IRM aligned with the stakeholder needs for formative improvement. Given that the ASC aims to operate with cultural competence, the LG-IRM findings needed to correspond with the ASC staff’s approach to treat student athletes as individuals and respond to their unique life circumstances. The LG-IRM results represented a rigorous approach to analyzing longitudinal data. Given that this information is intended for use by stakeholders, findings should not only be culturally relevant, but also statistically valid. For the ASC, the quantitative analysis was appropriate due to the need to collect ongoing information from a relatively large population, as well as provide findings relevant for both ASC staff and external stakeholders. An advantage of quantitative methods is the ability to analyze larger datasets (work that is often too time- consuming with qualitative methods) relatively quickly. While this often comes at the cost of in- depth case study knowledge, quantitative methods can include cultural factors to provide more nuanced findings. Thus, when the evaluation context and questions call for a culturally responsive approach, involve large populations, include quantitatively-oriented stakeholders, and incorporate repeated measures, the LG-IRM may be appropriate. The ASC case therefore represents how quantitative analytic techniques can allow the evaluator to respond to and engage with culture in a way that improves the statistical and multicultural validity of the findings.

Limitations and Future Research The sense of belonging and self-reliance instruments were created by the evaluator for the ASC to use beyond a one-time pre-posttest. Furthermore, the instruments were intended to be administered across all student athletes, not just those that self-selected into taking ED 98. As a result, the findings presented in this chapter represent a limited scope and sample of the UC Berkeley student athlete population. It is possible that the student sample reported in this chapter may be more or less likely to show growth in sense of belonging and self-reliance. Furthermore, due to the limited sample size, the evaluator was not able to analyze all subgroups useful for the ASC or other support services. Despite these limitations, the LG-IRM and its extensions can be applied in future evaluation settings. Thus, as the ASC continues to implement the sense of belonging and self- reliance instruments across a broader sample of student athletes and more time points, the evaluator will have more data with which to implement further extensions that improve the LG- IRM’s ability to address cultural factors. For example, with multiple time points, the evaluator can fit an LG-IRM that shows non-linear growth by including polynomials of time. As discussed, student athletes may go through several phases which would not equate to linear growth; for example, a student athlete may enter UC Berkeley with a strong feeling of sense of belonging, but this may decrease due to injury or adjustment difficulties, and then increase again after receiving support from the ASC. The LG-IRM would be able to model this trajectory. The ASC evaluation case had a clear purpose and intended use for quantitative data. Using the LG-IRM allowed the evaluator to apply a quantitative analytic technique that aligned with the culturally competent approach desired by the ASC. As discussed in the conceptual orientation, the method should follow the evaluation question. Thus, the LG-IRM may not be appropriate in cases in which the evaluation question and purpose calls for an in-depth exploratory inquiry. However, the LG-IRM does allow for answering questions that extend beyond a traditional impact analysis. As demonstrated, the LG-IRM can show how growth may differ for specific subgroups and in what ways. Thus, while the LG-IRM may not be appropriate in every context, when the evaluation question calls for a longitudinal growth analysis, it can be applied in a way that is consistent with a culturally competent approach to evaluation.

Concluding Statement This dissertation began with a systematic review of published evaluations that incorporated culturally competent approaches. Chapter One revealed that while quantitative methods were used, they seldom moved beyond basic descriptive statistics and frequencies. Furthermore, Chapter One illustrated that a significant amount of quantitative data was derived by either an evaluator-created or standardized instrument. Thus, Chapter Two and Chapter Three highlighted the case of the ASC evaluation to show how an evaluator can create, validate, and calibrate an instrument that aligns with a culturally competent approach, as well as how an evaluator can apply advanced quantitative techniques to pre-post data in a culturally competent manner. By doing so, this dissertation contributes to the body of knowledge regarding how evaluators can take a culturally competent approach in contexts in which the evaluation questions call for a quantitative analysis. In closing, I would like to outline three additional points to help understand how to conceptualize the use of quantitative methods in a culturally competent evaluation. Primarily, this dissertation has not yet discussed the role of evaluator reflection, a key component of culturally competent evaluations. Reflection is an aspect that helps to ensure that the chosen quantitative instruments and analytic techniques appropriately respond to the local culture. This reflection also prompts a further investigation into the epistemological assumptions underlying the choice to use quantitative methods under a culturally competent framework. And finally, to fully judge the appropriateness of quantitative methods in the evaluation case, this dissertation warrants a more focused explanation of long-term utilization of the sense of belonging and self- reliance instruments and growth findings. Evaluator Reflexivity AEA’s Public Statement on Cultural Competence in Evaluation states that in order to plan and implement a culturally competent evaluation, an evaluator must “engage in ongoing critical reflection on assumptions about what constitutes meaningful, reliable, and valid data and how these data are derived,” (American Evaluation Association, 2011). Furthermore, when preparing for an evaluation, the CRE framework promotes evaluator self-reflection as a way “to become acutely aware of their own cultural values, assumptions, prejudices, and stereotypes and how these factors may affect their evaluation practice within the particular setting,” (Frierson, Hood, Hughes, & Thomas, 2010, p. 80). Thus, evaluator self-reflection should not only inform methodological choices, but also help the evaluator discover his/her own positionality. The two foci complement one another, allowing the evaluator to uncover if methodological choices are inappropriate due to a lack of awareness or bias toward the context and population of the evaluation. Relating this concept back to the evaluation case presented in this dissertation, I (the evaluator) began my self-reflection through focusing on my own experiences and perceptions of the student athlete population. This self-reflection began while taking Professor Derek Van Rheenen’s course entitled “Theoretical Foundations for the Cultural Study of Sport in Education.” The course required weekly written summaries of selected readings, including a personal reflection of how the readings pertained to one’s own life history. Several weeks of readings dealt specifically with student athletes. Thus, I had the opportunity to reflect on my own experiences with sport and academics and discuss these issues with others to gain a better understanding of both my own

79 experiences and the experiences of others. Furthermore, the course final assignment entailed a written personal athletic academic autobiography. The final assignment coupled with the weekly reflections and group discussions allowed me to understand how my own feelings of sense of belonging and self-reliance were impacted by my participation in sports. Moreover, through listening to my peers, I gained an understanding of how my experiences as a biracial female were similar to and different from other genders, races/ethnicities, and sports. This self-reflection provided me with the foundational understanding that student athlete success was measured by more than just retention, graduation, and GPA. I understood that sense of belonging and self-reliance were characteristics that may be understood differently for student athletes than for the general undergraduate population. Additionally, I understood the external stakeholder demands and pressures imposed on the student athlete. Thus, when the ASC explained their evaluation needs, I could reflect in a way that allowed me to consider if and how quantitative data could meaningfully capture the complexity and nuance that comprises the student athlete population. Thus, once the evaluation began, I consistently reflected on the appropriateness of the instruments and methods, both individually and with stakeholders. This reflection allowed me to create an instrument and proceeding analysis that was appropriate and meaningful for the student athlete population, ASC staff, and stakeholders. Epistemological Questions On the surface, the choice to use quantitative methods to explore sense of belonging and self-reliance appears to follow an objectivist perspective. Specifically, each measure related back to a construct with clear levels. By mapping each measure onto their corresponding construct, one could say this assumes objective levels of sense of belonging and self-reliance. An objectivist stance may run the risk of reinforcing existing power structures if those in a position of power are seen to have what is considered the ‘objective knowledge’ and are the sole creators of the construct map. A central objective of a culturally competent evaluation is to combat detrimental inequalities through recognizing and incorporating cultural components into the evaluation. Thus, it may seem at the outset that using a quantitative method unfamiliar to several of the stakeholders would be against the epistemological assumptions underlying culturally responsive evaluation. However, when reading through the evaluation process described in this dissertation, it is clear that the construct and proceeding instruments were developed to directly incorporate the student athlete experience and perspective. Several rounds of feedback with student athletes and ASC staff were incorporated into both the construct and the instrument to ensure that the contents accurately reflected the student experience. In this way, the constructs for sense of belonging and self-reliance could be seen as specific to the student athlete community, derived from their experiences and not from a generalized source. Furthermore, the LG-IRM analysis allowed for flexibility in how growth on these constructs occured, thus not assuming that every student athlete grows on a linear upward trajectory. In this dissertation, quantitative methods allowed for individual and subgroup interpretations, which aligned with a culturally competent approach to evaluation. Utilization Next Steps The ASC case study discussed in Chapter Two and Chapter Three primarily focused on how the sense of belonging and self-reliance measures were created, validated, calibrated, and

80 analyzed. However, this content lacked an overarching discussion of how the measures can be incorporated in the ASC’s evaluative activities over time. The measures were primarily piloted and analyzed with the incoming group of freshman student athletes who elected to enroll in ED 98. However, in order to maximize the utilization potential of these measures, the ASC may want to expand survey administration beyond incoming freshman. Specifically, the ASC may select specific time points throughout a student athlete’s enrollment at UC Berkeley to administer the measures. For example, the ASC could administer the measures during freshman fall semester, sophomore spring semester, and senior fall semester. This data could then provide a longitudinal assessment of student athletes’ sense of belonging and self-reliance that would not only help advisors, learning specialists, and ASC staff improve service delivery, but also provide insight into broader trends associated with the development of student athletes’ noncognitive skills. The longitudinal data could be correlated with other student athlete variables such as GPA, playing time, and demographic characteristics to better understand the needs of student athletes during the different phases of their undergraduate careers. Summatively speaking, while this dissertation was not able to investigate the causal impact of the ASC on improving a student athlete’s sense of belonging and self-reliance, the measures could be used for this purpose. Specifically, these measures could be administered at a university of similar size and student population. Using matching methods such as propensity score matching, a causal analysis could be conducted to determine if the ASC improved student athletes’ sense of belonging and self-reliance, in comparison to similar NCAA programs. I intend for the sense of belonging and self-reliance instruments to provide ongoing use to the ASC. They can be integrated into the ASC’s evaluative efforts through consistent administration and analysis, as well as provide data sets for both ASC staff and UC Berkeley scholars to learn from. This ongoing use of the sense of belonging and self-reliance instruments reflects a culturally competent approach and contributes toward the overarching goal of improving the undergraduate experience and future success of UC Berkeley student athletes.

References

Aktan, G. (1999). A cultural consistency evaluation of a substance abuse prevention program with inner city African-American families. The Journal of Primary Prevention, 19, 227-239. Al Hudib, H., Cousins, J. B., Oza, J., Lakshminarayana, U., & Bhat, V. D. (2016). A Cross-cultural Evaluation Conversation in India: Benefits, Challenges, and Lessons Learned. Canadian Journal of Program Evaluation, 30(3), 329-343. Alkon, A., Tschann, J., Ruane, S., Wolff, M., & Hittner, A. (2001). A violence-prevention and evaluation project with ethnically diverse populations. American Journal of Preventive Medicine, 20(1 Suppl.), 38-44. American Evaluation Association. (2003). American evaluation association response to US department of education notice of proposed priority. federal register RIN 1890-ZA00. Retrieved March 13, 2017 American Evaluation Association. (2011). Public Statement on Cultural Competence in Evaluation. Fairhaven, MA. Retrieved from www.eval.org American Psychological Association. (2014). The Standards for Educational and Psychological Testing. Washington, DC: American Educationan Research Assocation. Andersen, E. (1985). Estimating latent correlations between repeated testings. Psychometrika, 3-16. Anderson-Draper, M. (2006). Understanding cultural competence through the evaluation of "breaking the silence: a project to generate critical knowledge about family violence within immigrant communities". The Canadian Journal of Program Evaluation, 21, 59-79. Arneson, A. (2015). crasch: Construct Mapping with Rasch. (R. p. 0000.0000.9000, Compiler) Retrieved from github.com/amyarneson/crasch Arseneault, C., Plourde, C., & Alain, M. (2016). Evaluating a Prison-Based Intervention Program: Approaches and Challenges. Canadian Journal of Program Evaluation, 31(1), 61-81. Ashley, G., & Reiter-Palmon, R. (2012). Self-Awareness and the Evolution of Leaders: The Need for a Better Measure of Self-Awareness. Pscyhology Faculty Publications(Paper 7). Baizerman, M., & Compton, D. (1992). From respondent and informant to consultant and participant: the evolution of a state agency policy evaluation. New Directions for Evaluation, 53, 5-15. Bamanyaki, P. A., & Holvoet, N. (2016). Integrating theory-based evaluation and process tracing in the evaluation of civil society gender budget initiatives. Evaluation, 22(1), 72-90. Barnes, H. (2000). Collaboration in community action: A successful partnership between indigenous communities and researchers. Health Promotion International, 15, 17-25. Berends, L., & Roberts, B. (2003). Evaluation standards and their application to indigenous programs in Victoria, Australia. Evaluation Journal of Australasia, 3, 54-59. 82

Bevan-Brown, J. (2001). Evaluating special education services for learners from ethnically diverse groups: Getting it right. The Journal of the Association for Persons with Severe Handicaps, 26, 138-147. Blackman, T., Wistow, J., & Byrne, D. (2013). Using Qualitative Comparative Analysis to understand complex policy problems. Evaluation, 19(2), 126-140. Bledsoe, K. L. (2014). Truth, beauty, and justice: Conceptualizing house’s framework for eval-uation in community-based settings. New Directions for Evaluation(142), 71-82. Bledsoe, K., & Donaldson, S. (2015). Culturally Responsive Theory-Driven Evaluation. In S. Hood, r. Hopson, & H. Frierson, Continuing the Journey to reposition Culture and Cultural Conotext in Evaluation Theory and Practice (pp. 3-28). Charlotte: Information Age Publishing, Inc. Botcheva, L., Shih, J., & Huffman, L. (2009). Emphasizing Cultural Competence in Evaluation A Process-Oriented Approach. American Journal of Evaluation, 30(2), 176-188. Bowen, M., & Tillman, A. (2015). Developing Culturally Responsive Surveys Lessons in Development, Implementation, and Analysis From Brazil's African Descent Communities. American Journal of Evaluation, 36(1), 25-41. Bowman, N., Francis, D., C, & Tyndall, M. (2015). Culturally Responsive Indigenous Evaluation: Apractical Approach for Evaluating Indigenous Projects in Tribal Reservation Contexts. In S. Hood, Hopson, R, & H. Frierson, Continuing the Journey to Reposition Culture and Cultural Context in Evaluation Theory and Practice (pp. 335-360). Charlotte: Information Age Publishing. Brandão, D. B., Silva, R. R., & Codas, R. (2012). Youth participation in evaluation: The Pró-Menino Program in Brazil. New Directions for Evaluation(134), 49-59. Briggs, D., & Wilson, M. (2003). An Introduction to Multidimensional Measurement Using Rasch Models. Journal of Applied Measurement, 87-100. Burns, G., Jasinski, D., Dunn, S., & Fletcher, D. (2012). Athlete identity and athlete satisfaction:the nonconformity of exclusivity. Personality and Individual Differences, 280-284. Butty, J., Reid, M., & LaPoint, V. (2004). A culturally responsive evaluation approach applied to the talent development school-to-career intervention program. New Directions for Evaluation, 101, 37-47. Caldwell, J., Davis, J., Du Bois, B., Echo-Hawk, H., Erickson, J., Goins, R., . . . Stone, J. (2005). Culturally Competent Research with American Indians and Alaska Natives: Findings and Recommendations of the First Symposium of the Work Group on American Indian Research and Program Evaluation Methodology. The Journal of the National Centre, 12, 1-21. Cavino, H. (2013). Across the colonial divide: Conversations about evaluation in Indigenous contexts. American Journal of Evaluation, 34(3), 339-355. Cervantes, R., & Pena, C. (1998). Evaluating Hispanic/Latino programs: Ensuring cultural competence. Alcoholism Treatment Quarterly, 16, 109-131. Chilisa, B., Major, T. E., Gaotlhobogwe, M., & Mokgolodi, H. (2016). Decolonizing and Indigenizing Evaluation Practice in Africa: Toward African Relational Evaluation Approaches. Canadian Journal of Program Evaluation, 30(3), 313-328. 83

Chouinard, J., & Cousins, B. (2009). A review and synthesis of current research on cross-cultural evaluation. American Journal of Evaluation, 30(4), 457-494. Chow, J., & Austin, M. (2008). The culturally responsive social service agency: The application of an evolving definition to a case study. Administration in Social Work, 32(4), 39-64. Christie, C., & Barela, E. (2005). The Delphi technique as a method for increasing inclusion in the evaluation process. The Canadian Journal of Program Evaluation, 20, 105-122. Clayson, Z. C. (2002). Unequal power-changing landscapes: Negotiations between evaluation stakeholders in Latino communities. American Journal of Evaluation, 23, 33-44. Cohn, N. (2007). A Visual Lexicon. The Public Journal of Semiotics, 1(1), 35-66. Cohn, N. (2014). You're a Good Structure, Charlie Brown: The Distribution of Narrative Categories in Comic Strips. Cognitive Science(38), 1317-1359. Comeaux, E. (2015). Making the Connection: Data-Informed Practices in Academic Support Centers for College Athletes. Charlotte: Information Age Publishing. Comeaux, E., Taustine, L., & Harrison, C. (2011). Purposeful engagement of first-year Division I student- athletes. Journal of the First-Year Experience & Students in Transition, 35-52. Conner, R. (2004). Developing and implementing culturally competent evaluation: A discussion of multicultural validity in two HIV prevention programs for Latinos. New Directions for Evaluation, 102, 51-65. Cooper, C., & Christie, C. (2005). Evaluating parent empowerment: A look at the potential of social justice evaluation in education. Teachers College Record, 107, 2248-2274. Copeland-Carson, J. (2005). Applying theory and method in evaluation anthropology: the example of the South Bronx's comprehensive community revitalization project. NAPA Bulletin, 24, 89-106. Coppens, N., Page, R., & Chan Thou, T. (2006). Reflections on the evaluation of a Cambodian youth dance program. American Journal of Community Psychology, 37, 321-331. Cornachione, E. B., Trombetta, M. R., & Nova, S. P. (2010). Evaluation use and involvement of internal stakeholders: The case of a new non-degree online program in Brazil. Studies in Educational Evaluation, 36(1), 69-81. Cram, F. (2016). Lessons on Decolonizing Evaluation From Kaupapa Māori Evaluation. Canadian Journal of Program Evaluation, 30(3), 296-312. Cullen, P., Clapham, K., Byrne, J., Hunter, K., Senserrick, T., Keay, L., & Ivers, R. (2016). The importance of context in logic model construction for a multi-site community-based Aboriginal driver licensing program. Evaluation and program planning(57), 8-15. Davis, J. (1992). Reconsidering the use of race as an explanatory variable in program evaluation. New Directions in Evaluation, 55-68. DeBacker Roedel, T., Schraw, G., & Plake, B. (1994). Validation of a Measure of Learning and Performance Goal Orientations. Educational and Psychological Measurement, 54(4), 1013-1021. 84

DeShong, R. (2014). Student Athlete Academic Performance Summary - Fall 2014. Berkeley: University of California Berkeley. Draanen, J. (2016). Introducing Reflexivity to Evaluation Practice. American Journal of Evaluation, 1-16. doi:10.1177/1098214016668401 Durá, L., Felt, L. J., & S. A. (2014). What counts? For whom?: Cultural beacons and unexpected areas of programmatic impact. Evaluation and program planning(44), 98-109. Easton, P. (2012). Identifying the evaluative impulse in local culture: Insights from West African proverbs. American Journal of Evaluation, 33(4), 515-531. Embretson, S. (1991). A multidimensional latent trait model for measuring learning change. Psychometrika, 494-515. Fearon, D., Barnard-Brak, L., Robinson, E., & Harris, F. (2011). Sense of belonging and burnout among first-year student athletes. Journal for the Study of Sports and Athletes in Education, 5(2), 139- 156. Fetterman, D. (2005). Empowerment evaluation: From the digital divide to academic distress. In D. Fetterman, & A. Wandersman, Empowerment evaluation principles in practice (pp. 92-122). New York: The Guilford Press. Fetterman, D., Kaftarian, S., & Wandersman, A. (1996). Empowerment Evaluation: Knowledge and tools for self-assessment and accountability . Thousand Oaks: Sage. Fisher, P., & Ball, T. (2002). The Indian family wellness project: An application of the tribal participatory research model. Prevention Science, 3, 235-240. Fletcher, G., & Dyson, S. (2013). Evaluation as work in progress: Stories of shared learning and development. Evaluation, 19(4), 419-430. Foreman-Peck, L., & Travers, K. (2015). Developing expertise in managing dialogue in the ‘third space’: Lessons from a responsive participatory evaluation. Evaluation, 21(3), 344-358. Fourneir, D. (2005). Evaluation. In S. Mathison, Encyclopedia of Evaluation (pp. 139-140). Thousand Oaks: Sage Publications, Inc. Freeman, M., & Hall, J. (2012). The complexity of practice: Participant observation and values engagement in a responsive evaluation of a professional development school partnership. American Journal of Evaluation, 33(4), 483-495. Freeman, M., Preissle, J., & Havick, S. (2010). Moral knowledge and responsibilities in eval-uation implementation: When critical theory and responsive evaluation collide. New Directions for Evaluation(127), 45-57. Frierson, H., Hood, S., Hughes, G., & Thomas, V. (2010). A Guide to Conducting Culturally Responsive Evaluations. In J. Frechtline, The 2010 User-Friendly Handbook for Project Evaluation (pp. 75- 96). National Science Foundation. Garaway, G. (1996). The case-study model: An organizational strategy for cross-cultural evaluation. Evaluation, 2, 201-211. 85

Green, K. (1996). Applications of the Rasch Model to Evaluation of Survey Data Quality. New Directions for Evaluation, 81-92. Greene, J. (2006). Evaluation, democracy, and social change. In I. Shaw, J. Greene, & M. Mark, The Sage Handbook of Evaluation (pp. 118-140). London: Sage Publication Ltd. Greene, J., & Henry, G. (2005). Qualitative-Quantitative Debate in Evaluation. In S. Mathison, Encyclopedia of Evaluation (pp. 345-350). Thousand Oaks: Sage Publications, Inc. Grimes, C., Dankovchik, J., Cahn, M., & Warren-Mears, V. (2016). American Indian and Alaska Native Cancer Patients’ Perceptions of a Culturally Specific Patient Navigator Program. The Journal of Primary Prevention, 38(1), 1-15. Gruskin, S., Waller, E., Safreed-Harmon, K., Ezer, T., Cohen, J., Gathumbi, A., & Kameri-Mbote, P. (2015). Integrating human rights in program evaluation: Lessons from law andhealth programs in Kenya. New Directions for Evaluation(146), 57-69. Guttman, L. (1944). A basis for scaling qualitative data. American sociological review, 9(2), 139-150. Hall, J., & Freeman, M. (2014). Shadowing in Formative Evaluation: Making Capacity Building Visible in a Professional Development School. American Journal of Evaluation, 35(4), 562-578. Hall, J., Freeman, M., & Roulston, K. (2014). Right timing in formative program evaluation. Evaluation and program planning, 45, 151-156. Hanberger, A. (2010). Multicultural awareness in evaluation: Dilemmas and challenges. Evaluation, 16(2), 177-191. Hannay, J., Dudley, R., Milan, S., & Leibovitz, P. (2013). Combining Photovoice and Focus Groups: Engaging Latina Teens in Community Assessment. American Journal of Preventive Medicine, 44(3), 215–224. Harklau, L., & Norwood, R. (2005). Negotiating research roles in ethnographic program evaluation: A post-modern lens. Anthropology and Education Quarterly, 36, 278-288. Hesse-Biber, S. (2013). Thinking outside the randomized controlled trials experimental box: Strategies for enhancing credibility and social justice. New Directions for Evaluation(138), 49-60. Hilton, L., & Libretto, S. (2016). Evaluation Capacity Building in the Context of Military Psychological Health: Utilizing Preskill and Boyle’s Multidisciplinary Model. American Journal of Evaluation, 1-12. Hong, F., Wu, P., & Xiong, H. (2005). Beijing Ambitions: An Analysis of the Chinese Elite Sports Syste and its Olympic Strategy for the 2008 Olympic Games. The International Journal of the History of Sport, 510-529. Hood, S. (2001). Nobody Knows my Name: In Praise of African American Evaluators who were Responsive. In J. Greene, & T. Abma, Responsive Evaluation: Roots and Wings. San Francisco: Jossey-Bass. Hood, S. (2009). Evaluation for and by Navajos: A narrative case of the irrelevance of globalization. In K. Ryan, & J. Cousins, The SAGE International Hanbook of Educational Evaluation. Thousand Oaks: SAGE. 86

Hopson, R. (2009). Reclaiming knowledge at the margins: Culturally responsive evaluation in the current evaluation moment. In Sage International Handbook of Educational Evaluation (pp. 429-446). Hopson, R. K. (2014). Justice signposts in evaluation theory, practice, and policy. New Directions for Evaluation(142), 83-94. Hosick, M. B. (2014, May 14). Student athletes continue to achieve academically. Retrieved from NCAA: http://www.ncaa.org/about/resources/media-center/news/student-athletes-continue- achieve-academically Hubberstey, C., Rutman, D., Hume, S., Van Bibber, M., & Poole, N. (2015). Toward an Evaluation Framework for Community-Based FASD Prevention Programs. Canadian Journal of Program Evaluation, 30(1), 78-89. Janzen, R., Nguyen, N., Stobbe, A., & Araujo, L. (2015). Assessing the Value of Inductive and Deductive Outcome Measures in Community-Based Programs: Lessons from the City Kidz Evaluation. Canadian Journal of Program Evaluation, 30(1), 41-63. Jay, M., Eatmon, D., & Frierson, H. (2005). Cultural reflections stemming from the evaluation of an undergraduate research program. In S. Hood, R. Hopson, & H. Frierson, The role of culture and cultural context: A mandate for inclusion, the discovery of truth, and understanding in evaluative theory and practice (pp. 201-216). Greenwich: Information Age Publishing. Johnson, E. (2005). The use of contextually relevant evaluation practices with programs designed to increase participation of minorities in science, technology, engineering, and mathematics (STEM) education. In S. Hood, R. Hopson, & H. Frierson, The role of culture and cultural context: A mandate for inclusion, the discovery of truth, and understanding in evaluative theory and practice (pp. 217-235). Greenwich: Information Age Publishing. Johnson-Turbes, A., Schlueter, D., Moore, A., Buchanan, N., & Fairley, T. (n.d.). Evaluation of a Web- Based Program for African American Young Breast Cancer Survivors. American Journal of Preventive Medicine, 49(6), 543–549. Johnston, A. L. (2013). To Case Study or Not to Case Study: Our Experience with the Canadian Government's Evaluation Practices and the Use of Case Studies as an Evaluation Methodology for First Nations Programs. The Canadian Journal of Program Evaluation, 28(2), 109-117. King, J., Nielson, J., & Colby, J. (2004). Lessons for culturally competent evaluation from the study of a multicultural initiative. New Directions for Evaluation, 102, 67-79. LaFrance, J. (2004). Culturally competent evaluation in Indian country. New Directions for Evaluation, 102, 39-50. LaFrance, J., Nelson-Barber, S., Rechebei, E., & Gordon, J. (2015). Partnering with Pacific Communities to Ground Evaluation in Local Culture and Context: Promises and Challenges. In S. Hood, R. Hopson, & H. Frierson, Continuing the Journey to Reposition Culture and Cultural Context in Evaluation Theory and Practice (pp. 361-378). Charlotte: Information Age Publishing, Inc. LaFrance, J., Nichols, R., & Kirkhart, K. E. (2012). Culture writes the script: On the cen-trality of context in indigenous evaluation. New Directions for Evaluation(135), 59-74. 87

Laperriere, H. (2006). Taking evaluation contexts seriously: A cross-cultural evaluation in exterme unpredictability. Journal of Multidisciplinary Evaluation, 3, 41-57. Lapidot-Lefler, N., Friedman, V., Arieli, D., Haj, N., Sykes, I., & Kais, N. (2015). Socialspace and field as constructs for evaluating social inclusion. New Directionsfor Evaluation(146), 33-43. LaPoint, V., & Jackson, H. (2004). Evaluating the co-construction of the family, school, and community partnership program in a low-income urban high school. New Directions for Evaluation, 101, 25- 36. Le Menestrel, S., Walahoski, J., & Mielke, M. (2014). A Partnership Model for Evaluation: Considering an Alternate Approach to the Internal–External Evaluation Debate. American Journal of Evaluation, 35, 61-72. Letichevsky, A., & Penna Firme, T. (2012). Evaluating with at-risk communities: Learning from a social program in a Brazilian slum. New Directions for Evaluation(134), 61-76. Letiecq, B., & Bailey, S. (2004). Evaluating from the outside: Conducting cross-cultural evaluation research on an American Indian reservation. Evaluation Review, 342-357. Luo, L., & Liu, L. (2014). Reflections on conducting evaluations for rural development interventions in China. Evaluation and program planning(47), 1-8. Lustig, R., Ben Baruch-Koskas, S., Makhani-Belkin, T., & Hirsch, T. (2015). Evaluation inthe Branco Weiss Institute: From social vision to educational practice. New Directions for Evaluation(146), 95-105. MacDonald, B. (1977). The portrayal of persons as evaluation data. In N. Nigel, Safari: Theory in practice (pp. 50-67). Norwich: Centre for Applied Research in Education, University of East Anglia. Maciak, B., Guzman, R., Santiago, A., Villalobos, G., & Israel, B. (1999). Establishing LA VIDA: A community-based partnership to prevent intimate violence against Latina women. Health Education & Behavior, 26, 821-840. Marin, G., & Marin, B. (1991). Research with Hispanic Populations. Newbury Park: Sage. McKenzie, B. (The Canadian Journal of Program Evaluation). Developing First Nations child welfare standards: Using evaluation research within a participatory framework. The Canadian Journal of Program Evaluation, 12, 133-148. Mertens, D. (1998). Research Methods in Education and Psychology. Thousand Oaks: Sage Publications. Mertens, D. (2008). Transformative Research and Evaluation. New York: Guilford Press. Mertens, D. (2013). What does a transformative lens bring to credible evidence in mixed methods evaluations? New Directions for Evaluation(138), 27-35. Mertens, D., & Hopson, R. (2006). Advancing evaluaton of STEM efforts through attention to diversity and culture. New Directions for Evaluation, 109, 35-51. Mertens, D., & Zimmerman, H. (2015). A Transformative Framework for Culturally Responsive Evaluation. In S. Hood, R. Hopson, & H. Frierson, Continuing the Journey to Reposition Culture 88

and Cultural Context in Evaluation Theory and Practice (pp. 275-288). Charlotte: Information Age Publishing, Inc. Mitakidou, S., Tressou, E., & Karagianni, P. (2015). Implementing Culturally Sensitive Assessment Tools for the Inclusion of Roma Children in Mainstream Schools. In S. Hood, R. Hopson, & H. Frierson, Continuing the Journey to Reposition Culture and Cultural Context in Evaluation Theory and Practice (pp. 233-250). Charlotte: Information Age Publishing, Inc. Muthen, L.K., & Muthen, B.O, (1998-2012). Mplus User's Guide. Seventh Edition. Los Angeles, Cal: Muthen & Muthen. Nagai, Y. (2001). Developing assessment and evaluation strategies for vernacular elementary school classrooms: A collaborative study in Papua New Guinea. Anthropology & Education Quarterly, 32, 80-103. Nastasi, B., & Hitchcock, J. (2009). Challenges of evaluating multilevel interventions. American Journal of Community Psychology, 43(3-4), 360-376. Nelson-Barber, S., LaFrance, J., Trumbull, E., & Aburto, S. (2005). Promoting culturally reliable and valid evaluation practice. In S. Hood, R. Hopson, & H. Frierson, The role of culture and cultural context: a Mandate for inclusion, the discovery of truth, and understanding in evaluative theory and practice (pp. 61-85). Greenwich: Information Age Publishing. Nevarez, C., Lafleur, M., Schwarte, L., Rodin, B., de Silva, P., & Samuels, S. (2013). Salud Tiene Sabor: A Model for Healthier Restaurants in a Latino Community. American Journal of Preventive Medicine, 44(3), 186-192. Newton, X., & Llosa, L. (2010). Toward a more nuanced approach to program effectiveness assessment: Hierarchical linear models in K–12 program evaluation. American Journal of Evaluation, 31(2), 162-179. Noblit, G., & Jay, M. (2010). Against the majoritarian story of school reform: The ComerSchools Evaluation as a critical race counternarrative. New Directions for Evaluation, 127, 71-82. Novins, D., King, M., & Son Stone, L. (2004). Developing a plan for measuring outcomes in model systems of care for American Indian and Alaska Native children and youth. American Indian and Alaska Native Mental Health Research, The Journal of the National Centre, 11, 88-98. O'Hara, J., McNamara, G., & Harrison, K. (2015). Culture Changes, Irish Evaluation and Assessment Traditions Stay the Same? Exploring Peer-and Self-Assessment as a Means of Empowering Ethnic Minority Students. In S. Hood, R. Hopson, & H. Frierson, Continuing the Journey to Reposition Culture and Cultural Context in Evaluation Theory and Practice (pp. 205-232). Charlotte: Information Age Publishing, Inc. Patton, M. Q. (1994). Developmental Evaluation. Evaluation Practice, 311-319. Patton, M. Q. (2011). Essenstials of Utilization-Focused Evaluation. Thousand Oaks: Sage. Peter, L., Christie, E., Cochrane, M., Dunn, D., Elk, L., Fields, E., . . . Yamamoto, A. (2003). Assessing the impact of total immersion on Cherokee language revitalization: A culturally responsive, participatory approach. In J. Reyhner, O. Trujillo, R. Carrasco, & L. Lockard, Nurturing native languages. Flaggstaff: Northern Arizona University. 89

Pon, G. (2009). Cultural competency as new racism: An ontology of forgetting. Journal of Progressive Human Services, 20(1), 59-71. Prilleltensky, I., Nelson, G., & Valdes, L. (2000). A value-based approach to smoking prevention with immigrants from Latin America: Program evaluation. Journal of Ethnic & Cultural Diversity in Social Work, 9, 97-117. Raedeke, T., Lunney, K., & Venables, K. (2002). Understanding athlete burnout: coach's perspectives. Journal of Sport Behavior, 25, 181-206. Randall, J., & Engelhard, G. (2000). Using Guttman's mapping sentences and Many Facet Rasch Measurement Theory to develop an instrument that examines the grading philosophies of teachers. Journal of Applied Measurement, 11(2), 122-141. Richmond, L., Peterson, D., & Betts, S. (2008). The evolution of an evaluation: A case study using the tribal participatory research model. Health Promotion Practice, 9, 368-377. Robertson, P., Jorgensen, M., & Garrow, C. (2004). Indigenizing evaluation research. American Indian Quarterly, 28, 499-526. Running Wolf, P., Soler, R., Manteuffel, B., Sondheimer, D., Santiago, R., & Erickson, J. (2002). Cultural competence approaches to evaluation in tribal communities. Symposium on Research on Evaluation Method: Lifespan Issues related to American Indians/Alaska Natives with Disabilities. Washington, DC: AIRPEM Working Group on American Indian Research and Program Evaluation Methodology. Russ-Eft, D., & Preskill, H. (2009). Evaluation in organiations a systematic approach: a systematic approach to enhancing learning, performance, and change. new York: Basic Books. Ryan, K., Chandler, M., & Samuels, M. (2007). What should school-based evaluation look like? Studies in Educational Evaluation, 33, 197-212. Samuels, M., & Ryan, K. (2010). Grounding evaluations in culture. American Journal of Evaluation, 32(2), 183-198. Schwandt, T. (2005). Values. In S. Mathison, Encyclopedia of Evaluation (pp. 443-444). Thousand Oaks: Sage Publications, Inc. Scott-Jones. (1993). Ethical issues in reporting and referring in research with minority and low-income populations. Society for Research in Child Development. New Orleans. Senese, G. (2005). The PENAL project: program evalaution and Native American liability. In S. Hood, R. Hopson, & H. Frierson, The role of culture and cultural context: A mandate for inclusion, the discovery of truth, and understanding in evaluative theory and practice (pp. 129-147). Greenwich: Information Age Publishing. SenGupta, S., Hopson, R., & Thompson-Robinson, M. (2004). Cultural competence in evaluation: An overview. New Directions in Evaluation, 5-19. Shoultz, J., Magnussen, L., Kreidman, N., Oneha, M., Iannce-Spencer, C., & Hayashi-Simpliciano, R. (2015). Engaging Native Hawaiians and Pilipinos in creating supportive and safe violence-free communities for women through a piloted “talkstory” intervention: Implications for program development. Evaluation and program planning, 51, 78-84. 90

Simons, H., Bosworth, C., Fujita, S., & Jensen, M. (2007). The Student Athlete Stigma in Higher Education. College Student Journal, 41(2), 251-273. Slaughter, H. (1991). The participation of cultural informants on bilingual and cross-cultural evaluation teams. Evaluation Practice, 12, 149-157. Small, S., Tiwari, G., & Huser, M. (2006). The cultural education of academic evaluators: Lessons from a university-Hmong community partnership. American Journal of Community Psychology, 37, 357- 364. Steinberg, S., & Zamir, J. (2015). A different light on normalization: Critical theory andresponsive evaluation studying social justice in participation practices. New Directions for Evaluation, 146, 119-127. Stockdill, S., Duhon-Sells, R., Olson, R., & Patton, M. (1992). Voices in the design and evaluation of a multicultural education program: A developmental approach. New Directions for Evaluation, 53, 17-33. Stokes, H., Chaplin, S., Dessouky, S., Aklilu, L., & Hopson, R. (2011). Addressing Social Injustices, Displacement, and Minority Rights Through Cases of Culturally Responsive Evaluation. Diaspora, Indigenous, and Minority Education, 5(3), 167-177. Tashakkori, A., & Teddlie, C. (1998). Mixed Methodology: Combing Qualitative and Quantitative Approaches (Applied Social Research Methods Series ed., Vol. 46). Thousand Oaks: Sage Publications. Thomas, R., Teel, T., & Bruyere, B. (2014). Seeking excellence for the land of paradise: integrating cultural information into an environmental education program in a rural Hawai’ian community. Studies in Educational Evaluation, 41, 58-67. Thomas, V. (2004). Buidling a contextually responsive evaluation framework: Lessons from working with urban school interventions. New Directions for Evaluation, 101, 3-23. Thomas, W., & Bellefeuille, G. (2006). An evidence-based formative evaluation of a cross-cultural Aboriginal mental health program in Canada. Australian e-Journal for the Advancement of Mental Health, 5, 1-14. Thurman, P., Allen, J., & Deters, P. (2004). The circles of care evaluation: Doing participatory evaluation with American Indian and Alaska Native communities. American Indian and Alaska Native Mental Health Research. The Journal of the National Centre, 11, 139-154. Uhl, G., Robinson, B., Westover, B., Bockting, W., & Cherry-Porter, T. (2004). Involving the community in HIV prevention program evaluation. Health Promotion Practice, 5, 289-296. Van Rheenen, D. (2011). Exploitation in the American academy: College athletes and self-perceptions of value. The International Journal of Sport & Society, 2(4), 11-26. Van Rheenen, D. (2013). Exploitation in College Sports: Race, Revenue and Educational Reward. International Review for the Sociology of Sport, 550-557. Van Rheenen, D. (2015). the Dilemma of Academic Support for College Athletes. In E. Comeaux, Introduction to Intercollegeiate Athletics (pp. 355-365). Baltimore: Johns Hopkins University Press. 91

Van Rheenen, D., McNeil, N., Minjares, V., & Atwood, J. (2012). Becoming REGS: The Impact of institutional sport elimination on Division 1 student athletes. International Journal of Sport & Society, 3(2), 91-105. Voyle, J., & Simmons, D. (1999). Community development through partnership: Promoting health in an urban indigenous community in New Zealand. Social Science & Medicine, 49, 1035-1050. White, C., & Hermes, M. (2005). Learning to play scholarly jazz: A culturally responsive evaluation of the Hopi teachers for Hopi schools project. In S. Hood, R. Hopson, & H. Frierson, The role of culture and cultural context: A mandate for inclusion, the discovery of truth, and understanding in evaluative theory and practice (pp. 105-128). Greenwich: Information Age Publishing. Willging, C., Helitzer, D., & Thompson, J. (2006). 'Sharing wisdom': Lessons learned during the development of a diabetes prevention program for urban American Indian women. Evaluation and Program Planning, 29, 130-140. Wilson, M. (2004). Constructing Measures: An Item Response Modeling Approach. New York: Psychology Press. Wilson, M., & De Boeck, P. (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer-Verlag. Wilson, M., & Sloane, K. (2000). From principles to practice: an embedded assessment system. Applied measurement in education, 13(2), 181-208. Wilson, M., Zheng, X. Z., & McGuire, L. (2012). Formulating latent growth using an explanatory item response model approach. Journal of Applied Measurement, 13(1), 1-22. Woelders, S. &. (2015). A different light on normalization: Critical theory andresponsive evaluation studying social justice in participation practices. New Directions for Evaluation, 146, 9-18. Wright, B., & Masters, G. (1982). Rating Scale Analysis. Chicago: Mesa Press. Wu, M., Adams, R., & Wilson, M. (2012). ACER ConQuest: Generalized Item Response Modeling Software [Computer software and manual]. Melbourne, Australia: Australian Council for Educational Research. Zamir, J., & Abu Jaber, S. (2015). Promoting social justice through a new teacher training pro-gram for the Bedouin population in the Negev: An evaluation case study. New Directions for Evaluation, 146, 71-82. Ziabakhsh, S. (2015). Reflexivity in evaluating an Aboriginal women heart health promotion program. Canadian Journal of Program Evaluation, 30(1), 23-40. Zoabi, K., & Awad, Y. (2015). The role of evaluation in affirmative action-type programs. New Directions for Evaluation, 146, 83-93. Zulli, R., & Frierson, H. (2004). A focus on cultural variables in evaluating an upward bound program. New Directions for Evaluation, 102, 81-93.

Appendix A. Fall Pilot Survey B. Distribution of Sports Teams in Fall Pilot Survey C. Frequency of Fall Pilot Survey Responses D. Mean Person Locations E. Sample Descriptors F. Sample Measurement-Specific Descriptors G. ASC Goals, Outcomes, and Measures H. Fall Pilot Outcome Space

A. ASC Self-Reliance and Sense of Belonging Survey – PILOT Please review the comic strips below and respond to the relevant questions. This survey is part of an internal evaluation of the Athletic Study Center. Additionally, with your consent, your responses will be part of an evaluation case study for a PhD student dissertation project. Your responses will be kept confidential; your coaches and peers will not have access to your responses. You are not obligated to answer any single question, but please be as honest as possible when responding to questions. Your honest answers will help the Athletic Study Center improve its services. If you have any questions about the survey or corresponding evaluation/research please contact Laura Pryor (phone: 760-815-1765; email: [email protected])

*UC Berkeley Student ID Number:______

*Please write your self-identified gender:______

*Did you participate in Summer Bridge (please circle)?: Yes No

*Are you an international student (please circle)?: Yes No

*Please write the sport you play at Cal:______

*Please write your high school GPA:______

*What is your race/ethnicity (circle all that apply): American Indian/Alaskan Native - Asian - Black or African American - Hispanic/Latino - Native Hawaiian/Pacific Islander Other (please indicate):______

Please read the comic strip below and respond to Question 1, 2, 3, & 4 In the scene below, you go to GSI office hours for Econ 1 to get help on an assignment. Shortly after you enter, another student named Sarah (whom you don’t know) joins you with her own questions.

a) If you were in this situation, in general, what would be the GSI’s most likely response to your request for help? a) N/A – I would not be in this situation because I would have met with my GSI before my game about missing lab. b) “Sure, maybe you and your classmate have the same questions, let’s all sit down together.” c) “It’s your responsibility to make up for missed class material, so I can’t help you.” d) N/A – I would not be in this situation because GSIs don’t help students like me, and I would never go to office hours. e) Other, please write:

b) If your GSI responded with Option B: “Sure, maybe you and your classmate have the same questions, let’s all sit down together,” how are you most likely to react? a. Ask both the GSI and your classmate questions about the material, trying to learn as a group. b. You get your questions answered by the GSI and then hang around to see if your classmate’s questions can help you understand the material. c. Sit in the group but feel hesitant to ask your own questions, letting the classmate ask all their questions. d. Leave office hours immediately, not wanting to interact with an unfamiliar student. e. Other, please write:

c) If your GSI responded to Question 1 with Option C: “It’s your responsibility to make up for missed class material, so I can’t help you,” What are the thoughts going through your head? a. You believe this GSI may think you are not committed to the course. You will make sure you are engaged in section and ask question in lecture. b. You believe that this GSI is just communicating the course policy. You will try to go to a tutor for help. c. You will never go back to this GSI’s office hours, but now you are not sure how to ask for help from other GSIs in the course. d. There’s no point in going to any GSI office hours because nobody will ever help you. 95

e. Other, please write: d) Imagine that Sarah (the other student in the comic strip) invites you to come to the next Undergraduate Business Student Association meeting. How would you feel if a student invited you to join an organization related to your major? a. I’m already involved with student organizations related to my major and regularly participate in organizations like the Undergraduate Business Student Association. b. I’m slightly interested in student groups that relate to my major and would like to attend a few meetings to see what they’re about. c. I’m not part of any student organization and while I might be interested, I have no time for something like this. d. I’m not a part of any student organization and have no interest in joining one, even if it was about something I had time for. e. Other, please explain:

Please read the comic strip below and respond to Questions 5, 6, 7 & 8 In the scene below, it’s the end of class and the professor has asked students to organize into groups for the upcoming group project. You’re the only student athlete in the class (wearing the Cal sweatshirt) and don’t know your other classmates.

e) Imagine you are the student athlete in the above scenario, how are you most likely to respond in Panel 3? a. “Unfortunately, the site visits will be a problem with my schedule. But, I can put together the final presentation, and I will try to work as many site visits into my schedule as possible.” b. “The site visits are going to be a problem for me. I’ll try to think of some ways to either work it into my schedule or make up for it.” c. “I’ll probably have to miss a lot of the site visits, and I’m not sure how to make up for that.” d. “I won’t be able to go do the site visits, so it will just be you two doing that part of the project.” e. Other, please write:

f) How are you most likely to feel about the comment in Panel 1: “It looks like everyone else has a group. I guess us three should work together.” a. Even though my groupmates are different from me, it’s an opportunity for me to show that student athletes work just as hard as everyone else and are great collaborators. b. My classmates might judge me because I’m a student athlete, but I’ll try to show them that I’ll do my fair share of the work. c. Hopefully we can all work together. It seems that I always struggle to work with people I don’t know. d. I always feel like non-student athlete purposely avoid picking me for group projects. I will probably feel uncomfortable during this whole project. e. Other, please write: g) Please select the option that best describes how you feel about the comic scenario: a. The group project for this class will be challenging for me, but I know how to plan my time and utilize the resources I need to help me do well in the course. b. There might be some challenges with fitting in all the work that is required, and I’ll just try my best to figure out how to manage everything. c. I can’t fit in all the work that’s required; I should probably drop the course. d. The group project will not present any challenges for me; my groupmates are smart enough to figure it out on their own. e. Other, please write: h) One week after the above scenario, your groupmates ask if you can do the library research for the project. How are you most likely to feel about this request? a. Confident – you know how to seek out a librarian and use library resources to get the information you need. b. Slightly Concerned – you have used the library a few times but still need help looking up good references. c. Very Concerned – you generally don’t go to the library and are not sure how to look up references. d. Fine – you’ll look at a few websites and send them to your groupmates to write about. e. Other, please write:

Please read the comic strip below and respond to Questions 9, 10, 11, & 12 In the scene below, you have decided to go to Caltopia to check out the booths and explore campus. Student groups and other organizations are there to talk to students and promote their organization.

i) Imagine you are the student athlete in the green sweatshirt. How are you most likely to respond in Panel 3? a. “That’s sounds really interesting. If it fits in my schedule, I’ll definitely check it out.” b. “Sure, I’ll take the flyer, and maybe I’ll see you there.” c. “No, I don’t think so.” d. N/A – I would keep walking, ignore the flyer, and not say anything e. Other, please write:

j) Imagine that the business student in the above scenario did not offer you a flyer. How are you most likely to feel? a. Assertive – you know that other students may have stereotypes against student athletes, but this doesn’t stop you from seeking out opportunities on campus that interest you. b. Annoyed – you know you can always approach the business student yourself, but you lack motivation to do so. c. Frustrated – you feel stereotyped and powerless to do anything about it. d. Isolated - as if you are not part of the regular student population at UC Berkeley. e. Other, please write:

k) Imagine that you are interested in going to the ice cream social, but realize that the event conflicts with one of your class section meetings. In general, how are you most likely to react? a. You ask your GSI if it is okay to attend another section that week so you can participate in a campus event. b. You ask your friend ahead of time what you should do about missing section. c. You go to the ice cream social and ask one of your friends what happened in section. d. You go to the ice cream social and don’t think about your section meeting. e. Other, please write:

l) Imagine that you take the flyer and it says: “Sponsored by the Haas School of Business Alumni Association – Alumni will be in attendance.” What are you most likely to think about this? a. Excited by the possibility of meeting with alumni, and you ask the business student if there are other ways to connect with the alumni association outside of the ice cream social. b. You would like to meet alumni, and you decide to go to the event if you can convince some friends to go too. c. You recognize that alumni might help you connect with people who can offer you a future job, but it’s not worth your time to go to the event and actually meet alumni. d. Nothing, alumni associations don’t mean anything to you. e. Other, please write:

Please read the comic strip below and respond to questions 13, 14, & 15 Scene: In the comic strip below, Dana and Judy are roommates and are both on the women’s soccer team. Dana completed a draft of her midterm paper before the weekend’s soccer game in Los Angeles while Judy is still confused about the material and hasn’t started hers. It’s Thursday, and the soccer team will leave that afternoon.

m) In general, my behavior is more like Judy than Dana a. All of the time b. Some of the time c. Almost never d. Never e. Other (please explain)

n) If you were Dana, how would you fill in the blank word bubble in Panel Three? a. “I can show you how to make a tutor appointment and on the bus ride down, we can make a scheduling timeline and discuss an outline for your paper.” b. “I can show you how to make a tutor appointment and on the bus ride down, we can look at your schedule and make a plan for how you will get your paper done.” c. “Just ask for an extension and next time, don’t wait until the last minute.” d. “Forget about it, you’ll figure it out when we get back.” e. Other (please write):

o) If you were Judy, how would you feel about this scenario? a. Determined – you can reflect on the things that went wrong with your time management in order to make a better plan for the future. b. Anxious/Stressed – unsure of how to better manage your time but feel like improvements can and should be made. c. Overwhelmed – there’s no way a student athlete can ever manage a difficult course schedule with athletic commitments. d. Wouldn’t really care – student athletes face this situation all the time and manage to survive. e. Other (please write): Please read the comic strip below and respond to questions 16, 17, & 18

p) How likely are you to find yourself in the above scenario? a. Very Likely – I have already sought out campus resources and asked similar questions. b. Likely – I would like to go to this kind of a workshop and could see myself asking questions. c. Somewhat likely – I may go to a workshop, but only if it was something I went to with my team. d. Not likely at all – I would not go to a Career Center workshop. e. N/A – my schedule would never allow me to go to any kind of career workshop. f. Other, please write:

q) If you were the student in this scenario, how are you most likely to respond in panel 3? a. “Yes, thank you. Do you think you and I could set up an appointment to talk about what I might ask when I call or email?” b. “Yes, thank you. I will try to call a few alumni.” c. “No, that’s okay. I think I can ask my friends if they know about any internships or just look for some on my own.” d. “No, that’s okay. I’m not sure what I would ask alumni.” e. Other, please write:

100

r) Imagine that you leave the Career Center workshop feeling like you still don’t know how to get a summer internship that is right for you. What are you most likely to do next? a. Contact an advisor or career center staff for further direction on how to find a summer internship that matches your interests. b. The next time you see a friend in your same major, you ask them about their summer internship plans. c. You think about the people you know with summer internships who might be able to help you, but you don’t actually contact them. d. Probably nothing – you don’t think anyone else will have useful information about how to find a summer internship. e. Other, please explain: Please read the comic strip below and respond to questions 19, 20, 21, & 22

s) Imagine you are the student with the C- in the above scenario, how are you most likely to feel? a. Motivated - you will re-submit your paper, and you are determined to get at least a B+ or higher. b. Disappointed - you tried to get a good grade on this paper and will re-submit for a better grade. c. Concerned - you’ll have to do better on the next paper to make sure you pass the class. d. Fine - a C- is passing and that’s all you need to do. e. Other, please write:

t) Imagine that you receive a C- and decide to re-submit your paper. What are you most likely to do next? a. Go to instructor office hours and incorporate all feedback, then seek out tutorial support to make sure your revisions make sense and improve the paper. b. Incorporate the instructor comments you understand and go to instructor office hours to ask about the comments you do not understand. c. Look at the instructor comments and incorporate the suggestions that you understand, but you don’t go to office hours. d. Make a few changes based on what you think was wrong with the paper, but you do not consult the instructor or look at instructor feedback. e. Other, please write:

101

u) Imagine your instructor gives you the name and email address of a student that got a high grade on the midterm paper and recommends you do a peer review (exchanging papers) before submitting your final paper. What are you most likely to do? a. Exchange final papers with the other student and then meet to go over specific feedback and general writing tips. b. Exchange final papers with the other student and incorporate the feedback you understand. c. Next class, ask the other student to do a peer review, but then forget to actually exchange before turning in the final paper. d. Forget about the suggestion and never contact the other student. e. Other, please write: v) Imagine that you miss the one-week deadline to re-submit your C- paper for a higher grade. How are you most likely to feel? a. Proactive - you identify exactly how your time management system allowed you to miss an important deadline and know that this same mistake won’t be repeated. b. Disappointed - you recognize how and why you missed the deadline and will try not to do it again in the future. c. Frustrated - you intended to and should have been able to meet the deadline; you are not sure how you missed it. d. Fine - it was optional to re-submit the paper. e. Other, please write:

102

B. Distribution of sports teams included in fall pilot sample

Sport % Baseball 11 (n=7) Field Hockey 6 (n=4) Football 2 (n=1)

Lacrosse 3 (n=2) Men’s Cross Country & Track 2 (n=1) Men’s Diving 2 (n=1) Men’s Golf 2 (n=1) Men’s Soccer 5 (n=3) Men’s Swimming 5 (n=3) Men’s Tennis 3 (n=2) Men’s Track 2 (n=1) Rugby 8 (n=5) Softball 2 (n=1) Women’s Basketball 5 (n=3) Women’s Cross Country & Track 5 (n=3) Women’s Golf 2 (n=1) Women’s Rowing 12 (n=8) Women’s Soccer 9 (n=6) 2 Women’s Swim & Dive (n=1) Women’s Swimming 6 (n=4) Women’s Track 3 (n=2) Women’s Water Polo 9 (n=6)

103

C. Frequency of Responses

Question Option A Option B Option C Option D GSI.Purpose 25 39 3 0 GSI.Recognize 31 29 8 0 GSI.Interpret 20 26 8 3 GSI.Community 1 45 22 1 Group.IdentityA 37 33 0 2 Group.Recognize 36 30 0 3 Group.IndentityB 28 39 3 0 Group.Holistic 15 29 23 4 Flyer.Community 19 29 13 8 Flyer.Interpret 26 14 1 8 Flyer.Holistic 34 6 13 0 Flyer.Purpose 43 22 2 4 Dorm.Achieve 11 22 32 4 Dorm.Network 21 33 7 5 Dorm.Growth 17 45 5 4 Career.Network 8 28 16 12 Career.Input 32 34 2 3 Career.Learn 48 15 6 2 Paper.Achieve 33 30 5 2 Paper.Input 39 20 13 0 Paper.Learn 38 25 2 6 Paper.Growth 13 28 24 0

104

D. Person Mean Locations Question Option A Option B Option C Option D GSI.Purpose -0.034 0.039 -0.065 N/A GSI.Recognize -0.083 0.059 -0.054 N/A GSI.Interpret 0.036 0.093 -0.009 -0.054 GSI.Community -0.473 -0.029 0.018 0.072 Group.IdentityA -0.934 -0.072 N/A 0.116 Group.Recognize -0.326 0.006 N/A 0.017 Group.IndentityB -0.613 -0.009 0.085 N/A Group.Holistic -0.614 -0.116 0.046 0.207 Flyer.Community -0.002 -0.044 -0.038 0.076 Flyer.Interpret -0.117 0.355 0.046 -0.021 Flyer.Holistic -0.191 -0.060 0.060 N/A Flyer.Purpose -0.815 -0.060 -0.154 0.158 Dorm.Achieve -0.276 -0.075 0.083 0.148 Dorm.Network -0.604 -0.172 0.058 0.047 Dorm.Growth -0.486 0.551 -0.100 0.264 Career.Network -0.054 0.057 0.051 -0.172 Career.Input -0.043 -0.568 0.003 0.042 Career.Learn -0.873 -0.139 -0.297 0.153 Paper.Achieve -0.31898 -0.162 0.035 0.0159 Paper.Input -0.516 -0.168 0.291 N/A Paper.Learn -0.549 0.269 -0.116 0.124 Paper.Growth -0.167 -0.052 0.454 N/A

105 Appendix A. Sample Descriptors

Only Mixed Only Specific Quant A. Article Context Population Purpose Quant Methods Qual Methods Aktan (1999) health African American 0 0 1 N/A N/A frequency data from questionnaires; means social and standard dev; t- summative Alkon et al (2001) US Minorities 0 1 0 services tests to look at judgement differences between subgroups

Anderson-Draper social Tracking of hours - summative Canadian Immigrants 0 1 0 (2006) services frequency data judgement Baizerman & frequency data of education US Minorities 0 1 0 formative learning Compton (1992) surveys social Barnes (2000) Maori 0 0 1 N/A N/A services Berends & Roberts social Australian Aboriginal 0 0 1 N/A N/A (2003) services not described, assume Bevan-Brown (2001) education Maori 0 1 0 frequency data from formative learning surveys pre/post analysis was summative Butty et al (2004) education US Minorities 0 1 0 not discussed judgement social Caldwell et al (2005) Native American 0 0 1 N/A N/A services Cervantes and Pena social frequency data of summative Hispanic/Latino 0 1 0

(1998) services participant background judgement

Christe & Barela means and sds via the

education US Minorities 0 1 0 formative learning

(2005) Delphi technique

106 not described, assume

knowledge

Clayson et al (2002) community Hispanic/Latino 0 1 0 frequency data from generation surveys compared with matched summative Connor (2004) health Hispanic/Latino 0 1 0 sample using inferential judgement stats Cooper & Christie education Hispanic/Latino 0 0 1 N/A N/A (2005) Copeland-Carson knowledge community US Minorities 0 1 0 use of census data (2005) generation pre and post analysis to summative Coppens et al. (2006) community US Immigrants 0 1 0 show growth; actual judgement analysis not described descriptive statistics of Fetterman (2005) community US Minorities 0 1 0 student test scores and formative learning

means/sds

longitudinal analyses, yet it's unclear what summative Fisher & Ball (2002) health Native American 1 0 0 statistical procedures judgement will be used to analyze data

multivariate regression; descriptive statistics; examination of knowledge Garaway (1996) education India 0 1 0 frequency distributions, generation measures of central tendency, dispersion

and shape

Harklau and Norwood quant collected but not

education US youth 0 1 0 N/A

(2005) described

Hong et al (2005) health US Minorities 0 0 1 N/A N/A

107 Jay et al (2005) education US Minorities 0 0 1 N/A N/A

Johnson (2005) education US Minorities 0 0 1 N/A N/A survey used, but knowledge King et al (2004) education US youth 0 1 0 specific analyses generation unclear LaFrance (2004) education Alaskan Native 0 0 1 N/A N/A Laperriere (2006) health Brazil 0 0 1 N/A N/A states 'measurable LaPoint & Jackson differences' found on summative community US Minorities 0 1 0 (2004) pre/post test, but does judgement not specifcy analysis

states that Likert questions were used, Letiecq & Bailey summative community Native American 0 1 0 but analyses were not (2004) judgement explicated - longitudinal study social Maciak et al (1999) Hispanic/Latino 0 0 1 N/A N/A services McKenzie (1997) community Canadian Aboriginal 0 1 0 descriptive survey data developmental Mertens & Hopson education US Minorities 0 0 1 N/A N/A (2006) Papau New Guinea Nagai (2001) education 0 0 1 N/A N/A Indigenous Nelson-Barber et al community Native American 0 1 0 frequencies/descriptives formative learning (2005) longitudinal design; summative Novins et al (2004) health Native American 1 0 0 however, analyses not judgement

specified

Frequency of 1-10

Peter et al (2003) education Native American 0 1 0 formative learning

rating on Likert Scale

108

pre/post analysis

between intervention and comparison group; descriptives on Prilleltensky et al background vars; mean summative health Hispanic/Latino 0 1 0 (2000) scores for each item; judgement test for interaction between group and time; mixed model ANOVAs

pre/post measures social summative Richmond et al (2008) Native American 0 1 0 analyzed, but not services judgement described how analyzed frequency and Robertson et al (2004) community Native American 1 0 0 advocacy

descriptive data

descriptives of background Running Wolf et al knowledge health Native American 0 1 0 characteristics ; (2002) generation analysis of surveys not described

descriptive stats of Ryan et al (2007) education Native American 0 1 0 school achievement formative learning data Senese (2005) health Native American 0 0 1 N/A N/A tabulation of language summative Slaughter (2001) education Native Hawaiian 0 1 0 assessment (raw scores) judgement

- not specific

social pre/post test analysis,

Small et al (2006) US Immigrants 0 1 0 advocacy

services not described

109 frequency stats of

district stats and staff summative Stockdill et al (1992) education US youth 0 1 0 surveys - not specified judgement how Thomas (2004) education US Minorities 0 1 0 not described advocacy Thomas & Bellefeuille health Canadanian Aborigini 0 0 1 N/A N/A (2004) Thurman et al (2004) health Native American 0 1 0 not described advocacy RCT; three time points, summative Uhl et al (2004) health African American 0 1 0 analysis not discussed judgement descriptive stats for Voyle & Simmons social participation rates, knowledge Maori 0 1 0 (1999) services disaggregated by sub generation pops White & Hermes education Native American 0 0 1 N/A N/A (2005) pre/post analyses; summative Willging et al. (2006) health Native American 0 1 0 details not described judgement frequency data regarding staff/student Zulli & Frierson education US Minorities 0 1 0 backgrounds; formative learning (2004) descriptive data of surveys

descriptive data of teacher questionnaires - Al Hudib et al. (2016) education India 0 1 0 formative learning actual analyses not

described

110

comparison of repeated

measures with social summative Arseneault et al (2016) Canadian Prison 0 1 0 comparison and control services judgement group; analysis not described Bamanyaki & Holvoet health Africa 0 0 1 N/A N/A (2016) descriptive survey data w/ Mann-Whitney to knowledge Blackman et al. (2013) health UK Youth 0 1 0 detect differeces generation between respodents and non-respondents Bledsoe (2014) community African American 0 0 1 N/A N/A pre/post RCT showing no significant change; summative Botcheva et al. (2009) health Africa 0 1 0 inferential stats not judgement discussed

descriptive statistics of Bowen & Tillman survey data - analyses community Brazil 0 1 0 advocacy (2015) not thoroughly described in detail frequency measures of summative Bowman et al. (2014) health Native American 0 1 0 pedometers judgement one time point survey Brandao & Codas social Brazil 0 1 0 data - descriptives, advocacy (2012) services analysis not specified

Cavino (2013) community Maori 0 0 1 N/A N/A

not described, assume

summative

Chilisa et al. (2016) community Africa 0 1 0 frequency data from

judgement

surveys

111 Cornacione et al. descriptive stats on summative

education Brazil 0 1 0

(2010) reacctionnaire judgement

Cram (2016) community Maori 0 0 1 N/A N/A social Cullen et al. (2016) Australian Aborigini 0 0 1 N/A N/A services social Draanen (2016) Canadanian homeless 0 0 1 N/A N/A services pre/post measures summative Dura et al. (2014) community international 0 1 0 analyzed judgement summative Easton (2012) education Africa 0 1 0 not described judgement pre/post survey analysis with comparison group, Fletcher (2013) community Austraila 0 1 0 formative learning actual analyses not described

frequency counts of Foreman-Peck & observation codes to education US youth 0 1 0 formative learning Travers (2015) measure student participation Freeman & Hall education US youth 0 0 1 N/A N/A (2012) Freeman et al. (2010) education US youth 0 0 1 N/A N/A descriptives of survey summative Grimes et al. (2016) health Native American 0 1 0 and backgroun judgement variables descriptives of quant Gruskin et al. (2015) health Africa 0 1 0 advocacy indicators Hall & Freeman education US youth 0 0 1 N/A N/A

(2014)

Hall et al. (2013) education US youth 0 0 1 N/A N/A

social

Hanberger (2010) Swedish Immigrants 0 0 1 N/A N/A

services

112 Hannay et al. (2013) health Hispanic/Latino 0 0 1 N/A N/A

summative Hesse-Biber (2013) health Chile 0 1 0 RCT judgement Hilton & Libretto descriptives collected health US Military 0 1 0 formative learning (2016) from monitoring data frequencies of Hood (2009) education Native American 0 1 0 formative learning checklists Hopson (2014) education Africa 0 1 0 descriptive survey data advocacy Hubberstey et al health Canadian Aboriginal 0 0 1 N/A N/A (2015)

sample error was calculated; t- tests/ANOVAs to show statiscal differences in outcomes between social summative

Janzen et al. (2015) Canadian Youth 0 1 0 higher and lower services judgement attendance students; EFA & CFA were used to assess internal structure; Cronbach's to test reliability

frequecy stats of post- use survey: total Johnson-Turbes et al. summative health African American 0 1 0 number of visitors, (2015) judgement demographics, and satisfaction rates Johnston (2013) community Canadian Aboriginal 0 0 1 N/A N/A

survey descriptive summative

LaFrance et al. (2015) education Pacific Islands 0 1 0

analysis judgement

LaFrance et al. (2012) community Native American 0 0 1 N/A N/A

Lapidot-Lefler et al. social 113

Disability 0 0 1 N/A N/A

(2015) services Le Menestrel et al. summative community US youth 0 1 0 longitudinal analyses (2013) judgement frequency data from checklists and Letichevesky & Penna social summative Brazil 0 1 0 unobstrusive measures Firme (2012) services judgement to measure logitudinal change in key domains

frequency (average score and average score by genderof PRA summative Luo & Liu (2014) community China 0 1 0 method via scores that judgement participating villagers assigned to the project

(1 to 10)

pre/post survey analysis ; descriptive data Lustig et al. (2015) education Immigrants 1 0 0 formative learning organized into a pre/post bar chart survey analysis - knowledge Mertens (2013) health Disability 0 1 0 descriptive; frequency generation of biomarkers

frequency of attendance, by percentage of monthly Mitakidou (2015) education Greece 0 1 0 formative learning attendance; descriptive

data of each tools

checklist ratings

114

# of boys and girls

completing school; Mertens & social summative international 0 1 0 pre/post to measure Zimmerman (2015) services judgement change on the GEM scale

quasi-experiment that tested the impact of the Nastasi & Hitchcock intervention at the summative health India 0 1 0 (2009) patient, provider, and judgement community levels, details not provided

means and descriptive stats from patron knowledge Nevarez (2013) health Hispanic/Latino 0 1 0 surveys & calorie generation counts from menus Noblit & Jay (2010) education African American 0 0 1 N/A N/A frequency/descriptive O'Hara et al. (2015) education Ireland 0 1 0 data from peer and self formative learning assessments descriptives of student Samuels & Ryan assessment data, education US Minorities 0 1 0 formative learning (2011) disaggregated by subgroups

demographic descriptives and survey descriptives analyzed at social Shoultz et al. (2015) Pacific Islands 1 0 0 three time points to developmental

services

measure growth and

compared with

comparison group

Steinberg & Zamir 115

education Israel/Palestine 0 0 1 N/A N/A

(2015)

social survey analysis - summative Stokes et al. (2011) US Refugees 0 1 0 services descriptives judgement Thomas et al. (2014) education Pacific Islands 0 0 1 N/A N/A Woelders & Abma social Disability 0 0 1 N/A N/A (2015) services descriptives from Zamir & Abu Jaber surveys and student education Israel/Palestine 0 1 0 developmental (2015) assessments - analyses unclear descriptives of hard summative Ziabkhsh (2015) health Canadian Aboriginal 0 1 0 health measures judgement point in time study, summative Zoabi & Awad (2015) education Israel/Palestine 1 0 0 descriptives from judgement

survey only

116 Appendix B. Measurement-Specific Descriptors

Measure Off-The Adapted Off The Evaluator Measure B. Article Evaluator Designed Used Shelf Shelf Validated Issue Aktan (1999) 0 0 0 0 0 0 Alkon et al (2001) 1 1 0 0 0 1 Anderson-Draper 0 0 0 0 0 0 (2006) Baizerman & 1 0 0 1 0 1 Compton (1992) Barnes (2000) 0 0 0 0 0 0 Berends & Roberts 0 0 0 0 0 0 (2003) Bevan-Brown (2001) 1 0 0 1 0 1

Butty et al (2004) 1 1 0 0 0 1

Caldwell et al (2005) 0 0 0 0 0 0 Cervantes and Pena 1 1 0 0 0 1 (1998) Christe & Barela 0 0 0 0 0 0 (2005) Clayson et al (2002) 1 1 0 0 0 1 Connor (2004) 1 0 0 1 0 0 Cooper & Christie 0 0 0 0 0 0 (2005) Copeland-Carson 0 0 0 0 0 1 (2005)

Coppens et al. (2006) 1 1 0 0 0 1

1 1 0 0 0 0

Fetterman (2005)

1 0 1 0 0 1

Fisher & Ball (2002)

Garaway (1996) 1 1 0 1 0 0

117 Harklau and Norwood

0 0 0 0 0 0

(2005) Hong et al (2005) 0 0 0 0 0 0 Jay et al (2005) 0 0 0 0 0 0 Johnson (2005) 0 0 0 0 0 1 King et al (2004) 1 1 0 0 0 1 LaFrance (2004) 0 0 0 0 0 1 Laperriere (2006) 0 0 0 0 0 0 LaPoint & Jackson 1 0 1 0 1 0 (2004) Letiecq & Bailey 1 1 0 0 0 1 (2004) Maciak et al (1999) 0 0 0 0 0 0 McKenzie (1997) 1 0 0 0 0 0 Mertens & Hopson 0 0 0 0 0 0 (2006) Nagai (2001) 0 0 0 0 0 0 Nelson-Barber et al 0 0 0 0 0 0 (2005) Novins et al (2004) 1 1 1 1 1 1 Peter et al (2003) 1 0 0 1 0 1 Prilleltensky et al 1 0 1 0 1 1 (2000) Richmond et al (2008) 1 0 0 1 0 0 Robertson et al (2004) 1 0 0 1 0 0 Running Wolf et al 1 1 1 1 1 1

(2002)

1 1 0 0 0 1

Ryan et al (2007)

Senese (2005) 0 0 0 0 0 0

Slaughter (2001) 1 0 1 1 0 1

118 Small et al (2006) 1 1 0 0 0 1

Stockdill et al (1992) 1 0 0 1 0 0 Thomas (2004) 1 1 0 0 0 0 Thomas & Bellefeuille 0 0 0 0 0 0 (2004) Thurman et al (2004) 1 0 0 1 0 0 Uhl et al (2004) 1 1 0 0 0 0 Voyle & Simmons 0 0 0 0 0 0 (1999) White & Hermes 0 0 0 0 0 0 (2005) Willging et al. (2006) 1 1 0 1 0 1 Zulli & Frierson 1 1 1 1 1 1 (2004) Al Hudib et al. (2016) 0 0 0 0 0 0 Arseneault et al (2016) 0 0 0 0 0 0 Bamanyaki & Holvoet 0 0 0 0 0 0 (2016) Blackman et al. (2013) 1 0 0 1 1 0 Bledsoe (2014) 0 0 0 0 0 0 Botcheva et al. (2009) 1 0 1 0 0 1 Bowen & Tillman 1 0 1 0 0 1 (2015) Bowman et al. (2014) 1 0 0 1 0 0 Brandao & Codas 1 0 0 1 0 0 (2012)

Cavino (2013) 0 0 0 0 0 0

1 1 0 0 0 1

Chilisa et al. (2016)

Cornacione et al.

1 0 1 0 0 0

(2010)

119 Cram (2016) 0 0 0 0 0 0

Cullen et al. (2016) 0 0 0 0 0 0 Draanen (2016) 0 0 0 0 0 0 Dura et al. (2014) 1 0 0 1 0 0 Easton (2012) 1 0 0 1 0 1 Fletcher (2013) 1 0 0 0 0 0 Foreman-Peck & 1 1 0 0 0 1 Travers (2015) Freeman & Hall 0 0 0 0 0 0 (2012) Freeman et al. (2010) 0 0 0 0 0 0 Grimes et al. (2016) 1 0 0 1 0 0 Gruskin et al. (2015) 0 0 0 0 0 0 Hall & Freeman 0 0 0 0 0 0 (2014)

Hall et al. (2013) 0 0 0 0 0 0 Hanberger (2010) 0 0 0 0 0 0 Hannay et al. (2013) 0 0 0 0 0 0 Hesse-Biber (2013) 0 0 0 0 0 0 Hilton & Libretto 1 1 0 0 0 1 (2016) Hood (2009) 1 0 0 1 0 0 Hopson (2014) 1 1 0 0 0 0 Hubberstey et al 0 0 0 0 0 0 (2015) Janzen et al. (2015) 1 1 0 1 1 1

Johnson-Turbes et al. 1 0 0 1 0 0

(2015)

Johnston (2013) 0 0 0 0 0 0

1 1 0 1 0 1

LaFrance et al. (2015)

120 LaFrance et al. (2012) 0 0 0 0 0 0

Lapidot-Lefler et al. 0 0 0 0 0 0 (2015) Le Menestrel et al. 1 1 0 0 0 1 (2013) Letichevesky & Penna 1 0 0 1 0 0 Firme (2012) Luo & Liu (2014) 1 0 0 1 1 1 Lustig et al. (2015) 1 0 0 1 0 0 Mertens (2013) 1 0 0 1 0 0 Mitakidou (2015) 1 0 0 1 0 0 Mertens & 1 1 0 0 0 0 Zimmerman (2015) Nastasi & Hitchcock 1 0 0 1 1 0

(2009)

Nevarez (2013) 1 0 0 1 0 0 Noblit & Jay (2010) 0 0 0 0 0 0 O'Hara et al. (2015) 1 0 0 1 0 0 Samuels & Ryan 1 1 0 1 0 0 (2011) Shoultz et al. (2015) 1 0 0 1 1 0 Steinberg & Zamir 0 0 0 0 0 0 (2015) Stokes et al. (2011) 1 0 0 1 0 1 Thomas et al. (2014) 0 0 0 0 0 0 Woelders & Abma 0 0 0 0 0 0

(2015)

Zamir & Abu Jaber

1 0 0 1 0 0

(2015)

Ziabkhsh (2015) 1 1 0 0 0 1

121 Zoabi & Awad (2015) 1 0 0 1 0 0

122

Appendix C. ASC Goals, Outcomes, and Measures

Goal Objectives Measurables Existing Instrument1 To have an internal culture that Have all employees fluent in our 100% of ASC staff know the 5 360 Tool consistently reflects our core values. values values of integrity, inclusion, 100% of staff rate individual and To have our actions as a team and teamwork, learning, and team behaviors at 3 or above on 360 Tool individuals reflect our values compassion. 360 tool Student, tutor, learning specialist Understanding Modes of surveys ranking 4 out of 5 on a communication and their likert scale appropriate uses (face to face, TQ14d, TQ14d, TQ14e

word processing, texting, email, Pre/Post assessment (for all who etc) take the course)

Student, tutor, learning specialist Navigating Research/Academic surveys ranking 4 out of 5 on a TQ20 Resources likert scale To be a department that fosters Student, tutor, learning specialist academic competence and Understanding Writing process surveys scoring 70% or better in TQ14a, TQ20 confidence among student frequency of promotion athletes through individualized Student, tutor, learning specialist skill development and full Understanding Reading process surveys scoring 70% or better in TQ14a, TQ20 integration into the campus frequency of promotion community. Student, tutor, learning specialist Understanding Note taking surveys scoring 70% or better in TQ14a, TQ20 process frequency of promotion Student, tutor, learning specialist Understanding Professor meeting surveys scoring 70% or better in TQ6a, TQ7, TQ14b process frequency of promotion Student, tutor, learning specialist Understanding Time Management surveys scoring 70% or better in TQ5, TQ14c process

frequency of promotion

1 TQ refers to the ‘Tutorial Questionnaire’

123

Student, tutor, learning specialist Understanding Annotating and surveys scoring 70% or better in TQ14a, TQ20 Quote Extraction frequency of promotion Student, tutor, learning specialist Understanding Citation and surveys scoring 70% or better in TQ20 Referencing frequency of promotion Learning how to negotiate Student, tutor, learning specialist relationships including tutoring, surveys scoring 70% or better in TQ6a/b/d/e, TQ7, TQ14b, TQ14e librarians, and faculty frequency of encouragement Student, tutor, learning specialist surveys scoring 70% or better in frequency of encouragement - e.g. Regularly meeting with a Student Athletes Self Report that TQ6a, TQ7, TQ14b professor or GSI they attended Professor/GSI office hours regularly, 70% Fall ‘14, 72% Spring ‘15 Student, tutor, learning specialist Showing up to class and meetings surveys scoring 70% or better in No existing instrument prepared frequency of encouragement Meeting with a Tutor TQ6b Student, tutor, learning specialist Meeting with a Study Group (e.g. surveys scoring 70% or better in TQ6d ASC/SLC) frequency of encouragement - Working independently (e.g. Student, tutor, learning specialist doing homework) without the surveys scoring 70% or better in TQ4 guidance or facilitation of a tutor. frequency of encouragement - Student, tutor, learning specialist Participating in study groups with surveys scoring 70% or better in TQ6f classmates or teammates frequency of promotion Student, tutor, learning specialist Attending Exam review sessions surveys scoring 70% or better in TQ6c

frequency of promotion

To be a department that helps Have every SA find support in Self reported feelings of support student athletes navigate and establishing a major that is in a survey administered by the Advising survey Section 2 take ownership of their authentic, inspiring and career ASC academic and personal related for them.

124

development through self reported feelings of Have every SA understand graduation and beyond. understanding in a survey Advising survey Section 2 academic/major requirements. administered by the ASC Self-reported acknowledgement Have SA’s see the ASC as that ASC staff provide more than providing more than academic No existing instrument academic advising in survey advising.. administered by the ASC To be a department that collaborates with campus Have a formal alumni mentoring partners to provide every program that exposes SA to a Assess program with SA/Alumni student athlete the opportunity valuable CAL network and to survey that measures engagement, No existing instrument to receive transitional support, help provide a vision for CAL impact and support. leadership development, and postgrad success student development enrichment italicized text indicates open-ended response

125

Appendix D. Fall Pilot Outcome Space Measure: Sense of Belonging Characteristic: Shared Purpose Question Options Characteristic Levels (GSI Scenario) a) N/A – I would not be in this situation Q1: If you were in this because I would have met with my GSI situation, in general, what before my game about missing lab. would be the GSI’s most b) “Sure, maybe you and your classmate have likely response to your the same questions, let’s all sit down request for help? together.” c) “It’s your responsibility to make up for

missed class material, so I can’t help you.” 1. Has made appointments and feels comfortable communicating d) N/A – I would not be in this situation and investing in relevant relationships with campus resources because GSIs don’t help students like me, outside the ASC on his/her own and I would never go to office hours. 2. Has made appointments, but still needs encouragement from others to connect with the campus resources outside the ASC to (Student Flyer Scenario) a) Excited by the possibility of meeting with best meet his/her academic and personal needs Q12: Imagine that you alumni, and you ask the business student if 3. Is aware of specific campus resources outside the ASC that may take the flyer and it says: there are other ways to connect with the help him/her, but does not feel comfortable connecting with such “Sponsored by the Haas alumni association outside of the ice cream resources School of Business social. 4. Is not aware of specific resources because he/she feels like Alumni Association – b) You would like to meet alumni, and you campus resources outside of the ASC are not meant for students like him/her Alumni will be in decide to go to the event if you can convince

attendance.” What are some friends to go too. you most likely to think c) You recognize that alumni might help you about this? connect with people who can offer you a future job, but it’s not worth your time to go

to the event and actually meet alumni. d) Nothing, alumni associations don’t mean

anything to you.

126

Characteristic: Recognize Group Dynamics Question Options Characteristic Levels (GSI Scenario) a) Ask both the GSI and your classmate Q2: If your GSI questions about the material, trying to learn responded with Option B: as a group. “Sure, maybe you and b) You get your questions answered by the GSI your classmate have the and then hang around to see if your same questions, let’s all classmate’s questions can help you sit down together,” how understand the material. are you most likely to c) Sit in the group but feel hesitant to ask your react? own questions, letting the classmate ask all 1. Recognizes that different students have different values, and feels their questions. that he/she can understand and productively work with students d) Leave office hours immediately, not wanting from backgrounds different from him/herself

to interact with an unfamiliar student. 2. Recognizes that different students have different values, but feels like he/she is still learning how to productively engage with students from different backgrounds (Group Project Scenario) a) Even though my groupmates are different 3. Does not recognize that different students have different values, Q6: How are you most from me, it’s an opportunity for me to show and tries but struggles to work with others different from likely to feel about the that student athletes work just as hard as him/herself comment in Panel 1: “It everyone else and are great collaborators. 4. Does not recognize that different students have different values, looks like everyone else b) My classmates might judge me because I’m a and disassociates from students that are different from has a group. I guess us student athlete, but I’ll try to show them that him/herself three should work I’ll do my fair share of the work. together.” c) Hopefully we can all work together. It seems that I always struggle to work with people I don’t know. d) I always feel like non-student athlete purposely avoid picking me for group projects. I will probably feel uncomfortable during this whole project.

127

Characteristic: Interpret External Forces Question Options Characteristic Levels (GSI Scenario) a) You believe this GSI may think you are not Q3: If your GSI committed to the course. You will make sure responded to Question 1 you are engaged in section and ask question with Option C: “It’s your in lecture. responsibility to make up b) You believe that this GSI is just for missed class material, communicating the course policy. You will so I can’t help you,” try to go to a tutor for help. What are the thoughts c) You will never go back to this GSI’s office going through your head? hours, but now you are not sure how to ask 1. Is aware of what kind of external/political forces impact his/her

for help from other GSIs in the course. life and does not allow them to derail him/her from finding a d) There’s no point in going to any GSI office community at Cal hours because nobody will ever help you. 2. Is aware of what kind of external/political forces impact his/her life, but still lets these forces result in feelings of alienation from time to time (Student Flyer Scenario) a) Assertive – you know that other students 3. Is aware that external/political forces may impact his/her life but Q10: Imagine that the may have stereotypes against student does not know how to overcome this business student in the athletes, but this doesn’t stop you from 4. Does not realize that external/political forces impact his/her life above scenario did not seeking out opportunities on campus that and feel alienated as a result offer you a flyer. How are interest you. you most likely to feel? b) Annoyed – you know you can always approach the business student yourself, but you lack motivation to do so. c) Frustrated – you feel stereotyped and powerless to do anything about it. d) Isolated - as if you are not part of the regular student population at UC Berkeley.

128

Characteristic: Community Enthusiasm Question Options Characteristic Levels (Group Project Scenario) a) I’m already involved with student organizations related to my Q4: Imagine that Sarah major and regularly participate in organizations like the (the other student in the Undergraduate Business Student Association. comic strip) invites you b) I’m slightly interested in student groups that relate to my major to come to the next and would like to attend a few meetings to see what they’re Undergraduate Business about. 1. Meaningfully involves him/herself with on- Student Association c) I’m not part of any student organization and while I might be campus or extracurricular activities meeting. How would you interested, I have no time for something like this. 2. Is hesitant to, but still participates in on- feel if a student invited d) I’m not a part of any student organization and have no interest campus or extracurricular activities you to join an in joining one, even if it was about something I had time for. 3. Does not understand why he/she should organization related to participate in on-campus or extracurricular

your major? activities (Student Flyer Scenario) a) “That’s sounds really interesting. If it fits in my schedule, I’ll 4. Would never want to participate in on- Q9: Imagine you are the definitely check it out.” campus or extracurricular activities student athlete in the b) “Sure, I’ll take the flyer, and maybe I’ll see you there.” green sweatshirt. How are c) “No, I don’t think so.” you most likely to d) N/A – I would keep walking, ignore the flyer, and not say respond in Panel 3? anything

129

Characteristic: Identity Acceptance Question Options Characteristic Levels (Group Project Scenario) a) “Unfortunately, the site visits will be a Q5: Imagine you are the problem with my schedule. But, I can put student athlete in the together the final presentation, and I will try above scenario, how are to work as many site visits into my schedule you most likely to as possible.” respond in Panel 3? b) “The site visits are going to be a problem for me. I’ll try to think of some ways to either work it into my schedule or make up for it.” c) “I’ll probably have to miss a lot of the site visits, and I’m not sure how to make up for that.” 1. Accepts and consistently addresses the challenges resulting from d) “I won’t be able to go do the site visits, so it his/her unique identity will just be you two doing that part of the 2. Accepts that there are challenges that result from his/her unique project.” identity but does not consistently address them 3. Acknowledges, but feels like he/she cannot address the (Group Project Scenario) a) The group project for this class will be challenges that arise from his/her unique identity Q7: Please select the challenging for me, but I know how to plan 4. Does not acknowledge or address the challenges that arise from option that best describes my time and utilize the resources I need to his/her unique identity how you feel about the help me do well in the course. comic scenario b) There might be some challenges with fitting in all the work that is required, and I’ll just try my best to figure out how to manage everything. c) I can’t fit in all the work that’s required; I should probably drop the course. d) The group project will not present any challenges for me; my groupmates are smart

enough to figure it out on their own.

130

Characteristic: Holistic Perspective Question Options Characteristic Levels (Group Project Scenario) a) Confident – you know how to seek out a librarian Q8: One week after the and use library resources to get the information you above scenario, your need. groupmates ask if you b) Slightly Concerned – you have used the library a can do the library few times but still need help looking up good research for the project. references. 1. Is aware of procedures and protocols, and feels that How are you most likely c) Very Concerned – you generally don’t go to the he/she can successfully navigates campus and community to feel about this request? library and are not sure how to look up references. spaces d) Fine – you’ll look at a few websites and send them 2. Is aware of, but is still learning how to follow the to your groupmates to write about. procedures and protocols necessary to navigate campus spaces (Student Flyer Scenario) a) You ask your GSI if it is okay to attend another 3. Is aware of, but does not know how to follow the Q11: Imagine that you section that week so you can participate in a campus procedures and protocols for navigating campus spaces are interested in going to event. 4. Does not feel that he/she has to abide by certain the ice cream social, but b) You ask your friend ahead of time what you should procedures and protocols for navigating campus spaces and is therefore not aware of how to do so realize that the event do about missing section.

conflicts with one of your c) You go to the ice cream social and ask one of your class section meetings. In friends what happened in section. general, how are you d) You go to the ice cream social and don’t think about most likely to react? your section meeting.

131

Measure: Self Reliance Characteristic: Achievement Standards Question Options Characteristic Levels (Roommate Scenario) a. Never Q13: In general, my b. Almost never behavior is more like c. Some of the time 1. Knows how to and frequently evaluates and revises Judy than Dana d. All of the time his/her standards of achievement to consistently improve him/herself (Midterm Paper a) Motivated - you will re-submit your paper, and you 2. Is developing his/her standards of achievement to help Scenario) are determined to get at least a B+ or higher. work toward personal/academic growth Q19: Imagine you are the b) Disappointed - you tried to get a good grade on this 3. Does not have any standards of achievement but has a student with the C- in the paper and will re-submit for a better grade. desire to make personal/academic improvements 4. Does not have any standards for achievement nor sees the above scenario, how are c) Concerned - you’ll have to do better on the next you most likely to feel? paper to make sure you pass the class. need; he/she just tries to get by with minimal effort

d) Fine - a C- is passing and that’s all you need to do.

132

Characteristic: Networking Skills Question Options Characteristic Levels (Roommate Scenario) a. “I can show you how to make a tutor appointment Q14: If you were Dana, and on the bus ride down, we can make a scheduling how would you fill in the timeline and discuss an outline for your paper.” blank word bubble in b. “I can show you how to make a tutor appointment 1. Has an established network of campus support and has Panel Three? and on the bus ride down, we can look at your therefore taken several concrete actions (acquiring career schedule and make a plan for how you will get your development tools, working independently, meeting paper done.” university requirements, etc) to achieve his/her goals c. “Just ask for an extension and next time, don’t wait 2. Has made a few connections with campus networks and until the last minute.” is therefore learning what he/she needs to do (acquiring d. “Forget about it, you’ll figure it out when we get career development tools, working independently, back.” meeting university requirements, etc) to achieve his/her goals 3. Does not take initiative with seeking out relationships, (Career Center Scenario) a) Very Likely – I have already sought out campus but is aware that he/she is responsible for making Q16: How likely are you resources and asked similar questions. meaningful connections in order to grow. to find yourself in the b) Likely – I would like to go to this kind of a 4. Does not seek out meaningful engagement, and would above scenario? workshop and could see myself asking questions. rather others do everything for him/her than figure out c) Somewhat likely – I may go to a workshop, but only what steps he/she needs to take for him/herself if it was something I went to with my team. d) Not likely at all – I would not go to a Career Center workshop.

133

Characterisitc: Growth Mindset Question Options Characteristic Levels (Roommate Scenario) a. Determined – you can reflect on the things that went Q15: If you were Judy, wrong with your time management in order to make how would you feel about a better plan for the future. this scenario? b. Anxious/Stressed – unsure of how to better manage your time but feel like improvements can and should be made. c. Overwhelmed – there’s no way a student athlete can ever manage a difficult course schedule with athletic commitments. 1. Is aware that he/she can grow from setbacks and d. Wouldn’t really care – student athletes face this consistently uses setbacks as means for growth situation all the time and manage to survive. 2. Is aware that he/she can grow from setbacks, and is

learning to apply lessons to his/her own personal

development (Midterm Paper a) Proactive - you identify exactly how your time 3. Is aware that he/she can grow from setbacks, but does not Scenario) management system allowed you to miss an know how to do so and often feels frustrated Q22: Imagine that you important deadline and know that this same mistake 4. Does not learn from setbacks and chooses to ignore them miss the one-week won’t be repeated. deadline to re-submit b) Disappointed - you recognize how and why you your C- paper for a missed the deadline and will try not to do it again in higher grade. How are the future. you most likely to feel? c) Frustrated - you intended to and should have been able to meet the deadline; you are not sure how you missed it. d) Fine - it was optional to re-submit the paper.

134

Characteristic: Seeks Input Question Options Characteristic Levels (Career Center Scenario) a) “Yes, thank you. Do you think you and I could set Q17: If you were the up an appointment to talk about what I might ask student in this scenario, when I call or email?” how are you most likely b) “Yes, thank you. I will try to call a few alumni.” to respond in panel 3? c) “No, that’s okay. I think I can ask my friends if they know about any internships or just look for some on my own.” d) “No, that’s okay. I’m not sure what I would ask alumni.” 1. Seeks input and feedback from others and uses to it to work toward his/her own academic goals (Midterm Paper a) Go to instructor office hours and incorporate all 2. Seeks input and feedback from others and is trying to

apply this knowledge to his or her own development

Scenario) feedback, then seek out tutorial support to make 3. Sometimes seeks input and feedback from others, but Q20: Imagine that you sure your revisions make sense and improve the does not apply this to his/her own development receive a C- and decide to paper. 4. Does not seek feedback from others, and does not think it re-submit your paper. b) Incorporate the instructor comments you understand is helpful to do so What are you most likely and go to instructor office hours to ask about the to do next? comments you do not understand. c) Look at the instructor comments and incorporate the suggestions that you understand, but you don’t go to office hours. d) Make a few changes based on what you think was wrong with the paper, but you do not consult the instructor or look at instructor feedback.

135

Characteristic: Learns from Others Question Options Characteristic Levels (Career Center Scenario) a) Contact an advisor or career center staff for further Q18: Imagine that you direction on how to find a summer internship that leave the Career Center matches your interests. workshop feeling like b) The next time you see a friend in your same major, you still don’t know how you ask them about their summer internship plans. to get a summer c) You think about the people you know with summer internship that is right for internships who might be able to help you, but you you. What are you most don’t actually contact them. likely to do next? d) Probably nothing – you don’t think anyone else will have useful information about how to find a summer 1. Learns from others in a way that helps his/her own internship. personal development

2. Recognizes that others have skills and talents that he/she can learn from and is learning to tap into them

(Midterm Paper a) Exchange final papers with the other student and 3. Does not know how to learn from the skills and talents of Scenario) then meet to go over specific feedback and general others, but thinks others may have useful information Q21: Imagine your writing tips. 4. Does not know how to learn from the skills and talents of instructor gives you the b) Exchange final papers with the other student and others and does not think that others may have name and email address incorporate the feedback you understand. skills/talents that he/she can learned from of a student that got a c) Next class, ask the other student to do a peer review, high grade on the but then forget to actually exchange before turning midterm paper and in the final paper. recommends you do a d) Forget about the suggestion and never contact the peer review (exchanging other student. papers) before submitting your final paper. What are you most likely to do?