Mining Text Data: Making Sense of What Students Tell Us

PROFESSIONAL FILE ARTICLE 139

John Zilvinskis using text to create a learning analytics Examples of text data accessible to IR Greg V. Michalski system at a community college (City staff are University of New York [CUNY]: the • Application essays, About the Authors Stella and Charles Guttman Community • Written assignments, John Zilvinskis is a doctoral student College, or CUNY Guttman). Results • Open-ended survey responses, in the Higher Education and Student of this study include identification • Course Management Software Affairs (HESA) program at Indiana of additional item choices for the (CMS) postings, University. Greg V. Michalski is Director survey and discovery of a relationship • Student blogs, of Institutional Analytics and Research, between e-portfolio content and • Course evaluations, Florida State College at Jacksonville. academic performance. Additional • Surveys, and examples of text mining in higher • E-portfolios. Abstract education and ethical considerations Text mining presents an efficient means pertaining to this technology are also IR professionals recognize the depth to access the comprehensive amount discussed. of text data that are—or potentially of data found in written records by can be—collected, but might be converting words into numbers and FRAMING THE ISSUE uncertain how to process those data using algorithms to detect relevant and use them for campus research patterns. This article presents the OF TEXT MINING informing decision-making. This article fundamentals of text mining, including Students generate copious amounts of presents an overview of the concepts an overview of key concepts, prevalent thick, rich data; however, these data are and strategies of text mining and text methodologies in this work, and often unexamined because traditional analytics, and acquaints the reader popular software packages. The qualitative methodologies used to with the terminology, methodology, utility of text mining is demonstrated examine thousands of submissions and software associated with this through description of two promising require extensive resources. Text technology. Two examples using text practices and presentation of two mining (the machine coding of text data in higher education illustrate detailed examples. The two promising with the goal of integrating converted how mining can be carried out. It is practices are (1) using text analytics submissions with quantitative hoped that the reader will develop a to understand and minimize course methods) offers timely, accurate, and fundamental understanding of text withdrawals, and (2) assessing student actionable assistance (Zhang & Segall, mining, be able to suggest how it can understanding and depth of learning 2010). This article presents detailed be used to aid decision-makers, and in science, technology, engineering information on how text mining can be understand the advantages as well as and mathematics (STEM) (physics). The used by staff who work in institutional the constraints of this approach to data two detailed examples are (1) refining research (IR) and collect, but often are analysis. survey items on the National Survey of forced to neglect, text-based data. Student Engagement (NSSE), and (2)

FALL 2016 VOLUME | PAGE 1 BASIC CONCEPTS the use of specific methodological/ linguistics contribute to text mining technological approaches: “Text mining by developing algorithms (a subfield Before describing how the researcher is a young interdisciplinary field which called natural language processing) would approach text data, it is draws on information retrieval, data to measure sentiment and meaning important to distinguish between mining, machine learning, statistics from text. Educational data mining can data mining information and analytics. and computational linguistics” (Singh & utilize text in a unique way to measure Data mining often implies that the Raghuvanshi, 2012, p. 139). student learning of important concepts researcher is employing algorithms to or reflecting on developmental explore large data sets (Baker & Yacef, The rationale for text mining stems milestones (Baker & Yacef, 2009). 2009; Fayyad, Piatetsky-Shapiro, & from “the need to turn text into Smyth, 1996). Analytics incorporates numbers so powerful algorithms Researchers familiar with qualitative data mining techniques to create can be applied to large document research might be skeptical about the “actionable intelligence,” meaning databases” (Miner et al., 2012, p. effort of those who work in natural information that guides decision- 30). According to Miner et al., text language processing and wonder, making (Campbell, DeBlois, & Oblinger, mining and text analytics are broad “Why wouldn’t the researcher simply 2007, p. 42). This information is umbrella terms describing a range perform traditional qualitative data used to predict case-level behavior of technologies for analyzing and methods with text data?” Despite the that will guide intervention (van processing semi-structured and development of intelligent graders and Barneveld, Arnold, & Campbell, 2012). unstructured text data. Text mining machine learning, computer programs In education, learning analytics takes is the practical application of many do not have the capacity to interpret the form of collecting real-time data to techniques of analytical processing in the nuance of writing at the level of the measure the effectiveness of teaching text analytics. human brain. Nonetheless, there are practices for a particular student, and several reasons why text mining is a to suggest intervention in relative There are several ways that researchers viable research technique. real-time in the case that they are not can use software to access, label, effective (Suthers & Verbert, 2013). As and study text data. Familiar to First, data sets can include thousands, will be described later, the timing of anyone who has used an Internet millions, or billions of text submissions, when data are collected, processed, search engine, information access precluding the use of traditional implemented, and acted on is critical uses recall systems common in most qualitative techniques. Second, even if from moving data mining to learning text-mining approaches; however, the text data were at a size that could analytics (Arnold & Pistilli, 2012). the absence of generating new be reviewed by researchers, hiring and This article uses similar definitions information excludes this practice training coders increases the resources when describing text mining and text from true text mining status (Hearst, (time and money) needed to complete analytics. 1999). Another step toward text this process. For most IR departments, mining includes the categorization of hiring a team of coders might not be A variety of related definitions for text (such as categorizing libraries or practical. When analytics projects are text mining can be found in the academic journals) that can be—in used to identify students who need literature. The following definition its own right—a means to mine text. support, the resulting information is an adaptation of data mining that Similar techniques can be used in the can come too late to provide that emphasizes the type of data being processes of clustering documents and support if data coding, processing, and mined: “the discovery of useful mining Web content. Beyond pulling interpreting takes too long. Having and previously unknown ‘gems’ of specific words or clustering proximity coders introduces the need to account information from textual document of terms, researchers also intend to for inter-rater reliability. Third, these data repositories” (Zhang & Segall, 2010, extract meaning from the text they are computed into numeric values so p. 626). Other definitions involve study. Those who study computational that they can be combined with other

PAGE 2 | FALL 2016 VOLUME sources of data in statistical models built terms that are featured in some, but SOFTWARE TO to predict student behavior. Text mining not all, of the cases). Another way to allows for the sizable and efficient organize the terms is by using singular CONSIDER processing of text data. value decomposition (SVD), which Numerous free and commercial reduces the input matrix to a smaller software programs are available for THE PROCESSES OF version, representing the variability text mining. RapidMiner is an open of each case (Berry & Kogan, 2010; source analytics program featuring MINING TEXT Manning & Schütze, 1999). This latter tools that include the analysis of text In their book Practical Text Mining and method is important when working data. The visual point-and-click nature Statistical Analysis for Non-Structured with large text data sets that otherwise of the interface allows non–computer Text Data Applications, Miner et al. could take a long time to process. programmers to access, clean, and (2012) present a comprehensive analyze their data. The RapidMiner Web step-by-step guide for text mining. Once the researcher has organized site includes numerous resources, and Their text-mining methodology is very the data, she can use several ways to the user community has posted helpful similar to other types of research: the extract important information from videos on YouTube. The text extensions researcher has to define the purpose these cases (Miner et al., 2012): include easy-to-use functions such as of the project, manage the data (seek, • Classification. The researcher the ability to group documents based organize, clean, and extract model creates a dictionary to organize on term frequency. Although the base- data), evaluate the results, and, finally, terms based on their definition level package is free, users might want disseminate the results. The aspects and hierarchical connection; for to purchase the professional package of data management (organizing, example, she might file “classes” that includes more-advanced options. cleaning, and extracting data) are under the domain of “education.” RapidMiner provides sizable discounts particular to text mining. • Clustering. The researcher groups to researchers who use their software, terms based on the frequency and and also provides an extension that First, the researcher generates a corpus, pattern of their use, compared to allows for the export of data into or a collection of documents or cases the number of students who use Tableau. This is an ideal program for in which the desired text exists. During those terms. IR professionals who want to begin this phase, the researcher removes • Association. The researcher experimenting with text data. all nonessential information, such examines the use of text in as e-mail addresses or Web links. connection with some event that is Waikato Environment for Knowledge Second, the researcher either uses an occurring. For example, she might Analysis (WEKA) is another open source established “stop or include” word list want to compare the positive or program available for download. WEKA or creates one that filters out words negative descriptions of faculty uses a system of algorithms through that lack informational value (the, a, his) teaching, as reflected in course Java to perform analytic functions; while highlighting words that do have evaluations before and after final Keyphrase Extraction Algorithm (KEA), value. This phase also includes limiting grades are posted. on the other hand, is an extension of the number of terms by accounting • Trend analysis. The researcher that project focused on text mining. for inflection (plural vs. singular, past measures the change of text Both WEKA and KEA were developed vs. present) or the word root (teach responses to an identical prompt at the University of Waikato, New for teaching, teacher, teaches). Third, over time. Zealand. WEKA is a premier program the researcher organizes the terms for machine learning, a field of within cases. This process can be done computer science that emphasizes the by using a simple binary notation development of programs that can (1 = term is present) or by noting recognize patterns and self-evolve, semifrequent but unique terms (i.e., which is relevant in text mining

FALL 2016 VOLUME | PAGE 3 where programs recognize and adapt Like most SPSS platforms, the interface their use of text-mining terms. These to changes in text. KEA has a clean is extremely user friendly, probably features within SAS Enterprise Miner function of recognizing text within a to a fault, considering how statistical software offer a comprehensive way to corpus. However, the program does analysis is reduced to a few button cluster terms for analysis. not have a graphical user interface and clicks. However, for the purposes of therefore is most accessible to those text mining, the software is quite PROMISING PRACTICES with computer science backgrounds. advanced. The text function has a built- In higher education there are numerous Also, although the WEKA Web site in dictionary that classifies terms and opportunities to mine text data to continues to collect publications from allows the user to create a dictionary. predict important student behavior. researchers using their system, as of This function creates a hierarchy of For example, application essays can be this writing the KEA Web site has not similar terms and places them into mined to predict student matriculation been updated for some time and has broad categories such as “athletics,” or even retention. These data can be limited resources. “emotion,” and “education.” The incorporated into an analytics project application also has built-in dictionaries aimed at awarding the appropriate Most IR professionals are familiar with that include a classification scheme amount of financial aid needed to R as a program, but they might not be for terms in areas such as customer secure a student’s enrollment. Another fluent in the programming language satisfaction. The natural sorting of way in which text mining could be used needed to make it work. R is a premier terms along with the ease of creating would be to measure virtual student statistical software, in part because it a dictionary allows for comprehensive participation in course management is free, but also because it has been and easy classification of text data. software, such as evaluating students’ adopted by a large contingent of contribution in a course message statisticians worldwide. There are SAS® Enterprise Miner™ software board within a learning management numerous books and articles on how also has a user interface and offers system like Blackboard or Canvas. to use R for text mining; however, comprehensive tools for organizing Already researchers are using data to between the time it would take staff text and clustering cases.1 Like IBM model and predict students’ academic to become expert in the use of R and SPSS Modeler Premium, SAS Enterprise performance and assign intervention, to teach themselves how to program Miner software is an analytics suite such as e-mail notification, the text mining, the software can lose that includes some text-mining conversations with advisors, and alerts its advantage. Nonetheless, for staff applications. However, the way this to faculty (Arnold & Pistilli, 2012). These familiar with the program, it can be a product distinguishes itself from efforts can be enhanced with the great platform to begin to explore and others is that the clustering function analysis of text data. experiment with text data. of terms is a component of the text- mining function. Often programs will Using Text Analytics to The commercial products do reduce terms to dichotomous values Understand and Minimize distinguish themselves from the free (not-present vs. present) and then Course Withdrawals software. IBM SPSS Modeler Premium employ clustering methods afterward. Two current and promising offers text mining in addition to the SAS Enterprise Miner software allows applications of text analytics involve analytics suite; the suite includes other the user to augment the clustering its application in the areas of student data analysis processes including parameters within the text-mining course retention and course outcomes. variable selection, various analysis function and creates data visualizations In mining open-ended comments applications, and visualization options. that are more compelling because of

1 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other coun- tries. ® indicates USA registration.

PAGE 4 | FALL 2016 VOLUME captured from students withdrawing Assessing Student TWO DETAILED from college courses, Michalski (2014) Understanding and Depth of produced and tested a model that Learning in STEM EXAMPLES succeeded in categorizing over 95% A second promising example of the This article now demonstrates of student explanations for student application of text mining is the the utility of text mining through course withdrawal decisions. This analysis of student responses to presentation of two detailed examples: model includes 11 major categories for open-ended assessment questions (1) refining survey items on the NSSE, course withdrawal (i.e., reasons) and at Michigan State University’s and (2) relating e-portfolio text to corresponds well to existing theoretical (2015) Collaborative Research in student performance at a community and empirical research in the area Education, Assessment and Teaching college (CUNY Guttman). of college course withdrawal. Of the Environments for the fields of 11 categories, the top three coding science, technology, engineering Example One: Classifying Open- categories were (1) time–schedule, and mathematics (CREATE for STEM). Ended Survey Questions (2) job–work, and (3) personal–other There, researchers analyzed student When creating a survey, researchers reasons provided by students. Other responses to open-ended test strive to develop closed-ended items categories included finances, health, questions and related these answers that present all of the possible answers family, course/faculty negative, and to course outcomes. Within this for a given question (Dillman, Christian, online course (mentioned by students project at Michigan State, as part of & Smyth, 2014). Answering open- who stated their desire to take the a National Science Foundation grant, ended questions requires more mental same course via instructor-led, rather Park, Haudek, and Urban-Lurain energy of the respondent and can lead than online, delivery). Subsequent (2015) used IBM SPSS Modeler to to answers that are more difficult than research (Michalski, 2015) further mine the text of short-answer physics closed-ended items for the researcher demonstrated how output from the test questions about the course to process. It can be challenging for resulting text model can be statistically topic “energy.” The purpose of this institutional researchers to develop a analyzed using selected quantitative study was to explore the degree to list of all the possible survey responses procedures (including Hierarchical which term use is associated with to a particular question. What do Agglomerative Cluster Analysis, overall knowledge of energy-related researchers do if they are not sure if Principal Components Analysis, and constructs. The researchers were able all possible responses are presented? Multiple Correspondence Analysis) to classify terms used in open-ended A simple answer is to create a text for validation (i.e., creating clusters of responses as either surface-level box next to an “other” option where a terms describing course withdrawal understanding or scientific ideas. Not respondent could write in an answer. that are both mathematically and surprisingly, students who wrote using Using text mining, the researcher can conceptually related). These results scientific ideas were more likely to organize these write-ins to create a are currently being used to develop answer corresponding multiple-choice more comprehensive list of options. a course Re-enrollment Assessment questions correctly. Innovative uses Online (REASON) process to minimize of text mining like these allow for a In the 2014 administration of the course withdrawals, in part by more robust understanding of student NSSE, an experimental item set identifying and providing appropriate learning, and assist in test design. was developed to identify types of support, services, advising, and other leadership positions held by students. assistance to encourage and assist In the development of this survey item student course reenrollment decisions. set, researchers wanted to know how often students said they had held a formal leadership position in the core survey, as compared to how often they identified serving in a specific

FALL 2016 VOLUME | PAGE 5 leadership role later in the survey. Table 1. Counts and Percentage of Write-ins of Additional Leadership Positions Of the students who responded to Added Through Text Mining the items, 1,482 of 4,836 (31%) wrote in a leadership position not listed Position n % in the core survey item. Text mining Tutoring 145 9.8% was used to determine (1) what new Teaching assistant 87 5.9% leadership roles should be added to the survey set and (2) the degree to which Research assistant 60 4.0% responses to the leadership item in the Secretary 55 3.7% core survey related to answers in the Treasurer 57 3.8% experimental item set. Write-in entries were classified using Mentor 54 3.6% the text-mining capability of IBM Member 51 3.4% SPSS Modeler Premium. Of the 1,482 Editor 25 1.7% students who entered a response, 830 (56%) entries were summarized into eight categories. For example, the As a result of this study, the response were required to attend a summer concept “teaching assistant” included options “Instructor or Teaching bridge program prior to their first variations in term, such as “teaching Assistant,” “Tutor,” and “Editor” were semester. During that program, assistant,” “teachering assistant,” and added as leadership positions for students began their e-portfolio by “teacher’s assistant.” The top eight the second administration of this writing introductions of themselves. leadership positions written into the experimental item set. Researchers Some submissions were informal, open-ended text box and their percent considered installing skip logic that reading like an online “shout out” of total written comments are shown in would allow only affirmative responses instead of an academic piece of work. Table 1. to the formal leadership item on the Other submissions described ambitions core survey to access the experimental and backgrounds. The researchers then calculated how item set. However, since a large number well these positions represented the of respondents who held positions Using SAS Enterprise Miner, terms respondents’ understanding of a “formal (both the original and write-in options) from 163 student e-portfolio leadership position.” In the core survey, had not reported completing a formal introductions were clustered to respondents were asked if they had leadership experience in the core predict student outcomes such as the completed (or were in the process survey, the skip logic was not included. number of credit hours earned and of completing) a formal leadership In this instance, text mining allowed GPA. The terms were grouped using a position. By comparing the number of the researchers to expand the item K-cluster analysis within the Enterprise students who identified one of these bank and improve the survey design. Miner suite; groups of terms were positions to the number of students given names by researchers, based who said they had participated in Example Two: Clustering on their seemingly shared concept. a formal leadership position, the E-Portfolio Submissions These concepts, which aligned researchers had a better understanding As part of a Bill and Melinda Gates with student development theory, of positions that constitute “formal Foundation–funded project, emerged from the terms; for example, leadership” from the perspective of the researchers with CUNY Guttman worrying about making friends (e.g., respondents. For example, students who were interested in mining first-year shy, person, friend, know, quiet) and wrote in “secretaries,” “treasurers,” and e-portfolio introductions to see commitment to society (e.g., social, “editors” were more likely to have said if student text data could predict worker, work, believe, help). These they had completed a leadership role. academic performance. All students results are summarized in Table 2.

PAGE 6 | FALL 2016 VOLUME Table 2. Results from Text Clustering of Student E-Portfolio Introductions

Concept Clustered Terms Connection to family family, york,* high school, college, child

Learning class, teacher, art, math, subject

Everyday know, day, love, life

College participation high school, school, attend, guttman

Gaming game, movie, favorite, watch, video

Worrying about making friends shy, person, friend, know, quiet

Recreation art, basketball, play, sport, travel

Commitment to society social, worker, work, believe, help

Technology integration technology, information, art, health, mind

Aspirations to work in business guttman, business, manhattan, administration, graduate

Note:* CUNY Guttman is located in Manhattan, so it might be common for students who are describing their connection to family and education to include their connection to New York City.

Variables were coded dichotomously the Theory of Student Departure, they are too difficult to synthesize based on whether the term was in which Tinto (1987) argues that compared with basic metrics such as contained in a concept’s cluster. family obligations can lead to student SAT score or GPA. However, clustering In an ordinary least squares (OLS) attrition. The ultimate goal of this terms from these corpuses can help regression analysis, after controlling project was to explore whether text to identify themes within these for college preparation (SAT and mining could be used in learning documents. Although clustering terms writing proficiency scores) and age, analytics at CUNY Guttman. A limitation requires more mathematical training a relationship (p = 0.06) was found of this study is that this institution and the naming of the concepts is between the concept “connection has a small enrollment, so many of subjective (as it is in factor analysis), to family” and credit hours earned the principles of big data (specifically the mathematical grouping of terms that fall. Students who used terms a large number of cases or students) provides insight into what students are within this concept earned fewer will not work. However, by identifying writing about, while providing units of credit hours than students who did which types of information (such as analysis that can be included in models not. Besides identifying the topics e-portfolio data) are predictive of to predict student success. Clustering that students write about in their student success, researchers at this student text provides a means to e-portfolio introductions, the results of institution can create analytic models harness these often-overlooked data. this research identified one concept— that are accurate, using few variables. connection to family—that predicted CONSIDERATIONS FOR the number of credit hours earned. In IR professionals often have access to this case, students who wrote about text data such as admissions essays, IMPLEMENTATION connection to family earned fewer answers to online tests, and e-portfolio Text mining has the possibility of credit hours than students who did submissions, which are often not being a featured tool in analytics not. This finding is consistent with incorporated in data analysis because projects; however, institutional

FALL 2016 VOLUME | PAGE 7 stakeholders need to be intentional use of sentiment analysis: “Is it only 2014). Furthermore, institutions are in how they collect and process these important to know what a student sometimes ill-equipped to grapple data so they can use information is writing or is it also important to with the more complex issues around in a timely manner. Processes for understand how a student feels about the use of data within analytics, such data collection, distribution, project a particular topic?” There might be as what to do when unexpected implementation, and analysis of results text sources, such as student course outcomes occur, such as predictive must be coordinated. In any analytic evaluations or satisfaction surveys, models that misclassify student project, timing is important. Text data where it would be important to potential (Willis, 2014). All of these need to be cleaned, processed, and consider sentiment. In these cases, factors reflect a campus culture in incorporated into predictive models in the sentiment analysis component of which administrators act ethically in time for interventions to occur; because customer satisfaction software can using data to improve student success, of the quick turnaround, most projects be implemented to measure student to ensure that students and faculty are include automation. attitude when describing certain valued partners with technologies, and phenomena as positive, negative, or to create an environment that views Another aspect to consider is the data indifferent. IBM SPSS Modeler Premium these technologies as advancing those source or origin. The two examples in has a built-in library for measuring aims (Willis, Campbell, & Pistilli, 2013). this article illustrate different sources customer satisfaction, allowing These concerns over analytics also of text data. In the first example, for a more refined and adjustable relate to the use of text mining. asking respondents to submit one view of individual likes and dislikes. “other” leadership position seems Measuring sentiment might be crucial One of the benefits of text mining is straightforward and easy to sort; for stakeholders (e.g., administrators that it presents a bridge between the however, there can be issues with and faculty); therefore, for researchers atheoretical process of data mining/ classifying even these basic data. For implementing text-mining projects, analytics and the more theoretically example, some responses do not easily the analysis of sentiment might be an driven approaches for research or naturally fit into a category, while important aspect when trying to garner used in higher education. It can be others might be placed in more than buy-in to approve text-mining projects. unnerving for campus stakeholders, one category. In the second example, including faculty, administrators, and researchers were mining students’ ETHICAL students, to rely on the black box of introduction statements for specific analytics. Text mining raises ethical information, but the prompts might CONSIDERATIONS concerns about the results of analytics have been too vague. Alternatively, As we continue to evolve in this era of processes. What is the implication prompts can be too prescriptive. analytics, it is not uncommon to hear when an algorithm predicts that a Contrast the prompts, “What do you concerns about how data are being student is less likely to pass a course or think it takes to succeed in college?” collected and used. Students might persist within a particular major? What with “Who will you ask to mentor you believe that their intellectual property aspects of a student’s background so you will be successful in college?” or even identity has been exploited and institutional environment are Similar to survey item design, the when researchers use their text data, either being overlooked or considered creation of prompts for text mining some of which can be very personal in making these predictions? might develop into its own science. (For (Slade & Prinsloo, 2013). Because there Furthermore, institutional policy is information regarding the intersection is much at stake in terms of student traditionally built on empirical studies of text mining and survey question response to analytics, institutions using of students that reference some design, see George et al., 2012.) these technologies need to develop theoretical framework, or at least a sophisticated policies regarding the plausible hypothesis. For example, Another aspect to consider when using data ownership, transparency, and first-generation college students text data for predictive models is the security of data (Pardo & Siemens, are disadvantaged, and therefore

PAGE 8 | FALL 2016 VOLUME administrators argue that they should When considering equity, institutional receive additional resources as a matter stakeholders need to take the Like many projects within an of policy. Gathering stakeholders to increasingly heterogeneous nature IR shop, successful text mining support this one aspect of identity is of student demography into requires multiple areas of expertise easy because it is an understandable consideration. Researchers need to (Haight, 2014). First, expertise in concept: “Students who did not grow account for diverse levels of familiarity data acquisition and management up in a home where at least one with English, resources, and prior is needed to procure and then clean parent completed college are at a education when working with student data (e.g., to remove extraneous text) disadvantage compared with students text data. Simply relying on text data while pairing data with other sources who did.” However, in an analytic without regard to student background of student information. Second, project, first-generation status might variables can mislead. Students who someone must mine and model text be one variable among many to predict have better educational preparation or data into predictive schemes that are student success. These complex models more resources may be advantaged in statistically sound. Third, expertise in replace an understandable narrative the types of phenomena they describe graphic design or data visualization is such as the one related to first- in text. If so, text mining could be used needed to successfully communicate generation status with a narrative that to reinforce inequity within higher the connection between data source, is reliant on numerous factors. On the education. These aspects, though not usable text, and student outcomes. other hand, the results of text mining the focus of this article, ask whether Fourth, as described earlier, the use can be easily interpreted. In the second the researcher should mine text data. of automation (i.e., creating systems example of this study, there was an Researchers need to consider that to automatically collect, clean, and inverse relationship between a student question at length before they consider process data) is paramount in any mentioning connection to family and whether they are able to mine text analytics process, so a team member earning credit hours. An educator data. needs to be in charge of this aspect meeting with students and reviewing of the project. Finally, text-mining e-portfolio introductions would find it PUTTING TEXT MINING projects need a team champion to easy to act on this information. advertise and demonstrate the value TO USE of this technology and to promote There are other institutional This article describes the ways in which it to campus stakeholders. Although cultural aspects to consider when text mining can be implemented this might not involve multiple team implementing text mining programs in the work of those institutional members, these skills are essential to on campus: Students could react to researchers who are asked to create integrating text mining into campus the common knowledge that all of analytics programs on their campuses. decision-making. Text mining presents their submissions on the campus The article described two promising a microcosm of other IR processes management system are being mined. application areas, and detailed two while offering an exciting way to make Students could change their responses examples of text mining projects. use of dense text data waiting to be once they realize that their text is being Important considerations for any unlocked. As a field, IR is in a unique mined. In the second example of this text-mining project are the context position because text data have been study, it might be the case that upper- and timing of text collection. Although collected for decades and the complex class students will warn incoming text mining offers several promising decision-making on college campuses first-years, “Make sure not to mention avenues within higher education, can only be further informed by the connection to your family in your ethical considerations should inclusion of those data. e-portfolio introduction; otherwise, provoke deep questions about the you’ll have to schedule an additional appropriateness of these methods meeting with your advisor.” balanced with the needs of the institution.

FALL 2016 VOLUME | PAGE 9 Forum, Denver, CO, August. http://www. Willis, J. E., Campbell, J. P., & Pistilli, M. D. REFERENCES forum.airweb.org/2015/Documents/ (2013). Ethics, big data, and analytics: A Arnold, K. E., & Pistilli, M. D. (2012, April). Presentations/1095_6bd9881f-0d68-457f- model for application. EDUCAUSE Review. Course signals at Purdue: Using learning 8557-8da3d450cbb8.pdf http://er.educause.edu/articles/2013/5/ analytics to increase student success. In Pro- Michigan State University. (2015) Home ethics-big-data-and-analytics-a-model-for- ceedings of the 2nd International Conference application on Learning Analytics and Knowledge (pp. page. CREATE for STEM Institute. http://crea- 267–270). doi:10.1145/2330601.2330666. te4stem.msu.edu/ Zhang & Segall, 2010 Zhang, Q., & Segall, R. Miner, M., Delen, D., Elder, J., Fast, A., Hill, S. (2010). Review of data, text and web min- Baker, R.S.J.D., Yacef, K. (2009). The state of ing software. Kybernetes, 39(4), 625–655. educational data mining in 2009: A review T., & Nisbet, B. (2012). Practical text mining and future visions. Journal of Educational and statistical analysis for non-structured text Data Mining, 1(1), 3–17. data applications. Waltham, MA: Academic Press. Berry, M. W., & Kogan, J. (Eds.). (2010). Text mining: Applications and theory. Hoboken, National Survey of Student Engagement NJ: Wiley & Sons. (NSSE). 2014. NSSE 2014 experimental items codebook formal leadership. http://nsse. Campbell, J. P., DeBlois, P. B., & Oblinger, D. indiana.edu/pdf/exp_items/2014/NSSE%20 G. (2007). Academic analytics: A new tool for 2014%20Exp_FOL_Codebook.pdf a new era. EDUCAUSE Review, 42(4), 40–57. Pardo, A., & Siemens, G. (2014). Ethical and Dillman, D. A., Christian, L. M., & Smyth, J. privacy principles for learning analytics. Brit- D. (2014). Internet, phone, mail, and mixed- ish Journal of Educational Technology, 45(3), mode surveys: The tailored design method. 438–450. Hoboken, NJ: Wiley & Sons. Park, M., Haudek, K., & Urban-Lurain, M. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, (2015). Computerized lexical analysis of P. (1996). From data mining to knowledge students’ written responses for diagnosing discovery in databases. AI Magazine, 17(3), conceptual understanding of energy. In 37–54. National Association for Research in Science George, C. P., Wang, D. Z., Wilson, J. N., Teaching (NARST) 2015 Annual International Epstein, L. M., Garland, P., & Suh, A. (2012, Conference, April. http://create4stem.msu. December). A machine learning based topic edu/publication/3361 exploration and categorization on surveys. Singh, P. D., & Raghuvanshi, J. (2012). Rising In 11th International Conference on Machine of text mining technique: An unforeseen- Learning and Applications (vol. 2, pp. 7–12). part of data mining. International Journal of doi:10.1109/ICMLA.2012.132 Advanced Research in Computer Science and Haight, D. (2014). The five faces of analytics. Electronics Engineering, 1(3), 139–144. Dark Horse Analytics. http://www.dark- Slade, S., & Prinsloo, P. (2013). Learning ana- horseanalytics.com/blog/the-five-faces-of- lytics ethical issues and dilemmas. American analytics Behavioral Scientist, 57(10), 1510–1529. Hearst, M. A. (1999, June). Untangling Suthers, D., & Verbert, K. (2013). Learning text data mining. In Proceedings of analytics as a middle space. In Proceedings the 37th annual meeting of the Asso- of the Third International Conference on ciation for Computational Linguistics on Learning Analytics and Knowledge (pp. 1–4). Computational Linguistics (pp. 3–10). Leuven, Belgium, April 8–12. doi:10.3115/1034678.1034679 Tinto, V. (1987). Leaving college: Rethinking Manning, C. D., & Schütze, H. (Ed. (1999). the causes and cures of student attrition. Foundations of statistical natural language Chicago: University of Chicago Press. processing. Cambridge, MA: MIT Press. van Barneveld, A., Arnold, K. E., & Campbell, Michalski, G. V. (2014). In their own words: J. P. (2012). Analytics in higher education: A text analytics investigation of college Establishing a common language. EDU- course attrition. Community College Journal CAUSE Learning Initiative, 1, 1–11. of Research and Practice, 38(9), 811–826. Willis III, J. E. (2014). Learning analytics and Michalski, G. V. (2015). Using analytics ethics: A framework beyond utilitarianism. to minimize student course withdraw- EDUCAUSE Review. http://er.educause.edu/ als. Paper presented at the Association articles/2014/8/learning-analytics-and-eth- for Institutional Research (AIR) Annual ics-a-framework-beyond-utilitarianism

PAGE 10 | FALL 2016 VOLUME