Automatic Summarization of Student Course Feedback

Wencan Luo† Fei Liu‡ Zitao Liu† Diane Litman† †University of Pittsburgh, Pittsburgh, PA 15260 ‡University of Central Florida, Orlando, FL 32716 wencan, ztliu, litman @cs.pitt.edu [email protected] { }

Abstract Prompt Describe what you found most interesting in today’s class Student course feedback is generated daily in Student Responses both classrooms and online course discussion S1: The main topics of this course seem interesting and forums. Traditionally, instructors manually correspond with my major (Chemical engineering) analyze these responses in a costly manner. In S2: I found the group activity most interesting this work, we propose a new approach to sum- S3: Process that make materials marizing student course feedback based on S4: I found the properties of bike elements to be most the integer linear programming (ILP) frame- interesting work. Our approach allows different student S5: How materials are manufactured S6: Finding out what we will learn in this class was responses to share co-occurrence statistics and interesting to me alleviates sparsity issues. Experimental results S7: The activity with the bicycle parts on a student feedback corpus show that our S8: “part of a bike” activity approach outperforms a range of baselines in ... (rest omitted, 53 responses in total.) terms of both ROUGE scores and human eval- uation. Reference Summary - group activity of analyzing bicycle’s parts - materials processing - the main topic of this course 1 Introduction Table 1: Example student responses and a reference summary Instructors love to solicit feedback from students. created by the teaching assistant. ‘S1’–‘S8’ are student IDs. Rich information from student responses can reveal complex teaching problems, help teachers adjust In this work, we aim to summarize the student re- their teaching strategies, and create more effective sponses. This is formulated as an extractive sum- teaching and learning experiences. Text-based stu- marization task, where a set of representative sen- dent feedback is often manually analyzed by teach- tences are extracted from student responses to form ing evaluation centers in a costly manner. Albeit a textual summary. One of the challenges of sum- useful, the approach does not scale well. It is there- marizing student feedback is its lexical variety. For fore desirable to automatically summarize the stu- example, in Table 1, “bike elements” (S4) and “bi- dent feedback produced in online and offline envi- cycle parts” (S7), “the main topics of this course” ronments. In this work, student responses are col- (S1) and “what we will learn in this class” (S6) are lected from an introductory materials science and different expressions that communicate the same or engineering course, taught in a classroom setting. similar meanings. In fact, we observe 97% of the bi- Students are presented with prompts after each lec- grams appear only once or twice in the student feed- ture and asked to provide feedback. These prompts back corpus ( 4), whereas in a typical news dataset § solicit “reflective feedback” (Boud et al., 2013) from (DUC 2004), it is about 80%. To tackle this chal- the students. An example is presented in Table 1. lenge, we propose a new approach to summarizing

80

Proceedings of NAACL-HLT 2016, pages 80–85, San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics student feedback, which extends the standard ILP be selected if that sentence is selected (Eq. 3). Fi- framework by approximating the co-occurrence ma- nally, the selected summary sentences are allowed trix using a low-rank alternative. The resulting sys- to contain a total of L or less (Eq. 4). tem allows sentences authored by different students to share co-occurrence statistics. For example, “The 3 Our Approach activity with the bicycle parts” (S7) will be allowed to partially contain “bike elements” (S4) although Because of the lexical diversity in student responses, the latter did not appear in the sentence. Experi- we suspect the co-occurrence matrix A may not es- ments show that our approach produces better re- tablish a faithful correspondence between sentences sults on the student feedback summarization task in and concepts. A concept may be conveyed using terms of both ROUGE scores and human evaluation. multiple expressions; however, the current co-occurrence matrix only captures a binary rela- 2 ILP Formulation tionship between sentences and . For exam- ple, we ought to give partial credit to “bicycle parts” Let D be a set of student responses that consist of M (S7) given that a similar expression “bike elements” sentences in total. Let y 0, 1 , j = 1, ,M j ∈ { } { ··· } (S4) appears in the sentence. Domain-specific syn- indicate if a sentence j is selected (yj = 1) or not onyms may be captured as well. For example, the (yj = 0) in the summary. Similarly, let N be the sentence “I tried to follow along but I couldn’t grasp number of unique concepts in D. z 0, 1 , i ∈ { } the concepts” is expected to partially contain the i = 1, ,N indicate the appearance of concepts { ··· } concept “understand the”, although the latter did not in the summary. Each concept i is assigned a weight appear in the sentence. w of i, often measured by the number of sentences or The existing matrix A is highly sparse. Only 2.7% documents that contain the concept. The ILP-based of the entries are non-zero in our dataset ( 4). We § summarization approach (Gillick and Favre, 2009) therefore propose to impute the co-occurrence ma- searches for an optimal assignment to the sentence trix by filling in missing values. This is accom- and concept variables so that the selected summary plished by approximating the original co-occurrence sentences maximize coverage of important concepts. matrix using a low-rank matrix. The low-rankness The relationship between concepts and sentences is N M encourages similar concepts to be shared across sen- captured by a co-occurrence matrix A R × , ∈ tences. The data imputation process makes two no- A = 1 i where ij indicates the -th concept appears in table changes to the existing ILP framework. First, the j-th sentence, and Aij = 0 otherwise. In the lit- it extends the domain of Aij from binary to a con- erature, bigrams are frequently used as a surrogate tinuous scale [0, 1] (Eq. 2), which offers a better for concepts (Gillick et al., 2008; Berg-Kirkpatrick sentence-level semantic representation. The binary et al., 2011). We follow the convention and use ‘con- concept variables (zi) are also relaxed to continuous cept’ and ‘bigram’ interchangeably in the paper. domain [0, 1] (Eq. 5), which allows the concepts to N be “partially” included in the summary. max wizi (1) y,z i=1 Concretely, given the co-occurrence matrix A N M ∈ PM R × , we aim to find a low-rank matrix B s.t. Aij yj zi (2) N M ∈ j=1 ≥ R × whose values are close to A at the observed AP y z (3) positions. Our objective function is ij j ≤ i M j=1 ljyj L (4) 1 2 ≤ min (Aij Bij) + λ B , (6) B RN M 2 − k k∗ yP 0, 1 , z 0, 1 (5) ∈ × (i,j) Ω j ∈ { } i ∈ { } X∈ Two sets of linear constraints are specified to en- where Ω represents the set of observed value po- sure the ILP validity: (1) a concept is selected if and sitions. B denotes the trace norm of B, i.e., k rk∗ only if at least one sentence carrying it has been se- B = i=1 σi, where r is the rank of B and σi k k∗ lected (Eq. 2), and (2) all concepts in a sentence will are the singular values. By defining the following P 81 projection operator PΩ, The reference summaries are created by a teach- ing assistant. She is allowed to create sum- Bij (i, j) Ω maries using her own words in addition to select- [PΩ(B)]ij = ∈ (7) 0 (i, j) / Ω  ∈ ing phrases directly from the responses. Because summary annotation is costly and recruiting anno- our objective function (Eq. 6) can be succinctly rep- tators with proper background is nontrivial, 12 out resented as of the 25 lectures are annotated with reference sum- maries. There is one gold-standard summary per 1 2 min PΩ(A) PΩ(B) + λ B , (8) N M F lecture and question prompt, yielding 36 document- B R × 2k − k k k∗ ∈ summary pairs1. On average, a reference summary where F denotes the Frobenius norm. contains 30 words, corresponding to 7.9% of the to- Followingk · k (Mazumder et al., 2010), we optimize tal words in student responses. 43.5% of the bigrams Eq. 8 using the proximal gradient descent algorithm. in human summaries appear in the responses. The update rule is 5 Experiments (k+1) (k) B = proxλρk B + ρk PΩ(A) PΩ(B) , (9) − Our proposed approach is compared against a range   of baselines. They are 1) MEAD (Radev et al., where ρk is the step size at iteration k and 2004), a centroid-based summarization system that the proximal function proxt(B) is defined as scores sentences based on length, centroid, and the singular value soft-thresholding operator, position; 2) LEXRANK (Erkan and Radev, 2004), prox (B) = U diag((σ t) ) V , where a graph-based summarization approach based on t · i − + · > B = Udiag(σ , , σ )V is the singular value eigenvector centrality; 3) SUMBASIC (Vanderwende 1 ··· r > decomposition (SVD) of B and (x)+ = max(x, 0). et al., 2007), an approach that assumes words oc- Since the gradient of 1 P (A) P (B) 2 is Lip- curring frequently in a document cluster have a 2 k Ω − Ω kF schitz continuous with L = 1 (L is the Lipschitz higher chance of being included in the summary; continuous constant), we follow (Mazumder et al., 4) BASELINE-ILP (Berg-Kirkpatrick et al., 2011), a 2010) to choose fixed step size ρk = 1, which has a baseline ILP framework without data imputation. provable convergence rate of O(1/k), where k is the For the ILP based approaches, we use bigrams number of iterations. as concepts (bigrams consisting of only stopwords are removed2) and sentence frequency as concept 4 Dataset weights. We use all the sentences in 25 lectures to construct the concept-sentence co-occurrence ma- Our dataset is collected from an introductory mate- trix and perform data imputation. It allows us to rials science and engineering class taught in a ma- leverage the co-occurrence statistics both within and jor U.S. university. The class has 25 lectures and across lectures. For the soft-impute algorithm, we enrolled 53 undergrad students. The students are perform grid search (on a scale of [0, 5] with step- asked to provide feedback after each lecture based size 0.5) to tune the hyper-parameter λ. To make the on three prompts: 1) “describe what you found most most use of annotated lectures, we split them into interesting in today’s class,” 2) “describe what was three folds. In each one, we tune λ on two folds and confusing or needed more detail,” and 3) “describe test it on the other fold. Finally, we report the av- what you learned about how you learn.” These open- eraged results. In all experiments, summary length ended prompts are carefully designed to encourage is set to be 30 words or less, corresponding to the students to self-reflect, allowing them to “recapture experience, think about it and evaluate it” (Boud et 1This data set is publicly available at http://www. al., 2013). The average response length is 10 8.3 coursemirror.com/download/dataset. ± 2Bigrams with one stopword are not removed because 1) words. If we concatenate all the responses to each they are informative (“a bike”, “the activity”, “how materials’); lecture and prompt into a “pseudo-document”, the 2) such bigrams appear in multiple sentences and are thus help- document contains 378 words on average. ful for matrix imputation.

82 ROUGE-1 ROUGE-2 ROUGE-SU4 Human System R (%) P (%) F (%) R (%) P (%) F (%) R (%) P (%) F (%) Preference MEAD 26.4 23.3 21.8 6.7 7.6 6.3 8.8 8.0 5.4 24.8% LEXRANK 30.0 27.6 25.7 8.1 8.3 7.6 9.6 9.6 6.6 — SUMBASIC 36.6 31.4 30.4 8.2 8.1 7.5 13.9 11.0 8.7 — ILP BASELINE 35.5 31.8 29.8 11.1 10.7 9.9 12.9 11.5 8.2 69.6% OUR APPROACH 38.0 34.6 32.2 12.7 12.9 11.4 15.5 14.4 10.1 89.6%

Table 2: Summarization results evaluated by ROUGE (%) and human judges. Shaded area indicates that the performance difference with OUR APPROACH is statistically significant (p < 0.05) using a two-tailed paired t-test on the 36 document-summary pairs. average number of words in human summaries. Sentence Assoc. Bigrams the printing needs to better so it can In Table 2, we present summarization results eval- the graph uated by ROUGE (Lin, 2004) and human judges. be easier to read graphs make it easier to understand ROUGE is a standard evaluation metric that com- hard to pares system and reference summaries based on n- concepts the naming system for the 2 phase gram overlaps. Our proposed approach outperforms phase diagram regions all the baselines based on three standard ROUGE I tried to follow along but I couldn’t 3 understand the metrics. When examining the imputed sentence- grasp the concepts concept co-occurrence matrix, we notice some in- no problems except for the specific teresting examples that indicate the effectiveness of equations used to determine properties strain curves the proposed approach, shown in Table 3. from the stress - strain graph Because ROUGE cannot thoroughly capture the between system and reference Table 3: Associated bigrams do not appear in the sentence, but summaries, we further perform human evaluation. after Matrix Imputation, they yield a decent correlation (cell For each lecture and prompt, we present the prompt, value greater than 0.9) with the corresponding sentence. a pair of system outputs in a random order, and the human summary to five Amazon turkers. The turk- is 3 system-system pairs 12 lectures 3 prompts ers are asked to indicate their preference for system × × 5 turkers = 540 total pairs. We calculate the per- A or B based on the semantic resemblance to the × human summary on a 5-Likert scale (‘Strongly pre- centage of “wins” (strong or slight preference) for ferred A’, ‘Slightly preferred A’, ‘No preference’, each system among all comparisons with its coun- ‘Slightly preferred B’, ‘Strongly preferred B’). They terparts. Results are reported in the last column of Table 2. OUR APPROACH is preferred significantly are rewarded $0.08 per task. We use two strategies 4 to control the quality of the human evaluation. First, more often than the other two systems . Regarding we require the turkers to have a Human Intelligence the inter-annotator agreement, we find 74.3% of the Task (HIT) approval rate of 90% or above. Sec- individual judgements agree with the majority votes ond, we insert some quality checkpoints by asking when using a 3-point Likert scale (‘preferred A’, ‘no the turkers to compare two summaries of same text preference’, ‘preferred B’). content but different sentence orders. Turkers who Table 4 presents example system outputs. This did not pass these tests are filtered out. Due to bud- offers intuitive understanding to our proposed ap- get constraints, we conduct pairwise comparisons proach. for three systems. The total number of comparisons

3F-scores are slightly lower than P/R because of the averag- 4For the significance test, we convert a preference to a score ing effect and can be illustrated in one example. Suppose we ranging from -2 to 2 (‘2’ means ‘Strongly preferred’ to a system have P1=0.1, R1=0.4, F1=0.16 and P2=0.4, R2=0.1, F2=0.16. and ‘-2’ means ‘Strongly preferred’ to the counterpart system), Then the macro-averaged P/R/F-scores are: P=0.25, R=0.25, and use a two-tailed paired t-test with p < 0.05 to compare the F=0.16. In this case, the F-score is lower than both P and R. scores.

83 Prompt submodular functions (Lin and Bilmes, 2010), Describe what you found most interesting in today’s class jointly extract and compress sentences (Zajic et al., Reference Summary 2007), optimize content selection and surface real- - unit cell direction drawing and indexing ization (Woodsend and Lapata, 2012), minimize re- - real world examples construction error (He et al., 2012), and dual decom- - importance of cell direction on materials properties position (Almeida and Martins, 2013). Albeit the System Summary (ILP BASELINE) encouraging performance of our proposed approach - drawing and indexing unit cell direction - it was interesting to understand how to find apf and on summarizing student responses, when applied to fd from last weeks class the DUC 2004 dataset (Hong et al., 2014) and eval- - south pole explorers died due to properties of tin uated using ROUGE we observe only comparable or marginal improvement over the ILP baseline. How- System Summary (OUR APPROACH) - crystal structure directions ever, this is not surprising since the lexical variety is - surprisingly i found nothing interesting today . low (20% of bigrams appear more than twice com- - unit cell indexing pared to 3% of bigrams appear more than twice in - vectors in unit cells student responses) and thus less data sparsity, so the - unit cell drawing and indexing - the importance of cell direction on material properties DUC data cannot benefit much from imputation. Table 4: Example reference and system summaries. 7 Conclusion We make the first effort to summarize student feed- 6 Related Work back using an integer linear programming frame- work with data imputation. Our approach allows Our previous work (Luo and Litman, 2015) pro- sentences to share co-occurrence statistics and alle- poses to summarize student responses by extract- viates sparsity issue. Our experiments show that the ing phrases rather than sentences in order to meet proposed approach performs competitively against a the need of aggregating and displaying student re- range of baselines and shows promise for future au- sponses in a mobile application (Luo et al., 2015; tomation of student feedback analysis. Fan et al., 2015). It adopts a clustering paradigm to address the lexical variety issue. In this work, we In the future, we may take advantage of the leverage matrix imputation to solve this problem and high quality student responses (Luo and Litman, summarize student response at a sentence level. 2016) and explore helpfulness-guided summariza- tion (Xiong and Litman, 2014) to improve the sum- The integer linear programming framework has marization performance. We will also investigate demonstrated substantial success on summarizing whether the proposed approach benefits other in- news documents (Gillick et al., 2008; Gillick et al., formal text such as product reviews, social media 2009; Woodsend and Lapata, 2012; Li et al., 2013). discussions or spontaneous speech conversations, in Previous studies try to improve this line of work which we expect the same sparsity issue occurs and by generating better estimates of concept weights. the language expression is diverse. Galanis et al. (2012) proposed a support vector re- gression model to estimate bigram frequency in the Acknowledgments summary. Berg-Kirkpatrick et al. (2011) explored a supervised approach to learn parameters using a This research is supported by an internal grant from cost-augmentative SVM. Different from the above the Learning Research and Development Center at approaches, we focus on the co-occurrence matrix the University of Pittsburgh. We thank Muhsin instead of concept weights, which is another impor- Menekse for providing the data set. We thank tant component of the ILP framework. Jingtao Wang, Fan Zhang, Huy Nguyen and Zahra Most summarization work focuses on summariz- Rahimi for valuable suggestions about the proposed ing news documents, as driven by the DUC/TAC summarization algorithm. We also thank anony- conferences. Notable systems include maximal mous reviewers for insightful comments and sugges- marginal relevance (Carbonell and Goldstein, 1998), tions.

84 References 1608–1616, Reykjavik, Iceland, May. European Lan- guage Resources Association (ELRA). ACL Anthol- Miguel B. Almeida and Andre F. T. Martins. 2013. Fast ogy Identifier: L14-1070. and robust compressive summarization with dual de- Chen Li, Xian Qian, and Yang Liu. 2013. Using super- composition and multi-task learning. In Proceedings vised bigram-based ILP for extractive summarization. of ACL. In Proceedings of ACL. Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. Hui Lin and Jeff Bilmes. 2010. Multi-document sum- 2011. Jointly learning to extract and compress. In marization via budgeted maximization of submodular Proceedings of ACL . functions. In Proceedings of NAACL. David Boud, Rosemary Keogh, David Walker, et al. Chin-Yew Lin. 2004. ROUGE: a package for automatic 2013. Reflection: Turning experience into learning. evaluation of summaries. In Proceedings of the Work- Routledge. shop on Text Summarization Branches Out. and Jade Goldstein. 1998. The use Wencan Luo and Diane Litman. 2015. Summarizing stu- of mmr, diversity-based reranking for reordering doc- dent responses to reflection prompts. In Proceedings uments and producing summaries. In Proceedings of the 2015 Conference on Empirical Methods in Nat- of the 21st Annual International ACM SIGIR Confer- ural Language Processing, pages 1955–1960, Lisbon, ence on Research and Development in Information Re- Portugal, September. Association for Computational trieval, SIGIR ’98, pages 335–336. Linguistics. Gunes¨ Erkan and Dragomir R. Radev. 2004. LexRank: Wencan Luo and Diane Litman. 2016. Determining Graph-based lexical centrality as salience in text sum- the quality of a student reflective response. In Pro- marization. Journal Artificial Intelligence Research, ceedings 29th International FLAIRS Conference, Key 22(1). Largo, FL, May. Xiangmin Fan, Wencan Luo, Muhsin Menekse, Diane Wencan Luo, Xiangmin Fan, Muhsin Menekse, Jing- Litman, and Jingtao Wang. 2015. CourseMIRROR: tao Wang, , and Diane Litman. 2015. Enhanc- Enhancing large classroom instructor-student interac- ing instructor-student and student-student interactions tions via mobile interfaces and natural language pro- with mobile interfaces and summarization. In Pro- cessing. In Works-In-Progress of ACM Conference on ceedings of NAACL (Demo). Human Factors in Computing Systems. ACM. Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Dimitrios Galanis, Gerasimos Lampouras, and Ion An- 2010. Spectral regularization algorithms for learning droutsopoulos. 2012. Extractive multi-document large incomplete matrices. Journal of Machine Learn- summarization with integer linear programming and ing Research. support vector regression. In Proceedings of COLING. Dragomir R. Radev, Hongyan Jing, Małgorzata Stys,´ and Dan Gillick and Benoit Favre. 2009. A scalable global Daniel Tam. 2004. Centroid-based summarization model for summarization. In Proceedings of NAACL. of multiple documents. Information Processing and Dan Gillick, Benoit Favre, and Dilek Hakkani-Tur. 2008. Management, 40(6):919–938. The icsi summarization system at tac 2008. In Pro- Lucy Vanderwende, Hisami Suzuki, Chris Brockett, and ceedings of TAC. Ani Nenkova. 2007. Beyond SumBasic: Task- Dan Gillick, Benoit Favre, Dilek Hakkani-Tur, Berndt focused summarization with sentence simplification Bohnet, Yang Liu, and Shasha Xie. 2009. The and lexical expansion. Information Processing and ICSI/UTD summarization system at TAC 2009. In Management, 43(6):1606–1618. Proceedings of TAC. Kristian Woodsend and Mirella Lapata. 2012. Multiple Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Lijun aspect summarization using integer linear program- Zhang, Deng Cai, and Xiaofei He. 2012. Document ming. In Proceedings of EMNLP. summarization based on data reconstruction. In Pro- Wenting Xiong and Diane Litman. 2014. Empirical ceedings of AAAI. analysis of exploiting review helpfulness for extractive Kai Hong, John Conroy, Benoit Favre, Alex Kulesza, summarization of online reviews. In Proceedings of Hui Lin, and Ani Nenkova. 2014. A repository of COLING 2014, the 25th International Conference on state of the art and competitive baseline summaries for Computational Linguistics: Technical Papers, pages generic news summarization. In Nicoletta Calzolari, 1985–1995, Dublin, Ireland, August. Dublin City Uni- Khalid Choukri, Thierry Declerck, Hrafn Loftsson, versity and Association for Computational Linguistics. Bente Maegaard, Joseph Mariani, Asuncion Moreno, David Zajic, Bonnie J. Dorr, Jimmy Lin, and Richard Jan Odijk, and Stelios Piperidis, editors, Proceed- Schwartz. 2007. Multi-candidate reduction: Sen- ings of the Ninth International Conference on Lan- tence compression as a tool for document summariza- guage Resources and Evaluation (LREC’14), pages tion tasks. Information Processing and Management.

85