Assessing 'desirable difficulties' to improve learning: Testing and Effects

Aryn Pyke1

This paper was completed and submitted for partial fulfillment of the Master Teacher Program, a 2-year faculty professional development program conducted by the Center for Faculty Excellence, United States Military Academy, West , NY, 2020.

Abstract: There is abundant support for the “testing effect” – subsequent recall is improved if

students try to recall information during study (self-testing) versus simply re-reading it. This

effect is consistent with a broader notion of “desirable difficulties” (Bjork & Bjork, 2011). If the

learning experience is characterized by difficulties that induce extra effort, then retention may be

improved (i.e., attempting recall is more difficult than re-reading). That said, not all difficulties

may be ‘desirable’ (i.e., facilitate learning). In this study, the testing effect was replicated and a

new potential desirable difficulty was investigated - a difficult-to-read font called Sans Forgetica

(Francis, 2018). Gaps and other irregularities in the letters are presumed to force a deeper level

of processing. However, when this study commenced, there was no published evidence on its

efficacy. It would be of great practical value if learning could be enhanced simply by changing

the text font. Freshman Psychology students (8 class sections, N = 120) were each given two

passages to learn via different learning methods (study then re-study vs. study then self-test).

Half the students saw the passages in Times New Roman and half in Sans Forgetica. On an

unexpected recall test a week later, scores were higher for passages learned via study-test than

study-study (a testing effect), however the effect of font and the interaction were not significant.

Possible reasons why this font manipulation may not have been efficacious are discussed.

1 I would like to acknowledge non-MTP BS&L colleagues Beth Wetzler, Adam Werner and David Feltner, who collaborated in this research. I am, however, the sole author of this write-up.

1

Keywords: testing effect, desirable difficulties, levels of processing, Sans Forgetica,font, recall

Introduction

One objective of education is to help learners internalize and retain important factual

information. Such information can be integrated meaningfully into their mental knowledge bases

and subsequently accessed and applied to critically interpret new information and to inform

decision making and problem solving. As such, pedagogical practices and factors that facilitate

learning, retention and recall are of key interest to educators and educational psychologists. It has long been known that learners’ success at retrieving content from memory tends to increase with the number of exposures to the information (e.g., Logan & Klapp, 1991). However, beyond the number of exposures, the nature of these exposures and of the learning task can also influence success at subsequent recall. There is a broad notion that “desirable difficulties” can facilitate learning (Bjork & Bjork, 2011). In brief, sometimes when the learning experience is

characterized by difficulties that induce extra effort, then retention may be improved. We later

discuss in a bit more detail the theoretical frameworks and potential cognitive mechanisms that

may explain such effects.

Examples of Desirable Difficulties

A well-established example of a desirable difficulty is the testing effect – i.e., subsequent

recall is improved if students try to recall information during study (i.e., self-testing) versus

simply re-reading it (Karpicke & Roediger, 2008). Flash cards are one way that self-testing can

be implemented during learning. The testing effect has been established for a variety of contents

including learning arbitrary word pairs, foreign words, answers to arithmetic problems, and

semantic facts (e.g., Carrier & Pashler, 1992; Gaspelin, Ruthruff, & Pashler, 2013; Karpicke &

Roediger, 2008, Pyke, Bourque & LeFevre, 2019). For the testing effect, the ‘desirable

2

difficulty’ refers to the fact that attempting recall during learning is harder than passively re-

reading the content.

Engaging in several study sessions that are spaced apart rather than that conducting a

single cram session is another example of a desirable difficulty that aids learning (for a review

see Cepeda et al., 2006). The desirable difficulty underlying this spacing effect (or distributed practice) is presumably that in the intervals between study sessions, the content becomes less active and/or accessible in memory, so that in each study session there is extra effort necessary to re-acquaint yourself with the content. A variation of this learning method is to interleave study sessions of different subjects. In this case, re-acquainting yourself with the content in one subject when you next study it is even more challenging because it has become less accessible not only due to the passage of time but also due to interference from content in the intervening subject(s) you studied (for a good computational model of memory accessibility factors see

ACT-R, Anderson et al., 2004). At a mechanistic level, spacing, interleaving, and testing effect benefits may all owe to the process of attempting to recall content into working memory.

Spacing and interleaving make such recall more difficult.

Not All Difficulties Are Desirable

Unfortunately, not all difficulties or investments of effort during learning produce gains in retention and recall. For example, math students can gain exposures to answers to arithmetic facts (e.g., 3*4) by computing the answer themselves (3*4 = 4+4+4) or by using a calculator.

Although self-computation is a more difficult and effortful way to practice, the act of self- computation itself does not provide a benefit for committing the problem-answer association to memory (Pyke & LeFevre, 2011). That said, to avoid arduous computation, students may be motivated to first attempt to recall the answer before trying to compute it – and this recall

3 attempt (rather than effort spent on computation itself) can facilitate fact learning (i.e., a testing effect, Pyke, Bourque & LeFevre, 2019).

Evidence also suggests that having a learner’s attention divided (e.g., multi-tasking) during learning tends to be an undesirable difficulty (e.g., Fernandes & Moscovitch,

2000). That said, Gaspelin, Ruthruff and Pashler (2013) hypothesized that divided attention during subsequent study (recall practice) might still be beneficial, because like spacing and interleaving, it might make recall more (desirably) difficult. However, Gaspelin et al.’s results

(see also Craik et al., 1996) led them to conclude that even in the context of subsequent study, divided attention was not a desirable difficulty. In summary, it may not always be obvious a priori which difficulties are ‘desirable’ for facilitating learning.

Font Manipulations: Desirable or Undesirable Difficulties?

An aim in the current research, was to investigate a new potential desirable difficulty in the form of a difficult-to-read font, Sans Forgetica (Francis, 2018). As shown in Figure 1, Sans

Forgetica includes gaps and other irregularities in the letters. The designers, who are cognitive psychologists, claim that because of these properties, this font will improve retention because readers will engage in deeper cognitive processing (Sansforgetica.rmit). It would obviously be of great practical value if learning could be enhanced simply by changing materials into this font, but when this study commenced, there was no published scientific evidence on its efficacy.

Figure 1: Illustrating the gaps, missing sections and back slant that may make Sans

Forgetica more difficult to read.

4

Prior memory research had, however, been done on other font manipulations to seek

possible ‘disfluency’ effects, as potential desirable difficulties are referred to in this context (e.g.,

Kuhl & Eitel, 2016). For example, in a meta-analysis, Halamish (2018) reported a u-shaped relationship for the effect of font size on memory – that is, compared to intermediate , recall was better for both large fonts and small fonts. In other research, however, visually degrading a text (e.g., blurring) to induce disfluency did not always show expected benefits (Yue et al.,

2013). In terms of comparisons across font types, Diemand-Yauman, Oppenheimer and

Vaughan (Study 1, 2011) reported that when learners were tested after a 15 minute distractor task, learners were better able to recall facts from passages that had been presented in ‘disfluent’ fonts (Comic Sans & Bodoni) versus in Arial. In that study, however, the disfluent fonts also differed in size (12 point) and ink saturation (75% greyscale) in comparison to the Arial control

(16 point, black).

In general, however, it seems that disfluency effects are not always readily obtained or replicated. Kuhl and Eitel (2016) summarized the results from a subsequent special issue investigating disfluency outcomes, and in all 13 studies, disfluency did not yield an overall benefit to performance. In that issue disfluency was operationalized via one or more of the following manipulations: making the text smaller, grey (vs. black), blurred, italicized and/or in a different font than the Arial control (e.g., Times New Roman, Comic Sans, Brush Script or

Haettenschweiler).

In terms of Sans Forgetica specifically, after the data for the current study were collected, a study by Eskenazi and Nix (2020) was published suggesting that Sans Forgetica might induce desirable difficulties, relative to Courier, in a lexical acquisition task. Subjects had to learn the spelling and infer the meaning of low frequency words, each presented in the context of two

5

sentences (15 words, total of 30 sentences). The efficacy of orthographic learning (spelling) was

then assessed by having learners choose the correct spelling of each word from among four

options (multiple choice recognition). Learning of semantic meanings was assessed by

presenting the word as a cue and having subjects recall the definition they had inferred from the

context sentences. These researchers reported a benefit for both orthographic and semantic

learning when the original sentences were presented in Sans Forgetica (vs. Courier), but only

among subjects that were high- (but not low-) skilled at spelling. It was not clear, however,

whether the skill levels among ‘high-skill’ subjects were matched across two font groups, or

whether the effect may have been potentially due to an a priori skill imbalance in favor of the

“high-skilled” Sans Forgetica group.

The current research investigated semantic learning and recall of facts gleaned from short passages presented in either Sans Forgetica or Times New Roman.

Method

Participants

Participants were 8 class sections of freshman students (N=120) taking the course

General Psychology for Leaders during the fall semester in 2019. Half the class sections (4 of 8) saw the stimulus passages in Times New Roman font (N=62) and half the class sections saw the passages in Sans Forgetica (N=58).

Materials

The materials to be learned were two short passages originally designed to assess comprehension in English-as-a-Second-Language students (Rogers, 2001), but that have also since been used and validated in experimental studies on the testing effect (Einstein, Mullet &

Harrison, 2012; Roediger & Karpicke 2006). One passage was about Sea Otters (275 words) and

6

one was about the Sun (256 words)2. Each passage was associated with a quiz consisting of 12

short-answer questions3. For example, the sun passage contains the information that: “The sun

today is a yellow dwarf star”; and its quiz contains the question: “What type of star is the sun

today?”. The sea otter passage contains the information: “Sea otters dwell in the North Pacific”;

and its quiz contains the question: “Where do sea otters dwell?”.

Procedure

The current procedure largely replicated that in Einstein et al. (2012) and occurred in two

phases: Learning (Session 1) and Recall Quiz (Session 2), which were conducted a week apart.

These activities then served as a basis for discussion in a subsequent lesson on study habits and

research methods (debriefing lesson).

Learning Session

During a psychology class on learning, each student was given both passages (sun and

sea otter), each on printed on its own of paper, to learn sequentially. A key difference from

Einstein et al. (2012) was that half our participating class sections (4) were given these passages

in Sans Forgetica font and the other sections received them in Times New Roman font. Learning

method, however, was a within-subjects variable – each student learned one passage via a study-

study method and one via study-test. Each student was allocated 8 minutes of total learning time

per passage. For both learning conditions, the first 4 minutes were spent studying a passage (i.e.,

reading it and taking notes in the if desired). In the study-study condition, the next 4

minutes were spend doing more of the same, but in the study-test condition, they flipped the

passage over and out of view and spent 4 minutes recalling and writing down information they

2 The Sun and Sea Otter passages can be found at http://psych.wustl.edu/memory/stimuli/Stimuli- Roediger&Karpicke2006b.pdf. 3 The short-answer questions and answers can be found at http://www2.furman.edu/academics/psychology/FacultyandStaff/Einstein/Pages/TeachingEffectDemo.aspx.

7

could remember on a blank page. Both the association of passages with learning methods and the order that the learning methods were executed were counterbalanced across class sections.

Students were not told there would be a subsequent recall quiz on these materials.

Recall Quiz Session

A week after the learning session, in a class on memory, two recall quizzes were administered – each with 12 short-answer questions per passage. The students took these quizzes in the same order in which they had initially read the passages, and were given 8 minutes to complete each quiz.

Debriefing Lesson

Pedagogically, the above activities served as the basis for a discussion in a subsequent lesson about: ii) which study habits students used and which they found most effective; ii) the testing effect; iii) research methods. Learning, memory, and research methods are key topics in psychology and this hands-on demonstration served as an excellent springboard for discussion.

Results

Figure 2 presents the students’ scores on the recall quiz that occurred one week after the students were exposed to the prose passages in class. A mixed model analysis of variance was used to analyze the results from this 2(learning method: study-study, study-test) X 2(passage font: Sans Forgetica, New Times Roman) design, with learning method as a repeated measure and font as a between groups factor. The ANOVA revealed a significant main effect for learning method on recall scores, F (1, 118) = 14.01, partial-eta^2 = .11, p < .001 – specifically, consistent with prior research, recall was significantly better after study-test learning (M = 44.3%

SD = 19.1) than study-study learning (M = 36.7%, SD = 20.2). However, there was neither a

8

significant effect of font type, F(1,118) = 0.01, partial-eta^2 <.01, p=.883, nor a significant

interaction, F(1,118) = 0.02, partial-eta^2 <.01, p=.942.

Figure 2: Percent of answers correct on the delayed (1-week) recall test after study-study learning and study-test learning of passages in Times New Roman or Sans Forgetica. Error bars are std err.

Discussion

The results replicated the expected testing effect (Einstein et al., 2012; Roediger &

Karpicke, 2006): Recall was better after study-test learning than study-study learning. However, the results did not provide support for the hypothesis that the Sans Forgetica font induces a desirable difficulty that improves recall relative to a more conventional font like Times New

Roman. Note, a testing effect - but not a font effect - occurred even though students were not told they would be subsequently tested on the material. The following discussion considers

9

several possible explanations for the inconsistent results between the current study and the effect

reported in Eskenazi and Nix (2020).

A possible factor that might influence results is the nature of the task. Eskenazi and Nix

(2020) emphasize their task focus on lexical acquisition. Indeed, learning the spelling

(orthographic features) of a word as assessed via a recognition task likely engages some different

resources than learning semantic facts as assessed by a recall task, which is the current focus.

However, Eskenazi and Nix also assessed - and reported a font effect for - the recall of the

semantic meaning of the words. The common task of semantic recall across their study and the

current research would suggest that task demands may not be a key locus of the difference in

results.

Another possible factor is learner control over the time they invest reading the material.

Sans Forgetica’s benefits may stem from inducing students to take longer to read a given number

of words. Thus, it is the extra time (vs. the font per se) that is beneficial, which seems

compatible with the views of its creators (Francis, 2018). This possibility is also compatible with

the finding in Eskenazi and Nix (2020) that the group that performed best spent more total time

reading the target words (in Sans Forgetica) than the control group spent reading them (in

Courier). The procedure in the current research allocated equal time for learning the passages in

each font. That said, the allotted time was enough to allow students to read the complete

passages, so it is unclear if allocating equal time was relevant to the differences in results.

A third possibility for why recall scores were not higher in the current Sans Forgetica condition is that the recall quiz questions were in Times New Roman font. Context-dependent

learning studies (for a review see Smith & Vela, 2001) sometimes find a recall benefit if the

recall context (here, font of the question) matches the learning context (here, font of passage),

10

even superficially. However, the locus of the memory benefit of a desirable difficulty is

typically expected to occur in the encoding – rather than the recall – stage, via greater depth of

processing. If Sans Forgetica font is only beneficial in learning in contexts where the recall cues

must also be in Sans Forgetica font, then the practical benefits for pedagogy seem substantially

limited, and the effect would reduce to a specific instance of context-dependent learning. It is

not clear from Eskenazi and Nix (2020) what font was used to present the questions to assess

orthographic and semantic learning after a subject read the learning material (30 sentences) in

Sans Forgetica or Courier. Each individual subject read all learning-material sentences in a

single font to avoid ‘task’-switching between fonts during the learning phase. It is suspected,

however, that in their study, the assessment questions after learning may have been presented in

a common, conventional font across groups. If so, the font of the questions is also not the locus

of the difference in results.

Finally, in contrast to Ekenazi and Nix (2020), in the current study, the protocol did not include an independent metric to classify and compare high- versus low-skilled students. In their

lexical acquisition task, Ekenazi and Nix reported that Sans Forgetica benefits were limited to

students who were high-skilled (in spelling). In that article, however, the reported data do not

clarify whether or not the skill levels were matched across font condition, which was a between-

subjects variable. Thus, despite random assignment, the ‘high-skilled’ students in their Sans

Forgetica group may have been higher-skilled than the ‘high-skilled’ students in the control

(Courier) group. If so, higher performance in the Sans Forgetica group may simply have owed to

those students having higher a priori skill levels, rather than to the font itself. In the current

study, one could also argue that, despite random assignment, there may have been an imbalance

of skill across groups that cancelled out the font effect in both the study-test subgroups and the

11

study-study subgroups. Nonetheless, it is a notable concern that Eskenazi and Nix had the independent metrics to check for a potential skill imbalance across their groups and apparently did not do so.

In summary, we have considered several possible factors that could influence font effects: task type, cue font, reading time, and a priori skill imbalances across font groups. We suggest that reading time and checking for skill imbalances are key factors to consider for future research. If Sans Forgetica induces students to spend longer studying in the wild, then it may lead to learning gains. Alternately, however, students may be inclined to give up on a reading if it is taking them too long to finish it. In future work, to explore this potential speed/accuracy trade-off (speed of encoding vs. accuracy of recall), it will be important to assess the relative amount of time that students voluntarily spend on course material - and the extent to which they complete their assigned readings - in each font. In future work, it also seems important to assess students’ a priori skill for two reasons: i) in case advantages only apply to a subset of skill levels; and ii) to allow researchers to check that there are no skill imbalances across groups that could skew the results.

In conclusion, the inconsistent results between and the current study and Eskanazi and

Nix (2020) suggest caution before assuming that Sans Forgetica will serve as a desirable difficulty. It did not promote delayed recall of semantic knowledge above and beyond that afforded by the Times New Roman font in the current study. More generally, published studies reporting significant disfluency effects may belie a silent majority of unpublished studies that obtained null-effects and ended up in the file drawer (Kuhl & Eitel, 2016). As such it may be prudent to exercise caution before rushing to put any disfluency manipulations into pedagogical practice.

12

References

Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004). An integrated theory of the mind. Psychological Review, 111(4), 1036- 1060.

Bjork, E. L., & Bjork, R. A. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. Psychology and the real world: Essays illustrating fundamental contributions to society, 2(59-68).

Carrier, M., & Pashler, H. (1992). The influence of retrieval on retention. Memory & Cognition, 20(6), 633-642.

Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354-380.

Craik, F. I., Govoni, R., Naveh-Benjamin, M., & Anderson, N. D. (1996). The effects of divided attention on encoding and retrieval processes in human memory. Journal of Experimental Psychology: General, 125(2), 159-176.

Diemand-Yauman, C., Oppenheimer, D. M., & Vaughan, E. B. (2011). Fortune favors the bold (and the italicized): Effects of disfluency on educational outcomes. Cognition, 118(1), 111-115.

Einstein, G. O., Mullet, H. G., & Harrison, T. L. (2012). The testing effect: Illustrating a fundamental concept and changing study strategies. Teaching of Psychology, 39(3), 190-193.

Eskenazi, M. A., & Nix, B. (2020). Individual differences in the desirable difficulty effect during lexical acquisition. Journal of Experimental Psychology: Learning, Memory, and Cognition. http://dx.doi.org/10.1037/xlm0000809

Fernandes, M. A., & Moscovitch, M. (2000). Divided attention and memory: evidence of substantial interference effects at retrieval and encoding. Journal of Experimental Psychology: General, 129(2), 155-176.

Francis, D. (2018). Introducing Sans Forgetica. Word Ways, 51(4), 10.

Gaspelin, N., Ruthruff, E., & Pashler, H. (2013). Divided attention: An undesirable difficulty in memory retention. Memory & Cognition, 41(7), 978-988.

Halamish, V. (2018). Can very small font size enhance memory?. Memory & Cognition, 46(6), 979-993.

Karpicke, J. D., & Roediger, H. L. (2008). The critical importance of retrieval for learning. Science, 319(5865), 966-968.

13

Kühl, T., & Eitel, A. (2016). Effects of disfluency on cognitive and metacognitive processes and outcomes. Metacognition and Learning, 11(1), 1-13.

Logan, G. D., & Klapp, S. T. (1991). Automatizing alphabet arithmetic: I. Is extended practice necessary to produce automaticity?. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17(2), 179-195.

Pyke, A., Bourque, G., & LeFevre, J. A. (2019). Expediting arithmetic automaticity: Do inefficient computation methods induce spontaneous testing effects?. Journal of Cognitive Psychology, 31(1), 104-115.

Pyke, A. A., & LeFevre, J. A. (2011). Calculator use need not undermine direct-access ability: The roles of retrieval, calculation, and calculator use in the acquisition of arithmetic facts. Journal of educational psychology, 103(3), 607-616.

Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17, 249-255.

Rogers, B. (2001). TOEFL CBT Success. Princeton, NJ: Peterson’s.

Smith, S. M. & Vela, E. Environmental context-dependent memory: A review and meta-analysis. Psychonomic bulletin & review, 8(2), 203-220.

Yue, C. L., Castel, A. D., & Bjork, R. A. (2013). When disfluency is—and is not—a desirable difficulty: The influence of clarity on metacognitive judgments and memory. Memory & cognition, 41(2), 229-241.

14