Learning efficiency correlates of using SuperMemo with specially crafted in medical scholarship.

Authors: Jacopo Michettoni, Alexis Pujo, Daniel Nadolny, Raj Thimmiah.

Abstract

Computer-assisted learning has been growing in popularity in higher and in the research literature. A subset of these novel approaches to learning claim that predictive algorithms called can significantly improve retention rates of studied knowledge while minimizing the time investment required for learning.

SuperMemo is a brand of commercial and the editor of the SuperMemo spaced repetition algorithm.

Medical scholarship is well known for requiring students to acquire large amounts of information in a short span of time. Anatomy, in particular, relies heavily on rote .

Using the SuperMemo web platform1 we are creating a non-randomized trial, inviting medical students completing an anatomy course to take part. Usage of SuperMemo as well as a performance test will be measured and compared with a concurrent control group who will not be provided with the SuperMemo Software.

Hypotheses

A) Increased average grade for memorization-intensive examinations If spaced repetition positively affects average retrievability and stability of over the term of one to four months, then consistent2 users should obtain better grades than their peers on memorization-intensive examination material.

B) Grades increase with consistency There is a negative relationship between variability of daily usage of SRS and grades.

1 https://www.supermemo.com/ 2 Defined in Criteria for inclusion: SuperMemo group. C) Increased stability of memory in the long-term If spaced repetition positively affects knowledge stability, consistent users should have more durable even after reviews of learned material have ceased.

Study design

This non-randomized controlled intervention study will be conducted during the 2021-2022 anatomy class at the Padua Faculty of medicine in Italy. The study is divided into two phases.

At the time of writing, the authors are waiting for Padua university's thesis committee to arbitrate which data are permissible in this study. To accommodate this uncertainty, we define several scenarios in a later chapter.

Phase one

In parallel with their normal scholarship, students will be distributed into two groups. The SuperMemo Group (SG) will study their anatomy course material using flashcards3 that have been collaboratively prepared by the students4 of this group and published on the SuperMemo (SM) web platform. The Traditional Group (TG) or control group will study anatomy course material using traditional methods of learning5.

All students of the 2021-2022 anatomy class will participate in the study, and each student will have the choice of joining either the TG or the SG. A presentation will be given before the beginning of the course’s term to introduce SuperMemo, the principles of spaced repetition, and the objectives of the study.

Depending on the scenario, the authors of this study may conduct additional simulation tests involving some or all of the students participating in phase one. These simulation tests will cover the same material as the official examinations given by the Padua university. If conducted, these tests will provide a more accurate and detailed measure of the students’ understanding of the course material.

Samples from the TG and SG will be analysed and compared to test hypotheses A and B.

The sample design for phase one is a non-randomized, two-group, unpaired relative comparison. No blinding will be involved.

3 A card bearing information, each with a question and an answer field. 4 Defined in Course material. 5 Defined in Traditional learning. Phase two

In the aftermath of their 2020-2021 scholarship, students participating in phase one will be offered the opportunity to participate in a follow-up test. The test will cover the same questions as the simulation tests conducted in phase one.

Samples from the TG and SG will be analysed and compared to test hypothesis C.

The sample design for phase one is a non-randomized, two-group, paired relative comparison with repeated measures.

Considerations about samples

The target population (TP) of this study are medicine students who employ traditional learning methods.

We will infer properties about the TP from one population and two sub-populations:

1. Population 1: all students from the 2021-2022 anatomy class (phase one), 2. Sub-population 1: a subset of students from population 1 participating in the simulation tests, 3. Sub-population 2: a subset of students from population 1 participating in the follow-up test (phase 2).

Within these populations we define an additional dimension for enrolment within either the SuperMemo group or the traditional group. For the remainder of this paper, we define the terminology: • Sampled population or sub-population (SP) refers to the entire group of students participating in the test under consideration, • Traditional group (TG) and SuperMemo group (SG) refer to the subset of students within a SP filtered by their membership in either group.

We use this compartmentalization to systematically consider errors introduced by each layer of the selection process. For example, the selection process for student scholarships differs significantly across academic institutions. Prestigious universities often exercise strict control on surface criteria such as academic success or financial means, which may correlate with features such as personal motivations or social status.

Therefore, it is not evident that students of a given institution are representative of the general population. Furthermore, any subsequent sample would inherit the same attributes and likely produce biased results. We address these concerns in the later chapters about the representativity of samples.

We begin by defining the qualities of TP and SP, as well as SG and TG. Each of these sections will also discuss the possible biases introduced at each stage of the study.

Sample Population Selection Units

Measurement

Inferred Samples data Estimation

Target population, sampled populations and sub-populations

SP 1 consists of the entire 2021-2022 anatomy class. Samples from this population will be in the form of two sets of grades, corresponding to exams taken in July and September 2021. They are conducted by the Padua faculty of medicine and are mandatory for graduation.

SP 2 consists of a subset of students from SP 1 that are participating in the simulation tests. Samples from this population will be in the form of grades corresponding to the successive simulation test. Depending on the scenario, the simulation tests may occur 3 to 21 days before or after Padua university's official tests. Questions will cover the same corpus of knowledge as the official examinations. The simulation tests do not have any direct6 influence on the graduation process.

SP 3 consists of a subset of students from SP1 participating in a follow-up test. Participation will be voluntary. Samples from this population will be in the form of a single set of grades, and

6 No influence other than helping students assess their readiness for the official test, if taken beforehand. questions will target the entire corpus of knowledge from the July and September exams. It is conducted by authors of this study, and it will not have any influence on the students’ graduation process.

All of these sampled populations and sub-populations are further subdivided within two groups: the TG and the SG.

TARGET POPULATION

Sampled population 1 Faculty’s anatomy exams

Not included in Sub-population 1 sampling frame Simulation tests Sub-population 2 Students not part of 6 months follow- Padua’s 2021-2022 up test anatomy class. Size: Unknown

Size: Unknown Size: 336 students Size: more than 90000

The size of the TP is a conservative estimate based on a paper by (Curtoni & Sutnick, 1995): There is approximately one first-year medical student per 10,000 people in Europe, compared with one per 15,000 people in the United States.

Assuming at least similar proportions of first-year students, given the 1995 populations of 267 million in the U.S. and 728 million in Europe, and given the strict growth of their respective populations after 1995 based on public data for years 1995-2020, we calculate:

267000000 728000000 푐푎푟푑(푇푃) > + > 90,600 15000 10000

We expect yet larger numbers for the TP when factoring in countries and continents other than America and Europe. Moreover, this estimate only includes first-year students. Medical scholarships often last many years, therefore the number of students who could benefit from the learning treatment at any given time would be significantly greater than 90,600.

At the time of writing, the sizes of SP 2 and 3 are unknown. If no form of incentive is offered in exchange for participation, we expect their levels to be low.

Representativity of SP 1

To evaluate the representativity of SP 1 with respect to our hypotheses and the TP, we consider the following factors and whether they contribute to sampling and non-sampling error7:

Non-sampling error: curriculum, teaching methods, studied discipline, geographic location, and randomness of scholarship.

Curriculum may differ between medicine faculties. Namely, the Padua faculty of medicine is a prestigious university, and its curriculum may differ significantly from other institutions. Therefore, the efficacy of the learning treatment may differ as well.

The applicability of spaced repetition as defined in this paper may vary by discipline. However, medical scholarship relies primarily on , and the learning treatment may generalize well to other classes.

Padua is one of the wealthiest regions in Italy with a rich cultural history. The Padua faculty is a highly ranked medical universities in Italy, and tuition fees average around 2500€ per year.

If data about the demographics background of students studying at the Faculty are made available for this study, we may examine how these parameters affect representativity.

Ideally, the sampled population would have students from many schools. However, in these circumstances, we have to consider the possibility that SP1 may not be representative of the TP.

Sampling error: Attrition.

Because the official examinations are required for graduation, we expect all students to participate. Attrition should be minimal and well under 20%.

Representativity of SP 2

For SP 2, we carry over the analysis of SP 1 and consider additional factors of sampling error.

Depending on the scenario, participation in the simulation test may be voluntary. Without an incentive, the SP 2 may be significantly smaller than SP 1 with participation rates well under 80%.

7 Errors that can be attributed to inter-sample variability. In other words, errors due to considering a sample instead of the entire population. Due to the additional memory consolidation, participation in simulation tests may improve scores on the official exam. To mitigate this issue, we will avoid pre-exam simulation tests when possible.

Representativity of SP 3

For SP 3 we carry over the analysis of SP 1 and consider additional factors of sampling error.

For ethical reasons, participation in the long-term follow-up test will be voluntary. Due to the absence of intrinsic and extrinsic incentives, we expect a substantial loss of participants. SP 2 may be significantly smaller than SP 1, with attrition rates well over 20%.

Other factors may play a role in students’ decision to participate in the follow-up test. For example, participants from the SG who scored poorly during exams may be less enthusiastic about offering their time if they perceive the learning treatment as a potential cause of their poor results.

In addition to loss of participants, loss to follow-up may occur. Contact information will be shared by students on a voluntary basis and may be erroneous (e.g., by misspelling an email address) or outdated (e.g., the student may have changed his communication address by the time the follow-up test is conducted).

We conclude that there exists a non-negligeable chance of SP2 not being representative of the TP.

Representativity of the SuperMemo group

To evaluate the representativity of the SG, we carry over the analysis of SP1, SP2, and SP3 and consider the following factors: self-selection, over-coverage, attrition, randomness.

We then discuss the means of mitigating error introduced by the sampling method.

Self-selection

Due to ethical reasons8, students will freely decide whether to join the SG or not. Students who do not join the SG will be assigned to the TG by default. The process of student self-selection introduces confounding variables that interfere with the ability to draw causal claims of SuperMemo’s impact on grades.

8 Discussed in Ethical considerations. For example, conscientious students may be more likely to join the SuperMemo group because of its potential to improve their grades. At the same time, assiduity might positively correlate with higher grades regardless of studying method. A positive relationship between SuperMemo usage and higher grades might then be explained in part or totality by the assiduity of students in the SG when compared with the TG.

Over-coverage

We have identified a possible case of over-coverage where repeat students taking the anatomy class for the second time are introduced into the sample without normalization. Including only their second attempt may introduce bias. We have examined three strategies for dealing with such cases: 1. Including grades from both years, 2. Including a normalized grade based on both years, 3. Excluding grades from the sample.

Including grades from both years and excluding grades may aggravate the bias by respectively over- and under- representing certain grades. For example, the student may score very poorly in the first year and very well on the second year, or very poorly both years.

Including a normalized grade involves aggregating grades from both years, for example by calculating the mean grade. This process may also introduce bias when grades from both years cannot be compared.

For example, let us assume the mean scores of the 2020-2021 and 2021-2022 classes are respectively equal to 10 and 12 on a scale of 20. A student who scored 11 in the first class may have scored differently (e.g., 13) if he instead participated in the second class. Calculating the mean of 11 and 13 wouldn’t be representative of his true score.

To compensate for this difference, scores may be adjusted by using the difference of the mean of both years. Let us assume 푢1 and 푢2 respectively the first- and second- year class’ mean grade. We define 푑 = 푢2 − 푢1 the difference and (푥 + 푑) + 푥 푓(푥 , 푥 ) = 1 2 1 2 2 the aggregate function of grades from both years, where 푥1, 푥2 are respectively the first- and second-year student’s grade.

However, the normalization function 푓 assumes a linear relationship between grade and mean grade which may not always be the case. For example, grades may spread exponentially farther from the mean as they are far from it, on either or both sides relatively to the mean. In other words, a student who scores below the average grade may score increasingly poorly as his performance departs from the mean.

We have chosen to use normalized grades for our analyses for this group of participants, though we will note where this leads to different conclusions compared to including grades from both years, or excluding grades from the sample.

Attrition

We expect two modes of attrition to occur in SP1 during the course of the study: false positives and loss of participants.

False positives are a benign form of attrition which in this study results from the advertisement effect of SuperMemo during the presentation to the students. The difficulty and intensity of medical scholarship generate high stakes situations for students and their career, any asset that can provide competitive advantages are prized.

The premise of this study is that SuperMemo may provide such an advantage. We expect to find a category of students who, based on the announcement effect, will seize the opportunity and drop out of the learning treatment shortly after beginning of the study. Another, closely related category of students who inconsistently follow the learning treatment may also emerge.

A previous study on comparing learning efficiency between two similar groups as the ones defined in this study had reported similar phenomenon, in a different context (Łodyga, 2011).

We define criteria for inclusion of samples from students of the SG later in this paper9. Samples that do not meet inclusion criteria for the SG will be removed from the sample.

Using data from the survey, we will examine the explanatory variables of students removed from the SG sample and filter out students from the TG who meet similar criteria. We will note where this leads to different conclusions compared to including all the samples that would otherwise not meet inclusion criteria.

Loss of participants is another form of attrition where participants who meet criteria for inclusion in the SG interrupt their learning treatment. We have determined at least three strategies for dealing with it: 1. Excluding the samples, 2. Including the samples, 3. Including a part of the sample.

Excluding the sample introduces bias by ignoring or under-covering certain categories of students. For example, after receiving poor grades at the first exam a student may drop out of the learning treatment if he judges it to be the cause, in part or in totality. In the case that the student’s judgement was correct, excluding his sample causes sampling error by under-

9 Defined in Criteria for inclusion: SuperMemo group. representing samples with negative relationships between the learning treatment and higher grades.

Including the samples introduces similar bias by over-covering certain categories of students. For example, if the student’s perception was incorrect, including his sample causes sampling error by over-representing samples with negative relationships between the learning treatment and higher grades.

Including a part of the sample is an intermediate strategy where samples from students who meet inclusion criteria exclusively for a single test have their corresponding sample (their grade) included respectively into the SG and into the TG. This strategy reduces sampling error but does not completely eliminate it. This is the strategy we will apply.

Course material

Due to the volume of material required for the anatomy class (estimated around 20,000 flashcards), course material used on the .com platform will be created collaboratively by the students. This process is defined in Course material.

The creation process is spread amongst 180

Addressing bias

In regression models, it is correlation between unobserved determinants of the outcome and unobserved determinants of selection into the sample which bias estimates, and this correlation between unobservables cannot be directly assessed by the observed determinants of treatment (Heckman, 1979).

If at its conclusion, the present study finds a positive and significant relationship between usage of SuperMemo and increase in grades, its parameters preclude us from establishing a causal relationship based on the relative comparison of the SG and the TG.

To combat it, we introduce methods of dealing with selection bias. While they do not eliminate error introduced in this way, these techniques may help identify its presence in the SG and mitigate its effects during the analysis.

Estimating selection bias

There are no human-accessible means of exhaustively listing unobserved determinants. This is a generic problem of understanding processes whose nature is opaque. Experiments dealing with natural sciences rely on induction to progressively reach increasing levels of confidence in their hypotheses.

In that sense we propose two approaches to identify determinants of grades in medical scholarship and generate weights to normalize the samples.

The choice of the method will be determined by the data available at the time of conducting the analysis, as defined in Scenarios.

If both methods are eligible, we will conduct both methods. In cases where at least one method is eligible, we will also publish results without factoring in the weights.

Comparing grades across disciplines

Using data provided by the Padua faculty, we will compare grades of disciplines distinct from the anatomy class scored by students of SP 1 with grades scored during the anatomy class.

Students who systematically score better than the mean grade may see their anatomy grade adjusted accordingly by calculating weights. Conversely for students with grades lower than the mean.

Survey

A survey will be distributed amongst the students of SP 1 to sample the variables defined in Annex A.

We will test whether the variables listed in the table are predictors of the grades obtained from the SP 1, and their respective weights.

A regression analysis will be run in combination with an F-statistics test, to determine the relationship between the explanatory variables and the grades. In preparation, diagnostic tests will be executed to ensure the chosen variables and the modalities of the regression do not violate the underlying assumptions of regression.

Low participation rates may introduce bias in the relationships between the explanatory variables and the grades. For transparency, we will report the participation rate in the final paper.

Mitigating selection bias

While it may be impossible to eliminate selection bias entirely, we propose an alternative method of analysis.

We have enumerated several biases introduced during the selection of our SP and the selection of students for the TG and SG. These biases apply when comparing between samples of the TG and the SG.

To address this problem, when the scenario is applicable, we will conduct an additional analysis on an existing set of data consisting of the mean grades for classes of the anatomy discipline taking place during years strictly anterior to our 2020-2021 class. This set of data will be provided by the Padua faculty. We will compare the mean grade of the 2020-2021 anatomy class with the sample mean of the mean grade distribution of these past years.

This approach offers evidence of a different nature to further support or reject hypothesis A. It however bears its own weaknesses which we describe in Representativity of the anatomy exams across years.

Representativity of the traditional group

To evaluate the representativity of the TG, we carry over the analysis of SP1, SP2 and SP3 and consider the following factors: over-coverage, attrition.

Over-coverage

We have identified one case of possible over-coverage within the TG. Students who do not choose to follow the learning treatment may still wish to use a spaced repetition software of their choice for studying the course material.

To mitigate that issue, we have included a question in the survey to identify such cases. There is however a risk this process cannot filter all the students using spaced repetition in the TG. In that case there is an increased risk of making a type II error.

Attrition

The likelihood that students from the TG drop out of the study is equivalent to the chance of students dropping out from the anatomy class. Because students have a high incentive to successfully graduate, we expect attrition for the TG should be minimal.

General considerations on the representativity of both groups

In addition to the biases discussed for each group in the previous chapters, we have examined errors that may occur simultaneously in both groups.

Measurement bias

We consider two moments of the measurement process:

1. Assigning a grade to the work produced by the student, 2. Accessing the grade as part of the study.

Grade assignment can be cause for measurement bias.

In ideal experimental conditions, testing would apply to individual pieces of knowledge that can either be tested to be true or false. Examples of exact knowledge include dates, mathematical constants, geographical data, etc.

Human intervention is a factor of bias when assessing individual questions and assigning them a value based on a non-deterministic process. Such a process typically involves evaluating complex information, answers to ambiguous questions or questions that accept multiple answers. Examples of non-exact knowledge include vocabulary, compound information, problem solving, etc.

This study involves two distinct measurement processes, each repeated twice: 1. Official examination conducted by the university, 2. Simulation test conducted by the authors of this study.

Official examinations are least susceptible to selection bias, but most susceptible to measurement bias. Due to the compulsory nature of official examinations and their requirement for successfully graduating we expect all students to partake. The tests are made up of open questions where students are expected to summarize their knowledge into concise descriptions of the requested systems of the human anatomy.

We expect bias in the measurement based on at least two factors. First, grade assignment is a non-deterministic process relying on personal interpretation by the reviewer. We expect grades to differ from their theoretical value when objectively graded against a specific evaluation grid. However, because the same reviewer will grade all the tests in random order, we expect the difference in-between grades to be evenly distributed. In that case, the grades will be representative when coalesced into a single value through operations such as mean calculation. Second, the personal appreciation of the teacher for the students whose test is being graded might factor into the grading process. Because tests are not anonymized, it is possible that bias might be introduced into the samples.

Simulation tests are additional exams conducted by authors of this study. They may be susceptible to selection bias in scenarios where participation is voluntary. They are however least susceptible to measurement bias. The questions included in the tests will specifically target individual pieces of knowledge with the intent of limiting subjectivity in grading.

Accessing the grade is the process of retrieving the grade (sample) for usage in the study. For the purpose of this study, the Faculty of Padua will communicate all the grades and the internal identification number of the student. We have not identified cause for bias at this stage.

Representativity of the anatomy exams across years

We have identified a risk of measurement bias in the anatomy grade samples obtained from previous years up until the date of this study, the 2020-2021 anatomy class.

Grades may not be fair measures for comparison across years because their quantity depends on a complex evaluation process by human beings, as established in Measurement bias. In addition, the context within which students go through their scholarship may change from one year to the other, or even within a single year. Examples include: - The teacher may be a different person from one year to the other, - A teacher may refine his teaching methods over the years, - Outside circumstances such as economic or political crisis may have broad impacts on the students’ lives.

The content of the official examinations conducted by the Padua university remained identical for the last few decades, and therefore shouldn’t introduce additional bias. We will only select samples for years until which the tests remain identical and comparable.

The anatomy class of 2019-2020 is a particular case where the tests and teaching methods differed as a consequence of the sanitary measures taken against Covid-19 in Italy. Data from this year will be excluded from the samples.

Course material

A new support “Sbobine” for medical education is gaining in popularity amongst students of the Faculty of Medicine and Surgery in Italy. They are an alternative source to the traditional textbook that students can use independently from official material and textbooks. The Sbobine are documents transcribed by students from recordings of lessons given by teachers of their university. They are subsequently integrated with slides and parts taken from manuals relating to the subject (Fabrizio Consorti, 2018).

SuperMemo Group

At the onset of the phase one, students from the SG will collaborate to create flashcards using the Sbobine as their main source material. The creation process will be guided and monitored by the authors of this paper. However, students will also be able to use books and personal notes where they deem necessary.

The two most common books in the Faculty of Medicine and Surgery in Padova, which are also recommended by the professors, are Anatomy of Gray and Human Anatomy, Anastasi et Al.

Once the process of creating flashcards reaches its conclusion, the material will be uploaded on the supermemo.com platform and studying will begin.

Traditional Group

The traditional group will not have access to flashcards; they will keep studying as they already did in the past, without any intervention.

Formulation of knowledge

Formulation of knowledge is a key aspect of understanding and memorizing knowledge. Complex or poorly worded flashcards may significantly decrease the student’s ability to learn the course material (Giorgio Bolondia, 2018).

The specific material created for the SuperMemo.com platform is an integral part of the effect being tested in this paper. Students participating in the creation process will receive a short guide to formulation prepared by authors of this study.

Additional considerations

This is a pilot study for renewable research that will be gradually improved. The intent of the authors is to draw attention on techniques that could improve education in academic institutions and the students’ well-being.

In addition, this study is part of a project to establish a template for repeatable models of research using spaced repetition and (Alexis, 2021).

Weaknesses in the parameters of the study constrains it to correlational relationships.

Traditional learning

Different learning paradigms are listed throughout this paper. We felt that it is necessary to define the scope of what "traditional" and "non-traditional" learning entail.

Research by (Dunlosky, 2013) in effective learning techniques identified 10 learning techniques: elaborative interrogation, self-explanation, summarization, highlighting, keyword , imagery for text learning, re-reading, practice testing, and interleaved practice.

We define traditional learning as the use of any of the aforementioned techniques in mutual exclusion with use of the spaced repetition technique. Moreover, use of spaced repetition without high utility techniques as defined by (Dunlosky, 2013) is still considered traditional learning. For example, applying spaced repetition to re-reading or highlighting is considered part of traditional learning in the scope of this paper.

Spaced repetition

Spaced repetition is a technique for organizing a student’s learning calendar in the most optimized way possible. spaced repetition has two functions: - Remember knowledge for as long as the student desires, - Spend the least amount of time possible learning each individual piece of knowledge.

Spaced repetition is used in combination with testing at retrieval, also called practice testing in (Dunlosky, 2013). slows forgetting and therefore improves the likelihood of later retrieval. The postpones forgetting by prompting a user to retrieve a memory, thereby consolidating it (Arnold, 2013).

Spaced repetition naturally integrates the effects of distributed and interleaved practice. It constitutes an amalgama of both high and moderate utility techniques coupled with algorithmic principles to anticipate forgetting.

SuperMemo-18 Algorithm

The SuperMemo.com platform implements the latest algorithm (SM-18) developed by the company (Krzysztof Biedalak, Piotr Woźniak, 2017). This latest iteration is based on three components of memory: - The R component or retrievability defines the probability of recall, - The S component or stability determines how fast retrievability drops over time. Stability increases at each review, - The D component or difficulty defines the complexity of a memory and is used to determines the difficulty of increasing the stability of memory.

Statistical protocol

Scenarios

At the time of writing, the authors are waiting for Padua university's thesis committee to arbitrate which data are permissible in this study. To accommodate this uncertainty, we define several scenarios.

Scenario n° Available grade Granularity Available range10 period S1: University Grades ≥ 18. Single grade calculated as the Current year11. database Grades < 18 are average of all exams taken during marked as “fail”. the two semesters of the anatomy class. S2: Published Grades ≥ 18. Individual grade for each exam Current year. scores Grades < 18 are taken during the anatomy class. marked as “fail”. S3: Full grade range, Individual grade for each exam Current year. Professor’s from 0 to 30. taken during the anatomy class. ledger S4: Full grade range, Individual grade for each exam Current and Professor’s from 0 to 30. taken during the anatomy class. previous years. ledger S5: Physical Full grade range, Individual mark for each question of Current year. paper from 0 to 30. each exam taken during the Possibly anatomy class. previous years. S6: Full grade range. Individual mark for each question of Current year. Simulation each exam taken during the test12 simulation test.

These scenarios are non-exclusive and may be combined to obtain the best data scope.

Sampled data

Data sampled during the course of this study refers to samples obtained from the TP1 and TP2. Additional, pre-existing data used during the analysis is described in chapter Existing data.

The variables sampled for testing each of the three hypotheses are described in the table below. The hypotheses columns indicate the variables used when testing a given hypothesis.

10 The grading scale for examinations consists of a score between 0 and 30. However, for ethical reasons grades under 18 are obfuscated from public knowledge and replaced with a single numeric value equating to “fail”. 11 Class of 2020-2021. 12 Simulation tests are independent examinations conducted by the authors of this study that do not count towards the graduating system of the Padua university. Independent variables

Hypothesis A Hypothesis B Hypothesis C Following the learning treatment? Yes Yes Yes Number of days studied Yes Average size of outstanding queue Yes Percentage of course material Yes studied

Following the learning treatment refers to the binary variable which takes the value 1 when the sample belongs to a student in the SG, and 0 when it belongs to a student in the TG.

Number of days studied refers to the duration in days of one of two periods. Each period corresponds to one of the two exams. For each student that period is bounded by the date when the first review took place and the date when the last review before the corresponding exam took place.

Average size of the outstanding queue refers to the number of flashcards the student still had to review at the end of every day for one of the two periods.

Let 푆 the super-set of set of elements in the outstanding queue for every day in the studying period, 푛 = 푐푎푟푑(푆) the number of days in the studied period, and 푐푎푟푑(푄) the size of the outstanding queue for a given day such that 푄 ∈ 푆.

The average number of outstanding material is defined by: 1 휇 = ∑ 푐푎푟푑(푄) 푛 푄∈푆

Percentage of course material studied refers to the number of flashcards learned by the student divided by the total number of flashcards for one of the two periods.

Dependent variables

Only a single dependent variable is sampled for all the hypotheses: the grade scored by students at each exam.

Criteria for inclusion: SuperMemo group

We expect a subset of students within the SG to either cease the learning treatment shortly after the beginning of the study, or to complete only a small portion of the online course. We define thresholds beyond which we consider the effect of the learning treatment to be negligible.

For each exam period, the following criteria must be met for inclusion for the samples:

1) The percentage of course material studied should be above 5% of the total course material, 2) The number of days studied should be greater than the number of weeks leading up to each exam period,

This only applies to between-groups (SG and TG) analysis. Data used for between-years analysis is not subject to those criteria.

Criteria for inclusion: Traditional group

Samples from students in the TG are eligible for inclusion in the data set when the following criteria are met: 1) The student does not follow the learning treatment, 2) The student meets the criteria defined in the chapter Traditional learning (e.g. the student cannot use another spaced repetition to practice for the exams).

This only applies to between-groups (SG and TG) analysis. Data used for between-years analysis is not subject to those criteria.

Existing data

Grades scored at the anatomy exams during previous years may be provided by the Padua faculty of medicine in the context of this study. If available, this data will be compared against samples from this year in the analysis of hypothesis A.

Hypothesis A: Increased average grade for memorization-intensive examinations

Depending on the data available as defined in chapter Scenario (SC1 to SC5), we will conduct up to two different forms of analysis: between-group comparison and between-years comparison.

Between-group comparison

Between-group refers to the TG and SG. We will use samples collected from both groups to compare the students’ results.

Scenarios 1 and 2

Using the nominal samples defined by PASS and FAIL obtained from the SG and TG, we compare the proportion of their respective probability distribution. We use the chi-squared statistics to analyse the data.

Let:

- 퐻0 the null hypothesis defined as: there is no difference between the proportion of students passing in the TG and those in the SG,

- 퐻1 the alternative hypothesis defined as: there is a greater proportion of students passing in the SG as there are in the TG.

We have:

- 푝푝푇 the proportion of passing students in the TG, - 푝푓푇 the proportion of failing students in the TG,

- 푛푆 the number of samples in the SG, - 푛푝푆 the number of passing students in the SG,

- 푛푓푆 the number of failing students in the SG, - 푑푓 the degrees of freedom defined by 푑푓 = 푛푢푚푏푒푟 표푓 푐푎푡푒푔표푟푖푒푠 − 1 = 1

We define:

- 퐸푝푆 the expected number of passing students in the SG when 퐻0 is correct, defined by:

퐸푝푆 = 푛푆 ⋅ 푝푝푇 - 퐸푓푆 the expected number of failing students in the SG when 퐻0 is correct, defined by:

퐸푓푆 = 푛푆 ⋅ 푝푓푇

2 We choose a significance level 훼 < 5%, with a chi-square critical value 휒푐푟푖푡 = 3.841.

If we assume 퐻0 is correct, we can be reasonably confident the probability of verifying the equation below is less than 5%: 2 2 2 (푛푝푆 − 퐸푝푆) (푛푓푆 − 퐸푓푆) 2 휒 = + > 휒푐푟푖푡 퐸푝푆 퐸푓푆

If the inequation is verified, we reject 퐻0. In that case, we can be reasonably confident 퐻1 is correct.

Scenarios 3, 4 and 5

Using the grades obtained from the SG and the TG, we compare the sample means of their respective probability distribution. We use the t-test statistics to analyse the data.

Let the random variables:

- 푋푇 representing grades from students using the traditional learning methods; - 푋푆 representing grades from students following the learning treatment;

We have:

- 푋̅푇 the mean score of the TG; - 푋̅푆 the mean score of the SG; - 푚 the sample size of the TG; - 푛 the sample size of the SG;

Let 푛 and 푚 respectively the sample size of grades obtained from the TG and the SG.

We define: ̅ ̅ - 휇푋̅푇 and 휇푋̅푆 the mean value of the sample distributions of 푋푇 and 푋푆;

- 휇푋푇 and 휇푋푆 the mean value of their true populations; ̅ ̅ - 휎푋̅푇 and 휎푋̅푆 the standard deviation of the sample distributions 푋푇 and 푋푆;

- 휎푋푇 and 휎푋푆 the standard deviations of their true populations; - 푆푇 and 푆푆 respectively the standard error of the TG and the SG.

We define 푑 = 푋̅푆 − 푋̅푇 the difference between the mean scores.

Building on previous demonstrations about samples and using the central limit theorem, we have the relationships: 휎2 2 푋푆 푆푆 - 휎 ̅ = ≈ ; 푋푆 푛 푛 휎2 2 푋푇 푆푇 - 휎 ̅ = ≈ ; 푋푇 푚 푚 2 2 2 2 휎푋 휎 푆 푆 - 휎2 = 휎2 = 휎2 + 휎2 = 푆 + 푋푇 ≈ 푆 + 푇 푑 푋̅푆−푋̅푇 푋̅푆 푋̅푇 푛 푚 푛 푚 2 2 푆푆 푆푇 - 휎 = 휎 ̅ ̅ = √ + 푑 푋푆−푋푇 푛 푚 ̅ - 푋푇 ≈ 휇푋̅푇 = 휇푋푇 ; ̅ - 푋푆 ≈ 휇푋̅푆 = 휇푋푆 ; ̅ ̅ - 휇푑 = 휇푋̅푆 − 휇푋̅푇 ≈ 푋푆 − 푋푇.

To test hypothesis A, we need to verify 푑 > 0 and 푑 sufficiently large to be statistically significant. Additionally, we will calculate a confidence interval to develop a better understanding of the quantity by which we can expect the learning treatment to improve students’ grades.

We choose a confidence level 훼 < 5.

Hypothesis testing

Let:

- 퐻0 the null hypothesis defined with 휇푋푆 − 휇푋푇 ≤ 0;

- 퐻1 the alternative hypothesis defined with 휇푋푆 − 휇푋푇 > 0.

We conduct a one-tailed t-test. The critical value 푡푐푟푖푡 depends on the degrees of freedom calculated from the total sample size with 푑푓 = 푛 + 푚 − 2.

If we assume 퐻0 is correct, we can be reasonably confident there is less than 5% chance of finding a value at least as extreme as the p-value:

푝 = 휎퐷 ⋅ 푡푐푟푖푡

If 휇퐷 = 휇푆 − 휇푇 > 푝 then we reject 퐻0. In that case we can be reasonably confident 퐻1 is correct.

Confidence interval

The critical value 푡푐푟푖푡 depends on the degrees of freedom calculated from the total sample size with 푑푓 = 푛 + 푚 − 2.

We define the confidence interval:

퐶퐼 = 휇퐷 ± 휎퐷 ⋅ 푡푐푟푖푡 푆2 푆2 ⇔ 퐶퐼 = 휇 − 휇 ± √ 푆 + 푇 ⋅ 푡 푆 푇 푛 푚 푐푟푖푡

We can be reasonably confident there is a 95% chance that 휇푆 − 휇푇 is contained within 퐶퐼.

Between-year comparison

Scenarios 1, 2 and 3

The data available in scenarios 1 through 3 do not contain the information required for the between-years analysis.

Scenarios 4 and 5

Using the mean grade scored by students of the anatomy class each year at the Padua faculty of medicine, we compare the 2020-2021 anatomy class’ (the class participating in the present study) mean grade with the sample mean of the distribution of mean grades for years strictly anterior to 2020-2021 where students used traditional learning methods.

Let the random variables: - 퐵 representing the mean grade from a Padua university anatomy class where students follow the learning treatments; - 퐴 representing the mean grade from a Padua university anatomy class where students follow traditional learning methods.

We have: - 퐵̅ the mean grade of the 2020-2021 anatomy class participating in the study calculated from the TG and the SG; - 퐴̅ the sample mean calculated from a sample of population 퐴 provided by the faculty. - 푛 the sample size taken from population 퐴.

We define:

- 푆퐴 the standard error of the sample taken from population 퐴; - 휇퐴̅ the mean grade of the sample distribution 퐴̅; - 휇퐴 the mean grade of population 퐴; - 휇퐵 the mean grade of population B; - 휎퐴̅ the standard deviation of the sample distribution 퐴̅; - 휎퐴 the standard deviation of population 퐴.

Using the central limit theorem, if the sample is sufficiently large, we have the relationships:

2 2 휎퐴 푆퐴 - 휎 ̅ = √ ≈ √ ; 퐴 푛 푛

- 퐴 ≈ 휇퐴̅ = 휇퐴.

We will test whether the mean grade scored by students of the 2020-2021 anatomy is significantly greater than the sample mean and calculate its t-statistics.

Hypothesis testing

Let:

- 퐻0 the null hypothesis defined with 휇퐵 − 휇퐴 ≤ 0; - 퐻1 the alternative hypothesis defined with 휇퐵 − 휇퐴 > 0.

We use the t-distribution to model the sample distribution of mean grades. We conduct a one- tailed test.

We choose a confidence level 훼 < 5. The critical value 푡푐푟푖푡 depends on the degrees of freedom calculated from the sample size with 푑푓 = 푛 − 1.

If we assume 퐻0 is correct, we can be reasonably confident there is less than 5% chance of finding a value at least as extreme as the p-value:

푝 = 푡푐푟푖푡 ⋅ 휎푋̅

2 푆퐴 If 휇 > 휇 + 푡 ⋅ 휎 ̅ which entails 퐵̅ > 퐴̅ + 푡 ⋅ √ , we reject 퐻 . In that case we can be 퐵 퐴 푐푟푖푡 푋 푐푟푖푡 푛 0 reasonably confident 퐻1 is correct.

T-statistics and difference

퐵̅−퐴̅ We define the t-statistics of 퐵̅ with 푡퐵̅ = . 푆퐴

We also define the difference 푑 = 휇푆 − 휇푋 which will be used in association with a box plot to better understand the significance of the results.

Hypothesis B: Grades increase with consistency

Scenarios 1, 2 and 3

The data available in scenarios 1 through 3 do not contain the information required for the analysis.

Scenarios 4 and 5

We will use the learning data available on the supermemo.com platform to test whether the following variables are predictors of the grades obtained from the SG, and their respective weights:

- 푋푑 the number of days studied, - 푋푞 the average size of outstanding queue,

- 푋푚 the percentage of course material studied.

A regression analysis will be run in combination with a F-statistics test, to determine the relationship between the explanatory variables and the grades. In preparation, a diagnostic will be executed to ensure the chosen variables and the modalities of the regression do not violate our assumptions.

Hypothesis C: Increased stability of memory in the long-term

Scenarios 1 through 5

The data available in scenarios 1 through 5 do not contain the information required for the analysis.

Scenario 6

We will test whether participation in the learning treatment is a predictor of grades obtained from the follow-up test as a repeated measure of the simulation tests conducted during the first phase of the study.

A regression analysis will be run in combination with a F-statistics test, to determine the relationship between the explanatory variables and the grades. In preparation, a diagnostic will be executed to ensure the chosen variables and the modalities of the regression do not violate our assumptions.

Ethical considerations

Due to the importance of these exams and concerns of fairness for students, all students in participating classes were offered the opportunity to participate in the study and be placed in the SG condition.

Note regarding authorship

This study is designed and conducted in conjunction by all of its authors. In particular, main authorship of the paper has been assigned to Jacopo Michettoni as part of his thesis. He will lead the flashcard creation process. Alexis Pujo shares equal credit with Jacopo Michettoni for his work on writing and designing the pre-registration paper.

Annex A: Survey questions

Question Answers Are you going to participate in the - Yes. experiment? - No. Have you ever used a spaced repetition - No. software like , SuperMemo, Remnote - Yes, I'm using it and I will use one and so on? (not the one used in the experiment) for this exam. - Yes, and I am using it but not for this exam. - Yes, I have used it in the past and I'm not using it anymore. How much time do you spend on social - 0-1 hour. media every day? - 1-2 hour. - 2-3 hour. - 3+ hour. When do you usually start actively studying? - I only read, summarize and highlight. (e.g. repeating aloud, answering questions, - One week before. quizzing yourself and flashcards) - Two weeks before. - Around one month before. - Two months before. - From the start of the semester. If you don't know something in your daily life, - Never. do you typically look it up online or in books? - Always. How many days do you work out every Linear scale of 1 to 7. week? What is your housing situation? - I live in Padova by myself. - I live in Padova with flatmates. - I live in Padova and I share a room with another person. - I live outside Padova and I travel to it when I need to. From 1 to 5, how anxious do you feel before Linear scale of 1 “Not at all” to 5 “A lot”. an exam? From 1 to 5, how depressed do you feel Linear scale of 1 “Not at all” to 5 “A lot”. before an exam?

References

Alexis, P. (2021). The spaced repetition research protocol. Retrieved from Incogito: https://www.incogito.org/en/projects/the-spaced-repetition-research-protocol Arnold, K. M. (2013). Test-potentiated learning: Distinguishing between direct and indirect effects of tests. Journal of Experimental Psychology: Learning, Memory, and Cognition. Curtoni, S., & Sutnick, A. I. (1995). Numbers of physicians and medical students in Europe and the United States. Acad Med. Dunlosky, J. &. (2013). Improving Students’ Learning With Effective Learning Techniques. Psychological Science in the Public Interest. Fabrizio Consorti, O. R. (2018). L’attualità del libro di testo: opinioni a confronto. Tutor – Attualità, Proposte E Ricerche Per l’Educazione Nelle Scienze Della Salute. Giorgio Bolondia, L. B. (2018). A quantitative methodology for analyzing the impact of the formulation of a mathematical item on students learning assessment. Studies in Educational Evaluation. Heckman, J. J. (1979). Sample Selection Bias as a Specification Error. Econometrica. Journal of Computer Assisted Learning. (n.d.). Retrieved from Wiley: https://onlinelibrary.wiley.com/journal/13652729 Koziol, & Budding. (2012). Procedural Learning. Springer. Krzysztof Biedalak, Piotr Woźniak. (2017, 01 23). Licensing and copyrighting of SuperMemo algorithms . Retrieved from SuperMemo: https://www.supermemo.com/en/blog/post/licensing-and-copyrighting-of-supermemo- algorithms Łodyga, O. (2011). THE EFFECTIVENESS OF FOREIGN LANGUAGE LEARNING SUPPORTED BY THE SUPERMEMO.NET PLATFORM (BASED ON THE HIERUNDDA1COURSE). M Schittek, N. M. (2001). Computer assisted learning. A review. European Journal of Dental Education. Piotr Woźniak; Krzysztof Biedalak. (2017). History of SuperMemo. Retrieved from Super Memory: http://super-memory.com/english/history.htm