MASARYKOVA UNIVERZITA F}w¡¢£¤¥¦§¨  AKULTA INFORMATIKY !"#$%&'()+,-./012345

Modeling Times in Tutoring Systems

PHDTHESIS

Petr Jarušek

Brno, 2013 Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Petr Jarušek

Advisor: prof. RNDr. Ivana Cerná,ˇ CSc.

ii Acknowledgement

I would like to thank to my advisor Ivana Cernᡠfor her guidance and sup- port during my PhD studies. It was my pleasure to work with her. I am deeply grateful to my consultant Radek Pelánek for many things. Firstly, since he has been my advisor since my master thesis I am really grateful for all the knowledge and critical and analytical worldview he has shared with me. He has broadened my horizons, shaped many of my opin- ions and changed my mind-sets in certain areas. But not only that. I am grateful for all the PhD years when we struggled with our rather exper- imental research – many times failing, sometimes succeeding, but always experimenting. I can remember that when we started even I myself was not convinced that the topic we were dealing with would lead to successful conclusion. But it happened and now I can see interesting results that we have achieved. Among many other things I deeply admire his working ef- fort and ability to finish work long time before any deadlines even appear on the horizon. I also admire his sense for humanity and his modesty and I am curiously looking forward what he is going to deal with in the future. But it has not been only the academy that has supported me and shaped my life during my studies. I am very grateful to my family, especially to my mother, for all the care and support she has given me through all of my life. This may sound like a phrase, but as I am growing older I can see more clearly how much she has sacrificed for her children and I am very thankful for that. I am looking forward to once support my kids in the same way she has been supporting us. I am also deeply thankful to other important members of my family – to my father and my sister and my broader family.

iii Abstract

We study problem solving in context of intelligent tutoring systems, partic- ularly with the focus on timing information as opposed to just correctness of answers. This leads to different types of educational problems and re- quires new student models. We describe a simple model which assumes a linear relationship be- tween latent problem solving skill and a logarithm of time to solve a prob- lem. We show that this model is related to models from two different ar- eas: the item response theory and collaborative filtering. We also propose model extensions for learning and dealing with multidimensional skills. Using both synthesized data and real data from a widely used “Problem Solving Tutor” we evaluate the model, analyze its parameter values and es- timation techniques, and discuss the insight into problem difficulty which the model brings. As a direct application of the model we developed a “Problem Solving Tutor” (tutor.fi.muni.cz) – a web-based educational tool for learning through problem solving. The tool makes predictions of problem solving times and thus is able to recommend to each student a problem of suitable difficulty. The tool contains 30 problem types and more than 2 000 prob- lems, mainly programming problems, math problems and logic . All problems are interactive and the system gives students immediate feed- back on their performance. This system is already widely used – it has more than 460 000 problems solved and 10 000 users. The system also supports “virtual classes” and is already used in more than 50 high schools in Czech republic. Finally, we study six transport puzzles – Minotaurus, Number , Replacement , Rush Hour, and Tilt Maze. Using Tutor we collect large scale data on human problem solving of these puzzles. The results show that there are large differences among difficulty of individual problem instances and that these differences are not explained by previous research. In order to explain differences, we propose a computational model of human problem solving behavior based on state space navigation and provide evaluation and discussion. We also derive concept of state space bottleneck and problem decomposition for Sokoban puzzle. We evaluate both methods and compare them to other metrics.

iv Contents

1 Introduction ...... 5 1.1 Contribution of the Thesis ...... 7 1.1.1 Model of Problem Solving Times ...... 7 1.1.2 Problem Solving Tutor ...... 8 1.1.3 Model of Human Problem Solving of Transport Puzzles 9 1.2 Outline of the Thesis ...... 10 2 Background ...... 12 2.1 Item Response Theory ...... 12 2.1.1 Basics ...... 12 2.1.2 Features ...... 14 2.1.3 Computerized Adaptive Testing ...... 14 2.2 Modeling Response Times ...... 16 2.2.1 Approaches ...... 16 2.2.2 Lognormal Model ...... 17 2.2.3 Application of Response Times in Adaptive Tests . . 18 2.2.4 Maximum Information Criterion for Response Times 19 2.3 Intelligent Tutoring Systems ...... 19 2.3.1 Outer Loop and Inner Loop ...... 20 2.3.2 Model Tracing ...... 20 2.3.3 Knowledge Tracing ...... 21 2.4 Educational Data Mining and Recommender Systems . . . . 22 2.4.1 Educational Data Mining ...... 22 2.4.2 Recommender Systems ...... 23 2.4.3 Collaborative Filtering ...... 24 2.5 Human Problem Solving and Puzzles ...... 25 2.5.1 Difficulty and Puzzles ...... 26 2.6 Our Approach: Focus on Timing Information ...... 26 2.6.1 Correctness Versus Timing Approach ...... 27 2.6.2 Tutoring Based on Timing Information ...... 28 2.6.3 Examples of Problems ...... 29 3 Model of Problem Solving Times ...... 30 3.1 Motivation ...... 30 3.1.1 Preliminaries ...... 30 3.2 Basic model ...... 31 3.2.1 Group Invariance ...... 33 3.2.2 Relations to Item Response Theory and Collaborative Filtering ...... 33

1 3.3 Model with Variability of Students’ Performance ...... 35 3.4 Basic Model with Learning ...... 36 3.4.1 Model with Multidimensional Skill ...... 37 3.5 Introduction to Maximum Likelihood and Estimation Methods 37 3.5.1 Maximum Likelihood for Univariate Gaussian Linear Regression ...... 38 3.5.2 Analytical Estimation ...... 39 3.5.3 Gradient Descent Estimation ...... 40 3.6 Parameter Estimation Using Maximum Likelihood ...... 41 3.7 Parameter Estimation Using Iterative Joint Estimation . . . . 44 3.7.1 Approach ...... 44 3.7.2 Estimating Skill ...... 44 3.7.3 Estimating Problem Parameters ...... 45 3.7.4 Joint Estimation ...... 46 3.7.5 Estimating Skill for Model with Learning ...... 47 4 Evaluation of the Model ...... 49 4.1 Evaluation Using Synthesized Data ...... 49 4.1.1 Synthesized Data for Basic Model and Model with Students’ Variability ...... 49 4.1.2 Synthesized Data for Basic Model with Learning . . . 50 4.1.3 Evaluation of Parameter Estimation Techniques . . . 53 4.2 Evaluation Using Real Data ...... 54 4.2.1 Parameter Values for Real Data ...... 55 4.2.2 Evaluation of Predictions ...... 56 4.2.3 Reliability of Parameter Values ...... 58 4.2.4 Insight Gained from Parameter Values ...... 60 4.2.5 Detection of Multidimensional Skill ...... 60 4.3 Open Issues ...... 63 4.3.1 Problem Completion ...... 63 4.3.2 Detection of Cheating ...... 64 4.3.3 Application for Adaptive Testing ...... 65 5 Problem Solving Tutor ...... 67 5.1 Main Approach ...... 67 5.2 Main Components ...... 68 5.2.1 Typical Usage ...... 68 5.2.2 Problem Simulators ...... 69 5.2.3 Data Collection ...... 69 5.2.4 Predictions ...... 70 5.2.5 Recommendations ...... 71 5.2.6 Class Mode ...... 72

2 5.2.7 Motivational Features ...... 72 5.3 Problems In the Tutor ...... 73 5.3.1 Robot Programming Problems ...... 73 5.3.2 Programming Problems ...... 75 5.3.3 Computer Science Problems ...... 75 5.3.4 Math Problems ...... 77 5.3.5 Logic Puzzles ...... 77 5.4 Implementation ...... 77 5.4.1 Technologies ...... 78 5.4.2 Main Entities ...... 78 5.4.3 Entity Relationship Model ...... 78 5.4.4 Logging Interface for Simulators ...... 79 5.4.5 Problem Locker ...... 79 5.4.6 Gradual Start ...... 80 5.5 Statistics of Usage ...... 80 6 Difficulty of Transport Puzzles ...... 82 6.1 Motivation ...... 82 6.2 Studied Problems ...... 83 6.2.1 Sokoban ...... 83 6.2.2 Minotaurus Puzzle ...... 84 6.2.3 Number Maze ...... 84 6.2.4 Tilt Maze ...... 85 6.2.5 Rush Hour ...... 86 6.2.6 Replacement Puzzle ...... 86 6.3 Data Collection and Analysis ...... 87 6.3.1 Data Collection ...... 88 6.3.2 Data Analysis ...... 88 6.3.3 Problem Difficulty ...... 89 6.3.4 Analysis of Individual Moves in Sokoban Puzzle . . 90 6.4 Model of Human Behaviour ...... 91 6.4.1 Basic Principle ...... 93 6.4.2 Model Formalization ...... 94 6.4.3 Model with Dead States ...... 95 6.4.4 Other Extensions ...... 96 6.5 Evaluation ...... 96 6.5.1 Difficulty Rating Metrics ...... 96 6.5.2 Value of the Parameter B ...... 97 6.5.3 Differences among Problems ...... 98 6.5.4 Relation to the Model of Problem Solving Times . . . 99 6.6 State Space Bottleneck ...... 101

3 6.6.1 Analysis of Bottleneck ...... 102 6.6.2 Network Flows ...... 102 6.6.3 Bottleneck Coefficient ...... 103 6.6.4 Possible Applications ...... 104 6.7 Problem Decomposition ...... 105 6.7.1 Approach ...... 105 7 Conclusion ...... 109 7.1 Future Work ...... 111 A First Appendix ...... 125 A.1 Author’s Contribution ...... 125 A.1.1 Conference Papers ...... 125 A.1.2 Software ...... 126 A.1.3 Technical Reports ...... 126

4 1 Introduction

Imagine you are a teacher of a computer science course at a university and you have several hundreds of online problem solving exercises of binary numbers on your website (see an example on Fig. 1.1). In a given academic year, several hundreds of students enroll for your course and they would like to prepare for a test on your website. Now, you are confronted with natural question. How should you or- der your exercises? For a weaker student, it would be beneficial to spend more time on simpler problems and then move to moderately difficult ex- ercises. For an advanced student, it would be beneficial to start with sim- pler exercises but to proceed quickly to more difficult problems. Therefore you would like to adapt your website individually to each student’s perfor- mance and problem difficulty. Now what if some problems have high level of variance in solving results even for equally skilled students? If you are preparing a test, you would not want to use problems like this one. There- fore you would also like to categorize problems somehow according to their parameters. This thesis attempts to answer these questions. Today, it is easy to develop interactive online learning environments and deliver them to students through the Internet. One of the tool we developed is called Graphs and Functions, where the task is to identify a formula for describing a depicted function. Students fill in a text field with formula and the tool plots their attempts on a graph (see right image in Fig. 1.1). Incor- rect attempts are not penalized, students try different functions until they find a solution. Tools like this belong to online interactive educational tools. Yet measuring student skill and problem parameters within these tools has not been studied yet. We focus on this area and propose models for model- ing problem solving times. Previous research in the area of skill assessment focused primarily on a correctness of test answers while using the solving time only as an addi- tional information. As already mentioned, our setting is different. We deal with problem solving activities, specifically with well-structured problems. Well-structured problems have clear boundaries, rules and goals. For ex- ample in Graphs and Functions we have clear rules: we have a displayed function, text field for input and syntax to specify a function. Goal is also clear – to find a formula of the displayed function. Solutions in our setting differ only in their solving times. In the thesis we present a simple model for predicting problem solving times based on newly proposed model. The model assumes a linear rela-

5 1. INTRODUCTION

Figure 1.1: Two examples of interactive problems. Left: goal is to fill in boxes with binary numbers according to the given condition. Right: goal is to find a formula describing depicted function. tionship between the latent problem solving skill and the logarithm of time to solve a problem. We also describe model extensions which incorporate student variance, learning, and multidimensional skills. Using both synthe- sized data and real data from a “Problem Solving Tutor,” we evaluate the model, analyze its parameter values, and discuss the insight into problem difficulty which the model brings. As a direct application of the model we developed a “Problem Solving Tutor” (tutor.fi.muni.cz) – a web-based educational tool for learning through problem solving. The tool makes predictions of problem solving times and thus is able to recommend to each student a problem of suitable difficulty. The tool contains 30 problem types and more than 2 000 prob- lems, mainly programming problems, math problems and logic puzzles. All problems are interactive and the system gives students an immediate feedback on their performance. The system is already widely used – it has more than 460 000 problems solved and 10 000 users. The system also sup- ports “virtual classes” and is already used in more than 50 high schools in Czech Republic. Besides modeling solving times, we also analyze process of human prob- lem solving for particular type of well structured problems – transport puz- zles. We focus on human navigation in the underlying state space. We study six transport puzzles – Minotaurus, Number Maze, Replacement Puzzle, Rush Hour, Sokoban and Tilt Maze. Using the Tutor we collect large scale data on human problem solving of these puzzles. Our results show that there are large differences among difficulty of individual problem instances and that these differences are not explained by previous research. In order

6 1. INTRODUCTION to explain the differences, we propose a computational model of human problem solving behavior based on a state space navigation and provide evaluation and discussion. We also derive and evaluate concept of state space bottleneck and problem decomposition for Sokoban puzzle. We eval- uate both methods and compare them to other metrics.

1.1 Contribution of the Thesis

We summarize main contributions of the thesis.

1.1.1 Model of Problem Solving Times

We propose a simple model of problem solving times which assumes a lin- ear relationship between the latent problem solving skill and the logarithm of time to solve a problem. We propose the model’s extensions to incor- porate student variance, learning and multidimensional skills. We evaluate models using synthesized data and real data from our Tutor. This is a novel approach, since the previous research in the area of skill assessment has fo- cused primarily on a correctness of test answers, while using the solving time only as an additional information. Main contributions of the model of problem solving times are as follows.

• The proposed model brings a novel approach to estimation of latent skill based on problem solving times and not correctness of the an- swers.

• The model is group invariant and it gives a better ordering of prob- lems with respect to their difficulty.

• The model brings additional insight — we can determine not just av- erage difficulty of problems, but also their discrimination, students’ skill variance and learning.

• We derive and evaluate two estimation methods for model parame- ters – based on stochastic gradient descent and iterative joint estima- tion.

• Since proposed model generative, we evaluate model over synthe- sized data. Experiments provide insight into how much data are needed in order to get usable estimates of parameter values.

7 1. INTRODUCTION

• We evaluate model using large data collected with the Tutor. The model brings slight improvement in predictions of solving times (mea- sured by RMSE).

• We run experiments with synthesized and real data for model with multidimensional skills and show how it can be used for automated concept detection.

Author’s publications relevant for the topic:

• P. Jarušek, V. Klusáˇcek,and R. Pelánek. Modeling students’ learning and variability of performance in problem solving. In Educational Data Mining, to appear, 2013

• P. Jarušek and R. Pelánek. Analysis of a simple model of problem solving times. In Proc. of Intelligent Tutoring Systems (ITS), volume 7315 of LNCS, pages 379–388. Springer, 2012

• P. Jarušek and R. Pelánek. Modeling and predicting students problem solving times. In Proc. of International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2012), volume 7147 of LNCS, pages 637–648. Springer, 2012

• P. Jarušek and R. Pelánek. Problem response theory and its applica- tion for tutoring. In Educational Data Mining, pages 374–375, 2011

1.1.2 Problem Solving Tutor

We have developed “Problem Solving Tutor” – a free web-based tutoring system for practicing problem solving skills, which is available at tutor. fi.muni.cz. The Tutor is widely used – it has more than 10 000 users who solved more than 460 000 problems. Many students from the Faculty of Informatics have been engaged in development of interactive problem sim- ulators and more than 10 bachelor thesis have been written on this topic. The tool is also used in more than 50 high schools in the Czech Republic and at the Faculty of Informatics at Masaryk University for learning con- cepts from computer science. More than 100 teachers have registered and they run 221 classes to which they have assigned more than 2400 students. Main contributions of Problem Solving Tutor are as follows.

• Above described model is directly applied in the Tutor for skill and problem assessment based on problem solving times.

8 1. INTRODUCTION

• The Tutor is widely used on high schools as an tool for introductory programming lessons and lessons from computer science.

• The Tutor provides large source of problem solving data used for evaluation of the proposed models.

Author’s publications relevant for the topic:

• P. Jarušek and R. Pelánek. A web-based problem solving tool for introductory computer science. In Proceedings of the 17th ACM an- nual conference on Innovation and technology in computer science education, page 371. ACM, 2012

1.1.3 Model of Human Problem Solving of Transport Puzzles

We propose a dynamic computational model which simulates human be- haviour during a state space search. The model captures structural differ- ences in problem state spaces which has not been fully studied in previous research. We evaluate the model on real data from our Tutor using more than 400 problems from six different transport puzzles (Minotaurus, Num- ber Maze, Rush Hour, Sokoban, Tilt Maze, Replacement Puzzle). Main contributions of the Model of Human Problem Solving of Trans- port Puzzles are as follows.

• We present novel approach to modeling difficulty of transport puz- zles based on state space structure.

• We propose general computational model of human navigation in a state space and its extensions.

• We evaluate the model over the large data set from the Tutor. Model improves difficulty predictions for selected transport puzzles in com- parison with other state space metrics (e.g. shortest path).

• We derive concept of state space bottleneck for determination of key states on the solution path.

• We derive and evaluate problem specific metric for Sokoban based on problem decomposition. Metric significantly improves difficulty predictions.

Author’s publications relevant for the topic:

9 1. INTRODUCTION

• P. Jarušek and R. Pelánek. What determines difficulty of transport puzzles? In Proc. of Florida Artificial Intelligence Research Society Conference, pages 428–433. AAAI Press, 2011

• P. Jarušek and R. Pelánek. Difficulty rating of sokoban puzzle. In Proc. of the Fifth Starting AI Researchers’ Symposium (STAIRS 2010), pages 140–150. IOS Press, 2010

• P. Jarušek and R. Pelánek. Human problem solving: Sokoban case study. Technical Report FIMU-RS-2010-01, Masaryk University Brno, 2010

• P. Jarušek and R. Pelánek. Analýza obtížnosti logických úloh na zák- ladˇemodelu lidského chování. In Kognice a umˇelýživot X, pages 171–176. Slezská univerzita v Opavˇe,2010

1.2 Outline of the Thesis

The thesis is organized as follows: Chapter 2 reports on a state of the art in relevant areas related to our research – item response theory, extensions of item response theory incor- porating timing information, intelligent tutoring systems, learner models, recommendation systems, collaborative filtering, educational data mining and human problem solving and difficulty. Chapter 3 describes a simple model for modeling latent skills and solv- ing times. More specifically, we describe a basic model, a model with stu- dent variance, a model with learning and we also outline extensions for multidimensional skills. We present two methods for parameter estimation – by using gradient descent technique and iterative joint estimation. Chapter 4 evaluates the model over synthesized and real data. Since the model is generative, we report on the results over synthesized data. We compare two methods for parameter estimation. We evaluate the model over a large set of real solving data generated by the Tutor. We report on insight gained from parameter values for problems. We summarize results and discuss applications. Chapter 5 introduces Problem Solving Tutor – an online educational tool which is one of the main results of our work. We report on the system architecture, main components, implementation details and also implemen- tation of prediction algorithm based on the proposed model. We describe problems in the Tutor and present statistics of system usage.

10 1. INTRODUCTION

Chapter 6 focuses on the topic of modeling human problem solving in the area of a transport puzzles. We analyze human state space navigation based on data collected by the Tutor and show students’ patterns in solv- ing. We generalize our observations and introduce a human like model for transport puzzles. More specifically, we present an artificial model which predicts difficulty of transport puzzles based on the structure of the under- lying problem state space. We evaluate the model on data from six different transport puzzles collected from the Tutor and propose the relation to the model of problem solving times. We derive and evaluate concept of state space bottleneck. We present problem specific metric for difficulty predic- tions in Sokoban puzzle – problem decomposition. Conclusions of the thesis and possible future directions are outlined in Chapter Chapter 7.

11 2 Background

In this chapter we present an overview of the research related to our the- sis. Proposed model of problem solving times is closely related to the item response theory and its extensions incorporating timing information, com- puterized adaptive testing and collaborative filtering. Our tool – the Prob- lem Solving Tutor is related to the research on intelligent tutoring systems, learner modeling, recommendation systems and educational data mining. Our proposed model of human navigation in state space is related to the research of human problem solving and puzzle difficulty. We conclude the chapter by emphasizing our approach based on timing information (i.e., not on the test questions) which is crucial for the work.

2.1 Item Response Theory

We give an overview of the item response theory, its features and its appli- cation in adaptive testing.

2.1.1 Basics

The item response theory is used mainly in testing where it has many ad- vantages according to standard test theory [50, 75, 78, 108, 109]. Its main assumption is that a given test measures one latent ability θ, and models give a relation between this ability θ and the probability P that a test item is correctly answered. This relation is expressed by an item S shaped response function which is a form of the logistic function (see Fig. 2.1). The item response functions differ according to the number of parame- ters they involve. We start with one parameter Rasch model [126]:

eθ−b P = b,θ 1 + eθ−b The b is a basic difficulty of an item, i.e., level of ability where the prob- ability of the correct answer is 50%. The idea behind the parameter is that it moves the response function along the ability scale. As we move to two parameter logistic model we add a discrimination parameter a – now we can change the slope of the response function:

ea(θ−b) P = a,b,θ 1 + ea(θ−b)

12 2. BACKGROUND

Figure 2.1: Illustration of the item response function (left) and group invari- ance (right). Group invariance: even if an item has been answered only by below-average students, the estimated response function should be similar as if the item was answered by a representative subset of students.

When the discrimination is high, even small change in ability (starting in b) will lead to a substantial increase of P . Finally, the most common model is a 3 parameter logistic model, which has the following parame- ters: b is a basic difficulty of an item, a is a discrimination factor, and c is a pseudo-guessing parameter which describes probability of answering item correctly simply by guessing (for more information see [51]):

ea(θ−b) P = c + (1 − c) a,b,c,θ 1 + ea(θ−b)

To get an intuition of these parameters look at the shapes of response functions in Fig. 2.2. On the left we see response functions for different val- ues of difficulty parameter b – functions move along ability scale. In the middle we see functions for different values of discrimination a – func- tion changes its slope. On the right we see functions for different values of guessing parameter c – lower boundary of probability values rises. To apply these models, it is necessary to estimate values of their param- eters. Since we do not know neither the item parameters (a, b, c), nor per- son’s abilities (θ), we need to estimate both of them at the same time. This is usually done by joint maximum likelihood estimation, which proceeds by repeating two steps: estimating abilities from item parameters and es- timating item parameters from abilities. These steps are repeated until the parameter values converge [11]. The estimation process has an important disadvantage – it demands a large pre-test data collection (e.g., 500 students and more).

13 2. BACKGROUND

Figure 2.2: Illustration of the item response functions for different values of b (left), different values of a (middle) and different values of c (right).

2.1.2 Features

Item response theory models suffer from the “indeterminacy of the scale” problem – for example, we can add a constant k to parameters θ and b and we obtain an equivalent model. This issue is usually addressed by some kind of normalization, e.g., requiring that the mean value of ability is 0 and variance is 1. An interesting feature of the theory is that the estimated parameters of the response function are not dependent on the ability level of the exam- inees who answered given item. Even if some item is answered only by below-average persons, the estimated item parameters should be similar as if the item was answered by a representative subset of persons [11]. This is known as group invariance (see Fig. 2.1). Information function describes contribution of a given item for estima- tion of student’s ability. According to the Fischer, information is a reciprocal of the precision with which parameter could be estimated. The precision is measured by the variability of the estimates around the given value of a parameter. An example is show in Fig. 2.3. If the value of the information function is large, it means that a student with a given level of θˆ can be estimated with precision. If the value of the information function is small, it means that student’s ability cannot be estimated with precision. In given example we obtain highest information contribution for a student with θ = 0. The information function is used in computerized adaptive testing to propose items with highest amount of information contribution.

2.1.3 Computerized Adaptive Testing

Computerized adaptive testing is a method of administering a test that adapts to a student’s ability level [76, 77, 123, 124]. The goal of the method is

14 2. BACKGROUND

Figure 2.3: Information function for given item: a relationship between the information contribution of an item and a student’s ability θ. to select an appropriate and informative items adjusted to the student [41]. The method differs from the paper test as students with different ability levels are tested with different sets of items. In a paper format, all students receive the same set of items. The goal of the method is to estimate a stu- dent’s latent ability θ, and to select test items from an item pool based on the current performance of a student. The major advantage of the method is that it provides more precise θ estimates with relatively fewer items than would be required in conventional tests [82]. A widely used estimator of ability is the maximum likelihood estima- tor, which is the value of θ that maximizes the likelihood function of the responses. The variance of the estimate is inversely related to the Fisher information [25] for a given value of the ability estimate. The maximum in- formation criterion [77] selects the item with the highest information (i.e., with the lowest variance) at current ability estimate (see also [119]). It other words [41]:

ˆ jm+1 = max{Il(θm): l ∈ R} l

ˆ Where jm+1 is the chosen item, Il(θm) is an information function of the item l given ability estimate θˆm and R is a pool of remaining items. This traditional method for item selection focuses only on the item in- formation without taking into consideration the time required to answer an item. We will move forward to extensions incorporating estimations of re- sponse times. The improved item selection algorithm will incorporate also these response time estimates.

15 2. BACKGROUND

2.2 Modeling Response Times

Adaptive tests are typically organized to have a fixed time length. But the items selected from the items pool may show large variation in the response times. When different students get different sets of items, the time to an- swer items may vary considerably. Because the solving time for items typi- cally correlates with their difficulty, students with higher ability may expe- rience higher time pressure during testing [115]. Therefore it is reasonable to study the item response time more deeply. We describe main approaches for dealing with response times.

2.2.1 Approaches

There are two main approaches to modeling response times. The first ap- proach models response times in the framework of the item response the- ory. The second approach models response times independently of item re- sponse parameters. We conclude with a mixed approach, which is the most relevant to our model. A typical example of the first approach is Roskam’s model – a model which incorporates response times in response models [99]. The model is based on one parameter item response function. The ability parameter θ is replaced by an “effective ability parameter”. The model is stated as: 1 p (θ) = i 1 + e(θ+ln t−b) In this model, an increase in the difficulty of the item b can be compen- sated by spending more time to solve the given item. This model has been extended in [120] with parameter for measuring the speed of a given per- son. Another proposed model [122] incorporated response times into three parameter model and also added a parameter ρj for student’s slowness and di for the slowness of an item. This model is stated as: 1 p (θ ) = i j ρj di −ai(θ− −bi) 1 + e tij There are also reverse types which incorporate response parameters into a model for response times. For example time can be estimated according to the parameters ai, θj, bi and item dependent variance σi see [45]. In other model [107] time response is derived from item parameters with an extra parameter for person and item effect. This model has been further extended in [42].

16 2. BACKGROUND

In the second approach, response times are modeled independently of item parameters. For example Scheiblechner proposed a model [102] that assumed exponential density for response time tij with person’s and item’s time parameters. Other models can be found in [80, 103, 116, 117]. A review of both approaches is given in [112]. The mixed approach assumes that the item response theory parame- ters and the response time are determined by distinct parameters. The ap- proach assumes that a person operates at a constant speed and a constant ability which are constrained by person’s choice between speed and accu- racy [112]. Once a person has chosen in speed and accuracy trade-off they act like constants. Van Linden proposes a lognormal model [110] for this approach which is closely related to our model and we discuss it further in the following section.

2.2.2 Lognormal Model

The model [110] posits a normal density for the distribution of the response time ln tij. For a given student j and an item i response time is modeled as:

1 f(tij; τj, αi, βi) = N (βi − τj, ) = αi

αi − 1 [α (ln t −(β −τ ))]2 √ e 2 i ij i j tij 2π

The mean of the distribution, µi = βi − τj > 0, βi ∈ (−∞, ∞) and τj ∈ (−∞, ∞). Parameter τ can be interpreted as a parameter for the speed of the person. For illustration see Fig. 2.4. The larger τ, the smaller the amount of time the person tends to spend on the item. Likewise, βi is a parameter which describes time intensity or time consumingness of given item i. This parameter controls the time item i demands from the persons; the larger βi, the larger the amount of time persons tend to spend on it (see [110, 111]). Discrimination parameter αi (αi > 0) is the reciprocal of the standard deviation of the normal distribution. A larger value for αi means less dis- persion for the log response time distribution on item i for the persons, and, hence, better discrimination by the item between distributions of per- sons with different levels of speed. We proceed with description of how to incorporate response times into adaptive tests.

17 2. BACKGROUND

Figure 2.4: Illustration of Van Linden’s lognormal model of item response time. Parameter τj represents speed of the person, βi describes the time intensity or time consumingness of a given item i, αi can be interpreted as a discrimination parameter.

2.2.3 Application of Response Times in Adaptive Tests

For standard paper test the lognormal model provided solving time esti- mates for the items in the item pool. Using this model, it is possible to as- semble tests from the pool with the expected time lower than a given time limit [114]. If the total time for the test ttotal has been chosen, the items in the test should be selected such that [115]:

n X exp(βi) ≤ ttotal i=1

In adaptive tests, the problem of different solving time occurs often, be- cause different students get different items. This can be solved by making test equally time consuming by using a constraint of this type on the selec- tion of the items. If the items have been calibrated, it becomes easy to score students for their speed. The process of item selection considering response times consists of the following steps: use of the response times to continuously update an es- timate of the students speed τj during the adaptive test; then use of the speed estimate together with the estimates of the time intensities βi for the remaining items in the pool to predict the time necessary for each of them; and then select the items in the adaptive test subject to a constraint on their predicted times that guarantees completion [115]. We proceed with an ex- tension of maximum information criterion incorporating response times.

18 2. BACKGROUND

2.2.4 Maximum Information Criterion for Response Times The maximum information criterion selects an item with highest informa- tion at a given level of ability. The criterion does not take into consideration the time required to answer the given item. The response time can be use- ful as a highly informative item can often be time consuming, so it has less practical value compared to an equally or little less informative item that requires less time to complete. Instead of maximizing the item information Wang proposed a new criterion – maximum information per time unit (see [41]). The criterion is stated as: ˆ Il(θm) jm+1 = max{ : l ∈ Rm} l E[Tl|τˆm] ˆ Where jm+1 is the chosen item, Il(θm) is an information function of the ˆ item, l is the given ability estimate, θm, Tl is the time required for the l item, τˆm is the maximum likelihood estimator of the current speed parameter, τ, E[Tl|τˆm] is the expected time to the lth item and Rm is a pool of remaining items. Instead of selecting an item with the highest item information, an item is selected according to both – the information and the timing estimate. Using this criterion, items with high information will tend to be chosen, but are less likely to be chosen if they require a great amount of time. By continu- ally updating the speed parameter and the ability parameter, items may be chosen for a student that will assist in faster obtaining of information about the student’s ability. By doing so, an exam with a fixed amount of informa- tion required can be completed more quickly, possibly affording the chance to seek a higher level of information by adding more items. Also, an exam of fixed length can be completed more quickly.

2.3 Intelligent Tutoring Systems

Intelligent tutoring systems [3] (ITS) are computer programs used to make learning process more adaptive and student oriented. They provide back- ground information, problems to solve, hints and learning progress feed- back. A well-known example of an ITS is a system for teaching algebra [68, 67]. Tutoring systems organize knowledge into knowledge chunks and con- front students with problems according to the success in solving problems of particular knowledge chunks. Therefore, intelligent tutoring systems are suitable for an “incremental learning” and usually have a static structure which is determined by an expert in a particular domain. Our system is

19 2. BACKGROUND dynamic and recommends problems based on collected problem solving data.

2.3.1 Outer Loop and Inner Loop

Tutoring systems have two loops – the outer loop and the inner loop [118]. The outer loop executes after solving a task and its responsibility is to de- cide which task the student should do next. The outer loop uses several methods to select the task properly. First, a tutor may display a menu and let the student select the next task. Second, a tutor organizes problems in a fixed order. To be able to move forward, the student must finish the task. Third, the curriculum is structured as a sequence of units. When a student is working inside a unit, a tutor keeps recommending problems from the unit until the student masters the unit’s knowledge. Then the student may move forward. Fourth, knowledge components are assigned for each task to describe what kind of knowledge components are exercised. A tutor then organizes tasks according to the student’s degree of mastery of the given knowledge components. The inner loop focuses on the following question: how to give hints and feedback about a problem within a task? For example, a tutor may give a response after each step of student, it may give error specific feedback on an incorrect step to help the student understand mistake. A tutor may also give hints on the next step (for example a student may demand this by click- ing a button). The inner loop may incorporate an assessment of student’s knowledge. This may be subsequently used for selection of the next prob- lem in the outer loop. To deal with these tasks, the inner loop incorporates learner models which we discuss in the following section. Most research on tutoring systems focuses on the “inner loop” (how to give hints about a problem), we focus solely on the “outer loop” (how to order problems). A specifically relevant work, which focuses on problem difficulty is [8], where authors study several aspects of difficulty, including the time needed to solve a problem. However, they use just a simple mean for estimating the difficulty from data, whereas we consider more complex models.

2.3.2 Model Tracing

In intelligent tutoring systems, the learner model for skill assessment plays a key role. Based on the student’s answers, the model can decide what a student knows and does not know (asses student’ s skill) and subsequently

20 2. BACKGROUND

Figure 2.5: Model tracing in cognitive tutors: skills are represented as rules and a skill is considered as correctly applied when a rule is matched to student’s action. Illustration altered from the [40] a tutor can intervene with additional hints or comments or other exercises. While a student is solving a problem, a tutor analyzes her actions. This process is termed as model tracing [7]. A tutor solves each problem along with a student [40] (for example see Fig. 2.5). The primary goal of model tracing is to provide guidance that is needed for the student to succeed in the problem solving [31]. For problem solving tutors, we distinguish two relevant concepts for model tracing – cognitive tutor modeling and constraint based modeling [84]. In constraint based modeling, the skills are represented as predicates and a skill is considered as mastered when a predicate is matched over student’s responses [83]. In cognitive tutor modeling [5, 6] the skills are represented as rules and a skill is considered as correctly applied when a rule is matched to student’s action. Both approaches are similar and can also serve for failed answers. If they are formulated, a model can match student’s failure and respond with appropriate content. With these models, a tutor may support students with hints and advices.

2.3.3 Knowledge Tracing

Knowledge tracing is used to monitor students’ learning from problem to problem, i.e., to maintain a student model which provides assessment of relevant skills over time [30, 31, 34]. For example suppose that a student has several opportunities to apply given knowledge and you received se- quence of correct (1) and incorrect (0) responses [31]: 0, 0, 0, 1, 0, 1, 1, 1. Has the student learned the rule? Each time the student has an opportunity to apply a rule in the model, a tutor updates its estimate of whether the stu- dent knows the rule, based on the student’s action. We briefly describe this

21 2. BACKGROUND approach which is used in Cognitive tutors. Bayesian knowledge tracing is a model for determination if learning oc- curs during a given problem solving step [31]. The model assumes that each step calls for a single skill. Student can either succeed or fail the task. For the basic model, four parameters are defined – the probability that the skill is already mastered P (L0), the probability that the skill will be learned P (T ), the probability that the student will guess correctly if the skill is not mas- tered g and the probability that the student will slip (make a mistake) if the skill is mastered s. The four parameters are fitted for each skill using data from students. The goal of parameter fitting is to find out which combination of parameters predicts best the pattern of correct and incorrect responses [32]. A tutor can then make predictions on students’ knowledge as they use a tutor.

2.4 Educational Data Mining and Recommender Systems

We describe educational data mining which is basically about how to im- prove learning based on the collected educational data. We proceed with description of recommender systems and its heavily used method – the col- laborative filtering. All three areas are relevant to our research since we use recommendations in our Tutor; data mining techniques in parameter es- timation of the proposed model; and our model is also related to matrix factorization method from collaborative filtering.

2.4.1 Educational Data Mining

Education data mining uses computational approaches to analyze educa- tional data in order to study educational questions [98]. Goals of educa- tional data mining may vary a lot. Basically it is about improvement of learning. For learners this could mean personalization of the e-learning, recom- mendation of activities, suggesting interesting learning experiences or rec- ommendation of relevant discussion (see examples: [28, 48, 92, 105, 121]). For educators this could mean analyzing students’ behaviours, detecting students that needs support, finding frequently made mistakes or irreg- ular patterns (see examples: [15, 127, 128]). For course developers this could mean evaluation or maintaining the courseware, improving students’ learning and evaluation of the structure of the course. It is also widely used to predict students’ performance – for example to predict grades in school or to model students’ knowledge and predict

22 2. BACKGROUND their performance within a tutoring system. Prediction of a student’s per- formance is one of the oldest and most popular application of data mining (see [10, 13, 49, 53, 90]). Educational data mining usually uses raw data from educational sys- tems such as offline systems, intelligent tutoring systems or other e-learning or web based educational systems. The data are usually very specific to the area with its intrinsic semantic information. An example would be Q-matrix which represents relations among students and learning concepts (see [12]). Techniques used in educational data mining are mostly only slightly al- tered data mining techniques. Some can be applied directly, others have to be adapted. Educational data mining uses techniques such as classification, text mining, clustering, decision trees, neural networks, regression analy- sis, Bayesian networks and statistics techniques such as regression, corre- lation and other techniques. In our setting we have a large set of problem solving data from a tutor which we use to predict students’ performance and problem parameters based on solving time. As far as the techniques are concerned, we use maximum likelihood estimation, linear regression, k-means algorithm for clustering, gradient descent technique and statistics techniques.

2.4.2 Recommender Systems

Recommender systems [1, 65] are used mainly in e-commerce. These sys- tems recommend to users products that may be interesting for them. For example they are used for recommending books on Amazon [73], for rec- ommending Google news [38], or to recommend films on Netflix [19, 17, 20, 106]. One of the approaches to recommendation – collaborative filtering – is based on use of data on user behaviour. With this approach a recom- mender system collects data and uses these data to make predictions and recommendation at the same time. We build our Tutor in the same way, al- though we are not interested in recommending products, but problems of suitable difficulty. Since students vary in their skills, it is crucial to make problem recommendations individually adaptive. This approach is in con- trast with the mainly linear approach (collect data, calibrate models, use models) used in the item response theory and in education in general. Since collaborative filtering is highly relevant to our model we will investigate the method more deeply.

23 2. BACKGROUND

2.4.3 Collaborative Filtering

Collaborative filtering is typically used in recommender systems, but its techniques have also been used in the context of the item response the- ory [22]. The goal is to predict future user ratings based on past ratings. There are two basic methods for collaborative filtering: neighbourhood based (memory based) and matrix factorization (model based). Both methods capture interactions between users and items. However a large part of the observed ratings is independent of the user – item inter- action. It is caused by a systematic tendency for some users to give higher ratings than others and for some items to receive higher ratings than others. These biases can be captured by a baseline predictor model. We use this no- tation – rui for rating of user u of item i, where high values mean stronger preference. The predicted rating is marked as rˆui. The baseline predictor is then [65]:

rˆui = µ + bu + bu

Where µ is overall average rating, bu and bi indicates observed devia- tions of user u and item i from the average. The parameters bu and bi are estimated by solving the least squares problem:

X 2 X 2 X 2 min (rui − µ − bu − bi) + λ( bu + bi ) u,i u u

The first term seeks to find values for bu and bi that fit the given ratings. The second term seeks to avoid overfitting by penalizing the magnitudes of parameters. The problem can be solved efficiently by the stochastic gradient descent. The first approach – matrix factorization transforms both items and users to the same latent factor space. The factor space tries to explain ratings by characterizing both items and users on these factors derived from users’ f feedback. Every item i will be associated with a vector qi ∈ R , and each f user u will be associated a vector pu ∈ R . For a given user u, the elements of pu measure an interest of the user on the corresponding factor. Product T qi pu captures overall interest of the use u for the item i. The final rating is also constructed by adding baseline predictors that depend only on the user or the item [65]:

T rˆui = µ + bu + bi + qi pu In order to learn model parameters squared error has to be minimized:

24 2. BACKGROUND

X T 2 2 2 2 2 min (rui − µ − bu − bi − qi pu) + λ(bu + bi + ||qi|| + ||pu|| ) u,i

The constant λ controls the degree of regularization. Parameters can be estimated again by using regularized squared error with stochastic gradient descent. The second approach to collaborative filtering is based on neighborhood methods which focus on relationships between the items or between the users [18, 74, 100]. An item-item approach models the rating of a user to an item based on ratings of the user for similar items. Similarity is typically based on Pearson’s correlation coefficient, which measures the tendency of users to rate items similarly. If the goal is to predict rui for unrated item i then by using similarity measure k closest items are identified. This set is denoted as Sk(i; u). Pa- rameter sij denotes similarity measure for item i and j. Value of rˆui is taken as a weighted average of the ratings of neighboring items with adjustment for baseline predictors: P j∈Sk sij(ruj − buj) rˆui = bui + P j∈Sk sij Even though the matrix factorization provides better results, the mostly used approach is based on neighborhood methods. There are two reasons for this. Firstly, they provide intuitive explanations of recommendations. Similarity measure also allows users to determine which of the rated items have impact on prediction. Second, they can provide immediate recommen- dations for new users. When user provides feedback to a system, it can use it immediately to improve recommendations. In the case of matrix factor- ization new training of the model is necessary.

2.5 Human Problem Solving and Puzzles

In this section we give an overview of relevant research in human problem solving and puzzle difficulty. Human problem solving has been studied for a long time, starting by a seminal work by Simon and Newell [104]; for a recent overview see [95]. In this thesis we focus concretely on puzzles. Logic puzzles are a typical example of well-structured problems [125]. They contain all important information in the statement of the problem

25 2. BACKGROUND

(and hence do not depend on knowledge), they are amenable to automated analysis, and they are also attractive for humans. Use of puzzles has a long tradition both in computer science (particularly artificial intelligence) [101] and cognitive psychology [104, 125]. Research concerned with puzzles which can be directly expressed as state space traversal, e.g., Tower of Hanoi puzzle [70], river crossing prob- lems [47], Water jug puzzle [9], Fifteen puzzle [96], and Chinese ring puz- zle [71] is particularly relevant for our work.

2.5.1 Difficulty and Puzzles

There are several types of factors that influence the problem difficulty. First, the problem difficulty is influenced by the context of the problem. The same problem may have different difficulty depending on the context in which it is presented. A typical example of this is the Einstellung effect [79] first studied for the water jug problem. Second, the overall problem difficulty depends on the difficulty of indi- vidual steps in the solution. This effect was demonstrated particularly with the use of isomorphic problems, i.e., problems which have the same un- derlying structure but different cover story. The most famous in this regard are for example the results of Tower of Hanoi [70] and Chinese ring puz- zle [70]. Depending on the representation of the problem, there are different requirements on working memory and processing of the information, and this influences the difficulty. Third, the problem difficulty is influenced by the overall structure of the problem state space. Previous research has focused on straightforward measures such as the size of the state space or the length of the solution path and on the effectiveness of the hill-climbing heuristic, which was studied for example for river crossing problems [47], Fifteen puzzle [96], and Water jug puzzle [9, 26]. In this work we report on experiments, where we show that these fac- tors do not fully explain the differences in problem difficulty that we ob- tained from experiments with human solvers (see Fig. 2.6). We believe that these unexplained differences are caused by differences in the structure of problems’ state spaces.

2.6 Our Approach: Focus on Timing Information

Problem solving is an important part of education in general and of intel- ligent tutoring systems in particular. To use problem solving activities ef-

26 2. BACKGROUND

Figure 2.6: An example of two similar Sokoban puzzles. The median solving time for the problem on the left is 1 minute, for the right one it is 49 minutes. Where are the origins of these huge differences?

ficiently, it is important to estimate well their difficulty. Easy problems are boring, difficult problems are frustrating – this observation is elaborated in detail by the flow concept [35, 36]. In this work, we focus on modeling students’ performance in this kind of timing based exercises. To attain a clear focus, we consider only the in- formation on problem solving times. It may be useful to combine this ap- proach with other data about students and problems (e.g., data about stu- dents from other learner models, data about problems from a human expert in a given domain). Nevertheless, even this basic approach is applicable in a practical system and it has an important advantage of being simple and cheap (e.g., compared to knowledge tracing models which require signifi- cant expertise).

2.6.1 Correctness Versus Timing Approach

In intelligent tutoring systems [3, 118] student models typically focus on correctness of students’ answers [40], and correspondingly, problems in tu- toring systems are designed mainly with the focus on correctness. This fo- cus is partly due to historical and technical reasons – the easiest way to col- lect and evaluate student responses are multiple choice questions. Thanks to the advances in technology, however, it is now relatively easy to create rich interactive problem solving activities. In such environments it is useful to analyze not only correctness of students answers, but also timing infor- mation about the solution process. We can illustrate this difference in focus using a problem from math- ematics education (mentioned in Chapter 1). A student is given a graph of a function and the task is to find a formula which corresponds to the

27 2. BACKGROUND

Figure 2.7: Illustration of two approaches to problem solving activities: A) test question, B) interactive exercise. graph. This general task can be realized in several ways (see Fig. 2.7). One approach is to use the traditional “textbook style’ multiple choice question. Another approach is to use an interactive environment in which students are provided with a text field, they type in their answers, and when they submit an answer they see a graph corresponding to their formula. With this interactive approach, the student modeling and thus tutor responses can focus either on correctness (particularly correctness of the first answer) or on time which is needed to find the correct solution (irrespective of the number of attempts).

2.6.2 Tutoring Based on Timing Information

Intelligent tutoring systems and their student models focus mainly on cor- rectness of students’ answers. Even if the tool provides interactive prob- lem solving setting, the data that are collected and modeled are mostly of the test question type (correct/incorrect). For example, recent UMUAI re- view of learner models [40] describes only models of this type and does not consider timing information. If the timing information is considered, it is usually only as a supplement to the correctness data, e.g., [16]. In this work we take a different approach. We focus solely on the tim- ing information. We consider well-structured problems with clearly defined

28 2. BACKGROUND correct solution, where the problem solving time is the single measure of students’ performance. There are no other “quality of solution” measures, i.e., neither hints given during solutions nor acceptance of partial solutions. Students have to continue solving a problem until they reach the correct solution (or they can abandon the problem and try some simpler one).

2.6.3 Examples of Problems We describe three problem examples in order to give the reader a better idea of the kind of problems we are considering and to emphasize the contrast between these problems and the test items.

• Linear transformations. The goal is to transform a given shape into a target shape using linear transformation (rotation, scaling, reflex- ion). Transformations are specified using either buttons for individ- ual transformations or using a matrix notation.

• Standard programming. The goal is to write a program for a specified task, programs are tested over (hidden) testing data. If a submitted program is incorrect, students are provided with specific inputs on which the program does not work correctly.

• Logic puzzles. Well-structured logic puzzles with clearly defined rules and goals, e.g., well-known puzzles like Sudoku, Nurikabe, Sokoban, Rush Hour.

In all cases the problems are well-defined and the system can recog- nize a correct solution. Problems are interactive and students get immedi- ate feedback on their individual attempts. This problem formulation leads to iterative approach to solution and helps to build intuition about the prob- lem domain. Moreover, when we are able to find a suitable problem solving formulation, students often find problems quite attractive and are willing to do activities that would otherwise be considered rather “boring” (like practicing functions). Each approach leads to slightly different learning experience. The mul- tiple choice version leads to practice of deductive reasoning with the focus on details. The interactive timing based version is suitable for building intu- ition and gradual approximation to a solution. In this case, both realizations have their merits. In some domains, like computer programming, the inter- active timing based formulation is much more natural – debugging is an inherent part of programming.

29 3 Model of Problem Solving Times

We describe a model which assumes a linear relationship between a prob- lem solving skill and a logarithm of time to solve a problem, i.e., exponen- tial relation between skill and time. We discuss connections of the model to two different areas – the item response theory and collaborative filtering. We describe three extensions of model – model with students’ variability, model wih learning and model for multidimensional skill. For the model we derive two parameter estimation procedures based on stochastic grad- ual descent and iterative joint estimation. For these methods we give short introduction with simple example.

3.1 Motivation

The aim of our model is to describe relation between latent skill and prob- lem solving time. If we know level of student’s skill with the model we can compute prediction of solving time for yet unsolved problems. Concretely we model probability distribution of solving time rather than accurate solv- ing time. For practical use we then take expected value as an prediction of solving time. The reason why we model probability distribution is that we expect inherited variance in solving performance of students and also vari- ance in solving tied to problems.

3.1.1 Preliminaries

In this work we do not consider any other information about students and problems except for the problem solving times. We study models for pre- dicting future problem solving times based on the available data. We work with a logarithm of time instead of the untransformed time itself (see Fig. 3.1). There are several good reasons to do so. At first, prob- lem solving times have a natural “multiplicative” (not “additive”) nature, e.g., if Alice is a slightly better problem solver than Bob, then we expect her times to be 0.8 of Bob’s times (not 20 second smaller than Bob’s times). At second, previous research on response times in item response theory successfully used the assumption of log-normal distribution of response times [110, 113], analysis of our data also suggests that problem solving times are log-normally distributed. At third, the use of a logarithm of time has both theoretical advantages (e.g., applicability of simple linear models) and pragmatic advantages (e.g., reduction of effect of outliers).

30 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.1: We assume exponential relation between skill θ and time (left). We work with a logarithm of time (right).

We study models which for given student parameters ~x and problem parameters ~y provide a probability distribution of problem solving times, i.e., a model is given by specifying p(t|~x,~y). Such models can be used to gen- erate synthesized data – a useful feature that we use in evaluation. To give prediction of problem solving times we use the expected value of p(t|~x,~y).

3.2 Basic model

We assume that we have a set of students S, a set of problems P , and data about problem solving times: tsp is a logarithm of time it took student s ∈ S to solve a problem p ∈ P (T is a matrix with missing values). In the following we use subscript s to index student parameters and subscript p to index problem parameters. Time t stands for a logarithm of time t. We assume that a problem solving performance depends on one latent problem solving skill θs and two main problem parameters: a basic diffi- culty of the problem bp and a discrimination factor ap. The basic structure of the model is simple – a linear model with Gaussian noise:

tsp = bp + apθs + 

Basic difficulty b describes expected solving time of a student with av- erage skill. Discrimination factor a describes the slope of the function, i.e., it specifies how the problem distinguishes between students with different skills. We illustrate an influence of different values of discrimination on simple example. Imagine that we have two students – student 1 and student 2 with θ1 = 0 and θ2 = 1 respectively and a problem with parameter ap = −1. Basic

31 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.2: Examples of three different problem types and their parameters for the basic model. Below every picture one representative of typical prob- lem is given. model predicts that student 2 will be two times faster – he solves problem in 50% of solving time of student 1 (we use logarithm with the base two – log2 tsp). Now when problem has low discrimination ap = −0.5 then the difference between predicted times for student 1 and 2 would be only 29%. On the other hand given the high value of discrimination a = −1.5 the difference in solving time prediction would be 65%. Finally,  is a random noise given by a normal distribution with zero mean and a constant variance. The model in [60, 61] assumes problem de- pendent variance cp:

tsp = bp + apθs + N (0, cp)

The predicted time for a student s and a problem p is the expected value of tsp, i.e.,

tˆsp = bp + apθs

The model is relatively simple, yet it can capture different aspect of problem difficulty and their combinations (see Fig. 3.2). Note that the presented model is not yet identified as it suffers from the “indeterminacy of the scale” (analogically to many IRT models). This is solved by normalization (again analogically to IRT models) – we require that the mean of all θs is 0 and the mean of all ap is -1.

32 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.3: If a problem is solved by above-average students, the mean time underestimates the difficulty of a problem, whereas our model can capture it correctly.

3.2.1 Group Invariance

A typical metric of problem difficulty is a mean time to solve a problem. This metric is misleading if a subgroup of students which solved a prob- lem is not representative of the whole population. An important feature of our approach is that the model is “group invariant” – similarly to IRT models [39]. Problem (student) parameters do not depend on a subgroup of students which solved the problem (problems solved by a student). Let us explain this important feature by comparing our model with a mean time to solve a problem (Fig. 6.12). In both cases the basic problem dif- ficulty is captured by one number – by difficulty parameter b in our model or by the mean m. If we have a set of problems, then it typically happens that harder problems are solved only by students with above-average skill. In this case the mean time underestimates the real difficulty of the problem, whereas our model is not biased by the selection of students.

3.2.2 Relations to Item Response Theory and Collaborative Filtering

The model has interesting connections to the item response theory and col- laborative filtering. The item response theory deals with test items with dis- crete sets of answers and models the probability of a correct answer. The basic models of IRT assume that probability of correct response depends on one latent skill θ. The most often used model is the three parameter logistic model:

33 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.4: An intuitive illustration of the analogy between the IRT three pa- rameter model (A) and our model (B). Dashed lines illustrate distributions for certain skill; solid line denotes the expected problem solving time, grey area depicts the area into which most attempts should fall.

(1 − c)ea(θ−b) p(correct|a, b, c, θ) = c + (1 + ea(θ−b)) . This model has three parameters (see Fig. 3.4): b is the basic difficulty of an item, a is the discrimination factor (slope of the curve, how well the item discriminates based on skill), and c is the pseudo-guessing parameter (lower limit of the curve, probability that even a student with very low skill will guess the correct answer). We have intentionally chosen notation for our model to be analogous to the IRT three parameter model, the analogy is is illustrated in Fig. 3.4. For more information on the theory see Section 2.1. The goal of collaborative filtering is typically to predict future user rat- ings based on past ratings. Instead of predicting ratings, we predict prob- lem solving times, but otherwise our situation is analogical – in both cases the input is a large sparse matrix, the outputs are predictions of missing values. The main principle of matrix factorization methods in Collabora- tive filtering is based on the singular value decomposition and leads to to the following model [69]:

T rui = bi + bu + ~qi · ~pu + 

where u is an user, i is an item, rui is the predicted rating, ~qi and ~pu are vectors of length k which specify problem-feature and user-feature inter- actions, bi and bu are item and user biases, and  is a random noise. The parameters of the model are typically estimated using stochastic gradient

34 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.5: The model with variance is equivalent to the following ap- proach: “at first determine a student’s local skill for the attempt (based on his variance σ2) and then determine the problem solving time with respect to this local skill”. descent with the goal to minimize sum of square errors. Our model is simi- lar to this approach for k = 1.

3.3 Model with Variability of Students’ Performance

The basic model outlined above assumes constant variance of the noise. But clearly some students are more consistent in their performance than others and also problem characteristics influence the variance of the noise. To incorporate these we assume that variance is given as:

2 2 2  = cp + σs ap 2 2 i.e., a weighted sum of a problem variance cp and a student variance σs , where student’s contribution to the variance depends on the discrimination of the problem. Intuitively, student’s characteristics matter particularly for more discriminating problems. Thus now we have three problem parameters a, b, c and two student parameters θ, σ. The model is the same as before, only the noise is modeled in more detail [54]:

2 2 2 tsp = bp + apθs + N (0, cp + apσs ) 2 2 2 The model with variance cp + apσs is equivalent to the following ap- proach: “at first determine a student’s local skill for the attempt (based on his variance σ2) and then determine the problem solving time with respect

35 3. MODELOF PROBLEM SOLVING TIMES to this local skill”: 0 0 2 p(θ |s) = N (θ |θs, σs )

0 0 2 p(t|θ , p) = N (t|bp + apθ , cp)

For illustration of procedure see Fig. 3.5. The equivalence of these two definitions is a special case of a general result for Gaussian distributions (see e.g., [24]).

3.4 Basic Model with Learning

It is sensible to incorporate learning into the model. The basic model as- sumes a fixed problem solving skill, but students problem solving skill should improve as they solve more problems – that is, after all, our main aim. The model extension is inspired by the extensive research on learning curves. A learning curve is a graph which plots the performance on a task (usu- ally time or error rate) versus the number of trials. The shape of the learn- ing curve is in the majority of human activities driven by power law: T = BN −α (where T is the predicted time, N the number of trials, α the learning rate and B the performance at the first trial). Other curves like hyperbolic, logistic or exponential were tested as well [85], but seem to do better only in a few cases. Imagine the simple activity like tying shoelaces – first time in our life it takes very long time, the we improve rapidly and finally the curve slowly flattens out since we cannot be much faster. Learning curves are of- ten examined in psychology (for example[72]) to capture cognitive skills or memory processes as well as in economy [52]. The slope and fit of the curve is measured to find out the quality of learning or progress taking place. Fi- nally, these curves are used to compare students, tasks or methods in order to improve learning process, tutor systems [81] or business strategies. If we take the logarithm of the above mentioned form of the power law, it can be naturally combined with our basic model of problem solv- ing times [54]:

tsp = bp + ap(θs + δs · log(ksp)) + 

where δs is a student’s learning rate and ksp is the order of the problem p in problem solving sequence of a student s. In the current analysis of this model we assume constant variance. Nevertheless, the model can be easily combined with the more detailed model of the noise presented above.

36 3. MODELOF PROBLEM SOLVING TIMES

Table 3.1: Overview of proposed models Model Noise  Basic model bp + apθs +  N (0, cp) 2 2 2 Model student’s variability bp + apθs +  N (0, cp + apσs ) Model learning bp + ap(θs + δs · log(ksp)) +  N (0, k) T Model multidim. bp + θ0s + ~ap · θ~s +  N (0, k)

3.4.1 Model with Multidimensional Skill

We can relax the assumption of single latent problem solving skill and assume a multidimensional skill. The model can be extended in a quite straightforward way, the extension corresponds to the above described model for collaborative filtering:

T tsp = bp + θ0s + ~ap · θ~s + 

where θ0s is a “basic skill” and θ~s is a vector of “corrective skills” which correspond to respective discrimination parameters given by a vector ~ap. The parameter estimation procedure can be derived analogically to the ba- sic problem as will be described subsequently in Section 3.6. In this case, however, we do not have any natural good initialization for the parameters ~ap and θ~s and thus the behaviour of gradient descent is less predictable and can lead to local minimum. You can find summary of presented models in Table 3.1.

3.5 Introduction to Maximum Likelihood and Estimation Methods

Since now we have formalized the models, we would like to derive their parameters from observed data. To estimate parameters of our models we use maximum likelihood method and two estimation techniques – iterative joint estimation and gradient descent. Firstly, we use maximum likelihood method to derive an error function which describes how good is a fit of our model for given data. Then we use analytical and gradient descent ap- proach to minimize this error function, i.e., to find the best fit for the model and observed data. Since we apply these methods directly to our models, it can be harder to understand the whole procedure. Therefore in this section we give short introduction to these methods applied on the simple problem of linear regression.

37 3. MODELOF PROBLEM SOLVING TIMES

3.5.1 Maximum Likelihood for Univariate Gaussian Linear Regression

Imagine we have a training set of n points represented by two vectors x(i) (i) (i) and y . Values y are normally distributed with a mean xiθ1 and variance σ2:

2 yi = N (xiθ1, σ )

Our goal is to fit the data with a simple linear function y = θ1x which we will call a hypothesis hθ(x) or simply a model . Once we calibrate the model we can use it for predictions of y for given x. For a given point (xi, yi) and 2 2 parameters θ1, σ a probability distribution function p(yi|θ1, σ ) will give us a probability for occurrence of value yi for given value of xi.

2 2 p(yi|θ1, σ ) = N (xiθ1, σ ) 2 We are seeking for the values of model parameters θ1, σ which will maximize joint probability distribution of y1 ... yn. In other words, we are trying to maximize likelihood of seeing the training data by modifying the parameters (θ1, σ):

n 2 Y 2 L = p(y1, ..., yn|θ1, σ ) = p(yi|θ1, σ ) = i=1 n 1 1 2 Y 2 − − (y −x θ1) (2πσ ) 2 e 2σ2 i i i=1

Notice that to maximize the likelihood L we need to minimize part (yi − 2 xiθ1) . Illustration in Fig. 3.6 shows the main principle of the method for two data points. We take logarithm of the likelihood function since it is easier to work in a log space (we get rid of the exponents). The logarithm of likelihood function is:

n n n 1 X ln L(θ , σ2|y ) = − ln(2π) − ln(σ2) − (y − θ x )2 1 i 2 2 2σ2 i 1 i i=1

For the parameter θ1 it is equivalent to minimizing the following error function J(θ1):

n X 2 J(θ1) = (yi − θ1xi) i=1

38 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.6: Illustration of maximal likelihood method for two data points 2 (x1, y1) and (x2, y2). We are seeking for values θ1 and σ which will maxi- mize the product of probabilities p1,p2 for observed values y1,y2.

We will show two different methods for locating minimum of the error function – analytical approach and gradient descent approach.

3.5.2 Analytical Estimation

We are looking for the maximum of the likelihood function. By taking the 2 first partial derivative of the likelihood function ln L on θ1 and σ we obtain:

n X (yi − θ1xi) = 0 i=1

n X 2 2 (yi − θ1xi) = nσ i=1

Solving the above equations leads to:

Pn i=1 yi θ1 = Pn i=1 xi

n 1 X σ2 = (y − θ x )2 n i 1 i i=1

2 These are the values θ1, σ of our model which maximize joint probabil- ity distribution of observed data.

39 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.7: Training set and the model h(x) is depicted (left). Right image shows error function for different values of θ. Notice that for θ = 1 error function is minimized.

3.5.3 Gradient Descent Estimation

Even though analytical approach would be sufficient we show another ap- proach – gradient decent method. For our models analytical solution may be intractable and we use gradient descent to approximate model parame- ters for minimized error function. Let us remind that our goal is to fit the data with simple linear function y = θ1x which we call our hypothesis hθ(x). We would like to find value θ1 so as hθ(x) is close to y in our training set

(x, y). To measure “closeness” we use previously defined error function Jθ1 – function which describes how good is a fit of our model.

n X 2 J(θ1) = (yi − θ1xi) i=1

We are looking for the value of parameter θ1 so as the error function will be minimized. On the left image of Fig. 3.7 we see training set and not very good estimate of the model. On the right image we error function J(θ1) is depicted for different values of parameter θ1. Notice that for θ1 = 1 error function is minimized. In summary, we have:

1. Model (hypothesis): hθ(x) = θ1x

2. Model parameter: θ1

3. Error function: J(θ1)

4. Goal: minimize J(θ1)

40 3. MODELOF PROBLEM SOLVING TIMES

The outline of the gradient descent algorithm for minimizing error func- tion is:

1. Start with some values of θ1

2. According to gradient of J(θ1) keep changing θ1

3. Until end up at a local minimum

Concretely, definition of gradient algorithm for our case with one pa- rameter is:

{θ = θ − α ∂ J(θ )} 1. Repeat until convergence 1 1 ∂θ1 1

where α is learning rate, i.e., how big steps do we take, θ1 is our param- ∂ J(θ ) eter and ∂θ1 1 is partial derivation of our error function. Let us explain intuition behind the algorithm on the image A in figure Fig. 3.8. Suppose we start with value of θ1 which lay to the right from the mini- mum. First derivative of J(θ1) will be positive number. We multiply deriva- tive by learning rate α and subtract it from θ1. We set θ1 new value and moved toward minimum. As we approach minimum of J(θ1) tangent de- creases and therefore our steps will be smaller and smaller. Notice the role of parameter α – learning rate. When learning parameter α is too large algorithm will overshoot minimum and diverge (see image B in Fig. 3.8). On the other hand when α is too small then algorithm will take only small steps toward global minimum and algorithm will tend to be very slow (see image C in Fig. 3.8). Suitable value α is therefore crucial. Initial setting of parameter θ1 is also crucial since gradient descent may end up in a local minimum instead of global minimum. On the image D in Fig. 3.8 we see two different initial settings of θ1. The left one will end up in local minimum A whereas right one in global minimum B.

3.6 Parameter Estimation Using Maximum Likelihood

We need to estimate model parameters from given data. To do so we use maximum likelihood estimation and stochastic gradient descent. As this is rather standard approach (see e.g., [24]) we focus in the following de- scription only on the derivation of the error function and the gradient. The likelihood of observed times tsp is:

Y 2 2 2 L = N (bp + apθs, cp + apσs )(tsp) s,p

41 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.8: Four different scenarios for gradient descent algorithm are de- picted. See text for an explanation.

To make the derivation more readable, we introduce the following no- tation:

• esp = tsp − (bp + apθs) (prediction error for a student and a problem),

2 2 2 • vsp = cp + apσs (variance for a student and a problem). Thus when we take the log-likelihood we can write it as:

2 X esp 1 1 ln L = − − ln(v ) − ln(2π) 2v 2 sp 2 s,p sp Maximizing the log-likelihood is thus equivalent to minimizing the fol- lowing error function:

e2 E = P E E = 1 ( sp + ln(v )) s,p sp where sp 2 vsp sp

It is intractable to find the minimum analytically, but we can minimize the function using stochastic gradient descent. To do so we need to compute a gradient of Esp:

2 2 2 ∂Esp −espθsvsp − apσs esp apσs = 2 + = ∂ap vsp vsp

42 3. MODELOF PROBLEM SOLVING TIMES

2 esp 2 esp apσs − (θs + apσs ) + vsp vsp vsp

∂E e sp = − sp ∂bp vsp

∂Esp esp = −ap ∂θs vsp

2 ∂Esp 1 esp 1 1 2 2 = (− 2 + ) = − 2 (esp − vsp) ∂cp 2 vsp vsp 2vsp

2 2 ∂Esp 1 2 esp 2 1 a 2 2 = (−a 2 + a ) = − 2 (esp − vsp) ∂σs 2 vsp vsp 2vsp Note that the obtained expressions have in most cases straightforward θ −a esp intuitive interpretation. For example the gradient with respect to s is p vsp , which means that the estimation procedure gives more weight to attempts over problems which are more discriminating and have smaller variance. Stochastic gradient descent can find only local minimums. However, by good initialization we can improve the chance of finding a global optimum. In our case there is a straightforward way to get a good initial estimate of parameters:

• bp = mean of tsp (for the given p),

• ap = -1,

• θs = mean of bp − tsp (for the given s),

1 • cp = 2 of variance of bp − tsp (for the given p),

1 • σs = 2 of variance of bp − tsp (for the given s).

If we make a simplifying assumption and assume that the variance is constant (independent of a particular problem and student), then the error function is the basic sum-of-squares error function and the computation of gradient simplifies to:

∂Esp ∂Esp ∂Esp = −θsesp, = −esp, = −apesp ∂ap ∂bp ∂θs

43 3. MODELOF PROBLEM SOLVING TIMES

3.7 Parameter Estimation Using Iterative Joint Estimation

Problem parameters a, b, c and student skills θ are estimated using an it- erative computation: problem parameters are computed using estimates of student skills; student skills are improved using estimates of problem parameters (both direction are computed by maximum likelihood estima- tion); and this process continues until requested precision is reached. Based on these estimates the system predicts problem solving times and recom- mends a suitable problem to solve. The collected data problem solving data are continuously used to further improve parameters estimates and prob- lem recommendations.

3.7.1 Approach

Since we do not know neither parameters of problems, nor parameters of persons, we need to estimate them. To compute these estimates we use data of the following type: problem p was solved by a person s in time tsp. From these data we need to estimate both problem parameters ap, bp, cp and per- son parameters θs. Here we discuss an iterative approach which is analogi- cal to the joint maximum likelihood calculation in IRT. Advantage of the iterative approach is that it computes estimates for each person (problem) separately from others, so it is possible to update estimates locally without recomputing the whole set of parameters – this is a useful feature for the application of the theory in our Tutor, which needs to make prediction in real time. Moreover, the iterative approach gives better insight into the computation.

3.7.2 Estimating Skill

Suppose that a person solved n problems, where problem p has parameters ap, bp, cp and was solved in time tsp. Based on these data we want to estimate the skill θs of the person. We do this by finding a maximal likelihood θs. The likelihood of the observed times ts1, . . . , tsn given our basic model model is:

2 n n (tsp−(bp+apθs)) Y Y 1 − 2 L = f (t ) = k e 2cp ap,bp,cp,θs sp c p=1 p=1 p

We need to find the value of θs such that L is maximized. As is usual, we proceed by finding maximum of ln L:

44 3. MODELOF PROBLEM SOLVING TIMES

n X 1 1 ln L = k + ln + (a2θ2 + 2a θ (t − b ) + (t − b )2) c 2c2 p s p s sp p sp p p=1 p p

Since this is a quadratic function in θs, we can find maximum by finding the value of θs for which the derivation is zero:

n 2 ln L X ap ap = θ + (t − b ) = 0 ∂θ s c2 c2 sp p s p=1 p p

2 Pn ap tsp−bp 2 p=1 cp ap θs = 2 Pn ap 2 p=1 cp

The resulting expression for θs has a clear intuitive interpretation. The expression (tsp − bp)/ap is a local estimate of skill for problem p – it is the value of θs for which the expected solving time is tsp. The overall estimate of θs is obtained as a weighted average of these local estimates, where the 2 2 weight is given by the expression ap/cp, i.e., the more discriminating and less random a problem is, the more weight it gets (which is exactly what we would intuitively expect). For the one parameter model model this expres- sion simplifies to:

n X θs = ( bp − tp)/n p=1

Alternatively it is possible to derive the estimate of θs by a Bayessian calculation – if we start by a priori estimate of person skill then we can use the Bayess theorem to compute the estimate of skill after solving a problem p. If the a priori estimate is given by a normal distribution then the resulting distribution can be analytically computed and is given again by a normal distribution. Thus we can apply this procedure iteratively to all problems and by induction we can derive an explicit expression for θs. This approach leads to the same result as the maximum likelihood estimation approach described above, the only difference is that in the case of Bayessian compu- tation the resulting weighted sum has an additional element which corre- sponds to an initial estimate of a person skill.

3.7.3 Estimating Problem Parameters Suppose that a problem was solved by n persons, where s-th person has skill θs and solved the problem in time ts. Now we want to estimate prob-

45 3. MODELOF PROBLEM SOLVING TIMES lem parameters a, b, c. Maximal likelihood estimates can be found by a re- gression analysis. For the two and three parameter models we can use stan- dard linear regression (least square method), because for our model (lin- ear dependence with normally distributed errors) the least square method gives maximal likelihood estimation. Parameter c is then estimated from error residuals. For the one parameter model we are looking for linear regression line with a fixed slope a = −1, thus we need to minimize the following sum of squares:

n X 2 (ts − (b − θs)) s=1 This is a quadratic function with a minimum at:

Pn t + θ b = s=1 s s n

For the model with learning we use θs + δs · log(ksp) instead of using only θs.

3.7.4 Joint Estimation So far we assumed that either abilities are known exactly and we estimate problem parameters, or that problem parameters are known exactly and we estimate person skill. In reality, of course, we do not known exactly neither person skills nor problem parameters. We compute their estimates by an iterative bootstrapping process:

1. initialization: for each problem p, set problem parameters as follows: ap = −1, bp = mean time, cp = k,

2. repeat until a selected convergence criterion is satisfied:

(a) for each student s update the estimates of θs based on the cur- rent problem parameters,

(b) for each problem p update the estimates of ap, bp, cp based on the current skill estimates.

Although each of the steps computes maximum likelihood estimates (with respect to fixed input parameters), overall it is only approximation of the joint maximum likelihood. One of the reasons is that the input pa- rameters in each step of iteration are only estimates and they differ in they

46 3. MODELOF PROBLEM SOLVING TIMES confidence, e.g., for persons which solved more problems we have better estimates of their skill. However, this aspects is not included in the de- scribed computation. This issue can be (pragmatically) addressed by using weighted least squares for estimating parameters ap and bp with weight for each person dependent on the number of solved problems.

3.7.5 Estimating Skill for Model with Learning

We present extension of estimation method for the model with learning. Again we suppose that a person solved n problems, where p-th problem has parameters ap, bp, cp, was solved in time tsp and individual order of the problem is ksp. We want to estimate the skill θs and learning rate δs of the person by finding a maximal likelihood θs and δs. The likelihood of the observed times ts1, . . . , tsn given our model with learning is:

n Y L = fap,bp,cp,θs,δs (tp) = p=1

n 1 2 1 − 2 (tsp−bp−ap(θs+δs ln(ksp))) Y 2 − 2cp (2πcp) 2 e p=1

We are looking for the value of θs such that L is maximized. As is usual, we proceed by finding maximum of ln L (which is the same as maximum of L):

n n n 1 X ln L = − ln(2π) − ln(c 2) − (t − b − a (θ + δ ln(k )))2 2 2 p 2c 2 sp p p s s sp p p=1

Since this is a quadratic function in θs, we can find maximum by finding the value of θs, δs for which the derivation is zero:

n X ap(tsp − bp − ap(θs + δs ln(ksp))) = 0 p=1 n X apksp(tsp − bp − ap(θs + δs ln(ksp))) = 0 p=1 By solving these equations we can estimate values of our model which maximize joint probability distribution of observed data. If we set ap to be a constant (for one parameter model we use ap = −1) we obtain equivalent of

47 3. MODELOF PROBLEM SOLVING TIMES

Figure 3.9: Iterative joint estimation: we used linear regression to estimate level of students skill θs and learning rate δs.

linear regression for student’s skill θs and learning rate δs. On the figure (see Fig. 3.9) we plot logarithm of student’s problem order ksp and an estimation of student’s skill θsp based on problem parameters. We use linear regression to estimate basic level of student’s skill θs and appropriate learning rate δs.

48 4 Evaluation of the Model

In this chapter we report on evaluation of model predictions, parameter values, parameter estimation procedure, and extensions of the model. The experiments were performed using both synthetic data and extensive data about real students from Problem Solving Tutor. We also give a specific ex- ample of an insight into problem difficulty which the model brings. The proposed model is generative and thus we can perform experiments with synthesized data. These experiments provide insight into how much data are needed in order to get usable estimates of parameter values. The evaluation shows several interesting results. The real data support the basic model assumption of linear relationship between skill and a logarithm of time to solve a problem. For predicting future times even a simple baseline predictor provides reasonable results; the model provides only slight im- provement in predictions. Nevertheless, it brings several advantages. The model is group invariant and gives a better ordering of problems with re- spect to difficulty. It also brings additional insight – we can determine not just average difficulty of problems, but also their discrimination and prob- lems and student variance and learning.

4.1 Evaluation Using Synthesized Data

The model can be easily used to generate synthesized data. Even though we have large scale data about real students, the synthesized data are still useful, because for these data we know “correct answers” and thus we can thoroughly evaluate the parameter estimation procedure. To generate data we have to specify the “meta-parameters” of a student and problem popu- lations (distributions of student skills and problem parameters).

4.1.1 Synthesized Data for Basic Model and Model with Students’ Variability

We have specified meta-parameters in such way that the simulated data are similar to the data about real students and problems from the Problem Solv- ing Tutor [58, 63], e.g., the student skill θ ∼ N (0, 0.7), δ ∼ N (0.15, 0.07), b ∼ N (7, 2). The results are reported for simulated data that contain values for all student-problem combinations (real data contain missing values). Nev- ertheless, the results are very similar even with missing values. Using such simulated data we can get an insight into how well the pa-

49 4. EVALUATION OF THE MODEL

Figure 4.1: Results for simulated data: Spearman’s correlation coefficient for true values and computed values of problem parameters a, b, c and student parameters θ, σ. Left: fixed number of problems (50) and varying number of students. Right: fixed number of students (100) and varying number of problems. rameter estimation procedure determines values of parameters and how many students (problems) are needed to get a good estimate of parameter values. Figure 4.1 shows results for varying number of students (left) and varying number of problems (right). The graph shows the Spearman’s cor- relation coefficient between the estimated values of parameters and their true values; results are averaged over 10 runs. We have chosen Spearman’s correlation coefficient as a metric for this evaluation because we are typi- cally interested more in relative values of parameters than in their absolute values – the values are used to sort problems (based on their difficulty or discrimination) or students (based on their skill). The results show that the basic difficulty of problems b and students’ skill θ can be estimated easily even from relatively few data. Estimating problem discrimination a is more difficult – to get a good estimate we need data about at least 30 solvers and even with more data the further improve- ment of the estimates is slow. As could be expected, it is most difficult to get a reasonable estimate of student and problem variance. To do so we need data about at least 50 problems and 150 students.

4.1.2 Synthesized Data for Basic Model with Learning

Now we investigate model with learning, i.e., how well we are able to detect students’ learning rates and differentiate between students who are learn-

50 4. EVALUATION OF THE MODEL

Figure 4.2: The color visualizes the Spearman’s correlation coefficient be- tween the true and computed values of the learning rate δ. The experiment was run for varying noise and deviation of the learning rate δ (with other meta-parameters fixed). ing and who are not. The basic observation here is straightforward: If the differences in learning rates are high and noise is low, then it is easy to de- tect the learning in the data. If the students’ learning rates are very similar and noise in data is high, it is impossible to detect the learning. The graph Figure 4.2 shows the transition between these two extremes. In the figure the color visualizes the Spearman’s correlation coefficient between the true and computed values of the learning rate δ. For every colored square we ran experiment 10 times and used average values. In many practical cases the ordering, in which students solve problems, is very similar. Often students proceed from simpler problems to more dif- ficult ones (this is certainly true for our data, which are used in the next section). Does this correlated ordering influence the estimation of parame- ters from data? When a group of students exercise problems in a similar sequence they will experience learning effect together, i.e., they will improve with similar rate and perform better in following problems. Since problem will be solved with better times it will appear as an easier problem and its difficulty will be undervalued. Moreover since students progressed together we will not be able to determine whether it is easier problems or it is caused by learning effect – we will not be able to estimate learning effect well. Consider the extreme case when all students solve the problem in the

51 4. EVALUATION OF THE MODEL

Figure 4.3: Estimates of learning rate for data with different “sameness”. When sameness is high (left) model is unable to estimate absolute learn- ing rates. When sameness is low (right) model is able to estimate both – absolute and relative learning rates.

same order. Then the model is not well identified. If we increase the values of all student learning rates δs by x and decrease the values of all problem parameters bp by xap log(k) (where k is the order of the problem; by as- sumption same for all students), then we get the same predictions. So there is no way to distinguish between the absolute values of student learning and intrinsic difficulty of problems. On the other hand, the ordering of problems does not impact the esti- mation of relative learning rates (i.e., comparing students’ learning rates, as reported in Figure 4.2). To reliably detect the relative learning rates we need them to be sufficiently different. But how to deal with estimation of absolute values? We need to avoid bias from joint improvement of students, i.e., to have sufficiently diverse data. To measure “sameness” of problem solving order we use a mean cor- relation between students’ ordering and mean ordering of all students. The higher this index the more coherent data are, thus the estimation of abso- lute values will be inaccurate. Figure 4.4 shows the dependence between this index and quality of absolute predictions (measured by RMSE). Figure 4.3 shows two model estimates for generated data for 150 stu- dents and 60 problems with high sameness (sameness = 1.0) and low sameness (sameness = 0.1). Notice that for high sameness (left image) model is able to estimate relative but not absolute values of learning rate.

52 4. EVALUATION OF THE MODEL

Figure 4.4: Estimation of absolute values of the learning rate. The x axis shows how much are the problem solving orders of individual students correlated, the y axis shows precision of estimation (measured by RMSE).

For the low sameness (right image) model is able to estimate both – relative and absolute learning rates correctly. Both information (relative and absolute predictions) are useful. Abso- lute predictions of learning rates can improve recommendations of prob- lems and adapt pace of a tutor to particular student. Absolute predictions can also serve as an evaluation of different problem sets designed for the same learning goal. Problem set with higher absolute learning rates should be preferred since it enables to make faster learning progression. Relative predictions are also useful since they can serve as a tool for teachers to de- termine learning trends in a class.

4.1.3 Evaluation of Parameter Estimation Techniques

In previous chapter we presented two methods of estimating parameter values – using stochastic gradient descent and iterative joint estimation. Now we investigate whether they lead us to the same estimates. We gener- ated data for the basic model:

tsp = bp + apθs + 

with a ∼ N (−1, 0.4), θ ∼ N (0, 0.7), b ∼ N (7, 2),  ∼ N (0.0, 1.00). We

53 4. EVALUATION OF THE MODEL

Table 4.1: Spearman’s correlation for estimated and true parameters for two methods – gradient descent and iterative joint estimation. Model Parameter Gradient Iterative Basic model a 0.91 0.91 Basic model b 1.0 1.0 Basic model θ 0.96 0.96 M. with learning δ 0.83 0.78 used these two methods for estimating parameter values. We ran experi- ment 10 times and used averaged values of parameters’ correlations. Both methods give us very good results (see Table 4.1) even for a small data set (25 problems and 100 students). Estimated parameters highly correlated with true parameters and they also highly correlated together (for all three parameters r = 1.0). Since iterative joint estimation does not depend on other method specific parameters (like learning rate α, initial parameter set- ting) it is more robust and suitable for practical purposes, e.g., use in our Problem Solving Tutor. Next we evaluated both methods for the model with learning. We gen- erated data for the model:

tsp = bp + ap(θs + δs · log(ksp)) + 

with a ∼ N (−1, 0.4), θ ∼ N (0, 0.7), b ∼ N (7, 2),  ∼ N (0.0, 0.3), δs ∼ N (0.21, 0.07). Again we generated data for 100 students, 25 problems and used averages from 10 runs. Both methods give us good results for estima- tion of students’ learning rate and highly correlated together (r = 0.84). Notice that we are using lower value for . Having too high noise in data we won’t be able to estimate learning rate δs.

4.2 Evaluation Using Real Data

Now we turn to experiments over real data from the Problem Solving Tutor. We start with the analysis of estimated values of model parameters. We proceed with evaluation of predictions and reliability of parameter values. Finally we discuss insight gained from parameter values and detection of multidimensional skills.

54 4. EVALUATION OF THE MODEL

Figure 4.5: Distributions of abilities for the Robotanist (top two) and the Number Maze puzzle (bottom two). Left: distribution of skills. Right: skill versus variation in the student performance. Note that Robotanist, which is an educational problems, varies more in the skill distribution.

4.2.1 Parameter Values for Real Data

For the evaluation we have used only problems for which we have enough data (based on the results of experiments with synthetic data). The estimated values of the parameters a, b, and θ are nearly the same for the basic model (assuming constant variance) and for the model with variability of student’s performance. The advantage of using the full model is thus in additional information (about problem and student variance), not in more precise information. For the parameter θ the results show that the estimated values are, as expected, approximately normally distributed. The variance of the distri- bution depends on the problem type – for educational problems we have larger variance of skills than for logic puzzles. There is a negative corre-

55 4. EVALUATION OF THE MODEL

Figure 4.6: Relations between parameter values a, b, c, the figure combines data about 24 problem types. lation between θ and the estimated σ (deviation of skill for individual at- tempts), i.e., students with lower skill have larger variability of their per- formance. This correlation differs for concrete problem type, but typically it is in range from r = −0.2 to r = −0.4. We have also studied the correlation between the problem parameters a, b, c. There is almost no correlation between the basic problem difficulty and its discrimination (r = −0.17). The problem variance is correlated with the difficulty even more weakly (r = 0.09). Variance and discrimination are also very weakly correlated (r = −0.16). Although there are some correlations among the parameters, generally the parameters are rather independent, i.e., each of them provides a useful information about the problem difficulty. For example, in intelligent tutor- ing system, it may be suitable to filter out problems with large variance or low discrimination. Note that these result indirectly supports the application of logarithmic transformation of times. If we had used untransformed times or some dif- ferent transformation, there would be much stronger dependence.

4.2.2 Evaluation of Predictions

Now we report on the evaluation of predictions of problem solving times. We compare model predictions with two simpler ways to predict problem solving times. At first, we consider the mean time as a predictor – the sim- plest reasonable way to predict solving times (note that, consistently with the rest of the work, we compute the mean over the logarithm of time and thus the influence of outliers is limited and the mean is nearly the same as the median). At second, we consider a simple “personalized” predictor:

56 4. EVALUATION OF THE MODEL

Table 4.2: Data used for evaluation Problem type Students Problem Solved type instances problems Tilt Maze 2091 110 43544 Robotanist 1254 68 30467 Binary crossword 778 57 23983 Region puzzle 313 112 14113 Slitherlink 204 88 10264 Sokoban 294 69 9471 Rush Hour 1092 69 9471 Nurikabe 132 46 4665

tˆsp = mp − δs

where mp is the mean time for a problem p and δs is a “mean perfor- mance of student s with respect to other solvers”, i.e., P mp − tsp δs = ns

where ns is the number of problems solved by the student. Note that this corresponds to the initialization of our basic model (Section 3.6); we call it a baseline predictor. For the experiment we used 8 most solved problem types from the Prob- lem Solving Tutor, the basic statistics about these problems are given in Ta- ble 4.2. For each problem we consider only students who solved at least 15 instances of this problem. Evaluation of model predictions was done by repeated random subsam- ple cross-validation, with 10 repetitions. The training and testing set are constructed in the following way: we choose randomly 70% of students, for these students all their data go into the training set. For the remaining 30% of students the first 80% of their attempt go into training set and the last 20% go into the testing set. Table 4.3. compares the results using the root mean square error metric. We have also evaluated other metrics like the Pearson’s and Spearman’s correlation coefficients and mean absolute error; the relative results are very similar. The results show that the model provides improvement over the use of a mean time as a predictor. Most of the improvement in prediction is cap- tured by the baseline model; the basic the model brings a consistent but

57 4. EVALUATION OF THE MODEL

Table 4.3: Quality of predictions for different models and problems mea- sured by root mean square error metric. Tilt Robot. Binary Region Mean time predictor 1.045 1.376 1.259 1.37 Baseline predictor 0.925 1.324 1.174 1.28 Basic model 0.92 1.301 1.148 1.28 Model with variance 0.918 1.304 1.161 1.278 Model with learning 0.948 1.313 1.181 1.322 Slith. Sokoban Rush. Nurik. Mean time predictor 1.195 1.246 1.077 1.143 Baseline predictor 0.976 1.037 0.995 1.026 Basic model 0.948 1.021 0.981 1.025 Model with variance 0.947 1.016 0.978 1.025 Model with learning 0.967 1.034 0.993 1.04 slight improvement (see example for Sokoban on Fig. 4.7). This improve- ment is larger for educational problems (e.g., Binary numbers) than for logic puzzles (e.g., Tilt Maze). Different variants (basic model with constant variance, individual vari- ance, learning) of the model lead to similar predictions and similar values of RMSE. The model with individual variance leads in same cases to im- proved RMSE, the model with learning leads to slightly worse results than the basic model – the on the current dataset the more model with more parameters slightly overfits the data. By using a more fine tuned version of gradient descent (using different step sizes for individual parameters, particularly smaller step size for the parameter δ), the model with learning leads to improved prediction predictions for some problems (particularly the Slitherlink, which is a puzzle with many opportunities for improving performance).

4.2.3 Reliability of Parameter Values

Even through the more complex models do not lead to substantially im- proved predictions, they can still bear interesting information. Predictions are useful for guiding behaviour of the tutoring systems, but small im- provement in prediction precision will not change the behaviour of the sys- tem in significant way. The important aim of the more complex models is to give us additional information about students and problems (e.g., the student’s learning rate,

58 4. EVALUATION OF THE MODEL

Figure 4.7: Two predictors for solving times are displayed. On the left com- parison with the mean time predictor and on the right comparison with the basic model. which can be used for guiding the behaviour of tutoring system and for providing feedback to students.

Table 4.4: Spearman’s correlation coefficient for parameter values obtained from two independent halves of the data. Tilt Robot. Binary Region student skill θ 0.748 0.641 0.822 0.472 student learning rate δ 0.525 0.394 0.623 0.576 basic problem difficulty b 0.994 0.961 0.951 0.927 problem discrimination a 0.469 0.564 0.569 0.282 Slith. Sokoban Rush. Nurik. student skill θ 0.816 0.789 0.737 0.904 student learning rate δ 0.455 0.394 0.509 0.570 basic problem difficulty b 0.981 0.963 0.962 0.837 problem discrimination a 0.533 0.347 0.434 0.195

Since the model with learning does not improve predictions, it may be, however, that the additional parameters overfit the data and thus do not contain any valuable information. To test this hypothesis we performed the following experiment: we split the data into two disjoint halves, we use each half to train one model, and then we compare the parameter values in these two independent models. Specifically, we measure the Spearman’s correlation coefficient for values of each parameter.

59 4. EVALUATION OF THE MODEL

Table 4.4 shows results for the model with learning. The results show, that estimates of basic difficulty and basic skill correlate highly, the weakest correlation between the estimates from the two halves is for the discrimi- nation parameter. For students’ learning rate, the additional parameter of the extended model, we get the correlation coefficient between 0.5 and 0.7 – a significant correlation which signals, that the fitted parameters contain meaningful values. We also analyzed correlations among different model parameters, e.g., between skill θ and learning rate δ. Generally there is only weak correla- tion between parameters, which shows that the new parameters bring ad- ditional information.

4.2.4 Insight Gained from Parameter Values

Let us illustrate the insight gained from the values of the model on the prob- lem Graphs and functions, which is described in Section 1 and illustrated in Fig. 2.7. Fig. 4.8 shows the collected data and values of model parame- ters for the three examples. The first and the second problem have similar difficulty, but the second one has much larger variance. The third problem is more difficult than the first two and is also more discriminating. These parameters provide a valuable insight, which can be potentially used for further improvement of intelligent tutoring systems. Problems with small discrimination and large randomness clearly depend more on luck than on skill and thus are probably not a very good pedagogical problems, so we may want to filter out such problems. At the beginning of the problem solving session (when we do not have a good estimate of a student skill), we may prefer problems with small dis- crimination and variance (so that we have high confidence in solving time estimation), later we may prefer problems with higher discrimination (so that we select problems “tuned” for a particular student).

4.2.5 Detection of Multidimensional Skill

So far we have assumed a single, independent latent problem solving skill for each problem type. This is, of course, a simplifying assumption. On one hand, for similar problems (e.g., Rush Hour and Sokoban puzzles) the prob- lem solving performance is clearly related, and our results show that the estimated problem solving skills are indeed highly correlated. For exam- ple for two of the mathematical problems (Graphs and functions and Math pairs) the correlation between skills is 0.78, typical correlation between two

60 4. EVALUATION OF THE MODEL

Figure 4.8: “Graphs and functions” problem – three specific examples, for each of them we provide the collected data and values of parameters of the three parameter model. problems is around 0.6. It may be advantageous to group together several similar problems and fit the data with one model with two dimensional skill. To study dependence among skills for problems in the Tutor we investi- gated skill correlation for students who solved more than just one problem type in the Tutor. In Fig. 4.9 you can see correlation network for selected problems from the Tutor. Nodes represent problems and links connects nodes which have high Spearman’s correlation. We depicted only links with correlation higher than 0.7 and p–value bellow 0.01. Graphs like these may serve as an useful tool for the Tutor developers since they reveal relations between different educational problems. On the other hand, in some cases it would be natural to assume multi- dimensional skill even for a single problem. For example in the case of the interactive graphs problem discussed above, some students may be profi- cient with polynoms, but struggle with trigonometric functions, and thus it may be useful to include at least 2 skills in the model. To evaluate whether the model is able to detect different skills from the problem solving times, we performed the following experiment. Firstly, we analyzed approach using synthesized data. We generated

61 4. EVALUATION OF THE MODEL

Figure 4.9: Correlation network for selected problems from the Tutor. Nodes represent problems and links represent Spearman’s correlation of skills. We depict only links with correlation higher the 0.7 and p–value bellow 0.01). data for two sets of 50 problems and 300 students using model:

tsp = bp + apθs + 

Parameters were set a1, a2 ∼ N (−1, 0.4), θ1, θ2 ∼ N (0, 0.7), b1, b2 ∼ N (7, 2),  ∼ N (0.0, 1.00). Then we fitted data with model with two skills.

tsp = bp + θ0s + a1pθ1s +  We analyzed how well are the problems separated using a discrimina- tion parameter a1 (see Figure 4.10). To distinguish groups of problems we used K-means algorithm for two clusters. According to the ratio of cor- rectly assigned problems to all problems we were able to divide 98.5% of problems correctly (we ran experiment 10 times and used average values). Notice that we did not put any correlation among both skills in generated data. In real world we can expect correlation among skills. We added correlation between two sets of skills to generated data and ran experiment. For a low value of skill correlation (r ≤ 0.4) method showed excellent results achieving 98% of precision of division. With rising corre- lation precision decreased and by reaching r = 0.9 results only weakly ex- ceeded 50% of precision (see Fig. 4.10). All experiments were ran 10 times and we used average correlation values. Second, we analyzed approach using real data. We mixed data for two different types of problems and we fitted the data with a model with two skills. Then we checked how well are the problems separated according to the discrimination parameter a1. Fig. 4.11 shows results for two pairs of

62 4. EVALUATION OF THE MODEL

Figure 4.10: Synthesized data for extended model with multidimensional skills: we divided 100 problems of two types in groups according to the values of parameter a1. With K-means algorithm we were able to achieve 98.5% precision of division (left). On the right the relation between skill correlation and precision of division is depicted. problems. As we can see, the two problem types are separated quite well by the automatically learned parameter a1. We can classify problems into two classes depending on whether a1 > 0, this simple approach classifies on average 80% of problems correctly. Note that we have used just the basic version of the extended model without any “fine-tuning” of the model or the parameter fitting procedure for this particular task.

4.3 Open Issues

We describe three open issues for model which we briefly characterize. Concretely we deal with issue of problem completion, detection of cheat- ing and adaptive testing using our model. These issues are set for further research.

4.3.1 Problem Completion

In the Tutor people often spent time solving problem but leave it unsolved due to its difficulty. We investigate relation between student skill and prob- ability of problem completion. For given problem instance we take students who solved problem and those who spent significant amount of time but has not reached the solu- tion. Concretely for unsuccessful students we take only those who spent at

63 4. EVALUATION OF THE MODEL

Figure 4.11: Extended model with multidimensional skills – values of pa- rameter a1 for different types of problems. least predicted amount of time estimated with our model. We divide skill axes to distinct slots (one slot corresponds to interval 0.3 of skill) and com- pute probability of problem completion as an ratio of students who suc- ceeded to all solvers. Fig. 4.12 shows two problems with high and low probability of succeed- ing for average student. Notice that higher the skill the higher probability of succeeding. Probability of completion can enhance problem recommenda- tion since it brings further information about suitability of problems. Main drawback is a large amount of data needed to estimate a curve. Notice that this is similar to item response theory models where also large data are needed to depict item response function.

4.3.2 Detection of Cheating

Detection of cheating relates to a problem of finding patterns in data that do not correspond to expected behaviour (see [27]). The basic model can be used to detect these anomalies in problem solving performance of students. Students may solve easier problems on their own but harder problems are tempting for cheating. For example students may solve the problem out- side the computer and then carry out quick solution. This sudden change in student’s performance is detectable. On the figure Fig. 4.13 we identify group of extremely skilled students (skill rises almost to the value of 6) who achieved very high level of individual variance (i.e., deviation in their skill results). This has been almost certainly caused by cheating.

64 4. EVALUATION OF THE MODEL

Figure 4.12: Probability of problem completion for two problems: on the left easy problem (Training of color from Robotanist problem), on the right hard problem (Counting lesson from Robotanist problem).

4.3.3 Application for Adaptive Testing

We give an informal approach to perform adaptive tests with our model – i.e., given limited amount of time which problems should be chosen to max- imize precision of skill estimate? We construct similar criterion as is used in computerized adaptive testing. Given a pool of unsolved problem instances we are seeking for the next problem instance which will contribute the most to the estimate of student’s skill. For the basic model, maximal likelihood method used for finding θs the overall estimate is obtained as weighted average of local estimates, where 2 2 the weights of local estimates are given by expression ap/cp. I.e., the more discriminating and less random the problem is, the more weight it gets (which is exactly what we would intuitively expect). With this intuition we construct a function to describe estimate contribution for given problem p as:

2 ap I(p) = 2 cp Selecting the problem with highest contribution would be:

jm+1 = max{I(p): p ∈ R} p

65 4. EVALUATION OF THE MODEL

Figure 4.13: “Graphs and functions” problem – cheating issue. Marked out- liers were detected as cheaters.

Where j denotes problems, m problems have been already solved and R denote the remaining problems in the problems pool. This criterion does not take time into consideration. If problem is time consuming it has less practical value compared to an equally scored problem that requires less time to complete. Therefore we extend criterion to select highest contribu- tion per time unit:

 I(p)  jm+1 = max : p ∈ R p bp + apθ

Where bi + aiθj is expected time for solving problem i by student j es- timated by our model. Notice that this method is only informal extension but still it can bring useful improvement for practical use. The more precise deriving of the method and contribution function is set for further research.

66 5 Problem Solving Tutor

In this chapter we present a “Problem Solving Tutor” (tutor.fi.muni.cz) – a web-based educational tool for learning through problem solving. The system focuses solely on the “outer loop” of intelligent tutoring [118], i.e., recommending problem instances of the right difficulty. The system adapts to an individual student – based on past problem solving attempts the sys- tem estimates student’s problem solving skill, using this estimated skill it predicts problem solving times for new problems and chooses a suitable problem for a student. The system also contains support for virtual classes and thus can be easily used in a classroom. The tool contains more than 2 000 problems, mainly educational problems and logic puzzles.

5.1 Main Approach

The Tutor is an online educational tool for learning different areas through problem solving activities. The Tutor contains sets of different problem types each containing 25 – 100 different problems. For example a student can practice her knowledge of binary numbers in an activity called Binary cross- word with problems ranging from trivial ones (taking 10 seconds to solve) to problems of high difficulty (taking 10 minutes to solve). While student solves problems, problem solving data are stored. Based on the data the Tutor makes estimates of student’s latent skill and makes solving time pre- dictions for unsolved problems and recommendations for further solving. Although the classical cognitive tutors use the same approach (estima- tion and adaptation to the student’s skill), the structure of possible passages through the learning materials is static and designed by domain experts. In this case it is important for a tutor to reason about the problem in the same way as humans do, therefore expert module is the heart of intelligent tutor- ing system [33]. Since experts prepare suitable ordering of problems, hints, questions, preparation of a tutor is expensive. Intelligent tutoring systems usually have two loops – outer loop and inner loop. The outer loop executes once for each task, where a task usually consists of solving a complex, multi-step problem. The inner loop executes once for each step taken by the student in the solution of task [118]. The inner loop can also assess the student’s evolving competence and update a student model [4], which is used by the outer loop to select the next task that is appropriate for the student. In the Tutor we focus solely on the outer loop but we also use data from this loop (i.e., solving time) for estimating

67 5. PROBLEM SOLVING TUTOR student skill and problem difficulty parameters. We do not combine any hints or study material with problems. Instead of using experts for problem ordering and difficulty estimation we learn from data. This approach is also close to the recommendation system do- main where more obtained data enable better recommendation [21]. Sim- ilar approach has been already used in recommendation of related papers and documents for learning [29].

5.2 Main Components

We provide an overview of main Tutor components. Concretely we describe typical usage of the system, problem solving simulators, data collection, predictions which directly implements our basic model, recommendations of suitable problems, class mode used by teachers for administering their classes and motivational features which support flow concept.

5.2.1 Typical Usage

When students register they are redirected to the main page with a list of 30 problem types. Students choose problem type they want to practice and the Tutor recommends them two problems (easier and harder). Alternatively they can choose from the list of all problems related to given problem type. For selected problem the Tutor displays its simulator and students begin to solve. As they solve problem the Tutor logs every step into database. When finished the Tutor offers students immediate feedback of their performance compared with other students (loads as a part of the page) and encourage them to solve another problems. The Tutor continuously updates problem parameters and students’ skill estimates. Whole section of the Tutor is devoted to detailed statistics on problem solving where stu- dents compare themselves with other classmates / tutor users. For teachers special section for administering their class is available. They can manage their class and assign problems to practice. For admin- istrators the Tutor offers data section with detailed logs on problem solving and sections for administering problem instances. Interaction of main com- ponents is depicted on Fig 5.1. Notice that collected data are continuously used to update and support other components.

68 5. PROBLEM SOLVING TUTOR

Figure 5.1: How do main modules in the Tutor cooperate together.

5.2.2 Problem Simulators

Problem simulators provide environment for solving individual problems. They are mostly based on Javascript libraries and runs completely in a web browser. As student selects problem instance for solving, proper simulator is loaded. Simulator receives instance description in text format and dis- plays problem environment. Simulators are very interactive – students are given chance to control entities with mouse or keyboard and see response. For example in Computational Trees students are confronted with set of operations (mathematical and logical), set of numbers and set of result en- tities placed on an interactive screen. The goal is to establish connections between these entities by linking them with curves and run a simulation of computation. If entities are connected correctly computation will give re- quired results. In other problems student may interact with mathematical curves or transform given shapes in plane. After completing problem in- stance in simulator Tutor displays problem statistics and recommendation panel.

5.2.3 Data Collection

The Tutor logs extensive problem solving data which are further used in experiments, evaluations and analysis in the thesis. While students solve a problem every step is logged – i.e., data from step are immediately sent to the Tutor interface and logged in a database. Step is described specifically

69 5. PROBLEM SOLVING TUTOR

Table 5.1: Summary statistics for Tutor (April 2013) Problem sessions 1 023 873 Attempted problems 558 589 Solved problems 463 419 Spent time 14 502 hours Moves logged 20 790 840 Users 10272 Schools 87 Teachers 106 Classes 221 Students in classes 2415 Active problem types 30 Active problem instances 2065 for every problem type. For example step can be evaluation of math ex- pression, running the robot with command or filling in a number in binary crossword. The Tutor now contains data about more than 460 000 solved problems and more than 20 millions steps logged (for information see Table 5.1). De- tailed data can be used for further research on problem difficulty and hu- man problem solving (see e.g., [59, 70]). The Tutor has a special account for accessing data where logs for every problem instance can be separately downloaded.

5.2.4 Predictions This component gives solving time estimates based on the basic model from Section 3.2. Computations for parameters a, b, c and student skill θ are run regularly and some minor computations are run after finishing a problem. Predicted times are displayed on a problem list page and are integral part of recommendation algorithm. We briefly describe prediction algorithm which is an implementation of iterative joint estimation described in Section 3.7. For student s and prob- lem p our goal is to estimate parameters ap,bp,cp and students’ skill θs. Al- gorithm has this structure:

1. Gradual start (for every problem sets initial problem function and add 1-5 virtual students)

2. Estimation of local θsp

70 5. PROBLEM SOLVING TUTOR

3. Estimation of global θs

4. Normalization of θs over all students

5. Normalization of ap over all problems

6. Estimation of parameters ap, bp, cp

Repeat steps 2-6 until stable problem parameters ap, bp, cp and θs values achieved. Typically algorithm converges very quickly – within 3 repetitions.

5.2.5 Recommendations

Based on predictions the Tutor recommends unsolved problems of suitable difficulty. Firstly, the Tutor takes estimation of student skill and predicts solving times for unsolved problems based on the model. Second, the Tutor an- alyzes difficulty parameters ap, bp, cp for unsolved problems p ∈ P . Pa- rameters are combined with other factors (e.g., whether student already attempted to solve given problem, is the time prediction close to the re- cently solved problem) in a scoring function which assigns every unsolved problem its “suitability score”. Then the Tutor selects a pair of problems and recommends them to the student. Recommended problem jm+1 is chosen as a problem with maximal score:

jm+1 = max{scoreb(p) + scorec(p) + scoret(p): p ∈ R} p where: scorec(p) = max{5 − cp, 0}

scoreb(p) = max{5 − |tsp − rt|, 0}

 0, if stsp > tsp  scoret(p) = 3, if stsp < tsp and stsp > 0 (5.1)  5, if stsp = 0

Parameter tsp represents predicted time for student s and problem p, stsp is already spent time on problem, cp is problem randomness, rt is rec- ommended time for next problem. Recommended time is time of recently solved problem multiplied by 0.7 or 1.3 (the Tutor recommends two prob- lems – easier and harder).

71 5. PROBLEM SOLVING TUTOR

Previously defined scoring function is only heuristic function. We have not evaluated the best setting for its parameters. It prefers problems with lower randomness close to given solving time. It also penalizes problems which students previously attempted but did not solved. Therefore stu- dents are more likely to be recommended a problem with less randomness which they have not attempted yet. Since every student has his own pace in learning, the Tutor recommends two problems (easier and harder) which allows students to adjust the most suitable pace for them.

5.2.6 Class Mode

In a “class mode” students solve subset of problems which were chosen by their teacher. The Tutor provides teachers with several tools for administrat- ing their class. They can create their own class and fill it with the students. Then they can choose problems they want to practice with students (e.g., binary numbers). When solving on a lesson teacher may monitor students progress through interactive “live view” tool. Students solve problems at their own pace neither slowed nor spurred by a whole class progress. Teach- ers may focus on struggling students or explain more deeply problems which seem to be difficult for a whole class. Similar approach of monitoring students is used also in Khan Academy [66] (and other ITSs) where students have to undergo series of problems of a given topic to proceed. The Tutor in addition applies its models to make problems suitably ordered and also allows better problem set design by considering difficulty parameters.

5.2.7 Motivational Features

To motivate students in a process of learning, the Tutor supports a flow state. Concept of flow describes a mental state when a student is fully ener- gized and involved in an activity. As states Csikszentmihalyi, students en- joy learning process the most when they reach the flow. In the flow state stu- dents are completely motivated to push their skills to the limit [89]. There are several conditions which supports the flow effect (the full list of condi- tions can be found in [37]):

• There are clear goals. The Tutor offers different problems with clear rules and clear goals.

• There is immediate feedback to one’s action. The Tutor provides an

72 5. PROBLEM SOLVING TUTOR

immediate feedback with statistics on finished problem and compar- ison with other students.

• There is a balance between challenges and skills. The Tutor uses adap- tive recommendations, i.e., it recommends easier or harder problems based on the recent solving results.

In Tutor we focus on immediate feedback. Based on the collected data it gives the user comparison with other users: particularly we provide imme- diate feedback on solving activity after finishing problem solving. We also display per problem statistics and comparison with other students.

5.3 Problems In the Tutor

In this section we give short overview of problems in the Tutor. Our system focuses on practicing problem solving skills and training of knowledge (acquired for example in lessons) through problem solving. The system uses only well-structured problems with well-defined (and easy to verify) solutions. An important part of the system development is a creation of good problem solving activities, i.e., formulating learning topics as prob- lem solving activities. For this purpose we have modified some well-known problems and developed several new ones. We found that when we manage to find a suitable problem solving for- mulation, students often willingly do activities that would otherwise be considered rather “dry” (like practicing binary numbers or recursion). At the moment the Tutor offers 30 different problem types, for each problem type the Tutor has about 30–100 problem instances. Together the Tutor contains more than 2 000 problem instances (for more statistics see 5.1). Most of the problems are available in English and Czech, few are cur- rently available only in Czech.

5.3.1 Robot Programming Problems We begin with a robot programming problems which gives general and at- tractive introduction to the programming. Robotanist, Robot Karel and Tur- tle Graphics are three problems, in which student program a robot to per- form a specified action (collect flowers on a grid, draw a picture). Robots are programmed via simple commands (move forward, turn left, repeat, condi- tional execution), programs are specified graphically (Robotanist) or using a simple programming language (Robot Karel, Turtle Graphics). These prob- lems can be used as an intuitive introduction into programming and later

73 5. PROBLEM SOLVING TUTOR as a good practice for training recursion. These types of problems were pre- viously used, we just adapted them for a web environment and created a suitable set of specific problems [88, 91].

Figure 5.2: Illustration of the Robotanist problem. The goal is to program a simple robot to gather all flowers on the given plan.

We describe in some detail the Robotanist problem 5.2 (the most popu- lar problem in the Tutor). Robotanist is a clone of the RoboZZle game [87], which is a puzzle illustrating some basic concepts of programming, partic- ularly recursion. The goal of the Robotanist is to program a simple robot to gather all flowers on the given plan. Plan consists of square cells with grass, flowers, or stones. Robot can be programmed using several commands such as go forward, turn left, or call of a function; commands can be also con- ditioned by a color of a current cell. Program code is limited by length of programming line. Although the problem rules are simple, they allow wide variety of interesting puzzles which illustrate important computer science concepts like “traversal of a binary tree” or “counting with the use of recur- sion stack”. The robot environment is highly interactive. Students test their program codes on many runs before achieving a goal and they have imme- diate feedback whether they model fits or needs correction. Hence problem solving is giving evident feedback and is more entertaining. At the beginning students are confronted with simple problems like “go to the end of plan” or “avoid the obstacle” where they practice ba- sic commands for controlling robot. In further problems students are lim- ited with the number of programming lines and their length which force them to decompose problems appropriately. Multiple programming lines and conditions allow students to solve problems like “collect all flowers in

74 5. PROBLEM SOLVING TUTOR the square shape” or “walk criss-cross through the plan with turning on the red squares”. Finally the most difficult problems requires deep under- standing of recursion and storing information in recursion stack (example in Figure 5.2 gives is the simplest example of this type).

5.3.2 Programming Problems

Other programming problems focus on classical programming languages. Python and C are two problems designed to practice language syntax and basic algorithms (e.g., Bubble sort, Sieve of Eratosthenes). Students are con- fronted with an incomplete program code where gaps are filled with multiple- choice menu. Their goal is to fill in the gaps with appropriate code to gen- erate the provided output (see Figure 5.3). After filling the gaps student run the program and see the difference between generated output and expected output. Since students do not have to write the code on their own, they can start early to solve interesting programming problems without getting stuck on syntactic issues (and they learn the syntax “by the way”). At the same time the problem is still interactive and students may approach the correct solu- tion by gradual testing of their hypotheses – as opposed to plain multiple- choice tests with feedback consisting of single correct/incorrect informa- tion. The realization of the problem simulator is rather simple and can be easily extended to other programming language.

Figure 5.3: Illustration of the Python problem. The goal is to fill in the pro- gram gaps in order to generate pyramid shape from the stars.

5.3.3 Computer Science Problems

Besides programming problems the Tutor involves other educational and logic problems. In Binary Crossword the goal is to fill a grid with zeros and

75 5. PROBLEM SOLVING TUTOR ones in such a way, that all specified conditions are met (see Figure 5.4). This setting can be used for easy problems for practicing basics of binary num- bers and logic operations, but also for more challenging problems where the specified conditions are given in self-referential crossword manner, which leads to quite entertaining practice of binary numbers and operations.

Figure 5.4: Illustration of the Binary Crossword problem. The goal is to fill the table with numbers 0 and 1 and satisfy all the given constraints.

In Finate Automaton is to construct deterministic automaton which will accept given language. In an interactive screen students move states and set transition rules for states. Student has test set of words which can be tested and show sequential transition through student’s automaton. Another computer science problem is Regular Expressions where stu- dents must specify a regular expression to filter a given data set according to specified criterion or to transform the set into required form. We can see example in Figure 5.5. The goal is to fill in the regular expression in order to find all occurrences of the words “hip”, “hap”, “hop” without repetitions. Black font marks correctly chosen words, italic font marks incorrectly cho- sen words.

Figure 5.5: Illustration of the regular expressions problem. The goal is to fill in the regular expression in order to find all occurrences of the words “hip”, “hap”, “hop” without repetitions.

76 5. PROBLEM SOLVING TUTOR

5.3.4 Math Problems

In Graphs and Functions problem students are given a graph of a function and their goal is to find a formula for this function. In Math Pairs students are given set of cards with mathematical expressions or diagrams and the goals is to pair matching cards. We have examples for wide range of math- ematical topics, e.g., basic arithmetic operations, goniometric functions, ar- eas of geometrical objects, and combinatorics. In Broken Calculator students are given calculator with limited set of numbers and operations and set of result numbers. The goal is to use num- bers and operations to reach given result numbers. Similar principle of set- ting computation for given result entities is used in problem Computational Trees. Here set of operations (mathematical and logical), set of numbers and set of result entities are placed on interactive screen. The goal of stu- dent is to establish connections between these entities by linking them with curves and run a simulation of computation. If entities are connected cor- rectly computation will give required results. For Linear Transformations students is given shape in Euclid space. The goal is to use matrix transformation to transform the shape in resulting size and position. Environment is interactive and student may examine different transitions and transformation for different matrix settings. In Patterns students are given colored shapes in Euclid space constructed with restrictive conditions (e.g., y <= x − 5). Students have to find suitable restriction to construct given shape. In the graphical environment students may elaborate on different conditions with immediate response.

5.3.5 Logic Puzzles

The Tutor also contains another 9 logic puzzles (Rush Hour, Sokoban, Num- ber Maze, Tents puzzle, Nurikabe, Tilt Maze puzzle, Region Puzzle, Loop Finder, Polyomines). Although they do not train specific computer science skills, they train generic problem solving skills and can be used to illustrate particular computer science concepts (e.g., state space, backtracking, depth first search). Some of them will be described in the next chapter.

5.4 Implementation

We provide brief survey on applied technologies and main entities of the application. We also discuss remarkable issues specific for this domain – logging interface, problem instance locker and gradual start.

77 5. PROBLEM SOLVING TUTOR

5.4.1 Technologies

Tutor is based on the model view controller architecture and commonly accessible technologies – PHP, Javascript, MySQL database, XHTML and CSS. Every action is processed by its controller and data are displayed with corresponding view. Application is localized in two languages and every text has its own language mutations.

5.4.2 Main Entities

Main entities in database schema are:

• Puzzle user: description of student

• Puzzle problem: description of problem types

• Puzzle instance: description of problem instance and its solution

• Puzzle results: results for solving given problem instance by given student

• Puzzle session: attempts of student to solve a given problem instance

• Puzzle log: logged steps from solving given problem by given stu- dent

• User stats: entity for quick displaying of student’s statistics

• Puzzle stats: entity for quick displaying of problem’s statistics

5.4.3 Entity Relationship Model

Main entities collaborates in a following way. As user starts solving a new problem instance (table puzzle instance) session is created (table puzzle ses- sion). Detailed log about solving is recorded (table puzzle log). When stu- dent succeeds, result record is updated (table results). Problem instance can be locked for a particular student (table puzzle lock). Also problem statis- tics are updated (table puzzle user stats and table stats) and student’s skill is refreshed (table user problem skill). Students can be enrolled to several classes (table tag) and teacher can organize problems to set (table puzzle set) and define concepts for problem instances (puzzle concept).

78 5. PROBLEM SOLVING TUTOR

Figure 5.6: ER diagram od the database schema for Tutor.

5.4.4 Logging Interface for Simulators

While student solves problem every step is logged using common Javascript function for sending data to interface. This function is available for all prob- lem simulators. The simulator calls the function with appropriate variables (such as session id, session hash, move number and move description). Function sends data using AJAX technology and Jquery library to an in- terface script which validates information and logs step into the database.

5.4.5 Problem Locker

As mentioned before (see Section 4.1.2), to evaluate model with learning it is crucial to have diverse solving sequences for various students. We im- plemented locking function for pseudo randomly selected problems. As student starts solving, problems are divided into three groups and only one group is active. As student proceeds in solving, problems from other groups become gradually unlocked. This provides sufficient unique order- ing of problems necessary for the model with learning.

79 5. PROBLEM SOLVING TUTOR

5.4.6 Gradual Start Similarly to other recommendation systems which make recommendations based on collected data, we have to face the “cold start” problem: How to make recommendations when we do not have enough data? We use a “gradual start” approach. At the beginning we provide estimates of solving times of individual problems and add into the system several artificial users with similar times. These estimates can be obtained by naive metrics (like the length of the shortest path to a solution) or by using computational models of human problem solvers [59, 93]. We start with the one parameter model, because it is more robust to random noise in data (this noise is particularly significant with sparse data). As we use the system and collect more data, we still make predictions using the one parameter model, but at the same time we compute parameters also for two and three parameter models and we evaluate their predictions. As soon as predictions of more parametric models become better, we start using them for displaying predictions and recommendations.

5.5 Statistics of Usage

The system is already used by more than 50 schools and has more than 10 000 registered users (mainly university and high school students). Users have spent more then 14 000 hours solving more than 460 000 problems. The number of solved problems is distributed unevenly among different problem types (see tabular 5.1 and tabular 5.2). More than 100 teachers from 88 schools have registered and they run 221 classes to which have assigned more than 2400 students.

80 5. PROBLEM SOLVING TUTOR

Table 5.2: Statistics for problems in Tutor in April 2013. Instances represents number of successfully solved problem instances. Spent time column rep- resents spent on successful attempts. Unsuccessful column represents time spent on usuccessfull attempts. Problem Instances Spent time Unsuccessful Binary Crosswords 38,405 736:25:21 126:50:13 Calculator 12,590 350:40:56 79:50:39 Circuits 8,635 216:14:55 60:45:54 Color Maze 10,905 200:14:44 51:17:08 Corector 11,759 110:16:54 76:36:39 Eternity 5,081 171:48:26 98:55:35 Finate Automaton 2,803 120:07:00 29:42:32 Graps and Functions 17,757 345:33:07 202:46:22 Graps and Functions Advanced 1,292 15:58:56 3:36:46 History 4,915 100:20:20 37:10:26 Interactive Python 1,918 217:21:55 118:32:36 Interactive C 3,549 133:22:13 41:53:01 Loop Finder 13,687 978:37:18 273:11:34 Math Pairs 18,701 259:41:22 67:55:43 Math Pairs Advanced 1,130 21:04:50 1:33:26 Minotaurus 5,113 161:44:10 128:12:30 Musician 1,683 37:56:42 12:41:32 Number Maze 34,593 402:31:19 99:46:06 Nurikabe 5,508 656:32:22 297:25:51 Regular Expressions 5,689 177:34:58 47:59:52 Robotanist 57,833 2992:40:29 1495:56:51 Pattenrs 2,054 43:01:00 18:51:30 Polyomines 14,028 359:49:06 257:17:52 Region Puzzle 19,853 470:58:33 183:08:52 Robot Karel 2,303 224:52:04 87:44:58 Rush Hour 41,815 668:01:53 368:13:51 Sokoban 19,927 753:41:41 445:45:41 Tents 12,072 314:54:23 50:13:19 Tents Numbers 9,774 303:11:49 38:32:57 Tilt Maze 61,046 1420:53:34 483:49:11 Transformations 1,809 47:45:13 16:23:36 Turtle Graphics 5,388 410:53:15 220:38:52

81 6 Difficulty of Transport Puzzles

In this chapter we study six transport puzzles (Minotaurus, Number Maze, Rush Hour, Sokoban, Tilt Maze and Replacement Puzzle). Using Tutor, we collected large scale data about human problem solving about these puz- zles. The results show that there are large differences among difficulty of individual problem instances. We argue that these differences are partly caused by global structure of problem state space and that they are not ex- plained by previous research. In order to explain these differences, we de- velop and evaluate computational model of human behaviour during state space navigation. We describe concept of state space bottleneck to deter- mine key states on solution path. We derive method for its computation. We also study problem specific methods to enhance difficulty predictions. Concretely we study problem decomposition and counterintuitive moves for Sokoban puzzle and propose their evaluation.

6.1 Motivation

Previous research on the problem difficulty was focused particularly on the following concepts:

• hill-climbing heuristic, which was studied for example for river cross- ing problems [47], Fifteen puzzle [96], and Water jug puzzle [9, 26],

• means-end analysis, which was proposed as a key concept in the “General Problem Solver” [86] and was studied for example for Tower of Hanoi puzzle,

• differences between comprehension of isomorphic problems, which focus on the difficulty of successor generation and were studied for example for Tower of Hanoi [70] and Chinese ring puzzle [70].

Let’s illustrate that this approach is not sufficient on Sokoban puzzle. In our experiment there are very similar Sokoban problems with large differ- ence in difficulty (more than 10-fold) – whereas the problems in Figure 6.1 took human on average nearly one hour, other problems were solved in within few minutes. Yet with respect to the above mentioned concepts the problems are nearly the same. Hill-climbing is not directly applicable for solving Sokoban problems (except very easy ones). Means-end analysis is applicable only in very limited sense and it is not clear how this concept could explain large differences in difficulty of different Sokoban problems.

82 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.1: Example of two difficult Sokoban puzzles. The median solving time for the left problem (further denoted example 1) is 43 minutes, for the right one (denoted example 2) it is 49 minutes.

Differences between comprehension of isomorphic problems and successor generation also cannot be responsible for differences in difficulty, because all instances are stated in the same way.

6.2 Studied Problems

All studied problems are single player transport puzzles. The “transport” notation does not mean that there is necessary some physical movement involved in solving the puzzle, but rather that the solution is a sequence of moves. Transport puzzles can be expressed directly using the state space terminology [104] – states are configurations of the puzzle, transitions are given by allowed operations, the goal of the puzzle is to find a path from initial to final state. Here we briefly describe rules of the used puzzles, used instances, and we note on their state spaces and collected data about solving times.

6.2.1 Sokoban

Sokoban is a well-known puzzle created by Hiroyuki Imabayashi. There is a simple maze with several boxes and one man. The goal of the puzzle is to move the boxes the boxes onto the target squares (see Fig. 6.1). The only allowed operation is a push by a man; man can push only one box. State space of the game is formalized as a directed graph G = (V,E) where V is the set of game states and E is the set of edges corresponding to a move of single box. A naive formulation of a state space is to consider as a move each step of a man. This formulation does not add any important in- formation to the analysis and leads to unnecessary large state spaces. State

83 6. DIFFICULTY OF TRANSPORT PUZZLES of the game is thus given by a position of boxes in the maze and by area reachable by a man. We denote s0 the vertex corresponding to the initial position of the game. In our discussion we consider only states reachable from the initial state. There can be several states corresponding to a solved problem – all boxes have to be on given position in the final state, but there can be more final states due to the position of the man. Nevertheless, in nearly all cases there is just one final state; thus to simplify the discussion in the following we assume just one final state denoted sf . This simplification is only for sake of readability of the text. Implementations of all our techniques work correctly in the general case of multiple final states. We say that a state s is “live” if there exists a path from s to sf ; otherwise we call the state s “dead”. State spaces for Sokoban are directed (moves are irreversible), their size ranges between tens of states to tens of thousands states. Median time to solve Sokoban puzzle ranges between 23 seconds to 16 minutes. For experi- ments we used 79 problem instances with two, three or four boxes on which solvers spent 753 hours. Most of the instances were selected from standard level collections [94].

6.2.2 Minotaurus Puzzle

The goal of the Minotaurus Puzzle is to move Theseus to the exit in the maze to escape from Minotaurus (see Fig. 6.2). By every one move of The- seus Minotaurus can move twice. But there is one restriction. Minotau- rus can move only when he can shorten his distance from Theseus. Firstly Minotaurus tries to move in horizontal and then in vertical direction. States are represented by position of the Minotaurus and Thesesus in the maze. State space is directed and small – their size ranges from tens to hundreds of states. Median solving time ranges between 13 seconds to 9 minutes. For experiments we used 46 problem instances on which solvers spent 160 hours.

6.2.3 Number Maze

In this case we are placed in a left upper corner of regular square chess- board. Our goal is to move from the top left to the bottom right corner by respecting possible jump length marked on the squares (see Fig. 6.2). Solver can move only in horizontal or vertical direction and must move for a given number of squares according to the number displayed on a current square. States are represented by position of the solver in a maze. State space is

84 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.2: Instance of Minotaurus puzzle and Number Maze. The goal of Minotaurus puzzle is to move with Theseus in the maze to escape from Minotaurus. The goal of Number Maze is to move from the top left to the bottom right corner by respecting possible jump length marked on the squares. directed and very small – their size ranges from 4 to 45 states. Nevertheless median time to solve a puzzle ranges between 13 seconds to 9 minutes. For experiments we used 85 problem instances which we generated using our own generator. Users spent spent 400 hours by solving puzzle. Visualization of a state space for one of the (concretely Maze 3c) is depicted in Fig. 6.3. Starting state of puzzle is in the bottom left, goal state lays in the bottom right. Grey squares marks current position of a solver. States with light-grey background represent dead states. Median solving time for this puzzle is 11 seconds.

6.2.4 Tilt Maze

The goal of Tilt Maze puzzle is to move ball in the maze and collect all squares (see Fig. 6.4). Ball can move in vertical or horizontal directions and it moves straight till it hits the maze wall. States are represented by position of the ball and also by the sequence of collected squares. State space is directed, their size ranges from tens of states to tens of thousands of states. Median solving time ranges between 19 seconds to 10 minutes. For experiments we used 109 problem instances on which solvers spent 1420 hours.

85 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.3: Illustration of state space for problem Maze 3c. Starting state is in bottom left, goal state in bottom right. Grey square marks current position of a solver. States with light-grey background represent dead states.

6.2.5 Rush Hour

Rush Hour is well-known transport puzzle created by Nob Yoshigahara. In a grid there are several cars. Each car can move either in vertical or horizon- tal direction, cars cannot be rotated. Each square of the grid can be occupied by at most one car. The goal of the puzzle is to move pointed car out of the grid (see Fig. 6.4). States are represented by position of the cars in the plan. State spaces are undirected (all moves are reversible) and therefore it does not have dead states. Their size ranges from hundreds of states to tens of thousands of states. Median time to solve a puzzle ranges between 10 seconds to 13 min- utes. Experiments were performed with 60 instances, all of them using 6×6 grid and car of size 1 × 2 or 1 × 3. Most of the instances were taken from the standard Rush Hour set. Solvers spent 668 hours by solving puzzle.

6.2.6 Replacement Puzzle

Replacement Puzzle is a lesser-known puzzle created by Erich Friedman [44]. In this case we are manipulating a sequence of symbols. Given a starting sequence of symbols (see Fig. 6.5), the aim is to derive a goal sequence by using provided replacement rules. Replacement rules are applied one at a time; replacement can be applied on any consecutive sequence of symbols.

86 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.4: The goal of Tilt Maze (left) is to move ball in the maze and collect all squares. The goal of Rush Hour (right) is to move pointed car out of the grid by pulling other cars away.

Figure 6.5: Instance of Replacement Puzzle. The aim is to derive a goal se- quence by using provided replacement rules.

At any time there may not be more than 6 symbols. Original formulation by Erich Friedman requires that the puzzle is solved in a fixed number of steps (to ensure a single solution), we allow arbitrary number of steps. The experiments were done with 40 instances; each of them used two types of symbols and three rules. States are represented by derived symbol sequence. State spaces are directed (moves are irreversible) and their size ranges between 10 and 120 states, i.e., in this case state spaces are much smaller than for Sokoban and Rush Hour. Nevertheless, the puzzle is still nontrivial, median time to solve a puzzle is 30 seconds for the easiest in- stance and 5 minutes for the hardest instance.

6.3 Data Collection and Analysis

Here we describe collected solving data for our six puzzles obtained mostly from Problem Solving Tutor. We proceed with an analysis – firstly we com-

87 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.6: Boxplot of solving times for our problems. pare different metrics for measuring difficulty and show that they highly correlate. Then we analyze individual moves in Sokoban state space which will give us an intuition for developing our model.

6.3.1 Data Collection

We have performed own data collection using our Problem Solving Tutor (see Tables 6.3.1 6.3.1). In the case of Replacement Puzzle we collected data using Tutor predecessor – web based tool for solving problems. Partic- ipants were not paid and we did not have a direct control over them since the whole experiment run over the Internet. As a motivation to perform well Tutor provides a public results list – this is for most people sufficient motivation, and at the same time it is sufficiently weak so that there is not a tendency to cheat. This Internet based approach has certainly some disadvantages over the standard “laboratory” approach to experiments with human problem solv- ing, particularly we do not have a direct control over our subjects. Nev- ertheless, we believe that the advantages significantly outweigh these dis- advantages. Figure 6.6 shows the distribution of median solving times for our puzzles. The solving time median varies from few seconds to tens of minutes.

6.3.2 Data Analysis

Here we describe selection of suitable parameter to measure problem diffi- culty. We also analyze collected data for Sokoban puzzle which will give us

88 6. DIFFICULTY OF TRANSPORT PUZZLES

problem state space instances Minotaurus small directed 46 Number Maze small directed 85 Replacement Puzzle small directed 40 Rush Hour large undirected 60 Sokoban large directed 79 Tilt Maze large directed 109

Table 6.1: Summary information about puzzle state spaces.

total median time to solve problem time easiest median hardest Minotaurus 160 h 13 sec 2:35 min 9 min Number Maze 400 h 5 sec 38 sec 5 min Replacement Puzzle 55 hours 34 sec 2 min 5 min Rush Hour 668 h 10 sec 2 min 13 min Sokoban 753 h 23 sec 3 min 16 min Tilt Maze 1420 h 19 sec 2 min 10 min

Table 6.2: Summary information about collected data. “Total time” is the total time spent by solvers to solve provided puzzles (only successful at- tempts are included). an intuition for deriving computational model.

6.3.3 Problem Difficulty

Our aim is to explain and predict difficulty of individual problems. As the first step it is necessary to specify a fair and robust measure of difficulty. There are several natural measures of difficulty: time taken to solve a prob- lem, number of moves necessary to solve a problem, number of solvers who successfully solved a problem or parameter b from model presented in Chapter 3. It turns out that all these measures are highly correlated, i.e., it seems plausible that any single of them sufficiently captures a concept of problem difficulty. In the rest of the chapter we use as a difficulty measure the median solv- ing time of successful attempts. Figure 6.7 shows the relation between this measure and number of successful solvers (Spearman’s r = −0.98) for Tilt Maze puzzle. We also show that median time is closely related to parameter b of the model presented in Section 3.2 (Spearman’s r = 0.98). Median time

89 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.7: Median solving time compared to other metrics – number of successful solvers (left) and parameter b of basic model presented in Sec- tion 3.2. is also related to the average number of moves for solving given problem (Spearman’s r = 0.87).

6.3.4 Analysis of Individual Moves in Sokoban Puzzle

We analyze Sokoban puzzle in detail to obtain information about human navigation through a state space [55, 56, 57]. This will give us an intuition for construction of artificial model of human state space navigation. In our Tutor we log information about all moves performed by solvers (including time taken to make the move). Here we provide analysis of these individual moves. Based on results of this analysis we built our computa- tional model of human problem solving behavior. Although 73.8% of Sokoban game-configuration are dead (i.e., states from which is impossible to reach the goal), humans usually do not spent much time in dead states (14.5% on the average). They can relatively quickly discover that they are in bad configuration and restart the game. Humans spent more time far from goal position (see Figure 6.8). Once humans get to one half of the distance between the start and goal state, they finish the problem rather quickly. To get better insight we visualized state spaces. But even for our rather small instances of Sokoban problems, whole state spaces are too big to be visualized directly, so it is necessary to prune the state space to obtain rea- sonable visualization. We prune state spaces in two ways. At first, we use

90 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.8: Time spent in given distance from goal. Both metrics (time and distance) are normalized. The bold line is a mean over all problems, two dashed lines are problems from Figure 6.1. only live states. This decreases the average number of states in the visu- alization from 3697 to 823; most important information is retained since humans spent most of the time in live states. At second, we cut long back- level edges1. According to our experience in most cases there are only few long back-level edges and their removal makes the visualizations much more comprehensible. Visualizations displayed in this thesis cut off edges of length three or more, which we consider to be a reasonable compromise between loss of information and comprehension of visualization. To visualize the pruned state space we use automated graph drawing2. Figure 6.9 shows examples of resulting visualizations. The size of each state is proportional to average time spent by humans in the state, the thickness of each edges is proportional to the average number of times humans per- formed the given move.

6.4 Model of Human Behaviour

Presented problems vary with respect to state space size and solution length and heuristic effectiveness, but we show that for some puzzles these factors

1. Back-level edges are edges which go to lower level with respect to breadth-first search. 2. Specifically the tool Pajek [14], algorithm Kamada-Kawai [64].

91 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.9: Examples of state space visualization of the problems given in Figure 6.1. Size of each vertex is proportional to average time spent by hu- man solvers. Some edges are omitted from the visualization (see text). do not fully explain the differences in problem difficulty that we obtained from experiments with human solvers. We believe that unexplained differences are caused by differences in the structure of problems’ state spaces. Fig. 6.10 demonstrates on artificial ex- amples how the structure can influence the difficulty. Both examples have the same number of states, edges and same distance from start to goal. In the left example it is easy to find the path to the goal – whatever path we choose we will arrive at the goal. In the example on the right it is much more difficult to succeed – we have to select the right sequence of moves and each wrong move makes the solution path much longer. To capture these structural differences among problems, we propose a dynamic computational model which simulates human behaviour during state space search. The model is very abstract – it approximates human be- haviour as a mix between randomness and optimality. The model does not provide explanation of “how people think”, it just simulates behaviour; i.e., it is a cognitive engineering rather than cognitive science model [46]. Cogni- tive science models provided better explanation of experimental results, but they typically contain many problem specific rules and many parameters, which makes them prone to data overfitting [97] and hard to generalize.

92 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.10: Two artifical examples which illustrate how the structure of the state space can influence difficulty of problem solving.

Our model can be easily applied to other problems. Our main aim in developing the model is to use it for predicting prob- lem difficulty. Nevertheless, the model can be useful for other applications, e.g., for providing hints or checking cheating in intelligent tutoring sys- tems [3] and other online problem solving applications. Using the computational model we specify a metric for rating difficulty of problems and we evaluate this metric over the collected data. We com- pare this metric with other possible difficulty rating metrics, particularly with the metric “length of the shortest path to goal’. For each of the three studied puzzles the results are slightly different. We argue that our results open several new questions about human problem solving, which were not addressed by research so far. We do not try to model the actual human cognitive processes while solv- ing the problem, i.e., this is cognitive engineering model rather than cogni- tive science model [46]. Our model is very abstract and is based only on information about underlying problem state space, i.e., the model is not specific for a single problem.

6.4.1 Basic Principle

Our model is based on the analysis of human behaviour as discussed in Sec- tion 6.3.4. At the beginning humans explore the state space rather randomly, later, as they get closer to the solution, they move more straightforwardly to the goal. Since humans spent most time at live states, the basic model works only with these states and completely avoids dead states. The model starts at the initial state and then repeatedly selects a suc- cessor state. This selection is in the basic model local and very simple – it is a combination of two tendencies: “random walk” (selection of a a ran- dom successor) and “optimal walk” (selection of a successor which is closer

93 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.11: Example of human state space traversal for Rush Hour prob- lem. Nodes represents states, the size of the node corresponds to the time spent in given state. to a goal state). Human decisions are usually neither completely random, nor completely optimal. Nevertheless, the model assumes that a weighted combination of these two tendencies can provide a reasonable fit of human behaviour.

6.4.2 Model Formalization

The general principle of our model is the following. In each step the model considers all successors s0 of the current state s. Each successor s0 is as- signed a value score(s0), the sum of all score values is denoted SumScore. The model moves to a successor which is selected randomly according to a probabilistic distribution:

score(s0) P (s0) = SumScore This general model is specified by a selection of a score function. In this report we evaluate the basic version of the model which uses a simple func- tion based on distance d(s) of a state s from the goal state. The function is defined as follows (B is a single parameter of the model – ‘optimality bonus’): ( d(s) d(s0) ≥ d(s) score(s0) = d(s) + B d(s0) < d(s)

Successors that lead toward a solution get an “optimality bonus’, i.e., they have higher chance of being selected. The use of distance from the goal

94 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.12: Example of model state space traversal for Rush Hour problem. Nodes represents states, the size of the node corresponds to the number of moves lead to a given state. in the formula has the consequence that the relative advantage of “bonus” increases as the model gets closer to the goal, i.e., the model behaves less randomly when it is close to the goal (as do humans). If B = 0 than the model behaves as a pure random walk. As B increases the behaviour of the model converges to the optimal path. Hence by tun- ing the parameter B the model captures continuous spectrum of behaviour between randomness and optimality.

6.4.3 Model with Dead States

When the state space is directed (as is the case for most of our puzzles), it is not possible to reach a goal state for some states – we call these states “dead’. Once the model reaches a dead state, it will forever cycle in dead states. Since this does not correspond to human behaviour, we have to ex- tend the model for directed state spaces. We consider two different exten- sions:

1. dead states are never visited, i.e., score(s0) = 0 if s0 is dead;

2. the model resets back to initial state when it reaches a state without any successor or when it revisits a same dead state for second time.

We use the first extension for all puzzles with dead states, and sec- ond extension for the Minotaurus, Tilt Maze and Replacement Puzzle. This choice corresponds the collected data about human problem solving. For example in case of Sokoban humans are good at avoiding dead states, whereas

95 6. DIFFICULTY OF TRANSPORT PUZZLES in case of Replacement Puzzle and Tilt Maze humans do visit dead states a lot.

6.4.4 Other Extensions There are several other possible extensions of the model. All of them are quite natural and can be done simply by extending the scoring function:

• Hill climbing heuristic (specific for the particular problem), e.g., for Sokoban the natural heuristic is the total distance of boxes from goal positions. We study this topic more deeply in Section 6.7.

• Use of memory (loop avoidance heuristic), e.g., the model would re- member states that were already visited and in the scoring function would prefer unvisited states.

• Penalization of long back edges – humans can recognize not just moves which lead to dead states, but also moves which lead “back- wards”.

Each of these extensions incorporates at least one additional parameter into the model. With the size of our testing data it could be misleading to evaluate versions of the model with more parameters due to the potential overfitting of data [97].

6.5 Evaluation

In this section we evaluate the model over the collected data. We discuss used difficulty rating metrics we and present obtained results. We also ana- lyze different values of parameter B. Since results differ for particular prob- lems we discuss these differences and give hypothesises of possible expla- nations. Finally, we outline hypothesis about connection of a state space structure to the model of problem solving times presented in Section 3.2.

6.5.1 Difficulty Rating Metrics Does the computational model provide an explanation of differences in problem difficulty? To answer this question we formalize a metric based on the computational model and compare it with other possible metrics. The metric based on the computational model works as follows: for a given problem we run a model repeatedly (100 times) over the state space and compute the mean number of steps necessary to reach the final state.

96 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.13: Results for Tilt Maze. Relations between different rating metrics and humans’ median solving times are displayed. In all examples we use log-scale axes.

For comparison we used several other metrics, e.g., parameters of a state space (particularly size), length of the shortest path to the goal state, and metrics based on simple heuristics like the number of counterintuitive moves [26] that are necessary to reach a goal. From these other metrics we report here only the length of the shortest path, because metrics based on state space parameters do not provide a statistically significant correlation with problem difficulty, and metrics based on problem specific heuristics work similarly as metrics based on the shortest path3 and as they are prob- lem dependent we do not discuss them in detail. Thus we focus on comparison of the shortest path metric and computa- tional model metric. Table 6.3 provides summary of correlation coefficients. We measure Spearman’s correlation coefficient, which gives the correlation with respect to ordering of values – for practical application of difficulty metrics the ordering is often more important than absolute values.

6.5.2 Value of the Parameter B The metric based on computational model is dependent on the parameter B (optimality bonus). We have done sensitivity analysis of the model be- haviour with respect to this parameter. For the basic model good results of the model were for values of the optimality parameter B around 25. Note that studied problems are quite different – their state spaces are dif-

3. The only notable exception is for Sokoban puzzle where we were able to get successful problem specific metric based on “chunks’ along the shortest path [57].

97 6. DIFFICULTY OF TRANSPORT PUZZLES

Table 6.3: Summary Problem Metric Spearman Sokoban shortest path 0.6 model B = 25 0.68 Rush Hour shortest path 0.89 model B = 25 0.92 Number Maze shortest path 0.74 model B = 25 0.73 Minotaurus shortest path 0.57 model B = 25 0.5 model dead B = 200 0.66 Tilt Maze shortest path 0.33 model B = 25 0.28 model dead B = 25 0.76 Replacement Puzzle shortest path 0.21 model B = 25 0.49 ferent combinations of large/small and directed/undirected types (see Ta- ble 6.3.1.). On the other hand for the Minotaurus puzzle and model with dead states we used higher value of B. This is due to practical reasons. Minotaurus puzzle has longer shortest paths with many branches with dead states and with low values of B computations would take huge amount of model steps (which is also unrealistic to humans navigation which model attempts to simulate).

6.5.3 Differences among Problems

Table 6.3. shows that there are quite large differences among the studied problems. For Rush Hour and Number Maze the shortest path metric pro- vides quite a good explanation of problem difficulty, in this case the com- putational model metric does not bring any improvement. However, for the Sokoban puzzle, Minotaurus and particularly the Replacement Puzzle and Tilt Maze, the shortest path metric provides poorer explanation and in these cases the computational metric does bring an improvement. These results thus open new interesting question: Why does shortest path metric sometimes provide sufficient explanation of problem difficulty and sometimes it does not? We provide several hypotheses. At first, problems differ in their “dead states recognition time”. It takes longer time for Tilt Maze and Replacement Puzzle to recognize dead state

98 6. DIFFICULTY OF TRANSPORT PUZZLES from regular state. Therefore solvers spent more time in these states if there are enough dead branches. This is captured by the model with dead states which gives good results for Minotaurus, Replacement Puzzle and Tilt Maze. At second, problems differ in their “local difficulty”. It is much harder to imagine successor states for the Replacement Puzzle than for Rush Hour puzzle or Number Maze puzzle. Thanks to that solvers can do more anal- ysis and planning for Rush Hour and Number Maze puzzle and thus the structural differences among problems may not be that much important. At third, the state space of Rush Hour is undirected (all moves are re- versible) whereas state spaces for other puzzles are directed4. We believe that the issue of directionality may be quite important in the study of prob- lem solving. So far this issue was not adequately addressed in previous research, as most research did focused on undirected problems (e.g., [47, 70, 96]).

6.5.4 Relation to the Model of Problem Solving Times We show relation between state space structure and parameters of the basic model obtained in Section 3.2. We remind the model formula:

tsp = bp + apθs + N (0, cp)

where ap is problem discrimination, bp is problem difficulty, cp prob- lem randomness and θs student’s skill. Are these parameters in any relation with state space structure? We describe two examples from Tilt Maze and formulate possible hypothesis for parameter values. We begin with a problem Maze 966 with very large value of random- ness and low value of discrimination (cp = 1.9, ap = −0.69), solved by 910 solvers. Figure 6.14 shows its state space. In the beginning solver can choose two directions – either leading directly toward goal or leading in- side the cluster of interconnected states where he get stuck. Notice that less skilled solver can choose simply by chance a right direction and complete puzzle very quickly. Second, we present problem Maze 815 with low randomness and stan- dard discrimination (cp = 0.58, ap = −1.05), solved by 2357 solvers. State space shows quite straightforward way to the goal with little branches and clusters where to get stuck (see Figure 6.15). For both less and well skilled solvers, there are not many chances to get stuck or to significantly shorten

4. Although we did provide solvers a “back” button for the Sokoban puzzle.

99 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.14: State space for Tilt Maze puzzle instance Maze 966. Model pa- rameters: cp = 1.9, ap = −0.69.

Figure 6.15: State space for Tilt Maze puzzle instance Maze 815. Model pa- rameters: cp = 0.58, ap = −1.05.

100 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.16: Maze 815 and maze 916: solving times for given students’ abil- ities are plotted. We see large differences in problem randomness cp (left: cp = 0.58; right: cp = 1.9).

the path. On Fig. 6.16 we see correlation between solving time and skill for both puzzles. Notice that for Maze 966 points are roughly grouped in two groups. This seems to support a simple hypothesis that if the state space has many clusters where solvers can get stuck or lost it will tend to have higher variance in solving times – higher model randomness cp. Nevertheless to approve or reject the hypothesis we would need large amount of data for stable estimation of parameter cp.

6.6 State Space Bottleneck

In this section we return to an analysis of a state space structure and give it additional extension, particularly we describe concept of state space “bot- tleneck”. We elaborate metric for scoring bottleneck. Even though the con- cept is not applicable for difficulty predictions it may serve as useful hint, since bottleneck provides natural decomposition of the problem. Their ex- plicit identification could help humans to better understand problems.

101 6. DIFFICULTY OF TRANSPORT PUZZLES

6.6.1 Analysis of Bottleneck

Using visualizations of state spaces of difficult levels we identified a recur- ring feature – often there is a “bottleneck” (a narrow part in the state space) and people spent most of the time in states before the bottleneck (see Fig- ure 6.9). Once people find the bottleneck, they usually quickly find a path to the final state. Based on this observation we propose a formalization of the concept of a state space bottleneck using network flows with non-uniform price. The concept is based only on the structure of state space, i.e., it is not specific to Sokoban. We are aware of only one related notion in the literature. Berlekamp et al. [23] briefly mention the notion of a “narrow bridge’ while discussing an abstract map of a state space of a sliding block puzzle Century. They at- tribute the high difficulty of the puzzle to existence of this “narrow bridge” in the state space. They, however, do not provide any formalization of the concept.

6.6.2 Network Flows

A straightforward approach to formalizing the bottleneck concept is to em- ploy graph connectivity notions, e.g., to find a minimum cut between initial and final state. This approach however has two disadvantages. At first, a bottleneck is not absolute measure, but rather a relative one – it is impor- tant to consider a “width” of the state space before the bottleneck, not just a “width” of the bottleneck. At second, connectivity measures are hard to compute. Therefore, we employ an approach based on network flows. In- tuitively, we compute a maximum flow from the initial to final state and study in which states the flow accumulates. Let G be a directed graph with two distinguished vertices, a source s and a sink t, and edge capacity function c : V × V → N .A network flow is a function f : V × V → R with the following properties:

• capacity constraints: ∀u, v : f(u, v) ≤ c(u, v),

• skew symmetry ∀u, v : f(u, v) = −f(v, u),

• flow constraint: ∀v : v = s∨v = t∨Σ(u,v)∈Ef(u, v) = Σ(v,w)∈Ef(v, w).

A maximal flow can be computed by Ford-Fulkerson algorithm [43]. The algorithm chooses an augmenting path from the source to the sink and increases the flow along this path. This process is repeated until there is no

102 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.17: Example of a network and maximum flow (all edges have ca- pacity 10). Flow computed by standard Ford Fulkenson algorithm does not identify the bottleneck; the flow computed with non-uniform prices splits the flow among more edges and thus finds the bottleneck. path that can be augmented. This basic algorithm is not suitable for finding a bottleneck, because it pumps the shortest path through the graph to the capacity of the minimal cut (see Figure 6.17). We need to make the flow more uniform. To achieve this goal we associate price with each edges and find a “minimum cost maximal network flow’ [2]. Pricing function p gives every edge (u, v) its real valued price depending on the flow f(u, v). We consider only simple polynomial price function of the current flow f(u, v), particularly a quadratic price. To find a minimum cost maximal network flow, we can use an extension of the Ford-Fulkerson algorithm [2]. When choosing an augmenting path, we choose the cheapest one with respect to price function (this can be done efficiently by Dijkstra’s algorithm). The resulting flow spreads the flow in the wide part of state space and keeps a large flow through the bottleneck (see Figure 6.17).

6.6.3 Bottleneck Coefficient

Even the network flow computed with respect to non-uniform price does not directly identify bottleneck states. Particularly, states close to the initial and final state typically have a large flow, but they should not be consid- ered as bottleneck states. We need to further “recalibrate” the results. For each state we compute how much the flow “spreads” before and after this state. Let f be the minimum cost maximal flow. We define flow spread be- fore vertex v (denoted sb(v)) as a maximum of minu∈p f(u) over all paths p from the initial state to v, analogically flow spread after vertex v (denoted sa(v)) is a maximum of minu∈p f(u) over all paths from v to the final state. Bottleneck coefficient of vertex v is then defined as:

103 6. DIFFICULTY OF TRANSPORT PUZZLES

f(v) b(v) = 2 2 sb(v) + sa(v)

Bottleneck coefficient can be computed efficiently since sb and sa can be computed easily using a simple dynamic programming algorithm. Fig- ure 6.18 provides example of resulting bottleneck coefficients for two state spaces. We see that the coefficient clearly identifies what we would natu- rally call a bottleneck.

Figure 6.18: Bottleneck coefficients. State space for two sample problems from Figure 6.1, size of each vertex v corresponds to bottleneck coefficient b(v).

6.6.4 Possible Applications

Our study of bottleneck was motivated by visual analysis of state spaces of difficult Sokoban problems. Can we use bottleneck coefficients as a dif- ficulty rating metric? Unfortunately, it does not seem so. A straightforward metric based on bottleneck coefficients (maximum over all coefficients in the state space) gives poor correlation with problem difficulty. Some prob- lems are difficult even though the set of live states is rather small and thus the bottleneck is not very dominant (even though it exists). Some problems have “strong” bottleneck, but are not very difficult, since the structure of

104 6. DIFFICULTY OF TRANSPORT PUZZLES the state space and the problem makes it easy to find the bottleneck. So in order to serve as a difficulty metric it would be necessary to combine the bottleneck coefficient with other information about the structure of the state space and it is not clear how to do it. Nevertheless, the concept of bottleneck could be useful for example for understanding human behavior or for tutoring humans during problem solving. If a human cannot solve a problem, bottleneck states can serve as useful hints, since they provide natural decomposition of the problem. By their nature they even provide some kind of insight into the problem structure. Their explicit identification could help humans to learn to better understand problems. These issues require further study. Note that the concept depends only on the state space and is completely independent of the particular problem. Therefore it should be applicable to other problems as well. For example, the much studied Tower of Hanoi problem does have a very strong bottleneck in its state space; this bottleneck state corresponds to a natural problem decomposition identified by means- end analysis.

6.7 Problem Decomposition

In the previous sections we presented general approach to estimate diffi- culty based on the state space structure which did not carry any problem specific information. In this section we elaborate on the approach based on problem specific information. Concretely we deal with counterintuitive moves, area changes and problem decomposition. We evaluate methods on Sokoban puzzle and show that problem decomposition brings better results comparing to other state space metrics and even to presented computa- tional model. Research was carried out on a slightly different data set from Tutor predecessor where all Sokoban problems contained exactly 4 boxes. For more information about data set see [56].

6.7.1 Approach

The most intuitive difficulty metric is the number of steps necessary to reach a solution, i.e., the length of the shortest path from initial to goal state in the state space. Other metrics can be obtained as variations on this basic principle. One of the concepts which was intensively studied in previous research on problem solving is the hill-climbing heuristics [9, 26, 96]. This concept can be quite directly applied as a difficulty metric. The straightforward hill-

105 6. DIFFICULTY OF TRANSPORT PUZZLES

Figure 6.19: Illustration of the “area change” metric. The example shows three consequential states of a Sokoban problem. The number of areas in states is one, two, and one; thus the “areas change” metric for this sequence is computed as |1-2| + |2-1| = 2. climbing heuristic for Sokoban is to minimize the total distance of boxes from their goal positions. Given this heuristic, we can define the metric “counterintuitive moves” as a number of steps on the shortest path which go against this hill-climbing heuristic. We assume that humans tends to avoid “counterintuitive moves” in their solution an therefore higher num- ber of these moves will indicate harder problem. Similar metric is based on the number of changes of connected areas in the problem maze. Area is a part of the maze which can be reached by a man without pushing a box. We define a metric “areas change” as a sum of sizes of differences of areas counts on the shortest path (see Figure 6.19). The metric assumes that humans are less willing to make moves which change areas. Another intuitively important concept in problem solving is problem decomposition. Humans are not good at systematic search, but they are good at tasks such as abstraction, pattern recognition, and problem decom- position. If a problem can be decomposed into several subproblems it is usually much simpler (for humans) than a same type of problem which is highly interwoven and indecomposable (see example in Figure 6.20). The concept of problem decomposition is however more difficult to grasp than the hill-climbing heuristic. We propose a way to formalize problem decomposition for a Sokoban puzzle. A natural unit of “composition” is a single box. Thus we can con- sider decomposition of a problem into single boxes and than count as a single move any series of box pushes. We can also generalize this idea and decompose the problem into two pairs of boxes and than count as a single

106 6. DIFFICULTY OF TRANSPORT PUZZLES

problem decomposition ABCD AABB ABAB ABBA left problem 10 2 6 5 right problem 14 7 12 10

Figure 6.20: Example of two Sokoban puzzle; the first one can be easily de- composed into two subproblems and is thus easy (median solving time 3:02 minutes), the second one is rather indecomposable and thus very difficult (median solving time 53:49 minutes). The table gives the number of “steps” for different decomposition (see text). The bold column corresponds to the decomposition provided in the figure.

move any series of box pushes within the group. Let D be a division of n boxes into several groups (at most n), in our case n = 4 and we denote the division by 4 letter string; e.g., “ABAB” is a division in which the first and the third box are in group A, the second and the fourth box are in group B; “ABCD” is a division in which each box is in a separate group. Each edge in the state space is labeled by identification of the group to which the moved box belongs. Let p be a path in a state space (a sequence of valid moves). We are interested in the number of label alternations a(p, D) along the path. Optimal solution of the problem with respect to a division D (denoted s(D)) is the minimum a(p, D) along all paths from the initial to the goal state. This optimal solution can be computed by the Dijkstra algorithm over an augmented state space – vertices are tuples (s, g) where s is a state in the state space and g is an identification of a group and edges have weights 0 or 1. Figure 6.20 gives results for different decompositions of the two provided examples. For our evaluation we use two metrics based on these concepts. At first, we use the “box change” metric which is based on the division “ABCD”,

107 6. DIFFICULTY OF TRANSPORT PUZZLES

Table 6.4: Correlation coefficients for different difficulty metrics, results given in bold are statistically significant (α = 0.05). type metric Spearman state space size -0.07 number of live states -0.15 average “live” degree -0.36 shortest paths shortest path 0.47 counterintuitive moves 0.69 area change 0.35 problem decomposition box change 0.74 2-decomposition 0.82 model B = 25 average number of steps 0.66 i.e., each box is a single group. At second, we use the “2-decomposition” metric, which is the minimum number of steps over all possible division into two groups (division “AABB”, “ABAB”, “ABBA”). We also tried other types decompositions (e.g., 3-1 decomposition such as “AAAB”), but the results were similar to 2-decomposition and we do not report them explic- itly. Metrics based on state space do not work at all (no statistically signif- icant correlation). The intuitively plausible metric based on length of the solution is better and further improvement is brought by Sokoban specific extension of shortest paths (counterintuitive moves, area change metrics). The best results are obtained by the metric based on problem decomposi- tions and by the metric based on computational model.

108 7 Conclusion

In this thesis we study problem solving in context of intelligent tutoring systems, particularly focusing on timing information as opposed to focus- ing only on the correctness of the answers. This focus leads to different types of problems and requires new student models. We propose a model of students’ problem solving times, which assumes a linear relationship between a problem solving skill and a logarithm of time. The model brings a novel approach to estimation of latent skill based on problem solving times and not correctness of the answers. We derive the model details and propose two methods for parameter estimation – based on stochastic gradient descent and iterative joint estimation. The model is related to two different areas: the item response theory and collaborative filtering. Based on data collected for a given student, the model is able to predict solving times for yet unsolved problem. Since the model is group in- variant it does not suffer from bias caused by the fact that the most skilled students solve more difficult problems. We also present several model ex- tensions – a model with variance, a model with learning and a model with multidimensional skills. We evaluate the model and its extensions using synthesized data and we show that the basic difficulty of problems b and students’ skill θ can be estimated easily having relatively few data. Estimating problem discrimi- nation a is more difficult and even with more data the further improvement of the estimates is slow. The most difficult is to get a reasonable estimate of student and problem variance. For the model with learning we analyze the role of noise and the bias from joint improvement of students. We show that according to the bias and the level of noise, we are able to estimate absolute and relative predictions of learning rates for students. With syn- thesized data we also evaluate two estimation techniques and show that they lead to very similar results. However, for the technique based on it- erative joint estimation we do not need any further technique parameters (e.g., learning rate in gradient descent) and therefore it is more suitable for practical applications. For the model with multidimensional skill we show how concepts can be automatically detected on synthesized data. We also show relation between correlation of skills and correctness of division. The model has already been applied in an online “Problem Solving Tu- tor” to recommend problems of suitable difficulty. We used the collected data for evaluation of the model and its extensions. The results on real data show that the model brings only slight improvement compared to the base-

109 7. CONCLUSION line predictor, but also that the model provides interesting insight in prob- lem parameters – we can determine not just average difficulty of problems, but also their discrimination, problem and student variance and learning. These information are useful for problem selection and recommendation. Extension of the model can be used for successful classification of problems based only on problem solving times. The absolute predictions of learning rates can improve recommendations of problems and adapt the pace of a Tutor to particular student. The relative predictions of learning are also use- ful, since they can serve as a tool for teachers to determine learning trends in a class. We show on real data that model can be also used for cheating detec- tion based on skill variance. The model parameters are useful for automatic problem selection and recommendation in intelligent tutoring systems. We present a “Problem Solving Tutor” — a web-based educational tool for learning through problem solving. The system focuses solely on the “outer loop” of intelligent tutoring, i.e., recommending problem instances of the right difficulty. We do not combine any hints or study materials with problems. Instead of using experts for problem ordering and difficulty esti- mation, we learn from data. For predicting the solving times, the Tutor uses the model presented above. We describe the system design, its main com- ponents and how they interact together. We also present problems which are available in Tutor. We describe implementation issues and usage statis- tics. The system is already widely used – it has more than 460 000 problems solved and 10 000 users. The tool contains 30 problem types and more than 2 000 problem instances – mainly educational problems and logic puzzles. More than 100 teachers from 88 schools have registered and they run 221 classes to which have assigned more than 2400 students. Using the Tutor, we collect extensive data on human problem solving of six transport puzzles (Minotaurus, Number Maze, Rush Hour, Sokoban, Tilt Maze, Replacement Puzzle). The results show that there are large dif- ferences in difficulty of individual problem instances. We argue that these differences are partly caused by the global structure of problem state space and that they are not explained by previous research. In order to explain these differences, we develop a computational model of human behaviour during state space traversal. The model is a simple combination of random and optimal behaviour. The model has just one parameter and the optimal value of this parameter is nearly the same for all the six studied problems. We also present a model extension for dealing with dead states (i.e., states from which the goal cannot be reached). We evaluate the model over the collected data and compare it with other metrics for difficulty rating. We show that difficulty measures based on me-

110 7. CONCLUSION dian solving time, solving attempts, average number of moves and param- eter b from our model are highly correlated. The results differ for the six studied puzzles. In case of the Rush Hour and Number Maze it is easy to predict difficulty even with the shortest path metric. In case of Sokoban and Minotaurus the computational model brings a significant improvement and in the case of Replacement Puzzle and Tilt Maze puzzle the model brings a large improvement. We describe hypothesis of relationship between the state space struc- ture and the parameters of the proposed model of problem solving times. We also describe concept of a state space bottleneck and a technique for its detection. The bottleneck may serve for generating hints and provides natural decomposition of the problem. We also study problem specific met- ric – problem decomposition. We show that metric significantly enhances difficulty predictions for Sokoban puzzle.

7.1 Future Work

For future work it may be interesting to combine the two types of exercises illustrated in Figure 2.7. – test questions (correctness based) and interactive exercises (timing based). Interactive exercises may be more suitable for ac- tual learning, whereas test question may be better for measuring the result- ing learning. From the theoretical perspective, it is interesting to combine the correctness based and timing based models. It would be also useful to extend the model to deal with unfinished attempts. When a student spends some time trying to solve a problem and then abandons the problem, we do not include this information in our computation, although it can plausibly improve our parameter estimates. Inspired with item response theory it would be interesting to formally derive information function for the model and apply it for adaptive testing based on solving time. Another concept which may be unfold is probability of problem abandoning according to student’s skill which we briefly out- lined in thesis. With more data it should be possible to better employ the model with learning, which now suffer from overfitting (note that the data typically used in collaborative filtering are still orders of magnitude larger than ours). It would also be interesting to combine our approach, which uses only problem solving times, with other approaches used in intelligent tutoring systems (e.g., knowledge tracing). It would be also interesting to study more deeply application of multidimensional model for automated concept

111 7. CONCLUSION detection. The Tutor may be broaden in many directions. We mention several mod- ules relevant also for our research and students’ motivation:

• Automated concept detection – it would be interesting to apply a multidimensional model for automated concept detection inside one problem type (or across several problem types). By detecting con- cepts in problem set Tutor may recommend problems which share similar concepts.

• Detection of learning rates – based on our model for learning, it would be interesting for students and teachers to access data about their learning rates. Unfortunately, the Tutor has not generated enough data for such estimations yet.

• Inner loop – the Tutor can be extended by including an inner loop (hints for problem solving), and more sophisticated recommenda- tions (e.g., using session history).

• Badges – badges module would serve for a better motivation of stu- dents. Students would be awarded badges for their performance and for the amount of solved problems. This is a common motivational component in similar systems.

From the perspective of practical development of Tutor it would be chal- lenging to join some of the big online educational projects (such as Khan Academy or Coursera) and combine our approach and models with their applications. This would certainly bring new impulses and goals for devel- opment of Tutor. Moreover Tutor already uses libraries adapted by Khan Academy for some of its tools for teaching math. The obtained results for transport puzzles open several new questions about problem solving, e.g., the role of directionality, time spent in dead states or interaction between difficulty of the local steps and the global structure. For further research, it may be interesting to analyze other trans- port puzzles or to study further data on human problem solving, partic- ularly to study differences among individual solvers. We believe that the results of the computational model can be further improved by extensions discussed in Section 6.4.4. Further analysis of state space bottleneck should bring practical tools for generating hints in problem solving. For example after failing in problem solving a student would receive a hint state which corresponds to bottle- neck state.

112 7. CONCLUSION

For educational problems (not transport puzzles) it would be interesting to analyze semantically typical solving strategies of students and find key steps and mistakes during solving (analogue to bottleneck states in state space). By pointing these strategies and mistakes in problem solving, new approach to automated hint generation may appear.

113 Bibliography

[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of rec- ommender systems: A survey of the state-of-the-art and possible ex- tensions. Knowledge and Data Engineering, pages 734–749, 2005.

[2] R. Ahuja, T. Magnanti, and J. Orlin. Network flows: theory, algo- rithms, and applications. Prentice Hall, 1993.

[3] J. Anderson, C. Boyle, and B. Reiser. Intelligent tutoring systems. Science, 228(4698):456–462, 1985.

[4] J. Anderson and K. Gluck. What role do cognitive architectures play in intelligent tutoring systems? . Cognition and Instruction: Twenty- five years of progress, pages 227–262, 2001.

[5] J. R. Anderson, C. F. Boyle, A. T. Corbett, and M. Lewis. Cognitive modeling and intelligent tutoring. Artificial Intelligence, 42, 1990.

[6] J. R. Anderson, A. T. Corbett, K. R. Koedinger, and R. Pelletier. Cog- nitive tutors: Lessons learned. The Journal of the Learning Sciences, pages 167–207, 1995.

[7] J. R. Anderson and R. Pelletier. A development system for model tracing tutors. In In Proceedings of the International Conference of the Learning Sciences, pages 1–8, 1991.

[8] I. Arroyo, H. Meheranian, and B. P. Woolf. Effort-based tutoring: An empirical approach to intelligent tutoring. In Proc. of Educational Data Mining 2010, pages 1–10, 2010.

[9] M. Atwood and P. Polson. Further explorations with a process model for water jug problems. Memory & Cognition, 8(2):182–192, 1980.

[10] E. Ayers and B. W. Junker. Do skills combine additively to predict task difficulty in eighth grade mathematics? In in Proc. AAAI Workshop Educ. Data Mining, pages 14–20, 2006.

[11] F. Baker. The basics of item response theory. University of Wisconsin, 2001.

[12] T. Barnes. The q-matrix method: Mining student response data for knowledge. In In Proceedings of American Association for Artificial Intelligence 2005 Educational Data Mining Workshop, 2005.

114 7. CONCLUSION

[13] T. Barnes and J. Stamper. Toward automatic hint generation for logic proof tutoring using historical student data. In in Int. Conf. Intell. Tutoring Syst., pages 373–382, 2008.

[14] V. Batagelj and A. Mrvar. Pajek – analysis and visualization of large networks. In Graph Drawing, volume 2265 of LNCS, pages 8–11. Springer, 2002.

[15] C. R. Beal and P. R. Cohen. Temporal data mining for educational applications. In Proc. 10th Pacific Rim Int. Conf. Artif. Intell.: Trends Artif. Intell., pages 66–77, 2008.

[16] J. Beck and B. Woolf. High-level student modeling with machine learning. In Intelligent tutoring systems, pages 584–593. Springer, 2000.

[17] R. Bell and Y. Koren. Lessons from the netflix prize challenge. ACM SIGKDD Explorations Newsletter, pages 75–79, 2007.

[18] R. Bell and Y. Koren. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In IEEE International Conference on Data Mining (ICDM’07), pages 43–52, 2007.

[19] R. Bell, Y. Koren, and C. Volinsky. The bellkor solution to the netflix prize. Netflix Prize Progress Award, October 2007, 2007.

[20] R. Bell, Y. Koren, and C. Volinsky. Modeling relationships at multi- ple scales to improve accuracy of large recommender systems. KDD, pages 95–104, 2007.

[21] J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD Cup and Workshop, volume 2007, page 35, 2007.

[22] Y. Bergner, S. Droschler, G. Kortemeyer, S. Rayyan, D. Seaton, and D. Pritchard. Model-based collaborative filtering analysis of student response data: Machine-learning item response theory. In Educa- tional Data Mining, pages 95–102, 2012.

[23] E. Berlekamp, J. Conway, and R. Guy. Winning ways for your math- ematical plays. AK Peters Ltd, 2003.

[24] C. Bishop. Pattern recognition and machine learning. Springer, 2006.

115 7. CONCLUSION

[25] R. A. Bradley and J. J. Gart. The asymptotic properties of ml estima- tors when sampling from associated populations. Biometrika, 4:205– 214, 1962.

[26] H. Carder, S. Handley, and T. Perfect. Counterintuitive and alter- native moves choice in the Water Jug task. Brain and Cognition, 66(1):11–20, 2008.

[27] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A sur- vey. Technical report, ACM Computing Surveys (CSUR), 2009.

[28] C. Chen, L. Duh, and C. Liu. A personalized courseware recommen- dation system based on fuzzy item response theory. In in Proc. IEEE Int. Conf. E-Technol., pages 305–308, 2004.

[29] C. Chen, H. Lee, and Y. Chen. Personalized e-learning system using Item Response Theory. Computers and Education, pages 237–255, 2005.

[30] A. Corbett and J. Anderson. Student modeling and mastery learn- ing in a computer-based programming tutor. In In Proceedings of the Second International Conference on Intelligent Tutoring Systems, 1992.

[31] A. Corbett and J. Anderson. Knowledge tracing: Modeling the acqui- sition of procedural knowledge. In User modeling and user-adapted interactions 4, pages 253–278, 1995.

[32] A. Corbett, R. S. Baker, and V. Aleven. More accurate student mod- eling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. Human-Computer Interaction Institute, pages 167–207, 2008.

[33] A. Corbett, K. Koedinger, and J. Anderson. Intelligent Tutoring Sys- tems. Handbook of Human-Computer Interaction, pages 839–870, 1997.

[34] A. T. Corbett, J. R. Anderson, and A. T. O Brien. Student modeling in the act programming tutor. Cognitively diagnostic assessment, pages 19–41, 1995.

[35] M. Csikszentmihalyi. Beyond boredom and anxiety. Jossey-Bass, 1975.

116 7. CONCLUSION

[36] M. Csikszentmihalyi. Flow: The psychology of optimal experience. HarperPerennial New York, 1991.

[37] M. Csikszentmihalyi. Creativity: Flow and the Psychology of Discov- ery and Invention. Harper Perennial, 1997.

[38] A. Das, M. Datar, A. Garg, and S. Rajaram. The google news person- alization: Scalable online collaborative filtering. In WWW 07: the 16th International Conference on World Wide Web, page 271–280, 2007.

[39] R. De Ayala. The theory and practice of item response theory. The Guilford Press, 2008.

[40] M. C. Desmarais and R. S. J. de Baker. A review of recent advances in learner and skill modeling in intelligent learning environments. User Model. User-Adapt. Interact., 22(1-2):9–38, 2012.

[41] Z. Fan, C. Wang, H.-H. Chang, and J. Douglas. Utilizing response time distributions for item selection in cat. Journal of Educational and Behavioral Statistics, 7(5):655, 2012.

[42] P. J. Ferrando and U. Lorenzo-Seva. An item-response model incorpo- rating response time data in binary personality items. Applied Psy- chological Measurement, 31:525–543, 2007.

[43] L. Ford and D. Fulkerson. Maximal flow through a network. Cana- dian Journal of Mathematics, 8(3):399–404, 1956.

[44] E. Friedman. Erich’s puzzle palace, 2006. http://www2.stetson.edu/~efriedma/puzzle.html.

[45] J.-L. Gaviria. Increase in precision when estimating parameters in computer assisted testing using response times. Quality and Quan- tity, 39:45–69, 2005.

[46] W. D. Gray. The Cambridge Handbook of Computational Psychol- ogy, chapter Cognitive Modeling for Cognitive Engineering, pages 565–588. Cambridge University Press, 2008.

[47] J. Greeno. Hobbits and orcs: Acquisition of a sequential concept. Cog- nitive Psychology, 6(2):270–292, 1974.

[48] Q. Guo and M. Zhang. Implement web learning environment based on data mining. Knowl.-Based Syst., 22:439–442, 2009.

117 7. CONCLUSION

[49] W. Hamalainen and M. Vinni. Comparison of machine learning meth- ods for intelligent tutoring systems. In in Proc. Int. Conf. Intell. Tu- toring Syst., pages 525–534, 2006.

[50] R. K. Hamleton and R. W. Jones. Comparison of classical test theory and item response theory and their application to test development. Educational measurement: issues and practice, pages 38–47, 1993.

[51] R. J. Harvey and A. L. Hammer. Item resposne theory. Consulting Psychologists Press, 27:353, 1999.

[52] W. Z. Hirsch. Manufacturing progress functions. The Review of Eco- nomics and Statistics, pages 143–155, 1952.

[53] Z. Ibrahim and D. Rusli. Predicting students academic performance: Comparing artificial neural network, decision tree and linear regres- sion. In Proc. Annu. SAS Malaysia Forum, pages 1–6, 2007.

[54] P. Jarušek, V. Klusáˇcek,and R. Pelánek. Modeling students’ learning and variability of performance in problem solving. In Educational Data Mining, to appear, 2013.

[55] P. Jarušek and R. Pelánek. Analýza obtížnosti logických úloh na zák- ladˇemodelu lidského chování. In Kognice a umˇelýživot X, pages 171–176. Slezská univerzita v Opavˇe,2010.

[56] P. Jarušek and R. Pelánek. Difficulty rating of sokoban puzzle. In Proc. of the Fifth Starting AI Researchers’ Symposium (STAIRS 2010), pages 140–150. IOS Press, 2010.

[57] P. Jarušek and R. Pelánek. Human problem solving: Sokoban case study. Technical Report FIMU-RS-2010-01, Masaryk University Brno, 2010.

[58] P. Jarušek and R. Pelánek. Problem response theory and its applica- tion for tutoring. In Educational Data Mining, pages 374–375, 2011.

[59] P. Jarušek and R. Pelánek. What determines difficulty of transport puzzles? In Proc. of Florida Artificial Intelligence Research Society Conference, pages 428–433. AAAI Press, 2011.

[60] P. Jarušek and R. Pelánek. Analysis of a simple model of problem solving times. In Proc. of Intelligent Tutoring Systems (ITS), volume 7315 of LNCS, pages 379–388. Springer, 2012.

118 7. CONCLUSION

[61] P. Jarušek and R. Pelánek. Modeling and predicting students problem solving times. In Proc. of International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2012), volume 7147 of LNCS, pages 637–648. Springer, 2012.

[62] P. Jarušek and R. Pelánek. Problem solving tutor, 2012.

[63] P. Jarušek and R. Pelánek. A web-based problem solving tool for in- troductory computer science. In Proceedings of the 17th ACM annual conference on Innovation and technology in computer science educa- tion, page 371. ACM, 2012.

[64] T. Kamada and S. Kawai. An algorithm for drawing general undi- rected graphs. Information processing letters, 31(1):7–15, 1989.

[65] P. Kantor, F. Ricci, L. Rokach, and B. Shapira. Recommender systems handbook. Springer, 2010.

[66] S. Khan. Khan academy, 2011. www.khanacademy.org.

[67] K. Koedinger, J. Anderson, W. Hadley, and M. Mark. Intelligent tu- toring goes to school in the big city. International Journal of Artificial Intelligence in Education, 8(1):30–43, 1997.

[68] K. Koedinger, A. Corbett, S. Ritter, and L. Shapiro. Carnegie Learn- ing’s Cognitive Tutor: Summary research results. White paper. Avail- able from Carnegie Learning Inc, 1200, 2000.

[69] Y. Koren and R. Bell. Advances in collaborative filtering. Recom- mender Systems Handbook, pages 145–186, 2011.

[70] K. Kotovsky, J. Hayes, and H. Simon. Why are some problems hard? Evidence from tower of Hanoi. Cognitive psychology, 17(2):248–294, 1985.

[71] K. Kotovsky and H. Simon. What Makes Some Problems Really Hard: Explorations in the Problem Space of Difficulty. Cognitive Psychol- ogy, 22(2):143–83, 1990.

[72] D. LaBerge. Acquisition of automatic processing in perceptual and associative learning. Attention and performance, 5:50–64, 1975.

[73] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 2003.

119 7. CONCLUSION

[74] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. In IEEE Internet Computing 7 (2003), pages 76–80, 2003.

[75] F. M. Lord. An analysis of the verbal scholastic aptitude test using birnbaum’s threeparameter logistic model. Educational and Psycho- logical Measurement, 28:989–1020, 1968.

[76] F. M. Lord. A theoretical study of two-stage testing. Psychometrika, 36:227–242, 1971.

[77] F. M. Lord. A broad-range tailored test of verbal ability. Applied Psychological Measurement, 1:95–100, 1977.

[78] F. M. Lord. Applications of item response theory to practical testing problems. 1980.

[79] A. Luchins. Mechanization in problem solving: The effect of Einstel- lung. Psychological Monographs, 54(6):95, 1942.

[80] E. Maris. Additive and multiplicative models for gamma distributed variables, and their application as psychometric models for response times. Psychometrika, 58:445–469, 1993.

[81] B. Martin, A. Mitrovic, K. R. Koedinger, and S. Mathan. Evaluating and improving adaptive educational systems with learning curves. User Modeling and User-Adapted Interaction, 21(3):249–283, 2011.

[82] R. R. Meijer and M. L. Nering. Computerized adaptive testing: Overview and introduction. Applied Psychological Measurement, 23(3):187, 1999.

[83] A. Mitrovic. Fifteen years of constraint-based tutors: what we have achived and where we are going. User Model. User-Adapt. Interact., 2012.

[84] A. Mitrovic, K. Koedinger, and B. Martin. A comparative analysis of cognitive tutoring and constraintbased modeling. In User Modeling 2003, 9th International Conference, UM 2003, page 313–322, 2003.

[85] A. Newell and P. Rosenbloom. Mechanisms of skill acquisition and the law of practice. Cognitive skills and their acquisition, pages 1–55, 1981.

120 7. CONCLUSION

[86] A. Newell and H. Simon. GPS, a program that simulates human thought. Computers and thought, pages 279–293, 1963.

[87] I. Ostrovsky. Robozzle online puzzle game, 2009. robozzle.com.

[88] S. Papert. Mindstorms: Children, computers, and powerful ideas. Ba- sic Books, Inc., 1980.

[89] B. Paras and J. Bizzocchi. Game, Motivation, and Effective Learning: An Integrated Model for Educational Game Design. In DiGRA con- ference, volume 2005, 2005.

[90] Z. Pardos, J. E. Beck, C. Ruiz, and N. Heffernan. Predicting stu- dents academic performance: Comparing artificial neural network, decision tree and linear regression. In in Proc. Int. Conf. Educ. Data Mining, pages 147–156, 2008.

[91] R. Pattis. Karel the robot: a gentle introduction to the art of program- ming. John Wiley & Sons, Inc., 1994.

[92] P. Pavlik, H. Cen, and K. Koedinger. Learning factors transfer analy- sis: Using learning curve analysis to automatically generate domain models. In in Proc. Int. Conf. Educ. Data Mining, pages 121–130, 2009.

[93] R. Pelánek. Difficulty rating of sudoku puzzles by a computational model. In Proc. of Florida Artificial Intelligence Research Society Conference (FLAIRS 2011), 2011.

[94] A. d. Peloux, D. Skinner, M. Hiroshi, and M. H. Caussa. Sokoban levels. http://www.sourcecode.se/sokoban/levels.php.

[95] Z. Pizlo. Human Problem Solving in 2006. The Journal of Problem Solving, 1(2):3, 2007.

[96] Z. Pizlo and Z. Li. Solving combinatorial problems: The 15-puzzle. Memory and Cognition, 33(6):1069, 2005.

[97] S. Roberts and H. Pashler. How persuasive is a good fit? A comment on theory testing. Psychological Review, 107(2):358–367, 2000.

[98] C. Romero and S. Ventura. Educational data mining: A review of the state of the art. International Journal of Artificial Intelligence in Education, 40(6):601, 2010.

121 7. CONCLUSION

[99] E. E. Roskam. Models for speed and time-limit tests. Handbook of modern item response theory, pages 187–208, 1997.

[100] B. Sarwar, G. J. Karypis, and J. Riedl. Item-based collaborative filter- ing recommendation algorithms. In Proc. 10th International Confer- ence on the World Wide Web, pages 285–295, 2001.

[101] J. Schaeffer and H. Van den Herik. Games, computers, and artificial intelligence. Artificial Intelligence, 134(1-2):1–8, 2002.

[102] H. Scheiblechner. Specific objective stochastic latency mechanisms. Journal of Mathematical Psychology, 19:18–38, 1979.

[103] D. L. Schnipke and D. J. Scrams. Representing response time infor- mation in item banks. LSAC Computerized Testing Report No. 97-09, 1997.

[104] H. Simon and A. Newell. Human problem solving. Prentice Hall, 1972.

[105] J. Stamper and T. Barnes. Unsupervised mdp value selection for au- tomating its capabilities. In in Proc. Int. Conf. Educ. Data Mining, pages 121–130, 2009.

[106] e. a. Takács, Gábor. Matrix factorization and neighbor based algo- rithms for the netflix prize problem. In Proceedings of the 2008 ACM conference on Recommender systems, 2008.

[107] D. Thissen. Timed testing: An approach using item response theory. New horizons in testing: Latent trait test theory and computerized adaptive testing, pages 179–203, 1983.

[108] L. R. Tucker. Maximum validity of a test with equivalent items. Psy- chometrika, 1:1–13, 1946.

[109] L. R. Tucker. A theory of test scores. Psychometric Monograph, 7, 1952.

[110] W. Van der Linden. A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2):181, 2006.

[111] W. Van der Linden. Using response times for item selection in adap- tive testing. Journal of Educational and Behavioral Statistics, 33(1):5, 2008.

122 7. CONCLUSION

[112] W. Van der Linden. Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3):247, 2009.

[113] W. Van Der Linden. Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3):247–272, 2009.

[114] W. Van der Linden. Predictive control of speededness in adaptive testing. Research Report 07-02, 2009.

[115] W. Van der Linden and X. Xiong. Speededness and adaptive testing. Journal of Educational and Behavioral Statistics, 0(0):1, 2008.

[116] W. J. Van Der Linden, D. J. Scrams, and D. L. Schnipke. Using response-time constraints to control for differential speededness in computerized adaptive testing. Applied Psychological Measurement, 23(3):195, 1999.

[117] W. J. van der Linden and E. M. L. A. van Krimpen-Stoop. Using re- sponse times to detect aberrant response patterns in computerized adaptive testing. Psychometrika, 68:251–265, 2003.

[118] K. Vanlehn. The behavior of tutoring systems. International Journal of Artificial Intelligence in Education, 16(3):227–265, 2006.

[119] W. J. J. Veerkamp and M. P. F. Berger. Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics, 22:203–226, 1997.

[120] N. D. Verhelst, H. H. F. M. Verstraalen, and M. G. Jansen. A logistic model for timelimit tests. Handbook of modern item response theory, pages 169–185, 1997.

[121] C. Vialardi, J. Bravo, L. Shafti, and A. Ortigosa. Web usagemining for a better web-based learning environment. In in Proc. Int. Conf. Educ. Conf., pages 190–198, 2009.

[122] T. Wang and B. A. Hanson. Development and calibration of an item response model that incorporates response time. Applied Psycholog- ical Measurement, 29:323–339, 2005.

[123] D. J. Weiss. Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 4:273 – 285, 1971.

123 7. CONCLUSION

[124] D. J. Weiss and K. G. G. Application of computerized adaptive testing to educational problems. Journal of educational measurment, 21:361– 375, 1984.

[125] R. Wilson and F. Keil. The MIT encyclopedia of the cognitive sciences. MIT Press, 1999.

[126] B. D. Wright. Solving measurement problems with the rasch model. Journal of Educational Measurement, 14:97–116, 1977.

[127] O. Zaiane and J. Luo. Web usagemining for a better web-based learn- ing environment. In in Proc. Conf. Adv. Technol. Educ., pages 60–64, 2001.

[128] C. Zhang and S. Zhang. Association rulemining: Models and algo- rithms. Lecture Notes in Artificial Intelligence, 2002.

124 A First Appendix

A.1 Author’s Contribution

A.1.1 Conference Papers • P. Jarušek, V. Klusáˇcek,and R. Pelánek. Modeling students’ learning and variability of performance in problem solving. In Educational Data Mining, to appear, 2013. [Author’s contribution: 50%]

• P. Jarušek and R. Pelánek. A web-based problem solving tool for introductory computer science. In Proceedings of the 17th ACM an- nual conference on Innovation and technology in computer science education, page 371. ACM, 2012. [Author’s contribution: 60%]

• P. Jarušek and R. Pelánek. Analysis of a simple model of problem solving times. In Proc. of Intelligent Tutoring Systems (ITS), volume 7315 of LNCS, pages 379–388. Springer, 2012. [Author’s contribution: 40%]

• P. Jarušek and R. Pelánek. Modeling and predicting students problem solving times. In Proc. of International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2012), volume 7147 of LNCS, pages 637–648. Springer, 2012. [Author’s contribution: 50%]

• P. Jarušek and R. Pelánek. Problem response theory and its applica- tion for tutoring. In Educational Data Mining, pages 374–375, 2011. [Author’s contribution: 40%]

• P. Jarušek and R. Pelánek. What determines difficulty of transport puzzles? In Proc. of Florida Artificial Intelligence Research Society Conference, pages 428–433. AAAI Press, 2011. [Author’s contribu- tion: 60%]

• P. Jarušek and R. Pelánek. Difficulty rating of sokoban puzzle. In Proc. of the Fifth Starting AI Researchers’ Symposium (STAIRS 2010), pages 140–150. IOS Press, 2010. [Author’s contribution: 50%]

• P. Jarušek and R. Pelánek. Analýza obtížnosti logických úloh na zák- ladˇemodelu lidského chování. In Kognice a umˇelýživot X, pages 171–176. Slezská univerzita v Opavˇe,2010. [Author’s contribution: 50%]

125 A.FIRST APPENDIX

A.1.2 Software • P. Jarušek and R. Pelánek. Problem solving tutor, 2012. [Author’s contribution: 75%]

A.1.3 Technical Reports • P. Jarušek and R. Pelánek. Human problem solving: Sokoban case study. Technical Report FIMU-RS-2010-01, Masaryk University Brno, 2010. [Author’s contribution: 40%]

126