<<

PROBABILISTIC MODELS OF AND &

MÁTÉ LENGYEL

Computational and Biological Learning Lab Department of Engineering University of Cambridge MULTIPLE INTERACTING MEMORY SYSTEMS

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 2 MULTIPLE INTERACTING MEMORY SYSTEMS

✓ ✓

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 2 MULTIPLE INTERACTING MEMORY SYSTEMS

✓ ✓

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 2 MULTIPLE INTERACTING MEMORY SYSTEMS

✓ ✓

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 2 LONG TERM MEMORY → SHORT-TERM MEMORY

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 3 60 CHASE AND SIMON

24

60 CHASE AND SIMON 16

24 LONG TERM MEMORY → SHORT-TERM MEMORY 60 CHASE AND SIMON

16

24

16

I 2 3 4 5 b 7 TRIALS

FIG. 1. Learning curves of the master ( M ), class A player (A), and beginner (B) for the middle-game and random middle-game positions. The brackets are standard errors on five positions. of their poorer ilrst-trial performance, had much more room for im- I 2 3 4 5 b 7 provement than did M; this difference disappears when the learning TRIALS curve reaches the level of M’s first-trial performance. InFIG. the1. end-gameLearning curvespositions, of the Mmaster placed ( M an), classaverage A player of about (A), eightand beginnerpieces correctly(B) for theon middle-gametrial 1, while and Arandom and Bmiddle-game placed about positions. seven The and brackets four, re-are spectively.standard errors In onthese five positions.positions, M required two or three trials to recon- struct the positions perfectly;I 2 A, about3 4 three5 bor four;7 and B, between fourof their and poorerseven ilrst-trialtrials. Thus,performance, in bothTRIALS middle- had muchand moreend-game room positionsfor im- provement than did M; this difference disappears when the learning fromFIG. actual1. Learning games, curves ability of tothe retain master information ( M ), class Afrom player a 5-set(A), viewand beginnerof the curve reaches the level of M’s first-trial performance. board(B) for was the closelymiddle-game related andto randomplaying middle-gamestrength. positions. The brackets are standardInIn thethe errors end-gamerandom, on five unstructuredpositions.positions, M placedpositions an thereaverage was of noabout relation eight piecesat all betweencorrectly memoryon trialProbabilistic of1, whilethe models positionA of learningand andB and placed memoryplaying — about Workingstrength. seven memory Moreover, and episodic four, memorythere- CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 3 spectively.of their poorerIn these ilrst-trial positions, performance, M required had two much or threemore trialsroom to forrecon- im- structprovement the positionsthan did perfectly;M; this differenceA, about threedisappears or four; when and the B, learningbetween curvefour andreaches seven the trials. level Thus,of M’s infirst-trial both middle-performance. and end-game positions fromIn theactual end-game games, positions,ability to retainM placed information an average from of aabout 5-set eightview piecesof the boardcorrectly was on closely trial related1, while toA playing and B strength.placed about seven and four, re- spectively.In the random,In these unstructuredpositions, M positionsrequired theretwo orwas three no trialsrelation to recon-at all betweenstruct the memorypositions of perfectly;the position A, aboutand playingthree orstrength. four; andMoreover, B, between the four and seven trials. Thus, in both middle- and end-game positions from actual games, ability to retain information from a 5-set view of the board was closely related to playing strength. In the random, unstructured positions there was no relation at all between memory of the position and playing strength. Moreover, the LONG TERM MEMORY → SHORT-TERM MEMORY 60 CHASE AND SIMON

24

16 Chase & Simon, 1973 I 2 3 4 5 b 7 TRIALS

FIG. 1. Learning curves of the master ( M ), class A player (A), and beginner (B) for the middle-game and random middle-game positions. The brackets are standard errors on five positions. Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 3 of their poorer ilrst-trial performance, had much more room for im- provement than did M; this difference disappears when the learning curve reaches the level of M’s first-trial performance. In the end-game positions, M placed an average of about eight pieces correctly on trial 1, while A and B placed about seven and four, re- spectively. In these positions, M required two or three trials to recon- struct the positions perfectly; A, about three or four; and B, between four and seven trials. Thus, in both middle- and end-game positions from actual games, ability to retain information from a 5-set view of the board was closely related to playing strength. In the random, unstructured positions there was no relation at all between memory of the position and playing strength. Moreover, the LONG TERM MEMORY → SHORT-TERM MEMORY 60 CHASE AND SIMON • short-term memory is capacity limited 24

16 Chase & Simon, 1973 I 2 3 4 5 b 7 TRIALS

FIG. 1. Learning curves of the master ( M ), class A player (A), and beginner (B) for the middle-game and random middle-game positions. The brackets are standard errors on five positions. Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 3 of their poorer ilrst-trial performance, had much more room for im- provement than did M; this difference disappears when the learning curve reaches the level of M’s first-trial performance. In the end-game positions, M placed an average of about eight pieces correctly on trial 1, while A and B placed about seven and four, re- spectively. In these positions, M required two or three trials to recon- struct the positions perfectly; A, about three or four; and B, between four and seven trials. Thus, in both middle- and end-game positions from actual games, ability to retain information from a 5-set view of the board was closely related to playing strength. In the random, unstructured positions there was no relation at all between memory of the position and playing strength. Moreover, the LONG TERM MEMORY → SHORT-TERM MEMORY 60 CHASE AND SIMON • short-term memory is capacity limited 24 ‣ capacity traditionally defined by number of items (or chunks)

16 ‣ information content does not influence capacity Chase & Simon, 1973 I 2 3 4 5 b 7 TRIALS

FIG. 1. Learning curves of the master ( M ), class A player (A), and beginner (B) for the middle-game and random middle-game positions. The brackets are standard errors on five positions. Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 3 of their poorer ilrst-trial performance, had much more room for im- provement than did M; this difference disappears when the learning curve reaches the level of M’s first-trial performance. In the end-game positions, M placed an average of about eight pieces correctly on trial 1, while A and B placed about seven and four, re- spectively. In these positions, M required two or three trials to recon- struct the positions perfectly; A, about three or four; and B, between four and seven trials. Thus, in both middle- and end-game positions from actual games, ability to retain information from a 5-set view of the board was closely related to playing strength. In the random, unstructured positions there was no relation at all between memory of the position and playing strength. Moreover, the LONG TERM MEMORY → SHORT-TERM MEMORY 60 CHASE AND SIMON • short-term memory is capacity limited 24 ‣ capacity traditionally defined by number of items (or chunks)

16 ‣ information content does not influence capacity Miller, 1956 Chase & Simon, 1973 I 2 3 4 5 b 7 TRIALS

FIG. 1. Learning curves of the master ( M ), class A player (A), and beginner (B) for the middle-game and random middle-game positions. The brackets are standard errors on five positions. Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 3 of their poorer ilrst-trial performance, had much more room for im- provement than did M; this difference disappears when the learning curve reaches the level of M’s first-trial performance. In the end-game positions, M placed an average of about eight pieces correctly on trial 1, while A and B placed about seven and four, re- spectively. In these positions, M required two or three trials to recon- struct the positions perfectly; A, about three or four; and B, between four and seven trials. Thus, in both middle- and end-game positions from actual games, ability to retain information from a 5-set view of the board was closely related to playing strength. In the random, unstructured positions there was no relation at all between memory of the position and playing strength. Moreover, the LONG TERM MEMORY → SHORT-TERM MEMORY 60 CHASE AND SIMON • short-term memory is capacity limited 24 ‣ capacity traditionally defined by number of items (or chunks)

16 ‣ information content does not influence capacity Miller, 1956

• knowledge held in long-term memory affects apparent short-term memory capacity Chase & Simon, 1973 I 2 3 4 5 b 7 ‣ role of long-term memory is TRIALS to define chunks FIG. 1. Learning curves of the master ( M ), class A player (A), and beginner (B) for the middle-game and random middle-game positions. The brackets are standard errors on five positions. Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 3 of their poorer ilrst-trial performance, had much more room for im- provement than did M; this difference disappears when the learning curve reaches the level of M’s first-trial performance. In the end-game positions, M placed an average of about eight pieces correctly on trial 1, while A and B placed about seven and four, re- spectively. In these positions, M required two or three trials to recon- struct the positions perfectly; A, about three or four; and B, between four and seven trials. Thus, in both middle- and end-game positions from actual games, ability to retain information from a 5-set view of the board was closely related to playing strength. In the random, unstructured positions there was no relation at all between memory of the position and playing strength. Moreover, the DESCRIPTION LENGTH

familiarization long-term test ‘pay ’ ‘which one looks more familiar?’

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 4 DESCRIPTION LENGTH

familiarization short-term test long-term test ‘pay attention’ ‘same or different?’ ‘which one looks more familiar?’

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 4 DESCRIPTION LENGTH

familiarization short-term test long-term test ‘pay attention’ ‘same or different?’ ‘which one looks more familiar?’

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 4 DESCRIPTION LENGTH

familiarization short-term test long-term test ‘pay attention’ ‘same or different?’ ‘which one looks more familiar?’

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 4 DESCRIPTION LENGTH

familiarization short-term test long-term test ‘pay attention’ ‘same or different?’ ‘which one looks more familiar?’

x ,x , . . . , x P(ˆ x) DL(x) log 1/P(ˆ x) 1 2 n −→ −→ ∝ visual scenes predictive description length distribution

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 4 INDIVIDUAL DIFFERENCES

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES

100

75 Subjects (percent correct)

50 50 75 100 IdealIdeal observer learner (percent correct)

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES ideal learner 100

75 Subjects (percent correct)

50 50 75 100 IdealIdeal observer learner (percent correct)

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES ideal learner naive learner 100

75 Subjects (percent correct)

50 50 75 100 IdealIdeal observer learner (percent correct)

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES ideal learner naive learner 100

P P 75 Subjects scenes scenes (percent correct)

50 50 75 100 IdealIdeal observer learner (percent correct)

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES ideal learner quasi-ideal naive learner 100 learners

P P P 75 Subjects scenes scenes

(percent correct) scenes 50 50 75 100 IdealIdeal observer learner (percent correct)

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES ideal learner quasi-ideal naive learner 100 learners

P P P 75 Subjects scenes scenes

(percent correct) scenes 50

50 75 100 60 60 IdealIdeal observer learner (percent correct) 40 40 Frequency Frequency 20 20

0 0 !50 0 50 !50 0 50 Prediction error Prediction error (percent correct) (percent correct)

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES ideal learner quasi-ideal naive learner 100 learners

P P P 75 Subjects scenes scenes

(percent correct) scenes 50

50 75 100 60 60 IdealIdeal observer learner (percent correct) 40 40 Frequency Frequency 20 20

0 0 !50 0 50 !50 0 50 Prediction error Prediction error (percent correct) (percent correct)

100

90

80

70

Percent correct 60

50 2 4 6 8 Set size Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES ideal learner quasi-ideal naive learner 100 learners

P P P 75 Subjects scenes scenes

(percent correct) scenes 50

50 75 100 60 60 IdealIdeal observer learner (percent correct) 40 40 Frequency Frequency 20 20

0 0 !50 0 50 !50 0 50 Prediction error Prediction error (percent correct) (percent correct)

100 20

90 10 80 0 70 Percent correct Percent correct !10 60

50 !20 2 4 6 8 2 4 6 8 Set size Set size Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 INDIVIDUAL DIFFERENCES ideal learner quasi-ideal naive learner 100 learners

P P P 75 Subjects scenes scenes

(percent correct) scenes 50

50 75 100 60 60 IdealIdeal observer learner (percent correct) 40 40 Frequency Frequency 20 20

0 0 !50 0 50 !50 0 50 Prediction error Prediction error (percent correct) (percent correct)

100 20 80 ** naive all * trained all 90 * 10 70 good learners bad learners 80 0 70 60 Percent correct Percent correct !10 Percent correct 60 50

50 !20 2 4 6 8 2 4 6 8 Lengyel & al, in prep Set size Set size Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 5 EFFECTS OF DESCRIPTION LENGTH

Bad learners Good learners 10 *

5

0 (Percent correct) Relative performance !5 low mid high low mid high DL

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 6 EFFECTS OF DESCRIPTION LENGTH

Bad learners Good learners 10 *

5

0 (Percent correct) Relative performance !5 low mid high low mid high DL

8 8

6 6

Set size 4 4

2 2 20 40 60 20 40 60 DL (bits) DL (bits)

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 6 EFFECTS OF DESCRIPTION LENGTH

Bad learners Good learners 10 *

5

0 (Percent correct) Relative performance !5 low mid high low mid high DL

8 8

6 6

4 4 Set size 100 Set size 100 () 2 * 2 20 40 60 20 40 60 90 DL (bits) 90 DL (bits) * 80 * 80 70 70 * Performance * (Percent correct) (percent correct) Performance 60 60 ()* 50 50 2 4 6 8 20 40 60 Set size DL (bits) Lengyel & al, in prep

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 6 EPISODIC MEMORY=ROTE LEARNING: HOW?

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 7 EPISODIC MEMORY=ROTE LEARNING: HOW?

• effects of time: , practice, …

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 7 EPISODIC MEMORY=ROTE LEARNING: HOW?

• effects of time: forgetting, practice, … • effects of context: semantic “pop-out”, false , …

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 7 EPISODIC MEMORY=ROTE LEARNING: HOW?

• effects of time: forgetting, practice, … • effects of context: semantic “pop-out”, false memories, …

• assume resource limitations for retrieval (eg. serially considering potential traces for retrieval)

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 7 EPISODIC MEMORY=ROTE LEARNING: HOW?

• effects of time: forgetting, practice, … • effects of context: semantic “pop-out”, false memories, …

• assume resource limitations for retrieval (eg. serially considering potential traces for retrieval) • important quantity to compute: P(need A history of A, context) need probability, need odds |

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 7 EPISODIC MEMORY=ROTE LEARNING: HOW?

• effects of time: forgetting, practice, … • effects of context: semantic “pop-out”, false memories, …

• assume resource limitations for retrieval (eg. serially considering potential traces for retrieval) • important quantity to compute: P(need A history of A, context) need probability, need odds | • related to experimentally measurable quantities: eg. probability P(recall A history of A, context) P(need A history of A, context) log | = log | θ s 1 P(recall A history of A, context) 1 P(need A history of A, context) − − | ! − | "#

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 7 EPISODIC MEMORY=ROTE LEARNING: HOW?

• effects of time: forgetting, practice, … • effects of context: semantic “pop-out”, false memories, …

• assume resource limitations for retrieval (eg. serially considering potential traces for retrieval) • important quantity to compute: P(need A history of A, context) need probability, need odds | • related to experimentally measurable quantities: eg. recall probability P(recall A history of A, context) P(need A history of A, context) log | = log | θ s 1 P(recall A history of A, context) 1 P(need A history of A, context) − − | ! − | "# • match need probabilities to environmental statistics → analyse corpora

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 7 EPISODIC MEMORY=ROTE LEARNING: HOW?

• effects of time: forgetting, practice, … • effects of context: semantic “pop-out”, false memories, …

• assume resource limitations for retrieval (eg. serially considering potential traces for retrieval) • important quantity to compute: P(need A history of A, context) need probability, need odds | • related to experimentally measurable quantities: eg. recall probability P(recall A history of A, context) P(need A history of A, context) log | = log | θ s 1 P(recall A history of A, context) 1 P(need A history of A, context) − − | ! − | "# • match need probabilities to environmental statistics → analyse corpora Anderson & Schooler, 1991

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 7 EFFECTS OF HISTORY: FORGETTING

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING

time

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING retrievals

time

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING retrievals test

time

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING retrievals delay test time

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING retrievals delay test time

Hermann Ebbinghaus

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING retrievals delay test time

Hermann Ebbinghaus

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING retrievals delay test time

Hermann Ebbinghaus

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING retrievals delay test time

Hermann Ebbinghaus

John Anderson Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: FORGETTING retrievals delay test time

Hermann Ebbinghaus Anderson & Schooler, 1991 John Anderson Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 8 EFFECTS OF HISTORY: PRACTICE retrievals test

time

Hermann Ebbinghaus

John Anderson Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 9 EFFECTS OF HISTORY: PRACTICE retrievals test

time frequency

! "# $

Hermann Ebbinghaus

John Anderson Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 9 EFFECTS OF HISTORY: PRACTICE retrievals test

time frequency

! "# $

Hermann Ebbinghaus

John Anderson Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 9 EFFECTS OF HISTORY: PRACTICE retrievals test

time frequency

! "# $

Hermann Ebbinghaus

John Anderson Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 9 EFFECTS OF HISTORY: PRACTICE retrievals test

time frequency

! "# $

Hermann Ebbinghaus Anderson & Schooler, 1991 John Anderson Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 9 EFFECTS OF HISTORY: SPACING retrievals test

time

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing test

time

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing delay test

time

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing delay test

time

experiment

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing delay test

time

experiment

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing delay test

time

experiment

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing delay test

time

experiment corpus statistics

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing delay test

time

experiment corpus statistics Anderson & Schooler, 1991

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing delay test

time

experiment corpus statistics Anderson & Schooler, 1991

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 EFFECTS OF HISTORY: SPACING retrievals spacing delay test

time

experiment corpus statistics Anderson & Schooler, 1991

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 10 DISCOVERING SEMANTIC TOPICS

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 11 DISCOVERING SEMANTIC TOPICS

330 Review TRENDS in Cognitive Sciences Vol.10 No.7 July 2006

(a) topics Topic 77 Topic 82 Topic 137 Topic 254 MUSIC LITERATURE RIVER READ t DANCE POEM LAND BOOK SONG POETRY RIVERS BOOKS P(word = i topic = j)=θij PLAY POET VALLEY | SING PLAYS BUILT LIBRARY SINGING POEMS WATER WROTE BAND PLAY FLOOD WRITE PLAYED LITERARY WATERS FIND SANG WRITERS NILE WRITTEN SONGS DRAMA FLOWS PAGES DANCING WROTE RICH WORDS PIANO POETS FLOW PAGE PLAYING WRITER DAM AUTHOR RHYTHM SHAKESPEARE BANKS TITLE

0.00 0.05 0.10 0.00 0.02 0.04 0.00 0.06 0.12 0.18 0.00 0.08 0.16 0.24

P( w | z ) P( w | z ) P( w | z ) P( w | z )

(b) Document #29795 Bix beiderbecke, at age060 fifteen207, sat174 on the slope071 of a bluff055 overlooking027 the mississippi137 river137. He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He showed002 promise134 on the piano077, and his parents035 hoped268 he might consider118 becoming a concert077 pianist077. But bix was interested268 in another kind050 of music077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...

Document #1883 There is a simple050 reason106 why there are so few periods078 of really great theater082 in our whole western046 world. Too many things300 have to come right at the very same time. The dramatists must have the right actors082, the actors082 must have the right playhouses, the playhouses must have the right audiences082. We must remember288 that plays082 exist143 to be performed077, not merely050 to be read254. ( even when you read254 a play082 to yourself, try288 to perform062 it, to put174 it on a stage078, as you go along.) as soon028 as a play082 has to be performed082, then some kind126 of theatrical082...

Figure 1. (a) Example topics extracted by the LDA model. Each topic is represented as a probability distribution over words. Only the fourteen words that have the highest probability under each topic are shown. The words in these topics relate to music, literature/drama, rivers and reading. Documents with different content can be generated by choosing different distributions over topics. This distribution over topics can be viewed as a summary of the gist of a document. (b) Two documents with the assignments of word tokens to topics. Colors and superscript numbers indicate assignments of words to topics. The top document gives high probability to the music and river topics while the bottom document gives high probability to the literature/drama and reading topics. Note that the model assigns each word occurrence to a topic and that these topic assignments are dependent on the document context. For example, the word play in the top and bottom document is assigned to the music and literature/drama topics respectively, corresponding to the different senses in which this word is used. Adapted from [16].

The Word Association Space (WAS) model is another patterns of word associations end up with similar vector technique for finding representations of words as points representations (even though they might not be directly in high-dimensional semantic space [27]. Instead of large associated). text databases, the model takes as input a set of word The high-dimensional semantic spaces found by LSA association norms. These norms are formed by asking and WAS can be used to model semantic effects in episodic subjects to produce the first word that comes to mind in memory tasks such as recognition memory and response to a given cue [28]. A matrix of associations can [27,29]. For example, a common finding in free recall is Probabilistic models of learning and memory — Working memory andbe episodic constructed memory from these data,CEU, with Budapest, the columns 22-26 being June category2009 clusteringhttp://www.eng.cam.ac.uk/~m.lengyel – words from the same category are 11 the words used as cues, the rows being the words often recalled in close temporal succession even though produced as associates, and the entries in the matrix the presentation order of words at study was randomized. indicating the frequency with which a word was Recently, the vector representations found by LSA and produced as an associate. The word association space is WAS have been incorporated into the SAM memory model found by applying the same dimensionality reduction to explain category clustering and intrusions in free recall techniques as used in LSA to this matrix. The result is a [30]. Integrating these representations into rational spatial representation in which words with similar models such as REM may provide a way to endow the

www.sciencedirect.com 330 Review TRENDS in Cognitive Sciences Vol.10 No.7 July 2006

(a) DISCOVERINGTopic SEMANTIC 77 Topic TOPICS 82 Topic 137 Topic 254

330 MUSIC Review LITERATURETRENDS in Cognitive Sciences Vol.10 No.7 July 2006RIVER READ DANCE POEM LAND BOOK SONG POETRY RIVERS BOOKS (a) PLAY Topic 77 POETTopic 82 TopicVALLEY 137 Topic 254READING topics SING PLAYS BUILT LIBRARY SINGING MUSIC LITERATUREPOEMS RIVER WATER READ WROTE t BANDDANCE POEMPLAY LAND FLOOD BOOK WRITE SONG POETRY RIVERS BOOKS P(word = i topic = j)=θij PLAYED PLAY LITERARYPOET VALLEY WATERS READING FIND | SANG SING WRITERSPLAYS BUILT NILE LIBRARY WRITTEN SONGSSINGING POEMSDRAMA WATER FLOWS WROTE PAGES BAND PLAY FLOOD WRITE DANCINGPLAYED LITERARYWROTE WATERS RICH FIND WORDS PIANO SANG WRITERSPOETS NILE FLOW WRITTEN PAGE SONGS DRAMA FLOWS PAGES PLAYINGDANCING WROTEWRITER RICH DAM WORDS AUTHOR RHYTHM PIANO SHAKESPEAREPOETS FLOW BANKS PAGE TITLE PLAYING WRITER DAM AUTHOR RHYTHM SHAKESPEARE BANKS TITLE 0.00 0.05 0.10 0.00 0.02 0.04 0.00 0.06 0.12 0.18 0.00 0.08 0.16 0.24 0.00 0.05 0.10 0.00 0.02 0.04 0.00 0.06 0.12 0.18 0.00 0.08 0.16 0.24

P( wP( | zw )| z ) P( wP( | z )w | z ) P( w | z ) P( w | zP( ) w | z ) P( w | z )

documents (b) (b) Document #29795 d DocumentBix beiderbecke, #29795 at age060 fifteen207, sat174 on the slope071 of a bluff055 overlooking027 the Bix beiderbecke,mississippi137 river at137.age He was060 listeningfifteen077207to,musicsat174077 comingon the009 fromslope a passing071 043ofriverboat. a bluff The055 overlooking027 the P(topic = j document = k)=θ music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix jk 137 137 077 077 009 043 | mississippibeiderbeckeriver had already. He had wasmusiclistening077 lessons077.to Hemusicshowed002 promisecoming134 on thefrompiano a077passing, and riverboat. The music077his parentshad035 alreadyhoped268 capturedhe might consider006 his118 becomingheart157 a asconcert well077 pianist as his077. Butear bix119 was. It was jazz077. Bix beiderbeckeinterested had268 in alreadyanother kind had050 ofmusicmusic077077. Helessonswanted268077to.play He077showedthe cornet.002 Andpromise he wanted134268on the piano077, and to play077 jazz077... his parents035 hoped268 he might consider118 becoming a concert077 pianist077. But bix was interestedDocument268 in #1883 another kind050 of music077. He wanted268 to play077 the cornet. And he wanted268 050 106 078 082 to playThere077 jazz is a 077simple... reason why there are so few periods of really great theater in our whole western046 world. Too many things300 have to come right at the very same time. The dramatists must have the right actors082, the actors082 must have the right playhouses, the Documentplayhouses #1883 must have the right audiences082. We must remember288 that plays082 exist143 to be 077 050 254 254 082 288 performed , not050merely to106 be read . ( even when you read a play 078to yourself, try to 082 Thereperform is a simple062 it, to put174reasonit on a stagewhy078, as there you go are along.) so few as soonperiods028 as a playof082 reallyhas to great be theater in our wholeperformedwestern082046,world. Too many things then300 have to come right at the some very same time. The dramatistskind126 of musttheatrical have082... the right actors082, the actors082 must have the right playhouses, the 082 288 082 143 Figure 1. (a) Exampleplayhouses topics extracted by must the LDA have model. the Each topicright is representedaudiences as a probability. We distribution must remember over words. Only thethat fourteenplays words thatexist have the highestto be probability under each topic are shown.077 The words in these topics050 relate to music,254 literature/drama, rivers and reading. Documents254 with different082 content can be generated288 by choosing different distributionsperformed over topics., not Thismerely distribution overto topics be canread be viewed. as( even a summary when of the gist you of aread document. (b)a Twoplay documentsto yourself,with the assignmentstry ofto word tokens to topics.perform Colors and062 superscriptit, to numbersput174 indicateit on assignments a stage of078 words, toas topics. you The go top document along.) gives as highsoon probability028 as to the a musicplay an082d riverhas topics towhile be the bottom document gives high probability to the literature/drama and reading topics. Note that the model assigns each word occurrence to a topic and that these topic assignments are dependentperformed on the082 document, context. For example, the word play in the top and bottom then document is assigned to the music and literature/drama topics some respectively, corresponding to the different senses in which this word is used. Adapted from [16]. kind126 of theatrical082... Steyvers & al, 2006

Figure 1. (a) ExampleThe topics Word extracted Association by the Space LDA (WAS) model. model Each topic is another is representedpatterns as a probability of word associations distribution end over up words. with Only similar the vector fourteen words that have the highest probability undertechnique each topic for are finding shown. representations The words in these of words topics relateas points to music,representations literature/drama, (even rivers though and reading. they mightDocuments not be with directly different content can be generated by choosing differentin high-dimensional distributions over semantictopics. This space distribution[27]. Instead over topics of large can be viewedassociated). as a summary of the gist of a document. (b) Two documents with the assignments of word tokens totext topics. databases, Colors and the superscript model takes numbers as input indicate a set assignments of word of wordsThe to high-dimensional topics. The top document semantic gives spaces high found probability by LSA to the music and river topics while the bottom documentassociation gives norms. high probability These norms to the are literature/drama formed by asking and readingand topics. WAS Note can be that used the to model model assigns semantic each effects word in occurrence episodic to a topic and that these topic assignments aresubjects dependent to produce on the documentthe first word context. that For comes example, to mind the in word playmemory in the tasks top andsuch bottom as recognition document memory is assigned and free to the recall music and literature/drama topics respectively, correspondingresponse to a to given the different cue [28]. senses A matrix in which of associations this word is can used. Adapted[27,29]. For from example,[16]. a common finding in free recall is Probabilistic models of learning and memory — Working memory andbe episodic constructed memory from these data,CEU, with Budapest, the columns 22-26 being June category2009 clusteringhttp://www.eng.cam.ac.uk/~m.lengyel – words from the same category are 11 the words used as cues, the rows being the words often recalled in close temporal succession even though produced as associates, and the entries in the matrix the presentation order of words at study was randomized. The Wordindicating Association the frequency Space with (WAS) which model a word is was anotherRecently,patterns the vector of representations word associations found by LSA end and up with similar vector techniqueproduced for finding as an representationsassociate. The word association of words space as is pointsWAS haverepresentations been incorporated into (even the SAM though memory they model might not be directly in high-dimensionalfound by applying semantic the same space dimensionality[27]. Instead reduction of largeto explainassociated). category clustering and intrusions in free recall techniques as used in LSA to this matrix. The result is a [30]. Integrating these representations into rational text databases,spatial therepresentation model takes in which as words input with a set similar of wordmodels suchThe as REM high-dimensional may provide a way tosemantic endow the spaces found by LSA associationwww.sciencedirect.com norms. These norms are formed by asking and WAS can be used to model semantic effects in episodic subjects to produce the first word that comes to mind in memory tasks such as recognition memory and free recall response to a given cue [28]. A matrix of associations can [27,29]. For example, a common finding in free recall is be constructed from these data, with the columns being category clustering – words from the same category are the words used as cues, the rows being the words often recalled in close temporal succession even though produced as associates, and the entries in the matrix the presentation order of words at study was randomized. indicating the frequency with which a word was Recently, the vector representations found by LSA and produced as an associate. The word association space is WAS have been incorporated into the SAM memory model found by applying the same dimensionality reduction to explain category clustering and intrusions in free recall techniques as used in LSA to this matrix. The result is a [30]. Integrating these representations into rational spatial representation in which words with similar models such as REM may provide a way to endow the

www.sciencedirect.com EFFECTS OF CONTEXT: VIA TOPICS

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 12 Review TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 329

word with some other word. The resulting high-dimen- material), and then applies matrix decomposition tech- sional space has been shown to capture neighborhood niques to reduce the dimensionality of the original effects in lexical decision and naming [23]. matrix to a much smaller size while preserving as The Latent Semantic Analysis (LSA) model also much as possible the covariation structure of words and derives a high-dimensional semantic space for words documents. The dimensionality reduction allows words but uses the co-occurrence information not between with similar meaning to have similar vector represen- words and words but between words and the passages tations even though they might never have co-occurred they occur in [24–26]. The model starts with a matrix of in the same document, and can thus result in a more counts of the number of times a word occurs in a set of accurate representation of the relationships documents (usually extracted from educational text between words.

Box 2. Probabilistic topic models

A variety of probabilistic topic models have been used to analyze the Typically, these words exhibit strong semantic associations with the content of documents and the meaning of words [31,13–15,32]. These words that did appear on the study list [47,48]. To evaluate the effects models all use the same fundamental idea – that a document is a mixture of semantic association on human memory, word association norms of topics – but make slightly different statistical assumptions. To have been developed to measure the associative strength between introduce notation, we will write P(z) for the distribution over topics z pairs of words. These norms are typically collected by showing a in a particular document and P(w z) for the probability distribution over participant a cue word and asking them to write down the first word j words w given topic z. Several topic-word distributions P(w z) were that comes to mind. Word association norms exist for over 5000 j illustrated in Figure 1, each giving different weight to thematically related words, with hundreds of participants providing responses for each cue

words. Each word wi in a document (where the index refers to the i th word [28]. word token) is generated by first sampling a topic from the topic In the topic model, word association can be thought of as a problem distribution, then choosing a word from the topic-word distribution. We of prediction. Given that a cue if presented, what new words might

write P(ziZj) as the probability that the j th topic was sampled for the i th occur next in that context? More formally, the problem is to predict the word token and P(w z Zj ) as the probability of word w under topic j. The conditional probability of word w (the response word) given the cue ij i i 2 model specifies the following distribution over words within a document: word w1. The first step in making this prediction is determining which topic w is likely to have been drawn from. This can be done by T 1 P w Z P w z Z j P z Z j (Eqn I) applying Bayes’ rule, with ð i Þ ð i j i Þ ð i Þ jZ1 P z Z j w fP w z Z j P z Z j (Eqn II) X ð j 1Þ ð 1j Þ ð Þ

where T is the number of topics. The two terms on the right hand side It is then possible to predict w2, summing over all of the topics that indicate which words are important for which topic and which topics could have generated w1. The resulting conditional probability is are important for a particular document, respectively. Several statistical techniques can be used to infer these quantities in a T Z Z Z completely unsupervised fashion from a collection of documents. The P w2 w1 P w2 z j P z j w1 (Eqn III) ð j Þ jZ1 ð j Þ ð j Þ result is a representation for words (in terms of their probabilities X under the different topics) and for documents (in terms of the probabilities of topics appearing in those documents). The set of which we can use to model word association. topics we use in this article were found using Gibbs sampling, a Figure I (a) shows the observed and predicted word associations for Markov chain Monte Carlo technique (see the online article by Griffiths the word ‘PLAY’. Figure I(b) compares the performance of the topic and Yuille: Supplementary material online; and Refs [15,16]). model and LSA in predicting the first associate in the word association The associative semantic structure of words plays an important role norms. The topic model outperforms LSA slightly when either the in episodic memory. For example, participants performing a free recall cosine or the inner product between word vectors is used as a task sometimes produce responsesEFFECTS that were not on the study OF list [47]. CONTEXT:measure of word association. VIA TOPICS (a) (b) Humans Topics Topics LSA 40 40 Cosine FUN BALL BALL GAME Inner GAME CHILDREN product WORK ROLE 30 30 GROUND GAMES MATE MUSIC CHILD BASEBALL 20 ENJOY HIT 20 WIN FUN ACTOR TEAM Median rank FIGHT IMPORTANT 10 10 HORSE BAT KID RUN MUSIC STAGE 0 0

free recall 0.00 0.05 0.10 0.15 0.000 0.025 0.050 300 500 700 900 1100130015001700 P(wordP( w | 'PLAY')2 word1 P(= w | ‘play’)'PLAY') No. of Topics Figure I. (a) Observed and predicted response distributions for the word PLAY. The responses from humans reveal that associations can be based on different senses of the () | cue word (e.g. PLAY–BALL and PLAY–ACTOR). The model predictions are based on a 500 topic solution from the TASA corpus. Note that the model gives similar responses to humans although the ordering is different. One way to score the model is to measure the rank of the first associate (e.g. FUN), which should be as low as possible. (b) Median rank of the first associate as predicted by the topic model and LSA. Note that the dimensionality in LSA has been optimized. Adapted from [14,16].

www.sciencedirect.com

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 12 Review TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 329

word with some other word. The resulting high-dimen- material), and then applies matrix decomposition tech- sional space has been shown to capture neighborhood niques to reduce the dimensionality of the original effects in lexical decision and naming [23]. matrix to a much smaller size while preserving as The Latent Semantic Analysis (LSA) model also much as possible the covariation structure of words and derives a high-dimensional semantic space for words documents. The dimensionality reduction allows words but uses the co-occurrence information not between with similar meaning to have similar vector represen- words and words but between words and the passages tations even though they might never have co-occurred they occur in [24–26]. The model starts with a matrix of in the same document, and can thus result in a more counts of the number of times a word occurs in a set of accurate representation of the relationships documents (usually extracted from educational text between words.

Box 2. Probabilistic topic models

A variety of probabilistic topic models have been used to analyze the Typically, these words exhibit strong semantic associations with the content of documents and the meaning of words [31,13–15,32]. These words that did appear on the study list [47,48]. To evaluate the effects models all use the same fundamental idea – that a document is a mixture of semantic association on human memory, word association norms of topics – but make slightly different statistical assumptions. To have been developed to measure the associative strength between introduce notation, we will write P(z) for the distribution over topics z pairs of words. These norms are typically collected by showing a in a particular document and P(w z) for the probability distribution over participant a cue word and asking them to write down the first word j words w given topic z. Several topic-word distributions P(w z) were that comes to mind. Word association norms exist for over 5000 j illustrated in Figure 1, each giving different weight to thematically related words, with hundreds of participants providing responses for each cue

words. Each word wi in a document (where the index refers to the i th word [28]. word token) is generated by first sampling a topic from the topic In the topic model, word association can be thought of as a problem distribution, then choosing a word from the topic-word distribution. We of prediction. Given that a cue if presented, what new words might

write P(ziZj) as the probability that the j th topic was sampled for the i th occur next in that context? More formally, the problem is to predict the word token and P(w z Zj ) as the probability of word w under topic j. The conditional probability of word w (the response word) given the cue ij i i 2 model specifies the following distribution over words within a document: word w1. The first step in making this prediction is determining which topic w is likely to have been drawn from. This can be done by T 1 P w Z P w z Z j P z Z j (Eqn I) applying Bayes’ rule, with ð i Þ ð i j i Þ ð i Þ jZ1 P z Z j w fP w z Z j P z Z j (Eqn II) X ð j 1Þ ð 1j Þ ð Þ

where T is the number of topics. The two terms on the right hand side It is then possible to predict w2, summing over all of the topics that indicate which words are important for which topic and which topics could have generated w1. The resulting conditional probability is are important for a particular document, respectively. Several statistical techniques can be used to infer these quantities in a T Z Z Z completely unsupervised fashion from a collection of documents. The P w2 w1 P w2 z j P z j w1 (Eqn III) ð j Þ jZ1 ð j Þ ð j Þ result is a representation for words (in terms of their probabilities X under the different topics) and for documents (in terms of the probabilities of topics appearing in those documents). The set of which we can use to model word association. topics we use in this article were found using Gibbs sampling, a Figure I (a) shows the observed and predicted word associations for Markov chain Monte Carlo technique (see the online article by Griffiths the word ‘PLAY’. Figure I(b) compares the performance of the topic and Yuille: Supplementary material online; and Refs [15,16]). model and LSA in predicting the first associate in the word association The associative semantic structure of words plays an important role norms. The topic model outperforms LSA slightly when either the in episodic memory. For example, participants performing a free recall cosine or the inner product between word vectors is used as a task sometimes produce responsesEFFECTS that were not on the study OF list [47]. CONTEXT:measure of word association. VIA TOPICS (a) (b) Humans Topics Topics LSA 40 40 Cosine FUN BALL BALL GAME Inner GAME CHILDREN product WORK ROLE 30 30 GROUND GAMES MATE MUSIC CHILD BASEBALL 20 ENJOY HIT 20 WIN FUN rank=9 ACTOR TEAM Median rank FIGHT IMPORTANT 10 10 HORSE BAT KID RUN MUSIC STAGE 0 0

free recall 0.00 0.05 0.10 0.15 0.000 0.025 0.050 300 500 700 900 1100130015001700 P(wordP( w | 'PLAY')2 word1 P(= w | ‘play’)'PLAY') No. of Topics Figure I. (a) Observed and predicted response distributions for the word PLAY. The responses from humans reveal that associations can be based on different senses of the (semantic memory) | cue word (e.g. PLAY–BALL and PLAY–ACTOR). The model predictions are based on a 500 topic solution from the TASA corpus. Note that the model gives similar responses to humans although the ordering is different. One way to score the model is to measure the rank of the first associate (e.g. FUN), which should be as low as possible. (b) Median rank of the first associate as predicted by the topic model and LSA. Note that the dimensionality in LSA has been optimized. Adapted from [14,16].

www.sciencedirect.com

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 12 ReviewReview TRENDSTRENDS in in Cognitive Cognitive Sciences Sciences Vol.10Vol.10 No.7 No.7 July July 2006 2006 329 329

wordword with with some some other other word. word. The The resulting resulting high-dimen- high-dimen- material),material), and and then then applies applies matrix matrix decomposition decomposition tech- tech- sionalsional space space has has been been shown shown to to capture capture neighborhood neighborhood niquesniques to to reduce reduce the the dimensionality dimensionality of of the the original original effectseffects in in lexical lexical decision decision and and naming naming[23][23]. . matrixmatrix to to a a much much smaller smaller size size while while preserving preserving as as TheThe Latent Latent Semantic Semantic Analysis Analysis (LSA) (LSA) model model also also muchmuch as as possible possible the the covariation covariation structure structure of of words words and and derivesderives a a high-dimensional high-dimensional semantic semantic space space for for words words documents.documents. The The dimensionality dimensionality reduction reduction allows allows words words butbut uses uses the the co-occurrence co-occurrence information information not not between between withwith similar similar meaning meaning to to have have similar similar vector vector represen- represen- wordswords and and words words but but between between words words and and the the passages passages tationstations even even though though they they might might never never have have co-occurred co-occurred theythey occur occur in in[24–26][24–26].. The The model model starts starts with with a a matrix matrix of of inin the the same same document, document, and and can can thus thus result result in in a a more more countscounts of of the the number number of of times times a a word word occurs occurs in in a a set set of of accurateaccurate representation representation of of the the relationships relationships documentsdocuments (usually (usually extracted extracted from from educational educational text text betweenbetween words. words.

BoxBox 2. 2. Probabilistic Probabilistic topic topic models models

AA variety variety of of probabilistic probabilistic topic topic models models have have been been used used to to analyze analyze the the Typically,Typically, these these words words exhibit exhibit strong strong semantic semantic associations associations with with the the contentcontent of of documents documents and and the the meaning meaning of of words words[31,13–15,32][31,13–15,32].. These These wordswords that that did did appear appear on on the the study study list list[47,48][47,48]. To. To evaluate evaluate the the effects effects modelsmodels all all use use the the same same fundamental fundamental idea idea – – that that a a document document is is a a mixture mixture ofof semantic semantic association association on on human human memory, memory, word word association association norms norms ofof topics topics – – but but make make slightly slightly different different statistical statistical assumptions. assumptions. To To havehave been been developed developed to to measure measure the the associative associative strength strength between between introduceintroduce notation, notation, we we will will write writePP(z()z) for for the the distribution distribution over over topics topicsz z pairspairs of of words. words. These These norms norms are are typically typically collected collected by by showing showing a a inin a a particular particular document document and andPP(w(wz)z) for for the the probability probability distribution distribution over over participantparticipant a a cue cue word word and and asking asking them them to to write write down down the the first first word word j j wordswordswwgivengiven topic topicz.z. Several Several topic-word topic-word distributions distributionsPP(w(wz)z) were were thatthat comes comes to to mind. mind. Word Word association association norms norms exist exist for for over over 5000 5000 j j illustratedillustrated in inFigureFigure 1, 1 each, each giving giving different different weight weight to to thematically thematically related related words,words, with with hundreds hundreds of of participants participants providing providing responses responses for for each each cue cue

words.words. Each Each word wordwwi ini in a a document document (where (where the the index index refers refers to to the thei thi th wordword[28][28]. . wordword token) token) is is generated generated by by first first sampling sampling a a topic topic from from the the topic topic InIn the the topic topic model, model, word word association association can can be be thought thought of of as as a a problem problem distribution,distribution, then then choosing choosing a a word word from from the the topic-word topic-word distribution. distribution. We We ofof prediction. prediction. Given Given that that a a cue cue if if presented, presented, what what new new words words might might

writewritePP(z(izZiZj)j as) as the the probability probability that that the thej thj th topic topic was was sampled sampled for for the thei thi th occuroccur next next in in that that context? context? More More formally, formally, the the problem problem is is to to predict predict the the wordword token token and andPP(w(wi zizZZj )j as) as the the probability probability of of word wordwwi underunder topic topicj.j The. The conditionalconditional probability probability of of word wordww2 (the(the response response word) word) given given the the cue cue j ij i i 2 modelmodel specifies specifies the the following following distribution distribution over over words words within within a a document: document: wordwordww1.1 The. The first first step step in in making making this this prediction prediction is is determining determining which which topictopicww isis likely likely to to have have been been drawn drawn from. from. This This can can be be done done by by T T 1 1 PPww ZZ PPwwz zZZj jPPz zZZj j (Eqn(Eqn I) I) applyingapplying Bayes’ Bayes’ rule, rule, with with ð i Þi ð i ji i i Þ ð i i Þ ð Þ jZjZ1 1 ð j Þ ð Þ X PPz zZZj jww1 1ffPPww1 1z zZZj jPPz zZZj j (Eqn(Eqn II) II) X ð ð j j Þ Þ ð ð j j Þ Þð ð Þ Þ

wherewhereTTisis the the number number of of topics. topics. The The two two terms terms on on the the right right hand hand side side ItIt is is then then possible possible to to predict predictww2,2, summing summing over over all all of of the the topics topics that that indicateindicate which which words words are are important important for for which which topic topic and and which which topics topics couldcould have have generated generatedww1.1. The The resulting resulting conditional conditional probability probability is is areare important important for for a a particular particular document, document, respectively. respectively. Several Several statisticalstatistical techniques techniques can can be be used used to to infer infer these these quantities quantities in in a a T T PPww ww ZZ PPww z zZZj jPPz zZZj jww (Eqn(Eqn III) III) completelycompletely unsupervised unsupervised fashion fashion from from a a collection collection of of documents. documents. The The ð 2j2 1Þ1 ð 2j2 Þ ð j 1Þ1 ð j Þ jZjZ1 1 ð j Þ ð j Þ resultresult is is a a representation representation for for words words (in (in terms terms of of their their probabilities probabilities XX underunder the the different different topics) topics) and and for for documents documents (in (in terms terms of of the the probabilitiesprobabilities of of topics topics appearing appearing in in those those documents). documents). The The set set of of whichwhich we we can can use use to to model model word word association. association. topicstopics we we use use in in this this article article were were found found using using Gibbs Gibbs sampling, sampling, a a FigureFigure I (a) I (a) shows shows the the observed observed and and predicted predicted word word associations associations for for MarkovMarkov chain chain Monte Monte Carlo Carlo technique technique (see (see the the online online article article by by Griffiths Griffiths thethe word word ‘PLAY’. ‘PLAY’.FigureFigure I(b) I(b) compares compares the the performance performance of of the the topic topic andand Yuille: Yuille: Supplementary Supplementary material material online; online; and and Refs Refs[15,16][15,16]).). modelmodel and and LSA LSA in in predicting predicting the the first first associate associate in in the the word word association association TheThe associative associative semantic semantic structure structure of of words words plays plays an an important important role role norms.norms. The The topic topic model model outperforms outperforms LSA LSA slightly slightly when when either either the the inin episodic episodic memory. memory. For For example, example, participants participants performing performing a a free free recall recall cosinecosine or or the the inner inner product product between between word word vectors vectors is is used used as as a a tasktask sometimes sometimes produce produce responses responsesEFFECTS that that were were not not on on the the study study OF list list[47][47]. . CONTEXT:measuremeasure of of word word association. association. VIA TOPICS (a)(a) (b) (b) HumansHumans TopicsTopics TopicsTopics LSALSA 4040 4040 CosineCosine FUNFUN BALLBALL BALLBALL GAMEGAME InnerInner GAMEGAME CHILDRENCHILDREN productproduct WORKWORK ROLEROLE 3030 3030 GROUNDGROUND GAMESGAMES MATEMATE MUSICMUSIC CHILDCHILD BASEBALLBASEBALL 20 2020 ENJOYENJOY HITHIT 20 WINWIN FUNFUN rank=9 Median rank ACTORACTOR TEAMTEAM Median rank FIGHTFIGHT IMPORTANTIMPORTANT 1010 1010 HORSEHORSE BATBAT KIDKID RUNRUN MUSICMUSIC STAGESTAGE 00 00 Steyvers & al, 2006

free recall 0.000.000.050.050.100.100.150.15 0.0000.000 0.0250.025 0.0500.050 300300500500700700900900 11001100130013001500150017001700 P(wordP(P( w w | 'PLAY')| 'PLAY')2 word1 P(= P( w w | ‘play’) 'PLAY')| 'PLAY') No.No. of of Topics Topics FigureFigure I. I.(a) (a)ObservedObserved and and predicted predicted response response distributions distributions for for the the word wordPLAYPLAY. The. The responses responses from from humans humans reveal reveal that that associations associations can can be be based based on on different different senses senses of of the the (semantic memory) | cuecue word word (e.g. (e.g.PLAYPLAY–BALL–BALLandandPLAYPLAY–ACTOR–ACTOR).). The The model model predictions predictions are are based based on on a a 500 500 topic topic solution solution from from the the TASA TASA corpus. corpus. Note Note that that the the model model gives gives similar similar responses responses to to humanshumans although although the the ordering ordering is is different. different. One One way way to to score score the the model model is is to to measure measure the the rank rank of of the the first first associate associate (e.g. (e.g.FUNFUN),), which which should should be be as as low low as as possible. possible.(b)(b) MedianMedian rank rank of of the the first first associate associate as as predicted predicted by by the the topic topic model model and and LSA. LSA. Note Note that that the the dimensionality dimensionality in in LSA LSA has has been been optimized. optimized. Adapted Adapted from from[14,16][14,16]. .

www.sciencedirect.comwww.sciencedirect.com

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 12 ReviewReview TRENDSTRENDS in in Cognitive Cognitive Sciences Sciences Vol.10Vol.10 No.7 No.7 July July 2006 2006 329 329

wordword with with some some other other word. word. The The resulting resulting high-dimen- high-dimen- material),material), and and then then applies applies matrix matrix decomposition decomposition tech- tech- sionalsional space space has has been been shown shown to to capture capture neighborhood neighborhood niquesniques to to reduce reduce the the dimensionality dimensionality of of the the original original effectseffects in in lexical lexical decision decision and and naming naming[23][23]. . matrixmatrix to to a a much much smaller smaller size size while while preserving preserving as as TheThe Latent Latent Semantic Semantic Analysis Analysis (LSA) (LSA) model model also also muchmuch as as possible possible the the covariation covariation structure structure of of words words and and derivesderives a a high-dimensional high-dimensional semantic semantic space space for for words words documents.documents. The The dimensionality dimensionality reduction reduction allows allows words words butbut uses uses the the co-occurrence co-occurrence information information not not between between withwith similar similar meaning meaning to to have have similar similar vector vector represen- represen- wordswords and and words words but but between between words words and and the the passages passages tationstations even even though though they they might might never never have have co-occurred co-occurred theythey occur occur in in[24–26][24–26].. The The model model starts starts with with a a matrix matrix of of inin the the same same document, document, and and can can thus thus result result in in a a more more countscounts of of the the number number of of times times a a word word occurs occurs in in a a set set of of accurateaccurate representation representation of of the the relationships relationships documentsdocuments (usually (usually extracted extracted from from educational educational text text betweenbetween words. words.

BoxBox 2. 2. Probabilistic Probabilistic topic topic models models

AA variety variety of of probabilistic probabilistic topic topic models models have have been been used used to to analyze analyze the the Typically,Typically, these these words words exhibit exhibit strong strong semantic semantic associations associations with with the the contentcontent of of documents documents and and the the meaning meaning of of words words[31,13–15,32][31,13–15,32].. These These wordswords that that did did appear appear on on the the study study list list[47,48][47,48]. To. To evaluate evaluate the the effects effects modelsmodels all all use use the the same same fundamental fundamental idea idea – – that that a a document document is is a a mixture mixture ofof semantic semantic association association on on human human memory, memory, word word association association norms norms ofof topics topics – – but but make make slightly slightly different different statistical statistical assumptions. assumptions. To To havehave been been developed developed to to measure measure the the associative associative strength strength between between introduceintroduce notation, notation, we we will will write writePP(z()z) for for the the distribution distribution over over topics topicsz z pairspairs of of words. words. These These norms norms are are typically typically collected collected by by showing showing a a inin a a particular particular document document and andPP(w(wz)z) for for the the probability probability distribution distribution over over participantparticipant a a cue cue word word and and asking asking them them to to write write down down the the first first word word j j wordswordswwgivengiven topic topicz.z. Several Several topic-word topic-word distributions distributionsPP(w(wz)z) were were thatthat comes comes to to mind. mind. Word Word association association norms norms exist exist for for over over 5000 5000 j j illustratedillustrated in inFigureFigure 1, 1 each, each giving giving different different weight weight to to thematically thematically related related words,words, with with hundreds hundreds of of participants participants providing providing responses responses for for each each cue cue

words.words. Each Each word wordwwi ini in a a document document (where (where the the index index refers refers to to the thei thi th wordword[28][28]. . wordword token) token) is is generated generated by by first first sampling sampling a a topic topic from from the the topic topic InIn the the topic topic model, model, word word association association can can be be thought thought of of as as a a problem problem distribution,distribution, then then choosing choosing a a word word from from the the topic-word topic-word distribution. distribution. We We ofof prediction. prediction. Given Given that that a a cue cue if if presented, presented, what what new new words words might might

writewritePP(z(izZiZj)j as) as the the probability probability that that the thej thj th topic topic was was sampled sampled for for the thei thi th occuroccur next next in in that that context? context? More More formally, formally, the the problem problem is is to to predict predict the the wordword token token and andPP(w(wi zizZZj )j as) as the the probability probability of of word wordwwi underunder topic topicj.j The. The conditionalconditional probability probability of of word wordww2 (the(the response response word) word) given given the the cue cue j ij i i 2 modelmodel specifies specifies the the following following distribution distribution over over words words within within a a document: document: wordwordww1.1 The. The first first step step in in making making this this prediction prediction is is determining determining which which topictopicww isis likely likely to to have have been been drawn drawn from. from. This This can can be be done done by by T T 1 1 PPww ZZ PPwwz zZZj jPPz zZZj j (Eqn(Eqn I) I) applyingapplying Bayes’ Bayes’ rule, rule, with with ð i Þi ð i ji i i Þ ð i i Þ ð Þ jZjZ1 1 ð j Þ ð Þ X PPz zZZj jww1 1ffPPww1 1z zZZj jPPz zZZj j (Eqn(Eqn II) II) X ð ð j j Þ Þ ð ð j j Þ Þð ð Þ Þ

wherewhereTTisis the the number number of of topics. topics. The The two two terms terms on on the the right right hand hand side side ItIt is is then then possible possible to to predict predictww2,2, summing summing over over all all of of the the topics topics that that indicateindicate which which words words are are important important for for which which topic topic and and which which topics topics couldcould have have generated generatedww1.1. The The resulting resulting conditional conditional probability probability is is areare important important for for a a particular particular document, document, respectively. respectively. Several Several statisticalstatistical techniques techniques can can be be used used to to infer infer these these quantities quantities in in a a T T PPww ww ZZ PPww z zZZj jPPz zZZj jww (Eqn(Eqn III) III) completelycompletely unsupervised unsupervised fashion fashion from from a a collection collection of of documents. documents. The The ð 2j2 1Þ1 ð 2j2 Þ ð j 1Þ1 ð j Þ jZjZ1 1 ð j Þ ð j Þ resultresult is is a a representation representation for for words words (in (in terms terms of of their their probabilities probabilities XX underunder the the different different topics) topics) and and for for documents documents (in (in terms terms of of the the probabilitiesprobabilities of of topics topics appearing appearing in in those those documents). documents). The The set set of of whichwhich we we can can use use to to model model word word association. association. topicstopics we we use use in in this this article article were were found found using using Gibbs Gibbs sampling, sampling, a a FigureFigure I (a) I (a) shows shows the the observed observed and and predicted predicted word word associations associations for for MarkovMarkov chain chain Monte Monte Carlo Carlo technique technique (see (see the the online online article article by by Griffiths Griffiths thethe word word ‘PLAY’. ‘PLAY’.FigureFigure I(b) I(b) compares compares the the performance performance of of the the topic topic andand Yuille: Yuille: Supplementary Supplementary material material online; online; and and Refs Refs[15,16][15,16]).). modelmodel and and LSA LSA in in predicting predicting the the first first associate associate in in the the word word association association TheThe associative associative semantic semantic structure structure of of words words plays plays an an important important role role norms.norms. The The topic topic model model outperforms outperforms LSA LSA slightly slightly when when either either the the inin episodic episodic memory. memory. For For example, example, participants participants performing performing a a free free recall recall cosinecosine or or the the inner inner product product between between word word vectors vectors is is used used as as a a tasktask sometimes sometimes produce produce responses responsesEFFECTS that that were were not not on on the the study study OF list list[47][47]. . CONTEXT:measuremeasure of of word word association. association. VIA TOPICS (a)(a) (b) (b) HumansHumans TopicsTopics TopicsTopics LSALSA 4040 4040 CosineCosine FUNFUN BALLBALL BALLBALL GAMEGAME InnerInner GAMEGAME CHILDRENCHILDREN productproduct WORKWORK ROLEROLE 3030 3030 GROUNDGROUND GAMESGAMES MATEMATE MUSICMUSIC CHILDCHILD BASEBALLBASEBALL 20 2020 ENJOYENJOY HITHIT 20 WINWIN FUNFUN rank=9 Median rank ACTORACTOR TEAMTEAM Median rank FIGHTFIGHT IMPORTANTIMPORTANT 1010 1010 HORSEHORSE BATBAT KIDKID RUNRUN MUSICMUSIC STAGESTAGE 00 00 Steyvers & al, 2006

free recall 0.000.000.050.050.100.100.150.15 0.0000.000 0.0250.025 0.0500.050 300300500500700700900900 11001100130013001500150017001700 P(wordP(P( w w | 'PLAY')| 'PLAY')2 word1 P(= P( w w | ‘play’) 'PLAY')| 'PLAY') No.No. of of Topics Topics FigureFigure I. I.(a) (a)ObservedObserved and and predicted predicted response response distributions distributions for for the the word wordPLAYPLAY. The. The responses responses from from humans humans reveal reveal that that associations associations can can be be based based on on different different senses senses of of the the (semantic memory) | only becue cue com word wordpared (e.g. (e.g.PLAY PLAY qualitativ–BALL–BALLandandPLAYPLAYely–ACTOR– ACTOR to). ). The the The model model obser predictions predictionsved recall are are based based probability. on on a a 500 500 topic topic In solution solution order from from to the the fully TASA TASA corpus. corpus. Note Note that that the the model model gives gives similar similar responses responses to to humanshumans although although the the ordering ordering is is different. different. One One way way to to score score the the model model is is to to measure measure the the rank rank of of the the first first associate associate (e.g. (e.g.FUNFUN),), which which should should be be as as low low as as possible. possible.(b)(b) simulateMedian Medianrecall, rank rank we of of thewould the first first associate associatehave to as as predictedimplem predicted byen by thet thea topicsam topic modelp modelling and andprocess LSA. LSA. Note Note with that that thea the stopping dimensionality dimensionality rule in in LSAto LSA has has been been optimized. optimized. Adapted Adapted from from[14,16][14,16]. . simulate how human participants typically produce only a subset of words from the list. For reasonswww.sciencedirect.comwww.sciencedirect.com of simplicity, we chose not to implement such a sampling process.

ENCODING RECONSTRUCTION Topic Probability

!"#$%&'(&)*+,+-./0+12 !"#$%&34'&)5.6789,2 !"#$%&4:&)-;;012

0.0 0.1 0.2 Route Probability Retrieval Probability List: PEAS HAMMER CARROTS Topic BEANS BEANS Special Words CORN SPINACH PEAS LETTUCE SPINACH HAMMER 0.0 0.5 1.0 CABBAGE TOMATOES LETTUCE CORN CARROTS CABBAGE Special Word Probability SQUASH SQUASH TOMATOES HAMMER PEAS 0.00 0.01 0.02 SPINACH (rote learning) CABBAGE CARROTS dual route model LETTUCE SQUASH BEANS TOMATOES CORN 0.00 0.05 0.10 0.15 Figure 4. Example and reconstruction of a list of words with the dual-route topic model. Note that the topic distribution is truncated and only shows the top 3 topics. Similarly, the special- word and retrieval distributions only show the top 9 nine words from a vocabulary of 26,000+ words. Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 12 4.4. Explaining effects

The dual-route topic model can also be used to explain false memory effects (Deese, 1959; McEvoy, Nelson, & Komatsu, 1999; Roediger, Watson, McDermott, & Gallo, 2001). In a typical experiment that elicits the false memory effect, participants study a list of words that are associatively related to one word, the lure word, that is not presented on the list. At test, participants are instructed to recall only the words from the study list, but falsely recall the lure word with high probability (in some cases the lure word is recalled more often than list words). Results of this kind have led to the development of dual- route memory models where the verbatim level information supports accurate recall whereas the gist level information that is activated by the semantic organization of the list supports the intrusion of the lure word. (Brainerd, Reyna, & Mojardin, 1999; Brainerd, Wright, & Reyna, 2002). These models were designed to measure the relative contribution of gist and verbatim information in memory but do not provide a computational account for how the gist and verbatim information is encoded in memory.

To explain how the dual-route topic model accounts for the false memory effect, we applied the model to a recall experiment by Robinson and Roediger (1997). In this experiment, each study list contains a number of words that are associatively related to the lure word, which itself is not presented on the study list. The remaining words were

14 ReviewReview TRENDSTRENDS in in Cognitive Cognitive Sciences Sciences Vol.10Vol.10 No.7 No.7 July July 2006 2006 329 329

wordword with with some some other other word. word. The The resulting resulting high-dimen- high-dimen- material),material), and and then then applies applies matrix matrix decomposition decomposition tech- tech- sionalsional space space has has been been shown shown to to capture capture neighborhood neighborhood niquesniques to to reduce reduce the the dimensionality dimensionality of of the the original original effectseffects in in lexical lexical decision decision and and naming naming[23][23]. . matrixmatrix to to a a much much smaller smaller size size while while preserving preserving as as TheThe Latent Latent Semantic Semantic Analysis Analysis (LSA) (LSA) model model also also muchmuch as as possible possible the the covariation covariation structure structure of of words words and and derivesderives a a high-dimensional high-dimensional semantic semantic space space for for words words documents.documents. The The dimensionality dimensionality reduction reduction allows allows words words butbut uses uses the the co-occurrence co-occurrence information information not not between between withwith similar similar meaning meaning to to have have similar similar vector vector represen- represen- wordswords and and words words but but between between words words and and the the passages passages tationstations even even though though they they might might never never have have co-occurred co-occurred theythey occur occur in in[24–26][24–26].. The The model model starts starts with with a a matrix matrix of of inin the the same same document, document, and and can can thus thus result result in in a a more more countscounts of of the the number number of of times times a a word word occurs occurs in in a a set set of of accurateaccurate representation representation of of the the relationships relationships documentsdocuments (usually (usually extracted extracted from from educational educational text text betweenbetween words. words.

BoxBox 2. 2. Probabilistic Probabilistic topic topic models models

AA variety variety of of probabilistic probabilistic topic topic models models have have been been used used to to analyze analyze the the Typically,Typically, these these words words exhibit exhibit strong strong semantic semantic associations associations with with the the contentcontent of of documents documents and and the the meaning meaning of of words words[31,13–15,32][31,13–15,32].. These These wordswords that that did did appear appear on on the the study study list list[47,48][47,48]. To. To evaluate evaluate the the effects effects modelsmodels all all use use the the same same fundamental fundamental idea idea – – that that a a document document is is a a mixture mixture ofof semantic semantic association association on on human human memory, memory, word word association association norms norms ofof topics topics – – but but make make slightly slightly different different statistical statistical assumptions. assumptions. To To havehave been been developed developed to to measure measure the the associative associative strength strength between between introduceintroduce notation, notation, we we will will write writePP(z()z) for for the the distribution distribution over over topics topicsz z pairspairs of of words. words. These These norms norms are are typically typically collected collected by by showing showing a a inin a a particular particular document document and andPP(w(wz)z) for for the the probability probability distribution distribution over over participantparticipant a a cue cue word word and and asking asking them them to to write write down down the the first first word word j j wordswordswwgivengiven topic topicz.z. Several Several topic-word topic-word distributions distributionsPP(w(wz)z) were were thatthat comes comes to to mind. mind. Word Word association association norms norms exist exist for for over over 5000 5000 j j illustratedillustrated in inFigureFigure 1, 1 each, each giving giving different different weight weight to to thematically thematically related related words,words, with with hundreds hundreds of of participants participants providing providing responses responses for for each each cue cue

words.words. Each Each word wordwwi ini in a a document document (where (where the the index index refers refers to to the thei thi th wordword[28][28]. . wordword token) token) is is generated generated by by first first sampling sampling a a topic topic from from the the topic topic InIn the the topic topic model, model, word word association association can can be be thought thought of of as as a a problem problem distribution,distribution, then then choosing choosing a a word word from from the the topic-word topic-word distribution. distribution. We We ofof prediction. prediction. Given Given that that a a cue cue if if presented, presented, what what new new words words might might

writewritePP(z(izZiZj)j as) as the the probability probability that that the thej thj th topic topic was was sampled sampled for for the thei thi th occuroccur next next in in that that context? context? More More formally, formally, the the problem problem is is to to predict predict the the wordword token token and andPP(w(wi zizZZj )j as) as the the probability probability of of word wordwwi underunder topic topicj.j The. The conditionalconditional probability probability of of word wordww2 (the(the response response word) word) given given the the cue cue j ij i i 2 modelmodel specifies specifies the the following following distribution distribution over over words words within within a a document: document: wordwordww1.1 The. The first first step step in in making making this this prediction prediction is is determining determining which which topictopicww isis likely likely to to have have been been drawn drawn from. from. This This can can be be done done by by T T 1 1 PPww ZZ PPwwz zZZj jPPz zZZj j (Eqn(Eqn I) I) applyingapplying Bayes’ Bayes’ rule, rule, with with ð i Þi ð i ji i i Þ ð i i Þ ð Þ jZjZ1 1 ð j Þ ð Þ X PPz zZZj jww1 1ffPPww1 1z zZZj jPPz zZZj j (Eqn(Eqn II) II) X ð ð j j Þ Þ ð ð j j Þ Þð ð Þ Þ

wherewhereTTisis the the number number of of topics. topics. The The two two terms terms on on the the right right hand hand side side ItIt is is then then possible possible to to predict predictww2,2, summing summing over over all all of of the the topics topics that that indicateindicate which which words words are are important important for for which which topic topic and and which which topics topics couldcould have have generated generatedww1.1. The The resulting resulting conditional conditional probability probability is is areare important important for for a a particular particular document, document, respectively. respectively. Several Several statisticalstatistical techniques techniques can can be be used used to to infer infer these these quantities quantities in in a a T T PPww ww ZZ PPww z zZZj jPPz zZZj jww (Eqn(Eqn III) III) completelycompletely unsupervised unsupervised fashion fashion from from a a collection collection of of documents. documents. The The ð 2j2 1Þ1 ð 2j2 Þ ð j 1Þ1 ð j Þ jZjZ1 1 ð j Þ ð j Þ resultresult is is a a representation representation for for words words (in (in terms terms of of their their probabilities probabilities XX underunder the the different different topics) topics) and and for for documents documents (in (in terms terms of of the the probabilitiesprobabilities of of topics topics appearing appearing in in those those documents). documents). The The set set of of whichwhich we we can can use use to to model model word word association. association. topicstopics we we use use in in this this article article were were found found using using Gibbs Gibbs sampling, sampling, a a FigureFigure I (a) I (a) shows shows the the observed observed and and predicted predicted word word associations associations for for MarkovMarkov chain chain Monte Monte Carlo Carlo technique technique (see (see the the online online article article by by Griffiths Griffiths thethe word word ‘PLAY’. ‘PLAY’.FigureFigure I(b) I(b) compares compares the the performance performance of of the the topic topic andand Yuille: Yuille: Supplementary Supplementary material material online; online; and and Refs Refs[15,16][15,16]).). modelmodel and and LSA LSA in in predicting predicting the the first first associate associate in in the the word word association association TheThe associative associative semantic semantic structure structure of of words words plays plays an an important important role role norms.norms. The The topic topic model model outperforms outperforms LSA LSA slightly slightly when when either either the the inin episodic episodic memory. memory. For For example, example, participants participants performing performing a a free free recall recall cosinecosine or or the the inner inner product product between between word word vectors vectors is is used used as as a a tasktask sometimes sometimes produce produce responses responsesEFFECTS that that were were not not on on the the study study OF list list[47][47]. . CONTEXT:measuremeasure of of word word association. association. VIA TOPICS (a)(a) (b) (b) HumansHumans TopicsTopics TopicsTopics LSALSA 4040 4040 CosineCosine FUNFUN BALLBALL BALLBALL GAMEGAME InnerInner GAMEGAME CHILDRENCHILDREN productproduct WORKWORK ROLEROLE 3030 3030 GROUNDGROUND GAMESGAMES MATEMATE MUSICMUSIC CHILDCHILD BASEBALLBASEBALL 20 2020 ENJOYENJOY HITHIT 20 WINWIN FUNFUN rank=9 Median rank ACTORACTOR TEAMTEAM Median rank FIGHTFIGHT IMPORTANTIMPORTANT 1010 1010 HORSEHORSE BATBAT KIDKID RUNRUN MUSICMUSIC STAGESTAGE 00 00 Steyvers & al, 2006

free recall 0.000.000.050.050.100.100.150.15 0.0000.000 0.0250.025 0.0500.050 300300500500700700900900 11001100130013001500150017001700 P(wordP(P( w w | 'PLAY')| 'PLAY')2 word1 P(= P( w w | ‘play’) 'PLAY')| 'PLAY') No.No. of of Topics Topics FigureFigure I. I.(a) (a)ObservedObserved and and predicted predicted response response distributions distributions for for the the word wordPLAYPLAY. The. The responses responses from from humans humans reveal reveal that that associations associations can can be be based based on on different different senses senses of of the the (semantic memory) | only becue cue com word wordpared (e.g. (e.g.PLAY PLAY qualitativ–BALL–BALLandandPLAYPLAYely–ACTOR– ACTOR to). ). The the The model model obser predictions predictionsved recall are are based based probability. on on a a 500 500 topic topic In solution solution order(a) from from to the the fully TASA TASA corpus. corpus. Note Note that that the the model modelsemantic( givesb) gives similar similar responses responses isolation to to (c) humanshumans although although the the ordering ordering is is different. different. One One way way to to score score the the model model is is to to measure measure the the rank rank of of the the first first associate associate (e.g. (e.g.FUNFUN),), which which should should be be as as low low as as possible. possible.(b)(b) DATA simulateMedian Medianrecall, rank rank we of of thewould the first first associate associatehave to as as predictedimplem predicted byen by thet thea topicsam topic modelp modelling and andprocess LSA. LSA. Note Note with that that thea the stopping dimensionality dimensionality rule in in LSAto LSA has has been been optimized. optimized. Adapted Adapted from from[14,16][14,16]. . PREDICTED OUTLIER LIST CONTROL LIST 1.0 0.05 simulate how human participants typically produce only a subset of words from thePEAS list. SAW Target Target www.sciencedirect.com Col 12 Background For reasonswww.sciencedirect.com of simplicity, we chose not to implement such a sampling process. CARROTS SCREW 0.04

0.8 ty i l

BEANS CHISEL l a bil c a e SPINACH DRILL 0.6 b 0.03 RECONSTRUCTION o ENCODING LETTUCE SANDPAPER R Pr l Topic Probability HAMMER HAMMER 0.4 0.02 eva i ob. of

TOMATOES NAILS tr !"#$%&'(&)*+,+-./0+12 Pr 0.2 Re 0.01 !"#$%&34'&)5.6789,2 CORN BENCH !"#$%&4:&)-;;012 CABBAGE RULER 0.00 SQUASH ANVIL 0.0 0.0 0.1 0.2 outlier list pure list outlier list pure list Route Probability Retrieval Probability List: PEAS HAMMER CARROTS Topic BEANS BEANS Special Words CORN SPINACH PEAS Figure 3. (a) Two example lists used in semantic isolation experiments by Hunt and Lamb (2001). LETTUCE SPINACH HAMMER 0.0 0.5 1.0 CABBAGE The outlier list has one target word (HAMMER) which is semantically isolated from the background. TOMATOES LETTUCE CORN CARROTS The control list uses the same target word in a semantically congruous background. (b) Data from CABBAGE Special Word Probability SQUASH SQUASH TOMATOES Experiment 1 of Hunt and Lamb (2001) showing the semantic isolation effect (c). The predictions of HAMMER PEAS 0.00 0.01 0.02the dual-route topic model. SPINACH (rote learning) CABBAGE CARROTS dual route model LETTUCE SQUASH We encoded the outlier and control lists with the dual-route topic model. To simplify the BEANS TOMATOES simulations, we used the same 1500 topics illustrated in Figure 2 that were derived by the CORN

0.00 0.05 0.10 0.15 standard topic model. We therefore inferred the special word distribution and topic and

Figure 4. Example encoding and reconstruction of a list of words with the dual-rrouteoute topicweights model. for this list while holding fixed the 1500 topics. We also made one change Note that the topic distribution is truncated and only shows the top 3 topics. Simtoilarly, the themode specl.ial- Instead of using a Dirichlet prior for the multinomial of the special-word word and retrieval distributions only show the top 9 nine words from a vocabularydistribution of 26,000+ wo rds. that has a single hyperparameter for all words, we used a prior with

Probabilistic models of learning and memory — Working memory and episodic memoryhyperparameterCEU, Budapest, values 22-26 that June we 2009re higher forhttp://www.eng.cam.ac.uk/~m.lengyel words that are present on the list than12 for 4.4. Explaining False Memory effects words that were absent (0.001 and 0.0001 respectively). This change forces the model to put more a priori weight on the words that are part of the study list. The dual-route topic model can also be used to explain false memory effects (Deese, 1959; McEvoy, Nelson, & Komatsu, 1999; Roediger, Watson, McDermFigureott, &4 shows Gallo, the model encoding for the isolate list shown in Figure 3(a). The most 2001). In a typical experiment that elicits the false memory effect, participanliketsly study topic a islist the vegetable topic, with smaller probability going toward the farming and of words that are associatively related to one word, the lure word, that is toolsnot pres topics,ented onreflecting the distribution of semantic themes in the list. The special word the list. At test, participants are instructed to recall only the words from the study list, but distribution gives relatively high probability to the word HAMMER. This happens because falsely recall the lure word with high probability (in some cases the lurethe wo rd m isodel rec alled encodes words either through the topic or special word route and the more often than list words). Results of this kind have led to the development of dual- route memory models where the verbatim level information supportsprobability accurate recall of assigning a word to a route depends on how well each route can explain the whereas the gist level information that is activated by the semantic organizationoccurrence of the of list that word in the context of other list words. Because most of the vegetable- supports the intrusion of the lure word. (Brainerd, Reyna, & Mojardin,related 1999; Brainerd, words can be explained by the topic route, these words will receive lower Wright, & Reyna, 2002). These models were designed to measureprobability the relative from the special-word route. On the other hand, the word HAMMER, which is contribution of gist and verbatim information in memory but dosem nota ntically provide isolated a from the vegetable words cannot be explained well by the topic computational account for how the gist and verbatim information is encodroute,ed in m wehmicory.h m akes it more likely to be associated with the special-word route. To simulate recall, Equation (4) can be applied to calculate the posterior predictive To explain how the dual-route topic model accounts for the false meprobabilitymory effect, over we the whole vocabulary (26,000+ words) using the model encoding. We applied the model to a recall experiment by Robinson and Roediger (1997). In this experiment, each study list contains a number of words that are associativelywill ref relateder to this to as the retrieval distribution. The retrieval distribution shown in Figure 4 the lure word, which itself is not presented on the study list. The remainingshows words an were advantage for the isolate word. This occurs because the special-word distribution concentrates probability on the isolate word which is preserved in the reconstruction using both routes (the topic route distributes probability over all words semantically 14related to the list, leading to a more diffuse distribution). Figure 3(c) shows the model predictions for the experiment by Hunt and Lamb (2001), which exhibits the same qualitative pattern as the experimental data. Note that the retrieval probability can

13 ReviewReview TRENDSTRENDS in in Cognitive Cognitive Sciences Sciences Vol.10Vol.10 No.7 No.7 July July 2006 2006 329 329

wordword with with some some other other word. word. The The resulting resulting high-dimen- high-dimen- material),material), and and then then applies applies matrix matrix decomposition decomposition tech- tech- sionalsional space space has has been been shown shown to to capture capture neighborhood neighborhood niquesniques to to reduce reduce the the dimensionality dimensionality of of the the original original effectseffects in in lexical lexical decision decision and and naming naming[23][23]. . matrixmatrix to to a a much much smaller smaller size size while while preserving preserving as as TheThe Latent Latent Semantic Semantic Analysis Analysis (LSA) (LSA) model model also also muchmuch as as possible possible the the covariation covariation structure structure of of words words and and derivesderives a a high-dimensional high-dimensional semantic semantic space space for for words words documents.documents. The The dimensionality dimensionality reduction reduction allows allows words words butbut uses uses the the co-occurrence co-occurrence information information not not between between withwith similar similar meaning meaning to to have have similar similar vector vector represen- represen- wordswords and and words words but but between between words words and and the the passages passages tationstations even even though though they they might might never never have have co-occurred co-occurred theythey occur occur in in[24–26][24–26].. The The model model starts starts with with a a matrix matrix of of inin the the same same document, document, and and can can thus thus result result in in a a more more countscounts of of the the number number of of times times a a word word occurs occurs in in a a set set of of accurateaccurate representation representation of of the the relationships relationships documentsdocuments (usually (usually extracted extracted from from educational educational text text betweenbetween words. words.

BoxBox 2. 2. Probabilistic Probabilistic topic topic models models

AA variety variety of of probabilistic probabilistic topic topic models models have have been been used used to to analyze analyze the the Typically,Typically, these these words words exhibit exhibit strong strong semantic semantic associations associations with with the the contentcontent of of documents documents and and the the meaning meaning of of words words[31,13–15,32][31,13–15,32].. These These wordswords that that did did appear appear on on the the study study list list[47,48][47,48]. To. To evaluate evaluate the the effects effects modelsmodels all all use use the the same same fundamental fundamental idea idea – – that that a a document document is is a a mixture mixture ofof semantic semantic association association on on human human memory, memory, word word association association norms norms ofof topics topics – – but but make make slightly slightly different different statistical statistical assumptions. assumptions. To To havehave been been developed developed to to measure measure the the associative associative strength strength between between introduceintroduce notation, notation, we we will will write writePP(z()z) for for the the distribution distribution over over topics topicsz z pairspairs of of words. words. These These norms norms are are typically typically collected collected by by showing showing a a inin a a particular particular document document and andPP(w(wz)z) for for the the probability probability distribution distribution over over participantparticipant a a cue cue word word and and asking asking them them to to write write down down the the first first word word j j wordswordswwgivengiven topic topicz.z. Several Several topic-word topic-word distributions distributionsPP(w(wz)z) were were thatthat comes comes to to mind. mind. Word Word association association norms norms exist exist for for over over 5000 5000 j j illustratedillustrated in inFigureFigure 1, 1 each, each giving giving different different weight weight to to thematically thematically related related words,words, with with hundreds hundreds of of participants participants providing providing responses responses for for each each cue cue

words.words. Each Each word wordwwi ini in a a document document (where (where the the index index refers refers to to the thei thi th wordword[28][28]. . wordword token) token) is is generated generated by by first first sampling sampling a a topic topic from from the the topic topic InIn the the topic topic model, model, word word association association can can be be thought thought of of as as a a problem problem distribution,distribution, then then choosing choosing a a word word from from the the topic-word topic-word distribution. distribution. We We ofof prediction. prediction. Given Given that that a a cue cue if if presented, presented, what what new new words words might might

writewritePP(z(izZiZj)j as) as the the probability probability that that the thej thj th topic topic was was sampled sampled for for the thei thi th occuroccur next next in in that that context? context? More More formally, formally, the the problem problem is is to to predict predict the the wordword token token and andPP(w(wi zizZZj )j as) as the the probability probability of of word wordwwi underunder topic topicj.j The. The conditionalconditional probability probability of of word wordww2 (the(the response response word) word) given given the the cue cue j ij i i 2 modelmodel specifies specifies the the following following distribution distribution over over words words within within a a document: document: wordwordww1.1 The. The first first step step in in making making this this prediction prediction is is determining determining which which topictopicww isis likely likely to to have have been been drawn drawn from. from. This This can can be be done done by by T T 1 1 PPww ZZ PPwwz zZZj jPPz zZZj j (Eqn(Eqn I) I) applyingapplying Bayes’ Bayes’ rule, rule, with with ð i Þi ð i ji i i Þ ð i i Þ ð Þ jZjZ1 1 ð j Þ ð Þ X PPz zZZj jww1 1ffPPww1 1z zZZj jPPz zZZj j (Eqn(Eqn II) II) X ð ð j j Þ Þ ð ð j j Þ Þð ð Þ Þ

wherewhereTTisis the the number number of of topics. topics. The The two two terms terms on on the the right right hand hand side side ItIt is is then then possible possible to to predict predictww2,2, summing summing over over all all of of the the topics topics that that indicateindicate which which words words are are important important for for which which topic topic and and which which topics topics couldcould have have generated generatedww1.1. The The resulting resulting conditional conditional probability probability is is areare important important for for a a particular particular document, document, respectively. respectively. Several Several statisticalstatistical techniques techniques can can be be used used to to infer infer these these quantities quantities in in a a T T PPww ww ZZ PPww z zZZj jPPz zZZj jww (Eqn(Eqn III) III) completelycompletely unsupervised unsupervised fashion fashion from from a a collection collection of of documents. documents. The The ð 2j2 1Þ1 ð 2j2 Þ ð j 1Þ1 ð j Þ jZjZ1 1 ð j Þ ð j Þ resultresult is is a a representation representation for for words words (in (in terms terms of of their their probabilities probabilities XX underunder the the different different topics) topics) and and for for documents documents (in (in terms terms of of the the probabilitiesprobabilities of of topics topics appearing appearing in in those those documents). documents). The The set set of of whichwhich we we can can use use to to model model word word association. association. topicstopics we we use use in in this this article article were were found found using using Gibbs Gibbs sampling, sampling, a a FigureFigure I (a) I (a) shows shows the the observed observed and and predicted predicted word word associations associations for for MarkovMarkov chain chain Monte Monte Carlo Carlo technique technique (see (see the the online online article article by by Griffiths Griffiths thethe word word ‘PLAY’. ‘PLAY’.FigureFigure I(b) I(b) compares compares the the performance performance of of the the topic topic andand Yuille: Yuille: Supplementary Supplementary material material online; online; and and Refs Refs[15,16][15,16]).). modelmodel and and LSA LSA in in predicting predicting the the first first associate associate in in the the word word association association TheThe associative associative semantic semantic structure structure of of words words plays plays an an important important role role norms.norms. The The topic topic model model outperforms outperforms LSA LSA slightly slightly when when either either the the inin episodic episodic memory. memory. For For example, example, participants participants performing performing a a free free recall recall cosinecosine or or the the inner inner product product between between word word vectors vectors is is used used as as a a tasktask sometimes sometimes produce produce responses responsesEFFECTS that that were were not not on on the the study study OF list list[47][47]. . CONTEXT:measuremeasure of of word word association. association. VIA TOPICS (a)(a) (b) (b) HumansHumans TopicsTopics TopicsTopics LSALSA random filler words that4040 did not have any obvious associative structure. In the 4040 CosineCosine FUNFUN BALLBALL experiment, the number of associatively related words were varied while keeping the BALLBALL GAMEGAME InnerInner GAMEGAME CHILDRENCHILDREN total number of study words constant. Figure 5(a) shows some example lists that contain 30 productproduct WORKWORK ROLEROLE 3030 3, 6, and 9 associates of 30the word ANGER which itself is not present on the list. Figure GROUNDGROUND GAMESGAMES MATEMATE MUSICMUSIC 5(b) shows the observed recall probabilities for the studied items and the lure word as a CHILDCHILD BASEBALLBASEBALL 20 function of the number20 20 of associates on the list. With an increase in the number of ENJOYENJOY HITHIT 20 WINWIN FUNFUN rank=9 associates, the results show an increase in false recall of the lure word and a decrease in Median rank ACTORACTOR TEAMTEAM Median rank veridical recall. We applied the dual-route topic model to this experimental setup and FIGHTFIGHT IMPORTANTIMPORTANT 1010 1010 HORSEHORSE BATBAT simulated word lists similar to those used by Robinson and Roediger (1997). Figure 5(c) KIDKID RUNRUN MUSICMUSIC STAGESTAGE shows that model predicts retrieval probabilities that are qualitatively similar to the 00 observed recall probabilities.00 As the numberSteyvers & al, 2006 of associates increases, the model will put

free recall 0.000.000.050.050.100.100.150.15 0.0000.000 0.0250.025 0.0500.050 300300500500700700900900 increasingly11001100 1300m1300o1500re15001700 weight1700 on the topic route, because the topic route can better explain the P(wordP(P( w w | 'PLAY')| 'PLAY')2 word1 P(= P( w w | ‘play’) 'PLAY')| 'PLAY') associaNo.No.tiv of of eTopics Topicsstructure when more associates are present. By putting more weight on the FigureFigure I. I.(a) (a)ObservedObserved and and predicted predicted response response distributions distributions for for the the word wordPLAYPLAY. The. The responses responsestopic from from humans humans rout reveale, reveal this that that associationsleads associations to can an can be beincrease based based on on different different in generalization senses senses of of the the beyond the list words which is (semantic memory) | only becue cue com word wordpared (e.g. (e.g.PLAY PLAY qualitativ–BALL–BALLandandPLAYPLAYely–ACTOR– ACTOR to). ). The the The model model obser predictions predictionsved recall are are based based probability. on on a a 500 500 topic topic In solution solution associated order(a) from from to the the fully TASA with TASA corpus. corpus. an increase Note Note that that the the in model modelsemantic fals( givesb) givese reca similar similarll. responses responses Sim isolationilarly, to to with (c) an increasing weight on the humanshumans although although the the ordering ordering is is different. different. One One way way to to score score the the model model is is to to measure measure the the rank rank of of the the first first associate associate (e.g. (e.g.FUNFUN),), which which should should be be as as low low as as possible. possible.(b)(b) DATA simulateMedian Medianrecall, rank rank we of of thewould the first first associate associatehave to as as predictedimplem predicted byen by thet thea topicsam topic modelp modelling and andprocess LSA. LSA. Note Note with that that thea the stopping dimensionalitytopic dimensionality route, rule in in LSAtothere LSA has has is been been a optimized.corres optimized.ponding Adapted Adapted from fromdecrease[14,16][14,16]. . in weight for the special-wPREDordICTE route.D This OUTLIER LIST CONTROL LIST 1.0 0.05 simulate how human participants typically produce only a subset of wordsroute from is thePE neededAS list. to recoSAnstructW the specific words presenTarget t on a list and as theTar weigget ht on www.sciencedirect.com Col 12 Background For reasonswww.sciencedirect.com of simplicity, we chose not to implement such a sampling process. CARROTS SCREW 0.04

0.8 ty

this route decreases, there is a decrease in veridical recall. Therefore,i the model explains l

BEANS CHISEL l a bil c a e these findingsSPINACH in a qualitativeDRILL fashion0.6 by underlying change inb 0.03 the balance between gist RECONSTRUCTION o ENCODING LETTUCE SANDPAPER R Pr and verbatim level information. One advantage of this modell over other dual route Topic Probability HAMMER HAMMER 0.4 0.02 eva i ob. of

memoryTOM AmTOoESdels (e.g. NABrainerd,ILS Reyna, & Mojardin, 1999; Brainetr rd, Wright, & Reyna, !"#$%&'(&)*+,+-./0+12 Pr CORN BENCH 0.2 Re 0.01 !"#$%&34'&)5.6789,2 2002) is that the model explains performance at the level of individual words and !"#$%&4:&)-;;012 CABBAGE RULER 0.00 specifiesSQUAS a represenH tatioANVILn for gist and 0.verbatim0 information. 0.0 0.1 0.2 outlier list pure list outlier list pure list Route Probability Retrieval Probability List: (a) (b) (c) PEAS HAMMER CARROTS Topic BEANS BEANS Number of Associates Special Words CORN false memories SPINACH Figure 3. (a) Two example lists used in semantic isolation experiments by Hunt and Lamb (2001). PEAS 69 LETTUCE 3 SPINACH DATA HAMMER 0.0 0.5 1.0 CABBAGE The outlier list has one target word (HAMMER) which is semantically isolatedPREDI fromCTED the background. TOMATOES LETTUCE !"# !"# !"# CORN 1.0 CARROTS The control$%"& list uses$%"& the s$a%"&me target word in a semantically congruous background. (b) Data from CABBAGE Special Word Probability Studied items Studied associates SQUASH '"(% '"(% '"(% SQUASH TOMATOES Experiment 1 of Hunt and Lamb (2001) showing theNons studiemaned (lure)tic isolation0.03 effect (c). TheNonstudied predicti (lure) ons of SMOOTH &")% &")% 0.8

HAMMER l a PEAS 0.00 0.01 0.02the dual-roNAuVYte topic(%! *%&model.(%!* %& ll v SPINACH HEAT $+&, $+&, ie (rote learning) eca r

CABBAGE 0.6 t R SALAD SALAD -&"(' e 0.02 CARROTS dual route model TUNE TUNE '"**, LETTUCE R f SQUASH We encodedCOURTS theCOUR ouTStlier$ .)'(and control0.4 lists with the dual-route topic model. To simplify the o . BEANS CANDY CANDY CANDY ob. of

TOMATOES simulations,PALACE wePA usedLACE thePAL AsamCE e 1500Pr topics illustrated in Figure0.01 2 that were derived by the CORN PLUSH PLUSH PLUSH 0.2 Prob 0.00 0.05 0.10 0.15 standardT OOTtopicH mTOOTodel.H TWOOTeH therefore inferred the special word distribution and topic and BLIND BLIND BLIND 0.0 0.00 WINTER WINTER WINTER Figure 4. Example encoding and reconstruction of a list of words with the dual-rrouteoute topicweights model. for this list while holding369 fixed the12 150015 topics. We 3also6 made9 12 one15 change

Note that the topic distribution is truncated and only shows the top 3 topics. Similarly, the special- Steyvers & Griffths, 2008 to the model. Instead of using a DirichletNumber of Associprioater sf Studior ethed multinomNumbeialr ooff A sthesociat esspecia Studied l-word word and retrieval distributions only show the top 9 nine words from a vocabulary of 26,000+ words.(lure = ANGER) distribution that has a single hyperparame ter for all words, we used a prior with Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 12 hyperparameterFigure 5. (a) Exam valuesple stu d thaty lis ts we varreyi higherng the number for words of words that ass areoci ate presentd to the on lur e the ANGER list whic thanh i fors 4.4. Explaining False Memory effects wordsnot presen that tedwere on absentthe list. (0.(b) 001Data and from 0.0001 Robinso respectively).n and Roediger This (199 7)change, Experiment forces 2, the sho wmingode thl eto observed recall probabilities for studied items and the lure item as a function of the number of put more a priori weight on the words that are part of the study list. associates on the list. (c) Predictions from the dual-route topic model. The dual-route topic model can also be used to explain false memory effects (Deese, 1959; McEvoy, Nelson, & Komatsu, 1999; Roediger, Watson, McDermFigureott, &4 shows Gallo, the model encoding for the isolate list shown in Figure 3(a). The most 2001). In a typical experiment that elicits the false memory effect, participanlike4.5.tsly study topi Applicationc a islist the vege tatoble Information topic, with smRetrievalaller probability going toward the farming and of words that are associatively related to one word, the lure word, that is toolsnot pres topics,ented onreflecting the distribution of semantic themes in the list. The special word the list. At test, participants are instructed to recall only the words from the study list, but distributionThe dual-ro givesute top relativelyic model high can probabilitybe applied toto documthe wordents HA toMMER probabi. Thislistically happens decom becausepose falsely recall the lure word with high probability (in some cases the lurethe wo rd m isodel rec alled encodes words either through the topic or special word route and the more often than list words). Results of this kind have led to the developmwoentrds of into dual- contextually unique and gist related words. Such as decomposition can be route memory models where the verbatim level information supportsprobability accurate recall of assigning a word to a route depends on how well each route can explain the whereas the gist level information that is activated by the semantic organizationoccurrence of the of list that word in the context of other list words. Because most of the vegetable- supports the intrusion of the lure word. (Brainerd, Reyna, & Mojardin,related 1999; Brainerd, words can be explained by the topic route, these words will receive lower15 Wright, & Reyna, 2002). These models were designed to measureprobability the relative from the special-word route. On the other hand, the word HAMMER, which is contribution of gist and verbatim information in memory but dosem nota ntically provide isolated a from the vegetable words cannot be explained well by the topic computational account for how the gist and verbatim information is encodroute,ed in m wehmicory.h m akes it more likely to be associated with the special-word route. To simulate recall, Equation (4) can be applied to calculate the posterior predictive To explain how the dual-route topic model accounts for the false meprobabilitymory effect, over we the whole vocabulary (26,000+ words) using the model encoding. We applied the model to a recall experiment by Robinson and Roediger (1997). In this experiment, each study list contains a number of words that are associativelywill ref relateder to this to as the retrieval distribution. The retrieval distribution shown in Figure 4 the lure word, which itself is not presented on the study list. The remainingshows words an were advantage for the isolate word. This occurs because the special-word distribution concentrates probability on the isolate word which is preserved in the reconstruction using both routes (the topic route distributes probability over all words semantically 14related to the list, leading to a more diffuse distribution). Figure 3(c) shows the model predictions for the experiment by Hunt and Lamb (2001), which exhibits the same qualitative pattern as the experimental data. Note that the retrieval probability can

13 REFERENCES

Anderson JR, Schooler LJ (1991) Reflections of the environment in memory. Psychol Sci 2:396-408. Chase W, Simon H (1973) Perception in chess. Cognit Psychol 4:55–81. Daw ND, Niv Y, Dayan P (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci 8:1704-1711. Lengyel M, Dayan P (2008) Hippocampal contributions to control: the third way. Advances in Neural Information Processing Systems 20:889-896. Miller GA (1956) The magical number seven plus or minus two: some limits on our capacity for processing information. Psychol Rev 63:81-97. Packard MG, McGaugh JL (1996) Inactivation of hippocampus or caudate nucleus with lidocaine differentially affects expression of place and response learning. Neurobiol Learn Mem 65:65-72. Poldrack RA, Clark J, Pare-Blagoev EJ, Shohamy D, Creso Moyano J, Myers C, Gluck MA (2001) Interactive memory systems in the human brain. Nature 414:546–550. Steyvers M, Griffiths TL (2008) Rational analysis as a link between human memory and information retrieval. In: (Oaksford M, Chater N eds) The probabilistic mind: Prospects for rational models of cognition. Oxford University Press, Oxford. Steyvers M, Griffiths TL, Dennis S (2006) Probabilistic inference in human semantic memory. Trends Cogn Sci 10:327-334.

Probabilistic models of learning and memory — Working memory and episodic memory CEU, Budapest, 22-26 June 2009 http://www.eng.cam.ac.uk/~m.lengyel 13