Online Submission ID: 549

Non-Sequential Authoring of Handwritten Video Lectures with Pentimento

(a) User records temporal lecture, (b) retroactively edits layout, (c) records more derivation, (d) moves content to make space runs out of space, timing is preserved decides extra step is needed for new step fortunately, can write in margin

(e) moves to insertion time, (f) new step properly inserted, (g) wants to change µ, (h) redraws E(x) to replace µ records new step finishes derivation makes space rest of timing is preserved

Figure 1: Authoring of a handwritten video lecture using Pentimento. Users record their lecture, in which strokes are drawn over time with a pen-tablet interface while audio is recorded. At any time, they can can perform retroactive edits with the flexibility of static 2D vector graphics. However, these operations also preserve temporal information and with audio. These edits would be challenging with current screen-capture approaches because they lack a sparse and structured space-time representation of the data. Other asynchronous recording scenarios are possible, such as audio first or visuals first, and can be combined iteratively.

1 Abstract 29 academy [Khan 2012]. These videos feature text and diagrams cap- 30 tured as the teacher writes, synchronized with a narration. Unfor- 2 We facilitate the authoring of handwritten educational video lec- 31 tunately, their authoring currently relies on archaic technology that 3 tures and seek to combine the advantages of vector graphics and 32 resembles a typewriter more than the flexibility of a modern word 4 . Handwritten videos are currently recorded using 33 processor. When recording digitally, authors use screen-capture 5 video screen capture, which makes editing challenging. We be- 34 software and draw with a painting program, which makes it dif- 6 lieve that the power of digital content creation is in non-sequential 35 ficult to correct or edit content after recording. We want to combine 7 iterative refinement. Teachers should be able to start a video lecture 36 the benefits of 2D vector graphics and non-linear video editing to 8 without a perfect script, and they should be able to improve their 37 make non-sequential and iterative authoring easy. 9 material over the year. If they want to add an extra line in a deriva- 38 Consider the variance derivation in the teaser (Fig.1), recorded with 10 tion, they should not need to restart a video lecture from scratch. 39 our approach. After recording the audio and visuals, the instructor 11 We make the editing of audio and visuals as orthogonal as possible 40 later decided to replace µ byE[x] (Fig.1f-h). This requires shifting 12 by decoupling the notion of time of these two modalities. To main- 41 strokes to the right, going back in time and replacing the (tempo- 13 tain synchronization and provide fine-grained control, we rely on a 42 ral) strokes corresponding to µ by a longer writing E(x), all the 14 simple retiming data structure that maps from audio time to visual 43 while preserving synchronization with the audio. 2D vector graph- 15 time and is controlled by point-wise correspondences. This also 44 ics applications make this type of spatial editing trivial, but they do 16 facilitates silence suppression and the corresponding speedup of vi- 45 not not support temporal content synchronized with audio. Video 17 suals. We identify and implement four types of retroactive editing 46 editing software supports editing and synchronization but requires 18 operations, which can modify content after recording. The user can 47 tedious efforts because visual and temporal content is dense and un- 19 1/ change the layout, 2/ content back in time, and respect the 48 structured. Similarly, when the lecturer starts writing an equation 20 other modality when inserting only audio or visuals 3/ replace au- 49 too big and runs out of space (Fig.1(a)), there is traditionally little 21 dio or visuals while preserving synchronization with the untouched 50 that can be done in post-production; they must restart this segment 22 modality, and 4/ shift content in time. Our approach supports a 51 from scratch and composite it spatially in the . The 23 variety of recording scenarios in which audio and visuals can be 52 creation of temporal handwritten content should be as easy as the 24 recorded synchronously or independently. 53 creation of static 2D vector graphics, allowing for non-sequential 54 recording and iterative refinements.

25 Keywords: Pen, education, authoring, 2D animation 55 An additional challenge stems from the difficulty of keeping audio 56 and visuals in sync. First, synchronously recording both modalities 26 1 Introduction 57 is cognitively difficult. Second, they have different natural speeds. 58 The writing often needs to catch up with the narration, leading to 27 Our work facilitates the creation of handwritten video lectures, in 59 awkward silences: video editors working on education content have 28 the style of the virtual white board popularized by, e.g., the Khan 60 told us that they spend most of their time suppressing silences and

1 Online Submission ID: 549

61 speeding up visuals. It is also often useful to shift timing to make 121 The Flash animation language is often used to display the creation 62 sure that a new term gets written at the time when it is uttered. 122 of drawings on the web, e.g. [Sclipo 2008], but the ability to record 63 These edits as well as the asynchronous recording of visuals and 123 and edit freehand drawings is limited. In fact, authors often use 64 audio are possible using non-linear video editors, but they remain 124 masks to reveal static drawings or hand-written text, e.g. [Adri- 65 tedious because visuals and audio need to be sliced up finely and 125 atic11 2008]. Vector graphics formats such as SVG have also been 66 the speed of videos segments must be adjusted to match the audio. 126 extended to the temporal dimension, e.g. [W3C 2011], but there is 67 However, we observe that, compared to traditional video editing, 127 limited support for editing and audio synchronization. Loviscach et 68 the time scale of the visuals in handwritten lectures is more flexible 128 al. developed a technique to automatically generate a handwritten- 69 and the recorded writing speed rarely needs to be respected, which 129 looking animation [2011]. Other tools focus on pre-recorder cli- 70 provides us with great flexibility. We want to make the synchro- 130 part’s, e.g. [Sparkol 2013]. 71 nization of audio and visuals direct and flexible, and let the user 131 Most handwritten video lectures are screencasts captured during a 72 explicitly specify correspondences rather than edit the duration of 132 live session, e.g. [Talbert 2011]. A number of apps, e.g. [Tablo 73 disconnected segments. 133 2012; Showme 2013; Queeky 2013; Everything 2013] and elec- 74 Flexible synchronization is also critical to allow users to edit or 134 tronic pens or white boards, e.g. [Livescribe 2013], enable the 75 record one modality (audio or visuals) without worrying about the 135 recording of freehand drawing for applications such as education, 76 other one. Consider the scenario where a user has recorded syn- 136 but they usually do not offer post-editing. 77 chronized audio and visuals, then realizes that she needs to insert 137 smart tablets for education Most work on tablets and sketching 78 new visuals without changing the audio because her diagram is in- 138 for education has focused on use by students; artificial intelligence 79 complete. Time needs to be made to accommodate the new visu- 139 to understand an input handwriting or sketch; feedback or simula- 80 als, but without modifying the audio, which requires speeding up 140 tion of the depicted configuration, e.g. [Zeleznik et al. 2008; LaVi- 81 other visuals. It’s, however, unclear how far from the insertion time 141 ola Jr. and Zeleznik 2004; Hammond and Davis 2003; Alvarado 82 should be sped up, and the user might want to edit the acceleration 142 et al. 2001]. In contrast, we focus on teachers, we do not want to 83 later. 143 embed intelligence in our software and, instead, focus on the flexi- 84 Our work makes a step towards the flexible creation of hand-drawn 144 bility of editing a recorded handwritten tutorial. 85 lectures composed of strokes and optional background slides. We 145 86 leave geometric shapes, typefaced text, or video as future work. Pen and audio notes The integration of handwriting and audio is 146 87 In order to make the editing of different modalities orthogonal, we also powerful in the context of white board meetings, e.g. [Pedersen 147 88 decouple the time dimension of audio and visuals. To maintain syn- et al. 1993; Moran et al. 1997] and note taking, e.g. [Wilcox et al. 148 89 chronization, we introduce a simple retiming data structure based 1997; Stifelman et al. 2001]. The focus in this context is usually 149 90 on a list of correspondences between audio and visual time val- automatic organization and retrieval, while we cater to users intent 150 91 ues. This retiming enables a number of editing operations while on creating tight synchronized content. However, for the creation 151 92 respecting synchronization. It also permits flexible synchroniza- of longer lectures, our users could benefit from powerful search and 152 93 tion, in particular, the acceleration of handwriting to keep up with other advanced retrieval to better navigate a time line. 94 narration. 153 Multimedia synchronization Our retiming between audio and vi- 154 95 We identify four families of retroactive operations that enable asyn- suals builds on work in multimedia synchronization, where chal- 155 96 chronous and on-sequential authoring. Retroactive layout edits lenges have been real-time and distributed systems issues, e.g. 156 97 change the spatial characteristics without affecting time. Insertion [Drapeau 1993; Chen and Wu 1996], or dealing with user inter- 157 98 adds new content at a given time, respecting the other modality action that can make the playback last a variable amount of time lo- 158 99 when inserting only audio or visuals. Content replacement pre- cally, e.g. [Wahl and Rothermel 1994; Schnepf et al. 1996; Bailey 159 100 serves synchronization with the other modality, such as redrawing et al. 1998; Kurihara et al. 2005]. Our context is different because 160 101 the visual content while preserving audio synchronization. Finally, we want to produce a synced video, which is why we can get away 161 102 temporal reordering enables visuals to appear in a different order with simple point wise representations of constraints. We focus 162 103 than they were drawn instead on maintaining meaningful synchronization during editing, 163 especially when the user modifies one modality without the other 1 104 We implemented our approach as a Mac OS application, Penti- 164 one. 105 mento and it was informally tested by a number of users. It will 106 be released as an open-source free software and we include an exe- 165 2 User Scenarios 107 cutable with our submission.

166 Pentimento was designed to support a variety of recording work- 108 1.1 Related work 167 flows where audio and visuals can be recorded synchronously or

109 We discuss traditional workflows for the creation of handwritten 168 not. In all cases, recorded content can be edited after the fact and 110 video, and related work in multimedia synchronization. Other areas 169 new content can be inserted at arbitrary times in the lecture. While 111 of synergy will be discussed at the end of the paper. 170 we discuss different workflows separately, users can mix and match 171 approaches and all operations are available at all times. One of 112 Handwritten tutorials and screencasts. Presentation software 172 our users even invented a way of using Pentimento that we had 113 such as Powerpoint offers limited support for sketching. It focuses 173 not anticipated, half-way between synchronous and asynchronous 114 on live annotations during a presentation and offers limited tempo- 174 recording. 115 ral editing. The main tools available to video lecture authors who 116 want to edit their work are traditional non-linear video editors such 175 Pentimento’s UI reflects its combination of free-hand drawing and 117 as Apple’s iMovie and Final Pro, Adobe’s Premiere or Avid. 176 audio-video editing. A main window acts like a white board and is 118 These tools unfortunately don’t have a structured representation of 177 where the drawing and layout editing happen (Fig.2.a). A second 119 the visuals as would a vector graphics software, which makes it 178 window is dedicated to audio synchronization and temporal editing 120 challenging at best to modify handwritten lectures. 179 (Fig.2.b). It resembles a non-linear video editor, with audio and 180 visual time lines, but it additionally shows editable synchronization 1 10.7 and above 181 arrows between audio and visuals, as discussed below.

2 Online Submission ID: 549

Play/record controls Editing operations Display options Thumbnail Selection formatting timeline for editing bar Retiming correspondences Audio waveform and segemnts In-Lecture tools: main every action with drawing these tools will area show in the final video lecture. safety margin Editing buttons Satellite view Time slider (a) Main drawing window (b) Audio/time window Figure 2: Pentimento’s UI combines (a) a drawing window and (b) an audio/time window inspired by non-linear video editors. (a) All the drawing tools (lower left) have their effect recorded over time. The selection/editing marquee (upper left) acts like in static vector graphics software but preserves time. The time slider (bottom) and skip buttons (upper right) control the current time. Insertion buttons at the bottom right allow the user to set the time to just after or just before a set of strokes, similar to insertion cursors in word processing. The optional safety margin (in grey when not recording) is not shown in the final video but offers more room in case the user runs out pf space when recording. (b) Between the traditional video thumbnails and audio waveform, Pentimento offers editable synchronization arrows. They are usually vertical, except while editing, because the video thumbnails reflect the retiming. Thumbnails are augmented thanks to the structured space-time vector graphics information to de-emphasize inactive content and make strokes drawn during a thumbnail bolder. A satellite view displays the current time and enables vector-graphics selection for features such as the recording of audio for a visual selection.

182 To best use Pentimento, it is important to understand its synchro- 217 function, they can access the time where a page was completed (just 183 nization and timing model. Pentimento considers the audio time to 218 before page down) using the skip forward and backward buttons. 184 be the absolute reference time. The visual time, on the other hand, 219 For a single page lectures, these go to the beginning and end of the 185 is flexible and can be sped up or slowed down locally. Synchro- 220 lecture. 186 nization between audio and visuals are controlled by point-wise 187 correspondences from a given audio time to a visual instant. They 221 Layout editing In contrast to standard tools, users have the flexi- 188 are shown as arrows between the audio waveform and the visual 222 bility to edit their captured audio-visual content. Strokes are vector- 189 thumbnail time line (Fig.2.b). The thumbnail time line always re- 223 based and can be modified like in any vector graphics software, 190 flects the retiming of the visuals, which is why the arrows are al- 224 while preserving their temporal nature and synchronization with the 191 ways vertical except when the user edits them. Some constraints are 225 audio. A standard marquee selection tool can be accessed from a 192 inserted automatically (shown in grey), for example when deleting 226 button (upper left of Fig.2), but right clicking or using the eraser 193 a silence or redrawing strokes, and some are specified by the user 227 are preferred because they make the workflow less modal. Once 194 (shown in solid black.) The retiming is interpolated linearly be- 228 strokes are selected, traditional handles appear at the corner of the 195 tween correspondences, which means that a constraint impact the 229 bounding box of the selection, and dragging them rescales the con- 196 timing between the previous one and the next one. retiming and 230 tent. Translation is obtained by clicking on a selected stroke and 197 correspondences make it possible to edit one modality (audio or 231 dragging. the color, width and fill of strokes can also be changed 198 visuals) without worrying about the other one. 232 from a formatting toolbar. Together, scale, translation and appear- 233 ance changes make it easy to manage space, and beautify the lay- 199 We first focus on the user’s perspective and the technical implemen- 234 out after the fact and free the user from this cognitive task during 200 tation that makes these workflows possible will be described in the 235 recording (Fig.3.a). 201 next section.

236 As discussed above, one great feature of Pentimento is that the time 202 2.1 Synchronous recording of audio and visuals 237 dimension can largely be ignored when performing spatial edits. 238 One exception is strokes that get deleted during the lecture, such 203 In this workflow, the recording itself is closest to traditional tablet 239 as highlighter strokes. To edit them consistently, users turn on the 204 lecture authoring. Instructors press record and write with a sty- 240 display of ”ghosts” using a checkbox, and deleted or future strokes 205 lus while speaking out the audio. Audio and visuals are recorded 241 appear as dashed lines that can be edited with the same interface. 206 in real time until stop is pressed. Users can use background slide 207 (PDF) if needed and move through pages (from the keyboard or a 242 Motivated by our experience with traditional tools where poor space 208 button), which clears out the strokes that were drawn. They can 243 planning can ruin a recording, Pentimento includes an optional 209 also scroll the canvas with a dragging interface to get more space. 244 safety margin. This space is not shown in the final lecture, but users 210 A highlighter pen draws strokes that slowly fade over time to add 245 can write there, for example if they need to finish a long equation 211 emphasis without cluttering the lecture. Content can also be made 246 that they started to write too big (see Fig.1.) They can then later 212 to disappear at a given time in the video using an eraser tool. It is 247 select and scale down the equation to make it fit in the lecture. 213 also possible to stop and resume recording at any time. 248 Precisely content spatially can be hard, especially with tightly 214 At the end of a recording session, users obtain synchronized audio 249 packed lines of text where it is hard to select all the dots on the 215 and visual content. They can replay it or directly access any time 250 ’i’ and the bars on the ‘t’ without including content in the adjacent 216 using a time slider (bottom of Fig.2.a.) If they used the page down 251 lines. This why we introduced a temporal closure of the selection,

3 Online Submission ID: 549

252 accessible through a key stroke, where all the visuals between the 310 constraints to be less critical and a user can drag an arrow, passed 253 beginning of a selection and its end are added. 311 them, which automatically deletes them.

254 Audio edits Conversations with MOOC producers indicate that 312 Audio or visuals replacement Both audio and visual content 255 they spend vast resources deleting audio silences and tediously 313 can be replaced while preserving the synchronization with the other 256 speeding up the corresponding visual content. Pentimento makes 314 modality. If the user has made a typo or wants to draw something 257 this easy because of its flexible retiming model between audio and 315 more cleanly, they can select the objectionable visual content, press 258 visuals. Silences are easy to spot in the audio waveform view. 316 a redraw button, and draw new strokes. When they press stop, 259 Users can select the corresponding time interval and press delete 317 the new visual content is automatically sped up or slowed down 260 to remove it. Pentimento also offers an automatic silence detection 318 to match the timing of the older content, thereby preserving coarse- 261 and deletion based on local audio volume. Because silences often 319 grain audio-visual synchronization. 262 happen when users finish writing what they just said, Pentimento 320 Similarly, a piece of audio can be replaced by a new recording with- 263 automatically speeds up the visuals during and before the silence, 2 321 out changing the visuals. The user can go to the audio/time window, 264 until a default maximum time before the beginning of the silence . 322 select a time interval and press record audio to replace the audio. 265 This looks reasonable most of the time, and users are free to edit 323 Similar to the above case, the visuals in the selected time interval 266 the synchronization later if needed. 324 get sped up or slowed down automatically to accommodate the pos- 267 Users can tune the precise timing of the beginning and end of audio 325 sibly different length of the new audio. 268 segments using a dragging interface. For example, if a silence dele- 269 tion was too aggressive and cut the end of a word, the user can click 326 Insertion If they forgot to say and write something in the mid- 270 on the end of the segment and drag it to when the word is finished. 327 dle of a lecture, users can move the time slider to when content is 271 The later audio gets shifted forward or backward to accommodate 328 missing and press record like above. The existing audio gets split at 272 the new duration, and visuals get sped up or slowed down slightly 329 the insertion time, the new content is added, and the later existing 273 differently to accommodate the different duration. 330 audio and visuals get shifted by the duration of the new recording. 331 Setting the right insertion time is facilitated by selecting visuals and 274 Pentimento does not support audio speed up or slow down yet. 332 clicking the insert before or after buttons, which sets the time to the 333 beginning, or end respectively, of the selection. 275 Flexible synchronization MOOC producers told us they want 276 the ability to finely control the synchronization between audio and 334 The desired insertion time is often not exactly the same in the audio 277 visuals. For example, when a term in an equation is introduced, the 335 and visuals, but Pentimento’s flexible synchronization and editing 278 instructor does not always write the term in sync with the audio. For 336 make it easy to address. One option is to change the synchroniza- 279 this, Pentimento users can select the visual using the marque, and 337 tion of audio and visuals first, using the synchronization arrows de- 280 move the time slider to the desired time where these visuals should 338 scribed above, so that the desired insertion time be the same. An- 281 start or end. They then press a synchronization button or a key- 339 other option is to insert content at the correct visual instant and fix 282 board shortcut, and a point-wise synchronization correspondence is 340 the audio later by dragging the audio segment boundaries. 283 added. Visuals before and after the synchronization or sped up or 341 It is also possible to insert visuals alone without affecting the audio 284 slowed down to reach the right visual instant at the required audio 342 (and vice versa.) The user can uncheck the audio checkbox before 285 time. 343 pressing record. The new visuals are inserted and they and the later 286 A correspondence affects the timing from the previous to the next 344 visuals get sped up to make time for the extra content. Existing 287 one. In order to limit the time interval affected by a new syn- 345 synchronization correspondences are preserved, and the user can 288 chronization correspondence, users can beforehand insert “iden- 346 later fine tune which parts should be sped by inserting and editing 289 tity” correspondences on both sides to limit the affected interval. 347 correspondences. They can also use re-synchronization described 290 They move the time slider and press a button or a keyboard short- 348 above, to restore the synchronization of content if audio and visuals 291 cut. 349 were recorded together.

292 If users don’t place such “sentinel” correspondences beforehand, 350 2.2 Visuals first 293 the effect of a synchronization constraint might extend farther than 294 desired. If the content was recorded synchronously, they can use a 351 While synchronous recording was the original intended use, we and 295 re-synchronization feature based on the wall clock time when audio 352 most of our users have found it easier to focus on visuals first, and 296 and visuals were recorded. They move the time slider to the desired 353 record the audio later. Pentimento has an optional automatic record- 297 time, and request a synchronization constraint from the current au- 354 ing feature, which makes drawing frictionless. The recording of 298 dio time to the visual recorded at the same time. Alternatively, they 355 visuals without audio starts whenever the stylus is pressed or when 299 can resynchronize from visuals to audio. 356 visual tools (such as page down) are used, and it stops after one sec- 357 ond of inactivity. The availability of vector-graphics editing such 300 All correspondences (automatic or user-specified) can be edited by 358 as scale, translate, and color change make this part very similar to a 301 dragging the origin or head of the arrows, which changes the audio 359 drawing program, although Pentimento does not yet support typed 302 and visual times respectively. In all case, the visuals get retimed 360 text or shapes. 303 and the audio remains unchanged. A simple strategy for synchro- 304 nization is to insert identity correspondences and edit their visual 361 A convenient way to aid the drawing is to insert images or slides 305 times. 362 that won’t be shown in the final video. For example, we have im- 363 ported maps, traced over them, and deleted them later. Similarly, 306 The retiming is constrained to remain monotonic: visual time can- 364 Pentimento has an optional grid to help write and draw straight. 307 not run backward. This means that the head (resp. origin) of an ar- 365 While trivial to implement. these features make a big difference 308 row cannot be dragged passed another user arrow. That other arrow 366 and are not possible with current tablet authoring tools. 309 would need to be deleted first. In contrast, we consider automatic

2 Except when the deleted silence is at the very beginning of the video, 367 Temporal reordering The similarity to free-hand drawing has a in which case things have to be sped up after the silence. 368 major exception: Pentimento does record the timing of strokes and

4 Online Submission ID: 549

369 they get replayed in the order in which they were drawn. Of course, blur_x[x,y] = (input[x,y]+input[x+1,y]+input[x+2,y])/3 blur_y[x,y] = (blur_x[x,y]+blur_x[x,y+1]+blur_x[x,y+2])/3

370 the user can move the insertion time back if they realize that they blur_y.tile(x, y, xo, yo, xi, yi, 256, 32) blur_x.compute_at(blur_y, xo) 371 have forgotten to draw or write something, but this can break the 372 flow of writing and it is all too easy to forget. For example, users 373 might realize that they forgot a prime after an x and draw it later. 374 It is then desirable to move it back in time so that it appears right 375 after the x is drawn. In other cases, we have drawn a diagram after 376 a series of pseudocode lines, and realized later that the diagram 377 drawing should be interleaved temporally (Fig.3.b.) Figure 3: (a) The ray-tracing lecture was created by recording 378 Users select visuals (which must be contiguous in time), move the the audio first and then creating the visuals. It made heavy use 379 current time using the slider, and press ”move to time”. The ap- of retroactive layout edits to fit all the content in the canvas and to 380 plication removes content from the old time, and inserts it back choose colors after the fact. We also made good use of the insertion 381 at the new time. Alternatively, users can first move to the desired of visuals because we wrote the pseudocode first and later inserted 382 new time (for example using the insert before/after buttons), enable the corresponding illustrations in the diagram at the proper time. 383 ghosts if the content to be moved is later, select it and press move This figure is a resolution-independent PDF exported directly from 384 to time. Pentimento. (b) Compute at was created with visuals first and in- volved a number of temporal reordering because, while the right 385 Pentimento currently tries to prevent multiple strokes being drawn diagram was drawn after the pseudocode, it was later decided to 386 at once. This can be useful in specific cases, but most of the time, interlave it. The text is a background slide. 387 the (spatial) semantics of word processing are desired: inserting 388 and moving content such as hand-drawn text should place it tempo- 389 rally between letters, not overlapping a letter. This is why we snap 429 2.3 Other strategies 390 insertion and moving time to the nearest empty time. One could 391 imagine keyboard modifiers to disable snapping, but we have not 430 Audio first It is possible to record audio first, and then draw the 392 implemented it yet. 431 visuals, e.g. Fig.3.a. Similar to the opposite scenario, users can 432 select a time interval in the audio and record visuals which then 393 Changing the temporal order is most useful when drawing visuals 433 automatically conform to the interval. Or, they can worry about 394 alone, and synchronization is then not an issue. If synchronization 434 synchronization later. 395 exists between audio and visuals, we do not currently attempt to 396 modify them at the old and new time, since it is unclear what users 435 Audio + placeholder visuals One of our testers came up with 397 would want. Fortunately, synchronization is easy to fix later. 436 a workflow we had not anticipated. She records audio and visuals 437 synchronously but only draws placeholders, such as a box for a

398 Audio and synchronization At any time, users can record au- 438 diagram or a wiggly scribble for an equation. This allows her to 399 dio and synchronize it to the visuals. They select a set of visu- 439 focus on what she says, but prepares the layout and synchronization 400 als, and press record audio for selection (both can be done in ei- 440 for visuals. Afterwards, she uses the ”redraw” operation to replace 401 ther window, thanks to a satellite view of the visuals). When they 441 the placeholders with the final visuals, and the synchronization with 402 press stop, the selected visuals are automatically synchronized with 442 the audio is preserved. 403 the recorded audio thanks to synchronization correspondences au- 404 tomatically placed at the beginning and end of the recording. While 443 Language translation Pentimento can facilitate translation to a 405 it is tempting to achieve fine-grained synchronization by selecting 444 new language, in particular if the visuals are mostly diagrams and 406 small sets of visuals and recording a single sentence at a time, we 445 equations. The user can select parts of the original audio and record 407 have found that this can make the narration flat and choppy, and we 446 a translation. The old audio gets replaced and the visuals are re- 408 prefer to record longer pieces of audio, and possibly synchronize 447 timed if the two languages don’t have the same duration. 409 later. 448 Words in the visuals can also be translated using the redraw feature.

410 Users can also record audio without selecting visuals and worry 449 Animations Pentimento supports limited animation of the visu- 411 about synchronization later. A common strategy is to start synchro- 450 als. Users can select visuals while recording using a lasso interface, 412 nizing at the beginning of a video. Users play the lecture and stop 451 and then translate the selection over time. Unlike the layout edits 413 when audio and visuals are out of sync. They add a synchroniza- 452 discussed so far, such movement is shown in the final video. For ex- 414 tion constraints to fix it and replay starting a little earlier to verify 453 ample, it can be used in a derivation when moving a term from the 415 that the timing is right. This strategy essentially requires the user to 454 left-hand side to the right-hand side. The instructor draws a lasso 416 replay audio at objectionable locations, which means that synchro- 455 stroke around the term to be moved, which selects it, and manipu- 417 nization is usually achieved in about twice the total duration of the 456 lation handles appear (inspired by the K-Skecth interface, although 418 audio. 457 only translation is currently supported). Users then press ”dupli- 458 cate” to leave a second version of the term at the original location, 419 Audio waveforms are not easy to read. While we hope to incor- 459 and then drags the selected content to the right-hand side. When 420 porate speech recognition in the future, we currently support “tick 460 they release the pen or mouse, the content gets unselected and writ- 421 marks” to indicate audio times for future editing with a red glyph 461 ing or drawing can resume. 422 on top of the waveform. users can insert them at any time with a 423 keyboard shortcut or a button. For example, while recording audio, 462 Inspired by, e.g. [Perlin and Fox 1993; Bederson et al. 1994; Bed- 424 if users make a mistake, rather than interrupt their flow to stop and 463 erson and Hollan 1994], prezi [Prezi 2013] and Preshak [McCann 425 delete the recording, they can insert a tick mark to remember where 464 2012], Pentimento support an infinite and zoomable canvas from 426 the problem happened and get back to it later while editing. Simi- 465 the lecturer’s perspective. For example, they can draw a stroke 427 larly, while playing a recording, they can mark key instants for later 466 around the area where they want to zoom in, or they can scroll with 428 synchronization. 467 a dragging interface to make more space.

5 Online Submission ID: 549

Render lecture at time tAudio: 468 3 Implementation Retimer: translate tAudio into tVisual query camera transformation at tVis, 469 Pentimento relies on a simple structured representation of content apply transform 470 inspired by vector graphics and non-linear video editing. The cen- render slides 471 tral component that makes it easy to edit audio and visuals indepen- for each visual primitive 472 dently while preserving synchronization is the decoupling of audio render at time tVis 473 and visual time and the retiming data structure based on point wise Render stroke at tVis: 474 correspondences. These data structures enable powerful retroactive query transform at tVis 475 operations: spatial editing, content insertion, content replacement, push transformation 476 temporal reordering, and fine-grain synchronization. The main de- query color, width, fill color at tVis 477 sign decisions for these operations are how the retiming is main- start polyline rendering 478 tained when editing one modality without the other one. for each space-time vertex if vertex.time

490 Visuals also include images, for which we store the time of appear- 528 To make the data structures more concrete, Fig.4 shows pseu- 491 ance, location, size, and a list of transformations over time. 529 docode for the rendering at a given audio time.

492 All visuals (and audio) store the wall clock time of recording, which 530 3.3 Layout and other basic edits 493 enables the re-synchronization of synchronous content discussed in 494 Section 2.1. 531 The representation of visuals makes layout changes mostly orthog-

495 The lecture stores a list of global spatial transformations over time 532 onal to the time dimension, and their implementation is very similar 496 for effects such as scrolling and page changes. Currently, all the 533 to a standard drawing program. The spatial selection, however, is 497 visuals and pages share a global coordinate system, and different 534 restricted to visuals visible at the current display time, unless ghosts 498 pages are simply laid out vertically. In a future version, we plan 535 are enabled. 499 to assign each page an independent coordinate system to make it 536 Changing the color of selected strokes modifies the appearance 500 easier to insert or delete pages. 537 record active at the display time, that is, the one with the largest 538 tmin smaller than the display time. This means that if the color was 501 Audio Audio is stored in standard audio track data structures. 539 changed later in the lecture, this change still occurs. 502 Recordings are stored into streams of samples. In order to facil-

503 itate editing, and in particular deletion and insertion, we also use 540 A number of dynamic events such as color changes are very punc- 504 a notion of segments, which are subsets of a stream. A segment 541 tual. The easiest editing strategy is to delete them and re-perform 505 encodes a pointer to a stream, the time at which it starts in the lec- 542 them. The user can select a time interval and delete the type of 506 ture, a duration, and the beginning time in the stream. Segments 543 event they choose (camera transformation, color changes) from the 507 are organized into tracks, which are lists of segments. We currently 544 audio/time window’s lower left (Fig.2.b.) 508 support a single track and segments cannot overlap.

545 Keyframes and animations We also extend the spatial layout 509 3.2 Synchronization and the retimer 546 editing described above for static objects to animation. We want to 547 offer an interface that is as similar as possible and let users select 510 The playback of the lecture is driven by the audio time and a retimer 548 and move key frames. For this, we exploit the nature of our anima- 511 data structure is used to translate from audio to visual time (and 549 tions. When objects are moved in handwritten lectures, their tem- 512 back). The retimer stores a list of synchronization constraints (often 550 poral list of transformations usually consists of long empty static 513 called retiming correspondences), which are pairs of floating point 551 periods followed by a burst of transformation when the user drags 514 values t audio, tvisual. The constraints are between time values and do 552 them from one place to another. We take advantage of this struc- 515 not point to actual content. 553 ture and define key frames that correspond to the static periods. We

516 We use piecewise-linear interpolation between constraints. We en- 554 simply go through the list of transformations of an object and de- 517 force the retiming constraints to be monotonic, and most editing op- 555 fine keyframes as static periods longer than a given time. These 518 erations described below are designed to respect this. When users 556 keyframes could be explicitly encoded by the UI, but we currently 519 drag constraints, they cannot drag passed the next or previous con- 557 generate them on the fly when needed. 520 straints. They can, however, drag passed an automatic one, which 558 Imagine we have recorded a derivation animation, taking a term 521 deletes it. 559 from the left-hand side to the right-hand side of the equation below. 3We originally implemented subdivision curves, but found that the high 560 We later need to edit the layout and move the first line. We turn throughput of modern tablets such as the Wacom Intuos or Cintiq series 561 on ghosts so that the original location of the term be selectable. We provides such a dense temporal sampling that fitting or subdivision are un- 562 drag the selection marquee over the line to select it. The ghost of the necessary. 563 initial location of the animated term gets selected, but not its final

6 Online Submission ID: 549

Visuals Visuals Visuals Visuals content to be shifted new content shifted content content to be shifted new content shifted content insertion insertion visual time visual time visual visual visual visual time time time time shifted constraint constraint shifted in visuals retiming new retiming curve constraint curve slope 1 constraint constraint marked ``dirty’’ marked ``dirty’’ new new new new constraint constraint constraint constraint

audio time audio time audio time audio time insertion insertion Audio audio time Audio Audio audio time Audio content to be shifted new content shifted content

(a) before insertion (b) after insertion (a) before insertion of visuals (b) after insertion of visuals

Figure 5: Insertion of synchronous audio and visuals. Synchro- Figure 6: Visuals can be inserted without affecting the audio. A nization is preserved by inserting two new correspondences. new synchronization constraint is added to preserve synchroniza- tion at the insertion time.

564 location. When we move the selection, only the initial position is Visuals Visuals 565 affected, and intermediate location or modified with a linear falloff visuals to be redrawn new visuals shifted content ``dirty’’ content 566 so that the initial one is fully modified and the final one is not.

visual visual time 567 3.4 Insertion time retiming curve new 568 Audio and visuals The user can insert new audio and visual con- constraint constraint marked 569 tent at arbitrary moments in the middle of the lecture. The old con- shifted ``dirty’’ 570 tent after the insertion time gets shifted by the duration of the new constraint new 571 recording and we must make sure that existing retiming between constraint 572 audio and visuals is preserved (Fig.5). audio time audio time 573 When insertion starts, we record the audio time of insertion and Audio Audio (unaffected) 574 corresponding visual time. They might be different because of re- 575 timing. We need to add add a synchronization constraint at the be- 576 ginning of the inserted segment that maps the insertion audio time (a) before redrawing (b) after redrawing 577 to the visual one, in order to make sure that previous retimings are 578 preserved. Figure 7: Redrawing involves two correspondences that match the speed of the new visuals to the duration of the old ones. 579 At the end of the insertion, all the previous content that was after the 580 insertion time gets shifted by the duration of the recording. Both au- 581 dio and visual content get shifted by the same amount. In addition, 582 we need to insert a new retiming constraint that is the insertion-time 603 3.5 Content replacement 583 constraint shifted by the recording duration in both audio and visu- 584 als. As Fig.5 shows, the retiming before and after the insertion gets 604 The user can select a set of visuals and redraw them without affect- 585 preserved and the slope of the curve during the new segment is 1. 605 ing the audio. We need to insert a new identity retiming constraint 606 at the minimum time of the selection to preserve earlier timing. 607 The user then draws the new version of the selection like in a nor- 586 Single modality When a single modality, such as visuals, is in- 608 mal recording session, except that no audio is recorded. At the end 587 serted at a time that also has the other modality, Pentimento keeps 609 of redrawing, we need to delete the old primitives, and update the 588 the other one untouched. A synchronization correspondence is 610 timing of the visuals and the retimer. The existing visual content 589 added at the beginning to preserve the audio-visual synchroniza- 611 after the midi fed one gets shifted by the difference between the old 590 tion that users see when they start recording content. The new vi- 612 and new visuals’ duration. We also add a new retiming constraint 591 suals are inserted and the existing ones after the insertion time are 613 between the audio time corresponding to the end of the old primi- 592 shifted by the length of the new recording (Fig.6.) No other corre- 614 tives and the visual time of the end of the new primitives (Fig.7.b.) 593 spondence is added, which means that content between the insertion 615 We currently delete synchronization constraints during the redrawn 594 point and the next existing correspondence will be added. 616 segment because the precise timing is usually not the same and be- 617 cause such constraints are easy for the user to add back. 595 Single modality when the other one is empty When we in- 596 sert content, such as visuals, at a time where the other modality is 618 Audio replacement proceeds similarly, but starts with a temporal 597 empty, Pentimento uses a different strategy that avoids unnecessary 619 interval selection. 598 content acceleration and helps reducing the number of correspon-

599 dences. This is particularly relevant when users start by recording 620 3.6 Temporal reordering 600 only visuals and there is no audio to keep pace with. In this case, 601 Pentimento simply adds the content at the insertion time, shifts the 621 Users can change the temporal ordering of visuals. For this, they 602 existing later content, and does not add any correspondence. 622 select a set of primitives, change the time slider to a new value, and

7 Online Submission ID: 549

selected 667 for a economics MOOC, a mostly handwritten 4:30 minute sup- Visuals before Visuals before time 668 plementary video lecture for an on-campus CS course, and a few visuals to be shifted visuals to be shifted 669 30-second research project overviews for paper submissions. Other selected 670 lectures were made to explore the possibilities of Pentimento. Ta- time 671 ble1 reports statistics about the features used by our users in a 4 672 subset of the produced content .

673 Three of the users with prior experience chose to use the syn- Visuals after Visuals after new value of selected time 674 chronous audio+visual recording strategy they were used to, and 675 (a) negative shift in time (b) positive shift in time one came up with the placeholder strategy. Most users worked on 676 the visuals first, and added audio later. In one instance, the hand- Figure 8: Temporal reordering of the visuals. The audio is unaf- 677 writing of the educator was replaced by someone else’s handwriting fected. Only the selected visuals and the time interval between them 678 for improved clarity. Only two of the users had access to the tuto- and the new selected time needs to be swapped. 679 rials, because they were created later, but others usually received a 680 brief ten-minute introduction by an author.

681 The main feedback by the experienced user was their excitement 623 request to move the beginning of the selected primitives to the cur- 682 to not have to restart when they make mistakes. The advanced 624 rent time. We require that the visuals be contiguous, which means 683 users who tried early versions of the tool that did not have can- 625 that no unselected visual event or primitive can occur during the se- 684 vas scrolling or the duplication and animation of strokes requested 626 lection. This allows us to restrict the timing changes to the interval 685 these features because they used them regularly in their lectures. 627 encompassing the selection and the specified time (Fig.8). 686 Two of the videos, ray tracing and microphone, had the audio 628 We do not to modify the retimer when shifting content in time, but 687 recorded first (the former by a different person than the one who 629 if constraints exist in the interval, they might need editing. Recall 688 drew the visuals later). In ray tracing, the visuals were then drawn 630 that constraints are between time values, not between audio and 689 without worrying about audio, and synchronization arrows were 631 visual content. The reordering of visuals is fundamentally a non- 690 added later. For microphone, the user selected time intervals in 632 monotonic operation. 691 the audio in small chunks and recorded the visuals with automatic 692 synchronization. 633 3.7 Silence deletion 693 The lecture on large bounds (see additional videos) and one on 634 When the user deletes a time interval in the audio, we first place 694 statistics are the only one that were recorded in real time with audio 635 an identity retiming constraint at the end of the silence to preserve 695 and visuals at the same time. The authors were experienced with 636 later timing.We then need to speed up the visual to catch up with the 696 screen capture and only made a few audio mistakes. Editing con- 637 audio, and the main decision is how early to start speeding up. We 697 sisted in deleting silences that occurred while he finished writing 638 specify a maximum time where visuals are sped up (in practice it 698 and fine-tuning the synchronization of terms that were not drawn in 639 is set to 5 seconds). We check if there is already a synchronization 699 good-enough sync with the audio. In this case, no layout change 640 constraint, and if not we add one before deleting.Finally, we delete 700 was needed. 641 the piece of audio. This requires us to split at least one audio seg- 701 The ray-tracing lecture (Fig.3) illustrates the benefits of retroactive 642 ment into two audio segments that point to the same audio stream.In 702 edits for complex layouts. Many elements such as the light sources 643 some cases, the selected interval overlaps multiple segments and we 703 needed to be moved multiple times as more content was added. It 644 treat each of them in turn. 704 also relies on many insertions because the pseudocode was writ- 645 In addition, there is an edge case when the user deletes the very be- 705 ten first and the corresponding elements in the diagram were then 646 ginning of the audio, in which case there is no prior visual to speed 706 inserted at the appropriate time. Audio recording took about five 647 up and we just add a new constraint between 0 and the beginning 707 minutes, visual drawing one hour, and synchronization another five 648 of the visuals. We also offer automatic silence suppression. We 708 minutes. 649 detect intervals in the audio that have a low audio level and that 650 are longer than a given duration and apply the above procedure. To 709 Overall, users found the drawing part of the UI easier than temporal 651 refine the removal of silence, users can edit where audio segments 710 editing. However, one of the experienced users found the time part 652 start and end using a traditional dragging interface over the segment 711 the most exciting because it alleviates the need for unwieldy non- 653 boundaries in the audio waveform view. 712 linear video editors, and provides a much quicker workflow than 713 them.

654 4 Results 714 Some of the videos with large numbers of audio takes indicate the 715 recording of small chunks synchronized with selected visuals, as 655 We implemented Pentimento as a Mac OS (10.7 and above) Cocoa 716 shown by the number of uses of “audio for selection.” In other 656 application. We used the document architecture for UI, storage and 717 cases, the user relied on the ability to cancel a recording when mak- 657 undo, and Quartz for rendering. Core audio provided support for 718 ing a mistake. 658 audio recording and playing. An executable is included with this 659 submission and we encourage you to try it. Tutorials are provided 719 Beginners use a smaller number of tools. Animation by translation 660 as supplementary material. 720 was used by only one beginner, and zoom by three. Of the main 721 retroactive edits, spatial layout changes were used heavily by only 661 Pentimento has been informally tested by about twenty users and 662 we include a number of handwritten video lectures in the supple- 4Version 0.5 of Pentimento did not have the arrow visualization for cor- 663 mentary videos (Fig.3.) Only one user was an author, but most respondences and did not properly report some usage statistics, hence the 664 users had a CS background. The majority of them did not have prior missing numbers. Version 0.6 fixed the statistics as well as a number of 665 experience with handwritten videos, but five did. The created con- other bugs. Version 0.9 introduced the arrows, the cleanup of unnecessary 666 tent includes a 15-minute statistics tutorial mostly based on slides correspondences, and a UI overhaul.

8 Online Submission ID: 549

722 half of the beginners (and all advanced users), while the others re- 783 prototype also did not have the constraint clean up feature. This 723 lied on scrolling to make space. Insertion was universally used, but 784 resulted in an excessive number of automatic constraints, which 724 shift in time was only used by advanced users. None of our begin- 785 added to the confusion. As soon as we switched the visualization to 725 ners used automatic silence detection, possibly because its UI is not 786 arrows between the visual and audio time line, and introduce cor- 726 salient. 787 respondence cleanup, people had a much easier time understanding 788 the retiming. 727 Pentimento can record and display the mouse cursor over time, 728 which can be useful for emphasis. However, this feature still has 789 We do not support audio acceleration. The main feature requested 729 bugs and usability issues and has been disactivated from current 790 by our users (and not implemented yet) is the export of multiple 730 versions. 791 versions at different speeds, where visuals and audio are equally 792 sped up. 731 Performance can become an issue when the number of strokes gets 732 large. We tried polyline simplification, but the bottleneck appears 793 Time shifts Temporal reordering was not widely used, but proved 733 to be fill rate and the number of pixels covered by strokes more 794 invaluable in some cases. In the compute at video, the visuals were 734 than the number of vertices. Incremental rendering is used while 795 created first, starting with pseudocode and proceeding with a dia- 735 recording strokes (only the current stroke gets refreshed) to provide 796 gram (Fig.3.) It was later realized (but before recording audio) that 736 low-latency feedback. On older machines ( more than 5 years old), 797 interleaving the diagram with the writing of the code would make 737 the large number of events generated by tablets can overwhelm our 798 more sense, and various parts were moved to different times. 738 UI and we need to optimize performance.

799 Moving visuals in time is one of the few operations where time 739 Retroactivity A number of operations, such as deletion and 800 is not orthogonal to space and visuals, and teh user interface is less 740 stroke color change, have two different semantics depending 801 intuitive. Pentimento could benefit from better representation of the 741 whether they are performed while recording the lecture or while 802 temporal evolution of visuals, e.g. [Su et al.; Barnes et al. 2010; 742 editing. If a user deletes a stroke while recording, it still appears at 803 Grossman et al. 2010]. 743 earlier times and the final video shows the disappearance at the mo-

744 ment chosen by the author. In contrast, if the user deletes it while 804 Slides and pages Authors who use slides requested a preview 745 editing, the stroke never gets shown in the video. Similarly, chang- 805 of the next slide, which will be included in a future version. One 746 ing color can either mean to show the change in the video, or just 806 user tried to use the “page down” feature to check his slides, which 747 change it from its inception. Such modal semantics can be hard to 807 unfortunately recorded the page change in the video. In general, 748 convey and be source of confusion. 808 Pentimento lacks the ability to navigate space without recording. 809 The challenge to enable this is to present the two options (in-lecture 749 Retiming Table1 shows the number of correspondences in our 810 vs. just to check) without confusing the user. 750 videos. The ratio of automatic to user defined gives a sense of the 811 A number of users also requested the equivalent of a slide sorter and 751 anount of time users spent editing the arrows (note that once a user 812 a more hierarchical and discrete organization of videos. Pentimento 752 edits an automatic constraints, we declare it user-defined.) The ra- 813 currently considers space and time to be unstructured and continu- 753 tio varies significantly per user, with some users clearly relying on 814 ous, which limits structured editing, and in particular the addition 754 automatic synchronization, while other like to manually tweak fine- 815 or deletion of pages. We, however, offer discrete time navigation 755 grain correspondences. A few videos involved no correspondence 816 with the skip buttons, which move the time to the next big camera 756 editing at all because the users entirely relied on recording visu- 817 movement or page change. 757 als for temporal selections (such as microphone), or its opposite, 758 recording audio for a visual selection (diffraction, Pythagoras.) The 759 videos made with earlier versions of Pentimento tend to have more 818 4.1 Discussion and opportunities 760 automatic constraints because the automatic simplification wasn’t 819 Our focus on freehand strokes is motivated by their low entry bar- 761 implemented yet. 820 rier and their flexibility, but it is by no means dogmatic. The ex- 762 On average, we observed one correspondence every 5 seconds, and 821 tension to other geometric primitives such as lines, rectangles and 763 a maximum of 4 seconds per user correspondence. This number is 822 arrows should be straightforward. The evolution of text fields over 764 higher than we expected, but our users indicated that synchroniza- 823 time can probably be encoded in a way similar to undo stacks, with 765 tion was a small part of authoring. Even for equations with many 824 the additional challenge that users will want to perform retroactive 766 terms like an energy function, it is easy to select the visuals for each 825 edits that are not sequential with respect to the lecture time. The 767 term one by one, go to the audio time where the term is uttered, and 826 extension to video will require an extension of the retiming data 768 press a keyboard shortcut to synchronize. The ease of adding and 827 structure, or at least to the way retiming is edited, because unlike 769 editing correspndences might explain their number. It is also a con- 828 handwritten video lectures, videos often require visuals and audio 770 sequence of asynchronous recording, and the synchronous record- 829 to remain synchronized to their real time. We want to support exter- 771 ings that were made exhibit significantly fewer correspondences (5 830 nal video streams of the lecture and automatically edit them in the 772 for 15 minutes in statistics.) 831 same way the audio is edited. This should be facilitated by our en- 832 coding of the wall clock time with audio streams and with dynamic 773 The average speedup per stroke due to retiming was typically be- 833 events. 774 tween 2 and 10 (Table1), and close to 1 for synchronous record- 775 ings. Research overview videos tend to be faster-paced, e.g. heart 834 Animation capabilities are limited at this point and Pentimento 776 or microphone both average 20x, while content used for classes tend 835 could benefit from the full range of animations offered by a tool 777 to have slower visuals (1.9x for compute at and 1.0x for statis- 836 such as, e.g., K-Sketch [Davis et al. 2008], from procedural anima- 778 tics, which was recorded synchronously and mostly consists of 837 tion, e.g. [Zongker and Salesin 2003; Perlin and Fox 1993; Beder- 779 slides+audio.) 838 son et al. 1994; Bederson and Hollan 1994], and from visualization 839 [Lee et al. 2013]. 780 In earlier versions of the software, users had a hard time understand- 781 ing retiming and correspondences, partially because they were vi- 840 New inserted audio can have an inconsistent sound, which can be 782 sualized as simple ellipses rather than the current arrows. The early 841 heard in the variance lecture. Some automatic equalization is on

9 Online Submission ID: 549

Title by paper version strategy length slides camera strokes Correspondences layout, delete time redraw audio audio for delete author (sec) changes auto user speedup color visuals reorder takes selection audio Odissey no 0.9 visuals 393 no 69 1695 10 80 9.2 663 61 16 9 52 51 83 microphone no 0.9 audio 55 no 1 128 71 0 19.4 255 4 0 33 1 1 0 water cycle no 0.9 visuals 152 no 11 925 11 30 4.7 6 9 1 2 16 16 27 statistics no 0.9 sync 941 yes 29 27 2 3 1.0 0 0 0 0 45 0 0 compute at yes 0.6 visuals 273 yes 1 321 15 24 1.9 34 0 4 1 10 4 11 heart yes 0.6 visuals 24 no 1 205 5 2 19.3 20 5 1 3 0 0 1 Pythagoras no 0.5 visuals 169 no 3 226 27 0 1.8 1 55 0 45 15 48 diffraction no 0.5 visuals 151 no 10 836 126 0 5.7 16 120 0 31 35 92 convex hull no 0.5 visuals 82 no 4 269 2 10 5.2 0 0 0 86 0 0 bar code no 0.5 visuals 296 yes 32 895 9 55 4.9 13 28 0 30 4 8 odd squares no 0.5 visuals 107 no 6 214 36 0 2.8 4 9 0 18 18 23 ray tracing yes 0.5 audio 74 no 1 382 2 25 7.2 30 54 1 2 0 0 variance yes 0.5 visuals 55 no 1 242 5 12 4.1 16 9 0 6 6 10

Table 1: The usage statistics of Pentimento reflect the variety of strategies used by authors. Note that version 0.5 of Pentimento had a poor visualization and UI for correspondences, as well as a bug in the redrawing statistics.

842 our wish list. We havent suffered too much from the discontinuity 886 The need for synchronous audio and visuals, vs. saying first and 843 at insertion points. In practice, microphone quality is the biggest 887 writing afterwards is still debated and might depdend on learners, 844 issue (eg. barcode video). 888 e.g. [Mayer 2002; Oviatt et al. 2005]. The flexible retiming in Pen- 889 timento should make it easy to create both versions and perform 845 The audio timeline would benefit greatly from speech recognition 890 experiment, or to encode both at the same time and make it an op- 846 and transcript and an interface similar to the ones introduced by 891 tion for viewers, possibly as part of variable-speed players. 847 Berthouzoz et al. [2012] and Rubin et al. [2013].

848 Gesture 892 Interaction Our efforts have focused on the creation of handwrit- 893 ten lectures, but their viewing is equally interesting. First of all, the 849 We hope that a low barrier to authoring will vastly increase the 894 structured nature of our content makes it possible to bypass video 850 number of educators who can create online content, and it will 895 encoding and save bandwidth by transmitting directly the vector 851 allow them to polish their material over the years. The combina- 896 information. We are experimenting with an HTML5 viewer that 852 tion of large numbers and iterative refinement will make it easy 897 renders directly from space-time strokes. In addition, in the spirit 853 for better and better content to bubble up, and to present students 898 of zoomable interfaces, e.g. [Perlin and Fox 1993; Bederson et al. 854 with many different perspectives and levels of complexity about a 899 1994; Bederson and Hollan 1994] the viewer should be able to 855 given topic. We also hope that editable video lectures will make it 900 scroll and zoom to go back and check prior content in a flexible 856 easy for intructors and students to work collaboratively and build on 901 manner. Finally, the strong link between space and time should 857 each other’s content in a way similar to what is currently done with 902 allow a student to select a piece of text in order to replay the corre- 858 slides. We are interested in integrating technology that facilitates 903 sponding segment, e.g. [Kim 2013; Monserrat et al. 2013]. 859 this, e.g. [Coughlan et al. 2013; Chen et al. 2011; Bergman et al. 860 2010; Spicer et al. 2012; Drucker et al. 2006; Edge et al. 2013]. 904 Pentimento currently focuses on the production of videos, but ed- 905 ucational content is best when the student is active. We want to

861 Format The flexibility of our approach makes it possible to cre- 906 merge the notion of replay and recording, and allow students to 862 ate content that ranges from near real-time, similar to the Khan style 907 take notes, write derivations themselves, do exercises in the middle 863 where the pace is slow in order to let the student absorb the mate- 908 of a lecture, or play with a simulation. We want to add an API that 864 rial, all the way to fast visuals where complex drawings are traced 909 will allow authors to control Pentimento visuals procedurally as a 865 in seconds and elaborate writing keeps up with a fast-paced audio, 910 function of time or user input. We want to enable non-linear video 866 such as the ray-tracing lecture, which was designed to fit in the sup- 911 lectures that branch out and where studentscan choose between a 867 plementary video and can be seen more as a teaser. This flexibility 912 quick or a thorough derivation. 868 can in turn be a curse because it makes it too easy for educators to 869 go too fast for students to follow. This is partly mitigated by the 913 Analytics One of the benefits of online content is the ability to 870 adjustability of digital players where students can slow down or go 914 gather analytics about student usage and learn where they pause, 871 back, but the high-level issue is not unlike the dangers of presenta- 915 rewind or skip, e.g. [Seaton et al. 2010; Kim et al. 2014]. The 872 tion software for live lectures [Tufte 2006], which make it too easy 916 direct display of these analytics in the timeline together with the 873 to cram too much material into a lecture. 917 ability to iteratively refine lectures can make it easier for educators 918 to improve their content over time. 874 A clear limitation of handwritten lectures is that poor handwriting 875 might affect clarity. Beautification techniques can help, e.g. [Julia 876 and Faure 1995; Baran et al. 2010; Igarashi et al.; Simard et al. 919 5 Conclusions 877 2005; Yang and Byun 2008; Chok et al. 1999; Smithies et al. 1999; 878 Zeleznik et al. 2007; Zitnick 2013], in particular in conjunction with 920 Handwritten video lectures are growing in importance but their au- 879 [Arvo and Novins 2000].Our system, however, allows an 921 thoring has so far relied on rigid approaches based on screen cap- 880 educator to create a first version of their content and then ask some- 922 ture, which make it challenging at best to edit them. Our work 881 body with better handwriting to use the redraw feature to replace 923 seeks to bring the best of vector graphics and non-linear video edit- 882 the content while preserving timing and synchronization. We used 924 ing together and to leverage the specific space-time characteristics 883 this strategy in some of the material created with our system. An 925 of handwritten video lectures. Our decoupling of visual and au- 884 exciting that we plan to support is type-righting [Cross et al. 2013; 926 dio time makes the editing of these modalities orthogonal while 885 Cross et al. 2014] where hand-drawn text fades into fonts. 927 preserving synchronization, and simplifies silence suppression and

10 Online Submission ID: 549

928 fine-grain synchronization. We introduced data structures and op- 983 SIGCHI Conference on Human Factors in Computing Systems, 929 erations that make it possible to perform retroactive edits including 984 ACM, CHI ’13, 991–1000. 930 layout changes, content insertion, content replacement, and tempo- 985 ROSS AYYAPUNEDI UTRELL GARWAL 931 ral reordering. We hope that flexible and easy educational video C ,A.,B ,M.,C ,E.,A ,A., 986 AND HIES 932 authoring tools will empower a large number of educators to cre- T , W. 2013. Typerighting: combining the benefits 987 933 ate online educational material and to improve it over the years. A of handwriting and typeface in online educational videos. In 988 Proceedings of the SIGCHI Conference on Human Factors in 934 non-sequential and asynchronous workflow makes it easier to get 989 Computing Systems 935 started and to refine the material. , ACM, 793–796.

990 CROSS,A.,BAYYAPUNEDI,M.,RAVINDRAN,D.,CUTRELL,E., 936 References 991 AND THIES, W. 2014. Vidwiki: Enabling the crowd to improve 992 the legibility of online educational videos. In CSCW 14, ACM.

937 ADRIATIC11, 2008. Handwriting flash cs3 tutorial. 993 DAVIS,R.C.,COLWELL,B., AND LANDAY, J. A. 2008. K- 938 http://www.youtube.com/watch?v=5msr1ObsBDQ. 994 sketch: a ’kinetic’ sketch pad for novice animators. In CHI,

939 ALVARADO,J.,C., AND DAVIS, R. 2001. Preserving the free- 995 ACM, M. Czerwinski, A. M. Lund, and D. S. Tan, Eds., 413– 940 dom of paper in a computer-based sketch tool. In Proceedings of 996 422. 941 the Ninth International Conference on Human-Computer Inter- 997 DRAPEAU, G. D. 1993. Synchronization in the maestro multi- 942 action, vol. 2, 687–691. 998 media authoring environment. In Proceedings of the first ACM

943 ARVO,J., AND NOVINS, K. 2000. Fluid sketches: continu- 999 international conference on Multimedia, ACM, 331–339. 944 ous recognition and morphing of simple hand-drawn shapes. In 1000 DRUCKER,S.M.,PETSCHNIGG,G., AND AGRAWALA, M. 2006. 945 UIST, 73–80. 1001 Comparing and managing multiple versions of slide presenta-

946 BAILEY,B.,KONSTAN,J.A.,COOLEY,R., AND DEJONG,M. 1002 tions. In Proceedings of the 19th Annual ACM Symposium on 947 1998. Nsynca toolkit for building interactive multimedia presen- 1003 User Interface Software and Technology, ACM, UIST ’06, 47– 948 tations. In Proceedings of the sixth ACM international confer- 1004 56. 949 ence on Multimedia, ACM, 257–266. 1005 EDGE,D.,SAVAGE,J., AND YATANI, K. 2013. Hyperslides: Dy-

950 BARAN,I.,LEHTINEN,J., AND POPOVIC, J. 2010. Sketching 1006 namic presentation prototyping. In Proceedings of the SIGCHI 951 clothoid splines using shortest paths. Comput. Graph. Forum 29, 1007 Conference on Human Factors in Computing Systems, ACM, 952 2, 655–664. 1008 CHI ’13, 671–680.

953 BARNES,C.,GOLDMAN,D.B.,SHECHTMAN,E., AND FINKEL- 1009 EVERYTHING, E., 2013. http://www.explaineverything.com/. 954 STEIN, A. 2010. Video tapestries with continuous temporal 1010 GROSSMAN, T., MATEJKA,J., AND FITZMAURICE, G. 2010. 955 zoom. ACM Trans. Graph 29, 4. 1011 Chronicle: capture, exploration, and playback of document 956 BEDERSON,B.B., AND HOLLAN, J. D. 1994. Pad++: A zooming 1012 workflow histories. In Proceedings of the 23nd annual ACM 957 graphical interface for exploring alternate interface physics. In 1013 symposium on User interface software and technology, ACM, 958 ACM Symposium on User Interface Software and Technology, 1014 143–152. 959 17–26. 1015 HAMMOND, T., AND DAVIS, R. 2003. LADDER: A language to 960 BEDERSON,B.,B.,STEAD,L.,HOLLAN, AND D., J. 1994. 1016 describe drawing, display, and editing in sketch recognition. In 961 Pad++: Advances in multiscale interfaces. In Proceedings of 1017 IJCIA, Morgan Kaufmann, G. Gottlob and T. Walsh, Eds., 461– 962 ACM CHI’94 Conference on Human Factors in Computing Sys- 1018 467. 963 tems, vol. 2 of SHORT PAPERS: Virtual and Visual Environ- 1019 IGARASHI, T., MATSUOKA,S.,KAWACHIYA,S., AND TANAKA, 964 ments, 315–316. 1020 H. Interactive beautification: A technique for rapid geometric 965 BERGMAN,L.,LU,J.,KONURU,R.,MACNAUGHT,J., AND 1021 design. In Proceedings of the ACM Symposium on User Interface 966 YEH, D. 2010. Outline wizard: Presentation composition and 1022 Software and Technology (UIST-97), ACM Press, 105–114. 967 search. In Proceedings of the 15th International Conference on 1023 JULIA,L., AND FAURE, C. 1995. Pattern recognition and beauti- 968 Intelligent User Interfaces, ACM, IUI ’10, 209–218. 1024 fication for a pen based interface. In ICDAR-1, IEEE Computer 969 BERTHOUZOZ, F., LI, W., AND AGRAWALA, M. 2012. Tools 1025 Society, 58–63. 970 for placing cuts and transitions in interview video. ACM Trans. 1026 KHAN, S. 2012. The One World Schoolhouse: Education Reimag- 971 Graph 31, 4, 67. 1027 ined. Twelve. 972 CHEN, H.-Y., AND WU, J.-L. 1996. Multisync: A synchroniza- 1028 KIM,J.,GUO, P. J., SEATON, D. T., MITROS, P., GAJOS,K.Z., 973 tion model for multimedia systems. Selected Areas in Commu- 1029 AND MILLER, R. C. 2014. Understanding in-video dropouts 974 nications, IEEE Journal on 14, 1, 238–248. 1030 and interaction peaks in online lecture videos. In Learning at 975 CHEN, H.-T., WEI, L.-Y., AND CHANG, C.-F. 2011. Nonlinear 1031 Scale. 976 revision control for images. ACM Trans. Graph. 30, 4 (July), 1032 KIM, J. 2013. Toolscape: enhancing the learning experience of 977 105:1–105:10. 1033 how-to videos. In CHI’13 Extended Abstracts on Human Factors 978 CHOK,S.S.,MARRIOTT,K., AND PATON, T. 1999. Constraint- 1034 in Computing Systems, ACM, 2707–2712. 979 based diagram beautification. In VL, 12–19. 1035 KURIHARA,K.,VRONAY,D., AND IGARASHI, T. 2005. Flexible 980 COUGHLAN, T., PITT,R., AND MCANDREW, P. 2013. Building 1036 timeline user interface using constraints. In CHI ’05 Extended 981 open bridges: Collaborative remixing and reuse of open edu- 1037 Abstracts on Human Factors in Computing Systems, ACM, CHI 982 cational resources across organisations. In Proceedings of the 1038 EA ’05, 1581–1584.

11 Online Submission ID: 549

2 1039 LAVIOLA JR.,J.J., AND ZELEZNIK, R. C. 2004. Mathpad : 1094 SMITHIES,S.,NOVINS,K., AND ARVO, J. 1999. A handwriting- 1040 a system for the creation and exploration of mathematical 1095 based equation editor. In Graphics Interface, 84–91. 1041 sketches. ACM Trans. Graph 23, 3, 432–440. 1096 SPARKOL, 2013. http://www.sparkol.com/. 1042 LEE,B.,KAZI,R.H., AND SMITH, G. 2013. Sketchstory: Telling 1097 SPICER,R.,LIN, Y.-R., KELLIHER,A., AND SUNDARAM,H. 1043 more engaging stories with data through freeform sketching. Vi- 1098 2012. Nextslideplease: Authoring and delivering agile multime- 1044 sualization and Computer Graphics, IEEE Transactions on 19, 1099 dia presentations. ACM Trans. Multimedia Comput. Commun. 1045 12, 2416–2425. 1100 Appl. 8, 4 (Nov.), 53:1–53:20.

1046 LIVESCRIBE, 2013. Pencasts. http://www.livescribe.com/en- 1101 STIFELMAN,L.,ARONS,B., AND SCHMANDT, C. 2001. The au- 1047 us/pencasts. 1102 dio notebook: Paper and pen interaction with structured speech.

1048 LOVISCACH, J. 2011. A real-time production tool for animated 1103 In Proceedings of the SIGCHI conference on Human factors in 1049 hand sketches. In CVMP. 1104 computing systems, ACM, 182–189.

1105 1050 MAYER, R. E. 2002. Multimedia learning. Psychology of Learning SU,S.,PARIS,S.,ALIAGA, F., SCULL,C.,JOHNSON,S.,, AND 1106 1051 and Motivation 41, 85–139. DURAND, F. Interactive visual histories for vector graphics. 1107 Tech. Rep. MIT-CSAIL-TR-2009-031. 1052 MCCANN, J., 2012. Preshack. 1108 TABLO, 2012. http://teachontablo.com/. 1053 https://github.com/ixchow/Preshack.

1109 TALBERT, R., 2011. How i make screencasts: The white- 1054 MONSERRAT, T.-J. K. P., ZHAO,S.,MCGEE,K., AND PANDEY, 1110 board screencast. The Chronicle of Higher Education blog, 1055 A. V. 2013. Notevideo: Facilitating navigation of blackboard- 1111 http://chronicle.com/blognetwork/castingoutnines/2011/06/07/how- 1056 style lecture videos. In Proceedings of the SIGCHI Conference 1112 i-make-screencasts-the-whiteboard-screencast/. 1057 on Human Factors in Computing Systems, ACM, CHI ’13, 1139– 1058 1148. 1113 TUFTE, E. R. 2006. The cognitive style of PowerPoint: pitching 1114 out corrupts within, 2 ed. Graphics Press LLC, Cheshire, Con- 1059 MORAN, T. P., CHIU, P., AND VAN MELLE, W. 1997. Pen-based 1115 necticut. 1060 interaction techniques for organizing material on an electronic

1061 whiteboard. In Proceedings of the 10th annual ACM symposium 1116 W3C, 2011. Scalable vector graphics (svg) 1.1. 1062 on User interface software and technology, ACM, 45–54. 1117 http://www.w3.org/TR/SVG/animate.html.

1063 OVIATT,S.,LUNSFORD,R., AND COULSTON, R. 2005. Individ- 1118 WAHL, T., AND ROTHERMEL, K. 1994. Representing time in mul- 1064 ual differences in multimodal integration patterns: What are they 1119 timedia systems. In Multimedia Computing and Systems, 1994., 1065 and why do they exist? In Proceedings of the SIGCHI confer- 1120 Proceedings of the International Conference on, IEEE, 538–543. 1066 ence on Human factors in computing systems, ACM, 241–249. 1121 WILCOX,L.D.,SCHILIT,B.N., AND SAWHNEY, N. 1997. 1067 PEDERSEN,E.R.,MCCALL,K.,MORAN, T. P., AND HALASZ, 1122 Dynomite: a dynamically organized ink and audio notebook. In 1068 F. G. 1993. Tivoli: an electronic whiteboard for informal 1123 Proceedings of the ACM SIGCHI Conference on Human factors 1069 workgroup meetings. In Proceedings of the INTERACT’93 and 1124 in computing systems, ACM, 186–193. 1070 CHI’93 conference on Human factors in computing systems, 1125 YANG, J. Y., AND BYUN, H. R. 2008. Curve fitting algorithm 1071 ACM, 391–398. 1126 using iterative error minimization for sketch beautification. In 1072 PERLIN,K., AND FOX, D. 1993. Pad: An alternative approach to 1127 ICPR, 1–4. 1073 the computer interface. SIGGRAPH ‘93, Coomputer Graphics 1128 ZELEZNIK,R.C.,MILLER, T., AND LI, C. 2007. Designing UI 1074 (Aug.). 1129 techniques for handwritten mathematics. In SBM, Eurographics 1075 PREZI, 2013. http://prezi.com/. 1130 Association, M. van de Panne and E. Saund, Eds., 91–98.

1076 QUEEKY, 2013. http://www.queeky.com/tools. 1131 ZELEZNIK,R.C.,MILLER, T. S., VAN DAM,A.,LI,C.,TEN- 1132 NESON,D.,MALONEY,C., AND JR, J. J. L. 2008. Applica- 1077 UBIN ERTHOUZOZ YSORE I AND R ,S.,B , F., M ,G.J.,L , W., 1133 tions and issues in pen-centric computing. IEEE MultiMedia 15, 1078 GRAWALA A , M. 2013. Content-based tools for editing audio 1134 4, 14–21. 1079 stories. In Proceedings of the 26th annual ACM symposium on 1080 User interface software and technology, ACM, 113–122. 1135 ZITNICK, C. L. 2013. Handwriting beautification using token 1136 means. ACM Trans. Graph. 32, 4 (July), 53:1–53:8. 1081 SCHNEPF,J.,KONSTAN,J.A., AND HUNG-CHANG DU,D. 1082 1996. Doing flips: Flexible interactive presentation synchroniza- 1137 ZONGKER,D., AND SALESIN, D. 2003. On creating an- 1083 tion. Selected Areas in Communications, IEEE Journal on 14, 1, 1138 imated presentations. In Proceedings of the 2003 ACM 1084 114–125. 1139 SIGGRAPH/Eurographics Symposium on Computer Animation 1140 (SCA-03), Eurographics, 298–308. 1085 SCLIPO, 2008. How to create flash animations with a wa- 1086 com tablet. http://adobe-flash.wonderhowto.com/how-to/create- 1087 flash-animations-with-wacom-tablet-154574/.

1088 SEATON, D. T., BERGNER, Y., CHUANG,I.,MITROS, P., AND 1089 PRITCHARD, D. E. 2010. Who does what in a massive open 1090 online course? Int. J. Hum.-Comput. Stud 68, 223–241.

1091 SHOWME, 2013. http://www.showme.com/.

1092 SIMARD, P. Y., STEINKRAUS,D., AND AGRAWALA, M. 2005. 1093 Ink normalization and beautification. In ICDAR, II: 1182–1187.

12