Animation of Natural Virtual Characters

Data-Driven Approach to Synthesizing Facial Using

Kerstin Ruhland, Mukta Prasad, and Rachel McDonnell ■ Trinity College Dublin

n the entertainment industry, cartoon anima- of the refining stage, with easily editable curves for tors have a long tradition of creating expres- final polish by an . sive characters that are highly appealing to The proposed approach replaces the acting and Iaudiences. It is particularly important to get right phases by recording the artists’ move- the movement of the face, eyes, and head because ments in front of a real-time video-based motion they are the main communicators of the charac- tracking and retargeting system (see Figure 1). ter’s internal state and emotions and an indication We use a commercially available and affordable of its level of engagement with real-time facial-retargeting system for this stage, others. However, creating these which creates the initial motion curves of the Producing cartoon highly appealing animations is . These motion curves con- is a laborious task, and there a labor-intensive task that is not tain realistic human movements that are typically is a distinct lack of automatic supported by automatic tools. Re- dense in keyframes, making them difficult for an tools to help , ducing the animator’s effort and animator to edit. particularly with creating facial speeding up this process, while Our approach thus focuses on the next stage animation. To speed up and allowing for artistic style and of the pipeline. The initial motion curves of the ease this process, the proposed creativity, would be invaluable to character animation provide the input to our algo- method uses real-time video- production companies that cre- rithm. Our pattern-matching algorithm replaces based motion tracking to ate content for weekly cartoons the refining stage, by matching the motion curves generate facial motion as input and feature-length films. to a database of hand-animated motion curves. Animation production typi- This creates synthesized animations that reflect and then matches it to existing cally goes through a number of the artist’s keyframing and animation style. In hand-created animation creation stages: acting (where the final stage, the animator can focus on polish- curves. the animator acts out the ex- ing the animation for final production. We expect pressions in front of a mirror), that artists need to be acquainted with polishing blocking (where key body and face poses are cre- synthesized animations, as opposed to animations ated), refining pass (where overlapping actions are of their own work, before such a system can be in- added, movements are exaggerated, and the timing corporated into a production pipeline. Therefore, and spacing of motions is refined), and final pol- our solution will be mostly suitable for produc- ish (where animation curves are cleaned and small tions that require a lot of animations in a short details are added). We propose an alternate, more amount of time (such as weekly TV cartoons). automated solution for use when production speed This article demonstrates our approach’s ability is most important. Our approach aims to produce to dramatically reduce the number of keyframes of facial animation that closely resembles the quality the motion-capture data. We also describe a set of

30 July/August 2017 Published by the IEEE Computer Society 0272-1716/17/$33.00 © 2017 IEEE Motion capture Retargeting Virtual character with realistic motion

Database of Our pattern-matching hand-created approach animations

Final polish by animator

Figure 1. Producing facial animation at the refining stage. A motion-capture system lets us predefine key facial poses and timings. Then, by utilizing a database of hand-created animations, our approach adjusts the motion-captured animations to match the artist’s style. The animator can easily edit the animation for final production in the final polish stage. user studies that show that our synthesized anima- content that we commissioned for the purpose of tions can achieve the same expressiveness and car- this project. The results show that our approach tooniness as hand-created animations and that an can effectively synthesize new animations (see animator can improve them with a small amount Figure 2). The large number of motion-capture of polish. (See the web extra video for an illustra- keyframes generated from the real-time retarget- tion of the proposed approach: https://youtu.be/ ing, usually one per frame, are replaced by well- PQP9965jcCk.) placed keyframes and in-betweens. To evaluate our method, we synthesized a series of emotional sentences, using some input motion Motion Capture and 3D Animation capture, a database of hand-animated motions, Motion capture is the favored method for applying and three different pattern-matching algorithms realistic motion to 3D characters, but the results (distance measurement, symbolic aggregate ap- can appear too realistic and lack expressiveness proximation, and a hidden Markov model). For when applied to highly stylized cartoon charac- this work, we were limited to a database consisting ters. Editing motion-capture data is difficult and of just one minute of high-quality hand-animated work intensive because keyframes are created for

Motion capture Synthesized Motion capture Synthesized Motion capture Synthesized (a) (b) (c)

Figure 2. Pattern-matching approach using example curves of professional cartoon animators to match to a motion-capture sequence creating a new synthesized animation. The example emotions are (a) happy, (b) angry, and (c) happy.

IEEE Computer Graphics and Applications 31 Animation of Natural Virtual Characters

each frame, which can limit the animator’s cre- generating a stylized 3D animation.6 Our proposed ativity. Various keyframe simplification methods method takes the opposite approach by using a da- have been implemented to make motion capture tabase of hand-created stylized motion curves to more accessible.1 Ideally, any animation tool us- replace realistic human motion curves. ing motion capture created for an animator should Although there has been an abundance of work simplify the keyframes. Our method automati- contributing to the automatic generation of real- cally reduces motion-capture keyframes using a istic facial animation, facial animation stylization pattern-matching approach, so it is well suited as has received less attention. Previous attempts have an animator tool. used a 2D approach, typically using video data as Professional cartoon animators draw on a input and producing low-resolution facial ani- wealth of established animation principles to cre- mation, where details in the face are minimized ate appealing characters. The literature has com- and the important features are therefore exagger- prehensively documented the application of these ated. Jung-Ju Choi and his colleagues added the principles to 3D computer traditional principle of anticipation to motion- animation, explaining their meaning in 2D ani- captured facial expressions.7 They used principal mation and how they can best be translated into component analysis (PCA) to classify facial fea- 3D.2 Squash-and-stretch, for example, makes ani- tures into components with similar motions and mations more fluid, while exaggeration and antici- extract expressions from each component’s ani- pation help to translate the character’s emotions mation graph. An anticipation effect for a specific and thoughts to the audience. expression is found by searching the component’s animation graph and finding the best-matching inverted expression, which is blended into the Our objective is to detect the original expression. Unlike this approach, we present a pattern-matching method that matches animator patterns in a given similar patterns in a hand-created animation to a motion-capture curve. The newly synthesized motion-capture curve. animation can exhibit an animator’s unique style. Automatically modifying motion-capture curves to add the traditional animation principles is an Some previous research focused on synthesizing effective method of changing the style of the re- these principles for 3D objects, for example, by cre- alistic motion data and making it appear more ating squash-and-stretch or exaggerated motions. cartoon like. However, this approach lacks the The Cartoon Animation Filter3 applies an inverted input of a trained animator, resulting in some- Laplacian of a Gaussian filter to the motion signal what predictable results—for example, the same to enrich a variety of motion signals with antici- exaggeration applied to all motions can look re- pation, follow-through, exaggeration, and squash- petitive over time. Our aim is to incorporate input and-stretch. Relatively few papers concentrate on from a trained animator in order to create more the application of traditional animation principles varied results. We observe that a motion-capture to virtual characters. Ji-yong Kwon and In-Kwon curve retargeted to a virtual character is basically Lee transformed motion-capture data into rubber- a sequence of data points over a time interval. like exaggerated motion by breaking down the orig- Therefore, we draw inspiration from the study of inal skeleton into shorter segments and applying a time-series data analysis, the goal of which is to Bézier curve interpolation for the bending effect detect and estimate patterns and predict future and a mass-spring emulation creating a smooth trends. Our objective is similar in that we wish to movement.4 Jong-Hyuk Kim and his colleagues detect the animator patterns from a given motion- implemented an anticipation effect added to char- capture curve. acter animation.5 With the notion that a main The search for matching patterns in time-series movement is preceded by an opposite movement, data has been widely researched. The most com- they calculated the rotations and translations of monly used strategy to compare two sets of time- the center of mass and then calculated the antici- series data with each other is the evaluation of pation pose with a nonlinear optimization tech- Euclidean distance. Dynamic time-warping tech- nique and incorporated it into the given motion. niques (DTW) and hidden Markov models (HMMs) Eakta Jain and her colleagues utilized an existing have also been successfully applied to mine time-se- motion-capture database to match and compare ries data. In addition to being robust to noise, offset the 3D poses to hand-drawn 2D character poses, translation, and amplitude scaling when matching

32 July/August 2017 patterns, Eamonn Keogh’s method8 accounts for Our method consists of three main stages. First, longitudinal scaling, which is an important feature the example curves from the database of artist- for our use. To reduce storage space and speed up created animations are divided into chunks of the similarity search, Jessica Lin and her colleagues progressively smaller sizes, which will serve as introduced a symbolic aggregate approximation patterns to be found in the motion-capture curve. (SAX) that reduces the data’s dimensionality and In the second stage, a pattern-matching approach enables indexing with a lower-bound distance mea- finds subsegments in the example curve that are sure.9 The time-series data are approximated by a similar to subsegments in the motion curve. For piecewise aggregate approximation (PAA) model comparison, we implemented three straightfor- and mapped to SAX symbols generating a word of a ward and well-established pattern-matching ap- predefined alphabet size. The word comparisons are proaches. The first approach finds the similarity fast and less computationally expensive than the of the curves using a distance measurement be- Euclidean distance measure. tween subsegments on the data points of both curves. The second approach transfers both curves Algorithm into a symbolic representation before comparison, Our method takes as input an unseen motion- and the third approach uses an HMM to find the capture curve and, using a pattern-matching ap- best match. We assume that the first approach proach, finds similar curve segments in a database will achieve more accurate results but will have a of hand-animated motion curves. The motion cap- higher processing overhead, the second approach ture is then adjusted to match the detected hand- will speed up the processing but might introduce animated motion curve. The curves for each facial errors due to the approximation of the original feature are processed independently. This allows curves, and the third approach will be superior in more flexibility, as a more exaggerated stylized ani- warping the motion-capture curve to the animator mation can be achieved by nonsynchronous move- curve as it models the smoothness of the match ment of facial features, and simpler computation. but, depending on the parameter definition, might Our work assumes that the distinctiveness of a require a longer processing time. The third stage of hand-created animation is derived from the spac- the method is the continuous adjustment of the ing of keyframes, tangents, interpolation meth- motion curve, based on the segments found by the ods, and the timing of the curves. chosen pattern-matching approach. A prerequisite for our approach is that the input motion capture is retargeted to the same char- Example Curve Segmentation acter rig as the database of hand-animated mo- In preparation for our approach, the example tion curves they are matched to. The character curves have to be segmented in order to match rig itself can be blendshape or bone based. Our subsegments to the motion-capture curve. We proposed method works independently of the mo- create a dictionary of all curve segments from tion capture and retargeting technique used, so the initial animation curves so that each curve any state-of-the-art method to capture realistic segment is at least k keyframes in length and at motion and apply it onto a 3D character can be most as large as the example curve. In our imple- used. For practicality, a markerless video-based mentation, we choose a top-down segmentation method for real-time capture is preferable because algorithm, where each example curve in the da- it is cheaper and easier to fit into an existing ani- tabase is recursively partitioned. Each keyframe mation pipeline, rather than the more costly and of a newly defined curve segment stores the time, difficult to configure alternative of optical motion translation/rotation value, the type of tangent, capture. Although the transfer of performance and the tangent’s in and out angles. For later pro- capture data to the virtual model are important, cessing, the time codes of the keys are recalculated we implemented existing approaches because the to start at zero. main focus of our work is on reusing the motion. Our method’s objective is to transfer the artistic In our approach, each retargeted motion-capture keyframing style of the hand-created animation to curve describing the xyz translation and rotation the motion curve. Experimental validation shows of the corresponding facial feature or body part that the best style transfer results are achieved by (such as head or eyes) along the time axis is re- transferring as much information as possible from ferred to as the motion curve. These motion curves larger artist-created segments. Therefore, the dic- are compared to curves hand-created by an anima- tionary of curve segments is sorted in decreasing tor using the same character rig and are referred order of length when searching for a pattern in to as example curves. the motion curve.

IEEE Computer Graphics and Applications 33 Animation of Natural Virtual Characters

For scale invariance, we adopted Keogh’s ap- A distance of zero describes identical segments, proach8 by including scaling in the x axis. To do as the distances between the points are equal.

so, each subsegment of the example sequence is di- The distance between each segment qs and subse-

vided into three parts with smaller left ql and right quences of P are stored in an M × N matrix, where

qr subsegments. For each ql and qr, a matching pat- M is the number of subsequences of P and N is the tern in the motion curve is found. New segments length of Q. The goal of subsequence matching is * are created by taking a match found for ql and qr to find r such that and compressing or stretching the middle part * along the x axis. The scaling factor with which r = argminr d(p(r), qs) the x values of the middle part are multiplied is computed as The final synthesized animation curve can be further influenced by introducing a threshold. In-

ScalingFactor = (qr(x1) – ql(x1))/qlength stead of an exact search for the minimal argminr, an approximate search satisfying some threshold

where ql(x1) and qr(x1) are the starting points can be performed. A low threshold will search for on the x axis of the left and right subsegments the best match between the motion curve and the

of found matches, respectively, and qlength is the example curves, whereas a higher threshold will original length of the example segment q. allow approximate matches. A combination of a low threshold and small subsegments will closely replicate the motion curve. The higher the thresh- Our method transfers the artistic old, the more the final curve will deviate. keyframing style of the hand-created Second Approach: SAX. The second approach applies symbolic aggregate approximation (SAX)9 to the animation to the motion-capture curve. motion data. SAX approximates a time series by converting the curve into symbols, thus reducing the curve’s dimensionality to allow faster compar- With the available subsequences, each pattern- ison. We described only the main steps here. For matching approach must solve the subsequence more detailed information, see earlier research.9 matching problem. As we explained earlier, we Before converting the example curves Q and implemented three different pattern-matching ap- motion curves P into symbols, they are normalized proaches for comparison purposes. to have a mean of zero and a standard deviation of one. This is necessary because previous tests have First Approach: Distance Measurement. The simplest shown that the distance measurement between and most common approach to determine the nonnormalized time-series data is of no relevance similarity between two curves is the distance mea- due to possible differences in offset and amplitude. surement. Let Q represent a dataset of curve seg- Therefore, both curves are converted using a piece- ments gleaned from the original example curve. wise aggregate approximation (PAA) method. PAA The example curve segments are represented as reduces a curve of length n into a representation

qs(i), where i ∈ {1 ... n(s)} denotes the time indices of w dimensions. The user-definedw specifies the for a segment with a start point that is zero cen- number of PAA segments representing the curve. tered to account for translational offset. The 1D The curve is divided into w segments of equal size motion curve is represented by p(t) at every time for each segment, and the mean value of the data index t. points is calculated. The quality of this approxima- For each point in the example curve segments tion of the original curve can be controlled by the

qs, the corresponding point in the motion curve number of segments. P must be found. The distance between a seg- The next step discretizes the data representation

ment qs and a subsequence p of P starting at r is further, based on the assumption that normal- the sum of squared errors (SSE) calculated over ized time-series data has a Gaussian distribution.

the distance measure between each point in qs Given this knowledge, breakpoints are determined and the corresponding point in p. This can be from a statistical table, which divides a Gaussian defined as distribution into a user-defined number (3 to 10) of equiprobable areas. One SAX symbol is mapped

d(p(r), qs) = SSE{qs(i) – p(r + (i – 1)) | i ∈ {1 ... n(s)}}. to each of these areas. Finally, the PAA coefficients are converted into a symbolic representation by

34 July/August 2017 applying the corresponding SAX symbol depend- spondences can be determined, but using one ing on the area under the Gaussian distribution helps us define smoothly varying, monotonic curve in which the PAA coefficient falls. correspondences. The initial state distribution I To compare two symbols, SAX provides a lookup will be constrained to always start with the first table with the distances between two symbols. segment I = [1, 0, 0, ..., 0]. To account for the The minimum distance between two equally sized fact that the two curves have different overall curves Q′ and P′ in a SAX representation can then scales (similar to Keogh’s approach8), we devised be calculated as a normalization factor λ based on the ratio of overall lengths, as constant: n w 2 min,dist()QP′′= dist()qpii′′, . w ∑() i=1 P 2 ∑ i=1()pi()−+pi()1 λ = . Q 2 The comparison of curves of varying length is ∑ j=1()qj()−+qj()1 implemented by using a sliding-window approach. As in the previous approach, we can introduce a If T(i, j, k) is the cost of going to state k in node threshold to influence the result of the synthesized i + 1 when the process is in state j in node i, the animation curve. transition matrix takes the following form:

Third Approach: HMM-Viterbi Algorithm. In the last  22  ()pj()−pk() −∗λ ()qi()−+qi()1 ,kj> approach, the first curve is compared with the sec- Ti(),,jk= .  ond, based on both the quality of individual fea-  ∞≤,kj  ture matches and the smoothness of the overall mapping across all points. This problem is formu- In this implementation, we are interested in lated using an HMM, where the example curve Q finding the most probable correspondence assign- forms the chain graph nodes and the keyframes of ment (between the motion and example curves) so the motion curve P are the states that each node that the correspondences are smooth and mono- on the first curve can take (denoting matches). tonic in time. Thus, T is right triangular and the The model parameters consist of the initial state rest of the elements have infinite cost. Such an distribution I, the unitary emission matrix E, and HMM solved with the Viterbi algorithm is similar the pairwise transition matrix T. to a dynamic time warping (DTW) formulation. The emission matrix E(P × Q) usually defines Here the parameters are empirically set, but in the how likely it is that qth example nodes in the presence of more data, these can be learned in a graph can take the pth motion curve label based maximum likelihood framework. on similarity between the features. For each point p(x) in the motion and q(x) in the example curve, Curve Warping a feature vector is defined based on curve charac- Once the best match is found, the motion curve is teristics. For example, for P warped. Each subsequence r* of the motion curve   is manipulated such that the motion curve seg-  px()  *   ment r matches the corresponding example curve φ px =  px′  . segment qs as closely as possible. This can be writ- ()()  ()   ten as a warping function with the given param-  px′′ ()   eter θ: The aim is to match each point in the exam- ∗ ple curve to a point in the motion curve that is min,dwpr()θ ,qs θ ()() similar. The distance between the mathematical representation of both points in terms of features To ensure a smooth transition between two suc- defines their similarity: cessive segments, endpoint constraints are defined as d(p(i), q(j)) = ||φ(p(i)) – φ(q(j))|| p(r*) = w(p(r*), θ), p(r* + n(s) – 1) = w(p(r* + n(s) – 1), θ). where i ∈ {1 ... pm} and j ∈ {1 ... qn}. The pairwise transition matrix T(P × P) de- and endpoint tangent constraints are defined as fines how likely it is for two neighboring nodes to take one particular combination of two state p′(r*) = w′(p(r*), θ), labels. Without temporal smoothness, corre- p′(r* + n(s) – 1) = w′(p(r* + n(s) – 1), θ).

IEEE Computer Graphics and Applications 35 Animation of Natural Virtual Characters

These functions denote C1 continuity at joining cation of emotional speech.10 We chose a sentence points. for each emotion from previously validated lists A simplified approach is to compute the differ- of affective sentences for emotion identification ence of the values on the y axis of each point in through content.11,12 * matching subsegments qs and r . The deviation of A state-of-the-art video-based head-mounted * the values in r from the values in qs defines how camera system was used to capture the profes- much the motion curve has to be warped (im- sional female actor’s facial movements while she plemented similar to Poisson interpolation). All acted out the two sentences. The motion capture points in the motion curve that are not part of a was then applied to the character rig using the matching pattern are deleted. real-time, example-based retargeting software Faceware (facewaretech.com). Faceware analyzes Evaluation the video input from the head-mounted camera For our evaluation, we chose a stylized female char- and automatically tracks 37 feature points (five acter that resembles characters from top animation for each brow, four for the outer eye, two for the studios and features blendshapes and squash-and- pupil, 14 for the outer lip, and 12 for the inner stretch functionalities needed to create convincing lip). The motion is then retargeted to the user- cartoon animation. For all our experiments, the defined controls (including 14 facial blendshape lip-syncing was directly derived from the motion- controllers and additional controllers for the eyes, capture animation and not altered by our approach. head, neck, and chest). The final curves are the xyz We chose this approach because lip motion is com- translation and rotations of these controllers over plex and is controlled by multiple blendshapes and time. From the recordings, the best interpretations controllers. To synthesize animation of this level were hand-picked and were each approximately 4 of complexity, we would need a much larger data- seconds long. base. Furthermore, the use of automated lip-sync software (from audio) is already a commonplace Synthesized Sequences approach to speeding up animation production of For the synthesized motions, the motion-captured lip movement, so we do not attempt to tackle this sequences were used as the “unseen input,” and particular issue here. the algorithms searched the animation database for matches. The three proposed approaches (dis- Animation Database tance measure, SAX, and HMM) were imple- For testing, we commissioned a small database mented as a plug-in for Autodesk Maya using the (just 1 minute) of high-quality artist-created ani- Maya Python API 1.0 interface. mation. The cost of commissioning this database was approximately €6,000, due to the amount of Keyframe Reduction time required for the animators to create it. The The aim of our approach is to synthesize anima- expense of this small animation database high- tion curves, which reflect the keyframing and ani- lights the need for more cost-effective production mation style of an animator. Keyframe reduction of cartoon facial animation and data reuse. Ide- is achieved by adjusting the motion-capture curve ally, our approach would be used in a production to the pattern found in the hand-created anima- setting, where high-quality pregenerated anima- tions. Both the motion-captured and synthesized tion would be readily available. animations were approximately 4-seconds long The database consisted of two emotional with 30 frames per second. 30-second monologues (anger and happiness), for Figure 3 shows the average number of keyframes which we used a professional female actor and re- for the motion-captured curve and the three dif- corded the audio. The professional 3D character ferent pattern-matching methods. Because we animator we commissioned created high-quality used retargeting software, the number of key- cartoon animations in his own style following the frames differs slightly among the different motion- provided audio clips. capture curves. All the tested methods achieved a keyframe reduction of approximately 90 percent Motion-Captured Sequences of the motion-capture curves. The number of key- To test our approach, we generated a set of motion- frames depends on the matching patterns found capture sequences that were used as input. To pro- and therefore varies among the different methods. vide expressive content, the dataset consisted of By reducing the number of keyframes, we are animations that convey anger and happiness, two able to transform the formerly difficult-to-edit of the six basic emotions in Paul Ekman’s classifi- motion-capture curves into those that are much

36 July/August 2017 160 145.21 140 easier for an animator refine in the final polish- 120 ing stage. 100 Perceptual Tests To evaluate how the synthesized results (produced 80 by applying the three different methods) com- 60 pared with the animations created by the artists and the original motion capture, we conducted Number of keyframes 40 16.74 17.88 a set of user studies. The stimuli set consisted of 20 11.93 20 animations: two examples each of the five ani- mation types (artist-created, motion capture, dis- 0 Motion Distance SAX HMM tance measure, SAX, and HMM) portraying both capture measure the anger and happiness emotions. Figure 3. Average number of keyframes for a 4-second animation clip Experiment 1: Unrefined for the motion captured curve and the three different pattern-matching Our first experiment consisted of three blocks methods. All the tested methods achieved a keyframe reduction of displaying animations in random order with three approximately 90 percent. repetitions. The first two blocks showed each emo- tion at a time without sound. The sound was muted where 1 was not appealing at all and 7 was ex- for this part of the experiment to avoid influenc- tremely appealing. ing the participant’s judgment of the character’s ■■ How cartoony did you find the virtual char- motion. The third block showed all animations, acter’s motion? Cartoony was described to the with audio, in random order. To avoid ordering ef- participants as a highly stylized cartoon anima- fects, the first group viewed the angry animations tion, similar to what would be seen in animated first, followed by the happy animations, and the movies with highly exaggerated and animated second group viewed them in reverse order. character facial expressions. In their answers, In total, the participants were presented with 10 the participants used a seven-point Likert scale, animations (5 animation types × 2 examples × 1 where 1 was not cartoony at all and 7 was ex- emotion) in blocks one and two and 20 anima- tremely cartoony. tions (5 animation types × 2 examples × 2 emo- tions) in block three, with three repetitions in The participants used the number keys on the key- each block. University ethical guidelines were fol- board to input their ratings. lowed as we administered this experiment. Eigh- After completing the first two blocks, the par- teen participants (seven female and 11 male), aged ticipants completed the third block and then were 19 to 30, took part in the experiment, which lasted asked a final question: approximately 20 minutes. The participants were not trained animators and had different disciplin- ■■ How good was the facial expression at convey- ary backgrounds, were recruited mainly from the ing the message? The participants were asked to university student and staff mailing lists, and were concentrate on their overall impression of the all unaware of the intention of the experiment. character’s facial and head movement and how After the participants read and signed the con- well it conveyed what the character was trying sent form, they viewed each animation in the first to say, Again, the participants used a seven- two blocks and were asked to answer three ques- point Likert scale, where 1 was not good at all tions after each clip: and 7 was extremely good.

■■ How expressive was the virtual character in por- Because we used the motion-captured lip move- traying the emotion? The participants were asked ments for all the clips, we asked the participants to base their decision on the strength of their not to focus on how well the lip movements impression of the character’s indicated emotion matched the audio. using a seven-point Likert scale, where 1 was not For the analysis, we averaged the data over the expressive at all and 7 was extremely expressive. two examples and the three repetitions. A repeated ■■ How appealing did you find the virtual charac- measures analysis of variance (ANOVA) was con- ter’s motion? Participants were advised to base ducted for each question with the animation type their decision on the appeal of the character’s (artist-created, motion capture, distance measure, overall motion using a seven-point Likert scale, SAX, and HMM), emotion (anger and happiness)

IEEE Computer Graphics and Applications 37 Animation of Natural Virtual Characters

Table 1. Main results of experiment 1, unrefined. Variable Effect F-test Post hoc analysis Expressiveness Animation type F(4, 64) = 36.633, p ≈ 0 Artist-created animations were rated most expressive; synthesized and motion-capture animations were similar. Emotion F(1, 16) = 8.434, p = 0.0103 Angry emotions were more expressive than happy. Animation type and F(4, 64) = 12.992, p ≈ 0 Artist-created and synthesized HMM angry animations had emotion similar ratings. Appeal Animation type F(4, 64) = 21.588, p ≈ 0 Synthesized animations were relatively appealing, but less than artist-created animations. Cartooniness Animation type F(4, 64) = 13.640, p ≈ 0 Artist-created animations were rated most cartoony. Animation type and F(4, 64) = 4.109, p = 0.005 Female participants rated the HMM and artist-created participant sex animations similarly. Animation type and F(4, 64) = 6.166, p = 0.0003 HMM was rated best at synthesizing the animation style from emotion the artist-created animations. Expression of Animation type F(4, 64) = 46.031, p ≈ 0 Artist-created and motion-capture animations were rated the message better at conveying the message than synthesized animations. Emotion F(1, 16) = 18.335, p = 0.001 Anger was better expressed than happiness. Animation type and F(4, 64) = 2.916, p = 0.028 Motion-capture, distance measure, and HMM were rated emotion similarly for angry animations; synthesized happy animations were rated the lowest.

within-subject factors as well as the participant and distance measure (mean 4.2) animations re- sex between-subject factor. For all post hoc analy- ceived similar ratings, whereas the SAX (mean: sis, Newman-Keuls tests were conducted for a 3.55) and HMM (mean 3.67) animations received comparison of means. Table 1 summarizes all the the lowest ratings. No other main effects or inter- significant main effects. actions were found. This implies that participants The post-hoc analysis showed that the artist- found the unrefined synthesized animations rela- created animations were rated by far the most tively appealing, but less so than the artist-created expressive (p < 0.0002 for all). Between the other animations. This is unsurprising because the ani- animation types, the results show only a signifi- mations were generated directly from the algo- cant difference between the distance measure and rithm and had not yet undergone a final polishing the HMM method (p = 0.0148), which indicates by an animator. that the synthesized animations using HMM were The artist-created animations were also perceived perceived as more expressive than the synthesized as the most cartoony (p < 0.0002 for all). However, animations using the distance measure. However, this was more true for male participants, whereas in general motion capture and all three synthesized the female participants rated the unrefined HMM animations were found to be similarly expressive. animations equally highly as the artist-created on We also found angry emotions to be more ex- cartooniness. An interaction was found between pressive than happy (p = 0.0086). An interaction animation type and emotion (Figure 4). between animation type and emotion shows the As seen with expressiveness, the post-hoc analy- differences between the expressiveness ratings in sis shows no difference between the high ratings more detail. Post hoc tests showed that there was of the artist-created and HMM synthesized angry no difference in the ratings of the artist-created animations for cartooniness, indicating that the and synthesized HMM animations for the angry HMM method was the best at synthesizing the an- emotion and that both differ significantly from imation style from the artist-created animations. the other animations (p < 0.012), indicating that However, for the happy emotion, the ratings for the unrefined synthesized angry animations us- the artist-created animations were significantly ing the HMM method were as expressive as the different than all other animations (p < 0.0001 artist-created animations. On the other hand, for all). for the happy emotions, the artist-created and Lastly, our results show that anger was better motion-capture animations were significantly expressed than happiness (p = 0.001). The artist- more expressive than the synthesized animations created and motion-capture animations were (p < 0.036 for all). found to be significantly better at conveying the The artist-created animations received the high- message than the synthesized animations (p < est ratings (mean 5.25) in terms of appeal (p < 0.003 for all). An interaction was found between 0.0004 for all). The motion-capture (mean 4.49) animation type and emotion, indicating that

38 July/August 2017 Table 2. Main results of experiment 2, refined. Variable Effect F-test Post hoc analysis Expressiveness Animation type F(6, 72) = 10.123, p ≈ 0 No significant improvement Animation type F(6, 72) = 5.7226, p = Angry HMM animations were rated similarly to artist- and emotion 0.00007 created animations; refined happy animations using distance measure were rated higher than unrefined animations Appeal Animation type F(6, 72) = 11.364, p ≈ 0 No improvement Cartooniness Animation type F(6, 72) = 6.881, p = 0.00001 Artist-created animations were rated significantly more cartoony Animation type F(6, 72) = 5.521, p = 0.00009 Refined angry HMM animations were rated as cartoony as and emotion artist-created animations Expression of the Animation type F(6, 72) = 14.201, p ≈ 0 Refined animations conveyed message as well as motion- message capture animations Animation type F(6, 72) = 5.202, p = 0.0002 Refined happy animation using distance measure were rated and emotion similarly to artist-created animations artist-created angry and happy animations were 7 Artist created the best at conveying the message (p < 0.0002 Motion capture Distance measure for all). No difference was found among the 6 SAX motion-capture, distance measure, and HMM HMM angry animations. However, the analysis of the 5 happy animations does show a significant dif- ference between the high ratings of the motion- 4 capture animations and the low ratings of the 5.69 synthesized animations (p < 0.005). 3 5.13 4.94 4.31 4.43 4.49 4.28 Thus, although in general the hand-created art- Cartooniness ratings 4.08 4.12 3.95 ist animations received significantly better results, 2 our unrefined HMM approach was found to be as 1 expressive and cartoony for the angry animation. Anger Happiness The synthesized animations were not intended to be final polished animations for production, Figure 4. Cartooniness ratings for the unrefined experiment. The bars whereas the artist-created and motion-captured show the expressiveness ratings averaged over all the participants for animations were. We therefore tested if a quick each emotion. polishing of the animations by a professional ani- mator would improve the ratings of the synthe- the same set of questions. The stimuli for this ex- sized animations. periment consisted of the five refined animations: one angry animation synthesized using distance Experiment 2: Refined measure, two angry animations synthesized using To investigate if the results for the synthesized HMM, one happy animation synthesized using the animations could be easily improved, we asked distance measure, and one happy animation syn- a professional animator to refine a subset of the thesized using HMM. As before, the animations synthesized animations. For refinement, we chose were shown in three blocks in random order with the animations that were rated the best in the three repetitions. The experiment lasted approxi- unrefined experiment. In total, we asked the ani- mately 15 minutes. mator to refine five animation clips (two happy For the analysis, we averaged the data over and three angry synthesized animations) and fix the three repetitions and the two examples. artifacts, improving the overall animation. The To compare the new ratings with the ratings of animator worked for approximately 30 minutes on the previous experiment, a two-way repeated each animation clip, making small refinements. measures ANOVA was conducted on the data with The mouth animation was excluded from the re- animation type (artist-created, motion capture, finement because we used the motion-capture lip distance measure, SAX, HMM, distance measure movements for all the clips. refined, and HMM refined) and emotion (anger Next, we conducted a user study with 14 of the and happiness) within-subject factors. Table 2 participants who took part in the previous experi- summarizes all significant main effects. ment (six female and eight male). The participants In terms of expressiveness, we found a small, but were asked to rate the refined animations using not significant improvement in the ratings for the

IEEE Computer Graphics and Applications 39 Animation of Natural Virtual Characters

7

6 Artist created 5 Distance measure SAX Re ned HMM 4 Motion capture 5.79 5.81 Re ned distance measure 5.54 5.34 HMM 3 4.75 4.77 4.94 4.59 4.33 4.43 4.54 3.96 3.99 3.94 Expressiveness ratings 2

1 Anger Happiness

Figure 5. Expressiveness ratings for the refined experiment. The bars show the expressiveness ratings averaged over the participants for each emotion.

refined animations. As before, the artist-created cantly better than the unrefined animations (p = animations were perceived as the most expressive 0.0002). The happy animations using HMM were (p < 0.0003 for all). An interaction between ani- only found to be similar to the unrefined dis- mation type and emotion provides a more detailed tance measure and motion-capture animations analysis. For the angry animations, no difference (p > 0.084). For the angry animations, we found was found between the artist-created and the un- no differences between the unrefined and refined refined and refined synthesized HMM animations. motion-capture animations. For the happy animations, the refined synthesized animations using distance measure were rated sig- nificantly higher than the unrefined animations ur comparison of the three pattern-matching (p < 0.028), as Figure 5 shows. Therefore, there Omethods revealed no significant differences. was an improvement in expressiveness for the re- They achieved good results in keyframe reduction fined animations. and were rated to be similar in terms of qual- Our results show the refinement did not improve ity in our user studies. The functional difference the appeal of the animations. The two refined ani- between the methods is their similarity search. mations were rated equally high on appeal as the Distance measure is the most accurate due to the motion-capture animations, but therefore were point-to-point distance calculation. When using still also equal to the unrefined animations. SAX, it is important to find the correct combina- The artist-created animations were still found tion of the number of segments representing the to be significantly more cartoony than any other curve and the number of SAX symbols. A high ab- animation (p < 0.0004 for all). An interaction straction of the curves can result in less suitable was found between animation type and emotion, matching patterns. HMM has been found to be where post hoc analysis shows that the angry re- the best in warping the motion-capture curve to fined animations using HMM were found to be the artist-created curve by finding the best match- as cartoony as the artist-created animations (p > ing pairs of keyframes. 0.065). Both refined happy animations were per- The results of our perceptual tests showed that ceived as less cartoony than the artist-created ani- the unrefined animations created by our approach mations (p < 0.0004 for both). No difference was received good ratings in terms of expressiveness, found between the unrefined and refined angry cartooniness, and appeal for the animations with and happy animations, implying that the refine- an angry emotion. With some minor refinement ment did not improve cartooniness. by a professional animator, we achieved signifi- Lastly, the artist-created animations (p < 0.0005 cantly better results for the expressiveness and ex- for all) still received the best ratings in terms of pression of the message for the happy animations. the expression of the message. However, the re- The general trends in research show that the fined animations were found to convey the mes- method shown here (and those in other, more sage as well as the motion-capture animations (p > complex, machine-learning applications) do gen- 0.212) and significantly better than the unrefined eralize and perform better, but the choice of model animations (p < 0.005 for all). Furthermore, the and the amount of data is crucial for learning. In interaction between animation type and emotion our tests, our database consisted of just one min- shows that the refined happy animation using ute of artist-created animation, and our results distance measure received similar ratings as the were still good. With more publicly available data, artist-created animations (p = 0.469) and signifi- the methods and results should only improve.

40 July/August 2017 One drawback of our approach is that the result- 7. J.-J. Choi, D.-S. Kim, and I.-K. Lee, “Anticipation for ing synthesized animations sometimes contain ar- Facial Animation,” Proc. 17th Int’l Conf. Computer tifacts. Processing curves individually means that Animation and Social Agents, 2004, pp. 1–8. the overall animation can exhibit uncoordinated 8. E. Keogh, “Fast Similarity Search in the Presence movements or unnatural timing. However, these of Longitudinal Scaling in Time Series Databases,” artifacts are generally easy for an animator to edit, Proc. 9th Int’l Conf. Tools with Artificial Intelligence remove, and refine in the final polishing stage. In (ICTAI), 1997, pp. 578–584. future work, we plan to extend our method to 9. J. Lin et al., “A Symbolic Representation of Time handle curves jointly. Series, with Implications for Streaming Algorithms,” In the future, we could also create a more sophis- Proc. 8th ACM SIGMOD Workshop on Research Issues ticated curve segment dictionary to improve the ac- in Data Mining and Knowledge Discovery, 2003, pp. curacy and matching speed, without increasing the 2–11. memory constraint. This could be achieved by seg- 10. P. Ekman, “An Argument for Basic Emotions,” menting the curves based on salient points or using Cognition & Emotion, vol. 6, nos. 3–4, 1992, pp. a graph simplification method1 to optimize the set 169–200. of keyframes before segmentation. Going forward, 11. B.M. Ben-David, P.H. van Lieshout, and T. Leszcz, we would like to extend this work to use machine “A Resource of Validated Affective and Neutral learning on more data and a larger range of emo- Sentences to Assess Identification of Emotion in tions to test generalizability. Recent deep-learning Spoken Language after a Brain Injury,” Brain Injury, techniques can exploit large amounts of data, even vol. 25, no. 2, 2011, pp. 206–220. with annotations on a small portion. While an 12. O. Martin et al., “The eNTERFACE’05 Audio-Visual initial unsupervised learning algorithm learns fea- Emotion Database,” Proc. 22nd IEEE Int’l Conf. Data tures, a subsequent supervised phase could learn to Eng., 2006; doi:10.1109/ICDEW.2006.145. perform a task such as curve mapping.

Kerstin Ruhland is a PhD student at Trinity College Dub- Acknowledgements lin. Her research interests include realistic facial and eye This research was funded by Science Foundation Ire- animation. Ruhland has a diploma in computer science from land under the ADAPT Center for Digital Content the Ludwig Maximilian University of Munich. Contact her Technology (grant 13/RC/2106) and the Cartoon at [email protected]. Motion Project (12/IP/1565). We thank the Mery Project for the character model and rig. Mukta Prasad is an adjunct assistant professor of creative technologies at Trinity College Dublin and a research lead at Daedalean AG (an autonomous flight startup). Her re- References search interests include computer vision, machine learning, 1. Y. Seol et al., “Artist Friendly Facial Animation and graphics. Prasad has a PhD in computer vision from Retargeting,” ACM Trans. Graphics, vol. 30, no. 6, the University of Oxford. Contact her at [email protected]. 2011, article no. 162. 2. J. Lasseter, “Principles of Traditional Animation Rachel McDonnell is an assistant professor of creative Applied to 3D ,” Proc. 14th technologies in the School of Computer Science and Statis- Ann. ACM Conf. Computer Graphics and Interactive tics at Trinity College Dublin. Her research interests include Techniques (Siggraph), 1987, pp. 35–44. computer graphics, character animation, and perceptually 3. J. Wang et al., “The Cartoon Animation Filter,” Proc. adaptive graphics, with a focus on emotionally expressive ACM SIGGRAPH 2006 Papers, 2006, pp. 1169–1173. facial animation and perception. McDonnell has a PhD in 4. J.-Y. Kwon and I.-K. Lee, “Exaggerating Character computer science from Trinity College Dublin. Contact her Motions Using Sub-Joint Hierarchy,” Computer at [email protected]. Graphics Forum, vol. 27, no. 6, 2008, pp. 1677–1686. 5. J.-H. Kim et al., “Anticipation Effect Generation for Character Animation,” Advances in Computer Graphics, 2006, pp. 639–646. 6. E. Jain, Y. Sheikh, and J. Hodgins, “Leveraging the Talent of Hand Animators to Create Three- Dimensional Animation,” Proc. 2009 ACM Read your subscriptions through SIGGRAPH/Eurographics Symp. Computer Animation, the myCS publications portal at http://mycs.computer.org. 2009, pp. 93–102.

IEEE Computer Graphics and Applications 41