Comparative Study of Movie Shot Classification Based on Semantic

applied sciences

Article Comparative Study of Movie Shot Classiﬁcation Based on Semantic Segmentation

Hui-Yong Bak 1 and Seung-Bo Park 2,*

1 Department of Mechatronics Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Korea; [email protected] 2 Department of Software Convergence Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Korea * Correspondence: [email protected]; Tel.: +82-32-860-8831

Received: 22 April 2020; Accepted: 12 May 2020; Published: 14 May 2020

Abstract: The shot-type decision is a very important pre-task in movie analysis due to the vast information, such as the emotion, psychology of the characters, and space information, from the shot type chosen. In order to analyze a variety of movies, a technique that automatically classifies shot types is required. Previous shot type classification studies have classified shot types by the proportion of the face on-screen or using a convolutional neural network (CNN). Studies that have classified shot types by the proportion of the face on-screen have not classified the shot if a person is not on the screen. A CNN classifies shot types even in the absence of a person on the screen, but there are certain shots that cannot be classified because instead of semantically analyzing the image, the method classifies them only by the characteristics and patterns of the image. Therefore, additional information is needed to access the image semantically, which can be done through semantic segmentation. Consequently, in the present study, the performance of shot type classification was improved by preprocessing the semantic segmentation of the frame extracted from the movie. Semantic segmentation approaches the images semantically and distinguishes the boundary relationships among objects. The representative technologies of semantic segmentation include Mask R-CNN and Yolact. A study was conducted to compare and evaluate performance using these as pretreatments for shot type classification. As a result, the average accuracy of shot type classification using a frame preprocessed with semantic segmentation increased by 1.9%, from 93% to 94.9%, when compared with shot type classification using the frame without such preprocessing. In particular, when using ResNet-50 and Yolact, the classification of shot type showed a 3% performance improvement (to 96% accuracy from 93%).

Keywords: shot type classiﬁcation; semantic segmentation; CNN; Mask R-CNN; Yolact

1. Introduction In ﬁlms, movie shot types are classiﬁed based on the distance between the camera and the subject, and the general types of shots are the close-up shot, the medium shot, and the long shot [1,2]. Among them, close-up shots are used for expressing the emotions and psychology of the characters, with the subject occupying most of the screen. As shown in Figure1a, emotion or psychology is expressed with the character’s eyes, mouth, and facial muscles by making the character’s face occupy most of the screen [3]. In medium shots, a portion of the character’s body below the waist or elbow is located at the bottom of the screen. Medium shots are also used to express a character’s gaze direction and movement, since the character’s body above the waist appears on the screen [3]. Figure1b is an example of the medium shot, which shows the character’s gaze direction, motion, and conversation partner. In long shots, the subject occupies about one-sixth of the screen, giving the audience information about the place (inside or outside, in an apartment, a shop, a forest, etc.) and time (day, night, season) [3].

Appl. Sci. 2020, 10, 3390; doi:10.3390/app10103390 www.mdpi.com/journal/applsci Appl. Sci. 2020, 10, x FOR PEER REVIEW 2 of 10 Appl. Sci. 2020, 10, 3390 2 of 10 information about the place (inside or outside, in an apartment, a shop, a forest, etc.) and time (day, night, season) [3]. Additionally, the director uses the close-up, medium shot, and the long shot Additionally, the director uses the close-up, medium shot, and the long shot alternately to appropriately alternately to appropriately place the flow of emotion in the film and the importance of the scene [4]. placeAs such, the flowthe shots of emotion contain in a the lot film of information, and the importance such as ofthe the arrangement scene [4]. As and such, flow the of shots emotion, contain the a lotpsychology of information, of the suchcharacters, as the arrangementand the important and flow scenes, of emotion, so classifying the psychology the shot of type the characters, is a very andimportant the important pre-task scenes, in movie so analysis. classifying the shot type is a very important pre-task in movie analysis.

(a) (b) (c)

FigureFigure 1. 1.Types Types of of shots shots in in the the movie movieVeteran Veteran:(a: )(a Close-up) Close-up shot; shot; (b )(b Medium) Medium shot; shot; (c) ( Longc) Long shot. shot.

Two representativerepresentative studiesstudies ofof shotshot typetype classificationclassification includeinclude classifyingclassifying shotshot typestypes basedbased on the proportionproportion of thethe faceface occupyingoccupying thethe screenscreen andand classifyingclassifying shotshot typestypes usingusing aa convolutionalconvolutional neuralneural network (CNN). In In one one study study that that classified classified shot shot ty typespes from from the the proportion proportion of the of theface face on-screen, on-screen, the theaccuracy accuracy of shot of shottype typeclassification classification was high was for high close-ups for close-ups and medium and medium shots (with shots faces (with in faces the frame), in the frame),but the butaccuracy the accuracy in shot intype shot clas typesification classification for long for shots long (without shots (without faces) faces)was low. was Also, low. Also,shot shottype typeclassification classification cannot cannot be performed be performed if there if thereis no isface no on-screen. face on-screen. Shot type Shot classification type classification using usingthe CNN the CNNis able is to able classify to classify the shot the shottype type even even if no if noone one is on-screen, is on-screen, but but a acertain certain portion portion of of the the shots shots cannot be classified,classified, owing to thethe absenceabsence ofof aa semanticsemantic approachapproach thatthat peoplepeople cancan graspgrasp (that(that is,is, thethe imageimage cannot bebe semanticallysemantically analyzed).analyzed). Therefore, in the present study, thethe performanceperformance ofof shotshot typetype classificationclassification was improved by pre-processingpre-processing semanticsemantic segmentation segmentation of the of frames the extractedframes extracted from the movie.from Semanticthe movie. segmentation Semantic approachessegmentation the approaches images in order the images to distinguish in order boundary to distinguish relationships boundary among relationships objects. The among representative objects. technologiesThe representative of semantic technologies segmentation of semantic include segm theentation Mask R-CNN include and the Yolact.Mask R-CNN We conducted and Yolact. a study We forconducted comparative a study evaluation for comparative of the performance evaluation attainedof the performance from using theseattained as preprocessesfrom using these for shot as typepreprocesses classification. for shot type classification. The proposedproposed approachapproach automaticallyautomatically classifiesclassifies close-up shots,shots, medium shots, and long shotsshots using a CNN and semantic segmentation, and the accuracyaccuracy of shot type classificationclassification is enhanced by applying the boundary relationships ofof objects toto the frames extracted from filmsfilms using Mask R-CNN and Yolact. For the data, a total of 11,066 frames extracted from 89 films, films, such as InceptionInception,, were were used. used. Section2 2 of of thisthis articlearticle explainsexplains thethe worksworks relatedrelated to to the the background background knowledge knowledge ofof shotshot typetype classificationclassification technology. technology. Section Section 33 explains explains the the structure structure suggested suggested for for shot shot type type classification, classification, and andcontemplates contemplates the experiment the experiment conducted, conducted, and its and resu itslts. results. Lastly, Lastly,the results the resultsare summarized are summarized in Section in Section4. 4.

2. Related Works 2.1. Shot Type Classification 2.1. Shot Type Classification The various studies that have classified shot types can be largely categorized into two methods: The various studies that have classified shot types can be largely categorized into two methods: Methods that use the face, and methods that use the CNN. The first method classifies shot types based Methods that use the face, and methods that use the CNN. The first method classifies shot types based on the proportion of the face that occupies the screen. In this type of study, although the accuracy on the proportion of the face that occupies the screen. In this type of study, although the accuracy of of movie shot classification is high when there is a face in the frame, as shown in the two frames of movie shot classification is high when there is a face in the frame, as shown in the two frames of Figure2a, there is a disadvantage in that the accuracy of shot type classification is low when there is no Figure 2a, there is a disadvantage in that the accuracy of shot type classification is low when there is face in the frame, as shown in the frame for Figure2b. Also, if no person is on-screen, the shot type no face in the frame, as shown in the frame for Figure 2b. Also, if no person is on-screen, the shot type cannot be classified [5–8]. cannot be classified [5–8]. Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 10

Appl. Sci. 2020, 10, 3390 3 of 10 Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 10

(a) (b)

Figure 2. (a) Shots with faces; (b) a shot without a visible face.

The second type of study uses a CNN to classify close-up shots, medium shots, and long shots [1]. In one study using 400,000 frames extracted from 120 movies as data, learning was done using (a) (b) CNNs with AlexNet, GoogLeNet, and VGG-16 structures to check and compare accuracy [9–11]. Figure 3 is a confusionFigure matrixFigure 2. (a) ofShots 2. the(a) Shotswithexperiment faces; with faces;(b )results a shot (b) a withoutin shot which without a visiblethe ashot visible face. types face. were classified at 74% accuracy with AlexNet, 75% accuracy with GoogLeNet, and 94% with VGG-16. Unlike the first study, this ThestudyThe second secondcan classify type type of of shotstudy study types uses uses evena aCNN CNN if tothere to classify classify is no close-up close-upperson shots,on-screen. medium VGG-16, shots,shots, with andand longthelong highest shots shots [ 1 ]. [1].accuracy,In In one one study isstudy a structure using using 400,000 400,000 using frames 16 frames network extracted extracted layers from amongfrom 120 120 movies VGGNets. movies as data, as The data, learningVGGNet learning was has was donea disadvantage, done using using CNNs CNNshowever,with with AlexNet, in AlexNet, that GoogLeNet, its performance GoogLeNet, and VGG-16 decreasesand VGG-16 structures as the stru tonetworkctures check to andlayers check compare increase, and accuracycompare and ResNet [accuracy9–11]. solves Figure [9–11]. 3this is a Figuredisadvantageconfusion 3 is a confusion matrix [12]. of matrix the experiment of the experiment results in results which in the which shot the types shot were types classiﬁed were classified at 74% accuracyat 74% accuracywithShot AlexNet, with type AlexNet, classification 75% accuracy 75% accuracyusing with a CNN GoogLeNet,with doesGoogLeNe not and semanticallyt, 94%and 94% with withanalyze VGG-16. VGG-16. and Unlike classify Unlike the images. the ﬁrst first study, Instead, study, this thisit studyclassifies study can can shot classify classify types shot based shot types typeson even image even if therecharacteristics if there is no is person no and person on-screen. patterns. on-screen. Therefore, VGG-16, VGG-16, with additional the with highest theinformation highest accuracy, accuracy,isis needed a structure is ina structure order using 16to using networkaccess 16 thenetwork layers images among layers semantically, VGGNets. among VGGNets. Theand VGGNet this The can VGGNet has be a done disadvantage, has through a disadvantage, however,semantic in however,segmentation,that its performancein that which its performance distinguishes decreases as decreases the the network boundary as layersthe relationships network increase, layers and among ResNetincrease, objects. solves and thisResNet disadvantage solves this [12 ]. disadvantage [12]. Shot type classification using a CNN does not semantically analyze and classify images. Instead, it classifies shot types based on image characteristics and patterns. Therefore, additional information is needed in order to access the images semantically, and this can be done through semantic segmentation, which distinguishes the boundary relationships among objects.

(a) (b) (c)

FigureFigure 3. 3.ComparisonComparison of ofshot shot scale scale classification classiﬁcation [1]: [1 ]:(a ()a AlexNet;) AlexNet; (b ()b GoogLeNet;) GoogLeNet; (c ()c VGG-16.) VGG-16.

2.2. CNNShot Technology type classification Used for Shot using Type a CNN Classification does not semantically analyze and classify images. Instead, it classifies shot types based on image characteristics and patterns. Therefore, additional information is Recently, shot(a) type classification studies (usb)ing the CNN has been performed,(c) and the most needed in order to access the images semantically, and this can be done through semantic segmentation, representative structures of the CNN are VGGNet and ResNet [11,12]. When classifying shot types whichFigure distinguishes 3. Comparison the boundary of shot scale relationships classification among [1]: (a) objects.AlexNet; (b) GoogLeNet; (c) VGG-16. only using a CNN, some types of shots cannot be classified because the method classifies the shot types based on image characteristics and patterns. Some approaches have been studied for shot type 2.2.2.2. CNN CNN Technology Technology Used Used for for Shot Shot Type Type Classification Classification classification based on CNN [13,14], which have accomplished some advances. However, their approachesRecently,Recently, can shot be shot appliedtype type classification classificationto specific domains studies studies suchus usinging as the sport the CNN CNNmovies has has andbeen been music performed, performed, concerts. and On and the the the mostother most representativehand,representative the purpose structures structures of our ofapproach ofthe the CNN CNN was are areto VGGNet classify VGGNet shotsand and ResNet of ResNet general [11,12]. [ 11movies.,12 When]. When Thus, classifying classifying we will compareshot shot types types to onlyVGG-16only using using and a CNN, aResNet-50 CNN, some some wh typesen types applied of ofshots shots to cannot movies. cannot be beclassified classified because because the the method method classifies classifies the the shot shot typestypes Inbased our based onstudy, onimage imagesemantic characteristics characteristics segmentation and andpatterns. was patterns. applie Somed Some toapproaches improve approaches on have the been haveexisting studied been studies studied for shotusing for type the shot classificationMasktype R-CNN classification based and basedonYolact. CNN on To CNN[13,14], understand [13 which,14], whichthese have have technologies,accomplished accomplished we some discuss some advances. advances.VGGNet, However, However,ResNet, their and their approachessemanticapproaches segmentation can can be beapplied applied in detail.to tospecific specific domains domains such such as assport sport movies movies and and music music concerts. concerts. On On the the other other hand,hand, the the purpose purpose of ofour our approach approach was was to toclassify classify shots shots of ofgeneral general movies. movies. Thus, Thus, we we will will compare compare to to VGG-162.2.1.VGG-16 VGGNet and and ResNet-50 ResNet-50 wh whenen applied applied to tomovies. movies. In Inour our study, study, semantic semantic segmentation segmentation was was applie appliedd to toimprove improve on on the the existing existing studies studies using using the the MaskMask R-CNN R-CNN and and Yolact. ToTo understand understand these these technologies, technologies, we discusswe discuss VGGNet, VGGNet, ResNet, ResNet, and semantic and semanticsegmentation segmentation in detail. in detail.

2.2.1. VGGNet Appl.Appl. Sci. 2020,, 1010,, 3390x FOR PEER REVIEW 44 ofof 1010

VGGNet was a study conducted to understand the influence of network layers on CNN Appl.2.2.1. Sci. VGGNet 2020, 10, x FOR PEER REVIEW 4 of 10 performance [11,15]. To solely determine the effect of network layer, the experiment was conducted using the smallest filter size, 3 × 3, and 11, 13, 16, and 19 network layers were tested. As a result, it VGGNetVGGNet was was a a study study conducted conducted to to understand understand the the influence influence of of network network layers layers on on CNN CNN was confirmed that the error rate was similar or worse when the performances of 16 network layers performanceperformance [11,15]. [11,15]. To To solely solely determine determine the the effect effect of of network network layer, layer, the the experiment experiment was was conducted conducted usingand 19 the smallestnetwork filterlayers size, were 3 3, andcompared. 11, 13, 16,Based and 19on network such, layersresearchers were tested. of VGGNet As a result, stopped it was using the smallest filter size, 3× × 3, and 11, 13, 16, and 19 network layers were tested. As a result, it experimenting with larger number of network layers. The structure of CNN that improved this wasconfirmed confirmed that that the errorthe error rate rate was was similar similar or worse or wo whenrse when the performancesthe performances of 16 of network 16 network layers layers and disadvantage is ResNet, which is explained in Section 2.2.2. Figure 4 is the structure of VGG-16 using and19 network 19 network layers werelayers compared. were compared. Based on such, Based researchers on such, of VGGNetresearchers stopped of experimentingVGGNet stopped with 16 network layers among VGGNets. As illustrated in Figure 4, VGGNet is a simple structure that experimentinglarger number ofwith network larger layers.number The of structurenetwork oflayers. CNN The that structure improved of this CNN disadvantage that improved is ResNet, this consists only of convolution, max pooling, and full connection, so it is widely used based on its disadvantagewhich is explained is ResNet, in Section which 2.2.2 is explained. Figure4 isin the Section structure 2.2.2. of Figure VGG-16 4 is using the structure 16 network of VGG-16 layers among using simplicity for understanding the structure and convenience in modifying and testing the structure. 16VGGNets. network Aslayers illustrated among in VGGNets. Figure4, VGGNetAs illustrated is a simple in Figure structure 4, VGGNet that consists is a simple only of structure convolution, that The previous shot type classification study using VGG-16 had 94% accuracy, but accuracy may consistsmax pooling, only of and convolution, full connection, max sopooling, it is widely and full used connection, based on its so simplicity it is widely for understandingused based on theits change when applied to a different test group [1]. simplicitystructure andfor understanding convenience in the modifying structure and and testing convenience the structure. in modifying and testing the structure. The previous shot type classification study using VGG-16 had 94% accuracy, but accuracy may change when applied to a different test group [1].

Figure 4. Structure of VGG-16 [16]. Figure 4. Structure of VGG-16 [16]. The previous shot type classification study using VGG-16 had 94% accuracy, but accuracy may 2.2.2. ResNet change when applied to a differentFigure test group 4. Structure [1]. of VGG-16 [16]. ResNet is a study conducted to investigate how the performance of CNN improves as the layer 2.2.2.2.2.2.number ResNet ResNet of the network increases. To find this, a comparison test was conducted for 20 layers and 56 layers. Figure 5 is a table of the experimental results. From the results of the experiment, as shown in ResNetResNet is is a a study study conducted conducted to to investigate investigate how how the the performance performance of of CNN CNN improves improves as as the the layer layer Figure 5, the experiment results with 56 network layers had a higher error rate than 20 network layers. numbernumber of of the the network network increases. increases. To To find find this, this, a comparison a comparison test test was was conducted conducted for for20 layers 20 layers and and 56 In other words, the higher the number of network layers, the higher the error rate [12,17]. This is layers.56 layers. Figure Figure 5 is5 a is table a table of the of the experimental experimental result results.s. From From the the results results of the of the experiment, experiment, as asshown shown in because gradient vanishing occurs as the number of network layers increases [18]. In general learning, Figurein Figure 5, the5, theexperiment experiment results results with with 56 network 56 network layers layers had a had higher a higher error rate error than rate 20 than network 20 network layers. a back-propagation method that finds the weight and bias that minimizes the loss function value is Inlayers. other Inwords, other the words, higher the the higher number the numberof network of networklayers, the layers, higher the the higher error the rate error [12,17]. rate This [12,17 is]. performed by executing forward propagation, which obtains the loss function value through the becauseThis is because gradient gradient vanishing vanishing occurs as occurs the number as the of number network of networklayers increases layers increases[18]. In general [18]. In learning, general neural network, then calculates the inverse of forward propagation. Gradient vanishing is a alearning, back-propagation a back-propagation method that method finds thatthe weight finds the and weight bias that and minimizes bias that minimizes the loss function the loss functionvalue is phenomenon in which the slope that becomes the back propagation gradually disappears and results performedvalue is performed by executing by executing forward forward propagation, propagation, which whichobtains obtains the loss the function loss function value value through through the in insufficient learning. neuralthe neural network, network, then then calculates calculates the theinverse inverse of offorward forward propagation. propagation. Gradient Gradient vanishing vanishing is is a a phenomenonphenomenon in in which which the the slope slope that that becomes becomes the the back back propagation propagation gradually gradually disappears disappears and and results results inin insufficient insufficient learning. learning.

Figure 5. Results of a performance experiment based on network layer change [12,17].

To solve this, the residual block shown in Figure 6b was applied. In general, a CNN uses the FigureFigure 5. 5. ResultsResults of of a a performance performance experiment experiment ba basedsed on on network network layer layer change change [12,17]. [12,17]. plain block shown in Figure 6a, and learns to find the () that processes the input value to output y. As shown in Figure 6b, ResNet applies a skip connection that adds input to (). This performs To solve this, the residual block shown in Figure 6b was applied. In general, a CNN uses the learning while minimizing () +. Here, since is a value that does not change, the learning plain block shown in Figure 6a, and learns to find the () that processes the input value to output proceeds in order to make () return 0. Because there is a value for that does not change during y. As shown in Figure 6b, ResNet applies a skip connection that adds input to (). This performs learning while minimizing () +. Here, since is a value that does not change, the learning proceeds in order to make () return 0. Because there is a value for that does not change during Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 10

the progress of learning through back propagation, at least 1 always exists after performing differentiation. This solved the problem of gradient vanishing by creating a minimum slope for learning [12,17].

Appl. Sci. 2020, 10, 3390 5 of 10

To solve this, the residual block shown in Figure6b was applied. In general, a CNN uses the plain block shown in Figure6a, and learns to ﬁnd the H(x) that processes the input value to output y. Appl. Sci.As 2020 shown, 10, x inFOR Figure PEER 6REVIEWb, ResNet applies a skip connection that adds input x to H(x). This5 performsof 10 learning while minimizing f (x) + x. Here, since x is a value that does not change, the learning proceeds the progressin order toof make learningf (x) returnthrough 0. Becauseback propagation, there is a value at least for x that1 always does not exists change after during performing the progress differentiation.of learning This through solved back the propagation, problem of atgradient least 1 alwaysvanishing exists by after creating performing a minimum diﬀerentiation. slope for This learningsolved [12,17]. the problem of gradient vanishing by creating a minimum slope for learning [12,17].

(a) (b)

Figure 6. Structures of (a) the plain block; (b) the residual block [12,17].

2.2.3. Semantic Segmentation Semantic segmentation classifies which class the images of pixels belong to [19]. As shown in Figure 7, since the boundary of the object can be classified by classifying the class at a pixel level using semantic segmentation, the image can be accessed semantically. In this study, Mask R-CNN and Yolact were used for applying semantic segmentation. Mask R-CNN combines the function to detect objects using a Faster R-CNN and the function to (b) perform semantic segmentation(a) using a fully convolutional network (FCN) [20–24]. Yolact predicts the mask coefficientsFigure for 6. eachStructures instance, of (a) thecreates plain a block; prototype (b) the residualmask for block the [ 12entire,17]. image, and then Figure 6. Structures of (a) the plain block; (b) the residual block [12,17]. combines the two linearly to identify the object and to set the object boundary [25–27]. Unlike Mask 2.2.3. Semantic Segmentation 2.2.3. R-CNN,Semantic Yolact Segmentation uses a full range of image space without compressing the image, resulting in better semanticSemantic segmentation segmentation performance classiﬁes than which Mask class R-CNN the images [25]. of pixels belong to [19]. As shown in SemanticFigureInstance7, since segmentation thesegmentation boundary classifies ofidentifies the which object different class can be the classiﬁed object imagess for byof classifyingpixelseach segmentation belong the classto [19]. atin aAs more pixel shown leveldetail in using than Figuresemanticsemantic 7, since segmentation, segmentation,the boundary the ofbut imagethe our object canstudy becan accessedused be classified sema semantically.ntic by segmentation classifying In this study,the only, class Mask since at R-CNNa pixelthe study andlevel Yolact only usingwererequired semantic used determining forsegmentation, applying the semantic theobject image boundary segmentation. can be relationships.accessed semantically. In this study, Mask R-CNN and Yolact were used for applying semantic segmentation. Mask R-CNN combines the function to detect objects using a Faster R-CNN and the function to perform semantic segmentation using a fully convolutional network (FCN) [20–24]. Yolact predicts the mask coefficients for each instance, creates a prototype mask for the entire image, and then combines the two linearly to identify the object and to set the object boundary [25–27]. Unlike Mask R-CNN, Yolact uses a full range of image space without compressing the image, resulting in better semantic segmentation performance than Mask R-CNN [25]. Instance segmentation identifies different objects for each segmentation in more detail than semantic segmentation, but our study used semantic segmentation only, since the study only required determining the object boundary relationships.

FigureFigure 7.7. Example of semantic segmentation.

3. ComparisonMask R-CNN of Shot combines Type theClassification function to detectMethods objects Based using on Semantic a Faster R-CNN Segmentation and the function to perform semantic segmentation using a fully convolutional network (FCN) [20–24]. Yolact predicts the mask3.1. Shot coe Typefficients Based for on each Semantic instance, Segmentation creates a prototype mask for the entire image, and then combines the two linearly to identify the object and to set the object boundary [25–27]. Unlike Mask R-CNN, This article proposed shot type classification using semantic segmentation and ResNet-50. As Yolact uses a full range of image space without compressing the image, resulting in better semantic shown in Figure 8, semantic segmentation was applied to the frames extracted from the films in order segmentation performance than Mask R-CNN [25]. Instance segmentation identifies different objects for each segmentation in more detail than semantic segmentation, but our study used semantic segmentation only, since the study only required determining the object boundaryFigure 7. Example relationships. of semantic segmentation.

3. Comparison of Shot Type Classification Methods Based on Semantic Segmentation

3.1. Shot Type Based on Semantic Segmentation This article proposed shot type classification using semantic segmentation and ResNet-50. As shown in Figure 8, semantic segmentation was applied to the frames extracted from the films in order Appl. Sci. 2020, 10, 3390 6 of 10

3. Comparison of Shot Type Classiﬁcation Methods Based on Semantic Segmentation

3.1. Shot Type Based on Semantic Segmentation Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 10 Appl.This Sci. 2020 article, 10, x proposedFOR PEER REVIEW shot type classification using semantic segmentation and ResNet-50.6 of 10 As shown in Figure8, semantic segmentation was applied to the frames extracted from the films in to classify the boundary relationships among objects. ResNet-50 alone cannot semantically approach orderto classify to classify the boundary the boundary relationships relationships among among objects. objects. ResNet-50 ResNet-50 alone cannot alone semantically cannot semantically approach images. Additionally, semantic segmentation was preprocessed on the frames extracted from the approachimages. Additionally, images. Additionally, semantic semanticsegmentation segmentation was preprocessed was preprocessed on the frames on the extracted frames extracted from the movies in anticipation that the shot type and the surface of the objects were closely related. To apply frommovies the in movies anticipation in anticipation that the thatshot thetype shot and type the su andrface the of surface the objects of the were objects closely were related. closely To related. apply semantic segmentation, Mask R-CNN and Yolact were used, as shown in Figure 9. The we used the Tosemantic apply semantic segmentation, segmentation, Mask R-CNN Mask R-CNNand Yolact and were Yolact used, were asused, shown as in shown Figure in 9. Figure The we9. Theused we the two was to investigate in detail how the preprocessing from semantic segmentation affects the usedtwo thewas two to wasinvestigate to investigate in detail in detail how how the theprepro preprocessingcessing from from semantic semantic segmentation segmentation aaffectsffects thethe performance of shot type classification. Unlike Mask R-CNN, Yolact uses a full range of image space performanceperformance of of shot shot type type classification. classification. Unlike Unlike Mask Mask R-CNN, R-CNN, Yolact Yolact uses uses a a full full range range of of image image space space without compressing the image, resulting in better semantic segmentation performance than Mask withoutwithout compressing compressing the the image, image, resulting resulting in in better better semantic semantic segmentation segmentation performance performance than than Mask Mask R-CNN [25]. We can see that Figure 9c, which had semantic segmentation applied via Yolact, shows R-CNNR-CNN [ 25[25].]. We We can can see see that that Figure Figure9c, 9c, which which had had semantic semantic segmentation segmentation applied applied via via Yolact, Yolact, shows shows semantic segmentation performance superior to that of Figure 9b, for which semantic segmentation semanticsemantic segmentation segmentation performance performance superior superior to to that that of of Figure Figure9b, 9b, for for which which semantic semantic segmentation segmentation was applied via Mask R-CNN. After changing the preprocessed frames to 224 × 224, close-up shots, waswas applied applied via via Mask Mask R-CNN. R-CNN. After After changing changing the the preprocessed preprocessed frames frames to to 224 224 ×224, 224, close-up close-up shots, shots, medium shots, and long shots were classified using ResNet-50 for CNN-based ×Classifier in Figure 8. mediummedium shots, shots, and and long long shots shots were were classified classified using using ResNet-50 ResNet-50 for for CNN-based CNN-based Classifier Classifier in in Figure Figure8 .8. The previous shot type classification studies used VGG-16 to classify shot types. In our study, ResNet, TheThe previous previous shot shot type type classification classification studies studies used used VGG-16 VGG-16 to to classify classify shot shot types. types. In In our our study,study, ResNet, ResNet, a deeper network than VGGNet, was used. In order to use the pretrained model in Keras, ResNet-50 aa deeper deeper network network than than VGGNet, VGGNet, was was used. used. In In order order to to use use the the pretrained pretrained model model in in Keras, Keras, ResNet-50 ResNet-50 was used from among the ResNets having various network layers. Keras is a deep learning library waswas used used from from among among the the ResNets ResNets having having various various network network layers. layers. Keras Keras is is a a deep deep learning learning library library implemented with Python. Mask R-CNN and Yolact for Semantic Segmentation used the pretrained implementedimplemented with with Python. Python. Mask Mask R-CNN R-CNN and and Yolact Yolact for for Semantic Semantic Segmentation Segmentation used used the the pretrained pretrained model. The PC used for the experiment comprised a I7-7700K CPU, 32GB memory, and a GTX 1080Ti model.model. The The PC PC used used for for the the experiment experiment comprised comprised a a I7-7700K I7-7700K CPU, CPU, 32GB 32GB memory, memory, and and a a GTX GTX 1080Ti 1080Ti graphics card with 11GB memory. graphicsgraphics card card with with 11GB 11GB memory. memory.

Figure 8. Structure of the suggested semantic segmentation. Figure 8. Structure of the suggested semantic segmentation.

(a) (b) (c) (a) (b) (c) Figure 9. RegularRegular frame frame examples examples of of semant semanticic segmentation segmentation applications: applications: (a) ( aNon-mask;) Non-mask; (b) ( bMask) Mask R- Figure 9. Regular frame examples of semantic segmentation applications: (a) Non-mask; (b) Mask R- R-CNN;CNN; (c) ( cYolact.) Yolact. CNN; (c) Yolact. 3.2. Experimental Results 3.2. Experimental Results In thethe experimental experimental data, data, 11,066 11,066 frames frames were were extracted extracted randomly randomly from 89from movies, 89 movies, such as Inceptionsuch as, In the experimental data, 11,066 frames were extracted randomly from 89 movies, such as andInception which, and were which classiﬁed were by classified shot type by through shot type theground through truth the operation.ground truth Of the operation. 89 movies, Of 75 the were 89 Inception, and which were classified by shot type through the ground truth operation. Of the 89 usedmovies, for 75 training were used and validation,for training and and the validation, remaining and 14 werethe remaining used for testing.14 were Asused shown for testing. in Table As1, movies, 75 were used for training and validation, and the remaining 14 were used for testing. As 8679shown frames in Table were 1, 8679 used frames for training were andused 1787 for traini framesng wereand 1787 used frames for validation, were used with for 600validation, frames usedwith shown in Table 1, 8679 frames were used for training and 1787 frames were used for validation, with to600 measure frames used the accuracy to measure of thethe shotaccuracy type classiﬁcation.of the shot type The classification. ratios of training The ratios data of and training validation data 600 frames used to measure the accuracy of the shot type classification. The ratios of training data and validation data were set to 82% and 15% for shot, and 85% and 15% for medium shot for and validation data were set to 82% and 15% for shot, and 85% and 15% for medium shot for supporting enough training data. supporting enough training data.

Table 1. Number of Frames. Table 1. Number of Frames. Shot Type Training Validation Testing Shot Type Training Validation Testing Close-up 3586 782 200 Close-up 3586 782 200 Appl. Sci. 2020, 10, 3390 7 of 10 data were set to 82% and 15% for shot, and 85% and 15% for medium shot for supporting enough Appl.training Sci. 2020 data., 10, x FOR PEER REVIEW 7 of 10

Medium Table2392 1. Number of 413 Frames. 200 Long 2701 592 200 Shot TypeTotal Training8679 Validation1787 600 Testing Close-up 3586 782 200 In order to evaluateMedium the accuracy of 2392 the shot type classifications, 413 a general 200 frame and frame to Long 2701 592 200 which semantic segmentation was applied using Mask R-CNN and Yolact were classified using Total 8679 1787 600 VGG-16 and ResNet-50 Afterward, the accuracies were evaluated and compared. As a result, we confirmed that the average accuracy of the shot type classification preprocessed with semanticIn order tosegmentation evaluate the was accuracy 1.9% higher of the than shot sh typeot type classifications, classifications a generalusing a general frame and frame, frame as seento which in the semantic results presented segmentation in Table was applied2. Also, using as shown Mask in R-CNN Table 3, and shot Yolact type were classification classified using using ResNet-50VGG-16 and was ResNet-50 more accurate Afterward, than shot the accuraciestype classification were evaluated using VGG-16. and compared. AAs conventional a result, we confirmedshot type thatclassification the average stud accuracyy using of a the CNN shot classified type classification shot types preprocessed with 94% accuracywith semantic for close-ups, segmentation medium was sh 1.9%ots, and higher long than shots shot using type VGG-16 classifications [1]. In the using present a general study, frame,when classifyingas seen in theclose-ups, results medium presented shot ins, Table and2 long. Also, shots as shownusing VGG-16 in Table on3, shotgeneral type frames, classification the shot using types wereResNet-50 classified was morewith accurate92.5% accuracy, than shot as type seen classification in Table 3. usingThis VGG-16.was expected to occur due to the differences in the data. In this study, the method to classify the shot types with ResNet-50 after Table 2. Average Accuracy of Shot Type Classification. preprocessing for semantic segmentation using Yolact is illustrated as a confusion matrix in Figure 10 and showedWithout Semanticthe highest Segmentation accuracy among the Semanticexperimental Segmentation–Based results (96%). This Classification classified Method shots with performance better than93% classifying shot types with only VGG-16 and 94.9% ResNet-50, as done in the previous study. The training time of ResNet-50Table 3.andAccuracy Yolact ofwith Shot the Type highest Classification. accuracy was 1,654s, and this model took 66s to assort 600 test frames. So, it took approximately 0.11s per frame. Therefore, this model is With Semantic Segmentation difficultClassification to apply Type in real time Without to process Semantic 30 Segmentation frames. Mask R-CNN Added Yolact Added VGG-16Table 2. Average 92.5% Accuracy of Shot Type Classification. 94.8% 93.8% ResNet-50 93.5% 95.3% 96% Without Semantic Segmentation Semantic Segmentation–based Classification Method 93% 94.9% A conventional shot type classification study using a CNN classified shot types with 94% accuracy for close-ups, medium shots,Table and long 3. Accuracy shots using of Shot VGG-16 Type Classification. [1]. In the present study, when classifying close-ups, medium shots, and long shots using VGG-16 on general frames, the shot types were classified With Semantic Segmentation with 92.5% accuracy, as seen in Table3. ThisWithout was expected to occur due to the di fferences in the data. Classification Type Mask R-CNN Yolact In this study, the method to classifySemantic the shot Segmentation types with ResNet-50 after preprocessing for semantic segmentation using Yolact is illustrated as a confusion matrix in FigureAdded 10 and showedAdded the highest accuracy amongVGG-16 the experimental results92.5% (96%). This classified shots94.8% with performance93.8% better than classifyingResNet-50 shot types with only VGG-1693.5% and ResNet-50, as done in the 95.3% previous study. 96%

FigureFigure 10. 10. ConfusionConfusion Matrix Matrix of of ResNet-50 ResNet-50 with with Yolact. Yolact.

3.3. Discussion The existing study of shot type classification performed classification with 94% accuracy using VGG-16 from among the structures of CNNs. The CNN does not classify shot types based on Appl. Sci. 2020, 10, 3390 8 of 10

The training time of ResNet-50 and Yolact with the highest accuracy was 1654 s, and this model took 66s to assort 600 test frames. So, it took approximately 0.11s per frame. Therefore, this model is diﬃcult to apply in real time to process 30 frames.

3.3. Discussion

Appl. Sci.The 2020 existing, 10, x FOR study PEER of REVIEW shot type classification performed classification with 94% accuracy using8 of 10 VGG-16 from among the structures of CNNs. The CNN does not classify shot types based on semantical analysissemantical of framesanalysis in of the frames way humansin the way classify humans shot cl typesassify becauseshot types it classifies because shotit classifies types onlyshot basedtypes ononly image based characteristics on image characteristics and patterns. and To improvepatterns. this,To improve the present this, study the present applied study preprocessing applied withpreprocessing semantic segmentationwith semantic on segmentation the frames extracted on the fromframes the extracted films. This from was the to approachfilms. This the was frame to afterapproach distinguishing the frame theafter boundaries distinguishing of the the objects boundari in thees images. of the objects By separating in the images. the boundaries By separating of the objects,the boundaries the frames of the could objects, be accessed the frames semantically. could be accessed Mask R-CNN semantically. and Yolact Mask were R-CNN used. Theand resultsYolact ofwere the used. experiment The results illustrate of the that experiment shot type illustrate classification that shot using type ResNet-50 classification and Yolact using had ResNet-50 the highest and accuracyYolact had among the highest the experiments, accuracy among with 96%the experiments, accuracy. The with average 96% accuracyaccuracy. of The shot average type classification accuracy of aftershot type preprocessing classification with after semantic preprocessing segmentation with se increasedmantic segmentation by 1.9% (to 94.9%) increa comparedsed by 1.9% to (to shot 94.9%) type classificationcompared to ofshot normal type classification frames. The reasonof normal shot frames. type classification The reason usingshot type ResNet-50 classification and Yolact using is superiorResNet-50 to and shot Yolact type classification is superior usingto shot ResNet-50 type classification and Mask R-CNNusing ResNet-50 is assumed and to resultMask fromR-CNN better is semanticassumed to segmentation result from better performance semantic than segmentation Yolact [25 ].performance Additionally, than the Yolact reason [25]. the Additionally, performance the of shotreason type the classification performance after of shot preprocessing type classification with semantic after preprocessing segmentation with is highersemantic than segmentation the shot type is classificationhigher than the that shot uses type frames classification without preprocessing that uses frames from semantic without segmentation preprocessing is thatfrom the semantic frames weresegmentation accessed is through that the semantic frames segmentation. were accessed As through a result ofsemantic analyzing segmentation. the error rate As of a 4% result in shot of typeanalyzing classification the error by rate ResNet-50 of 4% in andshot Yolact,type classi wefication confirmed by thatResNet-50 there areand frames Yolact, in we the confirmed ground truth that taskthere where are frames even in humans the ground find truth it diffi taskcult where to classify even thehumans shot find type it because difficult of to mixedclassify two the shotshot type.type Thisbecause portion of mixed accounts two shot for a type. fraction This of portion the 4% accounts error rate, for an a examplefraction of of the which 4% error is shown rate,in an Figure example 11. Figureof which 11 iswas shown classified in Figure as a close-up 11. Figure in the11 was ground classified truth workas a close-up but was classifiedin the ground as a medium truth work shot but by ResNet-50was classified and as Yolact. a medium shot by ResNet-50 and Yolact.

Figure 11. FrameFrame with an Error in Sh Shotot Type Classiﬁcation.Classification.

4. Conclusions The shot type decisiondecision isis aa veryvery importantimportant pre-taskpre-task in in movie movie analysis analysis because because each each shot shot contains contains a lota lot of information,of information, such such as the as emotions the emotions and the and psychology the psychology of the characters of the characters and the space and information. the space Ininformation. order to analyze In order a largeto analyze number a large of movies, number a techniqueof movies, that a technique automatically that automatically classifies shot classifies types is required.shot types Unlike is required. previous Unlike studies, previous our study studies, classified our study close-up classified shots, close-up medium shots, shots, medium and long shots, shots usingand long preprocessing shots using with preprocessing semantic segmentation with semantic and segmentation ResNet-50.Throughout and ResNet-50. this study,Throughout shot types this werestudy, classified shot types with were an averageclassified accuracy with an ofaverage 94.9%, accuracy which is betterof 94.9%, than which the average is better accuracy than the of average 93% in aaccuracy previous of study 93% withoutin a previous semantic study segmentation without sema preprocessing.ntic segmentation In particular, preprocessing. when categorized In particular, with when categorized with ResNet-50 and Yolact, the performance improved by 3% to 96% accuracy, which is superior to the 93% accuracy of the previous study. As a result of analyzing the 4% error rate of the shot type classification using ResNet-50 and Yolact, in a portion of the 4% error rate, there were frames in the ground truth task where it was difficult even for humans to classify the shot type. To solve this, a further study will be conducted for shot type classification using additional information from each segment based on instance segmentation (which identifies different objects for each segment) instead of semantic segmentation.

Author Contributions: Data curation, H.-Y.B.; Software, H.-Y.B.; Validation, H.-Y.B and S.-B.P.; Writing— original draft, H.-Y.B.; Writing-review & editing, S.-B.P.; Appl. Sci. 2020, 10, 3390 9 of 10

ResNet-50 and Yolact, the performance improved by 3% to 96% accuracy, which is superior to the 93% accuracy of the previous study. As a result of analyzing the 4% error rate of the shot type classification using ResNet-50 and Yolact, in a portion of the 4% error rate, there were frames in the ground truth task where it was difficult even for humans to classify the shot type. To solve this, a further study will be conducted for shot type classification using additional information from each segment based on instance segmentation (which identifies different objects for each segment) instead of semantic segmentation.

Author Contributions: Data curation, H.-Y.B.; Software, H.-Y.B.; Validation, H.-Y.B. and S.-B.P.; Writing—original draft, H.-Y.B.; Writing-review & editing, S.-B.P.; All authors have read and agreed to the published version of the manuscript. Funding: This article is the result of a basic research project performed under the support of National Research Foundation of Korea funded by the Korean government (Ministry of Education) in 2017 (No. NRF-2017R1D1A1B03036046). Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

1. Savardi, M.; Signoroni, A.; Migliorati, P.; Benini, S. Shot scale analysis in movies by convolutional neural networks. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 2620–2624. 2. Katz, S.D.; Katz, S. Film Directing: Shot by Shot; Michael Wiese Productions: Studio City, CA, USA, 1991. 3. Thompson, R. Grammar of the Edit, 2nd ed.; Focal Press: Waltham, MA, USA, 2009. 4. Canini, L.; Benini, S.; Leonardi, R. Affective analysis on patterns of shot types in movies. In Proceedings of the 7th International Symposium on Image and Signal Processing and Analysis (ISPA), Dubrovnik, Croatia, 4–6 September 2011; pp. 253–258. 5. Cherif, I.; Solachidis, V.; Pitas, I. Shot type identification of movie content. In Proceedings of the 9th International Symposium on Signal Processing and Its Applications, Sharjah, UAE, 12–15 February 2007; pp. 1–4. 6. Tsingalis, I.; Vretos, N.; Nikolaidis, N.; Pitas, I. Svm-based shot type classification of movie content. In Proceedings of the 9th Mediterranean Electro Technical Conference, Istanbul, Turkey, 16–18 October 2012; pp. 104–107. 7. Marín-Reyes, P.A.; Lorenzo-Navarro, J.; Castrillón-Santana, M.; Sánchez-Nielsen, E. Shot classification and keyframe detection for vision based speakers diarization in parliamentary debates. In Conference of the Spanish Association for Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2016; pp. 48–57. 8. Baek, Y.T.; Park, S.-B. Shot Type Detecting System using Face Detection. J. Korea Soc. Comput. Inf. 2012, 17, 49–56. [CrossRef] 9. Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Neural Information Processing Systems: Nevada, MA, USA, 2012. 10. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. 11. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale ImageRecognition. arXiv 2014, arXiv:1409.1556. 12. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. 13. Lin, J.C.; Wei, W.L.; Liu, T.L.; Yang, Y.H.; Wang, H.M.; Tyan, H.R.; Liao, H.Y.M. Coherent Deep-Net Fusion to Classify Shots In Concert Videos. IEEE Trans. Multimed. 2018, 20, 3123–3136. [CrossRef] 14. Minhas, R.A.; Javed, A.; Irtaza, A.; Mahmood, M.T.; Joo, Y.B. ‘Shot classification of field sports videos using alexnet convolutional neural network. Appl. Sci. 2019, 9, 483. [CrossRef] Appl. Sci. 2020, 10, 3390 10 of 10

15. Jun, H.; Shuai, L.; Jinming, S.; Yue, L.; Jingwei, W.; Peng, J. Facial Expression Recognition Based on VGGNet Convolutional Neural Network. In Proceedings of the 2018 Chinese Automation Congress (CAC), Xi’an, China, 30 November–2 December 2018; pp. 4146–4151. 16. Muhammad, U.; Wang, W.; Chattha, S.P.; Ali, S. Pre-trained VGGNet Architecture for Remote-Sensing Image Scene Classiﬁcation. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 1622–1627. 17. Li, B.; He, Y. An Improved ResNet Based on the Adjustable Shortcut Connections. IEEE Access 2018, 6, 18967–18974. [CrossRef] 18. Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 646–661. 19. Romera-Paredes, B.; Torr, P.H. Recurrent instance segmentation. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 312–329. 20. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. 21. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. 22. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. 23. Almutairi, A.; Almashan, M. Instance Segmentation of Newspaper Elements Using Mask R-CNN. In Proceedings of the 18th IEEE International Conference On Machine Learning And Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1371–1375. 24. Su, H.; Wei, S.; Yan, M.; Wang, C.; Shi, J.; Zhang, X. Object Detection and Instance Segmentation in Remote Sensing Imagery Based on Precise Mask R-CNN. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1454–1457. 25. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 9156–9165. 26. Ying, H.; Huang, Z.; Liu, S.; Shao, T.; Zhou, K. Embedmask: Embedding coupling for one-stage instance segmentation. arXiv 2019, arXiv:1912.01954. 27. Benbarka, N.; Zell, A. FourierNet: Compact mask representation for instance segmentation using diﬀerentiable shape decoders. arXiv 2020, arXiv:2002.02709.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).