Pattern Recognition 48 (2015) 2129–2140

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.elsevier.com/locate/pr

A tree conditional random field model for panel detection in comic images

Luyuan Li a, Yongtao Wang a,n, Ching Y. Suen b, Zhi Tang a, Dong Liu a a Institute of Computer Science and Technology, Peking University, No.128 North Zhongguancun Rd. Beijing 100871, China b Concordia University, 1455 de Maisonneuve Blvd. Montréal, Canada H3G 1M8 article info abstract

Article history: The goal of panel detection is to decompose the comic image into several panels (or frames), which is the Received 14 May 2014 fundamental step to produce digital comic books that are suitable for reading on mobile devices. The Received in revised form existing methods are limited in presenting the extracted panels as squares or rectangles and solely use 7 November 2014 type of visual patterns, which are not generic in terms of handling comic images with multiple styles Accepted 15 January 2015 or complex layouts. To overcome the shortcomings of the existing approaches, we propose a novel Available online 22 January 2015 method to detect panels within comic images. The method incorporates three types of visual patterns Keywords: extracted from the comic image at different levels and a tree conditional random field framework is used Comic image document to label each visual pattern by modeling its contextual dependencies. The final panel detection results Panel detection are obtained by the visual pattern labels and a post-processing stage. Notably, the detected panels are Probabilistic graphical model presented as convex polygons in order to keep their content integrity. Experimental results demonstrate Conditional random field Context modeling that the proposed method achieves better performance than the existing ones. & 2015 Elsevier Ltd. All rights reserved.

1. Introduction consists of a single drawing depicting an important stage of the storyline. It contains drawings such as characters, sceneries and Comics, defined as “juxtaposed pictorial and other images in texts (speech balloon, narrative box, etc.). Most panels have deliberate sequence, intended to convey information and/or to discernible enclosing boxes composed of line segments while the produce an aesthetic response in the viewer” [1], have gained others only have drawing objects. At the same time, panels are in increasing popularity since its appearance in the 19th century. many cases separated from each other completely by blank spaces. During the past few decades, a huge amount of comic publications In other cases, some drawings (such as characters or speech have been produced all over the world. In this era, comic is balloons) of one panel are drawn over the adjacent panels. Fig. 1 inevitably pulled into the road of digitization due to technological shows an example of comic page containing the scenarios men- advancements. More and more people are willing to read the tioned above. For some comic pages with rather complicated scanned copies of comic books on cell phones or tablet PCs. In layouts, on one hand, the panels are still approximate polygons order to cater to the growing needs, the most fundamental and which are very easy for readers to discern; on the other hand, the crucial step is to decompose the whole comic page into panels severe adhesion between the frames makes the task of panel (i.e. panel detection) so that they can be viewed sequentially on detection rather tantalizing. the mobile devices. Moreover, the detected panels can be further In this paper, we present a novel tree conditional random field analyzed to extract features and drawing objects for content based model for comic panel detection. Within the comic pages, there indexing and retrieval of the comic documents. are several kinds of basic elements that encode structural informa- Comics are popular almost throughout the world. Europe, tion of the comic page layouts at different levels. They are named China and have a lot of comic publications each year. visual patterns in our work. Contextual interactions among these Although the drawing styles among them are disparate, the comic visual patterns are of great importance and our novel model has structure is somewhat similar [2]. A comic book consists of several the tremendous ability to model these interactions within a well- comic pages, which are further split into panels. Each panel constructed framework. As far as we are concerned, we make the first attempt to model the contextual information within the comic pages to address the panel detection task, while the existing n + + Corresponding author. Tel.: 86 1082529542; fax: 86 1062754532. methods are some empirical rules designed for certain types of E-mail addresses: [email protected] (L. Li), [email protected] (Y. Wang), fi [email protected] (C.Y. Suen), [email protected] (Z. Tang), page layouts. We also give an explicit de nition on the panels, [email protected] (D. Liu). which preserves the content integrity of the panels better. http://dx.doi.org/10.1016/j.patcog.2015.01.011 0031-3203/& 2015 Elsevier Ltd. All rights reserved. 2130 L. Li et al. / Pattern Recognition 48 (2015) 2129–2140

which are difficult to acquire when the layout of the comic image becomes complex. Moreover, this method cannot handle the blank margins in the comic image very well.

2.1.2. Connected component based methods Arai et al. [8] propose another type of methods for comic panel Panel without extraction. They define comic images in which no overlap balloons Borderlines and comic art exist as flat comics and extract these flat comics using Extended modified connected component labeling algorithm. For the over- Drawing Object lapped ones, they are separated by line segments. This method cannot Panel with handle the situation when multiple panels are agglomerated together Borderlines by extended objects. Another connected component based method is proposed by Ngo et al. [2].Theyfirst extract connected components from the comic image and then filter them according to their sizes. For the overlapped panels, morphology operations such as dilation and erosion are executed. Such approach is based on a lot of thresholds which are hard to optimize for multiple layouts. Moreover, the morphology operations mentioned in the literature are not very computationally efficient. Recently, Rigaud et al. [9] propose a frame Text Characters Extended extraction method from comic books. This method processes page per Speech Balloon page and begins by acquiring Region Of Interest (ROI) through connected component generation. Then these ROIs are classified as “noise”, “text” and “frame” depending on their sizes and topological Fig. 1. A typical comic page layout (Source: Toriyama, Dr. Slump, Shuisha, vol. 2, relations. However, some panels are composed of multiple ROIs and p. 20). contextual information is not exploited within the framework.

The next section of this paper reviews past research works on the 2.1.3. Polygon detection based methods panel detection and also image segmentation using structured The polygon detection methods [10,11] see the page segmenta- probabilistic models. Section 3 highlights the importance of context tion as a polygon detection problem which approximate the panels modeling by integrating multiple visual patterns to address the task as quadrilaterals. To obtain polygons, the first step is to detect of panel detection. Our tree conditional random field formulation is straight line segments and then combine them to form complete presented and described in Section 4.InSection 5,wediscussthe polygons. A post-processing stage is needed to tackle the over- implementation details of this panel detection work, including lapped ones. Ref. [10] achieves this by designating three empirical potential and feature design, parameter estimation, model inference rules. Ref. [11] refines the post-processing stage by proposing a and post-processing. We test our method on multiple available line cluster Finite-state Machine to construct the quadrilateral datasets in Section 6 and conclusions as well as future works are polygons hierarchically. Such approaches make a strong assump- drawn in Section 7. tion that the frames are all quadrilateral polygons and are enclosed by borderlines composed of straight line segments and thereby 2. Related works would completely fail when the panels do not have enclosing borderlines. 2.1. Panel detection methods 2.1.4. Panel definition Panel detection in comic images (i.e. automatic comic page Most of the existing methods do not give an explicit definition segmentation) is a relatively new problem which started receiving on the panels. Therefore, in this section, we formally define the attention since the publication mentioning the manual panel panels in order to clarify the panel detection goals. labeling [3]. Several categories of approaches for panel detection In this paper, we present our detection results by defining a in comic images have been proposed, which can be distinguished panel as a convex polygon linked by a set of convex points that according to the type of basic elements extracted from the comic encloses all the contents of the panel. Such definition is based on image. To be more specific, the existing method can be classified the following observations: into three types: division line based [4–7], connected component based [8,2,9] and polygon detection based [10,11]. Some panels do not have enclosing borderlines. Even panels with borderlines do not have a strictly 2.1.1. Division line based methods quadrilateral shape. The division line based method first proposed by Tanaka et al. [4] iteratively cuts the comic images by horizontal and vertical As shown in Fig. 2, our definition allows a better representation division lines until the sub-images cannot be divided anymore. In of the panels than that in [11], which is contributed to the ability the end of the process, the comic image can be represented by a to include all the contents belonging to a panel. Such character tree structure. Chan et al. [5] extend the work by focusing on becomes more important in terms of content integrity and better processing color e-comic books consisted of multiple pages. The user experience when the readers are reading ordered panels on division line based method is further refined by Yusuke et al. [6], the mobile devices. which weights the difference between the contour and other pixels during the division line detection process. Ishii et al. [7] 2.2. Image segmentation using structured probabilistic models extend Tanaka's work by utilizing the results of frame separation and frame corner detection. The division line based methods are Image segmentation using structured probabilistic models such intuitive, but they rely heavily on the results of division lines, as conditional random field has been widely studied during the L. Li et al. / Pattern Recognition 48 (2015) 2129–2140 2131

Fig. 2. (a) Extracted panel in [11]. (b) Extracted panel in this work. (Source: Toriyama, Dragon Ball, Shuisha, vol. 34, p. 117). past decade and there are several systems proposed to achieve visual patterns at different levels in this work, which are con- natural image segmentation [12],offline document segmentation nected components, edge segments and line segments (as shown [13] and graphics extraction from heterogeneous online docu- in Fig. 3). ments [14,15]. Reynolds et al. [12] propose a system for detecting and segmenting 3.1. Visual patterns generic object classes from natural images that organize the different scales of super pixels in a novel way. The system combines the Connected component: Connected components of the foreground regions together into a hierarchical, tree-structured conditional ran- pixels are the building blocks of the comic images. Moreover, for dom field, which applies the classifier to each node and then most comic images with simple layouts, the contours of some integrates the link between the nodes by inferencing the model. extracted connected components can even perfectly represent the Ref. [13] describes an approach to extract physical and logical layouts areas of the panels (in other cases where the panels are overlapped in unconstrained handwritten letters using a Conditional Random by objects they do not). Meanwhile, when a panel does not have Field model. In this approach, the extraction is considered as a surrounding borderlines, connected components become the sole labeling task consisting in assigning a label to each pixel of the image source of contextual information for that panel. The final label and the proposed CRF model contains both the association between results of all the visual patterns are used for separating the the pixels and labels as well as the relationship of one label with connected components containing multiple panels. respect to its neighboring labels. More recently, Delaye et al. [14] Before generating connected components, we first convert the introduce a solution for automatically segmenting diagrams and input image into binary image using an adaptive threshold algo- drawings from unstructured online documents. These documents rithm OTSU [16]. Connected component labeling has long been are represented as a multi-scale tree and modeled as a hierarchical studied in terms of time and storage efficiency. Using connected conditional random fields to predict the detection of graphical component labeling algorithm such as [17] can significantly reduce elements at the stroke level. Then they extend the system [15] by the processing time. predicting the segmentation of online handwritten documents into As the extended objects of a panel sometimes tear the panel multiple block, such as text paragraphs, tables, and mathematical apart (which means the content of a panel belongs to multiple expressions besides the graphical objects, which are all modeled in a connected components), we perform a preliminary clustering tree conditional random field. stage to assemble the connected components in order to make sure that the contents of a panel only belong to one connected component. To achieve this, we use agglomerative clustering and 3. Context modeling by integrating multiple visual patterns set the distance threshold between the center points of two connected components to one-tenth of the diagonal of the whole Similar to the conventional text documents, comic image docu- comic image. After obtaining the clustered connected components, ments contain several kinds of basic elements that encode important their morphological information is also calculated and stored, contextual information of the comic page layouts, which is termed including the convex points of the connected components, the visual patterns in our work. As we can see, all the previous works span on the x- and y-axes, and the center of the mass. mentioned in Section 2 concentrate on one kind of objects extracted Edge segment: Most panels within the comic images are sur- from the comic image (such as connected components, division lines, rounded by enclosing borderlines. Therefore, we extract edge seg- and borderline segments) and rely heavily on certain types of ments (a chain of 8-connected edge pixel points) as one kind of visual empirical rules. Therefore, they can only be applied to process certain patterns within the framework, which are also used by [18]. types of comic layouts. For a given grayscale comic image, the Canny edge points are In order to fully exploit the comic layout information and thus obtained first. We calculate the gradient of the image (smoothed process various types of comic layouts, we extract three types of by a 5 5 Gaussian kernel with σ¼1.0) with Sobel operator 2132 L. Li et al. / Pattern Recognition 48 (2015) 2129–2140

Thus, for each connected component, it may contain zero, one or more panels. In order to separate the under-segmented connected components containing multiple panels and thereby obtain panel detection results, we need to find clues of whether and how to separate each connected component. Edge segment, a chain of 8-connected edge pixels in some cases, is a complete or incom- plete contour of a panel and therefore can exhibit differently with the contours of clustered connected components containing one or Connected more panels. Therefore, such connections between connected Components components and edge segments are crucial when modeling con- Edge Segments text information. Meanwhile, as explained in Section 3.1, the line segments are obtained by splitting edge segments in a recursive fashion. For those edge segments lying on the contours of the panels, the split line segment patterns tend to be predictable (for example, the adjacent lines tend to be perpendicular with one another), while the edge segments in other cases are split in a more stochastic way. As a result, the connections between edge segments and line segments are also used to model context information within our framework described below. As shown in Fig. 4, connected components, containing zero, one or multiple panels, stand at the top (coarsest) level of the Line Segments hierarchy. The edge segments, on the other hand, provide a finer segmentation of the comic images, especially the ones lay on the Fig. 3. Example of the three visual patterns extracted from a panel (Source: Dragon Ball, vol. 1, p. 38). border of the frames. The line segments, resulting from further decomposition of the edge segments, stand at the lowest level of the formulation. To achieve the task of panel detection, i.e. finding (the aperture size is 3) and adaptively determine the high thresh- the enclosing convex polygon of each panel, we intend to label old Th and the low threshold Tl. Further, non-maximum suppres- each of the visual patterns and execute post-processing to form sion is performed on the pixels whose gradient magnitude is the convex polygons of the panel contours. larger than Tl. At this point, the pixels can be classified into three types as follows: (1) p is a strong edge point if the gradient magnitude at p is a local maximum and larger than Th; (2) p is a weak edge point if the gradient magnitude at p is a local maximum and in the range of Tl to Th; (3) p is not an edge point for the other Original Image cases. Then we execute the chaining procedure among the obtained edge points to form each edge segment in two steps: (1) we pick the one with the largest gradient magnitude from the unsearched strong edge points as the starting point p0 to trace an edge segment; (2) we construct a list and add p0 to it as the tail, and further iteratively add the unsearched edge points from the 8 neighbors of the tail to the list as the new tail, until none of the 8 neighbors of the tail are an unsearched edge point. Line segment: The borderlines of the panels are composed of Connected straight line segments in most cases. Therefore, the obtained edge Component segments can be further split into line segments which provide crucial contextual information for panel detection as well. In order to split the edge segments to line segments, we first develop a straightness criterion for a chain of edge points. Then for each given edge segment, we employ a top-down scheme and gradually verify its sub-chains from the longer ones to the shorter ones. The methods for detecting edge segments (searching and chaining edge points) and further splitting them into straight line Edge Segment segments can be further refined in terms of computational efficiency. We use the method from [19], where the process is described in greater detail.

3.2. Towards integrating multiple visual patterns

As stated above in Section 3.1, we extract the three kinds of visual patterns from the comic image, which offer the segmenta- Line Segment tion (albeit rather coarse) of the comic image at different levels. The obtained connected components are the 8-connected component clusters of the binary comic image. These connected components are the under-segmented ones so as to make sure Fig. 4. A hierarchial view of the comic page structure (Source: Dragon Ball, vol. 1, that the contents of a panel belong to one connected component. p. 38). L. Li et al. / Pattern Recognition 48 (2015) 2129–2140 2133

It is crucial to model the contextual dependencies among these has the general form: visual patterns in order to achieve satisfactory detection results. 1 Conditional Random Field (CRF) [20] is known for its ability to pðyj xÞ¼ ∏ Φ ðx ; y Þ; ð1Þ ZðxÞ c c c model the contextual dependencies among the variables and has c A G fi been proven successful in the elds of Natural Language Proces- where c denotes the clique in the graph; Φc denotes the potential sing [20] and Computer Vision [21]. Therefore, we formulate the attached to that clique and the denominator is the normalization panel detection task as a tree conditional random field formed by constant known as the observation specificpartitionfunction.The the three types of visual patterns and their interactions. The model graph structure and the clique system strongly depend on the describes spatial dependencies among these visual patterns, the observation x, but a repetition of similar structures called clique inference of which combines local evidences and inter-layer templates can be constructed which are able to factorize the model interactions to label the elements. Moreover, the discriminative parameters [22].Ifζ is the set of clique templates within the graph nature of CRFs permits us to model and propagate label depen- structure, the CRF can be expressed alternatively as follows: dencies between visual patterns without requiring observation 1 ∏ ∏ Φ ; ; θ ; independence, thus enabling the combination of overlapping and pðyj xÞ¼ cðxc yc iÞ ð2Þ ZðxÞ A ζ A long-range features [22]. Ci c Ci Fig. 5 gives an overview of the proposed system for panel detection where all the cliques in clique template Φ are parameterized by θ in comic images. The CRF model is trained by manually labeling the c i [22]. In most practical definitions of CRF models, the cliques are low- extracted visual patterns from the training comic images. For testing order ones in order to guarantee the computation efficiency during the images, the CRF model provides the discriminative labeling results the inference process. for the extracted and constructed visual patterns and then a post- processing stage is executed to obtain the final detection results for the panels within the comic images. 4.2. Modeling visual patterns with tree CRF

In this section, we present our tree CRF formulation to address 4. Tree conditional random field formulation the task of panel detection in comic images. The three kinds of visual patterns (namely connected components, edge segments Conditional random field was introduced as a kind of probabil- and line segments) extracted from the comic images are denoted istic graphical models directly representing a distribution over as the set V ¼fC; E; Lg respectively. Let x be the set of observations labels conditioned on the observations, as opposed to the genera- from the comic image, where tive nature of the Markov Random Field [23].Graphicalmodelisthe C E L A ; A ; A : combination of graph theory and probabilities, which provides a x ¼ xm [ xn [ xk m C n E k L ð3Þ framework for modeling global probability based on some observa- L E Observation xk is connected to a higher observation xn, whereas tions of the local functions. The discriminative nature of CRF E C each xn is connected to xm. Therefore, they form a hierarchial view permits the arbitrary interactions among the observations without of the comic page. For each observation x, a label value y is the independent assumptions. associated, which is defined over the set ℓ. The ℓ contains only two values: 1 and 1, although the meanings of the positive and C 4.1. CRF formulation negative sightly vary between the sets of visual patterns. For ym, positive means the contour of corresponding connected compo- As mentioned in Section 3, some works are proposed to achieve E ; L nent matches the contour of a panel, negative otherwise; for yn yk, image segmentation by using conditional random fields, where the positive means the edge or line segment lies on the border of the general descriptions of conditional random field are presented. Here panel, negative otherwise. It is well worth to note that if a panel fi E L we give a similar presentation of the conditional random eld does not have an enclosing borderline, the yn and yk linked to the generally. The conditional random field defines joint distribution corresponding connected component are always labeled as nega- of the label variable set y given the input observation x in a given tive. An example is given in Fig. 6. image. The general CRF model can be expressed as a product of local The remaining task is the construction of the tree structured functions namely factors defined on the clique sets of graph G, observation of the comic page. In other words, we need to find the E L C E where the graph models the dependencies over the variables [14].It xn for each xk and xm for each xn. Finding the parent edge segment for each line segment is more of an intuitive and easy task, as the Visual Patterns Graph Constuction lines are obtained by splitting an edge segment. Therefore, E L L E xn ¼ parentðxkÞ if xk is obtained by splitting xn. As for the link C E Training Manual between connected component and edge segment, xm ¼ parentðxnÞ Image Labeling if E C ; m ¼ argmaxj xn \ xi j ð4Þ i where the intersect operation is defined in terms of overlap fore- CRF Model ground pixel count between the edge segment and the connected component. This formulation also guarantees that an edge segment Test Image only has one parent whereas a connected component can have multiple descendants. Therefore, the observation derived from the Label Results Visual Patterns Graph Constuction comic image forms a tree. It is worth noting that our model is distinguished with the one in [24], as the nodes in our model are Post- Detected essentially heterogeneous while the homogeneous nodes within the processing Panels model of [24] have different state sets. Such configuration has the fl Fig. 5. System workflow of the proposed methods. The whole processing line is exibility of modeling multiple kinds of objects within a single divided into training and testing parts. framework. 2134 L. Li et al. / Pattern Recognition 48 (2015) 2129–2140

Fig. 6. (a) Extracted clustered connected components; the red and the white ones only contain one panel and thus are labeled as 1, while the blue one is labeled as 1 for containing two panels. (b) The corresponding edge segments; the white edge segment does not lay on the contour of a panel and therefore is labeled as 1, while the red one is labeled as 1 for laying on the contour of a panel. (Source: Toriyama, Dr. Slump, Shuisha, vol. 1, p. 9). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

The cliques are formed in two categories in our work: unary C y m nodes and binary nodes (as shown in Fig. 7). The unary nodes encode the local evidence of each visual pattern and binary nodes encode the interactions between the adjacent levels of visual patterns. Since the tree has at most three levels, we have three clique templates for unary nodes and two for the binary ones, where the parameters are shared within each clique template. We give the tree conditional random field representation as follows: E E 0 1 y n y n X3 X X2 X 1 @ Φ ; ; θ Φ ; ; ω A: pðyj xÞ¼ exp cðxc yc iÞþ cðxc yc jÞ ZðxÞ A i ¼ 1 c A Ai j ¼ 1 c Ij ð5Þ Our model uses the log-linear model which is easy to represent and manipulate. The clique template Aj encodes the association potentials attached to the observation at three different levels, whereas Ij describes the interactions between edge segments and L L y y k line segments, as well as that among the edge segments and k connected components. The form of the factors also indicates that Fig. 7. The factor graph representing our tree conditional random field model. Each the two kinds of potentials are parameterized by θi and ωj. circle corresponds to the label of visual pattern Yfm;n;kg. Factors (squares) consist of a singleton potential for each kind of visual patterns and second-order functions for the interactions between the adjacent levels of visual patterns within the graph structure. 5. Implementation details

In this section, we present the implementation details regard- vector extracted from the observation, while in this work we use a ing the tree conditional random field for panel detection in comic normalization function which guarantees that all the descriptor images, which include potential (the form of clique template) and values range from 0 to 1. We also keep the first element of each descriptor design, parameter estimation, model inference, and vector 1 to accommodate a bias parameter [25].Thisrepresentation post-processing stage. ensures that the tree CRF is equivalent to the logistic binary classifier if the interaction potential is not attached. It is worth to note that the 5.1. Potential and descriptor design feature function is defined on the whole set of observation other than

xc, which exhibit the better expressive power of CRF than that of the Association potentials: Association potentials are obtained by a Markov Random Field. local discriminative model that calculates the association of a site When extracting features (more precisely, descriptors) from the with a class. Inspired by [25], we formulate it as a logistic binary visual patterns, we consider many geometric properties as well as classifier: the context characteristics of comic images. The descriptors extracted from connected components, edge segments and line Φ ðx ; y ; θ Þ¼log ðσðy θT hðxÞÞÞ; ð6Þ c c c i c i segments are shown in Tables 1, 2 and 3 respectively. where The features extracted from each kind of visual patterns truly reflect the likelihood of each visual pattern related to a panel. For 1 σðxÞ¼ : example, the “Center Point Distance” in Table 1 measures the distance 1þexpðxÞ between the center of mass and the bounding box middle point of the

The θis are model parameters which vary among different clique connected component, where the shorter distance indicates a more templates. The hðxÞ allows a nonlinear transformation of the feature probable panel contour; if an edge segment lies on the contour of L. Li et al. / Pattern Recognition 48 (2015) 2129–2140 2135

Table 1 Connected component (CC) level descriptors.

Category Name Description

Geometric properties Span The span along the x- and y-axes normalized by image height and width Occupation rate Number of foreground pixel divided by the area of CC and image respectively Aspect ratio Aspect ratio of the CC bounding box Center Point Distance Distance between center of mass and midpoint of the CC bounding box Centripetal Force The avg. and std. dev. of the distance bet. the convex pts and center of mass Contour smooth The std. dev. of the length of lines which lie on the convex polygon

Contextual characteristics Edge segment count The number of edge segments linked to it Binary label Whether it has linked edge segments

Table 2 Edge segment level descriptors.

Category Name Description

Geometric properties Span The span along the x- and y-axes normalized by image height and width respectively Occupation rate Pixel count divided by the perimeters of the image and its bounding box Center Point Distance Distance between center of mass and midpoint of its bounding box Centripetal Force Standard deviation of the distance from each point to the center of mass Binary label 1 Whether the edge segment is closed

Contextual characteristics Line count The number of lines split from the edge segment Line proportion The proportion of the split lines to the edge segment length Line deviation Standard deviation of the length of the split lines Binary label 2 Whether the edge segment has child lines and parent connected component

Table 3 Line segment level descriptors.

Category Name Description

Geometric properties Length The length normalized by its parent edge segment Absolute angle The angle formed by the line and the x-axis Relative angle The angle formed with adjacent lines on both sides

Contextual characteristics Side Pixel Count The foreground pixel count in a small area on both sides of the line Endpoint distance The distance of the two end points to end points of the adjacent lines

panel, the deviation of the distance between each point to the center In this definition, yc are a pair of two labels ya and yb, whose of mass is usually low, which is encoded by the “Centripetal Force” in elements are linked with an edge in the tree structured observa-

Table 2; in the small area on the side of line segment that lies on the tions with ya being the parent node. We define total of Kj border of the panel, there are few foreground pixels, which is descriptors for each pair of observations. As we can see, this included in the “Side Pixel Count” in Table 3 (as shown in Fig. 8). model can tolerate label inconsistencies better than that of the Interaction potentials: Designing interaction potentials is very smoothing models for the task such as contour extraction using important, as it encodes the geometric relationships between the CRF [26]. Moreover, our interaction potential is different from the pairs of visual patterns. As visual patterns only provide the flexible one used in [14], as the pair of label in our work is ordered preliminary segmentation of the comic images, it can result in while the one in [14] is unordered where only two scenarios are label inconsistencies between the pairs of visual patterns. There- considered. The descriptors extracted from a pair of connected fore, our model provides a more flexible definition of the interac- component and edge segment as well as edge segment and line tion potentials, which is different from a smoothing definition of segment are listed in Tables 4 and 5 respectively. These descriptors the prevalent CRF models [14]. The interaction potential is defined also model the interaction between the adjacent levels of visual as follows: patterns, which also provide crucial contextual information within 8 the comic page. For example, the “Contour Match 1” feature in > XKj Table 4 encodes the degree to which the edge segment and the > 4k k > δ ; ω f ðx Þ > fya ¼ 1 yb ¼ 1g j j c > connected component contours match with each other. > k ¼ 1 > > XKj > 4k þ 1 k 5.2. Parameter estimation and inference > δf ¼ ; ¼ g ω f ðx Þ <> ya 1 yb 0 j j c Φ ; ; ω k ¼ 1 cðxc yc jÞ¼ ð7Þ The goal of parameter estimation in this work is to decide the θ > XKj i > δ ω4k þ 2 k fi ω > fy ¼ 0;y ¼ 1g f ðxcÞ for the rst-order factor and the j for the second-order factor > a b j j > k ¼ 1 (total of 12þ13þ9þ4 14þ4 4 ¼ 106). We adopt the pseudo- > > XKj likelihood (PL) as the surrogate for the true likelihood, as the one > 4k þ 3 k > δ ; ω f ðx Þ :> fya ¼ 0 yb ¼ 0g j j c used in [14,22]. The PL is the local approximation, under the k ¼ 1 assumption that the training objective only relies on the 2136 L. Li et al. / Pattern Recognition 48 (2015) 2129–2140 conditional distribution over single variables, which make the If the label of a connected component is 1, which means that its computation of partition function efficient [22]. The log-PL func- contour matches the contour of a panel, we simply extract its convex tion is the sum over the log of pðy j y ; x; θ; ωÞ at each site, where polygon to represent a detected panel. Otherwise, we collect all its i Ni descendant line segments that do not lie on the border of the ∏ Φ ðx ; y Þ pðy j y ; x; θ; ωÞ¼P c 3 i c c c : ð8Þ connected component with label value 1 and employ an x–y cut [28] i Ni 0 0 ∏ Φ ðx ; y Þ y c 3 i c c c recursive strategy to separate the connected components (as shown This PL representation only requires summing over the labels of a in Fig. 9). For each separated part, its convex polygon is calculated single site y0 combining the fixedgroundtruthlabelingofitsneighbors and used to represent the detected panel. 0 yN [22]. The log-sum over multiple training instances is generally a convex function of the parameters [25] and therefore can be evaluated 6. Experiments by a family of quasi-Newton methods such as L-BFGS [27].Inour experiment, the evaluation converges after a few hundreds of itera- The goal of experiments is to evaluate the improvements over the tions. For the model inference, we apply the Belief Propagation existing approaches by using the tree conditional random field model. algorithm, which gives a Maximum-A-Posteriori labeling over the First, the evaluation criteria as well as the threshold for connected visual patterns [14]. component clustering are discussed. We then describe page-level complexity measurement and also the two datasets used in this paper. 5.3. Post-processing At last, the experimental results on the datasets are presented along with the result analysis. By inferring the constructed model, we obtain the labels of all the visual patterns. A post-processing stage utilizes the label results to 6.1. Evaluation criteria extract the complete convex polygon of each label. In this work, our algorithm for panel detection is evaluated on the panel level as well as the page level, where the definition of a correctly detected panel within a page is crucial. A correctly detected panel is Middle defined as follows: a set of convex polygons D is obtained as the Point of the detection result for each page. Meanwhile, we manually label the Bounding convex polygon p for each panel within the page. For each p,we Box calculate the proportion: Center of p \ D the Mass max i; ð9Þ i p [ Di where the union and the intersect operation are calculated in terms of area. If this value is above 90%, it is considered as a correct detection result for the panel. The evaluation metrics on the panel level are (1) Center of Precision (the percent of the correctly detected panels to the detected the Mass ones); (2) Recall (the percent of the correctly detected panels to the ones to be detected); (3) F1 measure (the harmonic mean of Precision and Recall). If each panel within a comic page is correctly detected, it is a correct detection result for the comic page.

6.2. Connected component clustering parameter

As stated in Section 3.1, we perform a preliminary clustering stage to assemble the connected components in order to make sure that the contents of a panel only belong to one connected component. The agglomerative clustering distance threshold t is defined proportionally

to the diagonal of the whole comic image d0,wheret ¼ λd0 with λ Fig. 8. (a) Center Point Distance feature illustration where the center point and the ranging from 0.00 to 0.16. Table 6 shows the results of how changing λ middle point of a connected component containing two panels diverge with each other (Source: Dragon Ball, vol. 34, p. 105), (b) Centripetal Force feature illustration, affects the number of connected components included in the CRF (c) Side Pixel Count feature illustration (Source: Mitsuru Adachi, , Shoga- model. The first column presents the different λ values; the second kukan Inc., vol. 6, p. 65). column is the total number of resulting connected components of all

Table 4 Interaction descriptors for connected component and edge segment pair.

Category Name Description

Geometric properties Span ratio The ratio of their x and y spans of the bounding boxes Diagonal ratio The ratio of their diagonals of the bounding boxes Area ratio The ratio of their areas of the bounding boxes Overlapped area The x- and y-axes offset between their center points

Contextual characteristics Minimum distance The minimum distance between the convex points and the edge segment Contour match 1 The avg. and std dev. of distance from the center point of CC to the edge seg. Contour match 2 The avg. and std dev. of the minimum distance between points on them Maximum distance The maximum distance from the edge segment point to the convex points L. Li et al. / Pattern Recognition 48 (2015) 2129–2140 2137

Table 5 Interaction descriptors for edge segment and line segment pair.

Category Name Description

Geometric properties Length ratio 1 The length of the line segment normalized by the edge segment pixel count Length ratio 2 The length of the line segment normalized by the diagonal of the edge segment

Contextual characteristics Distance difference The distance difference between two end points of the line segments to the center point of the edge segment

our improvements over the comic pages with broader range of layouts. The four types are simple, medium, complex and extra complex. To quantify the complexity measurement, we first summarize the panel drawing patterns within the dataset and assign a “difficulty score” to each pattern. To measure the layout complexity of each comic page, we sum up all the corresponding difficulty scores of the appearing patterns within the page. Finally the comic pages are classified by the summation. As shown in Fig. 10, each drawing pattern is assigned a score designating the difficulty for panel detection. The score ranges for the simple, medium, complex and extra complex layouts are ½0; 1, ð1; 3, ð3; 5 and ð5; þ1Þ respectively.

6.3.2. Dataset 1 We test our context modeling panel detection algorithm on two datasets. Due to the copyright or other issues, these two are all the ones available at hand for comparison. Most of the publica- tions so far, such as [4,8,10,18], use Dragon Ball, vol. 42 and calculate the correct page count for the total of 237 comic images. Therefore, the first part of the experiment is conducted on this dataset and we compare the correct page counts as well as evaluate panel level detection accuracy. The number of comic images for each type of layout is shown in Table 7.

6.3.3. Dataset 2 We also test our method by using the dataset from [11,18], Fig. 9. Illustration of the post-processing stage. The whole connected component is which includes a total of 2000 comic images. These images are labeled as 1, but its descendant red lines L1 and L2 are labeled as 1 and therefore collected from 40 comic books, which come from ten different can be used to separate and extract each panel from it (Source: Dragon Bll, vol. 1, p. 58). (For interpretation of the references to color in this figure caption, the reader is Japanese and Hong Kong comics. These comic images cover wide referred to the web version of this paper.) topics including campus, chivalrous, youths and girls. We select 50 pages from each book and manually label the convex polygon of

Table 6 each panel. The sizes of the comic page images vary from The relationship between λ and connected component count. 960 1300 to 1500 2200 pixels. The number of comic pages for each type of layout is shown in Table 8. λ Value Total Per image

0.00 21,003 10.50 6.4. Experiment I 0.04 18,004 9.02 0.07 14,693 7.35 The first part of the experiment in our work is to test our 0.10 11,607 5.80 0.13 7783 3.89 method on Dragon Ball, vol. 42, which consists of total 237 images. 0.16 4723 2.36 The dataset is described in Section 6.3.2. We estimate the para- meters by randomly selecting 100 comic pages from the other volumes of Dragon Ball. The comparisons with other methods are the 2000 test images and the third column is the per-image connected shown in Table 9, where our method achieves the highest correct component count. In order to avoid over-segmentation of the con- page count among all the existing approaches. As the authors in nected components, the threshold selected should not be too small. At [4,8,10,18] do not provide evaluation on the panel level, Table 10 thesametime,alargethresholdhelps agglomerate connected compo- only provides the details of the panel-level results of our method. nents but lead to a significant increase of the connections among the Moreover, we are able to present all the contents belonging to a visual patterns and further overall model complexity. In this work, we panel, which brings better user experience when applied to choose 0.10 as the value, which can agglomerate most the fragmented mobile reading (as shown in Fig. 2). connected components while reducing the model complexity.

6.3. Datasets 6.5. Experiment II

6.3.1. Page-level complexity measurement We also test our method by using the dataset from [11,18], We classify test comic images into four types based on the which includes a total of 2000 comic images. For training the layout complexity of the comic pages, in order to truly measure parameter, we select 20 more images from each of the 10 comic 2138 L. Li et al. / Pattern Recognition 48 (2015) 2129–2140

Table 9 The detection results of the proposed and other methods on Dataset 1 described in Section 6.3.2.

Method Total page Correct Percentage

[4] 237 195 82.28 [8] 237 218 91.98 00.5[10] 237 223 94.09 [18] 237 222 93.67 Proposed 237 227 95.78

Table 10 Panel-level evaluation on Dataset 1 described in Section 6.3.2.

Panel Type / Metric Simple Medium Complex Extra Overall

Ground truth panels 664 268 237 61 1230 0.5 1.0 Detected panels 665 268 239 63 1235 1.0 Correctly detected panels 663 268 229 55 1215 Precision 99.70 100.0 95.82 87.30 98.38 Recall 99.85 100.0 96.62 90.16 98.78 F1 99.77 100.0 96.22 88.71 98.58

demonstrate that the proposed method outperforms the existing ones, especially when the comic image page layouts are becoming more complex. Therefore, our method achieves better perfor- mances on a broader range of comic layouts with better repre- sentation of the detected panels, as shown in Fig. 11. Our method has relatively low computational cost: the learning 2.0 process takes less than 1 min; and the whole detection process N*1.0 including visual pattern extraction, model construction and infer- Fig. 10. Drawing patterns illustration with the score assignment: (a) quadrilateral ence as well as post-processing per image of all the test images is polygon without extending objects (Source: Dragon Ball, vol. 34, p. 208), (b) the 1.75 s on average, which are all executed on a 3.4 GHz processor. borders of the panel overlapping with image borders (Source: Dragon Ball, vol.34, p. The average processing time of the method proposed in [10] is 116), (c) panel with extending object on one side (Source: Rumiko, , 5.63 s, while the method proposed in [18] takes 0.19 s to process , vol. 55, p. 15), (d) panel without borderline segments (Source: Dragon Ball, vol. 01, p. 16), (e) panel with extending objects on more than one side (Source: each image on average. It is worth to note that most parts of the Dragon Ball, vol. 34, p. 20), (f) two panels overlapped by extending objects (Source: processing pipeline could be done on the cloud side, with the help Dragon Ball, vol. 34, p. 7), (g) N panels overlapped by extending objects (Source: of cloud computing, when applied to mobile reading. Dragon Ball, vol. 34, p. 181). The limitations of the existing methods such as [10,18] are in several major aspects. First, they make strong assumptions Table 7 on the panel shapes and therefore the methods may fail in cases The number of comic pages for each type of layout in Dataset 1 (237 images in such as the absence of enclosing borderlines. Second, these total). methods rely on single source of visual patterns extracted from

Layout type Number Percentage comic images and it has a limited ability in expressing the panels. As a result, the detected panels are presented as rectangles or Simple 114 48.1 quadrilateral polygons, which is not user-friendly when applied to Medium 62 26.2 mobile reading. Finally, by using some empirical rules, these Complex 49 20.7 methods have higher failing rates when the comic page layouts Extra Complex 12 5.1 become more complicated. By incorporating three kinds of visual patterns and modeling their context interactions within one tree conditional random field Table 8 framework, our method is able to process more diverse kinds of Comic page count for each type of layout in Dataset 2 (2000 images in total). page layouts and presents the detected panels as convex polygons. Our method not only achieves better experimental results on Layout type Number Percentage panel detection but also opens a new prospect to address the

Simple 793 39.7 panel detection task. Medium 757 37.9 Fig. 12 gives two examples of the error cases and limitations of Complex 312 15.6 our method. Fig. 12(a) shows that inappropriate x–y cut orders can Extra Complex 138 6.9 lead to false detection results, although all the separating line segments are correctly obtained. Even if both the cut orders and segmenting line segments are correct, selecting the bottom line of series, which form a total of 200 training images. The comparisons the upper panel in Fig. 12(b) as the cutting line brings better on the results are shown in Table 11. results than the one shown in Fig. 12(b). Other error cases involve The first four groups of columns show the comparison among false edge segment extraction and false labeling of the connected the methods on the four types of layouts, while the last group of components as well as the line segments, which we will try to columns gives an overall comparison on the dataset. The results correct in the followed-up works. L. Li et al. / Pattern Recognition 48 (2015) 2129–2140 2139

Table 11 The detection results of the [11,18] and proposed methods on Dataset 2 described in Section 6.3.3.

Method Simple Medium Complex Extra Overall

Pð%Þ RF1 PRF1 PRF1 PRF1 PRF1

[11] 98.27 95.04 96.62 95.64 92.77 94.18 90.02 85.82 87.56 78.01 69.74 73.64 95.18 91.63 93.30 [18] 99.28 96.74 97.99 97.76 95.39 96.56 92.25 86.24 89.14 79.92 72.14 75.83 96.85 93.54 95.16 Ours 98.70 97.99 98.34 98.23 97.64 97.93 95.12 87.48 91.14 90.18 86.32 88.21 97.64 95.82 96.73

Fig. 11. (a) Detection result of [11] on a test image with complexity score of 4, (b) detection result of [18], and (c) detection result of the proposed method.

Fig. 12. (a) Incorrect cutting orders can lead to false detection results (Source: Toriyama, Dragon Ball, Shuisha, vol. 34, p. 238). (b) Selecting more appropriate cutting lines can bring better detection results (Source: Toriyama, Dragon Ball, publisher Shuisha, vol. 34, p. 238).

7. Conclusions and future works on integrating drawing object recognition and panel detection to process the images where all the panels within are interconnected or In this paper, we make the first ever attempt to model the the pages which contain objects that expand over multiple panels. contextual relationships among the visual patterns (instead of using certain empirical rules) to address the task of panel detection in comic images. Our major contributions are the following: (1) we Conflict of interest extract three kinds of visual patterns (namely connected component, edge segment and line segment) from the comic images at different None declared. levels; (2) a tree conditional random field model is proposed to model these visual patterns and their interactions; (3) the presenta- tion of the detected panels is generalized as convex polygons in an Acknowledgments attempt to keep their content integrity. The experimental results show that our method achieves higher We would like to thank all the reviewers for providing encoura- correct page segmentation percentage as well as precision and recall ging and valuable comments to our work. This work is supported by panel detection rates than the existing ones, especially on the test National Natural Science Foundation of China under Grant 61300061 images with more complex page layouts. Our future work will focus and Beijing Natural Science Foundation (4132033). We also would 2140 L. Li et al. / Pattern Recognition 48 (2015) 2129–2140 lik