21st International Conference on Pattern Recognition (ICPR 2012) November 11-15, 2012. Tsukuba,

Multiple-Food Recognition Considering Co-occurrence Employing Manifold Ranking

Yuji Matsuda and Keiji Yanai Graduate School of Informatics, The University of Electro-Communications Email: [email protected], [email protected]

Abstract In this paper, we propose a method to recog- nize food images which include multiple food items Figure 1. Examples of multiple-food pho- considering co-occurrence statistics of food items. tos. Theproposedmethodemploysamanifoldrank- ing method which has been applied to image re- search of scene recognition the targets of which usu- trieval successfully in the literature. In the exper- ally contain multiple objects, relations between ob- iments, we prepared co-occurrence matrices of 100 jects were considered as important cue for scene food items using various kinds of data sources in- recognition in some works [8, 9]. Inspired by these cluding Web texts, Web food blogs and our own food works, we introduce relation information between database, and evaluated the final results obtained by food items for recognizing multiple-food meal pho- applying manifold ranking. As results, it has been tos. As relations which have been used in object proved that co-occurrence statistics obtained from a recognition research so far, co-occurrence [8] and food photo database is very helpful to improve the relative location [9] are common. In case of meal classification rate within the top ten candidates. photos, we think co-occurrence relation is more im- portant than relative locations, because some com- 1 Introduction binations of foods such as “hamburger and french Recently, personal services to recode people’s fries” are very common, while the way to place food food habits by taking meal photos with mobile items on the table is not strictly restricted in gen- phones have become popular. Currently, labeling eral. food names to meal photos requires human labor, In this paper, we propose a method to rec- which is a quite troublesome task. Therefore, it ognize multiple-food meal photos considering co- is desired to make recording of food items more occurrence statistics by extending our previous easier and quickly. To this end, several methods work [4]. To do that, we use manifold ranking [10] to recognize food images have been proposed so which have been used as a method for relevance far [1, 2, 3, 4]. feedback of image retrieval [11]. Manifold rank- Most of the existing works assumed that one ing is a ranking method to consider similarities meal image contained only one food item. They between items. In the image retrieval research, cannot handle a meal photo which contains two or manifold ranking was used to rank images con- more food items such as a hamburger-and-french- sidering both user preference and visual similar- fries image. Then, in [4], we proposed a new ities between images. In this paper, we use the method for recognizing meal photos which con- manifold ranking method to rank food candidates tain two or more food items as shown in Figure considering both recognition results by our previ- 1. In our proposed method, firstly, we detect candi- ous method [4] and co-occurrence statistics between date regions with several methods including Felzen- food items. That is, the method proposed in this szwalb’s deformable part model (DPM) [5], a cir- paper re-ranks the original candidate ranking ob- cle detector and the JSEG region segmentation tained by the multiple-food recognition method, [6]. Then, we extract various kinds of image fea- which does not consider co-occurrence statistics, by tures from each candidate region. After applying taking into account co-occurrence statistics. To our the classification models trained by multiple kernel best knowledge, this is the first work to recognize learning [7], we obtain the names of the top N food multiple-food items in one meal image taking into item candidates over the given image. account co-occurrence statistics. In the proposed method [4], each food item is The rest of this paper is organized as follows: recognized independently. Meanwhile, in the re- Section 2 explains the proposed method, and Sec-

978-4-9906441-1-6 ©2012 IAPR 2017 tion 3 describes the experimental results. Finally photo. Then, we select the maximum SVM out- in Section 4 we conclude this paper. put score over all the candidate region regarding each food category. Finally, we obtain the evalua- tion scores which express likelihood that the corre- 2 Proposed Method sponding food items appear in the given photo. As The proposed method refine the results obtained a system output, we obtain the names of the top by our previous method [4] with manifold rank- N food items over the given image regarding the ing [10]. Before explaining the propose method, we descending order of the evaluation score. describe the previous method to detect food items for multiple-food photos as shown in Figure 1. 2.2 Manifold Ranking with Co- occurrence Statistics

Query Image Some common combinations of food items ex-

Classification ist such as “hamburger and french-fries” and “rice Candidate Region Detection MKL-SVM and -”, while unlikely combinations ex- Whole DPM Circle JSEG ist such as “ and hamburger” or “ Ranking of 100 Foods Integration of Candidate Region and french-fries”. From these observation, co- occurrence statistics is expected to be able to Re-Ranking by Manifold Ranking Coding Image Feature Vector enhance the performance of multiple-food image Color SIFT CSIFT HOG Gabor Result recognition. By considering co-occurrence statis- tics, we can reduce unlikely combinations and boost Figure 2. Recognition Flow possible combinations which are included in the higher ranked candidates. As results, obtained ranking of food item candidates become more pre- 2.1 Previous Method cise. In [4], we proposed a food image recognition sys- For multi-food recognition with co-occurrence tem which outputs the names of food items that are statistics, we use manifold ranking [10], which is a expected to be shown in a given meal photo. We re-ranking method to consider similarities between show the overview of the processing flow of the pro- items. In this paper, we propose to re-rank the posed system including the co-occurrence extension food candidate ranking obtained by our previous proposed in this paper in Figure 2. Given an input work [4] with the manifold ranking method. image, first, the system detects candidate regions The equation of the manifold ranking is as fol- of dishes. We use four types of detectors includ- lows: ing the deformable part model (DPM) [5], a circle r∗ =(I − αS)−1r, (1) detector, the JSEG region segmentation [6], and r r∗ whole image. Next, we integrate bounding boxes where and represent the initial ranking vector I S of the candidate regions detected by the four meth- and the manifold ranking vector, and and rep- ods. Then, we check the aspect ratio of width and resent an identity matrix and a similarity matrix, α height of the bounding boxes, and exclude irrele- respectively. Note that is a constant varying from S vant bounding boxes regarding their shapes from 0 to 1, which adjusts the effect of for the initial r the candidate set. vector . r Next, the system extracts various kinds of image The initial ranking vector is calculated by ap- features including Bag-of-Features (BoF) of SIFT plying a standard sigmoid function to the SVM out- v and color SIFT with spatial pyramid, Histogram of put value i of each category and L1-normalized as Oriented Gradients (HoG), and Gabor texture fea- shown in the following equation: tures from the selected regions, and calculate SVM −1 (1 + exp(−vi)) scores by multiple-kernel learning (MKL) [7] with ri  , = −1 (2) chi-square RBF kernels for each candidate region j(1 + exp(−vj)) in the one-vs-rest manner. MKL can estimate the optimal weights for linear combination of kernels As a similarity matrix to re-rank the initial rank- each of which corresponds to one kind of a visual ing, we use a co-occurrence probability matrix, feature. which can be calculated by counting the number of After that, we obtain SVM output scores for co-occurrence of two food item pairs in a training all the candidate regions and all the food cate- dataset. Each element of the co-occurrence matrix gories. Note that our objective is not associating Si,j can be obtained with extracted regions with possible names of food items directly, but listing all the names of the food items ci,j Si,j =  , (3) c which are estimated to be shown in the given meal k∈F ∧k= j k,j

2018 3 Experiments chicken-’n’- rice eels on rice pilaf egg on port cutlet beef curry sushi chicken fried rice bibimbap toast croissant roll bread raisin chip butty hamburger rice on rice rice bowl bread

Japanese- tempura tensin fried sauteed grilled sauteed In the experiments, we used our own food im- pizza spaghetti style gratin croquette sandwiches noodle udon noodle noodle beef noodle noodle noodle vegetables eggplant spinach pancake age dataset built for [4] as shown in Figure 3 which

vegetable grilled sweet and potage sausage oden omelet ganmodoki stew fried fish grilled salmon sashimi pacific tempura grilled fish salmon meuniere saury sour pork includes 100 Japanese food categories with bound- ing boxes on each food item. It contains about one lightly steamed seasoned spicy chili- egg roasted egg tempura fried sirloin nanbanzuke boiled fish beef with hambarg steak dried fish ginger pork flavored cabbage omelet sunny-side fish hotchpotch chicken cutlet potatoes steak saute roll up hundred images for each category and 9132 images

stir-fried boiled fish-shaped steamed chilled simmered chicken sashimi pancake shrimp roast omelet cutlet spaghetti fried natto cold tofu egg roll beef and pork and sushi bowl with bean with chill meat with fried curry shrimp noodle peppers bowl source chicken dumpling rice meat sauce totally. For the experiments, we selected 500 multi- vegetables jam ple food-item images from them which contain 1178

Japanese kinpira- potato green salad macaroni tofu and pork miso chinese pizza toast dipping goya style hot dog salad salad vegetable soup soup beef bowl rice ball noodles french fries chanpuru sauteed mixed rice chowder burdock food items for test, and used the rest of all the im- ages for training. Figure 3. 100 food categories in our To evaluate the performance, we use a classifi- dataset. Please see this figure on a PDF cation rate within the top N candidates CR@N re- viewer with magnification. garding food items, which is defined in the following equation:

1 where ci,j represents the number of co-occurrence num. of correctly-detected food items in top N i j CR@N = pairs of food category and over whole the train- num. of all the food items in all the test image ing dataset, and F represents a set of all the pre- defined food categories. Note that all the diagonal If the top N candidates include the names of the elements of S aredefinedas0. food items appearing in the given food image, we In addition to a food image database, we also count them as the correctly-detected food items. use World Wide Web as knowledge source to cal- Firstly,wecomparetheresultsbeforeandafter S culate co-occurrence matrix . In[8],inad- applying manifold ranking using the co-occurrence dition to co-occurrence statistics extracted from matrix built from the food image dataset. Figure 4 image database, they proposed constructing co- shows the classification rates within the top N can- occurrence matrix with Google Web Search. Fol- didates. In this figure, “Baseline” represents the lowing that, we construct co-occurrence matrix us- initial results obtained by the previous method [4], ing Web data. Note that Rabinovich et al. used while “DB” represents the refined results by man- conditional random field (CRF) [8], while we use ifold ranking. Regarding classification rate within the manifold ranking method to re-rank the initial top 10 (CR@10), “DB” outperformed “Baseline” output. by 7.67 points, which shows the effectiveness of the As a method to estimate co-occurrence of two proposed co-occurrence-based refinement method. words from the Web, we use Normalized Google Figure 4 shows co-occurrence frequency among the Distance (NGD), which is calculated based on the top 15 frequent foods in our database. “Red boxes” hit number of Web search results. NGD between means more high frequent co-occurrence pairs. word x and word y is given by In the previous experiment, we set the constant value α in Eq.1 as 0.1. This value was decided α max(log f(x), log f(y)) − log f(x, y) by the experimental results varying from 0.0 to NGD(x, y)= , 0.9 as shown in Figure 6. α is a constant adjust- M − f x , f y log min(log ( ) log ( )) ing how extent co-occurrence is taken into account (4) in the manifold ranking computation. When α is M where is the total number of Web pages over 0.1, CR@10 achieved the maximum value, 63.50%. f x f y whole the Web, and ( )and ( ) represent the That is why we set α as 0.1. x y number of hit pages for the search word and , Next, we carried out manifold ranking with co- respectively. Then, a co-occurrence matrix can be occurrence matrix obtained by the knowledge on computed as follows: the Web as well. As explained in the previous sec- tion, we used Normalized Google Distance which −NGD , can be calculated based on the hit numbers of given S  exp( (Namei Namej) , i,j = words returned by Google Search. We built three exp(−NGD(Namek, Namej)) k∈F ∧k= j kinds of NGD-based co-occurrence matrices: the (5) first one is computed using Google Search, the sec- where “Namei” and “Namej”meansthenametext ond one is obtained using Google Image Search in- of food category i and j, respectively. steadofnormalGoogleSearch,andthethirdone As a system output, we obtain the names of the is calculated using Google Search the search tar- top N food items over the given image regarding get sites of which were restricted to blogs mainly the descending order of the elements of the manifold related to foods. Figure 7 shows the results by ∗ ranking vector r . three kinds of matrices as well as baseline, and

2019 Table 1. Classification rate within the top ten candidate (CR@10). Baseline DB NGD(text) NGD(img) NGD(blog) 55.84 63.50 56.27 56.71 56.10 gain ⇒ +7.67 +0.43 +0.87 +0.26

Figure 6. Classification rate varying the value of α

Figure 4. Classification rate before and af- ter applying manifold ranking

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) rice (1) beef curry (2) tempura bowl (3) Figure 7. Classification rate with Web re- hamburger (4) miso soup (5) sources jiaozi (6) fried chicken (7) egg sunny-sidee up (8) natto (9) externalknowledgesuchasWebtextsmoredeeply, cold tofu (10) green salad (11) and compare other methods than manifold ranking

Japanese chowder (12) to consider co-occurrence statistics for multiple ob- beef bowl (13) ject recognition such as conditional random fields french fries (14) mixed rice (15) (CRF). Figure 5. Co-occurrence matrix extracted References from our food photo database [1] T. Joutou and K. Yanai, “A food image recognition system with multiple kernel learning,” in ICIP, pp. 285–288, 2009. [2] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition using statistics of pairwise local fea- [email protected] tures,” in CVPR, 2010. [3] Z. Zong, D.T. Nguyen, P. Ogunbona, and W. Li, “On the though Google Search with food blogs achieved the combination of local texture and global structure for food best results among the Web-based methods, its im- classification,” in ISM, pp. 204–211, 2010. [4] Y. Matsuda and K. Yanai, “Recognition of multiple-food provement was not so much. Overall, Web-based images by detecting candidate regions,” in ICME, 2012. co-occurrence was not effective in this experiment. [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” PAMI, vol. 32, no. 9, pp. 4 Conclusions 1627–1645, 2010. [6] Y. Deng and B. S. Manjunath, “Unsupervised segmenta- In this paper, we proposed a method to detect tion of color-texture regions in images and video,” PAMI, vol. 23, no. 8, pp. 800–810, 2001. multiple food items from one food image consider- [7] S. Sonnenburg, G. R¨atsch, C. Sch¨afer, and B. Sch¨olkopf, ing co-occurrence statistics with the manifold rank- “Large Scale Multiple Kernel Learning,” JMLR, vol. 7, pp. 1531–1565, 2006. ing method. In the experiments, we could im- [8] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, proved the result from 55.85% to 65.62% regard- and S. Belongie, “Objects in context,” in ICCV, 2007. ing the classification rate within top ten using co- [9] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky, “Learning hierarchical models of scenes, objects, occurrence statistics estimated from a food image and parts,” in ICCV, pp. 1331–1338, 2005. dataset. For these results, it has been shown that [10] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Sch¨olkopf, “Ranking on data manifolds,” NIPS, vol. the proposed method is very effective. On the other 16, pp. 169–176, 2004. hand, Web-based co-occurrence does not improve [11] J. He, M. Li, H.J. Zhang, H. Tong, and C. Zhang, “Manifold-ranking based image retrieval,” in ACM MM, the initial results so much. pp. 9–16, 2004. For future work, we need to examine how to use

2020