Multimed Tools Appl (2009) 42:97–113 DOI 10.1007/s11042-008-0247-7

Automatic image annotation using visual content and

Stefanie Lindstaedt · Roland Mörzinger · Robert Sorschag · Viktoria Pammer · Georg Thallinger

Published online: 13 November 2008 © Springer Science + Business Media, LLC 2008

Abstract Automatic image annotation is an important and challenging task, and becomes increasingly necessary when managing large image collections. This paper describes techniques for automatic image annotation that take advantage of collab- oratively annotated image databases, so called visual folksonomies. Our approach applies two techniques based on image analysis: First, classification annotates images with a controlled vocabulary and second tag propagation along visually similar im- ages. The latter propagates user generated, folksonomic annotations and is therefore capable of dealing with an unlimited vocabulary. Experiments with a pool of Flickr images demonstrate the high accuracy and efficiency of the proposed methods in the task of automatic image annotation. Both techniques were applied in the prototypical tag recommender “tagr”.

Keywords Auto-annotation · Tagging · Image retrieval · Classification

S. Lindstaedt · V. Pammer Know-Center / KMI TU Graz, Inffeldgasse 21A, Graz, Austria S. Lindstaedt e-mail: [email protected] V. Pammer e-mail: [email protected]

R. Mörzinger (B) · R. Sorschag · G. Thallinger Joanneum Research, Institute of Information Systems and Information Management, Steyrergasse 17, 8010 Graz, Austria e-mail: [email protected] R. Sorschag e-mail: [email protected] G. Thallinger e-mail: [email protected] 98 Multimed Tools Appl (2009) 42:97–113

1 Introduction

With the prevalence of personal digital cameras, more and more images are hosted and shared on the Web. Systems organizing and locating images in these image databases heavily depend on textual annotations of images. For example, major existing image search engines like Google, Yahoo! and MSN rely on associated text in the web page, file names, etc. Naturally, the value of image databases immensely grows with rich textual annotations. Image annotation or image tagging is the process by which is added to a in the form of captioning or keywords. For this task humans interpret an image using their background knowledge and the capability of imagination. Therefore humans are able to annotate concepts which are not captured in an image itself. It is worth to note that the human labeling process is subjective and may therefore lead to ambiguous annotations especially in the absence of a fixed vocabulary. This issue can be addressed by making use of a shared annotation vocabulary. The practice of collaboratively creating and managing annotations—known as folksonomic tagging in the context of Web2.0—aims at facilitating sharing of per- sonal content between users. A widely used website that relies on folksonomies as main method for the organization of their images is Flickr.1 Flickr has a repository that is quickly approaching 2.0 billion images (as of November 20072) and growing. The large amount of information provided by such databases of annotated images is in the following referred to as visual . Users can describe pictures on Flickr in free text (title, caption) and via keywords (tags). If the picture owner allows it, users can also tag pictures of others. The general motivation of Flickr users to tag their and others’ pictures is to make the picture findable. In Section 2 we discuss related work. Section 3 then examines in depth a hybrid approach for automatic image annotation, namely the combination of visual content analysis with existing folksonomies. A detailed analysis of our experiments is given in Section 4. The techniques for automatic image annotation have also been applied in the tag recommender tagr that follows two goals: (1) to reduce human effort for annotation and (2) to work towards a collectively shared vocabulary in Flickr, exploiting multiple dimensions of content and interaction. Section 5 provides a description of the idea underlying the tagr, a system description and an investigation of tagging behavior when the tagr is used.

2 Related work

Automatic image annotation is a challenging task which has not been solved satisfac- tory for general real-world applications. Many solutions cover only small domains and they usually work with a very limited vocabulary set. While there has been work on content-based image retrieval and object recognition over the last decades [9, 11, 24], work on automatic image annotation is a relatively new field.

1http://www.flickr.com 2http://www.techcrunch.com/2007/11/13/2-billion-photos-on-flickr Multimed Tools Appl (2009) 42:97–113 99

The topic of automatic image annotation has recently been boosted by the rise of photo sharing sites such as Flickr, in which users massively tag their own pictures as well as those of others. This provides researchers with a huge new amount of data with which to experiment and research various questions.3 From the point of view of image retrieval, one possible drawback is the “noisiness” of data. In the context of image retrieval/annotation this means that a tag can not be expected to describe the image’s content. Among the currently researched topics is therefore the question how users actually tag and why they tag. The first question is addressed in detail e.g. in [12], in which the authors describe multiple functions that tags can have and come to the conclusion that describing an image’s content is only one possible function among many. In [23] a similar analysis of how users tag is described. Results are directly integrated in the implementation of various tag recommendation systems. The tag recommendation system described there based on tag co-occurrence, similar to one of the functionalities underlying the tagr. Another related research question is of course how to effectively navigate through large image databases with imperfect annotation. This is addressed e.g. by [1], where for navigation both tags and general visual features, such as “colorfulness” (standard deviation between the means of the three color channels) and other color and texture features, are combined. Automatic image annotation can be formulated as a classification problem, where e.g. supervised learning by Support Vector Machines (SVMs) is used to classify images and image parts to a number of concepts [7]. In a different approach, the probability of words associated with image features [14, 26] is considered. The work in [16] proposes an image annotation technique that uses content-based image retrieval to find visually similar images from the Web and textual information that is associated with these images to annotate the query image. A similar approach is described in [13] where SIFT features [18] are utilized to retrieve similar images and to map keywords directly to these image descriptors. In [22] the Flickr image database is used to investigate collective annotations. Some existing web-based image search engines use part of these techniques and operate on a pool of Flickr images. ‘Flickr suggestions’4 combines regular text- based and content-based search, ’beholdsearch’5 includes a search for a number of predefined and trained high level concepts, ‘retrievr’6 provides query-by-sketch using multiresolution wavelet decompositions. ‘ALIPR’7 [16] suggests automatic generated annotations for any online image specified by its URL.

3 Automatic image annotation

In our approach for automatic image annotation, we aim at taking advantage of the large amount of information provided by visual folksonomies. The visual content of

3Flickr gives API-level access to the photos and tags, as long as the pictures are set as public by their owners. 4http://www.imgseek.net 5http://www.beholdsearch.com 6http://labs.systemone.at/retrievr 7http://www.alipr.com 100 Multimed Tools Appl (2009) 42:97–113 untagged images is a valuable source of information allowing cross-linking of these images to the images of the visual folksonomy. Given the high computational cost of content-based image analysis, efficient methods are required for enabling annotation in real-time. In the context of automatic image annotation, we consider techniques that work in real-time to be fast enough to let the user work in an acceptable way, e.g. with a guaranteed response time of less than one second. Our approach to automatic image annotation is split into two strategies – automated image classification by off-line supervised learning of concepts from a folksonomy and – tag propagation from visually similar images Figure 1 shows the basic architecture incorporating the two strategies and related components. In the following the content-based features we are using and the two strategies are described in detail.

3.1 Content-based features

We use three different MPEG-7 color features [19] and two texture features. They are computed globally to ensure that the image annotation is fast and scales with large visual folksonomies.

MPEG-7 color features Color Layout describes the spatial distribution of colors. It is computed by dividing the image into 8 × 8 blocks and deriving the average value for each block. After computation of DCT and encoding, a set of low frequency DCT components is selected.

Fig. 1 Basic system architecture. Dashed lines indicate the semi-automatic off-line process; solid lines and shaded boxes depict components used on-line for automatic image annotation Multimed Tools Appl (2009) 42:97–113 101

Dominant Color consists of a small number of representative colors, the fraction of the image represented by each color cluster and its variance. We use three dominant colors extracted by mean shift color clustering [6]. Color Structure captures both, color content and information about the spatial arrangement of the colors. Specifically, we compute a 32-bin histogram that counts the number of times a color is present in an 8 × 8 windowed neighborhood, as this window progresses over the image rows and columns.

Texture features

Gabor Energy is computed by filtering the image with a bank of orien- tation and scale sensitive filters and calculating the mean and standard deviation of the filter output in the frequency space. We applied a fast recursive gabor filtering [27]for4 scales and 6 orientations. Orientation Histograms are computed by dividing an image into a certain number of subregions and by accumulating a 1-D histogram of gradient directions of every subregion. Specifically, we use 4 × 4 slightly overlapping subregions with 8 histogram bins which makes it similar to SIFT features [18] and Histogram of Oriented Gradients [8].

3.2 Image annotation by classification

Automatic image annotation with concepts of a folksonomy is realized by supervised learning of classifiers. For this purpose, we selected a number of concepts by inves- tigating high-level tags of the folksonomy. Afterwards a training set characteristic for the target concepts has to be generated. In our work this was accomplished in a supervised manner. Concepts are chosen manually based on the most frequent tags amongst the ones which are at the same semantic level and non-overlapping, i.e. tags which are not umbrella terms or subtopics of another tags. The training sets can initially be automatically compiled by text-search, however it is also necessary to manually exclude (visually) unrelated images for better classification results. Further input features have to be selected and their representation determined. Based on feature selection experiments, we set the feature vector as a concatenation of the feature values extracted for ColorLayout, DominantColor, ColorStructure and GaborEnergy. As learning algorithm we used a multi-class SVM. As argued in [2], early normalized fusion is a good choice for low- or intermediate-level features as above. We apply early feature fusion, statistical feature normalization (adjusting each feature dimension to have zero mean and unit standard deviation) and scaling of the feature value range to −1 and 1. The SVM model is trained in a one-against-rest fashion using a C++ implementation based on LibSVM [4]. This kind of feature set has provided good performance on variable data environ- ment, as reported in the TRECVid 2007 campaign [20] for evaluating the effective- ness of detection methods for semantic concepts (High Level Feature Extraction). 102 Multimed Tools Appl (2009) 42:97–113

3.3 Tag propagation using CBIR

We use content-based image retrieval for automatic image annotation. In this approach untagged images are compared to the images of a visual folksonomy in order to obtain several tags of visually similar images. Similar images are retrieved using the two image features ColorLayout and Orientation Histograms. These features capture both color and texture properties effectively and they can be used for efficient image matching using the similarity search approach outlined below. In an off-line learning process the two features are computed for the images from a visual folksonomy. These image features are stored in a relational database along with the user tags. In order to produce tags for an untagged image, the features of ColorLayout and Orientation Histograms are computed and an image search is started to find visually similar images. The image search uses both features separately to find similar images before the resulting images are combined. As metric for ColorLayout similarity a non-linear distance measure [19] is used with the help of a hybrid tree (a multidimensional data structure for indexing high dimensional feature spaces) to achieve fast computation times [5]. Since the first dimensions of ColorLayout convey more information about relevant image content than the subsequent ones, we assigned higher weights to the first dimensions of the ColorLayout descriptors. This is done for matching as well as for constructing the hybrid tree index. The Orientation Histograms are compared with the Euclidean distance and break-up conditions to accelerate the process, i.e. histogram comparisons are stopped if the distance (of intermediate calculations) exceeds the Euclidean distance of the already retrieved similar images. Therefore not all 128 dimensions of the histograms need to be compared for all images in the database. After a merged set of visually similar images is generated, candidate tags for the untagged image could be selected from the tags of visually similar images, see Fig. 2. We use tags that occur multiple times or tags with relationships to other tags (synonym, hypernyms, and hyponyms) in this set of tags from similar images. The

Fig. 2 Automatic image annotation examples. For three input images, the three visually most similar images resulting from similarity search, the propagated tags, the classified concept and the user generated tags of Flickr are shown Multimed Tools Appl (2009) 42:97–113 103

Table 1 Classifier results Precision Recall computed on the test set Banana 0.50 0.71 Blueberry 0.88 0.72 Kiwi 0.72 0.64 Orange 0.80 0.72 Raspberry 0.82 0.52 Strawberry 0.64 0.70 Other-fruits 0.55 0.72 Mean 0.70 0.67

available parameters for this method are the maximum number of tags to propagate and the number of visually similar images to use.

4 Experiments, results and evaluation

The set of images used for our experiment is taken from the Flickr pool Fruit&Veg.8 The training set is composed of around 15,000 images and their tags from a snapshot of this pool from March, 2007. The test set for evaluating classification and image similarity is made up of 100 relevant images per concept (700 in total), which were added to Flickr afterwards (till June 2008). The performance of the two strategies is measured separately using precision and recall. For classification, we considered the following concepts: banana, blueberry, kiwi, orange, strawberry, raspberry and the class other-fruits which does not contain any of these concepts. In this setup we intentionally included visually rather distinct concepts as well as visually similar concepts like raspberry and strawberry. For each of the concepts to be predicted, we subjectively selected a set of the 50 most relevant images from the training set, that is, images where the concept was most clearly predominant. After experimenting with different kernels and parameter tuning by applying a grid search using 5-fold cross-validation, we selected the RBF kernel with gamma parameter equal to 0.125 and the trade-off parameter C equal to 32. 1 shows the classifier results applied on the test set. Generally, high precision was achieved, with up to 88% for the blueberry concept. The classifier produced fewer positive predictions for the concepts banana, strawberry and other-fruits. The precision values for banana and strawberry are impaired by false positives for kiwi and raspberry respectively. The high recall of 70% for strawberry sacrifices the recall of the visually similar raspberry. As can be seen in Table 2, 32% percent of images with strawberry were misclassified to raspberry. Issues with learning separations from the exclusion class other-fruits, reduced the precision for banana and kiwi (about 11% false negatives). We observe that the classifier performed very well in discriminating blueberries against most of the other concepts, that is, there was no confusion between blueberry and the concepts of kiwi, orange, raspberry and strawberry in any sense. In total, we achieved a mean precision of 70% and recall of 67%.

8http://flickr.com/groups/fruitandveg 104 Multimed Tools Appl (2009) 42:97–113

Table 2 Classifier confusion matrix computed on the test set True Predicted Banana Blueberry Kiwi Orange Raspberry Strawberry Other-fruits Banana 0.51 0.04 0.17 0.11 0.02 0.04 0.11 Blueberry 0.06 0.89 0.01 0.02 0.00 0.00 0.01 Kiwi 0.07 0.03 0.73 0.00 0.05 0.02 0.10 Orange 0.07 0.00 0.03 0.81 0.00 0.07 0.02 Raspberry 0.00 0.02 0.02 0.00 0.83 0.14 0.00 Strawberry 0.00 0.02 0.01 0.00 0.32 0.64 0.01 Other-fruits 0.09 0.12 0.05 0.08 0.05 0.06 0.55 The underlined data indicates the maximum of each row, i.e. the most frequently predicted label for a certain class

In the evaluation of the automatic tag propagation technique, the training set builds the reference database for similarity search. As in [16], we propagate tags for all 15,000 images using the (residual) images as database for similarity search and compare the propagated tags with the user tags. This evaluation shows the percent- age of propagated tags which have also been generated by the user. Propagated tags are considered as false positives if they are not contained in the user annotations, even if they describe the image content correctly. The two diagrams in Fig. 3 present results of tag propagation using the tags from the 50 most similar images. The diagram on the left shows precision (percentage of propagated tags contained in user tags) and recall (percentage of user tags retrieved by the automatic tag propagation). These two measures behave complementary when the number of propagated tags is increased. The propagation of 8 tags, a number that corresponds to the average number of tags per image (7.85) in the Fruit&Veg pool, leads to 19% precision and recall. The diagram on the right shows the absolute number of different tags (vocabulary size) generated with our approach. Although this experiment uses a rather small image pool, more than 10,000 different tags can be propagated. The dashed curve indicates that several thousand of different tags are propagated from the system when propagating 32 tags per image. The solid line shows the number of different tags which are propagated and user generated (true positives).

Fig. 3 Tag propagation results computed on a set of 15,000 images. Precision / recall (left)and propagated/matched vocabulary sizes (right) are plotted against the number of propagated tags (logarithmic scale) Multimed Tools Appl (2009) 42:97–113 105

Moreover we have manually evaluated the tag propagation system with 250 ran- domly selected images from the test set described above. For that purpose, 8 tags per image where propagated using the 50 most similar images. A manual investigation of the relevance of these tags shows that about 70% of the propagated tags are useful annotations (precision) and only the residual 30% are wrong annotations. The reason for the difference between the results shown in Fig. 3 and the manual evaluation is that many propagated tags are actually judged as correct by humans while considered as false positives when compared only to the image tags from a single user. Examples of this dilemma are shown in Fig. 2. The classification and tag propagation of an untagged image of size 512 × 512 takes 0.9 s on average, mainly spent for feature extraction.

5 Application of image classification and tag propagation in the tag recommendation system tagr

The image classification and image similarity methods described above have been de- ployed within a prototypical tag recommender system, the tagr. The tagr’s innovation lies in the fact that it recommends tags based on three dimensions: visual content, text and user context. It has originally been described in [17]. The two automatic annotation techniques described in the main part of the paper are for recommending tags in the tagr system, i.e. they are not automatically assigned to a picture. The scenario for the target users of the tagr is the following: Imagine you are member of an online social community. You want to share your latest picture and thus you upload it. Tagr analyzes your picture and tries to identify its topic. The identified concept is returned to you as a suggested tag. In addition, tagr graphically compares the uploaded picture to other pictures on the site. As a result of tagr returns to you a number of similar pictures together with the tags associated to them. You can now choose one or many of the recommended tags and add new ones. As extra feature, tagr suggests a number of community members to you who have a similar user profile as you have (this is independent of images). In a real application, the picture and all collected tags would then be used in the upload process to the online social system. We approach recommending tags for a picture from two sides: looking for tags by analyzing the picture and possibly already existing tags for it, or looking for tags by taking user preferences into account. The latter is the typical approach in collaborative filtering or recommendation systems [3]. One of the assumptions underlying the tagr is that of a “limited world”, i.e. we assume a reasonably-sized community with reasonably well-defined interests. In our experiments this was achieved by using data from one group on Flickr, namely Fruit&Veg.8 This group is moderated, which means that uploaded images are checked for topic compliance before they are published. Otherwise, the group is public to all Flickr users. This assumption is insofar relevant, as it allows us to assume that terms within our setting are unambiguous, that similar users also use similar vocabulary (as in addition to their social similarity we also observe them in a similar context) and that similar pictures also show the same content on an object-level. 106 Multimed Tools Appl (2009) 42:97–113

5.1 Tagr: description

The tagr’s user interface is shown in Fig. 4a. The upper left corner displays the uploaded picture and the upper right corner displays the tag-area. The tag-area shows all tags that the user added to the picture. Below are three rows: The first is the picture row, showing pictures visually similar to the reference picture. As initial reference picture the uploaded picture is taken and it is displayed left-most in the picture row. Second is the tag row, showing tags similar to all tags in the

Fig. 4 Screenshot of tag recommendation system tagr (a, b) Multimed Tools Appl (2009) 42:97–113 107 tag-area. In the tag row, the tagr-user can take over tags into the tag-area by clicking on them. Third is the user row showing users similar to the reference user. Initially the reference user is the tagr-user uploading the picture. The reference user’s user picture is depicted left-most in the user row. The user can click on a row’s icon (representing image or community user) to get a detail view as shown in Fig. 4b. The detail view shows tags associated to the picture or user which in turn can be chosen as a new reference for the picture or user row. Choosing a new reference updates the corresponding row. In the detail view, the user can also take over tags. The tag row changes when a new tag is taken over from the tag row or the detail view, or when the user of tagr manually enters a new tag into the tag-area. Image classification is triggered as a new image is uploaded, and the classification result is shown directly in the tag-area. Retrieval of similar images feeds the picture- row, and tag-propagation occurs as users click on a picture to get associated tags in the detail view. The tag row shows tags similar to tags in the tag-area. Tag similarity is given by two slightly different functionalities, of which one is domain-dependent (tag-association service) and one is domain-independent (WordNet term association service). Both are described in some more detail below. The underlying cognitive theory of both techniques has been described in [21]. The results shown in the user row are inherently independent from the uploaded picture as the user row represents a kind of community-view and relies on similarities between user profiles. The user row is primarily intended as an additional feature to make visible the existing community in Flickr to the tagr-user. Additionally however it is used for tag recommendation as well, thus potentially fostering a shared vocabulary among Flickr users.

WordNet term association service WordNet is a lexical database for English [10]. It organises terms around senses, and a number of terms belong to each sense (synset), and describes hierarchical and non-hierarchical relations between senses. Our WordNet Term Association Service makes use of a Java interface9 to access the WordNet database. Within the tagr, the WordNet term association service is used to find all senses of one term, all synonyms that go with the senses as well as hypernym-senses and their synsets. This service feeds the tag row together with the tag association service. For application in the tag row, each group of words gets ranked higher for every term in it that is also a result of the tag association service. As the main source of data covers only one topic (fruit), the general-knowledge database WordNet is used to integrate knowledge lying outside the narrow scope in order to get a handle on unexpected and new topics.

Tag association service The tag association service is based on the statistical distri- bution of tags in relation to the photos. Similar tags are found using a co-occurrence, or better a co-tagging, analysis. For a single tag the result set of the recommended tags is the list of all tags that share at least one photo, sorted by their similarity weight. This service feeds the tag row together with the WordNet term association service. A

9http://www.mit.edu/~markaf/projects/wordnet/ 108 Multimed Tools Appl (2009) 42:97–113

Fig. 5 Overview over data and functionality behind the tagr tag gets additional points for ranking if it also occurs in the result set of the WordNet term association service. More details on this service are found in [15].

System architecture Tagr is designed as a modular system, making it possible to easily reuse underlying services for different applications. A data service collects relevant data from a data source and makes it available to other services. Tagr relies on two data sources, namely WordNet10 and Flickr.11 The data service grabs data from the Flickr-group Fruit&Veg. The functionality services consist of an image similarity service, an image classification service, a WordNet term association service, a tag association service and a user similarity service. The system is currently not integrated online with Flickr but only works on a snapshot of Flickr data. Most of the analysis therefore can be done offline. The functionality evaluated in this paper relies on a snapshot covering about 15,000 pictures. The data is aggregated into new structures which are queried at runtime, e.g. the tag association service creates a tag association index. The data is grabbed from Flickr using the Flickr API. Only information that is publicly available is used. The interaction between the services is sketched in Fig. 5.

5.2 Investigation of tagging behavior with and without tagr

The goal of the following experiment was to investigate if and how tagging behavior of users changes when a tag recommendation system is given.

10http://wordnet.princeton.edu/ 11http://www.flickr.com Multimed Tools Appl (2009) 42:97–113 109

A set of 40 pictures were taken from the Flickr group “fruit & veg”. These pictures are distinct from the set of pictures on which our backend services rely. However, no new Flickr-users were introduced to our system. Four test users tagged these pictures. Every test user had to manually tag 20 pictures and use the tagr to tag 20 more pictures. Each picture was tagged at least once manually and at least once with the support of tagr. Additionally, each of the test users filled out an informal questionnaire giving general feedback on the tagr. We assume that all test users gave only correct tags to the pictures. We first measured how the quantity of tags changes when a tag recommendation system is used. The mean number of tags given per picture per user when tagging manually is 4, 99, whereas it is 5, 79 when the tagr was used, which means that in the presence of the tag recommendation the images were tagged slightly more. We then measured whether the vocabulary was more varied when users were tagging manually. This was our hypothesis, as it seemed logical that unrestricted vocabulary (used in manual tagging) would be more varied than restricted vocab- ulary (restricted by the tags already in the dataset on which tagr is based and further restricted by which tags are recommended given a certain picture, user etc.). The mean number of distinct tags per picture was 8, 28 for manual tagging and 9, 28 when the tagr was used. This means that overall, users used more tags when using the tagr. On a picture-per-picture basis however, the ratio between tags given manually and tags given when supported by the tagr is 1.09. This shows a slight tendency towards our original hypothesis. The numbers are not clear enough however to definitely support it. Finally, an informal questionnaire was used to collect the test users’ impressions on the ease of use, and to detect potential show-stoppers. The users were asked to rank the different features according to how useful they perceived them. In this ranking, the tag row turned up at the top, followed by the picture row, and the user row at the third place. Our interpretation is for the task of tag recommendation the tag row was favored because it most directly supported tagging. The other two rows require at least one click on a similar picture or user before a tag can be added. A consequence would be to directly show propagated tags from image or user similarity in the user interface. This interpretation is supported by results presented in [17], where we showed that the actual quality of the results in the three rows does not differ significantly. Tags that were recommended but irrelevant were not rated as a large problem by the users, since the total number of tags recommended by the tagr is still manageable. Results of this evaluation have first been reported in [17]. There we also describe an additional study in which precision and recall are measured separately for image classification, tag recommendation by image retrieval, tag association and user similarity.

6 Conclusion

We presented two strategies for automatic image annotation that build on existing knowledge from labeled image databases, so called visual folksonomies. On the one hand, we implemented a supervised learning approach for classification using a SVM and a controlled vocabulary. On the other hand, we presented a tag propagation 110 Multimed Tools Appl (2009) 42:97–113 system that uses content-based image retrieval for automatic annotation that works with an uncontrolled folksonomic vocabulary. The proposed approaches use state- of-the-art image features and compute tags in less than one second per image which enables real-world applications on top of large-scale image databases. Experiments on a set of images from Flickr’s pool Fruit & Veg show that classification works with high mean precision and recall of about 70% on a limited set of target concepts. The behavior of the tag propagation approach heavily depends on the amount of propagated tags per image. When only few tags are propagated, high precision but low recall is achieved, while propagating many tags leads to low precision and high recall. Although the classification setup presented in this work only implements the prediction of one concept per image, it is planned to extend the approach and learn several binary classifier for each concept individually for annotating one image with multiple concepts. Future work might include the automatic definition of concepts and experiments using more general and larger visual folksonomies.

Acknowledgements The authors would like to thank their colleagues Marcus Thaler and Werner Haas for their support and feedback. The Know-Center is funded within the Austrian COMET Program—Competence Centers for Excellent Technologies—under the auspices of the Austrian Ministry of Transport, Innovation and Technology, the Austrian Ministry of Economics and Labor and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.

References

1. Aurnhammer M, Hanappe P, Steels L (2006) Integrating collaborative tagging and emergent semantics for image retrieval. In: Proceedings WWW2006, collaborative web tagging workshop, Southampton, May 2006 2. Ayache S, Quenot G, Gensel J (2007) Classifier fusion for SVM-based multimedia semantic indexing. In: European conf. on information retrieval, Rome, 2–5 April 2007 3. Breese JS, Heckerman D, Kadie C (1998) Empirical analysis of predictive algorithms for col- laborative filtering. In: Proceedings of the 14th annual conference on uncertainty in artificial intelligence, pp 43–52. citeseer.ist.psu.edu/breese98empirical.html 4. Chang C, Lin CJ (2008) LIBSVM: a library for support vector machines. http://www.csie. ntu.edu.tw/∼cjlin/libsvm 5. Chakrabarti K, Mehrotra S (1999) The hybrid tree: an index structure for high dimensional feature spaces. In: Proceedings of the 15th international conference on data engineering, Wash- ington, DC, 23–26 March 1999 6. Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24:603–619 7. Cusano C, Ciocca G, Schettini R (2003) Image annotation using SVM. In: Proceedings of internet imaging IV, SPIE, Santa Clara, 21–22 January 2003 8. Dalai N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Conf. on computer vision and pattern recognition, vol 1. IEEE, Piscataway, pp 886–893 9. Datta R, Li J, Wang J (2005) Content-based image retrieval—approaches and trends of the new age. In: MIR ’05: proceedings of the 7th ACM SIGMM international workshop on multimedia information retrieval. ACM, New York 10. Fellbaum C (Ed) (1998) WordNet: an electronic lexical database. MIT, Cambridge 11. Forsyth D, Ponce J (2002) Computer vision: a modern approach. Prentice Hall, Englewood Cliffs 12. Golder SA, Hubermann BA (2006) The structure of collaborative tagging systems. J Inf Sci 32/2:198–208 13. Hardoon DR, Saunders C, Szedmak S, Shawe-Taylor J (2006) A correlation approach for automatic image annotation. Int Conf Adv Data Mining Appl Springer LNAI 4093:681–692 Multimed Tools Appl (2009) 42:97–113 111

14. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR conference on research and development in information retrieval, Toronto, pp 119–126 15. Kern R, Granitzer M, Pammer V (2008) Extending folksonomies for image tagging. In: 9th international workshop on image analysis for multimedia interactive services WIAMIS 2008, Klagenfurt, 7–9 May 2008 16. Li X, Chen L, Zhang L, Lin F, Ma WY (2006) Image annotation by large-scale content-based image retrieval. In: Proceedings of ACM int. conf. on multimedia, Santa Barbara, 23–27 October 2006 17. Lindstaedt S, Pammer V, Mörzinger R, Kern R, Mülner H, Wagner C (2008) Recommending tags for pictures based on text, visual content and user context. In: Proceedings of the 3rd international conference on internet and web applications and services (ICIW 2008), Athens, 8–13 June 2008 18. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110 19. Manjunath BS, Ohm J-R, Vasudevan VV, Yamada A (2001) MPEG-7 color and texture descrip- tors. IEEE Trans Circuits Syst Video Technol 11:703–715, June 20. Mörzinger R, Thallinger G (2007) TRECVid 2007 high level feature extraction experimetns at JOANNEUM RESEARCH. In: Proceedings of TRECVID workshop, Gaithersburg, 5–6 November 2007 21. Pammer V, Ley T, Lindstaedts (2008) Waxmann Verlag, chap tagr: Unterstützung in kollabo- rativen Tagging Umgebungen Durch Semantische Und Assoziative Netzwerke. Medien in der Wissenschaft 22. Shaw B (2006) Learning from a visual folksonomy. Automatically annotating images from flickr, May 23. Sigurbjörnsson B, van Zwol R (2008) Flickr tag recommendation based on collective knowledge. In: Proceedings of the 17th international conference on world wide web, WWW 2008, Beijing, 21–25 April 2008 24. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22:1349–1380 25. Wang X, Zhang L, Jing F, Ma WY (2006) AnnoSearch: image auto-annotation by search. In: Proceedings of the international conference on computer vision and pattern recognition, New York, 17–22 June 2006 26. Yavlinsky A, Schofield E, Rüger SM (2005) Automated image annotation using global features and robust nonparametric density estimation. In: Proceedings of the 4th int. conf. on image and video retrieval (CIVR), vol 3568, Singapore, July 2005, pp 507–517 27. Young IT, van Vliet LJ, van Ginkel M (2002) Recursive gabor filtering. In: ICPR ’00: proceedings of the int. conf. on pattern recognition, vol 50(11), November 2002, pp 2798–2805

Stefanie Lindstaedt is head of the division Knowledge Services at the Know-Center in Graz and is scientific coordinator of the APOSDLE IP. In these roles she is responsible for the management of many large, multi-firm projects and the scientific strategy of the division. Her research interests are 112 Multimed Tools Appl (2009) 42:97–113 in knowledge management, technology enhanced learning, and software engineering. Specifically, she focuses on hybrid approaches which combine semantic technologies with the power of soft computing (methods which are able to deal with uncertainty such as heuristics, statistical analysis, neural networks, etc.) in order to provide intelligent knowledge work support. She holds a PhD and a M.S. both in Computer Science from the University of Colorado (CU) at Boulder (USA). She is member of the Center for LifeLongLearning and Design and the Institute of Cognitive Science at CU, has chaired several Special Tracks on Semantic Technologies and Work-integrated learning at I-Know conferences, and has published more than 65 scientific publications.

Roland Mörzinger finished his study ‘Software Engineering for Medical Purposes’ at the Hagen- berg University of Applied Sciences with the diploma thesis ‘Detection of Grain and Noise for Regraining in Film and Video’ in 2005. Since then he has been working as research associate for the JOANNEUM RESEARCH Institute of Information Systems & Information Management, where he is involved in large international R&D projects. He does research in computer vision and multimedia retrieval with a focus on film restoration, machine learning, image and video classification and multi- view video analysis.

Robert Sorschag was born in 1982 in Klagenfurt, Austria. After finishing high-school he started to study Applied Informatics in 2000 at the Alpen-Adria University of Klagenfurt. Since 2004 he works for the Institute of Information Systems & Information Management of the Austrian Multimed Tools Appl (2009) 42:97–113 113 research company JOANNEUM RESEARCH as a R&D employee. In 2006 he received his Master of Science for his master thesis “Object Recognition for Media Monitoring with Emphasis on Didactical Purpose”. His areas of experience and interest cover computer vision, object recognition and classification, image and video copy detection, machine learning, multimedia content analysis and metadata generation, and software design & development.

Viktoria Pammer is working at the Know-Center in Graz, and at the Knowledge Management Institute at Graz University of Technology. Her working areas cover both semantic web and Web2.0 technologies. On the Web2.0 side, she is interested in how tags can automatically be organised into (semi-)formal structures. Her special research interest is in bringing together Web2.0 with formalisms of the Semantic Web. She holds a MSc. of Telematics from the Graz University of Technology where her major subjects were signal processing and machine learning. Currently she is following a PhD Program of Telematics, also at the Graz University of Technology.

Georg Thallinger received an MSc in Telematics from Graz University of Technology, Austria in 1992. He joined the Institute of Information Systems & Information Management at JOANNEUM RESEARCH right after university as research engineer in the domain of scientific visualization. Since 2002 Georg is co-leader of the Digital Media group at the institute and as such is coordinating large, international projects. His areas of interest and expertise are content based analysis and retrieval of audiovisual information and the application of these in the domains of audiovisual archives, film restoration and surveillance.