applied sciences
Article Semantic Scene Graph Generation Using RDF Model and Deep Learning
Seongyong Kim 1, Tae Hyeon Jeon 2, Ilsun Rhiu 3 , Jinhyun Ahn 4 and Dong-Hyuk Im 5,*
1 Department of Big Data and AI, Hoseo University, Asan 31499, Korea; [email protected] 2 Department of Computer Engineering, Hoseo University, Asan 31499, Korea; [email protected] 3 Division of Future Convergence (HCI Science Major), Dongduk Women’s University, Seoul 02748, Korea; [email protected] 4 Department of Management Information Systems, Jeju National University, Jeju 63243, Korea; [email protected] 5 School of Information Convergence, Kwangwoon University, Seoul 01890, Korea * Correspondence: [email protected]
Abstract: Over the last several years, in parallel with the general global advancement in mobile technology and a rise in social media network content consumption, multimedia content production and reproduction has increased exponentially. Therefore, enabled by the rapid recent advancements in deep learning technology, research on scene graph generation is being actively conducted to more efficiently search for and classify images desired by users within a large amount of content. This approach lets users accurately find images they are searching for by expressing meaningful information on image content as nodes and edges of a graph. In this study, we propose a scene graph generation method based on using the Resource Description Framework (RDF) model to clarify semantic relations. Furthermore, we also use convolutional neural network (CNN) and recurrent neural network (RNN) deep learning models to generate a scene graph expressed in a controlled vocabulary of the RDF model to understand the relations between image object tags. Finally, we experimentally demonstrate through testing that our proposed technique can express semantic content more effectively than existing approaches. Citation: Kim, S.; Jeon, T.H.; Rhiu, I.; Ahn, J.; Im, D.-H. Semantic Scene Keywords: scene graph; RDF model; deep learning; image annotation Graph Generation Using RDF Model and Deep Learning. Appl. Sci. 2021, 11, 826. https://doi.org/10.3390/ app11020826 1. Introduction The total amount of digital image content has increased exponentially in recent years, Received: 26 November 2020 owing to the advancement of mobile technology and, simultaneously, content consumption Accepted: 14 January 2021 has greatly increased on several social media platforms. Thus, it becomes increasingly more Published: 17 January 2021 imperative to effectively store and manage these large amounts of images. Hitherto, image
Publisher’s Note: MDPI stays neu- annotation techniques able to efficiently search through large amounts of image content tral with regard to jurisdictional clai- have been proposed [1–5]. These techniques represent images in several ways by storing ms in published maps and institutio- semantic information describing the image content along with the images themselves. This nal affiliations. enables users to accurately search for and classify a desired image within a large amount of image data. Recently, scene graph generation methods for detecting objects in images and expressing their relations have been actively studied with advancements in deep learning technology [6,7]. Scene graph generation involves the detection of objects in an image Copyright: © 2021 by the authors. Li- and relations between these objects. Generally, objects are detected first, and the relations censee MDPI, Basel, Switzerland. between them are then predicted. Relation detection using language modules based on a This article is an open access article pretrained word vector [6], and through message interaction between object detection and distributed under the terms and con- relation detection [7], have been used for this prediction. Although scene graph generation ditions of the Creative Commons At- complements the ambiguity of image captions expressed in natural language, it does not tribution (CC BY) license (https:// effectively express semantic information describing an image. Therefore, a method of creativecommons.org/licenses/by/ adding semantic information by incorporating the Resource Description Framework (RDF) 4.0/).
Appl. Sci. 2021, 11, 826. https://doi.org/10.3390/app11020826 https://www.mdpi.com/journal/applsci Appl. Sci. 2021, 11, x FOR PEER REVIEW 2 of 12
Appl. Sci. 2021, 11, 826 2 of 12 pressed in natural language, it does not effectively express semantic information describ- ing an image. Therefore, a method of adding semantic information by incorporating the Resource Description Framework (RDF) model [8] into conventional scene graph genera- tion is proposed modelin this [study.8] into conventionalAlthough several scene graphworks generation in the relevant is proposed literature in this [2– study.4,9,10 Although] several works in the relevant literature [2–4,9,10] also applied the RDF model to image also applied the RDF model to image content, these studies did not utilize deep learning content, these studies did not utilize deep learning technology-based models. The proposed technology-basedmethod models. detects The proposed image objects method and relations detects image using deepobjects learning and relations and attempts using to express deep learning andthem attempts using an to RDFexpress model, them as shownusing inan Figure RDF 1model,. as shown in Figure 1.
Figure 1. OverviewFigure of semantic 1. Overview scene of graph semantic generation scene graph. generation. The contributions of the present study are as follows. First, an RDF model for generat- The contributions of the present study are as follows. First, an RDF model for gener- ing a scene graph is proposed. Semantic expression in a controlled vocabulary is possible ating a scene graphby expressingis proposed. the Semantic existing scene expression graph usingin a controlled an RDF model. vocabulary Specifically, is possi- scene graph ble by expressingsentences the existing that arescene logically graph contradictory using an RDF can model. be filtered Specifically, out using an scene RDF g schemaraph (RDFS). sentences that areFurthermore, logically contradictory an image can becan queried be filtered using out SPARQL using [11 an], RDF an RDF schema query (RDFS). language. Second, Furthermore, an deepimage learning can be technology queried using was appliedSPARQL to the[11], RDF an model-basedRDF query language. scene graphs. Sec- Although ond, deep learningimage technology content was was also applied described to inthe [3 RDF,10] using model the-based RDF model, scene agraphs. user had Alt- to manually hough image contentinput thewas relation also described in [3], and ain machine [3,10] using learning the method RDF wasmodel, applied a user to detect had to only image manually input thetags relation in [10]. In in this [3], study, and relationsa machine between learning image method tags were was detected applied using to detect a deep learning model during generation of image scene graphs. Furthermore, subjects and objects were only image tags in [10]. In this study, relations between image tags were detected using a detected using a Convolutional Neural Network (CNN) model in the RDF-based scene deep learning modelgraph, during and a generation property relation of image was scene trained graphs. using Furthermore, a Recurrent Neural subjects Network and (RNN) objects were detectedmodel. using Third, a ourConvolutional results indicate Neural that theNetwork expression (CNN) of scene model graphs in the can RDF be improved- based scene graph,significantly and a property using therelation proposed was inference trained using approach a Recurrent with RDF-based Neural scene Network graphs. (RNN) model. Third,The our remainder results indicate of this paper that proceeds the expression as follows. of Variousscene graphs image annotations can be im- and image proved significantlysearch using methods the are proposed examined as inference related work approach in Section with2, and RDF the- proposedbased scene methods are graphs. described in Section3. We conclude by presenting test results through a comparison with The remainderthe conventionalof this paper method proceeds in Section as follows.4. Various image annotations and im- age search methods are examined as related work in Section 2, and the proposed methods are described in Section 3. We conclude by presenting test results through a comparison with the conventional method in Section 4. Appl. Sci. 2021, 11, 826 3 of 12
2. Related Work Ontology-based image annotation has been studied extensively [1–5,8]. Image annota- tion using tags is mainly used for image search by identifying rankings and meanings of tags. I-TagRanker [5] is a system that selects the tag that best represents the content of a given image among several tags attached to it. In this system, first, a tag extension step is performed to find images similar to the given image, following which tags are attached to them. The next step is to rank the tags using WordNet (https://wordnet.princeton.edu/) in the order that they are judged to be related to the image while representing detailed content. The tag with the highest ranking is the one that best represents the image [12] suggested semiautomated semantic photo annotation. For annotation suggestion, [12] proposed an effective combination of context-based methods. Contextual concepts are organized into ontologies that include locations, events, people, things and time. An image annotation and search based on an object-relation network was proposed in [13]. Their method entailed finding objects within images based on a probability model by finding parts of images, referred to as segments, and searching images using ontologies to represent the relation between objects. Compared to other approaches, images that are much more semantically similar can be distinguished using this method. There have also been extensive studies conducted on image processing and tagging in mobile devices [14]. Images were searched using context information provided by smart devices in [14]. The context information used included time and location, as well as social and personal information. Context information was annotated when capturing a camera image; subsequently, the annotated information enables a user to find the desired image through a search. However, only the query words provided by the system can be used because the annotation form is standardized. A model that shows the semantic relation between tags in an image using a knowledge graph was proposed in [15]. A relational regularized regression CNN (R3CNN) model was initialized along with AlexNet pretrained using ImageNet, using five layers of CNN, three pooling layers and two fully connected layers in [15]. First, tags were extracted from an image, then the image tags were vectorized using equations proposed in the study and the distance between vectors was calculated. Subsequently, the vectors were used for training. Although [15] is extremely useful for detecting relations between image tags, their method requires a well-established knowledge base.
3. Proposed System 3.1. Semantic Scene Graph In this study, we constructed semantic scene graphs using RDF. The RDF was estab- lished by the W3C (World Wide Web Consortium) and was designed to be used to describe additional information pertaining to objects to be described, as well as hierarchical and similarity relations between data. In other words, it provides a way to define data and provide a description or relation. The RDF is usually described using a triple model with a subject, predicate and object. The subject denotes the resource data to be represented, and the predicate denotes characteristics of the subject and indicates a relation between the subject and the object. The object, then, is the content or value of a description. Each value can be described using a uniform resource identifier (URI). The RDF not only describes actual data but also supports the RDFS, a schema that describes the types of terms used in the data and relations between them. If the scene graph is described using the RDF(S) model, it can be utilized as shown in Figure2. The upper and lower parts of Figure2 represent the RDF model and RDFS, respectively. Our proposed scene graph generation method creates a representation of relations between objects in an image and can prevent semantically incorrect scene graphs, such as “Jone wears Banana” or “Jane eats shirt,” from being generated. For instance, RDFS can set “person” as the subject and “fruit” as the object for the predicate “eats.” Likewise, “person” could be used as the subject and “cloth” as the object of the predicate “wears,” to Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 12
Appl. Sci. 2021, 11, x Likewise,FOR PEER REVIEW “person” could be used as the subject and “cloth” as the object of the predicate4 of 12
Appl. Sci. 2021, 11, 826“wears,” to generate a semantically correct scene graph. Although image annotations4 of 12 were represented using the RDF model in [3,4], in these works the user manually inputs the subject, predicateLikewise, “person”and object. could The be uniqueness used as the subjectof the andpresent “cloth” study as the lies object in our of applica-the predicate tion of the RDF“wears,”generate model a to semantically to generate scene graph a correct semantically generation scene graph. correct using Although scene a machine graph. image Althoughlearning annotations algorithm. image were annotations represented wereusing represented the RDF model using in [the3,4], RDF in these model works in [3 the,4], user in these manually works inputs the user the subject,manually predicate inputs theand subject, object. Thepredicate uniqueness and object. of the The present uniqueness study lies of in the our present application study of lies the in RDF our modelapplica- to tionscene of graph the RDF generation model to using scene a graph machine generation learning using algorithm. a machine learning algorithm.
Figure 2. Resource description framework (RDF) and RDF schema for semantics. Figure 2. Resource description framework ( (RDF)RDF) and RDF schema forfor semantics.semantics. 3.2. Creating a Deep Learning-Based Scene Graph Image tagging3.2. Creating Creating was a aperformed Deep Deep Learning Learning-Based using-Based a CNNScene Graph for image annotation in a previous study [10]. However, trainingImageImage tagging tagging images was was using performed performed a CNN using usingmakes a CNN a it CNN impossible for image for image annotation to annotationunderstand in a previous in the a previous se- study mantics of objects.[study10]. However, [10Furthermore]. However, training, trainingsince images the imagesusing previous a using CNN method amakes CNN made makesit impossible a itdirect impossible toSPARQL understand to understand query the se- on DBPedia manticsthe(https://wiki.dbpedia.org/ semantics of objects. of objects. Furthermore Furthermore,) for ,detecting since the since previous relations the previous method between method made images a made direct tags, a SPARQL direct it SPARQLwas query impossible toonquery find DBPedia ona relation DBPedia (https://wiki.dbpedia.org/ that (https://wiki.dbpedia.org/ was not already) for detectingstored) in for relations DBPedia. detecting between Therefore, relations images between although tags, images it was tags, it was impossible to find a relation that was not already stored in DBPedia. Therefore, we performedimpossible image tagging to find a usingrelation a thatCNN, was as not in alreadythe previous stored instudy, DBPedia. our approachTherefore, althoughpre- wealthough performed we performed image tagging image using tagging a usingCNN, aas CNN, in the as previous in the previous study, study, our approach our approach pre- dicted relations between tags after vectorization and image tagging using a long short- dictedpredicted relations relations between between tags tags after after vectorization vectorization and and image image tagging tagging using using a a long long short short-- term memorytermterm (LSTM) memory memory network, (LSTM) (LSTM) asnetwork, network, well as as as gated well well as asrecurrent gated gated recurrent recurrent units (GRU), units units (GRU), (GRU), an RNN a ann RNN RNNmodel. model. model. The overall processThe overall of theprocess proposed of the proposedmethod is method as shown is as in shown Figure in Figure3. 3 3..
Figure 3. Proposed system. Figure 3. Proposed system. Figure 3. Proposed system. First, in our proposed method, tags are created for image objects using a CNN model. First, in our proposed method, tags are created for image objects using a CNN model. First, inIn our particular, proposed we method, used the tags Res areNet created-101 model for image[16] as objects the state using-of-the a CNN-art (SOTA) model. CNN modelIn particular, for image we usedtagging the because ResNet-101 it uses model skip [-16connection] as the state-of-the-art techniques and (SOTA) has effective CNN model per- In particular, we used the ResNet-101 model [16] as the state-of-the-art (SOTA) CNN formancefor image taggingon ImageNet. because The it uses process skip-connection for image tagging techniques using and Res hasN effectiveet-101 is performance as follows. model for imageFirst,on ImageNet. taggingthe CNN because model The process extracts it uses for the skip image features-connection tagging of images using techniques using ResNet-101 a filter and isandhas as theneffective follows. stores First,per- them the in formance onCNN ImageNet. modelextracts The process the features for image of images tagging using using a filter Res andN thenet-101 stores is as them follows. in various First, the CNN model extracts the features of images using a filter and then stores them in Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 12 Appl. Sci. 2021, 11, 826 5 of 12
various feature maps. The fully connected (FC) layer uses a value projected by the number featureof each maps.class (tags) The fullyfrom connected each layer (FC) to classify layer uses the final a value image projected of the byextracted the number feature of map. each classIn the (tags) output from layer, each the layer predicted to classify value the of final the imageimage oftag the t extractedis output feature using the map. Sigmoid In the outputactivation layer, function, the predicted shown valuein Equation of the image(1) below tag :t i is output using the Sigmoid activation function, shown in Equation (1) below: 1 ( ) = (1) 11+ σ(x) = (1) The Sigmoid function is used for outputting1 + e−x values for each class and is defined as follows in Equation (2): The Sigmoid function σ is used for outputting values for each class and is defined as
follows in Equation (2): = ( + ) (2) ti = σ(Wxi + b) (2) where is the FC layer value, and is the bias term. Then, we adopt a binary cross- whereentropyW lossis the function FC layer given value, as and followsb is the: bias term. Then, we adopt a binary cross-entropy loss function given as follows: N M ! ℒ = + 1 − 1 − (3) L1 = ∑ ∑ yijlog tij + 1 − yij log 1 − tij (3) i = 1 j = 1 where is the input image, N is the number of image datasets, M is the number of tags, where xi is the input image, N is the number of image datasets, M is the number of tags, tij is a predicted value of a tag j for an image , and is the correct answer for the is a predicted value of a tag j for an image xi, and y ij is the correct answer for the image xi. = 1 Forimage example, . Fory example,ij = 1 indicates that indicates the image thatx ithecontains image tag j.contains The process tag j. ofThe detecting process an of objectdetecting in an an image object using in an theimage CNN using model the isCNN shown model in Figure is shown4. in Figure 4.
FigureFigure 4.4. ConvolutionalConvolutional neuralneural networknetwork (CNN)(CNN) modelmodel forfor imageimage tagging.tagging.
The disadvantage ofof processingprocessing imagesimages usingusing aa CNNCNN isis thatthat itit becomesbecomes impossibleimpossible toto understand thethe semantics of the objects (tag words). To understand the meaning of words, techniquestechniques such as Word2Vec [[1177],], whichwhich expressesexpresses wordswords asas vectors,vectors, have been created. Furthermore, systems such asas ConceptNetConceptNet [[1188]] havehave becomebecome availableavailable toto helphelp computerscomputers understand thethe relationships betweenbetween words.words. ConceptNetConceptNet isis aa semantic network designed toto enableenable computerscomputers toto understandunderstand thethe semanticssemantics ofof wordswords usedused byby people.people. ItIt providesprovides aa 300-dimensional300-dimensional tag tag representation representation vector vectorw i per per word word embedded. embedded.wi is a 300-dimensionalis a 300-dimen- vectorsional thatvector passes that passes through through the embedding the embedding layer in layer the image in the tag imageti, represented tag , represented as follows: as follows: = ( ) ∈ R300 wi word2vec ti wi (4) = 2 ( ) ( ∈ ℛ ) (4) In this paper, we propose an RDF model-based scene graph describing an image using triple sets of the form
Appl. Sci. 2021, 11, 826 6 of 12
In this paper, we propose an RDF model-based scene graph describing an image us- ing triple sets of the form