id2196634 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com

International Journal of Latest Research in Science and Technology ISSN (Online):2278-5299 Volume 2, Issue 5: Page No.52-57,September-October 2013 https://www.mnkpublication.com/journal/ijlrst/index.php

IMAGE, SPEECH AND NATURAL LANGUAGE PROCESSING APPROACH FOR BUILDING A SOCIAL NETWORKING FOR THE VISUALLY IMPAIRED

1N.Vignesh,2S.Sowmya 1Research Associate, IIM Ahmedabad 2SDE, ACS Oracle India

Abstract- A social network is a social structure made of nodes (which are generally individuals or organizations) that are tied by one or more specific types of interdependency, such as values, visions, ideas, financial exchange, friendship, kinship, dislike, conflict or trade. The resulting graph-based structures are often very complex. Social network analysis views social relationships in terms of nodes and ties. Nodes are the individual actors within the networks, and ties are the relationships between the actors. There can be many kinds of ties between the nodes. Research in a number of academic fields has shown that social networks operate on many levels, from families up to the level of nations, and play a critical role in determining the way problems are solved, organizations are run, and the degree to which individuals succeed in achieving their goals. Our system puts forth integrated software that also allows multimedially cryptic things to be converted orally to give a better definition of social networking to the visually impaired. – Keywords Multimedia ,Orally, Visually Impaired

1. INTRODUCTION

Presently, social networking has become part and parcel of AND-OR tree to provide a more realistic view for the same. ’ everyone s life. The advent of visual multimedia applications So we made an effort to integrate an I2T and T2S software to has added value to the same. In the present day there are a social networking base to help the visually impaired enjoy many social networking sites targeted purposefully for the pleasure of social networking. specific age groups. The interesting thing is that, yet there is The paper is organized as follows: Section 2 deals with the not a single networking site which not even allows visually related works (prior arts) and a comparison of our system impaired to get hands on experience in the social networking ’ with existing works. Section 3 describes the working process world. Although there are application software s available to of the proposed system in the design and implementation overcome the above mentioned problem, scarcely such aspect. Section 4 concludes with future works. efforts were made in integrating them to a social networking platform. Currently the most used application software that 1. PRIOR ARTS AND PROPOSED MODEL allows visually impaired to enter social networking site turns 1.1 JAWS(Job Access With Speech) out to be JAWS. The problem associated with JAWS is that it is capable of providing any information which is textually JAWS (Job Access With Speech) is a computer screen cryptic vocally. This merely turns out to be an old school reader program in that allows blind and screen reading program which never suffices the present day visually impaired users to read the screen either with text to social networking. So we thought that integrating software speech output or by a refreshable Braille display. JAWS were that also allows multimedially cryptic things to be converted originally created for the MS-DOS operating system. It was orally gives a better definition of social networking to the one of several screen readers giving blind users access to visually impaired. But parsing an image to various sub- text-mode MS-DOS applications. A feature unique to JAWS components (using AND-OR trees) and then give it to a text at the time was its use of cascading menus, in the style of the ’ to speech converter doesn t really make sense because the popular Lotus 1-2-3 application. What set JAWS apart from text generated is grammatically insufficient to explain itself. other screen readers of the era was its use of macros that So we made use of I2T converter designed by Benjamin Z. allowed users to customize the user interface and work better Yao et al. as an effort to parse images to text components with various applications. Ted Henter and Rex Skipper wrote which provides a reasonably better solution to the problem .It the original JAWS code in the mid-1980s, releasing version uses text planner and text analyzer to provide grammatically 2.0 in mid-1990. Skipper left the company after the release of almost-correct definition for a parsed picture. The working of version 2.0, and following his departure, Charles Oppermann this I2T converter is very much related to our problem since was hired to maintain and improve the product. Oppermann it also uses the AND-OR tree hierarchy for parsing images and Henter regularly added minor and major features and into text and the aforementioned text enhancements. Our frequently released new versions. Freedom Scientific now problem also includes importing non lexical fillers to the offers JAWS for MS-DOS as a freeware download from their Publication History IMSaSnuNsc:r2ip2t 7R8ec-e5iv2ed9 9 : 1 8 O c t o b e r 2 0 1 3 52 Manuscript Accepted : 25 October 2013 Revision Received : 28 October 2013 Manuscript Published : 31 October 2013

International Journal of Latest Research in Science and Technology. web site. In 1993, Henter-Joyce released a highly-modified An And-or Graph (AoG) visual knowledge version of JAWS for people with learning disabilities. This representation that embodies vocabularies of visual product, called Word Scholar, is no longer available. elements including primitives, parts, objects and scenes as well as a stochastic image grammar that specifies syntactic 1.1.1 INABILITIES OF JAWS : (compositional) relations and semantic relations (e.g. Although JAWS was able to convince people that visually Categorical, spatial, temporal and functional relations) ’ impaired s can access computers in the same way as their between these visual elements. The categorical relationships visually able counter parts. But with the advent of multimedia are inherited from WorldNet, a lexical semantic network of applications in social networking sites the inability of JAWS English. The AoG not only guides the image parsing engine ’ was exposed as it can t parse images to produce vocal with top-down hypotheses but also serves as ontology for descriptions about the same. Not to disrespect JAWS, it mapping parse graphs into semantic representation. ’ enabled about 867 visually impaired s to enjoy the pleasure A Semantic Web that interconnects different domain of social networking (Facebook). specific ontology with semantic representation of parse 1.1.2 OUR ADVANCEMENT OVER JAWS: graphs. This step helps to enrich parse graphs derived purely from visual cues with other sources of semantic information. We thought that by adding image parsing capabilities to an “ ’ For example, the input picture has a text tag Oven s mouth interface, we would take the level of social networking to ” river . With the help of a GIS database embedded in the another high. Our project mainly aims at integrating an Semantic Web, we are able to relate this picture to a geo- interface that would in turn take care of the overheads of “ ’ ” location: Oven s mouth preserve of Maine State . Another converting text to speech and vice-versa and parse images benefit of using Semantic Web technology is that end users and deliver it in text formats. Not to mention, it would also not only can access the semantic information of an image by eliminate the conventional brailley key logger method with a reading the natural language text report but can also query the fully fledged voice interface. Semantic Web using standard semantic querying languages. 1.2 IMAGE TO TEXT PARSER(I2T) A text generation engine that converts semantic 1.2.1 OVERVIEW representations into human readable and query-able natural language descriptions. As simple as the I2T task may seem to Fast growth of public photo and video sharing websites, “ ” “ ” be for a human, it is by no means an easy task for any such as Flickr and YouTube , provides a huge corpus of computer vision system today - especially when input images unstructured image and video data over the Internet. are of great diversity in contents (i.e. number and category of Searching and retrieving visual information from the Web, objects) and structures (i.e. spatial layout of objects), which is however, has been mostly limited to the use of meta-data, certainly the case for images from the Internet. But given user-annotated tags, captions and surrounding text (e.g. the certain controlled domains, for some cases we may use image search engine used by Google). In this paper, we automatic image parsing is practical. For this reason, our present an image parsing to text description (I2T) framework objective in this paper is twofold: (a) we use a semi- that generates text descriptions in natural language based on automatic method (interactive) to parse general images from understanding of image and video content. Fig. 1 illustrates the Internet in order to build a large-scale ground truth image two major tasks of this framework, namely image parsing and dataset. Then we learn the AoG from this dataset for visual text description. By analogy to natural language knowledge representation. Our goal is to make the parsing understanding, image parsing computes a parse graph of the process more and more automatic using the learned AoG most probable interpretation of an input image. This parse models. The camera is static, so we only need to parse the graph includes tree structured decomposition for the contents background (interactively) once at the beginning, and all of the scene, from scene labels, to objects, to parts and other components are done automatically. primitives, so that all pixels are explained. It also has a number of spatial and functional relations between nodes for AoG (And or Graph) context at all levels of the hierarchy. The parse graph is The And-or Graph (AoG) is a compact yet expressive data similar in spirit to the parsing trees used in speech and natural structure for representing diverse visual patterns of objects language understanding except that it can include horizontal (such as a clock). In this section, we will formally define the connections for specifying relationships and boundary AoG as a general representation of visual knowledge, which sharing between different visual patterns. From a given parse entails (1) a stochastic attribute image Grammar specifying graph, the task of text description is to generate semantically compositional, spatio-temporal and functional relations meaningful, human readable and query-able text reports as between visual elements; and (2) vocabularies of visual illustrated. To achieve the goal illustrated, we propose an I2T elements of scenes, objects, parts and image primitives. framework, which has four major components. 1.3.1. Stochastic image grammar 1.2.2 COMPONENTS The AoG representation embodies a stochastic attributed An image parsing engine that parses input images into image grammar. Grammars, studied mostly in language, are parse graphs. For specific domains such as the two case study known for their expressive power to generate A very large set systems presented the image/ video frame parse is automatic. of configurations or instances (i.e. their language) by For parsing general images from the Internet for the purpose composing a relatively small set of words (i.e. shared and of building a large-scale image dataset, an interactive image reusable elements) using production rules. Therefore, the parser is used as discussed

ISSN:2278-5299 53

International Journal of Latest Research in Science and Technology. image grammar is a parsimonious yet expressive way to is a child of both nodes C and D. A configuration is a planar account for structural variability of visual patterns. attribute graph formed by linking the open bonds of the primitives in the image plane. A configuration inherits the An And-Or tree is an AoG without horizontal connections (relations). An And-node represents a decomposition of an relations from its ancestor nodes, and can be viewed as a entity into its parts. There are two types of decompositions: Markov network with a reconfigurable neighborhood. We (i) object | parts, (ii) scene | objects. introduce a mixed random field model to represent the configurations. The mixed random field extends conventional The object | parts decomposition has a fixed number of child Markov random field models by allowing address variables, nodes, which correspond to the grammar rules, for example: which allows it to handle non-local connections caused by object in a scene. We can use this model to create a vast occlusions. In this generative model, a configuration amount of unique scene configurations by sampling. The Or- “ ” corresponds to a primal sketch graph. nodes act as switches for alternative substructures, and stand for labels of classification at various levels, such as The language of a grammar is the set of all possible valid scene category, object classes, and parts etc. They correspond configurations produced by the grammar. In stochastic to production rules. Due to this recursive definition, one may grammar, each configuration is associated with a probability. merge the AoG for many objects or scene categories into a As the AoG is directed and recursive, the sub-graph larger graph. In theory, all scene and object categories can be underneath any node A can be considered a sub-grammar for represented by one huge AoG, as is the case for natural the concept represented by node A. Thus a sub-language for language). The nodes in an AoG may share common parts. the node is the set of all valid configurations produced by the For example, both cars and trucks have rubber wheels as AoG rooted at A. For example, if A is an object category, say parts, and both clock and pictures have frames. Relations a car, then this sub-language defines all the valid represent the horizontal links for contextual information configurations of a car. In an existing case, the sub-language between the children of an And-node in the hierarchy at all of a terminal node contains only the atomic configurations and thus is called a dictionary. levels. Each link may represent one or several relations. There are three types of relations of increasing abstraction for 1.5. AUTOMATIC IMAGE PARSING the horizontal links and context. The first type is the bond type that connects image primitives into bigger and bigger 1.5.1. Bottom-up/Top-down inferences with AoG. graphs. The second type includes various joints and grouping We extend the previous algorithm to work on an arbitrary rules for organizing the parts and objects in a planar layout. node A in an AoG; we define three inference processes for The third type is the functional and semantic relation between each node A in an AoG. This process handles situations in objects in a scene. Image primitive has a number of open which node A is at middle resolution without occlusion. bonds, shown by the half disks, to connect with others to Node A can be detected directly (based on its compact image form bigger image patterns. Two bonds are said to be data) and alone (without taking advantage of surrounding connected if they are aligned in position and orientation. context) while its children or parts are not recognizable alone Relations type 2: Joints and junctions (relation between in cropped patches. Most of the sliding window detection object parts). When image primitives are connected into methods in computer vision literature along to this process. It larger parts, some spatial and functional relations must be can be either bottom-up or top down in terms of whether found. Beside its open bonds to connect with others, usually discriminative models such as the Adaboost method or its immediate neighbors, a part may be bound with other parts generative models such as the active basis model are used. in various ways. Parts can be linked over large distances by When node A is at high resolution, it is more likely to be collinear, parallel or symmetric relations. This is a occluded in a scene. Node A itself is not detectable in terms phenomenon sometimes called gestalt grouping. Relations of the process due to occlusion. A subset of node A0s type 3: interactions and semantics (relation between objects). children nodes can be colors, creating a checkerboard pattern. When letters are grouped into words, semantic meanings The right panel shows a connected component V0 formed emerge. When parts are grouped into objects, semantic after edges are turned on and off. V0 can be broken down in relations are created for their interactions. Very often these the final panel into subcomponents of like-colored nodes relations are directed. For example, the occluding relation is connected by negative edges. viewpoint dependent binary relation between objects or 1.5.2.Cluster sampling surfaces, and it is important for figure-ground segregation. A supporting relation is a view point independent relation. Aside from bottom-up/top-down detection of a single There are other functional relations among objects in a scene. object, another important issue pertinent to automatic image For example, a person is carrying a backpack, and a person is parsing is how to co-ordinate detection of multiple objects in eating an apple. These directed relations usually are partially one image. It is important to have an algorithm that can ordered. optimally pick the most coherent set of objects. Previous methods commonly used a greedy algorithm, which first 1.4. A parse graph assigned a weight to each of the currently unselected A parse graph is augmented from a parse tree, mostly used candidate object s based on how well it maximized the in natural or programming language by adding a number of posterior. The object with the highest weight was selected to relations, shown as side links, among the nodes. A parse be added to the running parse of the scene. The objects were graph is derived from the AoG by selecting the switches or then re-weighted according to how much the remaining classification labels at related Or-nodes. The part shared by objects would improve the overall explanation of the scene two nodes may have different instances, for example, node I and this process iterated until no objects above a certain

ISSN:2278-5299 54

International Journal of Latest Research in Science and Technology. weight remained. The problem with this approach is that it is Notice that the FD and the POS tree share similar hierarchical greedy and cannot backtrack from a poor decision. We structure, but there are notable differences. In the FD, would like our new algorithm to be able to backtrack from children nodes are unordered. In the POS tree, the ordering of these mistakes. We use an algorithm called Clustering via children nodes is important and additional syntactic nodes are Cooperative and Competitive Constraints to deal with these inserted. A unification process matches the input features problems. with the grammar recursively, and the derived lexical tree is then linearized to form the sentence output. While general- 1.6. TEXT GENERATION purpose text realization is still an active research area, current While OWL provides an unambiguous representation for NLG technology is sufficiently capable of expressing video image and video content, it is not easy for humans to read. content. The lexicon of visual objects and relationships Natural language text remains the best way for describing the between objects is relatively small. In addition, textual image and video content to humans and can be used for descriptions of visual events are mostly indicative or image captions, scene descriptions, and event alerts. Natural declarative sentences and this simplifies the grammar language generation (NLG) is an important sub-field of structure of the resulting text significantly. natural language processing. NLG technology is already 2. DESIGN AND IMPLEMENTATION widely used in Internet applications such as weather reporting and for giving driving directions. A commonly used NLG As an extreme level of abstraction, we are to have four approach is template filling, but it is inflexible and primary use cases viz.. Know notifications, know shared items, know friends online, update status, begin voice chat. inadequate for describing images. An image NLG system “ ” ‘ ’ Note that the use case begin voice chat has an includes should be able to consume OWL data, select relevant content “ ” and generate text to describe objects in images, their dependency over the use case know friends online . The properties, events and relationships between other objects. interaction between the user and the social networking site The text generation process is usually designed as a pipeline takes place via an interface that consists of 4 component of two distinct tasks: text planning and text realization. The controllers namely speech to text controller, text to speech text planner selects the content to be expressed, and decides controller and image to text controllers and key loggers. how to organize the content into sections, paragraphs, and Again it is to be noted that any interactions between user and sentences. Based on this formation, the text realizer generates the site takes place, Using any of the aforementioned each sentence using the correct grammatical structure. component controllers.Here we are not trying to depict the actual flow of events that would take place in a scenario that 1.6.1. Text planner the corresponding use cases abstracts, rather we try to project The text planner module translates the semantic the kind of interaction that would take place among various representation to a sentence-level representation that can controllers, an user and the social networking site itself. readily be used by the text realizer to generate text. This KNOW NOTIFICATIONS intermediate step is useful because it converts a representation that is semantic and ontology-based, to a Any scenario that this use case abstracts begins with an UP representation that is based more on functional structure. The key press by the user. The social networking site understands output of the text planner is based on a functional description that this input is for viewing any new notifications available (FD) which has a feature-value pair structure, commonly and returns the result set containing all the notifications(could used in text generation input schemes. For each sentence, the be text, image) to the respective interface controller which in functional description language specifies the details of the turn delivers the result to the requester vocally. text that is to be generated, such as the process (or event), actor, agent, time, location, and other predicates or functional System properties. The text planner module also organizes the layout of the text report document. The planning of the document know notifications structure is strongly dependent on the intended application. For example, a video surveillance report may contain know shared items separate sections describing the scene context, a summary of INTERFACE (speech to text objects that appeared in the scene, and a detailed list of user converter) detected events. Other applications, such as an email report or update status instant alert would warrant different document structures, but the underlying sentence representation using functional know friends online description remains the same.

<>

1.6.2. Text realizer begin voice chat

From the functional description, the text realizer generates each sentence independently using a simplified head-driven phrase structure grammar. HPSG consists of a set of production rules that transform the functional description to a structured representation of grammatical categories. The FD is first transformed to a part-of-speech tree (POS) where “ ” “ ” additional syntactic terms ( between and and ) are inserted. The POS tree is then linearized to form a sentence.

ISSN:2278-5299 55

International Journal of Latest Research in Science and Technology.

KNOW SHARED ITEMS speech to text controller text to speech controller social networking site Any scenario that this use case abstracts begins with an : user

RIGHT key press by the user. The social networking site 1 : pressLeftKey() understands that this input is for viewing any newly shared posts .The site returns a list containing friends who have an 2 : returnOnlineFriendListTextually() unviewed, shared posts with the user. The user chooses friends from that list vocally and retrieves all kinds of posts shared with him/her. It is to be noted that all kinds of 3 : returnOnlineFriendListOrally() communications between the user and the social networking 4 : chooseFiriendFromListOrally() site takes place through the various interface component controllers. 5 : chooseFriendFromListTextually()

UPDATE STATUS 6 : chatConnectConfirmationTextually()

Any scenario that this use case abstracts begins with a 7 : chatConnectionConfirmationOrally() DOWN key press by the user. The social networking site understands that this input is for updating his/her status vocally and waits for the same and then confirms the same 8 : pressLeftKey() vocally. It uses speech to text and text to speech controllers for doing the same respectively.

KNOW FRIENDS ONLINE

Any scenario that this use case abstracts begins with a LEFT The next challenge we faced is that what happens when the key press by the user. The use case is itself self explanatory visually impaired user gets vocal information that two or .This returns the list of friends who are logged on currently more of his friends with same user name are online and this vocally to the user. It is obvious that this use case is included user wants only to have a voice chat with a particular friends “ ” by begin voice chat use case. he needs. The same name confliction is resolved in our “ ” system by using the field ABOUT ME as a differentiator so BEGIN VOICE CHAT that whenever the interface finds friends with same names gives the impaired user a clean slate of who is that using that Any scenario that this use case abstracts begins with a “ field so that the user can further proceed only with the LEFT key press by the user. This includes the know friends ” intended person. online use case and this is used to begin a voice chat session with a friend as requested by the user vocally. 3. CONCLUSION ’ Although there are many applications software s that aids in CHALLENGES FACED conversion between text to speech and vice-versa, the The first and foremost challenge faced by us during application software named JAWS was the one which implementation phase is that of authorization and motivated us to develop this kind of web application for the authentication. The log in system is too complex to visually impaired. So we ought to mention this base in our implement as it demands high level of features. We thought reference section. Also the I2T (Image to text parser) have of implementing this with the use of a special brailey played a vital role in our project by giving us the opportunity keyboard for the password alone but it demands installation to present the pictorial data vocally to the users. As every of a special hardware just for this purpose alone which would project has its own pros and cons, our future aim is to avoid be of no use later. Also using finger print analysis demands various cons of this system. As per now, the major problem special hardware which made us neglect that method too. that is posed by this environment is authentication. We are aiming at a still better method to implement the same in the Finally we here have used a plain method of providing the mere future. Also the other major drawback of our system is user name vocally and crypting it into text using speech to that the impaired users are not yet given the feeling that they text converter and the password is provided by means of too are prime users because of the fact that they are able only ’ using the track pad which is pre default hardware employed to present their updates only in vocal formats but they aren t in all modern laptops today. So we demand the impaired user able to share pictures from their own domain which is too to provide his/her unique password as patterns recognizable tedious for them and would definitely need the help of by the track pad. This provides a security to some level rather someone else. than using the same vocal password method.

ISSN:2278-5299 56

International Journal of Latest Research in Science and Technology.

So we aim at rectifying this defect by allowing the impaired user give his updates even in pictorial formats by taking our controller interfaces (speech to text engine) to the Operating system level so that they could browse images from their own system or even in any other sites and upload them .This would definitely make the impaired users feel that they too are primary users and are not just asynchronous receivers alone. But every process of globalization demands high level of security, we ought to consider the authorization and authentication processes keenly to make this social network phisher proof.

4. ACKNOWLEGEMENT

We express my deep gratitude to our guide,

Dr.T.V.Geetha, College of Engineering Guindy, Anna

University Chennai for guiding me through every phase of this project. We appreciate her thoroughness, tolerance and ability to share her knowledge with me. We thank her for being easily approachable and quite thoughtful. Apart from adding her own input, she has encouraged me to think on my own and give form to my thoughts. We owe her for harnessing my potential and bringing out the best in me. Without her immense support through every step of the way, we could never have it to this extent. REFERENCES [1] Collective Generation of Natural Image Descriptions Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg and Yejin Choi. [2] Ahmet Aker and Robert Gaizauskas. 2010. Generating image descriptions using dependency relational patterns. In ACL. [3] Deb K. Roy. 2002. Learning visually-grounded words and syntax for a scene description task. Computer Speech and Language. [4] WHISTLER: A TRAINABLE TEXT-TO-SPEECH SYSTEM Xuedong Hung, Alex Acero, Jim Adcock, Hsiao- Wuen Hon, John Goldsmith, Jingsong Liu and Mike Plumpe. [5] Allen J., Hunnicutt S., and Klatt D. From text to speech: the MlTalk system. MIT Press, Cambridge, Massachusetts, 1987. ‘ ” [6] Matt D. Review of text-to-speech conversion for English . Journal of the Acoustical Sociely of America, 82(3):737- 793,1987.

’ Author s Profile

S.Sowmya is working as a Software Development Engineer at ACS, Oracle India. She received her B.E degree in Computer Science and Engineering from College of Engineering Guindy, Anna University Chennai in 2013. Her research interest includes Natural Language Processing and Computer Networks.

N.Vignesh is working as a Research Associate at Indian Institute of Management, Ahmedabad. He received his B.E degree in Computer Science and Engineering from College of Engineering Guindy, Anna University Chennai in 2013. His research interest includes Natural Language Processing and Computer Networks.

ISSN:2278-5299 57