<<

International Journal of Pure and Applied Mathematics Volume 116 No. 21 2017, 719-727 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu

Information Extraction Using Semantic Similarity Features in Natural Language Processing 1S. Jayalakshmi and 2Ananthi Sheshasaayee 1Periyar University, Vels University, India. [email protected] 2PG & Research Dept. of Computer Science, Queen Marys College for Women, Chennai, India. [email protected]

Abstract system have achieved significant progress in which question classification is an essential part, Retrieving most relevant information from the web is a major challenge. It focuses on proving the answer to the tough questions posted by the users and furnishing the appropriate answer along with the significant information with adequate sentences. To analyze a natural language question which assigns a semantic category to a given question that represents the type of answer required resulting in an accurate question answering system. Methods: The proposed approach focuses on predicting the original intention of the questions and providing the candidate answer with adequate, significant information from both web corpus and ontology, it incorporates the three modules are Generating relevant documents, Ranking the Documents and Predicting the precise answer. Findings: Identify the appropriate, concise candidate answer automatically. Applications: Improved system accuracy and provide better result to user with less effort and time.

719 International Journal of Pure and Applied Mathematics Special Issue

1. Introduction

The growth of the web searching information is rapid, the user demands more sophisticated search tools capable of providing highly relevant information with ease [1] and the web has become the global and easily accessible repository containing textual information. The web search engine has been exploited as the most significant tool on the internet to extract the information from the web, which enables the users to retrieve the relevant information, is retrieved from the internet search engine [2]. Hence the information management system targets to create a comfortable searching environment to the users in the flood of online information; an information retrieval [3] is the process of converting the information needs of the users into the list of documents that are relevant to the needs of the users through the web search engine. Question Answering System

Question Answering(QA) [4] is one of the services which satisfies the desires of the users and provides the adequate sentences as answer to specific natural language questions posted by them instead of providing a set of relevant web documents. The AQ system receives considerable attention due to the increasing amount of web content and the high demand for digital information. The IR engines have been desired to retrieve only the documents and not retrieve the specific information as the answer from the abundant relevant information [5]. If a user requires specific information, the user has to examine the retrieved documents manually to find a accurate answer. The QA system addressing this problem and solve with the help of NLP methods and IR system.

The QA system jointly processes the IR and NLP techniques to flexibly access the online information. The automated QA system is related to the IR system offering desired answers to the queries submitted by the users, but the QA system differs providing the information needs as the direct answers. Significance of Semantic Similarity in QA

Semantics is the most complex and essential factor for natural language. involved in three processes of QA system such as identify the question, recognize the topic and retrieve the relevant answer. The semantic analysis is based on the different semantic element like ontology classes, WordNet, FrameNet and are used to understand the input query. QA has the advantage from SRL, which improves the accuracy of the given answers.

Semantic search find ways for improving accuracy of search using the conceptual knowledge hidden in internet. It does not assign ranks for predicting relevancy; it uses hidden meaning to assign outputs Problem Statement and Scope of the Work

The conventional QA systems still confront with the answer generation problem for WH-operator missing questions. The improper questions which are not

720 International Journal of Pure and Applied Mathematics Special Issue

specified with the question starting with ‘W’ and‘H’likely to mislead the question processing syntactically. The conventional QA system apply semantic analysis on the natural language questions throughout the QA stags, even though it lacks in identifying the accurate answer for each questions. To overcome this problem the lexical, syntactic and reinforcement of semantic relation of the arguments based answer extraction is necessary for a QA system.

The main scope of QA system is designed to improve theease of understanding capability of information by the user along with the benefits of the IR system. In web searching the user need to retrieve the relevant answer quickly from the web search engine, which are lexically incorrect or improper questions, Thus QA system provides the accurate result as the answer at the top of the result web page rather than retrieving a set of documents containing the answer.

The QA technique enables the system to answer both the proper and improper NLP questions, which precisely provides the appropriate answer by constructing the proper questions from awkward question using by the syntactically, semantically and pragmatically examination technique. Syntactic Analysis: arranging the words in an appropriate way. Semitic Analysis-Providing meaning of the , by applying semantic features. Semantic Features: Synonyms-contain related meaning of the word. Hypernyms-Contain more general term(eg: Flowers) Hyponyms- sub division of more general term.(eg.Flowers: Jasmine, Rose, Lotus) Pragmatic Analysis: used to identify the content about the context, all the above techniques are used to identify the more relevant answer for the posted query. 2. An Overview of SSSR-QAS

The SSSR-QAS, Semantic and Syntactic Structure Representation- Question Answering System, is an automated question answering system based on the Lexical, syntactic and semantic measures. It includes three methods are as Question Processing, document Processing and Answer Validation as shown in Fig.1. Question Processing

Question Processing is process of identifying the correct format of the question type. It contains two major modules like question classification and question reformulations. In Question Classification[6] is mainly used to identify the question type like WH- overt or WH–covert type of question, Main and Sub class generation it encompass of coarse class like ABBR, DESC, ENTY, LOC etc., and Fine class like LOC–Country, city. The linear order of the argument technique is applied for question extraction. Document Processing

Process of matching [7] the appropriate question terms and removes the irrelevent contents. It provides correct sentence formation for answer retrieval

721 International Journal of Pure and Applied Mathematics Special Issue

process. The answers [8] are filtered on the bases Title and Snippet to retrieve most relevant answer types. The pattern generation [9] and answer weight based pattern order are used to arrange the contents in the weighted order which is used to arrange the text in the form of most relevant to least relevant order. Answer Processing

Which is used to identify the candidate answer sentences [10] and it validate [11] the main verb to display the correct answer on the ranked list. It Re-rank the contents [12] according to the posted query. It assigns the score for pattern according to the semantic relation between the question and the Answer type.

Fig. 1: SSSR-QAS Approach SSSR-QAS Algorithm Input: Posted Query Output: Correct answer For all posted query Q do For all QA samples do Construct Training Corpus (TC) for QA // Phase 1: Question Processing While (check ->user quer) do If (Q->over type) then Identify WH-non WH Q then Identify sub –main class from TC for Q End if If(Q->cover type) then Convert into overt type Q with (SVO structure) End if Q(NLP)-> (Q(preprocessing)) If((Q->overt type) then If(Qtype->Qans) Classify the question using SVM machine learning algorithm End if For all (Q of all Qans) then

722 International Journal of Pure and Applied Mathematics Special Issue

Extract the answer using linear order of words in syntactic representation End if End for End while // Phase 2: Document Processing For all retrieved D from WSE If (T(D)->user Q) then D->list the Document Else D-> Removed list End if If(snippet(D)=user Q)then S->list(S) Else S->Removed list End if If(S->P rank) then Generate patter using POS End if // Phase 3: Answer Validation For all ranked s answers do Select the high rank with matching pattern S(D) For all relevant S then do If (S->relevant ans) then Validate the named entity type based TC(QA) End if Else If (S->P)=Q(P) then Assign top rank to the Q End if End for 1. Algorithm for SSSR-QAS 3. Performance Analysis for Question Answering System

The SSSR-QAS approach is implemented in java platform and Expert system of java System Shell (JESS) rule engine. Using Docjax search engine and IR engine are used retrieve semantic relationship. Java API for WordNet searching (JAWS) provides the interface to retrieve the content from the Wordnet database. The non WH covert questions are complex level question type. in the preprocessing the SSSR-QAS approach used to convert non WH question into WH overt question type using Porter stemmer and Standford Parser for extracting the question pattern.

723 International Journal of Pure and Applied Mathematics Special Issue

Dataset

SSSR-QAS approach collects the question from TREC8,TREC 9 and TREC 10 dataset it contain 5952 questions, from the dataset 5452questions as training set and 500 questions as test set.[13] The main aim of this evaluation is to analyze the accuracy of the SUPREME-ANS system and compare it with the existing Improving Question Answering System (IQAS) approach [14]. Result and Analysis Precision

The number of all answers=True positive (correct and relevant)+False positive(not relevant but retrieved) Precision=TP/(TP+FP) Table 1: Precision Ratio No.of Trained Questions SSSR-QAS IQAS 1000 to 5426, at the point of 2726 Correct Answer Retrieved in % 95% 90.8%

94

92

90 SSSR-QAS IQAS 88 10 20 Precision Precision in% No.of Questions in 1000's

Fig. 2: Precision vs No. Questions Recall

The number of all correct answers= True Positive (correct and relevant)+ false negative(retrieved but not relevant) Recall=TP/(TP+FN) Table 2: Recall Ratio Recall No. Trained Questions Level 1 Level 2 Level 3 Level 4 SSSR-QAS 9.3 9 8.9 8.7 IQAS 9 8.8 8.5 8.4

10 9 8 7 SSSR-QAS

Recall Recall % in 1 2 3 4 IQAS No.Trained Question

Fig. 3: Recall Vs. No of Trained Question

724 International Journal of Pure and Applied Mathematics Special Issue

F-Measure

F-measure is the weighted harmonic mean of Precision and Recall, from the overall accuracy the SSSR-QAS gives high accuracy answer prediction rather than IQAS approach, it increase by 9% at the point of 6.8% of complex question factor.

4. Conclusion

This study has presented SSSR-QAS approach for tackling the ambiguity of answers and answer selection complexity in QA system. The goal of this approach is accomplished by exploiting together the web and semantic knowledge in three phases. The approaches have employed to provide precise answer and disambiguate the candidate answer sentences, the candidate sentences are ranked according to the posted query. 5. Future Work

The future directions are semantic resolution and recognizing different questions about the same answer. The QA systems are needed to bridge the gap between the syntactically and semantically different natural language questions and answer-bearing texts. The questions and answers are semantically or syntactically interrelated with each other. Hence, focusing on the deep understanding of the questions and answers is crucial in the QA system especially in descriptive type. References [1] Etzioni O., Search needs a shake-up, Nature 476(7358) (2011), 25–26. [2] Kolomiyets O., Moens M.F., A survey on question answering technology from an information retrieval perspective, Elsevier transaction on Information Sciences 181(24) (2011), 5412-5434. [3] Singh V., Dwivedi S.K., Question Answering: A Survey of Research, Techniques and Issues, Elsevier transaction on Procedia Technology 10 (2013), 417–424. [4] Zhang D., Lee W.S., Question Classification using Support Vector Machines, ACM Proceedings of the 26th Annual International SIGIR Conference on Research and Development in Information Retrieval (2003), 26-32.

725 International Journal of Pure and Applied Mathematics Special Issue

[5] Li X., Roth D., Learning Question Classifiers: The Role of Semantic Information, Cambridge Univ press, National Language Engineering 12(3) (2006) 229-249. [6] Quarteroni S., Moschitti A., Manandhar S., Basili R., Advanced Structural Representations for Question Classification and Answer Re-ranking, European Conference on Information Retrieval, Springer-Verlag (2007), 234–245. [7] Roberts I., Gaizauskas R., Evaluating passage retrieval approaches for question answering, Springer transaction on Advances in Information Retrieval (2004), 72-84. [8] Cui H., Sun R., Li K., Kan M.Y., Chua T.S., Question answering passage retrieval using dependency relations, ACM Proceedings of the 28th annual international SIGIR conference on Research and development in information retrieval (2005), 400-407. [9] Agirre E., Ansa O., Arregi X., De Lacalle M.L., Otegi A., Saralegi X., Zaragoza H., Elhuyar-ixa: Semantic relatedness and cross- lingual passage retrieval, Workshop of the Cross-Language Evaluation Forum for European Languages (2009), 273-280. [10] Gabrilovich E., Markovitch S., Computing Semantic Relatedness using -based Explicit Semantic Analysis, Proceedings of the 20th International Joint Conference on Artificial Intelligence, (2007), 6–12. [11] Gomez-Adorno H., Pinto D., Ayala D.V., Semantic Answer Validation in Question Answering Systems for Reading Comprehension Tests, pattern recognition, lecture notes in computer science (2013). [12] Gunawardena, T., Lokuhetti M., Pathirana N., Ragel R., Deegalla S., An automatic answering system with template matching for natural language questions, IEEE 5th International Conference on Information and Automation for Sustainability (2010), 353- 358. [13] Saxena, A.K., Sambhu G.V., Kaushik S., Subramaniam L.V., IITD-IBMIRL System for Question Answering Using Pattern Matching, Semantic Type and Semantic Category Recognition, TREC (2007). [14] Cui H., Kan M.Y., Chua T.S., Soft pattern matching models for definitional question answering, ACM Transactions on Information Systems 25(2) (2007).

726 727 728