MATEC Web of Conferences 164, 01047 (2018) https://doi.org/10.1051/matecconf/201816401047 ICESTI 2017
Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method
Viny Christanti Mawardi*, Niko Susanto, and Dali Santun Naga
Faculty of Information Technology, Tarumanagara University, Letjend S. Parman No.1, West Jakarta 11440, Indonesia
Abstract. Any mistake in writing of a document will cause the information to be told falsely. These days, most of the document is written with a computer. For that reason, spelling correction is needed to solve any writing mistakes. This design process discuss about the making of spelling correction for document text in Indonesian language with document's text as its input and a .txt file as its output. For the realization, 5 000 news articles have been used as training data. Methods used includes Finite State Automata (FSA), Levenshtein distance, and N-gram. The results of this designing process are shown by perplexity evaluation, correction hit rate and false positive rate. Perplexity with the smallest value is a unigram with value 1.14. On the other hand, the highest percentage of correction hit rate is bigram and trigram with value 71.20 %, but bigram is superior in processing time average which is 01:21.23 min. The false positive rate of unigram, bigram, and trigram has the same percentage which is 4.15 %. Due to the disadvantages at using FSA method, modification is done and produce bigram's correction hit rate as high as 85.44 %.
Key words: Finite state automata, Levenshtein distance, N-gram, spelling correction
1 Introduction Language is one of the most important component in human life, which could be expressed by either spoken word or written text. As a text, language became an important element in document writing. Any mistake in document writing will cause the information told falsely. Nowadays, most of the document is written with a computer. In its writing process, there could be some mistakes happened because of human error. The mistake or error can be caused by either the letters from the adjacent keyboard keys, errors due to mechanical failure or slip of the hand or finger. Error when writing a document is often occurred. Especially nowadays, people’s awareness to pour his idea into the article, scientific journals, college assignment or other documents have increased. For that reason, spelling correction is needed to solve any writing mistakes. This research aim to objectify spelling correction on Indonesian text documents, to overcome non-word error. This system helps
* Corresponding author: [email protected]
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). MATEC Web of Conferences 164, 01047 (2018) https://doi.org/10.1051/matecconf/201816401047 ICESTI 2017
the user to o erco e docu ent te t writin errors, input on this syste is a te t docu ent and its output is new docu ent te t which error o writin ha e een corrected ethod is used to deter ine which letter caused error in a word e enshtein distance ethod is used to calculate the di erence etween the word error and the word su estion he word su estion se uence is deter ined y the pro a ility o ra
2 Literature review n this case, the inite tate uto ata, e enshtein distance and ra ethods are used to create application spellin correction or docu ent te t
2.1 Pre-processing re processin is the initial sta e in processin input data e ore enterin the ain sta e process o spellin correction or ndonesian te t docu ents re processin consists o se eral sta es he pre processin sta e is done case oldin , eli inatin nu ers, punctuation and characters other than the alpha et letter, to eni ation he pre processin step are e plained elow i ase oldin ase oldin is the sta e that chan es all the letters in the docu ent into lowercase nly the letters a to are accepted ase oldin is done or inputs, ecause the input can e capitali ed letter or non capitali ed letter or e a ple ro the input i ust Want to Eat ", we’ll get output "i ust want to eat ii Eli inatin nu ers, punctuation and characters other than the alpha et Eli inatin nu ers, punctuation and characters other than the alpha et is done ecause the characters are considered as deli iters and ha e no e ect on te t processin here are e ceptions to the eli inatin punctuation ecause it will a ect the process o uildin ra s, the e a ples is dots and , co a iii o eni ation okenization is commonly understood as “splitting text into words.” usin white space characters as deli iters and isolate punctuation sy ols when necessary or e a ple ro input te t "i just want to eat ", we can get tokens "i”, “just”, “want”, “to”, “eat ".
2.2 Finite state automata (FSA) inite tate uto ata is a athe atical odel that can accept inputs and outputs with two alues that are accepted and re ected he has a li ited nu er o states and can o e ro one state to another y input and transition unctions he has no stora e or e ory, so it can only re e er the current state here are two types o , the irst is eter inistic inite tate uto ata e pressed y i e tuples , Σ, δ, S, F inite set o states Σ = Finite set of input symbols called the alphabet δ = Transition function δ Q × Σ nitial or start state, inal state, he second is the ondeter inistic∈ inite tate uto ata which di ers ro the Deterministic Finite State⊆ Automata (DFA) only in the transition function Q × (Σ ε
∪ 2 MATEC Web of Conferences 164, 01047 (2018) https://doi.org/10.1051/matecconf/201816401047 ICESTI 2017 the user to o erco e docu ent te t writin errors, input on this syste is a te t docu ent at means, or eac state t ere may e more t an one transition, and eac may lead to a and its output is new docu ent te t which error o writin ha e een corrected di erent state. inite tate utomata is a good data structure or representing natural ethod is used to deter ine which letter caused error in a word e enshtein language dictionaries. is used to represent dictionaries and word errors to acilitate distance ethod is used to calculate the di erence etween the word error and the word e ens tein distance calculations . su estion he word su estion se uence is deter ined y the pro a ility o ra 2.3 Levenshtein distance 2 Literature review Levenshtein distance, could also be referred to as edit distance, is a measurement (metric) n this case, the inite tate uto ata, e enshtein distance and ra ethods are used by calculating number of difference in the two strings. The Levenshtein distance from two to create application spellin correction or docu ent te t strings is the minimum number of point mutation required to replace a string with another string. Point mutations consist of: (i) Substitutions: substitutions is an operation to exchange a character with another 2.1 Pre-processing character. For example the author writes the string "eat" become "eay", in this case re processin is the initial sta e in processin input data e ore enterin the ain sta e the character "t" is replaced with the character "y". process o spellin correction or ndonesian te t docu ents re processin consists o (ii) Insertions: insertions is the addition of characters into a string. For example the se eral sta es he pre processin sta e is done case oldin , eli inatin nu ers, string "therfor" becomes "therefore" with an addition of the character "e" at the end punctuation and characters other than the alpha et letter, to eni ation of the string. The addition of characters is not restricted at the end of a word, but can he pre processin step are e plained elow be added in the beginning or middle of the string. i ase oldin (iii) Deletions: deletions is deleting the character of a string. For example the string ase oldin is the sta e that chan es all the letters in the docu ent into lowercase "womanm" become "woman", in this operation the character "m" is deleted [4]. nly the letters a to are accepted ase oldin is done or inputs, ecause the input Distance is the number of changes needed to convert a string into another string. The can e capitali ed letter or non capitali ed letter or e a ple ro the input i ust following equation or algorithm used to calculate the Levenshtein distance between two Want to Eat ", we’ll get output "i ust want to eat words a a1.....a n and b b1.....bm are: ii Eli inatin nu ers, punctuation and characters other than the alpha et Eli inatin nu ers, punctuation and characters other than the alpha et is done d i w b , for 1 i m (1) ecause the characters are considered as deli iters and ha e no e ect on te t i0 k 1 del k processin here are e ceptions to the eli inatin punctuation ecause it will a ect j d0 j winsak , for 1 j m (2) the process o uildin ra s, the e a ples is dots and , co a k1 iii o eni ation fo r 1 a j b j okenization is commonly understood as “splitting text into words.” usin white di1, j1 d ij {?(d_(i - 1, j - 1) fo r a_j = b_i@min?( @{?( d_(i - 1, j) + w_del (b_i ) fo r 1? i ?m,1? j? n@d_(i, j - 1) + w_ins (a_j ) fo r a_j?b_i@d_(i - 1, j - 1) + w_sub (?a_j, b?_i ) d ij )? ))? fo r 1 i m, 1 j n (3) space characters as deli iters and isolate punctuation sy ols when necessary or { di-1,jwdel bk e a ple ro input te t "i just want to eat ", we can get tokens "i”, “just”, “want”, “to”, fo r a j b j min “eat ". {di-1,jwinsa j di1,j1wsuba j ,b j 2.2 Finite state automata (FSA) inite tate uto ata is a athe atical odel that can accept inputs and outputs Where : a = Words represented as rows with two alues that are accepted and re ected he has a li ited nu er o states and b = Words represented as columns can o e ro one state to another y input and transition unctions he has no i = Letter index of word a stora e or e ory, so it can only re e er the current state j = Letter index of word b here are two types o , the irst is eter inistic inite tate uto ata n = Letter length of the word a e pressed y i e tuples m = Letter length of the word a , Σ, δ, S, F w = The value of deletions, insertions, or substitutions (w = 1) inite set o states is algorit m start rom t e top le t corner o a two dimensional array t at as illed in Σ = Finite set of input symbols called the alphabet a num er o initial sring c aracters and target strings wit cost. e cost at t e lower rig t end ecomes t e le ens tein distance alue, t at descri es t e num er o di erences δ = Transition function δ Q × Σ etween t e two strings. nitial or start state,