Pragmatic Sentence Boundary Disambiguation

PySBD: Pragmatic Sentence Boundary Disambiguation

Nipun Sadvilkar Mark Neumann Episource LLC Allen Institute for Artiﬁcial Intelligence [email protected] [email protected]

Abstract to disambiguate these edge case scenarios and be robust to known textual variation. In this paper, we present a rule-based sentence Our contributions in this paper are describing boundary disambiguation Python package that an open-source, freely available tool for pragmatic works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can sentence boundary disambiguation. In particular, provide logical sentences even when the for- we describe the implementation details of PySBD, mat and domain of the input text is unknown. evaluate it in comparison to other open source SBD In our work, we adapt the Golden Rules Set tools and discuss it’s natural extensibility due to (a language speciﬁc set of sentence boundary it’s rule based nature. exemplars) originally implemented as a ruby 1 gem pragmatic segmenter which we ported 2 Related Work to Python with additional improvements and functionality. PySBD passes 97.92% of the Sentence segmentation methods can be broadly di- Golden Rule Set examplars for English, an improvement of 25% over the next best open vided into 3 approaches: i) rules-based, requiring source Python tool. hand crafted rules/heuristics; ii) Supervised machine learning, requiring annotated datasets, and 1 Introduction iii) Unsupervised machine learning, requiring distributional statistics derived from raw text. Sentence Boundary Disambiguation (SBD), also (Palmer and Hearst, 1997) use decision trees and known as sentence boundary detection, is a key neural networks in a supervised, feature based SBD underlying task for natural language processing. In model, requiring part of speech information and many NLP pipelines, gold standard SBD often as- training data. (Kiss and Strunk, 2006) design punkt, sumed, and acts as a primary input to downstream an unsupervised SBD model centered around the NLP tasks such as machine translation, named en- observation that abbreviations are the main con- tity recognition and coreference resolution. How- founders for rule based sentence boundary mod- ever, in real world scenarios, text occurs in a variety els. Although it is unsupervised, punkt requires the of input modalities, such as HTML forms, PDFs computation of various co-occurence and distribu- and word processing doccument formats. tional statistics of a relevant corpora; as PySBD is Although SBD is considered to be a simple prob- rule-based, it does not require a initial corpus of lem, it becomes more complex in other domains text. (Evang et al., 2013) cast SBD as a character due to unorthodox use of punctuation symbols. For sequence labelling problem and use features from example, drug names in medical documents, case a recurrent neural network language model to train citations in legal text and references in academic a CRF labelling model. articles all use punctuation in ways which are un- Many SBD papers reject rule-based approaches common in newswire documents. Simple SBD due to non-robustness, maintainability and perfor- approaches for English web text (treating - “?!:;.” mance. We reject these conclusions, and instead - as end of sentence markers) covers a majority focus on the positive features of rule-based systems of cases, but an ideal SBD system should be able - namely that their errors are interpretable, rules can 1https://github.com/diasks2/pragmatic_ be adjusted incrementally and their performance is segmenter often on-par with learnt statistical models.

110 Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 110–114 Virtual Conference, November 19, 2020. c 2020 Association for Computational Linguistics Issues with benchmarks on PTB/WSJ corpora • Standard SBD systems have historically been benchmarked • ListItemReplacer on the Wall Street Journal/Penn Treebank corpora • AbbreviationReplacer (Read et al., 2012). The majority of the sentences • ExclamationWords found in the Penn Tree Bank are sentences that • BetweenPunctuation end with a regular word followed by a period, test- ing the same sentence boundary cases repeatedly. The Processor identifies sentence boundaries by In the Brown Corpus 90% of potential sentence manipulating input text in 3 stages. Firstly, rules boundaries come after a regular word. Although are applied to alter the input text by adding interme- the Wall Street Journal corpus is richer with nu- diate unicode characters as placeholders to signify merical values, abbreviations and only 53% accord- that particular pieces of punctuation are not sen- ing to (Gale and Church, 1993) of sentences end tence boundaries. The segment stage identifies true with a regular word followed by a period (Mikheev, sentence boundaries by bypassing unicode charac- 2002). ters and segments text into sentences using a much Given that commonly used training/evaluation simpler regex rule. Finally, the manipulated text is corpora do not contain a particularly large amount transformed into original text form by replacing the of sentence marker variation, we use a Golden unicode placeholders with their original characters. Rule Set to enumerate edge cases observed in sentence boundaries. The Golden Rule Set contains 48 The Language holds all the languages supported hand-constructed rules, designed to cover sentence by PySBD. Each language is built on top of two boundaries across a variety of domains. The GRS sub-components - Common and Standard - involv- is interpretable (each rule targets a specific type of ing basic rules prevalent across languages. Com- sentence boundary) and easy to extend with new mon rules encompass the main sentence boundary examples of particular sentence boundary markers. regexes; AM-PM regexes handle numerically ex- pressed time periods; number regexes handle pe- 3 Implementation riod/newline characters before or after single/multi- digit numbers and additional rules handle quota- PySBD is divided into four high level components: tion, parenthesis, and numerical references within The Segmenter, Processor, Language and Cleaner the input text. The Standard rule set contains regex sub-modules. patterns to handle single/double punctuation, geolo- Segmenter The class is the public API to PySBD. cation references, fileformat mentions and ellipsis Segmenter It allows a user to set up a in their lan- in input text. The ListItemReplacer rule set handles guage of choice, as well as specify additional op- itemized, ordered/unordered lists; the Abbrevia- tions such as text cleaning and char span function- tionReplacer contains language specific common ality. The Segmenter requires a two character ISO 2 abbreviations. Finally, the ExclaimationWords and 639-1 code to process input text. Text extracted BetweenPunctuation rules handle language specific from a PDF or obtained from OCR systems typ- exclamations and more complicated punctuation ically contains unusually formatted text, such as cases. line breaks in the middle of sentences. This can In practice, text encountered in the wild is noisy, be handled with the doc type option, or for more containing extraneous line breaks, unicode charac- aggressive text cleaning, the clean functionality ters, uncommon spacing and hangovers from doc- performs additional pre-filtering of the input text, ument structure. In order to handle this, PySBD removing repeated and unnecessary punctuation. provides an optional component to handle such The Processor contains the sentence segmenta- texts. The Cleaner is passed as an option through tion logic, using rules to segment the input text. the top-level Segmenter component and provides The Processor contains several groups of sentence text cleaning rules for cases like irregular newline segmentation rules, some of which are universal characters, tables of contents, URLs, HTML tags across languages, and some of which are language and text involving no space between sentences. As specific. These are grouped as follows: the text cleaning rules perform a destructive opera- • Common tion, this feature is incompatible with the char span 2https://en.wikipedia.org/wiki/List_ functionality, as mapping back to character indices of_ISO_639-1_codes within the original text is no longer possible.

111 4 Experimental Setup Tool GRS GENIA blingfire 75.00 86.95 Data Contrary to the WSJ, Brown and GENIA syntok 68.75 80.90 datasets mentioned in (Read et al., 2012), we use spaCy 52.08 76.80 language specific Golden Rules Sets for our experi- spacy dep 54.17 39.20 ments. There are total 22 Golden Rules sets for fol- stanza 72.92 63.40 lowing languages - English, Marathi, Hindi, Bulgar- NLTK 56.25 87.95 ian, Espanol˜ (Spanish), Russian, Arabic, Amharic, PySBD 97.92 97.00 Armenian, Persian, Urdu, Polish, Chinese, Dutch, Danish, French, Italian, Greek, Burmese, Japanese, Table 1: Accuracy (%) of PySBD compared to other Deutsch (Germen), Kazakh. open source SBD packages with respect to the English The Golden Rules Sets are devised by consider- Golden Rule Set and the GENIA corpus. ing possible sentence boundaries per language as well as considering different domains. For example the English language GRS is comprised of 48 NLTK’s (Bird, 2006) PunktSentenceTokenizer golden rules3 from formal and informal domains is based on an unsupervised algorithm (Kiss and to cover a wide variety of phenomena. For exam- Strunk, 2006) and fails to segment text containing ple, news articles are grammatically and punctually brackets, itemized text and abbreviations as sen- correct; scientific literature often involves numbers, tence boundaries. In contrast, practically all the abbreviations and bibliography references, and In- modules following a rules-based approach - bling- formal domain like Web Text - E-mail, Social me- fire (Bling Team, 2020), syntok (Leitner, 2020) and dia text involves irregular punctuation and ellipses. PySBD - appear to be faster and more accurate on To ensure our rules-based system built with respect both corpora. Blingfire and syntok modules strug- to the Golden Rules Set generalizes well in the real gle when text has decimal numbers, abbreviations, world, we have also performed a benchmark com- brackets and mixed cased words prior to or fol- parison on the GENIA corpus (Kim et al., 2003), lowing the true sentence boundary. Lastly, the 3% a dataset of linguistic annotations on top of the ab- drop in PySBD’s accuracy on the GENIA corpus stracts of biomedical papers. The GENIA corpus is caused by splitting itemized text into segments, provides both raw and segmented abstracts, which whereas the GENIA abstracts contain single sen- we use as natural data for our evaluation. tences with multiple bulleted lists. We feel that this segmentation choice comes down to preference, Setup We evaluate PySBD and other segmenta- and both are equally valid. tion tools on two corpora - the English Golden Table2 shows the runtime performance of each Rules Set and GENIA corpus. module on the entire text of “The Adventures of Sherlock Holmes” 4, which contains 10K lines and 5 Comparison to alternatives 100K words. The experiment was performed on Table1 summarizes accuracy of PySBD and the al- Intel Core i5 processor running at 2.9 GHz clock ternative Python SBD modules on English Golden speed. Blingfire module is the fastest since it is Rules Set and the GENIA corpus. The super- extremely performance oriented (at cost of reduced vised machine learning based sentence segmenters, maintainability/extensibility). Although PySBD stanza (Qi et al., 2020) and spacy dependency pars- is slower than several alternatives, it is considering (spacy dep) (Honnibal and Montani, 2017) are ably faster than running full pipelines and is a good slower compared to other python modules and choice for users who require high accuracy segmen- seem to segment incorrectly when text contains tations. Our comparisons cover a variety of differ- mixed case words or abrupt punctuation within ent implementations (C++, Cython, Pytorch), ap- words, which is prevalent in biomedical domain. proaches (model-based, distributional, rule-based) Inability to generalise on out of domain corpora is and are representative of practical choices for an a main drawback of using supervised learning for NLP practitioner. SBD.

3https://s3.amazonaws.com/ 4http://www.gutenberg.org/files/1661/ tm-town-nlp-resources/golden_rules.txt 1661-0.txt

112 Tool Speed(ms) Language Accuracy (%) blingfire 85.24 Amharic 80.95% syntok 1764.11 Arabic 70.40% spaCy 1523.20 Armenian 63.75% spacy dep 26850.69 Bulgarian 93.35% stanza 48383.46 Burmese 48.05% NLTK 780.49 Chinese 85.35% PySBD 9483.96 Danish 91.40% Deutsch 80.95% Table 2: Speed benchmark on the entire text of “The Dutch 91.40% adventures of Sherlock Holmes” for PySBD compared French 91.90% to other open source SBD packages. Greek 91.05% Hindi 88.50% 6 Discussion Italian 90.55% Japanese 96.45% 6.1 Package Development Kazakh 63.20% We use Test-driven Development (TDD) whilst Marathi 92.60% developing, first write a test for one of the rule Persian 84.95% from Golden Rules Set that fails intentionally, be- Polish 55.48% fore writing functional code to pass the test. The Russian 88.55% approach used by our python module is rules- Spanish 92.65% based. We employ Python’s standard library mod- Urdu 77.55% ule called re5 which provides regular expression (regex) matching operations for text processing. Table 3: Accuracy of PySBD’s multilingual modules on the OPUS-100 multilingual corpus test sets, contain- 6.2 Non-destructive Segmentation ing 2000 sentences per language. Each language mod- Our module does non-destructive sentence tok- ule build with respect to its own GRS. enization, as when dealing with noisy text, character offsets into the original documents are often de- module. This modular language support allows sirable. The indices are obtained after postprocess PySBD to be maintained in a community driven stage by mapping post-processed sentence start in- way by open source NLP practitioners. For ex- dices into the original input text. Upon enabling ample, we extended the original Ruby pragmatic this feature, the output format is a list of TextSpan segmeter by adding support for Marathi by forming objects containing sentence, start and end character the Golden Rules and identifying Marathi language indices. The character spans makes it easy to navi- specific sentence syntax, punctuations and abbrevi- gate to any sentence within the unaltered original ations resulting in a usable module within 1 hour text. of work. If you would like to contribute to the 6.3 Multilingual Implementation PySBD module by updating existing GRS or by adding support for new language then refer to our NLP research predominantly focuses on develop- contributing guidelines6 to get you started. ing dataset and methods for English language de- We benchmarked PySBD on the OPUS-100 par- spite the many benefits of working on other lan- allel multilingual corpus, covering 100 languages guages (Ruder, 2020). An advantage of PySBD’s (Tiedemann, 2012). We used the test sets of 21 rule-based approach is straightforward extension of the languages excluding English which contain to new languages. PySBD has support for 22 lan- 2000 sentences per language (Due to unavailability guages spanning many language families, each hav- of the test set for Armenian, we used its train set, ing its own Golden Rules Set. Adding support for containing 7000 sentences). Due to noisy nature a new language involves adding language specific of OPUS (on inspection, multiple sentences were rules to the Golden Rules Set and adding language present on individual lines in the test sets) and lack specific punctuation markers to a new language 5https://docs.python.org/3/library/re. 6https://github.com/nipunsadvilkar/ html pySBD/blob/master/CONTRIBUTING.md

113 of language speciﬁc knowledge to form rules and Florian Leitner. 2020. Syntok : Text tokenization abbreviation list we observed weak performance and sentence segmentation (segtok v2). https: in a few languages like Burmese, Polish, Kazakh, //github.com/fnl/syntok. Armenian, etc. Shortcomings of such languages Andrei Mikheev. 2002. Periods, capitalized words, etc. can be improved in community driven way by col- Computational Linguistics, 28(3):289–318. laborating with multilingual NLP practitioners. David D. Palmer and Marti A. Hearst. 1997. Adap- tive multilingual sentence boundary disambiguation. 7 Conclusion Computational Linguistics, 23(2):241–267. In this paper, we have described PySBD, a prag- Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, matic sentence boundary disambiguation model. and Christopher D. Manning. 2020. Stanza: A PySBD is open source, has over 98% test coverage python natural language processing toolkit for many human languages. In ACL. and integrates easily with existing natural language processing pipelines. PySBD currently supports 22 Jonathon Read, Rebecca Dridan, Stephan Oepen, and languages, and is easily extensible, with 57 projects Lars Jørgen Solberg. 2012. Sentence boundary detection: A long solved problem? Computational depending on it at the time of writing. Although Linguistics, pages 985–994. slower than some alternatives implemented in low level languages such as C++, PySBD successfully Sebastian Ruder. 2020. Why You Should Do http://ruder.io/ disambiguates 97% of sentence boundaries in a NLP Beyond English. nlp-beyond-english. Golden Rule Set and is robust across domains and noisy text. Jorg¨ Tiedemann. 2012. Parallel data, tools and inter- faces in opus. In LREC.

References Steven Bird. 2006. Nltk: The natural language toolkit. ArXiv, cs.CL/0205028.

Beyond Language Understanding Bling Team, Mi- crosoft. 2020. Blingﬁre : A lightning fast ﬁ- nite state machine and regular expression manipula- tion library. https://github.com/microsoft/ BlingFire/tree/master/ldbsrc/sbd.

Kilian Evang, Valerio Basile, Grzegorz Chrupała, and Johan Bos. 2013. Elephant: Sequence labeling for word and sentence segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natu- ral Language Processing, pages 1422–1426, Seattle, Washington, USA. Association for Computational Linguistics.

William A. Gale and Kenneth W. Church. 1993.A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75–102.

Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embed- dings, convolutional neural networks and incremen- tal parsing. https://github.com/explosion/ spaCy. To appear.

Jin-Dong Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. 2003. Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformat- ics, 19 Suppl 1:i180–2.

Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computa- tional Linguistics, 32(4):485–525.

114