arXiv:2102.09507v4 [cs.CL] 21 Jun 2021 5% 2 o h 1ms omnlnugs ihpeiin>9 precision with languages, common most 11 the for (2) >50%, oflydfn oi n vlaecasfe performance classifier evaluate and topic a define fully To ait fagrtmcapoce n otae hspaper This software. and and approaches applications NLP algorithmic many of of variety core the at are classifiers Text aiu nert ramns[](ihcsoiainofte customization (with [5] treatments integrity various aha oaps eog oanro oi uha COVID-19. as such a topic narrow from a anything to — belongs text — of post a piece to given hashtag a if desc determines we particular, Facebook In expressions. classifie regular text large-scale for on methodologies and infrastructure duces ABSTRACT • CONCEPTS CCS classifi narrow-topic other to applied be can learnings less Our classifier and recall, DNN and a precision to higher Comparisons results, COVID explainable revisions. like are so challenges and to fast Response is qu low-latency platforms. enable multiple expressions from Regular re >90%. and recall precision and 99% with regu of languages, sets 66 two for build n (1) we do COVID-19, expressions: For but data. discovery, labeled keyword require of iterations human-guided ploy otn types “coro Content treatment ayurvedic the and dis “covidiots”, the jargon note sive we considered, words non-dictionary Mar, Among del not. “Corona and scope beer” the “corona but within interpretations, are all vairos” “korrana as such Misspellings, [ for ranking Feed broadened News is recommendations, interpretation page/group strict in cations This not. s “qua are “corona but relevant, example, jokes” considered For are accepted. “sars-covid-2” are and name illness toms” that the expressions or and interpretati virus words strict the only very — a COVID-19 to imposes relevance context determi ot search the help On our matches. language hand, unintended for each check in and ar keywords queries salient queries search COVID-19. search of about relevant frequencies information with authoritative users with COVID-19 Search presented to in related example, material currently For of Facebook treatments by special supported plement surfaces application Most INTRODUCTION 1 KEYWORDS • L,tx lsiiain eua xrsin,COVID-19 expressions, regular classification, text NLP, nomto systems Information optn methodologies Computing eua xrsin o atrsos OI-9Tx Class Text COVID-19 Fast-response for Expressions Regular • • • niepssadidvda comments sentences individual and full queries form posts search Entire don’t as but words, well separate as usually names that group and words page merge often Account, that hashtags Instagram and URLs grL Markov L. Igor [email protected] el ak CA Park, Menlo ade yorsltosotnintermix: often solutions our by handled aeokInc., Facebook → nomto retrieval Information → ahn learning. Machine ; aamining Data overfitting. desired). n iehow ribe sbased rs nil”. A are CA” rantine ers. eem- we aqeieLiu Jacqueline under el ak CA Park, Menlo nof on appli- intro- show s a use ymp- eries [email protected] aeokInc., Facebook mis- The call her -19 [4]. im- lar 0% ne 8], ot e ; andta r uyMkvt h uhro h conspiracy- the of author the — Mikovits Judy Dr. that warned “SARS-CoV- 2019”, (“NCoV virus the for names new whenever ol’ aylnugs rmAhrct itaee n the and Vietnamese, to Amharic from languages, many world’s ie ufcetcnet nuhtann aacnb colle be can data training enough context, sufficient vides iu.Bti pns,“ooa "rw" loapasi ot in appears also ("crown") “corona” Spanish, in But virus. retcag requests. change approach. Urgent our iter back it comparison improved empirical and an labeling, of manual Results by it ba validated solution 7], deep-learning [3, competent a Fort built expertise. colleagues language our require not does results checking l text class critical, not learning is deep tokenization where well-designed leave posts, a we English on to performance compared much table how the know on to important performance is careful it of uation, matter l a online As complexity. reducing software thus and dependencies, m external the a without precision for call higher even provide solutions But our languages, etc. common tokenization, dedicated hash context, require (espec lacks URLs text find short to languages), difficult popular less are the sets somewha va spelling training is their Reasonable and nets ants. words emerging neural many for the by learning handicapped of use transfer The and etc. “covid”, embeddings “corona”, la of the misspellings handle not of do number they English for availa even not and languages, are many solutions lang lemmatization many robust for support example, our For by application our in too limited NLP is many i [6] of use making The names, relevance. COVID-19 personal for keyword and risky geographic in up showing texts, ucee nlmtn h itiuino hsvdo unlik video, this of platfor distribution social-network the leading limiting the in [11], succeeded 19 August t vide According on “indoctorination”. second word a media the release using to 18, going August was on — Decker video Ben “Plandemic” theory researcher cas other many the in 2020 and 17, arise, August (“COVID-19”) On illness the and 2”) i,w on i odfrsfr“ooa sadsrpino t of description a as “corona” for forms word six found “” we and bic, “corona” for spellings languag or Some nuance. terms and multiple context misses c often a it on basis, useful case is Ukra translation (Russian, automatic languages While Bulgarian). related and in examples short mixed-languag and by tent challenged is detection language l tomatic all in classifiers th train automatically and to data languages, sufficient right-to-left of include complications examples support, challenges; additional brings diversity support tools. mandates products NLP Facebook-family of using reach global to 14]. obstacles 12, and 10, support 9, Multilingual 7, 3, [2, art apart work prior our of phrases. sets much well longer types content and many vocabulary handle to broader ability a to due categories, hcigetr ot al o oesblt hnshort-te than subtlety more for calls posts entire Checking lsiircvrg utb increased be must coverage Classifier dmVagner Adam el ak CA Park, Menlo [email protected] aeokInc. Facebook ification nugs Au- anguages. nt pro- ength td and cted, akof lack e asand tags Unicode al for ially nAra- In . e con- her unately, ase-by- atively. atency e on sed l for ble o the for suse es uages. con- e dre- nd word inian for e the o from eval- ifier The The rge ost ms a t he es. ri- xt ls ir o t the first "Plandemic" video in May. For such special cases, our Java and SQL code in multiple existing applications. Compared classifiers can be revised quickly without labeled data. As part of to other classifier infrastructures, regexes are (1) easy to use, (2) due diligence, we found additional keywords (“plannedemic”) and very inexpensive, (3) provide low latency and high throughput. similar but unrelated keywords, excluded to avoid false positives. Software integration complexity is low, and regex updates can be To support urgent changes, we use regular expressions (regexes). distributed worldwide in under one hour. However, the risk of Skepticism about regular expressions abounds in the indus- mistakes was initially high and prospects for scalability in size try, despite common use at small scale. The infrastructure for large unclear. regexes remains limited, while some practitioners lack the mastery Our initial mandate for COVID-19 regexes required synonymy > 95% of regexes and technical understanding of pros and cons. An impor- to either the virus or the illness, with very high precision ( , > 99% tant insight is that matching is significantly easier than parsing [13]. in practice usually ). This kept regexes short and allowed Contrary to a common misunderstanding, modern practical regu- us to quickly extend coverage to over 60 languages with help lar expressions (regexes) can match many context-sensitive charac- from native speakers only for particularly challenging languages, ter configurations (formal languages) by using backreferences and such as Arabic and Thai. The availability of these initial regexes zero-width lookaround assertions [1, 13]. A regex can match "virus" supported accelerated development of downstream applications if preceded by "corona", or match "corona" but not "corona beer". for many surfaces in Facebook-family products and helped put With two regexes, we can match the set-difference of two formal together training data for a DNN-based classifier developed later. languages, but context-free languages are not closed under set- However, recall was clearly limited by the “strict mandate” of difference. Regexes with backreferences are NP-complete. In appli- specific keyword matching, by the size of these regexes, and by cations, many engineers focus on deep learning solutions, leaving the initial focus on the keywords with the highest prevalence. low-hanging fruit for other methods (Section 4). We explain how In some cases, the recall of the initial regexes was doubled by to harvest this low-hanging fruit, noting gaps in the literature on adding overlooked keywords or word forms, as well as by careful regex-building (Section 5). In academia, regex methods are criti- generalization. However, the greatest improvements were attained cized for hand-optimization. However, developing competent ML by developing Tier 2 regexes that were not limited by size or by models also requires manual effort, although of a different kind. strict semantic requirements. Our regex methodology is faster (per language) than building ML Tier 2 regexes match many more keywords and expressions to models because data collection is automated. For serving, regexes trade off the exceptionally high precision of Tier 1 regexes for require much simpler infrastructure and resources. significantly higher recall. In 2020, we supported Tier 1 regexes for 66 languages and Tier 2 regexes for 11 languages. In the remainder of this paper, we describe our infrastructure Given that terms and text in English often appear in posts writ- to store and access regexes from several application platforms. We ten predominantly in other languages, it is common practice to con- then explain how we scale regexes in size and improve their pre- catenate the English regex to each individual language regex before cision. Several aspects of regex syntax are key to our work; those matching, or apply the two regexes separately. We do not recom- we illustrate by example and show the regex patterns we use. The mend concatenating more than several language regexes because coverage of high-precision Tier 1 regexes is rounded up by a dis- this may lead to very large regexes and unexpected cross-lingual cussion of validation and our development methodology. Tier 2 matches. To illustrate, an early Czech regex included the overly ag- regexes increase recall for a moderate loss in precision. Here we gressive clause “(v[iy´y]r).*([ck]o[rn].[nr][ao])” for “virus.*corona” use automated translation and develop keyword discovery meth- and several spelling variants unintentionally matched “environ- ods. To ensure explainability, our regexes produce matched snip- mental pest control” in English. In this particular case, the dot was pets useful for application-based customizations and fast evalua- replaced with a more restrictive character class. tion of regex performance. We pay special attention to evaluation protocols and illustrate them by examples. We optimize regexes for timing performance and introduce structural regex analysis to 2.1 Scaling up regex size and precision identify portions that have minimal impact on recall. Empirically, Using common regex syntax [9], Sections 2.1 and 2.2 introduce our regex matching is very fast. We summarize lessons learned from higher-level structures and illustrate them for the COVID-19 topic. developing COVID-19 regexes, outline how our infrastructure and Improving the recall of COVID-19 regexes clearly required techniques can be applied to other classification challenges, and larger regexes. Our infrastructure comfortably supports individ- compare our work to prior literature. ual regexes up to 30KB, and some of the applications can han- Text detection techniques we describe in this paper are comple- dle megabyte-sized regexes. However, regexes with more than mented by many others in production solutions [5, 8], including 80 characters, especially in foreign languages, become hard to algorithms and manual content review. Our examples illustrate pro- read and awkward to maintain. Scaling the regexes up required posed techniques by simplified variants of production solutions. adding structure, including support for visual line breaks (not line breaks in text) and comments ignored during matching. Sim- 2 BASIC INFRASTRUCTURE ply splitting a regex adds line break characters, which we don’t The initial effort on COVID-19 classification at Facebook collected want to match. To this end, different regex engines support line several keywords in high-prevalence languages to form short reg- breaks, but we prefer not to rely on engine-dependent match- ular expressions. These regexes were populated in a Unicode- ing options or maintain different regex variants for applications compatible key-value storage directly readable from PHP, Python, in PHP, Python, Java and SQL. Therefore, we have developed a 2 portable way to support such comments and line breaks, compat- ible with major regex engines. But before explaining the details, ([ck][ao0]?r[ao0]n[ao](?!(ry|do|[ct]tion|\\W{0,3} we first introduce necessary syntax and solve a different prob- beer|dal|\\W*queens|\\W{0,2}(ca|) lem — matching “corona” not followed by “ beer” or “ry” (as in “coronary”). Recall that the (?!) syntax expresses a Words like “virus” are too general (low-precision) to be used in- negative zero-width lookahead assertion; this means the regex dividually, even if we exclude the most common irrelevant phrases. “corona(?!(beer|ry))” matches “corona”, “coronavirus”, but not Such weak keywords are assembled in two-word clauses with mul- “coronabeer” and not “coronary”. To add an optional space before tiple loosely-related alternatives for individual words — bipartite “beer”, use “corona(?!(\\s?beer|ry))”. To add up to 3 optional non- clauses. This structure uses sets of weak keywords and is powerful word characters before “beer”, use “corona(?!(\\W{0,3}beer|ry))”. in practice as it captures # ∗" word combinations using # +" key- Our regexes also exclude “coronation”, “Corona, Queens”, etc. Since words. The toy example below illustrates such a bipartite structure, all text is lowercased before matching, we only use lowercase char- accounting for word order. acters. Cross-platform compatibility for negative lookahead covers the popular PCRE regex engine, but not the C++ re2 library. ([ck][ao0]?r[ao0]n[ao]|c[o0]vid).{0,80}?(v[ai]i?r[aou]s|flu) Negative lookahead assertions also help to embed inert text into |(v[ai]i?r[aou]s|flu).{0,80}?([ck][ao0]?r[ao0]n[ao]|c[o0]vid) a regex. Consider the regex "(!?x)x". It tries to match “x”, but pre- cedes it by a negative lookahead assertion that insists that “x” does Here, N=50 and M=13, so 650*2 different keyword combinations not appear next, creating a logical contradiction. In other words, are captured with arbitrary characters in between. In practice, we this regex cannot match anything. And neither can this regex: capture more combinations by orders of magnitude. “(!?x)xhello” which appends the inert text “hello” to the contradic- To allow arbitrary characters between words, we initially used tion. It can be conjoined with a meaningful regex as an alternative: “.*” but found that far-away matching words introduce spurious “foo|bar|(!?x)xhello”. Here we match “foo” or “bar”, but nothing matches. Multiple variable-length wildcards in the same regex, e.g., else. We embed version numbers (“(!?x)x_Version_1234|foo”) “severe.*acute.*respiratory.*syndrome” slow down the match- and section names (“foo|(!?x)x_Section_III|bar”). Adding a line ing process. Therefore, we limit the distance between matched break within inert text does not affect matching" words and replace “.*” with “.{0,80}”. Measured in characters, this max distance is language-dependent, e.g., French is more verbose than English, which is more verbose than Hebrew. For individual foo|(!?x)x clauses, tuning maximal distance balances precision and recall. |Section_III|(!?x)x Recall that regex matching is greedy by default, i.e., it tries to |bar find the longest possible matching fragment. However, we prefer the shortest matches as they do not merge unrelated phrases, take less time to find, and provide more concise explanations. In regex Multiplatform support entails several nuances. Given that PHP lingo, we are replacing greedy matching with reluctant matching, requires delimiters on both sides of regexes as in “/foo|bar/” or the e.g., by using “.{0,80}?” instead of “.{0,80}”. like, the stored regex must be extrapolated into a PHP string, which Short words such as “cov” and “sars” run the risk of matching resolves backslashes. For example the \b (word boundary) turns some unintended longer words, such as “covariance” and “tsars”. To into a non-printable ASCII character for “backspace”. To avoid this, prevent unintended matches, we surround short words with word- we double all backslashes in stored regexes: “\\b”. While Python boundary symbols \b, often with provisions for hashtags (#) and does not require extrapolation, it is easy to handle “\\” correctly. digits. In some cases, we limit the number and type of characters between the beginning/end of the string (or the # character) and the given word. These safeguards are illustrated by the following 2.2 Advanced regex syntax and regex structure one-line regex fragment. Our regexes consist of sections, each section contains unrelated #.{0,50}?c[o0]vi[dt](.?19)? clauses, separated by the | character. Each clause typically targets |(\\b|\\d|_|#)c[o0]vi[dt](\\b|\\d|_) one word or a pair of words separated by some number of charac- ters, e.g., “corona news”. In rare cases, a clause targets a longer ex- pression, such as “Severe acute respiratory syndrome”. Individual For right-to-left (RTL) languages, e.g., Arabic and Hebrew, edit- words may allow multiple word forms and spelling variants, ex- ing regexes is awkward because ASCII characters are sequenced pressed using character classes or more general alternatives of sub- left-to-right and language-specific characters right-to-left. Some word fragments. For example, the fragment “[ck][ao0]?r[ao0]n[ao]” editors show inconsistent visual effects. We found helpful to put captures 48 different spelling variants of “corona”, including “carona”, parentheses around each word in RTL languages. “cor0na” and “korona”. However, this great power comes with great responsibility - we need to exclude “Koronadal”, “corona beer” and 2.3 Regex validation and basic development other irrelevant common words and phrases. Among several tech- Our infrastructure improvements support scaling regexes to large niques we use are negative zero-width look-ahead assertions that sizes, where the risk of mistakes calls for testing infrastructure. we discussed earlier, illustrated by the following one-line example: We now perform instantaneous checks before a regex change is 3 2690 new matches 230 lost matches Language Recall gain corona news corona beer AR Arabic 56.7% corona news 24 koronadal news update BN Bengali 93.5% news corona live brigada news koronadal update EN English 68.4% #coronacrisis convit19 ES Spanish 79.2% zoom news live corona corona beer virus joke HI Hindi 50.4% korona news corona beer virus ID Indonesian 37.3% corona news bangladesh koronadal updates FR French 59.2% nepal corona news corona beer virus funny PA Punjabi 9.62x #coronanews sharp coronado hospital PT Portuguese 1.07 corona news nepal coronation hospital RU Russian 1.13x ...... VI Vietnamese 3.06x Table 1: Evaluating a change to COVID-19 regexes. Table 2: Recall gain for Tier 2 regexes in eleven languages. The untypically large recall gains for Punjabi and Viet- namese are due to the inclusion of several English words. Not shown here, Tier 1 regexes cover 66 languages. submitted for review. These checks cover backslashes that should be doubled (in \\b, \\d, etc), require a valid regex, try known 3.1 Automatic translation and its pitfalls matching and nonmatching examples. We intercept several known During keyword discovery, we routinely rely on automatic trans- bug types (empty alternatives in ||, (| and |) ) and ban regex syntax lation to understand words and sentences, but the process requires which SQL and PHP interpret differently. some human expertise. While unable to process massive amounts The basic infrastructure above is sufficient to support large of information, people provide insights into candidate solutions, regexes, but has a number of blind spots, such as checks for regex- formulate generalization and notice identify unexpected phenom- engine compatibility, numerical estimation of precision and recall, ena. For example, through human insight we learned that in Ben- etc. Our early methodology for checking regex modifications was gali (Bangla), inserting a space in the term for “corona” (splitting it based on interactive SQL queries. It was automated using Python- into “coro” and “na”) produces the words “do not” and in rare cases based Jupyter notebooks that this meaning can appear even without a space character. Addition- ally, swapping the words for "virus" and "corona" may change the • test the candidate regex in Python, meaning of the word for “corona” keyword to “don’t”. • launch SQL queries comparing the candidate regex to the We found subword issues particularly difficult in Arabic, where latest production regex to estimate recall gain and aggregate “corona” can be spelled in different ways, but combining some most common “new” and “lost” matches, spelling variants leads to irrelevant matches. Germanic and Slavic • can be posted readonly, linked from a reviewed regex change languages have numerous word forms, some of them irregular, that and cloned for future code changes. must be taken into account by regexes to avoid lengthy enumera- Matched user search queries capture many keywords that also tions. Word meanings and connotations are important when trad- appear in other contexts. Being short, queries repeat and can be ing off precision for recall. For example, common Kazakh queries sorted by frequency, unlike posts which tend to be unique. Table 1 with the word for “ковид” (“covid”) include “ковид белгiлерi” illustrates new and lost matches triggered by a change to the Eng- (“signs of covid”). This phrase can be added as a two-word regex lish regex. Going through lists of 50 most frequent matches usually clause or used in a larger bipartite clause as a generalization and to makes it clear if a regex works as intended and if it hits many false produce more explicable matches. However, being a very general positives. Unacceptable changes are rejected, but otherwise the word, “белгiлерi” (“signs”) cannot be used by itself, compared to next iteration attempts to improve recall (by adding or generaliz- “symptoms” which has a medical connotation. ing clauses) or prevision (by removing or narrowing clauses, e.g., Automatic translation is sufficient to support regex develop- adding negative lookaround). We iterate on both search queries ment for many languages when used with our keyword discovery and posts. tools and large searchable corpora. Languages with sophisticated grammar and word forms, such as Arabic, require native-speaker expertise because (a) automatic translation can be misleading, es- 3 KEYWORD DISCOVERY AND EVALUATION pecially for short phrases, and (b) constructing regexes without Tier 2 regexes increase recall by 30-900% compared to Tier 1 language expertise can miss relevant word forms while running regexes by adding many more keywords, expressions and bipartite the risk of false positives due to unexpected substring matches. terms. While they allow some false positives for the sake of increas- Any fluent speaker can sanity-check individual keywords and ing recall, both precision and recall are routinely >90%. Negative regex clauses, while having a working knowledge of regexes can lookahead assertions and other techniques help maintain precision help even more by flagging pitfalls, suggesting improvements, and >95% in many cases. Such improvements require a greater develop- expressing word stems and spelling variants to increase recall ment effort and additional automation. Table 2 shows recall gains without undermining precision. Successful regex development is of Tier 2 regexes over Tier 1 regexes for 11 languages supported as possible without language fluency (via automated translation), but of September 2020. We estimate precision in all cases to be >90%. a fluent speaker can accelerate work for difficult languages. 4 covid coronavirus cas corona While labeled data are not necessary in our approach, even contre covid1 virus test noisy data enable refinements that focus on “false positives” and covi0 sars tests symptomesˆ “false negatives” for a given regex (many of these turn out correct morts nouvelle confinement teste´ upon inspection). Initially, we used sampled posts from known mort testees´ testes coronavirus groups, as positively-labeled examples, along with infection tester cov dernier posts that matched our earlier regexes. Incidentally, this collection teste epidemie c0vid derniere` was also used by our colleagues to train a supervised classifier. mortalite´ derniers pandemie symptomes Subsequently, we used classifier decisions precomputed for large chine dec´ ed´ ees´ symptomeˆ testee´ numbers of posts. We used several types of keyword discovery contrer dernieres` stopcovid covid_ techniques: dec´ ed´ e´ dec´ ed´ ee´ pneumoni infections derniere contrepoints testicules covidinfos • Adjacent co-occurrence: e.g., starting with “corona”, we may dec´ ed´ es´ morte ncov neumonies find “news”, “symptoms”, etc. This technique is useful to symptomatiques mortes testes covids find clarifying words for bipartite clauses, e.g., “halloween _covid testing mortalite masks” vs “surgical masks”. covit testez mortel covi1 • Most frequent words in true positives: e.g., starting with mortalite symptomatique covi0 cov19 “coronavirus” and “covid-19”, we may discover “doctor”, corana symptome notocovid c0v1 “hospital”, etc. When including these keywords in the regex, symptomatiqueˆ coronafakevirus _covid_ testee one should check the incremental gains, as some words only symptoms decede decedent symptomatic appear next to stronger keywords. testte masquecovid testant quarantine • Most frequent words in false positives can indicate erro- neumonie dec´ eder´ chinese neous keywords, unintended matches, overlapping topics, symptomˆ confinements kovid contrevenant etc. Keywords found this way can be used in negative looka- Table 3: Starting with just "covid" and "coronavirus", our head assertions. When valid keywords appear among false keyword discovery finds related French words. positives, this indicates incorrect labels (see Section 4). • Most frequent words in false negatives: e.g., starting with “corona”, we find in a Russian corpus respective terms in Rus- 3.2 Keyword discovery techniques sian. However, this turns up many prepositions and modal When adding a new language, one starts with “corona” and “corona verbs. Hence, we compare word frequencies in false nega- virus”, used even in languages with non-Latin scripts. For example, tives to those in all texts, then rank keywords by the ratio of the word “corona” has low precision in Spanish where it means frequencies — we seek words more likely to appear in false “crown”, but high precision in Portuguese (where “crown” trans- negatives than in general texts. This technique turned up lates as “coroa”). Where relevant, translations of "corona" and "coro- the French word “quarantaine” when the regex misspelled it navirus" should be considered as well, noting that many language as “quarantine”. For Bengali and Punjabi, the most common groups share words and spelling variants (Indonesian and Javanese, missing words were in English. languages that use Cyrillics, etc). • Most frequent word bigrams and trigrams with words fre- Additional keywords can be found among words that appear quently in false negatives and false positives. Bigrams and most commonly in matching search queries, as well as posts. To trigrams give context, make automated translation more ef- this end, posts use a broader lexicon than search queries. To il- fective, help distinguishing different word meanings, and lustrate, we randomly sample a large number of French-language suggest bipartite clauses for regexes. posts with “covid” or “coronavirus”. Table 3 shows the most com- mon words in those posts. These words can be additionally cat- Here is a small sampler of bigrams found in false negatives for egorized by Wikipedia-text frequencies. Here we see the origi- English at one point during the regex development: nal two keywords and their variants (covid1, cov, ncov, _covid, covid_, _covid_, covit, cov0, covi1, cov19, c0v1), as well as many ’’, ’in lockdown’, ’you touch’, ’but liquor’, other relevant keywords (virus, pandemie, symptom/symptˆ omes,ˆ ’self isolation’, ’not social’, ’non essential’, ’to thank’, quarantaine, confinement), including multiple word forms (testing, ’home please’, ’gloves masks’, ’safe everyone’, ’doctors tests, testez, teste,´ testes,´ testee, testees)´ and typos (testte). Some nurses’ words are foreign (china, chinese). Some words are short and must be used with word boundaries (sars, cas) while others are fairly generic (cas, contre, mort, nouvelles, chine). There are also sev- This suggests adding bigrams “social.distancing”, “self.isolation” eral apparent hashtags (coronafakevirus, ensemblecontrelacovid). and “gloves.masks” as regex clauses, while watching out for false For words that cannot be used individually, we find common bi- positives this may introduce. These clauses can also be gener- grams. Even with a limited understanding of French, one can build alized to allow for more interstitial characters between words as a compact but comprehensive regex. We initially erred on the side well as multiple word forms, such as “distanc(e|ing)”. Notably, “self- of high precision and later gradually relaxed precision to signifi- isolation” brings many false positives, but can be included if used cantly increase recall. with additional weak keywords. For French, Portuguese and other 5 languages, it was important to distinguish masks related to COVID- • Use additional “negative” regexes to weed out false positives. 19 from cosmetic and entertainment masks. The manual regex revi- We found posts on lockdowns during hurricanes, surgical sion process (Section 2.3) sometimes uses transformations similar masks used during surgery, non-COVID viral pneumonias, to those in [9], but we do not automate what people do quite well, vaccines development for non-COVID infections, etc. and instead focus on big-data aspects of the problem. 3.4 Performance evaluation with "safe" regexes 3.3 Explainability and customization To evaluate a change in a regex, our basic regex development Unlike entire posts, short strings do not require laborious scanning methodology (see above) uses search traffic to find the most com- to understand which parts matched and why. However, in order to mon “new” and “lost” matches. We also use posts to estimate recall deal with longer texts, we leverage the ability of regex matching gain. Given that posts use longer phrases and a larger vocabulary functions to return a full list of matches. For each matched text, a than search queries, we use a more comprehensive evaluation later. downstream table contains a list of up to 30 of the matched snip- When working with (possibly noisy) labeled data, we zoom in on pets, sorted by increasing length. For example, a long text may pro- classifier disagreements and evaluate them manually. For example, duce the following array of matches: [ "covid1", "virus disease seeing 10 actual false positives out of 200 formal “false positives”, (covid", "health department corona" ]. This helps confirming (pessimistically) supports a >95% precision bound for our regexes. correct matches without reading the entire post. Evaluating 200 examples can be laborious even for a native Tier 2 regexes are optimized for more explainable matches by speaker, whereas we do not understand most languages involved trying to extend each match to full words or entire expressions, without translation. To streamline evaluation, we extract a list of and including known keywords that clarify meaning. For example, matched snippets, as illustrated in the example on explainability. we match “pandemi[ac]” in English (allowing the Latin form of the While most posts are unique, their matched keywords exhibit sta- word often used by doctors). “Pandemi” is good enough, but not as tistics similar to those in search queries. Instead of always check- clear to the reader. Along the same lines, we try to match “cases of ing individual examples like the one in Section 3.3, we create a coronavirus” rather than just “cases of corona”. On the other hand, “safe” (100% precision) regular expression for matched fragments. for interstitials between keywords, we replace the default greedy This safe regex includes “coronavirus” and automatically confirms regex matching with reluctant regex matching that uses as few many examples is true positive matches. A handful of such key- words as possible. In practice, these adjustments provide concise words usually confirm > 80% of matches. For the Bengali (Bangla) but clear explanations, particularly valuable for long texts. language, adding just two non-English keywords to the general English safe regex covers > 96% “false positive” matches where Original text: Facebook Data for Good has a number of the regex disagrees with the DNN classifier. Thus, out of 200 ex- tools and initiatives that can help organizations respond to amples, we only need to check 7 with human intervention using the COVID-19 pandemic. Here’s how you can make use automated translations, and they all come out correct, confirming of these tools: https://dataforgood.fb.com/docs/covid19/ a near-100% recall. This process is fast and automated translation is now accessible from spreadsheet applications in batch mode. With a safe regex for a particular language, one can quickly iter- ate changes to the main regex. A more traditional (and laborious) Extracted keywords: COVID-19, covid19 human-labeling process can validate the final result, as discussed in Section 4. Downstream applications may assume different criteria for COVID- 19 relevance. To accommodate multiple applications without mod- 3.5 Matching-time evaluation and optimization ifying regexes, our Tier 2 regexes are inclusive and open to cus- Regex size is an important parameter for both maintainability and tomization. We distinguish several types of customization: matching time which can limit the bandwidth of the COVID-19 classifier. Downstream tables may take hours to perform matching • Remove or normalize some text during preprocessing. For on all rows. We time regexes in Python and SQL on a small set of example, leaving only the first : lines of text would require long posts, including matching and non-matching examples. Non- matches early on and emphasize the significance of the topic. matching posts usually take longer. When matches exist, returning In another example, one can remove “vaping epidemic” to all of them excludes early termination. When matches and explain- avoid false positives with “epidemic”, if this word is matched ability are unimportant, one can perform matching with early standalone. Normalization can be illustrated by stemming termination. Among common techniques for regex-matching ac- and lemmatization for Slavic and Germanic languages that celeration, we found that avoiding avoiding multiple “.*” per clause use numerous word forms. is most impactful (Section 2.2). • Apply the regex only to the first : words (or some fraction) of a long text. This addresses a known issue with texts Tier Regex Runtime in