Community Notebook Free Software Projects Projects on the Move Who says have to be a nightmare? Also in this issue: After the Deadline takes care of language issues.

By Carsten Schnober and Heike Jurzik

Jan Prchal, 123RF

ven hardened nerds are often over-challenged by the less than intuitive field of statistics. Besides the theory, you Kheng Ho Toh, 123RF need to know how to use the software that converts all Ethe theory into a practical application. The environment [1] can be a big help on the software side. This free implementation of the S statistics programming lan- guage was launched in 1992. Initially, you will not be able to do much with R and its spartan command-line tools until you learn the language. Although SPSS [2] is a commercial alternative, even students are asked to pay a three-digit licensing fee, and the quality isn’t always on par with R. Sofa Statistics for All Free statistics software with an intuitive web interface is a niche that the Sofa (Statistics Open For All) [3] project fills. One of the project’s aims is to give statisticians an easy-to-use tool. Like many other mathematical disciplines, statistics is an ancillary science that is used in many areas to interpret empiric surveys. For example, you can use a series of results to calcu- late the probability of dice throws. Social scientists can also forecast whether a survey is significant or will just output random values. Many statistical tests have different strengths and weaknesses that only specialists can assess. Statistical tests are generally regarded as essential because human intuition is prone to misjudgment. The Monte-Carlo fallacy is just one example that leads to an incorrect assumption of coherencies and the resulting shift of probabilities between two unrelated events – for example, between the result of the next throw of the dice and the previous results with the same dice. To cope with these and even more complex problems, Sofa (Figure 1) supports standards such as ANOVA (), Pearson’s chi-square test, t-tests, and other methods. The statistics tool also provides basic values like means, medi- ans, standard deviations, sums, maximums, minimums, and more. Quick Change Artist As mentioned previously, Sofa’s strength is not just its functionality; R and others have all the features listed in the previous section. But Sofa also impresses with its easy accessibility and its import and export options.

92 DECEMBER 2010 ISSUE 121 LINUX-MAGAZINE.COM | LINUXPROMAGAZINE.COM

092-094_projects.indd 92 07.10.2010 12:11:52 Uhr Community Notebook Free Software Projects

Sofa can read data from an SQL data- and text summarization base (MySQL, SQLite, and PostgreSQL), and correction, quickly from Open Document spreadsheets (as reached their limits, a used by OpenOffice and others) and new approach has dom- from MS Access. The program presents inated recent research: the results in HTML format and relies on statistical models. JavaScript to highlight specific rows of Whereas statisticians data in a targeted way when the user used to apply complex mouses over them, giving users the abil- rules manually in the ity to build statistics directly into a web- language of their site. Alternatively, Sofa stores its results choice, encountering directly in spreadsheet formats used by many new exceptions OpenOffice Calc and . and supplements in the Users can point and click to define the process, the current ap- style and content of their Sofa reports proach is for programs (Figure 2). Also, the software can auto- to learn from existing Figure 1: Sofa offers an intuitive user interface that pleases mate output, via Python to create, for ex- texts on the basis of sta- professional statisticians and newcomers alike. ample, a new program in a blog when tistical methods. new data occur. The advantage of sta- tistical language models Functionality and of this kind is that they Usability need virtually no man- Sofa relies on the free Dojo toolkit [4] to ual preparation. All render the JavaScript-enhanced HTML they require is a large output. Sofa itself is implemented in Java volume of text from and thus runs on other platforms besides which to learn. The dis- Linux. The homepage includes prebuilt advantage is that, with packages for , Windows, and Mac no upper limit for the OS X, as well as the source code. volume of text, a system The roadmap for future versions of the of this kind will need a program includes the ability to export to fairly large amount of the Oracle database format and other di- space, a couple of hun- agram types. Additionally, a plugin ar- dred megabytes at least, chitecture is planned to facilitate the in- both on disk and in tegration of further statistics functions. RAM, which is a big Sofa developers also continue to pur- issue, especially on mo- Figure 2: Sofa can create reports in a variety of file formats by sue their original goal – that of making bile devices. simply selecting the data and choosing a presentation format. statistical methods more accessible to Therefore, statistics- newcomers. To this end, the tool will in- based grammar and spelling checkers clude tips on typical applications and are less well suited to use on the desk- graphical examples in the future. They top. Who wants to swap their lean client also plan to expand localization efforts: fast enough for word processing for a fat Right now, Sofa is only available in Eng- one, just to run a single function? lish and Galician, but more languages The alternative is a spell checker that are planned. If you are interested in con- compares words with a dictionary and tributing to a translation or promoting reports an error if it encounters some- the development of Sofa in any other thing unknown. The mistakes it finds are way, you can contact the developers via fairly simple, and it will fail to discover the project homepage. Grant Paton- errors in sentence structure, stylistic Simpson encourages users to report is- bugs, grammatical errors, and incorrect sues and vote through the Freshmeat usage of words. Some commercial word and SourceForge open source portals. processors do provide ruleset-based grammar checkers, but manual defini- From Theory to Practice tions will never be able to cover all the Automated processing of natural lan- rules that govern human language. guages is one example of a practical ap- plication of sophisticated statistical Deadline for Typos methods. Whereas the early applications After the Deadline [5] offers an innova- in this field, such as machine translation tive solution to users with broadband In-

LINUX-MAGAZINE.COM | LINUXPROMAGAZINE.COM ISSUE 121 DECEMBER 2010 93

092-094_projects.indd 93 07.10.2010 12:11:54 Uhr Community Notebook Free Software Projects

ternet connections. The most likely candidate to replace an un- free web service accepts known word. Besides considering the texts and returns proof- probability of the word occurring in the reading proposals. The given context (i.e., with the preceding name comes from the and following words), After the Deadline New York Times column looks at other factors: the number of that investigates stylistic changes required to create another word blunders in day-to-day from the word with the typo, and the English. match between the first letter and the After the Deadline is a context-dependent frequency of a word. statistical language A neuronal network helps assess the cor- model that weighs in at rection suggestions. Figure 3: The OpenOffice plugin for After the Deadline sup- 1GB and resides in RAM After the Deadline does resort to man- ports stylistic corrections in addition to spell checking. server side, thus reduc- ual rules for any further checks; to this ing the client require- end, it integrates LanguageTool [6], ments to something min- which was developed in 2003 and is imal. Users simply need available as an independent OpenOffice a program to access the plugin. Additionally, After the Deadline online service, and there can structure a body of text in a mean- are plugins to handle ingful way and generate suggestions for this for the Open Office paragraph borders. word processor, Firefox, Google Chrome brows- Unlimited Options ers, and the WordPress If you are interested in checking out the blogging software. program’s capabilities, you can test After Also, extensions for the Deadline online [7]. The plugins re- jQuery and TinyMCE let ferred to will also cooperate with one of users build the spell- the project’s servers, although use is re- Figure 4: Known words can still be incorrect depending on the checker into forms on stricted to the English language and context: After the Deadline identifies mistakes that occur websites. With open must be non-commercial. because of different words with identical pronunciation. standards and source Users who are interested in deploying code, there is nothing to the style and language checker for com- prevent more interfaces being developed mercial purposes, or for multiple lan- in future. guages, will find an installation how-to Figure 3 shows the setup options for for the server component on the pro- the OpenOffice extension. After the ject’s website. Deadline offers a long list of optional The GPL’d software requires Sun’s style checks that supplement the tool’s Java version 1.6.0 and 1.5GB of RAM for core functionality. The tests will warn the low-memory variant or 4GB of RAM you, for example, if you use overly com- for the full version. The 1.6.0 version is plicated sentence structure, nominaliza- useful for short documents and a re- tion (Hidden Verbs) and passive sen- stricted amount of client access. Besides tences. the software itself, a number of language INFO models are also freely available. After the Deadline speaks not only English, [1] R Project: http:// www. r-project. org Error Evaluation The language model that After the Dead- but also Dutch, French, German, Indone- [2] SPSS: http:// www. . com line uses is based on bigrams – that is, sian, Italian, Polish, Portuguese, Span- [3] Sofa project: sequences of two words. Their relative ish, and Russian. http:// www. sofastatistics. com frequency in the training material repre- Future versions of the software will be [4] Dojo toolkit: sents the probability of their occurrence capable of finding omitted words. Also, http:// www. dojotoolkit. org in new text. The developers evaluated plans are afoot to let the server automat- [5] After the Deadline: the texts of Wikipedia, the Gutenberg ically extend its language models at reg- http:// afterthedeadline. com project, and various web logs to support ular intervals. [6] LanguageTool: this. The model is supplemented by a tri- The roadmap also envisages improve- http:// www. languagetool. org gram-based language model that is used ments to the plugins and a function that to search for easily confused words, such checks text for errors during input. If [7] After the Deadline online spell as homonyns (Figure 4). you want to help, or you have your own checker: http:// www. polishmywriting. com The probability values derived by this ideas, the project website provides de- approach are first used to provide the tails of how to contribute. ■■■

94 DECEMBER 2010 ISSUE 121 LINUX-MAGAZINE.COM | LINUXPROMAGAZINE.COM

092-094_projects.indd 94 07.10.2010 12:11:56 Uhr