A Safety-Oriented Platform for Web Applications

Total Page:16

File Type:pdf, Size:1020Kb

A Safety-Oriented Platform for Web Applications A Safety-Oriented Platform for Web Applications Richard S. Cox†, Jacob Gorm Hansen‡, Steven D. Gribble†, and Henry M. Levy† †Department of Computer Science & Engineering ‡Department of Computer Science University of Washington University of Copenhagen {rick, gribble, levy}@cs.washington.edu [email protected] Abstract 1 Introduction The 1993 release of the Mosaic browser sparked the on- The Web browser has become the dominant interface to set of the modern Web revolution [24]. The nascent Web a broad range of applications, including online banking, was a hypertext document system for which the browser Web-based email, digital media delivery, gaming, and e- performed two functions: it fetched simple, static content commerce services. Early Web browsers provided simple from Web servers, and it presented that content to the user. access to static hypertext documents. In contrast, modern A key Web feature was the ability for one Web site to link browsers serve as de facto operating systems that must man- to (or embed) content published by other sites. As a result, age dynamic and potentially malicious applications. Unfor- users navigating the early Web perceived it as a vast repos- tunately, browsers have not properly adapted to their new itory of interconnected, passive documents. role. As a consequence, they fail to provide adequate isola- Since that time, the Web has become increasingly com- tion across applications, exposing both users and Web ser- plex in both scale and function. It provides access to an vices to attack. enormous number of services and resources, including fi- This paper describes the architecture and implementa- nancial accounts, Web mail, archival file storage, multime- tion of the Tahoma Web browsing system. Key to Tahoma is dia, and e-commerce services of all types. Users transfer the browser operating system (BOS), a new trusted software funds, purchase tickets, file their taxes, apply for employ- layer on which Web browsers execute. The benefits of this ment, and seek medical advice through the Web. Their per- architecture are threefold. First, the BOS runs the client- ceptions of the Web have evolved, as well. Today’s users side component of each Web application (e.g., on-line bank- see the modern Web as a portal to a collection of indepen- ing, Web mail) in its own virtual machine. This provides dent, dynamic applications interacting with remote servers. strong isolation between Web services and the user’s local Moreover, they expect that Web applications will behave resources. Second, Tahoma lets Web publishers limit the like the applications on their desktops. For example, users scope of their Web applications by specifying which URLs trust that Web applications are sufficiently isolated from and other resources their browsers are allowed to access. one another that tampering or unintended access to sensi- This limits the harm that can be caused by a compromised tive data will not occur. browser. Third, Tahoma treats Web applications as first- To respond to the demands of dynamic services, the class objects that users explicitly install and manage, giv- browser has evolved from a simple document rendering en- ing them explicit knowledge about and control over down- gine to an execution environment for complex, distributed loaded content and code. applications that execute partially on servers and partially within clients’ browsers. Modern Web browsers download We have implemented a prototype of Tahoma using Linux and execute programs that mix passive content with active and the Xen virtual machine monitor. Our security eval- scripts, code, or applets. These programs: effect transac- uation shows that Tahoma can prevent or contain 87% of tions with remote sites; interact with users through menus, the vulnerabilities that have been identified in the widely dialog boxes, and pop-up windows; and access and mod- used Mozilla browser. In addition, our measurements of ify local resources, such as files, registry keys, and browser latency, throughput, and responsiveness demonstrate that components. The browser, then, has transcended its origi- users need not sacrifice performance for the benefits of nal role to become a de facto operating system for executing stronger isolation and safety. client-side components of Web applications. Unfortunately, current browsers are not adequately de- Proceedings of the 2006 IEEE Symposium on Security and Privacy (S&P’06) 1081-6011/06 $20.00 © 2006 IEEE Authorized licensed use limited to: University of Washington Libraries. Downloaded on October 05,2020 at 15:09:57 UTC from IEEE Xplore. Restrictions apply. signed for their new role and environment. Despite many Web content being downloaded and executed. Web services attempts to retrofit isolation and security, the browser’s orig- gain the ability to restrict the set of sites with which their ap- inal roots remain evident. Simply clicking on a hyperlink plications can communicate, thereby limiting damage from can cause hostile software to be downloaded and executed hijacked browsers. Active Web content and the browser that on the user’s machine. Such “drive-by downloads” are a interprets and renders it are isolated in a private virtual ma- common cause of spyware infections [23]. Trusted plug-ins chine, protecting the user’s desktop from side-effects, mali- may have security holes that permit content-based attacks. cious or otherwise. Browser extensibility features, such as ActiveX components The idea of sandboxing Web browsers is not new. For and JavaScript, expose users to vulnerabilities that can po- example, VMware has recently release a virtual-machine- tentially result in the takeover of their machines. based “Web browser appliance,” containing a checkpointed Users assume that Web applications cannot interfere image of the Firefox browser on Linux [32]. As another ex- with one another or with the browser itself. However, to- ample, GreenBorder [12] augments Windows with an OS- day’s browsers fail to provide either kind of isolation. For level sandbox mechanism similar to BSD jails [17], in order example, attackers can take advantage of cross-site script- to contain malicious content arriving through Internet Ex- ing vulnerabilities to fool otherwise benign Web applica- plorer or Outlook. Our work uses VMs to provide strong tions into delivering harmful scripted content to users, leak- sandboxes for Web browser instances, but our contribu- ing sensitive data from those services. Other browser flaws tion is much broader than the containment this provides. let malicious Web sites hijack browser windows [26] or Tahoma isolates Web applications from each other, in addi- spoof browser fields, such as the displayed URL [37]. Such tion to isolating Web browsers from the host operating sys- flaws facilitate “phishing” attacks, in which a hostile appli- tem. As well, Tahoma permits Web services to customize cation masquerades as another to capture information from the browsers used to access them, and to control which re- the user. mote sites their browser instances can access. Overall, it is clear that current browsers cannot cope with We have implemented a prototype of the Tahoma brows- the demands and threats of today’s Web. While holes can ing system using Linux and the Xen virtual machine mon- be patched on an ad hoc basis, a thorough re-examination itor [4] and modified the Konqueror browser to execute on of the basic browser architecture is required. To this end, top of it. Our experience shows that the Tahoma architecture we have designed and implemented a new browsing system is straightforward to implement, protects against the major- architecture, called Tahoma. The Tahoma architecture ad- ity of existing threats, and is compatible with existing Web heres to three key principles: services and browsers. We also demonstrate that the ben- efits of our architecture can be achieved without compro- 1. Web applications should not be trusted. Active content mising user-visible performance, even for video-intensive in today’s Internet is potentially dangerous. Both users browsing applications. and Web services must protect themselves against a The remainder of this paper is organized as follows. Sec- myriad of online threats. Therefore, Web applications tion 2 defines Tahoma’s high-level architecture and abstrac- should be contained within appropriate sandboxes to tions. Section 3 describes the implementation, while Sec- mitigate potential damage. tion 4 evaluates our prototype with respect to both function 2. Web browsers should not be trusted. Modern browsers and performance. Section 5 presents related work, and we are complex and prone to bugs and security flaws conclude our discussion in Section 6. that can be easily exploited, making compromised browsers a reality in the modern Internet. Therefore, 2 Architecture browsers should be isolated from the rest of the sys- tem to mitigate potential damage. The Tahoma architecture has six key features: 3. Users should be able to identify and manage down- loaded Web applications. Web applications should be 1. It defines a new trusted system layer, the browser op- user visible and controllable, much like desktop appli- erating system (BOS), on top of which browser imple- cations. Users should be able to list all Web applica- mentations (such as Netscape or IE) can run. tions and associated servers that provide code or data, and ascribe browsing-related windows to the Web ap- 2. It provides explicit support for Web applications.A plications that generated them. Web application consists of a browser instance, which includes a client-side browser executing dynamic Web By following these principles, Tahoma substantially im- content and code, and a Web service, which is a col- proves security and trustworthiness for both users and Web lection of Web sites with which the browser instance is services. Users gain knowledge and control of the active permitted to communicate. Proceedings of the 2006 IEEE Symposium on Security and Privacy (S&P’06) 1081-6011/06 $20.00 © 2006 IEEE Authorized licensed use limited to: University of Washington Libraries.
Recommended publications
  • Technical Project and Product Manager Solution Architect and Senior Full Stack Developer
    Technical project and product manager Solution architect and senior full stack developer Profile Technical project and product manager, working with R&D software innovation processes, with a background as solution architect and senior full stack developer. Experienced manager of outsourced teams in many different countries, working with team efficiency based on pragmatic Scrum approach. Experience from scaled agile projects with hundreds of people involved. I am used to working with confidential information and have been security cleared several times. Knowledge I have been programming more than 20 different programming, script and database languages the last 36 years. My recent programming skills include NodeJS, JavaScript, Python, C#, Java, React for progressive web apps (PWA) and some courses in React-Native Expo. I have worked mostly with REST API architecture and a little GraphQL. I have specialist knowledge from leading CMS and DMS systems like Sitecore, DOCUMENTUM and Public 360, were I as product manager had to know Danish law on personal data and record management (ESDH). I worked many years on products for the educational sector and non-profit projects like Oligo Academy, that uses virtual worlds in primary schools for teaching environmental issues and other study subjects. Nikolaj Lisberg Hansen Born 1973 and started programming early in 1984. Got first programming job in 1995 and started working as freelance solution architect and technical project manager in 2006. I practice Tai Chi, love music and like to travel the world. Languages Very good Danish, English and German. Availability 10-25 hours per week normally remote or on-site near Copenhagen. Flexible salary between 96€ / hour (600 DKK) for remote work and 144€ / hour (900 DKK) for on-site work on product innovation or as technical project manager, solution architect or senior developer.
    [Show full text]
  • CA Network Flow Analysis Release Notes
    CA Network Flow Analysis Release Notes Release 9.1.3 This Documentation, which includes embedded help systems and electronically distributed materials, (hereinafter referred to as the “Documentation”) is for your informational purposes only and is subject to change or withdrawal by CA at any time. This Documentation may not be copied, transferred, reproduced, disclosed, modified or duplicated, in whole or in part, without the prior written consent of CA. This Documentation is confidential and proprietary information of CA and may not be disclosed by you or used for any purpose other than as may be permitted in (i) a separate agreement between you and CA governing your use of the CA software to which the Documentation relates; or (ii) a separate confidentiality agreement between you and CA. Notwithstanding the foregoing, if you are a licensed user of the software product(s) addressed in the Documentation, you may print or otherwise make available a reasonable number of copies of the Documentation for internal use by you and your employees in connection with that software, provided that all CA copyright notices and legends are affixed to each reproduced copy. The right to print or otherwise make available copies of the Documentation is limited to the period during which the applicable license for such software remains in full force and effect. Should the license terminate for any reason, it is your responsibility to certify in writing to CA that all copies and partial copies of the Documentation have been returned to CA or destroyed. TO THE EXTENT PERMITTED BY APPLICABLE LAW, CA PROVIDES THIS DOCUMENTATION “AS IS” WITHOUT WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NONINFRINGEMENT.
    [Show full text]
  • Security Analysis of Firefox Webextensions
    6.857: Computer and Network Security Due: May 16, 2018 Security Analysis of Firefox WebExtensions Srilaya Bhavaraju, Tara Smith, Benny Zhang srilayab, tsmith12, felicity Abstract With the deprecation of Legacy addons, Mozilla recently introduced the WebExtensions API for the development of Firefox browser extensions. WebExtensions was designed for cross-browser compatibility and in response to several issues in the legacy addon model. We performed a security analysis of the new WebExtensions model. The goal of this paper is to analyze how well WebExtensions responds to threats in the previous legacy model as well as identify any potential vulnerabilities in the new model. 1 Introduction Firefox release 57, otherwise known as Firefox Quantum, brings a large overhaul to the open-source web browser. Major changes with this release include the deprecation of its initial XUL/XPCOM/XBL extensions API to shift to its own WebExtensions API. This WebExtensions API is currently in use by both Google Chrome and Opera, but Firefox distinguishes itself with further restrictions and additional functionalities. Mozilla’s goals with the new extension API is to support cross-browser extension development, as well as offer greater security than the XPCOM API. Our goal in this paper is to analyze how well the WebExtensions model responds to the vulnerabilities present in legacy addons and discuss any potential vulnerabilities in the new model. We present the old security model of Firefox extensions and examine the new model by looking at the structure, permissions model, and extension review process. We then identify various threats and attacks that may occur or have occurred before moving onto recommendations.
    [Show full text]
  • Learning Outcomes - Email Etiquette Training Program (Part A)
    Learning Outcomes - Email Etiquette Training Program (Part A) By the end of this course, participants will: Develop a heightened awareness of the potential perils of digital communication Master effective email structures to achieve clarity and successful communication Learn to write for the reader, starting with effective subject lines Make the most of 'email estate' Carefully consider the email recipients Learn to work within principles or 'rules of thumb' to ensure professional, clear & effective emails Perfect grammar because it matters Format messages for readability Learn to write professionally and brand Broadcast emails Learn to avoid senders regret by proof reading Understand 'netiquette' Master the inbox using some core principles and email functions Program Outline – Email Etiquette Training Program Topic 1 – Introduction As with any form of communication, there are certain rules of behavior which should be considered when using email. Email is written communication, but it does not have the formality of earlier written forms. It has a much more immediate, less formal feel than paper, pen, and stamp mail. Email is also essentially one - way communication. There is no immediate feedback and interaction. Also, written communication by definition allows far fewer context clues to its meaning as face - to -face and telephone conversation. Any written communication must be carefully considered so that it is not misunderstood, but email lends itself to casual interaction. The potential for real misunderstanding is clear. When you compose an email message, pause and read over it again before you send it. Once it is sent, you can't get it back. Remember that your grammar, spelling, and vocabulary send a message as clear as the words.
    [Show full text]
  • Inlined Information Flow Monitoring for Javascript
    Inlined Information Flow Monitoring for JavaScript Andrey Chudnov David A. Naumann Stevens Institute of Technology Stevens Institute of Technology Hoboken, NJ 07030 USA Hoboken, NJ 07030 USA [email protected] [email protected] ABSTRACT JS engines are highly engineered for performance, which has Extant security mechanisms for web apps, notably the\same- led some researchers to argue for inlined monitoring for IFC. origin policy", are not sufficient to achieve confidentiality The main contribution of this paper is an inlined IFC mon- and integrity goals for the many apps that manipulate sen- itor that enforces non-interference for apps written in JS. sitive information. The trend in web apps is \mashups" We present the main design decisions and rationale, together which integrate JavaScript code from multiple providers in with technical highlights and a survey of state of the art al- ways that can undercut existing security mechanisms. Re- ternatives. The tool is evaluated using synthetic case studies searchers are exploring dynamic information flow controls and performance benchmarks. (IFC) for JavaScript, but there are many challenges to achiev- ing strong IFC without excessive performance cost or im- On IFC. Browsers and other platforms currently provide practical browser modifications. This paper presents an in- isolation mechanisms and access controls. Pure isolation lined IFC monitor for ECMAScript 5 with web support, goes against integration of app code from multiple providers. using the no-sensitive-upgrade (NSU) technique, together Access controls can be more nuanced, but it is well-known with experimental evaluation using synthetic mashups and that access control policies are safety properties whereas IF performance benchmarks.
    [Show full text]
  • Cloud Computing Synopsis and Recommendations
    Special Publication 800-146 Cloud Computing Synopsis and Recommendations Recommendations of the National Institute of Standards and Technology Lee Badger Tim Grance Robert Patt-Corner Jeff Voas NIST Special Publication 800-146 Cloud Computing Synopsis and Recommendations Recommendations of the National Institute of Standards and Technology Lee Badger Tim Grance Robert Patt-Corner Jeff Voas C O M P U T E R S E C U R I T Y Computer Security Division Information Technology Laboratory National Institute of Standards and Technology Gaithersburg, MD 20899-8930 May 2012 U.S. Department of Commerce John Bryson, Secretary National Institute of Standards and Technology Patrick D. Gallagher, Under Secretary of Commerce for Standards and Technology and Director CLOUD COMPUTING SYNOPSIS AND RECOMMENDATIONS Reports on Computer Systems Technology The Information Technology Laboratory (ITL) at the National Institute of Standards and Technology (NIST) promotes the U.S. economy and public welfare by providing technical leadership for the nation’s measurement and standards infrastructure. ITL develops tests, test methods, reference data, proof of concept implementations, and technical analysis to advance the development and productive use of information technology. ITL’s responsibilities include the development of management, administrative, technical, and physical standards and guidelines for the cost-effective security and privacy of other than national security-related information in Federal information systems. This Special Publication 800-series reports on ITL’s research, guidance, and outreach efforts in computer security and its collaborative activities with industry, government, and academic organizations. National Institute of Standards and Technology Special Publication 800-146 Natl. Inst. Stand. Technol. Spec. Publ.
    [Show full text]
  • Designing a Browser to Benefit from Multi-Core Silicon
    Designing a Browser to Benefit from Multi-core Silicon Ekioh Ltd, Cambridge UK. [email protected] Abstract This paper investigates the impact of the evolution in processor technology upon HTML browser performance, highlighting some limitations in current browser design and ways in which these limitations can be overcome. It asserts that overcoming these limitations is key to offering 4K UIs on mass-market consumer products in the very near future. Introduction HTML browsers are increasingly being used for Trends of CE Processor Speeds application rendering and user interface (UI) 25 presentation. The driving reasons behind this are that single core dual core quad core browsers reduce UI authoring complexity and provide a 20 level of hardware abstraction which enables authoring to happen in parallel with hardware design. 15 Browser technology has also allowed the application 10 authoring community to grow beyond embedded DMIPS Headline Performance software engineers to include creative designers. This has 5 led to a marked improvement in the visual quality of user Per core Performance interfaces and the look and feel of applications. 0 This flexibility and increased visual quality comes at a Time → cost; the browser is one of the most demanding components within a device and achieving the necessary The headline processing speed of multi-core devices responsiveness directly drives CPU selection benefits from increases in the number of cores and, requirements. indirectly, from reductions in process size. Year on year improvements of around 30% were achieved in headline processing speed over a five year period despite the Processor evolution relatively small performance improvements of each core.
    [Show full text]
  • On the Disparity of Display Security in Mobile and Traditional Web Browsers
    On the Disparity of Display Security in Mobile and Traditional Web Browsers Chaitrali Amrutkar, Kapil Singh, Arunabh Verma and Patrick Traynor Converging Infrastructure Security (CISEC) Laboratory Georgia Institute of Technology Abstract elements. The difficulty in efficiently accessing large pages Mobile web browsers now provide nearly equivalent fea- on mobile devices makes an adversary capable of abusing tures when compared to their desktop counterparts. How- the rendering of display elements particularly acute from a ever, smaller screen size and optimized features for con- security perspective. strained hardware make the web experience on mobile In this paper, we characterize a number of differences in browsers significantly different. In this paper, we present the ways mobile and desktop browsers render webpages that the first comprehensive study of the display-related security potentially allow an adversary to deceive mobile users into issues in mobile browsers. We identify two new classes of performing unwanted and potentially dangerous operations. display-related security problems in mobile browsers and de- Specifically, we examine the handling of user interaction vise a range of real world attacks against them. Addition- with overlapped display elements, the ability of malicious ally, we identify an existing security policy for display on cross-origin elements to affect the placement of honest el- desktop browsers that is inappropriate on mobile browsers. ements and the ability of malicious cross-origin elements Our analysis is comprised of eight mobile and five desktop to alter the navigation of honest parent and descendant el- browsers. We compare security policies for display in the ements. We then perform the same tests against a number candidate browsers to infer that desktop browsers are signif- of desktop browsers and find that the majority of desktop icantly more compliant with the policies as compared to mo- browsers are not vulnerable to the same rendering issues.
    [Show full text]
  • Comparative Studies of 10 Programming Languages Within 10 Diverse Criteria
    Department of Computer Science and Software Engineering Comparative Studies of 10 Programming Languages within 10 Diverse Criteria Jiang Li Sleiman Rabah Concordia University Concordia University Montreal, Quebec, Concordia Montreal, Quebec, Concordia [email protected] [email protected] Mingzhi Liu Yuanwei Lai Concordia University Concordia University Montreal, Quebec, Concordia Montreal, Quebec, Concordia [email protected] [email protected] COMP 6411 - A Comparative studies of programming languages 1/139 Sleiman Rabah, Jiang Li, Mingzhi Liu, Yuanwei Lai This page was intentionally left blank COMP 6411 - A Comparative studies of programming languages 2/139 Sleiman Rabah, Jiang Li, Mingzhi Liu, Yuanwei Lai Abstract There are many programming languages in the world today.Each language has their advantage and disavantage. In this paper, we will discuss ten programming languages: C++, C#, Java, Groovy, JavaScript, PHP, Schalar, Scheme, Haskell and AspectJ. We summarize and compare these ten languages on ten different criterion. For example, Default more secure programming practices, Web applications development, OO-based abstraction and etc. At the end, we will give our conclusion that which languages are suitable and which are not for using in some cases. We will also provide evidence and our analysis on why some language are better than other or have advantages over the other on some criterion. 1 Introduction Since there are hundreds of programming languages existing nowadays, it is impossible and inefficient
    [Show full text]
  • Nikolaj Lisberg Hansen Availability: Only for Minor Ongoing Tasks Or Ad Address: Valdemarsgade 41, St
    Personal profile Entrepreneur and Product Manager Experienced Technical project manager Solution architect and senior developer Entrepreneur and Product Manager with experience as technical project manager, solution architect and senior developer. Good knowledge of product development processes in R&D and public-sector domain knowledge for legal requirements in record and document management. Team efficiency based on a pragmatic approach to using SCRUM methods. Keep IT simple is a primary goal for me in all my work. My primary programming skills include NodeJS, C#, Java, PHP, AngularJS, JavaScript and CSS, but I have been programming more than 18 different programming, script and database languages the last 33 years. I have specialist domain knowledge from market leading CMS and DMS systems like Sitecore and Documentum. I know the Danish laws about personal data and record management (ESDH), from working as product manager in Software Innovation. Currently I am working on a non-profit project called Oligo Academy to use virtual worlds in primary and secondary schools as the basis for international school projects. I worked many years with confidential information and have been security cleared by both Danish Police Intelligence Service and the Danish Defense Intelligence Service several times. Personal information: Born 1973 and started programming early in 1984. Got first programming job in 1995 and started working as freelance solution architect and technical project manager in 2006. Practices the martial art Tai Chi Chuan, loves Jazz music and spending time with my family. Languages: Very good Danish, English and German. Contact information Name: Nikolaj Lisberg Hansen Availability: Only for minor ongoing tasks or ad Address: hoc assistance with solution Valdemarsgade 41, st.
    [Show full text]
  • Collaborative Modelling and Co-Simulation in Engineering and Computing Curricula
    Collaborative Modelling and Co-simulation in Engineering and Computing Curricula Peter Gorm Larsen1, Hugo Daniel Macedo1, Claudio Goncalves Gomes1, Lukas Esterle1, Casper Thule1, John Fitzgerald2, and Kenneth Pierce2 1 DIGIT, Department of Engineering, Aarhus University, Aarhus, Denmark, [email protected] [email protected], [email protected], [email protected] 2 School of Computing, Newcastle University, United Kingdom, [email protected] [email protected] Abstract. The successful development of Cyber-Physical Systems (CPSs) re- quires collaborative working across diverse engineering disciplines, notations and tools. However, classical computing curricula rarely provide opportunities for students to look beyond the confines of one set of methods. In this paper, we report approaches to raising students’ awareness of the integrative role of digital technology in future systems development. Building on research in open but in- tegrated tool chains for CPS engineering, we consider how this has been realised in two degree programmes in Denmark and the UK, and give preliminary find- ings. These include the need for ensuring stability of research-quality tools, and observations on how this material is presented in Computing versus Engineering curricula. 1 Introduction Collaboration between diverse disciplines is essential to the successful development of the Cyber-Physical Systems (CPSs) that are key to future innovations [26]. However, many university curricula train professionals within long-established disciplinary silos such as mechanical, electrical, civil or software engineering, with few opportunities for interaction between them. It is therefore critical to include elements within degree programmes to prepare students for the cross-disciplinary work that will likely feature in their subsequent careers [18].
    [Show full text]
  • The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme*
    PACLIC 24 Proceedings 389 The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme* Serge Heiden ICAR Laboratory, University of Lyon, 15 parvis René Descartes - ENS de Lyon, 69342 Lyon, France [email protected] Abstract. This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM. The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficient full-text search on annotated corpora and for statistical data analysis. The architecture is based on a Java toolbox articulating a full-text search engine component with a statistical computing environment and with an original import environment able to process a large variety of data sources, including XML-TEI, and to apply embedded NLP tools to them. The platform is distributed as an open-source Eclipse project for developers and in the form of two demonstrator applications for end users: a standard application to install on a workstation and an online web application framework. Keywords: xml-tei corpora, search engine, statistical analysis, textometry, open-source. 1 Introduction Textometry is a textual data analysis methodology born in France in the 80s in relation with the data analysis framework designed by (Benzecri et al., 1973a,b). A first synthesis of the methodology can be found in (Lebart et al, 1997). It somewhat differentiates from text mining by the fact that it tries to always combine various statistical analysis techniques, like factorial correspondence analysis or hierarchical ascendant classification, with full-text search techniques like kwic concordances to be able to always get back to the precise original editorial context of any textual event participating to the analysis.
    [Show full text]