Mining for Co-Changes in the Context of Web Localization

Mining for Co-Changes in the Context of Web Localization Huzefa Kagdi and Jonathan I. Maletic Department of Computer Science Kent State University Kent Ohio 44242 {hkagdi, jmaletic}@cs.kent.edu Abstract projects to have thousands of documents that need to be kept in alignment with the corresponding versions of the An approach for mining repositories of web-based software. As such, the evolution process of document user documentation for patterns of evolutionary change translation is similar to the evolution process of source in the context of internationalization and localization is code. There are multiple teams and contributors involved presented. Sets of documents that are changed together in the evolution process. Therefore, the translation during the translation process are uncovered and process is managed via versions-control systems in 1 documented to support future evolution of the system. A document repositories. sequential-pattern mining technique is used to uncover The versions stored in document repositories can be the patterns from Subversion repositories. The approach utilized to extract pertinent information and/or uncover is applied to the open source KDE system. KDE relationships or trends about a particular evolutionary maintains documentation for over fifty different natural characteristic. In this paper, we present an approach for languages and presents a prime example of the problem. mining the document repositories (web pages) in order to Characteristics of the uncovered patterns such as size, uncover patterns of translated documents that frequently frequency, and occurrences within a single language or co-change in a single language or across multiple across multiple languages are discussed. Such patterns languages. Our approach is applied on over four help provide insight as to the effort required in thousand versions of KDE document repository. The retranslation due to a change in the documentation and KDE document repository includes over fifty different help user communities estimated the progress of languages for localization. documentation in their respective languages. The recovered patterns provide information not only about the documents that are likely to be retranslated or 1. Introduction updated in a single version but also in a series of versions. For example, the pattern {kexi.po}→{krita.po} Open-source products have a world-wide user base. mined from the KDE document repository indicates that In order to better serve and retain a diverse user the documents kexi.po and krita.po are changed in two community, open-source projects are increasingly successive versions. This pattern is found in the change developed with the ability to adapt to a number of natural history of 506 translated documents across 44 languages. languages and environments. Products such as Linux, Such patterns directly support the evolution of OpenOffice, and KDE provide localization of user documentation. They can be utilized to assess the impact interfaces and user documentation (including online help) of changes with regards to translation. This supports the in several languages and locales (e.g., characters set, translators and users in the following ways: date/time, and currency). For example, KDE provides • Supports their decision to allow (or not allow) online user guides in nearly twenty-six languages [13]. changes in the documentation during the string The majority of the open-source projects are based on (hard) freeze, the gnu gettext model for localization purposes [10]. • Assists translators to plan and examine potential Such projects produce source code (i.e., string literals) out-of-date/obsolete documents, and documentation in a base language (typically US • Assists user communities to anticipate updates in English), extract the strings and documentation that the documentation. require a language-specific translation, and translate them The rest of the paper is organized as follows. In to another language. Localization is a semi-automatic section 2, we examine the document translation paradigm process where the translation process (i.e., converting text from one language to another) largely remains a 1 Here document or documentation includes the web pages or sources human-intensive activity. It is not uncommon for large from which web pages are generated. based on gettext, section 3 discusses the evolution data specific translation by (manually) filling-in the empty found in the document repository, section 4 presents our translated strings. The language-specific instance copies pattern mining approach, section 5 evaluates our of these documents are then converted back to the approach on KDE project, section 6 presents related original format of the documents. work, and finally our conclusions and future work are presented in section 7. #. Tag: title #: man-kdeoptions.7.docbook:8 2. Document Translation … msgid "KDE User's Manual" msgstr "" Open-source systems are very often developed and … distributed with the Native Language Support (NLS) Figure 2. A portion of kdelibs_man-kdeoptions.7.pot paradigm. The Native-language support encompasses portable object file from kdelibs – Internationalized Internationalization (better known as i18n ) and template for translation to other languages. localization (better known as l10n) [10]. For example, consider a portion of a document in the Internationalization is a generalization process that gives docbook format (www.docbook.org) shown in Figure 1. a program ability to understand and support multiple The content of the element title is in the language US locales (e.g., interaction messages, input, output, English. In order for this document to support native date/currency formats in multiple languages). language support, it is converted to the intermediate POT Localization is a specialization process that produces a format as shown in Figure 2. An entry (msgid, msgstr) of locale-specific instance from an internationalized strings is created. The string msgid contains the program. Though, NLS (specifically localization) untranslated content of the element title, whereas, the requires a non-trivial human effort, tools such as gnu string msgstr is left empty. This conversion is facilitated gettext and KBabel help simplify the task. Multilingual by the tool xml2po. The tool also inserts additional websites that contain documentation (online help) of context information in the form of comments. For open-source systems typically follow the same example, the comments (lines starting with the symbol #) internationalization and localization process that is provide information about the element name whose established for the source code of a project. For content is translated and its location in the original example, in case of KDE, the documentation (with document. internalization support) is written in the US English and then translated (localized) to a number of other languages with the help of tools [14]. This is an exact replication of #. Tag: title #: man-kdeoptions.7.docbook:8 the KDE source-code internationalization and … localization process. msgid "KDE User's Manual" In the KDE project, the online documents produced in msgstr "KDE Benutzerhandbuch" the US English are said to be untranslated documents and … serve as templates for translation in other languages. A Figure 3. A portion of kdelibs_man-kdeoptions.7.po portable object file from kdelibs – Specialized translation team localizes the untranslated documents to a portable file for German. specific language (e.g., German). The documents specialized to a specific language from the untranslated An instance (copy) of the template is created as shown documents are known as translated documents. in Figure 3 . In this case, the string msgid is translated to the language German and the translated content is <title>KDE User's Manual</title> manually entered in the string msgstr. Once such a translation is completed, the document in the PO format Figure 1. A portion of man-kdeoptions.7.docbook is converted to the docbook format as shown in Figure 4. from kdelibs – Untranslated document in US English. The translation process is based on the Portable <title>KDE Benutzerhandbuch</title> Object (PO) files [10]. A PO file consists of textual Figure 4. A portion of man-kdeoptions.7.docbook entries. Each entry is a pair of untranslated and from kdelibs – Translated document to German. translated strings. An untranslated document (i.e., the parts of it that require internationalization) is converted 3. Repositories of Translated Documents to a corresponding PO file. Such PO files typically contain an empty translated string in the pair of strings The translation of documents is similar to source-code and are stored as POT (Portable Object Template) files. evolution in that it is a continuous process in which These template files are then used to produce a language- changes are performed incrementally. That is, translation is often performed on a single entity or group of related logentry. Subversion’s log-entries include the attributes entities during a single step. Also new untranslated committer, date, and paths (i.e., files) involved in a documents (POT files) may be introduced that require change-set. Each logentry is uniquely identified by a creation and translation of new language-specific revision number. Figure 5 shows four logentries (i.e., instances (PO files). Incremental translation of changes change-sets) of a hypothetical document repository. is easier to handle when compared with modification of Revision 1

Load more