See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/261111986

A framework for developing cross-browser data intensive Web applications

Conference Paper · October 2012 DOI: 10.1109/ICCTA.2012.6523547

CITATION READS 1 17

3 authors:

Mahmoud Youssef Nourhan Hamdi George Washington University Arab Academy for Science, Technology & Maritime Transport

12 PUBLICATIONS 72 CITATIONS 3 PUBLICATIONS 4 CITATIONS

SEE PROFILE SEE PROFILE

Salma Rayan University of Strathclyde

2 PUBLICATIONS 2 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Enhancing location privacy View project

All content following this page was uploaded by Salma Rayan on 20 June 2020.

The user has requested enhancement of the downloaded file. A Framework for Developing Cross-Browser Data Intensive Arabic Web Applications

Mahmoud Youssef, Nourhan Hamdi, and Salma Rayan Business Information Systems Department Arab Academy for Science, Technology, and Maritime Transport Alexandria, Egypt [email protected], [email protected], and [email protected]

Abstract— the frequent encounter of incorrectly functioning The impact of properly functioning Arabic data-intensive Arabic Websites, especially those of large businesses and Websites can be seen on different fronts. From a societal governments calls for a clear and applied framework to help perspective, it enables the spread of e-commerce and e- practitioners develop properly functioning Websites. The issue of government applications in the Arab world with their Website internationalization has been addressed in a plethora of associated benefits. Moreover, it enhances the trust in the standards, guideline, good practices, and tutorials. However, these organizations that own these Websites [19]. And from a standards with their formal language may not be directly comprehensible to the practitioner. In addition, guidelines and technical perspective, it provides the ability to integrate tutorials are usually designed to address one issue requiring the correctly within Web 2.0 mashups, the ability to handle future practitioner to integrate knowledge from different sources. application needs such as linked data, and the ability to Moreover, these standards, guidelines and tutorials are limited to integrate with other components within distributed the Web technologies themselves excluding other parts of Web architectures such as the Service-Oriented Architecture (SOA). application architecture. In this paper, we propose a comprehensive step-by-step framework addressed to the The needs to provide information in languages other than Latin practitioner. The framework integrates knowledge from different and to provide more than one language at the same time are standards and technologies and calls attention to issues that could addressed in computer applications and specifically in Web be overlooked in designing and implementing data-intensive applications under the titles globalization, internationalization Arabic Web applications. (i18n), and localization (l10n) which we explain herein forth. Keywords-component; Internationalization; Arabic; Data- Globalization refers to the ability of an e-commerce intensive; Web applications Website to address global community taking into consideration the diversified needs of this community. I. INTRODUCTION While the discussion of Web applications Internationalization refers to the design of a Website so that internationalization and localization might sound as an it can be adapted to different countries taking into antiquated topic, the frequent encounter of incorrectly consideration their various languages, scripts, and cultures functioning Arabic Websites, especially those of large [21]. businesses and governments (See Figure 1), calls for a clear Localization refers to the adaption of a Website to a and applied framework to help practitioners develop properly specific locale such as Arabic/Egypt. This may impact, among functioning Arabic Websites. other factors, numeric date and time format, currency format, collate and sort order, and calendar system. A basic requirement of a Website is to perform properly across different Web browsers. While browsers are expected to render content in a consistent way, experience has shown that the behavior of Websites may be different with different browsers. Current statistics show that Google Chrome, Mozilla Firefox (FF), Microsoft Internet Explorer (IE), and Apple Safari are the dominant browsers in the market [22]. In this research we examine the proposed framework against these browsers. A data-intensive Website is characterized by its dynamicity, which typically involves architecture of several layers that interact together. Providing Arabic textual information Figure 1. Improperly functioning Website of a major bank correctly on the user interface requires proper interchange of data among these layers as well as proper representation of the Interestingly, numbers in Arabic, whether using Indic or Arabic data at each layer. numerals, are displayed left-to-right adding to the complexity. To address the needs for internationalization and 1) Arabic Character Sets History localization, the research community, through its In order for computers to process and interchange textual standardization organizations, has developed a plethora of data correctly, characters encoding must be standardized. For standards (e.g.,[7], [8], [9]). However, these standards with English language, the American Standard Code for Information their formal language may not be directly comprehensible to Interchange (ASCII) character set has been the standard the practitioner. To make such knowledge accessible, many representation since the early time of computers. ASCII used 7 guidelines and tutorials (e.g., [10], [16], [20]) were developed, bits per character, which allowed 128 characters only. These yet they are usually designed to address one issue requiring the characters included the upper and lower case English alphabet, practitioner to internalize knowledge from different sources. the digits from 0 to 9, and punctuation symbols. As such, Moreover, these standards, guidelines and tutorials are ASCII allowed representation of English text only. frequently limited to the Web technologies themselves excluding other parts of the architecture. With the existence of large amounts of information in ASCII and the need to represent other languages, 8-bit In this paper, we propose a comprehensive step-by-step character sets that extend ASCII were developed and framework addressed to the practitioner that integrates the standardized. They were referred to as Extended ASCII. As 8- knowledge from different standards and technologies and calls bit representations allowed 256 characters (code points) and as attention to issues that could be overlooked in designing and the first 128 of them were always occupied by ASCII, the other implementing data-intensive Arabic Web applications. 128 code points were used to represent one or more languages, Throughout the discussion, we strive to provide solutions that or character graphics, and they were referred to as the extended are applicable to different development environments; set. As an example, ISO-8859-1 character set extends ASCII however, we conducted our experiments, mostly, using open with other Western European languages. source applications and tools. Operating systems utilized Extended ASCII character sets We limit our work to addressing issues related enforcing to provide multilingual support and referred to them as “code proper Arabic text representation, transport, and processing. As pages”. Each user was able to add other locales by selecting such, we exclude discussion on other issues such as calendar their code pages. However, since only one can be systems, time zones, date and time, and currency formats. active at time in an application, it was impossible to have The rest of the paper is organized as follows. Section 2 languages from different code pages concurrently. The problem provides background, Section 3 presents related work, Section was not solved until the development of [18] 4 introduces the proposed solution, and Section 5 concludes the presented in the next Section. paper. Arabic character set encoding went through several developments. In 1981, the CUDAR-U encoding appeared as II. PRELIMINARIES the first Arabic character set, which used 7 bits per character. Most of the Web standards emphasize the use of the In 1982, the Arab Standards and Metrology Organization globally accepted Unicode and maintaining (ASMO) produced its first character set standard, AMSO-449, that encoding throughout the different layers of the which used 7 bits per character as well. ASMO-449 then architecture. In this section, we describe the characteristics of became the basis for all subsequent standard sets, playing a role Arabic, the historical development of Arabic character sets and similar to ASCII for Latin characters; however, similar to the current status of the top Arabic Websites. ASCII, it allowed the representation of one language; Arabic in that case. In 1986, ASMO-708 standard appeared. It uses 8 bits A. Arabic Language Representation per character. Later, it became the international standard ISO- When displaying characters, it is important to distinguish 8859-6 and became the code page for the Arabic Macintosh between the characters themselves and their visual system [5]. representation (glyphs). This is particularly important in Arabic, where shapes of characters depend on the context. To In the 1980s, more than 20 Arabic code sets coexisted. display Arabic text correctly, a context analysis program is With the spread of personal computers and the MS-Windows needed to select the right shape of glyph. The context is not operating system in the 1990s, Microsoft's MS-Windows necessarily the preceding and following characters only. Arabic Arabic code page (MSCP-1256, a.k.a. Windows-1256) almost script is highly decorative, and many ligatures (a glyph for became a standard. Microsoft decided not to use the ISO-8859- multiple characters) are used. This implies that Internet client 6 standard character set and developed its own set to allow programs that display Arabic (such as Web browsers) must simultaneous use of Arabic and French and as well as control employ contextual analysis or rely on an underlying operating characters. system to do that. Furthermore, Arabic has a number of While major part of the Arabic content on the Web was diacritic marks that are written above and below the characters encoded in Windows-1256 and many Internet applications used to aid in pronunciation. The diacritics must naturally be it as the default encoding, other encoding schemes existed and displayed in their places. In addition to this, Arabic requires they were growing. The multiple standards status led to a special display algorithms as it requires right-to-left direction. common situation where Arabic information was not interchanged correctly and was presented to the end user environment do not default to Unicode and incorrectly. sometimes they don’t support it. 2) Unicode 2- The dominance of Windows environment at the The Unicode’s Universal Character Set (UCS) has the client side (the development side) especially in potential to solve the problem of the plurality of character sets. the Arab World leads to Websites that are based It was jointly developed by the International Standards on Microsoft technology and consequently Organization as (ISO-10646) and by the Unicode consortium. defaults to its character sets. We reviewed the The two organizations maintained their modifications in sync 100 locally provided top sites according to user until now and intend to do so in the future. The latest version of traffic [3] and found that 30% of them use this Unicode so far, version 6.1 maps to ISO-10646:2012 [18]. encoding. UCS allows 1,114,112 code points, enough to represent all 3- Since the encoding of Arabic information in writing systems of the world [7]. By January 2012, 110,181 UTF-8 requires two bytes per character versus characters have been assigned to code points. UCS separates one byte per character in other encodings, many character mapping to code points from encoding. While in developers are inclined to avoid it. previous system, the integer value of the character’s code point was usually used as its encoding, UCS has provided various 3) Displaying bi-directional text encoding forms such as: UCS-4 (32 bits per character), UCS-2 Arabic text is stored in logical order. Before it is displayed, (16 bits), UTF-16 (multiple 16 bits), UTF-8 (multiple 8 bits), it must be reordered correctly on the screen. This is an and UTF-7 (multiple 7 bits). The latter four schemes are called important issue because most computer systems are designed to transformation formats. With exception of UTF-8, all other display text left-to-right. Unicode [18] defines a direction encoding forms require multiple bytes per character. UTF-8 is property for each character and provides a text directionality an IETF standard described in RFC3629 [7]. algorithm for displaying bi-directional text. The directional property of Arabic characters is strong right-to-left, while the In fact, UTF-8 possesses several useful properties that characters of other languages are strong left-to-right. The text made ideal for internationalization purposes. First, UTF-8 is a directionality algorithm uses a set of directional ordering codes superset of ASCII; as such, all information represented in to influence the ordering of text. ASCII is directly compatible with UTF-8. This has a very important impact as many of the Internet transport protocols, Original specifications of the World Wide Web allowed software applications, and parsers were designed to use ASCII. transport of Arabic pages in 8-bit character sets. These In addition, a large portion of the stored information is still in specification are based on three standards: a language to ASCII format. Second, UTF-8 can represent any UCS describe contents of Web pages (Hypertext Markup Language, character in an unambiguous representation that takes between HTML) [6], a method to define the location of a Web page on one and four octets (each octet is 8 bits). Third, Latin the Web (Uniform Resource Identifier, URI) [15], and a information, which represents a major chunk of the information mechanism to retrieve Web pages from the addressed specified on the Internet, does not require larger size since Latin in the URI (Hypertext Transfer Protocol, HTTP) [12]. As characters are represented in one byte codes as it is the case these standards were upgraded to support internationalization with the other Latin standards (ISO-8859-1 and Windows- [8], it became possible to transfer and render multi-byte 1252). character encodings such as UTF-8. The Arabic coding in UCS corresponds to the ASMO-449 The internationalization of the HTML standard introduced code page. Unicode has detailed character property tables and many new features that facilitate the use of Arabic on the Web. algorithms (e.g., bi-directional text display), which are These features include indicating the character set, tagging the particularly suited for Arabic. language, marking bi-directional text, and controlling cursive joining behavior [8]. The term character set has been used in the literature to describe two separate but related concepts [5]. First, a character B. Architecture of Data Intensive Web Applications set is mapping from a set of abstract characters to a set of Web application architectures can vary widely. However, integers. In UCS, that mapping represents the repertoire. The almost all implementation follow the multi-tier client server discussion on 7-bit and 8-bit representation in the previous architecture, which is a computing model that supports the section, as well as UCS2 representation in this section should separation of functionality into tiers that can be located on a be understood within this use. Second, a character set is an physically separate machines, a cluster of machines or, as encoding scheme that encodes each integer into one or more increasingly taking place, on virtual machines that merges bytes or octets. UTF-8, UTF-16, and UTF-32 should be hardware resources from large number of computers. The understood within this meaning. software functionality in the different tiers can be described While UTF-8 encoding is the clear choice for developing through the three layers of the client/server model: the Web applications (particularly Arabic Web application), other presentation layer, the business logic layer, and the data layer standards, in particular Windows-1256 [11], are still being (see Figure 2). used. This can be explained by the following: The Presentation Layer is the code that produces the output 1- Since most of the Websites in the world are in on the user screen and accepts user input. In Web applications, Latin, most Web development tools and this functionality is provided partially by the rendering engine of the browser with support from the underlying operating

Figure 2. Architecture of dynamic Web applications. Figure 3. HTTP Request system and complemented by the code in the HTML, CSS, provides a complete stack with its associated development JavaScript, and other languages and plugins (e.g., Adobe environment based on the .Net technology. This stack includes Flash). Windows Server, Internet Information Server (IIS), MS SQL The Business Logic Layer implements the functional and Server, and several programming languages including C#, non-functional requirements of the application. This layer is the Visual C++, and Visual Basic. most complex layer of a Web application as it may include C. Interaction in Web Application several combinations of application servers (such as Apache Tomcat and Apache Geronimo), message servers, and Web applications interaction can be simply described as a programming frameworks (such as Struts and Zend set of request/response cycles between the client (typically a Framework), in addition to Web application code. Web browser) and the Web server. Functionality of this layer involves the following: The request created by the browser is formulated according 1- Receiving a request from the Web server, parsing the to the HTTP protocol rules. A typical request includes a request, extracting the address and name of the method (e.g., GET or POST), the directory part of a URL, an program to be executed, and extracting the values to optional document or a program name, and an optional set of parameters representing pairs of variables and values. The be passed to that program. request header also includes the host name (or an IP address) 2- Executing the code with the parameters passed in the and another set of parameters that are vital for Web request, which may involve querying the DBMS or a applications. Figure 3 shows an example of an HTTP request Web Service. as captured by the FireFox add-on Live HTTP Headers. 3- Creating the HTML code that includes the response If the request is asking for a static page, the Web server including any data retrieved from the data layer. may merely retrieve the file containing the Web page, attach it 4- Passing the response to the back to the Web Server to to an HTTP response header, and submit them to the TCP deliver it to the client. protocol for transport. However, if the client request is dynamic, it may involve other recursive cycles in the Business The Data Layer includes the code that provides the Logic and Data layers. The server side processing starts as the functionality for the Create, Read, Update, and Delete (CRUD) Web server extracts information about the program to be of data objects. In Web applications, the Data Layer executed and the data to be forwarded to it. It also forwards functionality is provided by the DBMS and the SQL code that several pieces of information from the request header. The interacts with it. Procedural SQL code such as Pl/SQL and server then communicates all this information to the Transact SQL is usually considered Business Logic code even programming language that can execute that program and waits when it resides in the database in the form of stored for a response. Upon receiving a response from the application, procedures. the Web server forwards it to the client in a similar fashion to The Business Logic Layer and the Data Layer are usually that of a static page request. implemented through predefined solution stacks that include an operating system, a Web server, a database management III. RELATED WORK system, and programming language. One of the most common Belussi and Posenato [1] proposed an approach for of such stacks is the open source LAMP, which includes Linux application internationalization that focuses on correct handling OS, Apache Web server, MySQL database management server, of storage and retrieval according to the locale of the user. The and PHP, Perl, and Python programming languages. Other proposed solution uses database extensions and query re- variations of this stack are the Microsoft Windows-based writing tools. While this work is related, it doesn’t cover the WAMP, and the Mac OS based MAMP. Several other stacks particular problem of this research but rather complement it. are based on Java programming language. Moreover, Microsoft Takada [17] presented one of the early works that draw The encoding of the document is specified just after attention to the need for internationalization and localization of "charset=". In order for this request to be processed correctly, Web data. Most of the issues addressed by Takada were especially if the server does not set encoding, two conditions covered in the HTML internationalization standard [8]. must be satisfied. First, the HTML document must be saved in UTF-8 encoding. Second, the meta element containing the IV. THE FRAMEWORK character encoding must be in the first 1024 characters of the We divide the framework to four sections that map to the file. architecture layers with an additional section for development environments. In each section, we present the handling of TABLE I. TESTING MAJOR BROWSER BEHAVIOR WITH DIFFERENT UTF-8 and the issues that may arise in that layer. ENCODINGS http File Meta FF Chrome Safari IE header Encoding Tag A. The Presentation Layer UTF-8 UTF- When the browser receives the HTTP response, it first UTF-8 with 9 9 9 9 8 parses the response, creates a DOM (Document Object Model) BOM tree of the HTML elements and presents it to the user. For UTF-8 UTF- UTF-8 without 9 9 9 9 detailed discussion on browser functionality, the reader is 8 referred to ([2], [14]). In this part of the interaction, it is very BOM Iso- important to notice that character encoding can be conveyed Iso-8859- UTF-8 8859 8 8 8 8 from the Web server to the browser in the HTTP response (See 6 Figure 4). If such encoding exists, it will override other -6 Wind encodings within the HTML/XML code. On the other hand, the Windows UTF-8 ows- 8 8 8 8 client usually states the character encodings that it accepts in -1256 the HTTP request (See Figure 3), but the Web server may not 1256 Iso- honor that. Iso- Iso-8859- 8859 9 9 9 9 8859-6 6 -6 Iso- UTF-8 UTF- 8859-6 with 8 8 9 9 9 BOM Iso- UTF-8 UTF- 8859-6 without 8 8 8 8 8 BOM Iso- Wind Windows 8859-6 ows- 8 8 8 8 -1256 1256 Windo Wind Windows ws- ows- 9 9 9 9 -1256 1256 1256 Windo Iso- Iso-8859- ws- 8859 8 8 8 8 6 1256 -6 Windo UTF-8 UTF-

ws- with 8 8 9 9 9 Figure 4. HTTP Response 1256 BOM Windo UTF-8 UTF- 1) Declaring the Character Encoding ws- without 8 8 8 8 8 Both in-document, and HTTP header declaration can be 1256 BOM used when it comes to declaring the character encoding. Since Not UTF-8 UTF- HTTP declarations have higher precedence than in-document Specifi with 8 9 9 9 9 ones, it is recommended to use it if there's a chance for the ed BOM character encoding to be changed by intermediary servers. If Not UTF-8 UTF- both units declarations are used in conjunction with each other, Specifi without 8 9 9 9 9 they should definitely be consistent and declare the same ed BOM Not Wind character encoding. Windows Specifi ows- 9 9 9 9 -1256 For pages written in HTML4, the pragma directive should ed 1256 Not Iso- be used. Iso-8859- Specifi 8859 9 9 9 9 6 .

Changing the web server encoding could be done by The data should be stored in the database using the UTF-8 directly changing the server defaults, or changing settings for a character set, which can be defined at different levels; namely, specific application (e.g., using .htaccess on Apache). If the the sever, the database, the table, and the column levels. Figure document is dynamically created using a scripting language, 5 shows a page where the static Arabic text is presented the encoding declaration could be explicitly added to the HTTP properly while the dynamic data is not. header using the language's functions and constructs (e.g., the header( ) function in PHP). 2) Content Language It is recommended to use the language attribute lang on the tag to declare the default language of the text in the page. This helps tools and applications that interact with the page use this information for language-sensitive tasks [3]. For instance, the Arabic code points in Unicode are used by several languages including Urdu and Farsi. However, the glyphs look Figure 5: Improperly functioning data layer different from one language to the other for some characters. 1) Database Server Setup 3) Text Direction A database server has a server character set and a server In order for text to look right when an HTML page is collation. In MySQL, for instance, these can be set at server displayed, the base direction of that text needs to be startup on the command line or in an option file that changes it established. The dir attribute should be used with the html at runtime. element to set the base direction of text for display. This is essential for enabling HTML in right-to-left scripts. It is [mysqld] important to set up the appropriate base direction so that character-set-server = utf8 the Unicode bidirectional algorithm can generate the expected collation-server = utf8_general_ci ordering of the displayed text. Correct specification of the base 2) Database Creation direction also establishes a proper default alignment for the While creating the database, the character set and collation text. Other elements on the page can have left-to-right direction should be set to UTF-8. The "CREATE overriding the base direction if the dir attribute is specified for DATABASE" and "ALTER DATABASE" statements have them [3]. optional clauses for specifying the database character set and B. The Business Logic Layer collation. Currently, there are tens of Application Servers and CREATE DATABASE mydb Programming Frameworks that facilitate the development of DEFAULT CHARACTER SET utf8 Web applications. The role of theses servers and frameworks is DEFAULT COLLATE utf8_general_ci; to offer clean and efficient code for the standard functionality 3) Tables and Columns of Web applications. As such, they typically offer modules for All tables have a table character set and a table collation. authentication, authorization, form design and validation, The "CREATE TABLE" and "ALTER TABLE" statements have database access, Web services interaction, full-text search, etc. optional clauses for specifying them. If the character set is Most of these Application Servers and Frameworks support specified without specifying the collation, the database server UTF-8, yet it may require explicit settings. uses the default collation associated with that character set and the other way round. Otherwise, the values configured for the One major issue that arises in this layer is choice of database are used. variable sizes in the programming environment and choice of variable types in the query or the service request. The character set and collation of any character column, of type CHAR, VARCHAR, or TEXT, should also be changed to 1) Character Widths UTF-8. If they were not modified, they inherit their table's Since every character in English is encoded within a single configuration. byte, a database field column of width CHAR (10) implies 10 characters in English or in any language that is encoded in ISO 4) Connection Handling 8859. A character in Arabic encoded in UTF-8; however, will When a connection starts, the client sends SQL statements, span two bytes. Consequently, all variables that handle Arabic such as queries, over the connection to the server. The server data need to have to at least two times their size for English sends responses, such as result sets or error messages, over the data. connection back to the client. If the connection character set and collation are not set to UTF-8, the data will not be C. The Data Layer transferred correctly. The server identifies the character set of To ensure the consistency of the framework, configurations the statement when it left the client from the in the data layer should also be set up to support the UTF-8 character_set_client system variable. After receiving the character set and its corresponding collation. A collation is a statement, the server translates it to the character set configured set of rules for comparing characters in a character set. Such in its character_set_connection system variable. This is why it order is needed, since the numerical representation of the is important to configure the server's connection handling characters may not be similar to that of the alphabet. character set to UTF-8. Same applies on the collation. The server's default connection character set can be overwritten by language in both English and Arabic with more details and the client by running the "SET NAMES utf8" query. If the screen shots that we could not present here due to space defaults are not set to UTF-8, the character settings for each limitation. connection to the server should be changed. REFERENCES If the data is not valid UTF-8, or the application is dealing [1] A. Belussi and R. Posenato, “A framework for the internationalization of with data in another character set that need to be converted into data-intensive web applications,” Web Engineering, pp. 775–775, 2004. UTF-8, a function must be developed to perform the [2] A. le Hors et al (eds).( Jul. 15, 2012.). Document Object Model (DOM) conversion. The following is the PHP code for converting data Level 2 Core Specification [online]. Availble: to UTF-8: http://www.w3.org/TR/2000/REC-DOM-Level-2-Core- 20001113/ mbstring extension: mb_convert_encoding(string, to, [3] A. Phillips, and M. Davis, "Tags for Identifying Languages", RFC 5646, September 2009. from) [4] Alexa. (Jul. 15, 2012.). Website Ranking in the Arab World [online]. iconv extension: iconv(from, to, string) Availble: http://www.alexa.com/topsites/category/World/Arabic [5] B. Al-Badr, “Using the Internet in Arabic: Problems and Solutions,” in recode extension: recode_string(request, string) INET98 conference proceedings, 1998. built-in function: utf8_encode(string) [6] D. Raggett, A. Le Hors and I. Jacobs. (Jul. 15, 2012). HTML 4.01 Specification W3C Recommendation [online]. Available: 5) Data Migration http://www.w3.org/TR/1999/REC-html401-19991224December 1999 Another challenge that need to be handled at the data layer [7] F. Yergeau, "UTF-8, a transformation format of ISO 10646,” RFC3629, is the migration of existing data to UTF-8 encoding. There are November 2003 a multitude of tools that convert from a database format to [8] F. Yergeau, G. Nicol, G. Adams et al., "Internationalization of the another, typically, through the exporting and importing of data Hypertext Markup Language," RFC 2070, January 1997. to switch its character set. An excellent intermediary format is [9] H. Alvestrand, "Tags for the Identification of Languages," RFC 3066, XML. January 2001. [10] J. O’Conner.( Jul. 15, 2012). Character Conversions from Browser to D. Development Environment Issues Database Sun Developer Network [online]. Available: http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/, Jan. As mentioned above, HTML documents, whether static or 2006. dynamically generated, must be saved using the UTF-8 [11] Microsoft. Windows 1256 Codepage [online]. Available: encoding in order to be parsed correctly by the browser. http://msdn.microsoft.com/en-us/goglobal/cc305149 Different text editors and IDEs have specific options for [12] R. Fielding, J. Gettys, J. Mogul et al., "Hypertext Transfer Protocol -- HTTP/1.1," RFC 2616, June 1999. setting character encodings while saving files. Some [13] T. Appelmans, “Web Globalization and WSDM Methodology of Web applications automatically insert Byte Order Mark (BOM) Design” Graduation thesis, WISE-Web & Information System character, while others do not. While the W3C advises against Engineering Department of Applied Computer Science, Vrije saving files with BOM. Interestingly, our experiments show Universiteit Brussel, Belgium, Academic Year, 2003. that all browsers except FireFox detected the UTF-8 format [14] T. Berners-Lee. (Jul. 15, 2012). Web Architecture from 50,000 feet from the BOM and accordingly overridden the incorrect format [online]. Availble: http://www.w3.org/DesignIssues/Architecture.html in the HTTP header (See Table 1). On the other hand, FireFox [15] T. Berners-Lee, R. Fielding, and L. Masinter, "Uniform Resource which followed the standard strictly produced the characters Identifiers (URI): Generic Syntax," RFC 2396, August 1998. () in place of BOM. For this reason, it is usually best for [16] T. Olsson. (Jan. 10, 2007). The Definitive Guide to Character Encoding interoperability to omit the BOM, for UTF-8 content. Site Point Article [online]. Availble: http://www.sitepoint.com/guide- web-character-encoding We examined the behavior of different Web browsers with [17] T. Takada, “Multilingual information exchange through the World-Wide different settings of http header, and file encoding. We Web,” Computer Networks and ISDN Systems, vol. 27, no. 2, pp. 235– controlled the http header at the server side through settings in 241, 1994. the .htaccess file of Apache server. Table 1 below shows the [18] The Unicode Consortium. The Unicode Standard, Version 6.1.0”, (Mountain View, CA: The Unicode Consortium, 2012. ISBN 978- results. From the table, it is clear that the best results are 1-936213-02-3) [online]. Availble: obtained when the header is not specified by the server (in the http://www.unicode.org/versions/Unicode6.1.0/ http header). [19] Economic and Social Commision for Western Asia, “Harmonization of ICT Standards Related to Arabic Language Use in Information Society V. CONCLUSION Applications,” UN, New York, 2003. In this article we presented a framework for developing Arabic [20] W3C. (May 16, 2005). Internationalization Quick Tips for the Web data-intensive Web applications. The major contribution in this work [online]. Availble: http://www.w3.org/International/quicktips is, clearly, not introducing new knowledge, but rather to collect the [21] W3C. (Jul. 15, 2012). Internationalization [online]. Availble: scattered knowledge on internationalization and best practices for http://www.w3.org/standards/webdesign/i18n.html, implementing it and organized it in a step-by-step approach that can [22] W3Schools. (Jul. 15, 2012). Browser Statistics Month by Month benefit the practitioner. [online]. Availble: http://www.w3schools.com/browsers/browsers_stats.asp. As the proposed framework is intended to practitioners, we intend to make it available online in a more accessible

View publication stats