(12) United States Patent (10) Patent N0.: US 7,900,143 B2 Xu (45) Date of Patent: Mar
Total Page:16
File Type:pdf, Size:1020Kb
US007900143B2 (12) United States Patent (10) Patent N0.: US 7,900,143 B2 Xu (45) Date of Patent: Mar. 1, 2011 (54) LARGE CHARACTER SET BROWSER 6,718,519 B1* 4/2004 Taieb ....................... .. 715/542 (75) Inventor: Yueheng Xu, Beaverton, OR (US) OTHER PUBLICATIONS Roger Molesworth, The Open Interchange of Electronic Documents, (73) Assignee: Intel Corporation, Santa Clara, CA Standards & Practices in Electronic Data Interchange, IEEE Col (Us) loquium on May 21, 1991, pp. 1-6.* Apple Computer, Inc., Programming With The Text Encoding Con Notice: Subject to any disclaimer, the term of this version Manager, Technical Publication, Apr. 1999, pp. 1-286. patent is extended or adjusted under 35 K. Lunde, CJKJNF, CJKINF Version 1.3, Jun. 16, 1995, pp. 1-50. USC 154(b) by 2155 days. Zhu, et al., Network Working Group, Request for Comments 1922, “Chinese Character Encoding For Internet Messages,” Mar. 1996, pp. 1-27. (21) Appl. No.: 09/748,895 Searle, “A Brief History of Character Codes in North America, Europe, and East Asia,” Sakamura Laboratory, University Museum, (22) Filed: Dec. 27, 2000 University of Tokyo, 1999, pp. 1-19. Goldsmith, et al., Network Working Group, Request for Comments (65) Prior Publication Data 2152, “A Mail-Safe Transmission Format of Unicode,” May 1997, US 2002/0120654 A1 Aug. 29, 2002 pp. 1-15. Japanese Patent Of?ce, English Language Translation of Inquiry Int. Cl. Letter Dated Jul. 11, 2008, pp. 1-3. (51) Japan Standards Association, “7-bit and 8-bit double byte coded G06F 17/22 (2006.01) extended Kanji sets for information interchange,” JIS X 0213:2000, (52) US. Cl. .................................................... .. 715/269 Jan. 20, 2000, pp. 50-52 and 60-64. (58) Field of Classi?cation Search ....... .. 7l5/5l3i5l7, 7l5/523i524, 530, 536, 542, 239, 255, 269; * cited by examiner 707/10, 1004101, 203; 709/246 Primary ExamineriLaurie Ries See application ?le for complete search history. Assistant Examiner4Chau Nguyen (56) References Cited (74) Attorney, Agent, or FirmiTrop, Pruner & Hu, RC. U.S. PATENT DOCUMENTS (57) ABSTRACT 5,870,084 A 2/1999 Kanungo et a1. A large number of characters may be represented by character 6,157,905 A * 12/2000 Powell ..................... .. 715/536 codes without undue complication. For example, characters 6,204,782 B1 * 3/2001 Gonzalez et al. ............ .. 341/90 that are already represented by Unicode may be handled 6,397,259 B1* 5/2002 Lincke et al. 715/501.1 pursuant to the Unicode system within one 16-bit code. Char 6,425,123 B1* 7/2002 Rojas et a1. .. 717/136 6,446,133 B1* 9/2002 Tan et al. 709/245 acters that are not represented by Unicode may be handled in 6,512,448 B1* 1/2003 Rincon et al. .. 340/7.51 a second fashion. The characters that are not represented by 6,601,108 B1* 7/2003 Marmor .... .. .. 709/246 Unicode may be represented by two 16-bit codes. 6,631,500 B1* 10/2003 .. 715/531 6,658,625 B1* 12/2003 715/523 25 Claims, 3 Drawing Sheets II'ISGIl New 086 To lmlicalu Wind Change Ute Surrogate Value ‘in Find Fonl Fill Lncatinn 8| mm: ti General: Glyph Image @ US. Patent Mar. 1, 2011 Sheet 1 013 US 7,900,143 B2 Source HTML File 10 Encoded In ISO-2022 CNllSO-2022-CN- ‘J 12 EXT Format f / 18 l 14 16 / lSO-2022-CN/ISO- _/ J 2022-CN-EXT To . Web Content Unrcode/Untcode Wrth Web Content Parser 7 Rendering Engine Surrogate Support Converter l l ‘ 26 20 ASCII And Other J L CJK String Charset String D . s Ia M d I Display Module I p y o u e 24 l ll / 28 w ll 1 f 22 ASCII And Other GB2312/HZGBK/ lSO-2022-CN/ISO Charset Font BIG5 Font 2022-CN-EXT Font Display Drivers Display Drivers Display Driver l i ll 30 Text Content On _/ Screen (Glyphs Of Text) FIG. 1 US. Patent Mar. 1, 2011 Sheet 2 013 US 7,900,143 B2 32 f 34 / Processor f 40 f 36 f 38 Dlsplay. Brldge. MemorySystem f 42 F 44 46 f 12 Bridge HDD --— f 48 SIO Modem FIG. 2 US. Patent Mar. 1, 2011 Sheet 3 013 US 7,900,143 B2 WK Browser ‘ Receive HTML Web [Page Iln PRC [Formal f. 66 llnsertt New 68G Tl'o llnldicate Mod Change /. 701) Use Table Wrapping 5] F 72 murder Map To Surrogalte Areas 1 f 76 Use Surrogale Manure To r0110 [Font lFile Location & @?sei Generate 75 Glyph —/ Ilmage V @ Fm US 7,900,143 B2 1 2 LARGE CHARACTER SET BROWSER BRIEF DESCRIPTION OF THE DRAWINGS BACKGROUND FIG. 1 is a ?ow chart for software in accordance with one embodiment of the present invention; This invention relates generally to Internet browsers and 5 FIG. 2 is a block diagram of hardware for implementing particularly to browsers which support a large number of one embodiment of the present invention; and characters. FIG. 3 is a ?ow chart for software in accordance with one Textual data, displayed as printed material or on a com embodiment of the present invention. puter monitor, is encoded in the form of binary numerical codes. When a given key on a keyboard is stricken, a character DETAILED DESCRIPTION code for that key is generated. A computer then uses the character code to select an appropriate character shape from a A web browser 12, shown in FIG. 1, provides a platform stored font ?le listing with the same character code. level solution to the need for larger character sets. The web English language personal computer systems generally browser 12 may serve as a universal, cross-platform user employ a seven bit character code. This code is codi?ed interface for any web-based application. The browser 12 according to the American Standard Code for Information effectively enables a large character set to support all web Interchange (ASCII) (ANSIx3.4-1996) that allows for char based applications. acter sets of about 128 items of upper and lower case Latin A source hypertext markup language (HTML) ?le is letters, Arabic numerals, signs and control characters. encoded in accordance with the ISO-2022 CN and ISO-2022 The Chinese, Japanese and Korean (CJ K) languages have a 20 CN-extension formats as indicated in block 1 0. In the browser relatively large number of characters, on the order of hun 12, that ?le is converted into a Unicode format extended to dreds of thousands of characters. These characters far exceed include support of a surrogate mechanism of Unicode 2.0 the capacity, in and of themselves, of the 7-bit ASCII charac ter codes. (ISO/IEC 10646 and RFC 2152 (1996)). Using the surrogate mechanism of Unicode extends Unicode to a capacity su?i For example, personal computers in Japan presently utilize 25 the Japanese Industrial Standard (J IS) X0208-1990 that cient to recognize a character set of one million unique char accommodates only 6,879 characters. While this is adequate acters. Since existing Unicode standards use a sixteen bit character length, the sixteen bit value can provide no more for many basic functions, it may be insuf?cient for writing than 65,536 characters. people’s names, place names, historical data and other such Thus, the ISO-2022-CN and CN extension formats are information. 30 The existing CJ K character sets are not suf?cient to provide converted to Unicode with surrogate support as indicated at a wide variety of important information given the available block 14. Those CJK characters that are already de?ned in character sets. For example, with the GB-2312 and Big-5 Unicode standard are converted into their sixteen bit Unicode character sets used by the Netscape Communicator web value. The remaining characters that are not de?ned by the Unicode standard are converted into two sixteen bit values. In browser, only about 16,000 characters are available. 35 As a result, the International Organization for Standardiza one embodiment of the present invention, each of the two tion has created a standard called ISO-2022 that outlines how sixteen bit values is in the range from 0xD800 to 0xDB00 and seven bit and eight bit character codes may be structured. The from 0xDC00 to 0xDFFF respectively. This conforms to the Unicode de?nition of a surrogate mechanism without ambi Chinese language version is ISO-2022-CN, set forth in guity. Request for Comments (RFC) 1922 (Network Working 40 Group, 1996). The browser 12 does the web content parsing, as indicated The so-called Uni?cation Code or Unicode was developed in block 1 6, such as HTML parsing. Thereafter, the remaining by a number of US. software ?rms to unify all the world’s rendering steps are completed as indicated in block 18. character sets into one large character set. See International When the browser 12 is ready to render a particular text Organization for Standardization ISO/IEC 10646-1 (1993) 45 string, the browser 12 loops for each of the characters in the Geneva, Switzerland. Unicode seeks to limit the character set string. For CJK characters, a ?rst search seeks fonts in known space to sixteen bits or a maximum of 65,536 characters. This CJK font libraries as indicated in block 20. The known CJ K character space means each character must be represented by font libraries include Guo-jiu Biao-zhun (“Coding of Chinese a ?xed length code of sixteen bits or two bytes. However, even Ideogram Set for Information Interchange Basic Set”, with Unicode, all of the world’s characters, including the 50 GB-2312-80), Big-5 (Institute for Information Industry, hundreds of thousands of CJ K characters, can not be “Chinese Coded Character Set in Computer,” March 1984), expressed using a character set that only allows for 65,536 GBK (“Information Technology” Universal Multiple-Octet total characters.