Converting a Hebrew Code Page CP1255 to the UTF-8 Format
Total Page:16
File Type:pdf, Size:1020Kb
LIBICONV – An Interface to Team Developer By Jean-Marc Gemperle Technical Support Engineer November, 2005 Abstract ..................................................................................... 3 Introduction............................................................................... 3 What Is LIBICONV?.................................................................... 4 Obtaining and Building LIBICONV for Win32.............................. 4 A DLL Interface to Team Developer............................................ 4 Team Developer ICONV Samples and Tests................................ 5 A Brief Description of the Application......................................... 6 Converting a Hebrew Code Page CP1255 to the UTF-8 Format................................................................................. 7 Converting the Generated UTF-8 Back to the CP1255 Hebrew Code Page .............................................................. 8 Chinese ISO-2022-CN-EXT to UTF-8 .................................... 9 ISO-8859-1 to DOS 437..................................................... 10 ISO-8859-1 to WINDOWS-1250 Cannot Convert ............... 11 ISO88591 to WINDOWS-1250 Using Translit .................... 11 Conclusions .............................................................................. 12 Abstract This technical white paper proposes an interface from GUPTA Team Developer to the GNU LIBICONV allowing a Team Developer programmer to convert their documents to a different form of encoding. See http://www.gnu.org/software/libiconv/ Introduction Introduction Generally, international text is encoded using a specific country- dependent character encoding. With the Internet, conversion between different encoding has become critical. Conversion is also a problem because some characters which are present in one encoding may not be in another. For all these reasons Unicode as been created as the super- encoding standard over all others and is the default for new text formats such as XML. See http://www.unicode.org/standard/WhatIsUnicode.html Page 3 What Is LIBICONV? Many computers still use traditional character encoding. This is also the case for applications built using Team Developer. Team Developer 2006 will be fully Unicode enables. However, some applications must be able to convert from one encoding to another. In Team Developer for About LIBICONV example, you may generate XML files using either XML Table Windows, and Team the DOM class Library, or Serialization of UDV, and you may want to Developer transfer the documents to other applications using other encodings. GNU LIBICONV is a conversion library to convert from Unicode to traditional encoding and vice versa. So LIBICONV is a solution for you if you need your applications to support multiple character encodings and is lacking in your current system. See http://www.gnu.org/software/libiconv/ for details on supported encodings. How to obtain & Obtaining and Building LIBICONV for Win32 build LIBICONV You can download the LIBICONV source files from for Windows http://sourceforge.net/projects/gettext. To build LIBICONV you will need to have Microsoft Visual Studio 6 and both the libiconv-win32 and gettext-win32 sources. First you will need to build LIBICONV without NLS, then GETTEXT and LIBICONV. Carefully follow the README.woe32 from these packages. By following these steps you should be able to get the WIN32 binary of the main application ICONV.EXE along with its dependencies ICONV.DLL, INTL.DLL, etc. Whether or not you decide to compile or directly use the binary package provided, you can easily interface ICONV.EXE to Team Developer using SalLoadApp(). Simply invoke ICONV.EXE –help for the description of the parameters. The LIBICONV A DLL Interface to Team Developer DLL wrapper This document provides a simple interface to GUPTA Team Developer. The Dynamic Link Library (DLL) file LIBICONVDLL.dll is a wrapper to the ICONV main() entry point. See the Visual Studio projects LIBICONVDLL.dsw and td_iconv.c. The wrapper exports 3 functions to Team Developer: bOK= iConv (BOOL bBinary, BOOL bTransLit, sFromCode,sFromFile, sToCode, sToFile) Parameters bBinary : Open the sFromFile in BINARY mode bTransLit : Limited support for transliteration, i.e. when a character cannot be represented in the target character set, it can be approximated Page 4 through one or several similarly looking characters. sFromCode: The coding of the document source sFromFile: The source file to convert sToCode: The target coding sToFile: The target file Return Ok is TRUE if the function succeeds and FALSE if it fails. iConvListEnc(sListOfCode) Parameter sListOfCode: Return list of supported coding. iConvLastError(sLastError) Parameter sLastError: If iConv fails, get error description with this function. The interface only provides conversion from a source file to a target file and does not yet provide the capability of converting directly a buffer in memory; although all the logic is available in the convert() function of ICONV.C source. Samples and tests Team Developer ICONV Samples and Tests In order to perform some interesting tests using Team Developer interface to iconv, install the Supplemental language support from the Languages tab in the Regional and Language Options control panel and select a Unicode font in the Notepad as shown below for XP: You can also test using Windows 2000 as long as you install the needed Page 5 language in the Control Panel’s Regional option. A Brief Description of the Application The Team Developer application that interfaces to LIBICONV can be found in \libiconv\bin\test_iconv.apt and the sample file in \libiconv\bin\Samples. With it you can type in the file you want to convert or, by using the ... button, you can open a File dialog and then open some of the samples provided in the samples directory where different file types are supported, for instance *.TXT, UTF-8 and XML. Here we will open the HEBREW-CP1255.TXT file. This file uses the traditional encoding CP1255 and our goal is to convert it to UTF-8. By default the application chooses an ISO-8859-1 code page both for the source and the target as soon as you select an existing file. Note that the combo boxes are Type Ahead combo. The list of encoding returned in the combo is an encoding type that ICONV supports. Also note that all these encodings are not specific, but that some are aliases to others (i.e. 437 is equivalent to CP437). The Notepad button allows us to view the source file in Notepad while the DOS button will open a command prompt allowing you to type the filename. The same button exists to check the differences after converting both in Windows and DOS. The Del button simply deletes the file that was generated. The Binary check box opens the file in binary mode and the Translit button is for transliteration, i.e. when a character cannot be represented in the target character set it can be approximated through one or several similar looking characters. Page 6 Converting a Hebrew Code Page CP1255 to the UTF-8 Format Converting a Hebrew code page to UTF-8 Microsoft Windows Notepad cannot properly display the file because the operating system used for this test is a United States English version and therefore does not have the Hebrew font installed. Thus, Notepad cannot properly display the CP1255 code page. Converting from CP1255 to UTF-8 will allow Notepad to properly display Hebraic character sets. Obviously DOS CP437 can not show this unless we have the right code Page 7 page and right raster fonts. Converting the Generated UTF-8 Back to the CP1255 Hebrew Code Page Converting the UTF-8 back to the Hebrew code page This test just shows that the resulting OUT.CP1255 file is identical to the HEBREW-CP1255.TXT from our previous test. Page 8 Page 6 Chinese ISO-2022-CN-EXT to UTF-8 Chinese IS0- 2022-CN-EXT to UTF-8 This is the same test as above but with a different code page. The C:\libiconv\bin\Samples directory contains other samples such as Japanese snippets. Also, when you obtain the source of libiconv-win32 you will have additional test samples to verify that ICONV is functional. Page 9 ISO-8859-1 to DOS 437 ISO-8859-1 to DOS 437 Page 10 This test converts a WINDOWS ANSI code page ISO-8859-1 to the code page 437. The command prompt on the input file shows garbage as DOS can’t display ISO-8859-1, but can using CP437. Once converted, Windows Notepad can not display the CP437 code page. ISO-8859-1 to WINDOWS-1250 Cannot Convert ISO-8859-1 to WINDOWS-1250 cannot convert This test shows that some characters from ISO-8859-1 can’t be converted to the WINDOWS-1250 code page. ISO88591 to WINDOWS-1250 Using Translit ISO88591 to WINDOWS-1250 using translit The “Translit” option tries to approximate the character that could not Page 11 be represented in the target code page. Conclusions This whitepaper covered the ability to convert applications from one character encoding format to another in Team Developer using GNU LIBICONV. All the work being done by the ICONV interface to Team Conclusions Developer can simply be done using a call to ICONV.EXE. The DLL interface proposed here is only a sample and is given without any guarantee. There are many other useful GNU tools that could be used in Team Developer. There is no real need to rebuild the tools, and either a port on WIN32 can be found as is the case with LIBICONV, or one could use CYGWIN found at: http://www.cygwin.com/ and its runtime to execute UNIX tools under WIN32. Copyright © 2005 Gupta Technologies LLC. GUPTA, the GUPTA logo, Page 12 and all GUPTA products are licensed or registered trademarks of Gupta Technologies, LLC. All other products are trademarks or registered trademarks of their respective owners. All rights reserved. .