Born Broken: Fonts And Information Loss In Legacy Documents
Geoffrey Brown and Kam Woods Indiana University School of Informatics and Computing Key Questions
How pervasive are font substitution problems ?
What information is available to identify fonts ?
How well can we match the fonts required by a document collection ?
How can we assist archivists in identifying serious font issues ? Page 8 MCTM Bulletin February 2005
K: I knew what you meant. I was just kidding. I’ll do XüLLbl (W):InputQ:FnOff :"""Y =! Y",Y#:PlotsOff the dishes tonight at dinner. YüL‚(W):Goto:0!Xscl:0!Yscl:Plot1(Scatt T er,L#,L$,&) PlotsOn 1:ZoomStat:StorePic Pic1 Lbl Q:FnOff :""üY :PlotsOff Jennifer felt better so offered the following challenge to Pause :Goto T Kevin. Lbl:0üXscl:0üYscl:Plot1(Scatt S:ClrHome:2!dim(L%er,L):dim(L ,L‚,Ñ)# )!N J: What type of general statement can you make DispPlotsOn "NO. 1:ZoomStat:StorePic OF Pic1 regarding the various polygons and, better yet, what PausePTS.":Output(1,13,N):Pause :Goto T can you say about a figure that looks like this? LblFor(I,1,N):ClrHome S:ClrHome:2üdim(Lƒ):dim(L )üN Disp "NO. "PT. OF NO.","":Output(1,9,I) PTS.":Output(1,13,N):Pause L#(I)!L%(1):L$(I)!L%(2) For(I,1,N):ClrHome Disp L%:Pause :End:Goto T LblDisp "PT.0:Menu(" NO.","":Output(1,9,I) MODELS R""," LINEAR (2)",1,"L(I)üLƒ(1):L‚(I)üLƒ(2) QUADRATIC",2," CUBIC/QUARTIC",3,"Disp Lƒ:Pause :End:Goto LOGARITHMIC",4," T LblEXPONENTIAL",5," 0:Menu(" MODELS POWER",6," RÜ"," LINEAR MAIN (2)",1," MENU",T) QUADRATIC",2," CUBIC/QUARTIC",3," Lbl 1:"aX+b"!Y# Kevin was impressed. Would your students be as LOGARITHMIC",4," EXPONENTIAL",5," Menu(" LINEAR "," STANDARD",U," impressed or challenged? MEDIAN/MEDIAN",VPOWER",6," MAIN MENU",T) Born Broken: Fonts and Information Loss in Legacy Digital Documents Lbl 1:"aX+b"üY U:ClrHome:LinReg(ax+b) L#,L$ Reprinted from the California ComMuniCator, Volume /57/270/Corr.1 DispMenu(" "LinReg(ax+b) LINEAR "," STANDARD",U," " United Nation s A 29. No. 1. andMEDIAN/MEDIAN",VDisp "------",a,b,"",r Geoffrey Brown Kam Woods General Assembly Distr.: General Department of Computer Science,Output(3,1,"a Indiana University ="):Output(4,1,"b Lbl U:ClrHome:LinReg(ax+b) L ,L‚ 27 September 2002 150 S. WoodlawnDisp="):Output(6,1,"r Ave. "LinReg(ax+b) " =") Bloomington, IN 47405-71041!T:Pause :Goto 7 Original: English TI Corner Disp "------",a,b,"",r by R2 Lbl V:ClrHome:Med-Med L#,L$ DispOutput(3,1,"a "Med-Med ="):Output(4,1,"b "
Disp="):Output(6,1,"r "------=") ",a,b As statistics continues to grow to a more and more 1üT:PauseOutput(3,1,"a :Goto 7="):Output(4,1,"b =") prominent place in today’sAbstract mathematics curriculum, all for mathematical symbols may result in total information 8!T:Pause :Goto 7 resources (NCTM and others) suggest that statistics, even loss.Lbl V:ClrHome:Med For example, our- experimentalMed L ,L‚ data include 9 docu- For millions of legacy documents, correct rendering de- mentsLbl with2:ClrHome program listings for the Texas Instruments TI- in itspends early uponforms, resources should such be introduced as fonts that to are students not generally at the Disp "Med-Med " 83If series N<3:Goto calculators 0 renderedFifty-seventh with the sessionTi83Pluspc font appropriateembedded levels. within the This document program structure. (RESIDUAL) Yet there is sig- is Disp "------",a,bAgenda item 44* Start whichQuadReg provides L#,L various$:2! mathematicalT:"aX"+bX+c" symbols;!Y# these pro- Comment: <
K: I knew what you meant. I was just kidding. I’ll do XüLLbl (W):InputQ:FnOff :"""Y =! Y",Y#:PlotsOff the dishes tonight at dinner. YüL‚(W):Goto:0!Xscl:0!Yscl:Plot1(Scatt T er,L#,L$,&) PlotsOn 1:ZoomStat:StorePic Pic1 Lbl Q:FnOff :""üY :PlotsOff Jennifer felt better so offered the following challenge to Pause :Goto T Kevin. Lbl:0üXscl:0üYscl:Plot1(Scatt S:ClrHome:2!dim(L%er,L):dim(L ,L‚,Ñ)# )!N J: What type of general statement can you make DispPlotsOn "NO. 1:ZoomStat:StorePic OF Pic1 regarding the various polygons and, better yet, what PausePTS.":Output(1,13,N):Pause :Goto T can you say about a figure that looks like this? LblFor(I,1,N):ClrHome S:ClrHome:2üdim(Lƒ):dim(L )üN Disp "NO. "PT. OF NO.","":Output(1,9,I) PTS.":Output(1,13,N):Pause L#(I)!L%(1):L$(I)!L%(2) For(I,1,N):ClrHome Disp L%:Pause :End:Goto T LblDisp "PT.0:Menu(" NO.","":Output(1,9,I) MODELS R""," LINEAR (2)",1,"L(I)üLƒ(1):L‚(I)üLƒ(2) QUADRATIC",2," CUBIC/QUARTIC",3,"Disp Lƒ:Pause :End:Goto LOGARITHMIC",4," T LblEXPONENTIAL",5," 0:Menu(" MODELS POWER",6," RÜ"," LINEAR MAIN (2)",1," MENU",T) QUADRATIC",2," CUBIC/QUARTIC",3," Lbl 1:"aX+b"!Y# Kevin was impressed. Would your students be as LOGARITHMIC",4," EXPONENTIAL",5," Menu(" LINEAR "," STANDARD",U," impressed or challenged? MEDIAN/MEDIAN",VPOWER",6," MAIN MENU",T) Born Broken: Fonts and Information Loss in Legacy Digital Documents Lbl 1:"aX+b"üY U:ClrHome:LinReg(ax+b) L#,L$ Reprinted from the California ComMuniCator, Volume /57/270/Corr.1 DispMenu(" "LinReg(ax+b) LINEAR "," STANDARD",U," " United Nation s A 29. No. 1. andMEDIAN/MEDIAN",VDisp "------",a,b,"",r Geoffrey Brown Kam Woods General Assembly Distr.: General Department of Computer Science,Output(3,1,"a Indiana University ="):Output(4,1,"b Lbl U:ClrHome:LinReg(ax+b) L ,L‚ 27 September 2002 150 S. WoodlawnDisp="):Output(6,1,"r Ave. "LinReg(ax+b) " =") Bloomington, IN 47405-71041!T:Pause :Goto 7 Original: English TI Corner Disp "------",a,b,"",r by R2 Lbl V:ClrHome:Med-Med L#,L$ DispOutput(3,1,"a "Med-Med ="):Output(4,1,"b "
Disp="):Output(6,1,"r "------=") ",a,b As statistics continues to grow to a more and more 1üT:PauseOutput(3,1,"a :Goto 7="):Output(4,1,"b =") prominent place in today’sAbstract mathematics curriculum, all for mathematical symbols may result in total information 8!T:Pause :Goto 7 resources (NCTM and others) suggest that statistics, even loss.Lbl V:ClrHome:Med For example, our- experimentalMed L ,L‚ data include 9 docu- For millions of legacy documents, correct rendering de- mentsLbl with2:ClrHome program listings for the Texas Instruments TI- in itspends early uponforms, resources should such be introduced as fonts that to are students not generally at the Disp "Med-Med " 83If series N<3:Goto calculators 0 renderedFifty-seventh with the sessionTi83Pluspc font appropriateembedded levels. within the This document program structure. (RESIDUAL) Yet there is sig- is Disp "------",a,bAgenda item 44* Start whichQuadReg provides L#,L various$:2! mathematicalT:"aX"+bX+c" symbols;!Y# these pro- Comment: <
j K id=16756 suffer from font(lowercase) substitution when(CAPITAL) moved to another ma-
k L
(lowercase) (CAPITAL)
j H
(lowercase) (CAPITAL)
Q: Can a Logo Font be more than one color?
http://bowerwebsolutions.com/services/logotruetype.htmA: A logo converted into a font cannot contain color information. It is simply "Black" or "White". However, depending on your specific logo, it could be broken up into different "letters" and each color could be assigned from within your word process program.
2 of 5 9/10/09 9:26 AM Test Collections
~125,000 Collected Using Glossaries [Reichherzer and Brown 2006] 3910 Total Fonts Top 10 -- Times New Roman, Arial, Times, Verdana, Courier New, Symbol, Tahoma, Arial Narrow, Garamond, Helvetica
~105,000 Collected From .gov Sites 1920 Total Fonts Top 10 - Times New Roman, Arial, Courier New, Verdana, Times, Arial Narrow, Courier, Tahoma, CG Times, Helvetica Font “Collection”
Foundry Adobe Published Table 2374 Bitstream TrueType Fonts 1556 FontFont Foundry Supplied Table 11973 URW Foundry Supplied Table 2358 Operating System Microsoft Windows + Office Font Files 444 Mac OS X + Office Font Files 322 Application Adobe PostScript 3 Fonts Published List 103 Microsoft Applications Published List 537 WordPerfect TrueType Fonts 1080 Source Data Type Number of Fonts
Table 1: Font Data ply various heuristics for inexact matching (for example Times New Roman 42% longest matching prefixes), our examination of the names Times 3% suggests there are many special cases, thus we studied ex- CG Times 1% act matching as a baseline. We matched names against TimesNewRoman 0.5% three distinct name sets – our complete collection, the Times New Roman Bold < 0.5% names extracted from an installation of Windows XP and TimesNewRoman,Bold < 0.5% Office 2007, and the single name “Times New Roman” Times New (W1) < 0.5% which is the most common font referenced in our data sets. CG Times (W1) < 0.5% Our metric for each font collection is the percentage TimesNewRomanPSMT < 0.5% of documents whose font requirements can be completely Times-Roman < 0.5% met (satisfied) by that collection. We compare these re- ...... sults with the percentage of “satisfied documents” given Total (375) 49% font collections of the N most referenced fonts for all val- ues of N. The results for our glossary based collection Table 2: Times Variations. Percentages calculated from (3910 fonts) are illustrated in Figure and those for the number of times each font variation is encountered. “.gov” (1920 fonts) documents are illustrated in Figure . Although the total number of referenced fonts differs, the overall results are quite similar – roughly 31-39% satis- critical to preserving the information content of a docu- fied by Times New Roman, 72-79% satisfied by XP and ment. Office, and 90-94% satisfied by our more comprehensive The primary additional font metrics available from collection. Notice that in both cases the top 100 fonts sat- Word documents include the font family (Roman, Swiss, isfy approximately 92% of the documents. etc.) and pitch, the font weight, Panose number, and Code As mentioned above, exact name matching is probably sets (unicode and Windows). The two font families that too pessimistic. For example, we noticed many variations are likely to be safely substituted are Roman and Swiss on fonts including the word “Times”. Some of this varia- (serif and san serif fonts respectively). It is less clear tion is due to similar fonts published by different vendors, whether Modern or Script can be substituted, and Deco- some is due to changes in name conventions from the early rative generally cannot be substituted. As illustrated in bitmapped fonts to PostScript fonts to Truetype, and some Table 3 and Table 4, either the font family or pitch infor- is due to significant differences. The top 10 “Times” fonts mation has a “default” value for nearly 40% of all refer- and their fraction of reference from the two document col- enced fonts. The font family categories are described in lections is illustrated in Table 2. The complete list com- (Microsoft 2009b). prises 375 fonts and 49% of all font references. Note that the values in Table 2 are calculated from the total num- Default 40% ber of references rather than the percentage of documents Fixed Pitch 3% satisfied. Variable Pitch 57% Even if all variations on Times, after suitable analysis, proved to be equivalent, the overall problem isn’t signif- Table 3: Font Pitch icantly simplified. The extremely long tail on font usage means achieving a 95% satisfaction level for a document Unicode range and Codepage information can be used collection is tractable, achieving 99% may not be feasible. to analyze font data at a finer granularity, providing a Furthermore, our experience suggests that finding many of clearer picture of why a particular font may be used in the identified fonts will be quite challenging. a given document (e.g. in order to satisfy the need for par- Ultimately, preservation of documents will require se- ticular glyphs) and potential guidance on selection of a lection of suitable font substitutes either because a partic- suitable substitute. We intend to explore the use of such ular font cannot be obtained or identified. Thus, we ex- information in future research. amined other data that might aid in characterizing suitable As mentioned, many fonts are used for a single or few substitutes or in determining whether a required font is glyphs. In the case of the document collections studied, Name Problem
Printed by geobrown Sep 10, 09 10:51arial.txt Page 1/2 Sep 10, 09 10:51arial.txt Page 2/2 !"#$%& "#$%&!./&0J2 "#$%& "#$%&!./&096%&$:;- "#$%&!'()*+, "#$%&!./&0;< "#$%&!'--, "#$%&!./&0;- "#$%&!'--,!./&0 "#$%&!96%&$:;- "#$%&!'12, "#$%&.&%:8 "#$%&!./&0 "#$%&.&%:8L96%&$: "#$%&!)3# "#$%&PG#/;- "#$%&!4566 "#$%&;< "#$%&!7#558 "#$%&;- "#$%&!96%&$: "#$%&;-!./&0 "#$%&!;< "#$%&;-!PQ6#%./&0 "#$%&!;- "#$%&C%##/D "#$%&!;-!.&%:8 "#$%&C%##/DL./&0 "#$%&!;-!)/=05=>50!*$?@6 "#$%&C%##/D!./&0 "#$%&!;AB/#$ "#$%&C%##/D!./&096%&$: "#$%&!C%##/D "#$%&C%##/D!96%&$: "#$%&!C%##/D!./&0 "#$%&H=$:/05;< "#$%&!C%##/D!96%&$: .7**"9J"#$%& "#$%&!C5?#$6% .7**9CJ"#$%&L./&0 "#$%&!C/#E%%&$ )"(C)MJ"#$%& "#$%&!F/G=050!;- )794"RJ"#$%&L./&096%&$: "#$%&!F/G=050!;-!./&0 7(4MP.J"#$%& "#$%&!H=$:/05!;< 7(4MSCJ"#$%&L./&0 "#$%&!I/&0 M!"#$%& "#$%&JK M5&&%>"#$%& "#$%&L!M5&N56$:%L!>%=>!>5#$O T5#0%=%L!"#$%&L!-$E5> "#$%&L./&0 "#$%&L./&096%&$: "#$%&LM5&N56$:%L>%=>!>5#$O "#$%&L96%&$: "#$%&!.&%:8
Thursday September 10, 2009 /Users/geobrown/arial.txt 1/1 The Naming Table http://www.microsoft.com/typography/otspec/name.htm
LangTagRecord langTagRecord[langTagCount] The language-tag records where langTagCount is the number of records. (Variable) Storage for the actual string data. The 'OS/2'When & Windows format 1 is Metrics used, the Tablelanguage IDs in name records can be less than or greater than 0x8000.http://www.microsoft.com/typography/otspec/os2.htm If a language ID is less than 0x8000, it has a platform-specific interpretation as with a format 0 Naming table. If a language ID is equal to or greater than 0x8000, it is associated with a language-tag record that references a language-tag string. In this way, the language ID is associated with a language-tag string that specifies the language for name records using that language ID, regardless of the platform. These can be used for any platform that supports this language-tag mechanism. A fontMicrosoft using a format Typography 1 Naming table may| Developer... use a combination |of OpenTypeplatform-specific specification language IDs as well | asOpenType tables | The OS/2 language-tag records for a given platform and encoding. table Each langTagRecord is organized as follows: Type Name Description USHORT length Language-tag string length (in OS/2 - OS/2 andbytes) Windows Metrics USHORT offset Language-tag string offset from The OS/2 table consists of a setstart of of metricsstorage area that (in are bytes). required in OpenType fonts. The fifth version of the OS/2 table (version 4), follows: The OpenType Font File http://www.microsoft.com/typography/otspec/otff.htm#otttables Language-tag strings stored in the Naming table must be encoded in UTF-16BE. The language tags must conform to IETFType specification BCPName 47. This of provides Entry tags such as "en", "fr-CA"Comments and "zh-Hant" to identify languages,USHORT including dialects,version written form and other variations.0x0004 ULONG ulDsigLength The length (in bytes) of the DSIGThe table language-tag records are associated sequentially with language IDs starting with 0x8000. Each (null if no signature) language-tagSHORT record correspondsxAvgCharWidth to a language ID one greater than that for the previous language-tag record. ULONG ulDsigOffset The offset (in bytes) of the DSIGThus, table language IDs associated with language-tag records must be within the range 0x8000 to 0x8000 + from the beginning of the TTC filelangTagCount (nullUSHORT if - 1. If a name recordusWeightClass uses a language ID that is greater than this, the identify of the language is unknown; such name records should not be used. no signature) USHORT usWidthClass For example, suppose a font has two language-tag records referencing strings in the storage: the first Font Tables referencesUSHORT the string "en", andfsType the second references the string "zh-Hant-HK" In this case, the language ID Font Information0x8000 is used in(Truetype) name records to index English-language strings. The language ID 0x8001 is used in name The rasterizer has a much easier time traversing tables if they are paddedrecordsSHORT so to that index each strings table in begins TraditionalySubscriptXSize on a 4-byteChinese as used in Hong Kong. boundary. It is highly recommended that all tables be long-aligned and padded with zeroes. NameSHORT Records ySubscriptYSize SHORT ySubscriptXOffset OpenType Tables Each NameRecord looks like this: SHORT ySubscriptYOffset Type Name Description Whether TrueType or PostScript outlines are used in an OpenType font, the following tables are required for USHORTSHORT platformIDySuperscriptXSizePlatform ID. the font to function correctly: USHORTSHORT encodingIDySuperscriptYSizePlatform-specific encoding ID. Required Tables USHORT languageID Language ID. SHORT ySuperscriptXOffset USHORT nameID Name ID. Tag Name USHORTSHORT length ySuperscriptYOffsetString length (in bytes). cmap Character to glyph mapping USHORTSHORT offset yStrikeoutSizeString offset from start of storage head Font header area (in bytes). hhea Horizontal header FollowingSHORT are the descriptionsyStrikeoutPosition of the four kinds of ID. The specific values listed here are predefined; new ones hmtx Horizontal metrics maySHORT be added in the future. sFamilyClassSimilar to the character encoding table, the NameRecords are sorted by platform maxp Maximum profile ID, then platform-specific ID, then language ID, and then by name ID. LanguageBYTE IDs refer to a valuepanose[10] that identifies the language in which a particular string is written. Values less name Naming table than 0x8000 are defined on a platform-specific basis. Values greater than or equal to 0x8000 are associated OS/2 OS/2 and Windows specific metrics withULONG language-tag records, asulUnicodeRange1 described above. Not all platformsBits 0-31 have platform-specific language IDs, and not all platforms support language-tag records. post PostScript information ULONG ulUnicodeRange2 Bits 32-63 For OpenType fonts based on TrueType outlines, the following tables areULONG used: ulUnicodeRange3 Bits 64-95 Tables Related to TrueType Outlines 2 of 17 ULONG ulUnicodeRange4 Bits 96-127 9/10/09 11:00 AM
Tag Name CHAR achVendID[4] cvt Control Value Table USHORT fsSelection fpgm Font program USHORT usFirstCharIndex glyf Glyph data loca Index to location USHORT usLastCharIndex prep CVT Program SHORT sTypoAscender Thehttp://www.microsoft.com/typography/otspec/ PostScript font extensions define a new set of tables containing dataSHORT specific to PostScript fontssTypoDescender that are used instead of the tables listed above. SHORT sTypoLineGap Tables Related to PostScript Outlines USHORT usWinAscent Tag Name USHORT usWinDescent CFF PostScript font program (compact font format) VORG Vertical Origin ULONG ulCodePageRange1 Bits 0-31 It is strongly recommended that CFF OpenType fonts that are used for verticalULONG writing include ulCodePageRange2aVertical Bits 32-63 Origin ('VORG') table. Multiple Master support in OpenType, has been SHORTdiscontinued as of versionsxHeight 1.3 of the specification. The 'fvar', 'MMSD', 'MMFX' tables have hence been removed. SHORT sCapHeight Tables Related to Bitmap Glyphs USHORT usDefaultChar
6 of 7 9/10/09 10:58 AM 1 of 19 9/10/09 11:01 AM !"#$%&%'()*''"#+),%$-)./0122/)3"45$6)7"8+)7%$95():;-%#<)=>+#"'"#5("%4) ?5@+)AB2)!")1A2)
b10 b16 Field Type Size Bitfield Comment
) ) !"# C4&) EB) 2AF2) G+H()I$5>>"4@)9%-+)) &D%$() 2))!"#$%&'%()*%+,$-./*%0$1)"0$%2(-,!)*$%,(3$4*) A))4%)(+H()4+H()(%)&D5>+) 1))I$5>)5$%C4-)5J&%8C(+)%JK+#() L))I$5>)5&)"')4%)%JK+#()>$+&+4() B))I$5>)("@D(86)5$%C4-)%JK+#() M))I$5>)("@D(86N)JC()588%I)D%8+&) O0AM))$+&+$P+-)'%$)'C(C$+)C&+)
) ) !"$# C4&) EB) AF22) G+H()I$5>>"4@)9%-+)(6>+):P58"-)%486)'%$)I$5>>"4@) &D%$() 9%-+&)1)54-)B) 2))I$5>)J%(D)&"-+&) A))I$5>)%486)%4)8+'() 1))I$5>)%486)%4)$"@D() L))I$5>)%486)%4)85$@+&()&"-+)
) ) %&'()*+,-.# C4&) EA) 1222) ,D+4)&+(N)(+9>%$5$"86)%P+$$"-+&)/0N)J6N)'%$#"4@)(D+) &D%$() 0(1.%2N)0(&*342N)5(67,N)54-)5(87227+)'"+8-&)(%)588) J+)>5@+)$+85("P+;)
) ) %8.-7!6.02# C4&) EA) B222) A))&D5>+)"&)J+8%I)(+H() &D%$() 2))&D5>+)"&)5J%P+)(+H()
) ) %9:'47"17'$# C4&) EA) Q222) A))54#D%$)"&)8%#R+-) &D%$() 2))54#D%$)"&)4%()8%#R+-) 11Word) AO) '60/0 Document# 8%4@) ) ) FontS%C4()%')(+H(J%H+&)"4)&D5>+):C4-%)-%#)%486< Information) ) '/;)<9):#%C4()%')J6(+&)%');)<9<)"&)1O):-+#"958 2) 2) '/;%:=># C4&)#D5$) ) ) G%(58)8+4@(D)%')77U)0)A;) A) A) ,"?# C4&)#D5$) E1) 2L) ?"(#D)$+VC+&() ) ) %6"@.65,.# C4&)#D5$) EA) 2B) ,D+4)AN)'%4()"&)5)G$C+G6>+)'%4() ) ) # C4&)#D5$) EA) 2Q) W+&+$P+-) ) ) %%# C4&)#D5$) EL) /2) 7%4()'59"86)"-) ) ) # C4&)#D5$)) EA) Q2) W+&+$P+-) 1) 1) !A.*342# &D%$() ) ) 35&+)I+"@D()%')'%4() B) B) '4B# C4&)#D5$) ) ) SD5$5#(+$)&+()"-+4("'"+$) M) M) *0'4)C9-2# C4&)#D5$) ) ) X4-+H)"4(%#%%:DBC;%:)(%)(D+)459+)%')(D+) !"#$%&%'()*''"#+),%$-)./0122/)3"45$6)7"8+)7%$95():;-%#<)=>+#"'"#5("%458(+$45(+)'%4() ) ?5@+)ABA)!")1A2) O) O) ,(:7B.# ?TU*=F) ) ) ) AOb10) A2b16) %BField# 7*UG=XYUTGZWFType ) ) Size ) Bitfield ) Comment B2) 1C) !"#$%&' DEFGHIJ) ) ) K+$%)(+$9"45(+-)&($"4@)(L5()$+#%$-&)459+)%') '%4(;)?%&&"M86)'%88%N+-)M6)5)&+#%4-)!"#)NL"#L) $+#%$-&)(L+)459+)%')54)58(+$45(+)'%4()(%)O&+)"') (L+)'"$&()459+-)'%4()-%+&)4%()+P"&()%4)(L"&) &6&(+9;)!5P"958)&"Q+)%')!"#$%&)"&)RS) #L5$5#(+$&;) ) File Information Block (FIB) T4),%$-)U+$&"%4)CV)(L+)$())"&)$+%$@54"Q+-)(%)95W+)'O(O$+)+P(+4&"%4)+5&"+$V)54-)(%)95W+)"() +5&"+$)(%)95W+)M5#WN5$-)#%9>5("M8+)'"8+)'%$95()#L54@+&;)XL+)$())4%N)#%4&"&(&)%')'%O$) &OM&($O#(O$+&Y)(L+)L+5-+$)54-)(L$++)5$$56&;)XL+)$())L+5-+$V)"&)O4#L54@+-)'$%9)>5&()U+$&"%4&;) XL+)&+#%4-)>5$()"&)54)5$$56)%')AR0!"#$%&'()#&*+$,($(-$.'"/'$.0)0$1)0&02#$"2$03)4"0)$50)&"(2&$ "4)-"''+$+4()8%#5("%4&;)XL+)(L"$-)>5$()"&)54)5$$56)%')Z10M"()8%4@&V)9546)%')NL"#L)N+$+)((+$+-) (L$%O@L)(L+)>$+U"%O&)U+$&"%4)$();)7"45886V)(L+$+)"&)54)5$$56)%')$*[+*))>5"$&V)NL"#L)N+$+) -"U"-+-)"4(%)&+U+$58)-"&\%"4()5$$56&)"4)(L+)>$+U"%O&)$();)7O(O$+)U+$&"%4&)%'),%$-)N"88)5--) +4($"+&)(%)(L+)(L$++)5$$56&V)&%)$+5-+$&)%')(L+)$())9O&()M+)#5$+'O8)(%)&W">)%U+$)546)+4($"+&)"4) +5#L)5$$56)(L5()N+$+)4%()>$+&+4()"4)(L+)U+$&"%4)'%$)NL"#L)(L+)$+5-+$)N5&)-+&"@4+-;),$"(+$&)%') (L+)$())9O&()N$"(+)+P5#(86)5&)9546)+4($"+&)5&)N5&)-+'"4+-)'%$)(L+)&$,-)U58O+)(L+6)>O()"4)(L+) $();) XL+)$()$*+*))&($O#(O$+V)O&+-)"4)54)5$$56)"4)(L+)$()Y) Bitfield Bitfield Decimal Hex Name Type Size Mask Comments Introduced 2) 2P2222) $.' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ) B) 2P222B) +.-' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) ) ) XL+)$*0123+2)&($O#(O$+V)$+'+$+4#+-)"4)(L+)$()V)O&+-)"4(+$45886)M6),%$-Y) Bitfield Bitfield Decimal Hex Name Type Size Mask Comments Introduced 2) 2P2222) %.045' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ) B) 2P222B) /.-045' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) ) C) 2P222C) %.)65' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ) A1) 2P222E) /.-)65' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) ) ) XL+)$*012)&($O#(O$+V)$+'+$+4#+-)"4)(L+)$()V)O&+-)"4(+$45886)M6),%$-;)XL"&)9%-"'"+-)U+$&"%4)%') (L+)5M%U+)&($O#(O$+)N5&)"4($%-O#+-)"4),%$-)122ZY) Bitfield Bitfield Decimal Hex Name Type Size Mask Comments Introduced 2) 2P2222) %.045' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ,%$-)122Z) B) 2P222B) /.-045' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) ,%$-)122Z) C) 2P222C) %.)65' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ,%$-)122Z) A1) 2P222E) /.-)65' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) ,%$-)122Z) AR) 2P22A2) %.7%5' 7E) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ,%$-)122Z) Extracting Fonts from Documents libwv (used in abiword) + bug fixes on font identification Custom C program (several hundred lines) Walk document character-by-character Process results with shell scripts Match name strings (complete match) against font database 100 Font Collection 90 80 XP+Office 2007 70 60 Percentage of Files Satisfied 50 40 Times New Roman 30 10 25 50 100 200 300 400 Total Fonts Documents collected using Glossaries 100 90 Font Collection 80 XP+Office 2007 70 60 50 Percentage of Files Satisfied 40 30 Times New Roman 20 10 25 50 100 200 300 400 Total Fonts .gov Documents Windows Font Analysis Tool Word Add-In Logo Example Cyrillic Font Problems Old document (pre Unicode) used ascii character range for Cyrillic. Modern versions of “Glasnost Light” use unicode pages properly. Old font is not available It is with profound regret that we inform you that Casady & Greene will close its doors on July 3rd, 2003, Legacy substitution documentation identified Corel font “Czar”, but that’s not quite right http://nwalsh.com/comp.fonts/FAQ/cf_33.htm The Cyrillic Charset Soup -- http://czyborra.com/ charsets/cyrillic.html WD2000: Incorrect Characters Appear When You Open Document i... http://support.microsoft.com/default.aspx?scid=kb;en-us;Q260162 Article ID: 260162 - Last Review: August 5, 2004 - Revision: 2.2 WD2000: Incorrect Characters Appear When You Open Document in Earlier Eastern European Version of Word This article was previously published under Q260162 SYMPTOMS When you open a Word document that was created using an earlier Eastern European version of Microsoft Word, the text of the document may appear as square boxes or other incorrect characters when you open the document in the English (U.S.) version of Microsoft Word. CAUSE This problem is caused by font-mapping and character-mapping problems. For example, this problem can occur when you create a document by using a font that is not installed on your computer. RESOLUTION To correct these problems, Microsoft provides a Word Font Repair Macro that converts the fonts in the document (or the selected text) to Arial. The following file is available for download from the Microsoft Download Center: Eefonts.exe (http://download.microsoft.com/download/word2000/eefonts/2000/w9xnt4/en-us/eefonts.exe) Release Date: 27-Jan-2000 For additional information about how to download Microsoft Support files, click the following article number to view the article in the Microsoft Knowledge Base: 119591 (http://support.microsoft.com/kb/119591/EN-US/ ) How to Obtain Microsoft Support Files from Online Services Microsoft scanned this file for viruses. Microsoft used the most current virus-detection software that was available on the date that the file was posted. The file is stored on security-enhanced servers that help to prevent any unauthorized changes to the file. MORE INFORMATION To download and install the Word Font Repair Macro, follow these steps: 1. Click the Eefonts.exe link in the "Resolution" section of this article to download the Eefonts.exe file. 2. In the File Download dialog box, click to select Save this program to disk, and then click OK. 3. Change the Save in box to the folder where you want to save the Eefonts.exe file, and then click Save. 4. After the file has been downloaded, install the macro. To do this, follow these steps: a. Quit Microsoft Word 2000. b. Click Start, and then click Run. c. In the Run dialog box, click Browse. d. In the Browse dialog box, change the Look in box to the folder where you saved the Eefonts.exe file. e. Click to select the Eefonts.exe file, and then click Open. f. In the Run dialog box, click OK. g. In the Microsoft Word Font Repair Macro Setup dialog box, click Next. h. Click to select I accept the License Agreement, and then click Next. NOTE: If you do not select to accept the License Agreement, the Word Font Repair Macro will not be installed. i. To start the installation of the Word Font Repair Macro, click Next. 1 of 2 9/29/09 3:47 PM What can we do ? A better job of identifying “critical” font substitutions Use unicode range to assist language identification Use language identification tools to identify human language vs symbolic font use Create better tools to facilitate quality assurance testing by isolating text with font substitutions Create a clearinghouse of font identification and substitution information