<<

Born Broken: And Information Loss In Legacy Documents

Geoffrey Brown and Kam Woods Indiana University School of Informatics and Computing Key Questions

How pervasive are substitution problems ?

What information is available to identify fonts ?

How well can we match the fonts required by a document collection ?

How can we assist archivists in identifying serious font issues ? Page 8 MCTM Bulletin February 2005

K: I knew what you meant. I was just kidding. I’ll do XüLLbl (W):InputQ:FnOff :"""Y =! Y",Y#:PlotsOff the dishes tonight at dinner. YüL‚(W):Goto:0!Xscl:0!Yscl:Plot1(Scatt T er,L#,L$,&) PlotsOn 1:ZoomStat:StorePic Pic1 Lbl Q:FnOff :""üY :PlotsOff Jennifer felt better so offered the following challenge to Pause :Goto T Kevin. Lbl:0üXscl:0üYscl:Plot1(Scatt S:ClrHome:2!dim(L%er,L):dim(L ,L‚,Ñ)# )!N J: What type of general statement can you make DispPlotsOn "NO. 1:ZoomStat:StorePic OF Pic1 regarding the various polygons and, better yet, what PausePTS.":Output(1,13,N):Pause :Goto T can you say about a figure that looks like this? LblFor(I,1,N):ClrHome S:ClrHome:2üdim(Lƒ):dim(L )üN Disp "NO. "PT. OF NO.","":Output(1,9,I) PTS.":Output(1,13,N):Pause L#(I)!L%(1):L$(I)!L%(2) For(I,1,N):ClrHome Disp L%:Pause :End:Goto T LblDisp "PT.0:Menu(" NO.","":Output(1,9,I) MODELS R""," LINEAR (2)",1,"L(I)üLƒ(1):L‚(I)üLƒ(2) QUADRATIC",2," CUBIC/QUARTIC",3,"Disp Lƒ:Pause :End:Goto LOGARITHMIC",4," T LblEXPONENTIAL",5," 0:Menu(" MODELS POWER",6," RÜ"," LINEAR MAIN (2)",1," MENU",T) QUADRATIC",2," CUBIC/QUARTIC",3," Lbl 1:"aX+b"!Y# Kevin was impressed. Would your students be as LOGARITHMIC",4," EXPONENTIAL",5," Menu(" LINEAR "," STANDARD",U," impressed or challenged? MEDIAN/MEDIAN",VPOWER",6," MAIN MENU",T) Born Broken: Fonts and Information Loss in Legacy Digital Documents Lbl 1:"aX+b"üY U:ClrHome:LinReg(ax+b) L#,L$ Reprinted from the California ComMuniCator, Volume /57/270/Corr.1 DispMenu(" "LinReg(ax+b) LINEAR "," STANDARD",U," " United Nation s A 29. No. 1. andMEDIAN/MEDIAN",VDisp "------",a,b,"",r Geoffrey Brown Kam Woods General Assembly Distr.: General Department of Computer Science,Output(3,1,"a Indiana University ="):Output(4,1,"b Lbl U:ClrHome:LinReg(ax+b) L ,L‚ 27 September 2002 150 S. WoodlawnDisp="):Output(6,1,"r Ave. "LinReg(ax+b) " =") Bloomington, IN 47405-71041!T:Pause :Goto 7 Original: English TI Corner Disp "------",a,b,"",r by R2 Lbl V:ClrHome:Med-Med L#,L$ DispOutput(3,1,"a "Med-Med ="):Output(4,1,"b "

Disp="):Output(6,1,"r "------=") ",a,b As statistics continues to grow to a more and more 1üT:PauseOutput(3,1,"a :Goto 7="):Output(4,1,"b =") prominent place in today’sAbstract mathematics curriculum, all for mathematical symbols may result in total information 8!T:Pause :Goto 7 resources (NCTM and others) suggest that statistics, even loss.Lbl V:ClrHome:Med For example, our- experimentalMed L ,L‚ data include 9 docu- For millions of legacy documents, correct rendering de- mentsLbl with2:ClrHome program listings for the Texas Instruments TI- in itspends early uponforms, resources should such be introduced as fonts that to are students not generally at the Disp "Med-Med " 83If series N<3:Goto calculators 0 renderedFifty-seventh with the sessionTi83Pluspc font appropriateembedded levels. within the This document program structure. (RESIDUAL) Yet there is sig- is Disp "------",a,bAgenda item 44* Start whichQuadReg provides L#,L various$:2! mathematicalT:"aX"+bX+c" symbols;!Y# these pro- Comment: <>N0260931E<> allowsrectly the substituteduser to enter fonts. and edit the ordered pairs in the <>A/57/270/Corr.1<> uments to assess the difficulty of matching font require- poundingLblOutput(3,1,"a 2:ClrHome the problem, ="):Output(4,1,"b Texas Instruments has published=") <><> ments with a database of fonts. We describe the identify- aIfOutput(5,1,"c variety N<3:GotoFont of calculator 0 Substitution ="):Pause fonts with different :Goto internal 7 names program keeps track of the one that best fits the data by programing information keeps track contained of the inone common that best font fits formats, the data font re-by andLbl possibly 3:Menu(" incompatible SELECT . 1 "," CUBIC",A," calculating the residual sum. QuadReg L ,L‚:2üT:"aXÜ+bX+c"üY Report of the Secretary-General quirements stored in Word documents, the API provided by QUARTIC",B) Disp "QUAD. aXÜ+bX+c" Windows to support font requests by applications, the doc- Lbl A:ClrHome Corrigendum Lbl umentedT:ClrHome:10 substitution0000 algorithms!M:Float used by Windows when Disp "------",a,b,c Lbl T:ClrHome:100000üM:Float If N<4:Goto 0 Menu("requested fonts RESIDUALS are not available, and R the""," ways in DATA: which 1. Paragraph 41 Menu(" RESIDUALS RÜ"," DATA: ENTER",R," CubicRegOutput(3,1,"a L#,L ="):Output(4,1,"b$:3!T:"aX'+bX " +cX+d"=") !Y# ENTER",R,"support software ..... might VIEW",S," be used to control ..... font substitution ..... VIEW",S," ..... DOutput(5,1,"cisp "aX'+bX ="):Pause"+cX+d" :Goto 7The second sentence should read: EDIT",W,"in a preservation ..... environment. PLOT",Q," TEST 8 EDIT",W," ..... PLOT",Q," TEST 8 MODELS",0," LblDisp 3:Menu(" "------SELECT 1 "," ",a,b,c,dCUBIC",A,"East Asia has been quite successful in reducing the proportion of people who MODELS",0," QUIT",Z) Correctly Rendered suffer from hunger, while Africa’s malnutrition rate has hardly budged. QUARTIC",B)Output(3,1,"a ="):Output(4,1,"b =") LblQUIT",Z) R:ClrHome Overview Output(5,1,"c ="):Output(6,1,"d =") LblInput R:ClrHome "NO. OF PTS. ",N Lbl A:ClrHome 2. Paragraph 111 Q. I need ‘font x’. ... Can you tell me where I can get Pause :Goto 7 N!dim(L#):N!dim(L$):1!S The last sentence should read: Inputit? "NO. OF PTS. ",N LblIf N<4:Goto B:ClrHome 0 For(I,1,N):ClrHome Nüdim(LA. This ):Nüdim(L‚):1üS is very unlikely, as there are over 100,000 IfCubicReg N<5:Goto L ,L‚:3üT:"aXÓ+bXÜ+cX+d"üY 0 This is why I intend to submit to the General Assembly in September 2002 a Disp "PT. NO.","":Output(1,9,I)1 report that will propose further programmatic, institutional and process For(I,1,N):ClrHomedigital fonts in existence. QuartRegDisp "aXÓ+bXÜ+cX+d" L#,L$:7!T Input "X = ",X:Input "Y = ",Y improvements, so that we can translate the ambitious template of the "aX^4+bX'+cX"+dX+e"!Y#:ClrHome XDisp!L#Doubtless, "PT.(I):Y NO.","":Output(1,9,I)!L$ many(I):End:Goto readers have T witnessed PowerPoint Disp "------Default",a,b,c,d Substitution Declaration into an achievable agenda of action. Disp "aX^4+bX'+cX"+"," dX+e" LblInputpresentations W:ClrHome:Input "X = ",X:Input where the "Y slides = "WHICH",Y were clearly ENTRY? missing ",Wglyphs Output(3,1,"a ="):Output(4,1,"b =") (visible characters) or were otherwise poorly rendered. In Disp a,b,c,d,eFigure 1: TI83plus 3. font Annex samples table, goal 6 ClrHome:DispXüL (I):YüL‚(I):End:Goto "CURRENTLY:","" T Output(5,1,"c ="):Output(6,1,"d =") most cases, this unhappy event is the direct result of copy- Output(3,1,"a ="):Output(4,1,"b Replace the =") text of goal 6 by the following: L#(W)!L%(1):L$(W)!L%(2) Lbling W:ClrHome:Input the presentation from "WHICH the machine ENTRY? upon ",W which it was Output(5,1,"cPause :Goto TI83plus7 ="):Output(6,1,"d Font =") Disp L%,"":Input "X = ",X In the process of preparing this paper we found several ClrHome:Dispcreated to a machine "CURRENTLY:","" provided for the presentation with- documentsOutput(7,1,"eLbl B:ClrHome with barcodes ="):Pause rendered using:Goto the 7 font “Bar- X!L#(W):Input "Y = ",Y L(W)üLƒ(1):L‚(W)üLƒ(2)out ensuring that the target machine has the required fonts. codeLblIf N<5:Goto 34:ClrHome:LnReg of 9 by request”.0 When L rendered#,L$:4 with!T font substi- Y!L$(W):Goto T DispThe Lƒ,"":Input reason is not always"X = ",X clear to the presenter because Mi- tutions,QuartReg"a+bln(X)" the bar L codes,L‚:7üT!Y#:Disp of the numbers"LnReg are replaced a+bln(X)" by Ara- crosoft Office performs font substitution without warning. bicDisp numerals. "------Thus, the substitution",a,b,"",r preserved the numeric Annoying font substitutions occur frequently. For meaning, but not the functional ability to be scanned! Un- example, symbols such as and quotations fortunately, there areRegular many barcode fonts and we were un- are rendered with the “WP TypographicSymbol” font in able to find the required font; however, we were able to WordPerfect Office 11. When these documents are mi- find a suitable substituteABCDEFGHIJKLM and to configure Office to accept grated to Word, this font dependence is pre- our substitute as illustratedNOPQRSTUVWXYZ in Figure 2. served, and when the documents are rendered on machines 02-60931 (E) 300902 without this font, these symbols become “A” and “@” as *0260931* in: “in the AStrategies and Assessment@ Column”.2 The degree of information loss due to substitution de- Default Substitution pends both upon the importance of the glyphs substituted 1234567890 and their frequency. For example, corporate logos are of- ten implemented with dedicated fonts containing a single Forced Substitution of “Code39Azalea” . There may be no substantive information loss from a missing logo for most purposes. In contrast, substitution Figure 2: Barcode Rendering 1http://www.microsoft.com/typography/FontSearchFAQ.mspx 2http://www.wpuniverse.com/vb/showthread.php?thread- It is difficult to create portable documents that will not id=16756 suffer from font substitution when moved to another ma- Page 8 MCTM Bulletin February 2005

K: I knew what you meant. I was just kidding. I’ll do XüLLbl (W):InputQ:FnOff :"""Y =! Y",Y#:PlotsOff the dishes tonight at dinner. YüL‚(W):Goto:0!Xscl:0!Yscl:Plot1(Scatt T er,L#,L$,&) PlotsOn 1:ZoomStat:StorePic Pic1 Lbl Q:FnOff :""üY :PlotsOff Jennifer felt better so offered the following challenge to Pause :Goto T Kevin. Lbl:0üXscl:0üYscl:Plot1(Scatt S:ClrHome:2!dim(L%er,L):dim(L ,L‚,Ñ)# )!N J: What type of general statement can you make DispPlotsOn "NO. 1:ZoomStat:StorePic OF Pic1 regarding the various polygons and, better yet, what PausePTS.":Output(1,13,N):Pause :Goto T can you say about a figure that looks like this? LblFor(I,1,N):ClrHome S:ClrHome:2üdim(Lƒ):dim(L )üN Disp "NO. "PT. OF NO.","":Output(1,9,I) PTS.":Output(1,13,N):Pause L#(I)!L%(1):L$(I)!L%(2) For(I,1,N):ClrHome Disp L%:Pause :End:Goto T LblDisp "PT.0:Menu(" NO.","":Output(1,9,I) MODELS R""," LINEAR (2)",1,"L(I)üLƒ(1):L‚(I)üLƒ(2) QUADRATIC",2," CUBIC/QUARTIC",3,"Disp Lƒ:Pause :End:Goto LOGARITHMIC",4," T LblEXPONENTIAL",5," 0:Menu(" MODELS POWER",6," RÜ"," LINEAR MAIN (2)",1," MENU",T) QUADRATIC",2," CUBIC/QUARTIC",3," Lbl 1:"aX+b"!Y# Kevin was impressed. Would your students be as LOGARITHMIC",4," EXPONENTIAL",5," Menu(" LINEAR "," STANDARD",U," impressed or challenged? MEDIAN/MEDIAN",VPOWER",6," MAIN MENU",T) Born Broken: Fonts and Information Loss in Legacy Digital Documents Lbl 1:"aX+b"üY U:ClrHome:LinReg(ax+b) L#,L$ Reprinted from the California ComMuniCator, Volume /57/270/Corr.1 DispMenu(" "LinReg(ax+b) LINEAR "," STANDARD",U," " United Nation s A 29. No. 1. andMEDIAN/MEDIAN",VDisp "------",a,b,"",r Geoffrey Brown Kam Woods General Assembly Distr.: General Department of Computer Science,Output(3,1,"a Indiana University ="):Output(4,1,"b Lbl U:ClrHome:LinReg(ax+b) L ,L‚ 27 September 2002 150 S. WoodlawnDisp="):Output(6,1,"r Ave. "LinReg(ax+b) " =") Bloomington, IN 47405-71041!T:Pause :Goto 7 Original: English TI Corner Disp "------",a,b,"",r by R2 Lbl V:ClrHome:Med-Med L#,L$ DispOutput(3,1,"a "Med-Med ="):Output(4,1,"b "

Disp="):Output(6,1,"r "------=") ",a,b As statistics continues to grow to a more and more 1üT:PauseOutput(3,1,"a :Goto 7="):Output(4,1,"b =") prominent place in today’sAbstract mathematics curriculum, all for mathematical symbols may result in total information 8!T:Pause :Goto 7 resources (NCTM and others) suggest that statistics, even loss.Lbl V:ClrHome:Med For example, our- experimentalMed L ,L‚ data include 9 docu- For millions of legacy documents, correct rendering de- mentsLbl with2:ClrHome program listings for the Texas Instruments TI- in itspends early uponforms, resources should such be introduced as fonts that to are students not generally at the Disp "Med-Med " 83If series N<3:Goto calculators 0 renderedFifty-seventh with the sessionTi83Pluspc font appropriateembedded levels. within the This document program structure. (RESIDUAL) Yet there is sig- is Disp "------",a,bAgenda item 44* Start whichQuadReg provides L#,L various$:2! mathematicalT:"aX"+bX+c" symbols;!Y# these pro- Comment: <>N0260931E<> allowsrectly the substituteduser to enter fonts. and edit the ordered pairs in the <>A/57/270/Corr.1<> uments to assess the difficulty of matching font require- poundingLblOutput(3,1,"a 2:ClrHome the problem, ="):Output(4,1,"b Texas Instruments has published=") <><> ments with a database of fonts. We describe the identify- aIfOutput(5,1,"c variety N<3:Goto of calculator 0 ="):Pause fonts with different :Goto internal 7 names program keeps track of the one that best fits the data by programing information keeps track contained of the inone common that best font fits formats, the data font re-by andLbl possibly 3:Menu(" incompatible SELECT glyphs. 1 "," CUBIC",A," calculating the residual sum. QuadReg L ,L‚:2üT:"aXÜ+bX+c"üY Report of the Secretary-General quirements stored in Word documents, the API provided by QUARTIC",B) Disp "QUAD. aXÜ+bX+c" Windows to support font requests by applications, the doc- Lbl A:ClrHome Corrigendum Lbl umentedT:ClrHome:10 substitution0000 algorithms!M:Float used by Windows when Disp "------",a,b,c Lbl T:ClrHome:100000üM:Float If N<4:Goto 0 Menu("requested fonts RESIDUALS are not available, and R the""," ways in DATA: which 1. Paragraph 41 Menu(" RESIDUALS RÜ"," DATA: ENTER",R," CubicRegOutput(3,1,"a L#,L ="):Output(4,1,"b$:3!T:"aX'+bX " +cX+d"=") !Y# ENTER",R,"support software ..... might VIEW",S," be used to control ..... font substitution ..... VIEW",S," ..... DOutput(5,1,"cisp "aX'+bX ="):Pause"+cX+d" :Goto 7The second sentence should read: EDIT",W,"in a preservation ..... environment. PLOT",Q," TEST 8 EDIT",W," ..... PLOT",Q," TEST 8 MODELS",0," LblDisp 3:Menu(" "------SELECT 1 "," ",a,b,c,dCUBIC",A,"East Asia has been quite successful in reducing the proportion of people who MODELS",0," QUIT",Z) Correctly Rendered suffer from hunger, while Africa’s malnutrition rate has hardly budged. QUARTIC",B)Output(3,1,"a ="):Output(4,1,"b =") LblQUIT",Z) R:ClrHome Overview Output(5,1,"c ="):Output(6,1,"d =") LblInput R:ClrHome "NO. OF PTS. ",N Lbl A:ClrHome 2. Paragraph 111 Q. I need ‘font x’. ... Can you tell me where I can get Pause :Goto 7 N!dim(L#):N!dim(L$):1!S The last sentence should read: Inputit? "NO. OF PTS. ",N LblIf N<4:Goto B:ClrHome 0 For(I,1,N):ClrHome Nüdim(LA. This ):Nüdim(L‚):1üS is very unlikely, as there are over 100,000 IfCubicReg N<5:Goto L ,L‚:3üT:"aXÓ+bXÜ+cX+d"üY 0 This is why I intend to submit to the General Assembly in September 2002 a Disp "PT. NO.","":Output(1,9,I)1 report that will propose further programmatic, institutional and process For(I,1,N):ClrHomedigital fonts in existence. QuartRegDisp "aXÓ+bXÜ+cX+d" L#,L$:7!T Input "X = ",X:Input "Y = ",Y improvements, so that we can translate the ambitious template of the "aX^4+bX'+cX"+dX+e"!Y#:ClrHome XDisp!L#Doubtless, "PT.(I):Y NO.","":Output(1,9,I)!L$ many(I):End:Goto readers have T witnessed PowerPoint Disp "------Default",a,b,c,d Substitution Declaration into an achievable agenda of action. Disp "aX^4+bX'+cX"+"," dX+e" LblInputpresentations W:ClrHome:Input "X = ",X:Input where the "Y slides = "WHICH",Y were clearly ENTRY? missing ",Wglyphs Output(3,1,"a ="):Output(4,1,"b =") (visible characters) or were otherwise poorly rendered. In Disp a,b,c,d,eFigure 1: TI83plus 3. font Annex samples table, goal 6 ClrHome:DispXüL (I):YüL‚(I):End:Goto "CURRENTLY:","" T Output(5,1,"c ="):Output(6,1,"d =") most cases, this unhappy event is the direct result of copy- Output(3,1,"a ="):Output(4,1,"b Replace the =") text of goal 6 by the following: L#(W)!L%(1):L$(W)!L%(2) Lbling W:ClrHome:Input the presentation from "WHICH the machine ENTRY? upon ",W which it was Output(5,1,"cPause :Goto 7 ="):Output(6,1,"d =") Disp L%,"":Input "X = ",X In the process of preparing this paper we found several ClrHome:Dispcreated to a machine "CURRENTLY:","" provided for the presentation with- documentsOutput(7,1,"eLbl B:ClrHome with barcodes ="):Pause rendered using:Goto the 7 font “Bar- X!L#(W):Input "Y = ",Y L(W)üLƒ(1):L‚(W)üLƒ(2)out ensuring that the target machine has the required fonts. codeLblIf N<5:Goto 34:ClrHome:LnReg of 9 by request”.0 When L rendered#,L$:4 with!T font substi- Y!L$(W):Goto T DispThe Lƒ,"":Input reason is not always"X = ",X clear to the presenter because Mi- tutions,QuartReg"a+bln(X)" the bar L codes,L‚:7üT!Y#:Disp of the numbers"LnReg are replaced a+bln(X)" by Ara- crosoft Office performs font substitution without warning. bicDisp numerals. "------Thus, the substitution",a,b,"",r preserved the numeric Annoying font substitutions occur frequently. For meaning, but not the functional ability to be scanned! Un- example, symbols such as apostrophes and quotations Bowerfortunately, Web Solutions Inc. - Convert Your there Company Logo are to Regulara Tru... many barcodehttp://bowerwebsolutions.com/services/logotruetype.htm fonts and we were un- are rendered with the “WP TypographicSymbol” font in able to find the required font; however, we were able to WordPerfect Office 11. When these documents are mi- find a suitableA: Yes, we substitute do! Simply downloadABCDEFGHIJKLM our and free "Lion Heart" to font configure (lionheart.zip 16K size) Office to accept grated to Microsoft Word, this font dependence is pre- Font Substitution (cont.) our substitute as illustratedNOPQRSTUVWXYZ in Figure 2. served, and when the documents are rendered on machines 02-60931 (E) 300902 without this font, these symbols become “A” and “@” as *0260931* 2 in: “in the AStrategies and Assessment@ Column”. Q: How to I install your Free "Lion Heart" Demo TrueType Font? A: For WindowsDefault Users: Substitution The degree of information loss due to substitution de- Once you download and uncompress the .zip file, go to the directory (folder) you downloaded (saved) the file. Copy the lionheart.ttf file. Next, go to your Control Panel in Windows (from the "Start" Menu) and open the Fonts Control pends both upon the importance of the glyphs substituted Panel. 1234567890 Once you have the Fonts window open, "paste" the lionheart.tff into this Fonts and their frequency. For example, corporate logos are of- folder. This will install the font and it can be accessed via the Font Menu from within most program that use TrueType fonts. ten implemented with dedicated fonts containing a single For Macintosh Users (OS X): ForcedOnce you download Substitution and uncompress the .zip file, of go to “Code39Azalea”the folder you downloaded (saved) the file. Copy the lionheart.ttf file. Next go to your Fonts folder glyph. There may be no substantive information loss from located within your Home folder (Open your "Home" Folder, then inside that open the"Library" folder, then open the "fonts" folder). Paste a copy of the lionheart.tff into your"Fonts" folder. This will install the font and it can be a missing logo for most purposes. In contrast, substitution Barcodeaccessed via the Font Menu 3 from of within most9 program by that useRequest TrueType fonts. Figure 2: Barcode Rendering 1 Q: How do I use your Free Demo Font? http://www.microsoft.com/typography/FontSearchFAQ.mspx A: Once the TrueType font is installed, open your favorite word processing or page 2 layout program. Use the Font menu from your application and choose the "Lion http://www.wpuniverse.com/vb/showthread.php?thread- It is difficultHeart" font. to Then type create any of the following portable characters: documents that will not

j K id=16756 suffer from font(lowercase) substitution when(CAPITAL) moved to another ma-

k L

(lowercase) (CAPITAL)

j H

(lowercase) (CAPITAL)

Q: Can a Logo Font be more than one color?

http://bowerwebsolutions.com/services/logotruetype.htmA: A logo converted into a font cannot contain color information. It is simply "Black" or "White". However, depending on your specific logo, it could be broken up into different "letters" and each color could be assigned from within your word process program.

2 of 5 9/10/09 9:26 AM Test Collections

~125,000 Collected Using Glossaries [Reichherzer and Brown 2006] 3910 Total Fonts Top 10 -- , , Times, , New, , , Arial Narrow, Garamond,

~105,000 Collected From .gov Sites 1920 Total Fonts Top 10 - Times New Roman, Arial, Courier New, Verdana, Times, Arial Narrow, Courier, Tahoma, CG Times, Helvetica Font “Collection”

Foundry Adobe Published Table 2374 Bitstream TrueType Fonts 1556 FontFont Foundry Supplied Table 11973 URW Foundry Supplied Table 2358 Operating + Office Font Files 444 Mac OS X + Office Font Files 322 Application Adobe PostScript 3 Fonts Published List 103 Microsoft Applications Published List 537 WordPerfect TrueType Fonts 1080 Source Data Type Number of Fonts

Table 1: Font Data ply various heuristics for inexact matching (for example Times New Roman 42% longest matching prefixes), our examination of the names Times 3% suggests there are many special cases, thus we studied ex- CG Times 1% act matching as a baseline. We matched names against TimesNewRoman 0.5% three distinct name sets – our complete collection, the Times New Roman Bold < 0.5% names extracted from an installation of Windows XP and TimesNewRoman,Bold < 0.5% Office 2007, and the single name “Times New Roman” Times New (W1) < 0.5% which is the most common font referenced in our data sets. CG Times (W1) < 0.5% Our metric for each font collection is the percentage TimesNewRomanPSMT < 0.5% of documents whose font requirements can be completely Times-Roman < 0.5% met (satisfied) by that collection. We compare these re- ...... sults with the percentage of “satisfied documents” given Total (375) 49% font collections of the N most referenced fonts for all val- ues of N. The results for our glossary based collection Table 2: Times Variations. Percentages calculated from (3910 fonts) are illustrated in Figure and those for the number of times each font variation is encountered. “.gov” (1920 fonts) documents are illustrated in Figure . Although the total number of referenced fonts differs, the overall results are quite similar – roughly 31-39% satis- critical to preserving the information content of a docu- fied by Times New Roman, 72-79% satisfied by XP and ment. Office, and 90-94% satisfied by our more comprehensive The primary additional font metrics available from collection. Notice that in both cases the top 100 fonts sat- Word documents include the (Roman, Swiss, isfy approximately 92% of the documents. etc.) and , the font weight, Panose number, and Code As mentioned above, exact name matching is probably sets ( and Windows). The two font families that too pessimistic. For example, we noticed many variations are likely to be safely substituted are Roman and Swiss on fonts including the word “Times”. Some of this varia- ( and san serif fonts respectively). It is less clear tion is due to similar fonts published by different vendors, whether Modern or Script can be substituted, and Deco- some is due to changes in name conventions from the early rative generally cannot be substituted. As illustrated in bitmapped fonts to PostScript fonts to Truetype, and some Table 3 and Table 4, either the font family or pitch infor- is due to significant differences. The top 10 “Times” fonts mation has a “default” value for nearly 40% of all refer- and their fraction of reference from the two document col- enced fonts. The font family categories are described in lections is illustrated in Table 2. The complete list com- (Microsoft 2009b). prises 375 fonts and 49% of all font references. Note that the values in Table 2 are calculated from the total num- Default 40% ber of references rather than the percentage of documents Pitch 3% satisfied. Variable Pitch 57% Even if all variations on Times, after suitable analysis, proved to be equivalent, the overall problem isn’t signif- Table 3: Font Pitch icantly simplified. The extremely long tail on font usage means achieving a 95% satisfaction level for a document Unicode range and Codepage information can be used collection is tractable, achieving 99% may not be feasible. to analyze font data at a finer granularity, providing a Furthermore, our experience suggests that finding many of clearer picture of why a particular font may be used in the identified fonts will be quite challenging. a given document (e.g. in order to satisfy the need for par- Ultimately, preservation of documents will require se- ticular glyphs) and potential guidance on selection of a lection of suitable font substitutes either because a partic- suitable substitute. We intend to explore the use of such ular font cannot be obtained or identified. Thus, we ex- information in future research. amined other data that might aid in characterizing suitable As mentioned, many fonts are used for a single or few substitutes or in determining whether a required font is glyphs. In the case of the document collections studied, Name Problem

Printed by geobrown Sep 10, 09 10:51arial.txt Page 1/2 Sep 10, 09 10:51arial.txt Page 2/2 !"#$%& "#$%&!./&0J2 "#$%& "#$%&!./&096%&$:;- "#$%&!'()*+, "#$%&!./&0;< "#$%&!'--, "#$%&!./&0;- "#$%&!'--,!./&0 "#$%&!96%&$:;- "#$%&!'12, "#$%&.&%:8 "#$%&!./&0 "#$%&.&%:8L96%&$: "#$%&!)3# "#$%&PG#/;- "#$%&!4566 "#$%&;< "#$%&!7#558 "#$%&;- "#$%&!96%&$: "#$%&;-!./&0 "#$%&!;< "#$%&;-!PQ6#%./&0 "#$%&!;- "#$%&C%##/D "#$%&!;-!.&%:8 "#$%&C%##/DL./&0 "#$%&!;-!)/=05=>50!*$?@6 "#$%&C%##/D!./&0 "#$%&!;AB/#$ "#$%&C%##/D!./&096%&$: "#$%&!C%##/D "#$%&C%##/D!96%&$: "#$%&!C%##/D!./&0 "#$%&H=$:/05;< "#$%&!C%##/D!96%&$: .7**"9J"#$%& "#$%&!C5?#$6% .7**9CJ"#$%&L./&0 "#$%&!C/#E%%&$ )"(C)MJ"#$%& "#$%&!F/G=050!;- )794"RJ"#$%&L./&096%&$: "#$%&!F/G=050!;-!./&0 7(4MP.J"#$%& "#$%&!H=$:/05!;< 7(4MSCJ"#$%&L./&0 "#$%&!I/&0 M!"#$%& "#$%&JK M5&&%>"#$%& "#$%&L!M5&N56$:%L!>%=>!>5#$O T5#0%=%L!"#$%&L!-$E5> "#$%&L./&0 "#$%&L./&096%&$: "#$%&LM5&N56$:%L>%=>!>5#$O "#$%&L96%&$: "#$%&!.&%:8

Thursday September 10, 2009 /Users/geobrown/arial.txt 1/1 The Naming Table http://www.microsoft.com/typography/otspec/name.htm

LangTagRecord langTagRecord[langTagCount] The language-tag records where langTagCount is the number of records. (Variable) Storage for the actual string data. The 'OS/2'When & Windows format 1 is Metrics used, the Tablelanguage IDs in name records can be less than or greater than 0x8000.http://www.microsoft.com/typography/otspec/os2.htm If a language ID is less than 0x8000, it has a platform-specific interpretation as with a format 0 Naming table. If a language ID is equal to or greater than 0x8000, it is associated with a language-tag record that references a language-tag string. In this way, the language ID is associated with a language-tag string that specifies the language for name records using that language ID, regardless of the platform. These can be used for any platform that supports this language-tag mechanism. A fontMicrosoft using a format Typography 1 Naming table may| Developer... use a combination |of OpenTypeplatform-specific specification language IDs as well | asOpenType tables | The OS/2 language-tag records for a given platform and encoding. table Each langTagRecord is organized as follows: Type Name Description USHORT length Language-tag string length (in OS/2 - OS/2 andbytes) Windows Metrics USHORT offset Language-tag string offset from The OS/2 table consists of a setstart of of metricsstorage area that (in are bytes). required in OpenType fonts. The fifth version of the OS/2 table (version 4), follows: The OpenType Font File http://www.microsoft.com/typography/otspec/otff.htm#otttables Language-tag strings stored in the Naming table must be encoded in UTF-16BE. The language tags must conform to IETFType specification BCPName 47. This of provides Entry tags such as "en", "fr-CA"Comments and "zh-Hant" to identify languages,USHORT including dialects,version written form and other variations.0x0004 ULONG ulDsigLength The length (in bytes) of the DSIGThe table language-tag records are associated sequentially with language IDs starting with 0x8000. Each (null if no signature) language-tagSHORT record correspondsxAvgCharWidth to a language ID one greater than that for the previous language-tag record. ULONG ulDsigOffset The offset (in bytes) of the DSIGThus, table language IDs associated with language-tag records must be within the range 0x8000 to 0x8000 + from the beginning of the TTC filelangTagCount (nullUSHORT if - 1. If a name recordusWeightClass uses a language ID that is greater than this, the identify of the language is unknown; such name records should not be used. no signature) USHORT usWidthClass For example, suppose a font has two language-tag records referencing strings in the storage: the first Font Tables referencesUSHORT the string "en", andfsType the second references the string "zh-Hant-HK" In this case, the language ID Font Information0x8000 is used in(Truetype) name records to index English-language strings. The language ID 0x8001 is used in name The rasterizer has a much easier time traversing tables if they are paddedrecordsSHORT so to that index each strings table in begins TraditionalySubscriptXSize on a 4-byteChinese as used in Hong Kong. boundary. It is highly recommended that all tables be long-aligned and padded with zeroes. NameSHORT Records ySubscriptYSize SHORT ySubscriptXOffset OpenType Tables Each NameRecord looks like this: SHORT ySubscriptYOffset Type Name Description Whether TrueType or PostScript outlines are used in an OpenType font, the following tables are required for USHORTSHORT platformIDySuperscriptXSizePlatform ID. the font to function correctly: USHORTSHORT encodingIDySuperscriptYSizePlatform-specific encoding ID. Required Tables USHORT languageID Language ID. SHORT ySuperscriptXOffset USHORT nameID Name ID. Tag Name USHORTSHORT length ySuperscriptYOffsetString length (in bytes). cmap Character to glyph mapping USHORTSHORT offset yStrikeoutSizeString offset from start of storage head Font header area (in bytes). hhea Horizontal header FollowingSHORT are the descriptionsyStrikeoutPosition of the four kinds of ID. The specific values listed here are predefined; new ones hmtx Horizontal metrics maySHORT be added in the future. sFamilyClassSimilar to the table, the NameRecords are sorted by platform maxp Maximum profile ID, then platform-specific ID, then language ID, and then by name ID. LanguageBYTE IDs refer to a valuepanose[10] that identifies the language in which a particular string is written. Values less name Naming table than 0x8000 are defined on a platform-specific basis. Values greater than or equal to 0x8000 are associated OS/2 OS/2 and Windows specific metrics withULONG language-tag records, asulUnicodeRange1 described above. Not all platformsBits 0-31 have platform-specific language IDs, and not all platforms support language-tag records. post PostScript information ULONG ulUnicodeRange2 Bits 32-63 For OpenType fonts based on TrueType outlines, the following tables areULONG used: ulUnicodeRange3 Bits 64-95 Tables Related to TrueType Outlines 2 of 17 ULONG ulUnicodeRange4 Bits 96-127 9/10/09 11:00 AM

Tag Name CHAR achVendID[4] cvt Control Value Table USHORT fsSelection fpgm Font program USHORT usFirstCharIndex glyf Glyph data loca Index to location USHORT usLastCharIndex prep CVT Program SHORT sTypoAscender Thehttp://www.microsoft.com/typography/otspec/ PostScript font extensions define a new set of tables containing dataSHORT specific to PostScript fontssTypoDescender that are used instead of the tables listed above. SHORT sTypoLineGap Tables Related to PostScript Outlines USHORT usWinAscent Tag Name USHORT usWinDescent CFF PostScript font program (compact font format) VORG Vertical Origin ULONG ulCodePageRange1 Bits 0-31 It is strongly recommended that CFF OpenType fonts that are used for verticalULONG writing include ulCodePageRange2aVertical Bits 32-63 Origin ('VORG') table. Multiple Master support in OpenType, has been SHORTdiscontinued as of versionsxHeight 1.3 of the specification. The 'fvar', 'MMSD', 'MMFX' tables have hence been removed. SHORT sCapHeight Tables Related to Bitmap Glyphs USHORT usDefaultChar

6 of 7 9/10/09 10:58 AM 1 of 19 9/10/09 11:01 AM !"#$%&%'()*''"#+),%$-)./0122/)3"45$6)7"8+)7%$95():;-%#<)=>+#"'"#5("%4) ?5@+)AB2)!")1A2)

b10 b16 Field Type Size Bitfield Comment

) ) !"# C4&) EB) 2AF2) G+H()I$5>>"4@)9%-+)) &D%$() 2))!"#$%&'%()*%+,$-./*%0$1)"0$%2(-,!)*$%,(3$4*) A))4%)(+H()4+H()(%)&D5>+) 1))I$5>)5$%C4-)5J&%8C(+)%JK+#() L))I$5>)5&)"')4%)%JK+#()>$+&+4() B))I$5>)("@D(86)5$%C4-)%JK+#() M))I$5>)("@D(86N)JC()588%I)D%8+&) O0AM))$+&+$P+-)'%$)'C(C$+)C&+)

) ) !"$# C4&) EB) AF22) G+H()I$5>>"4@)9%-+)(6>+):P58"-)%486)'%$)I$5>>"4@) &D%$() 9%-+&)1)54-)B) 2))I$5>)J%(D)&"-+&) A))I$5>)%486)%4)8+'() 1))I$5>)%486)%4)$"@D() L))I$5>)%486)%4)85$@+&()&"-+)

) ) %&'()*+,-.# C4&) EA) 1222) ,D+4)&+(N)(+9>%$5$"86)%P+$$"-+&)/0N)J6N)'%$#"4@)(D+) &D%$() 0(1.%2N)0(&*342N)5(67,N)54-)5(87227+)'"+8-&)(%)588) J+)>5@+)$+85("P+;)

) ) %8.-7!6.02# C4&) EA) B222) A))&D5>+)"&)J+8%I)(+H() &D%$() 2))&D5>+)"&)5J%P+)(+H()

) ) %9:'47"17'$# C4&) EA) Q222) A))54#D%$)"&)8%#R+-) &D%$() 2))54#D%$)"&)4%()8%#R+-) 11Word) AO) '60/0 Document# 8%4@) ) ) FontS%C4()%')(+H(J%H+&)"4)&D5>+):C4-%)-%#)%486< Information) ) '/;)<9):#%C4()%')J6(+&)%');)<9<)"&)1O):-+#"958

2) 2) '/;%:=># C4&)#D5$) ) ) G%(58)8+4@(D)%')77U)0)A;)

A) A) ,"?# C4&)#D5$) E1) 2L) ?"(#D)$+VC+&()

) ) %6"@.65,.# C4&)#D5$) EA) 2B) ,D+4)AN)'%4()"&)5)G$C+G6>+)'%4()

) ) # C4&)#D5$) EA) 2Q) W+&+$P+-)

) ) %%# C4&)#D5$) EL) /2) 7%4()'59"86)"-)

) ) # C4&)#D5$)) EA) Q2) W+&+$P+-)

1) 1) !A.*342# &D%$() ) ) 35&+)I+"@D()%')'%4()

B) B) '4B# C4&)#D5$) ) ) SD5$5#(+$)&+()"-+4("'"+$)

M) M) *0'4)C9-2# C4&)#D5$) ) ) X4-+H)"4(%#%%:DBC;%:)(%)(D+)459+)%')(D+) !"#$%&%'()*''"#+),%$-)./0122/)3"45$6)7"8+)7%$95():;-%#<)=>+#"'"#5("%458(+$45(+)'%4() ) ?5@+)ABA)!")1A2)

O) O) ,(:7B.# ?TU*=F) ) ) )

AOb10) A2b16) %BField# 7*UG=XYUTGZWFType ) ) Size ) Bitfield ) Comment

B2) 1C) !"#$%&' DEFGHIJ) ) ) K+$%)(+$9"45(+-)&($"4@)(L5()$+#%$-&)459+)%') '%4(;)?%&&"M86)'%88%N+-)M6)5)&+#%4-)!"#)NL"#L) $+#%$-&)(L+)459+)%')54)58(+$45(+)'%4()(%)O&+)"') (L+)'"$&()459+-)'%4()-%+&)4%()+P"&()%4)(L"&) &6&(+9;)!5P"958)&"Q+)%')!"#$%&)"&)RS) #L5$5#(+$&;)

) File Information Block

Bitfield Bitfield Decimal Hex Name Type Size Mask Comments Introduced

2) 2P2222) $.' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) )

B) 2P222B) +.-' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) )

) XL+)$*0123+2)&($O#(O$+V)$+'+$+4#+-)"4)(L+)$()V)O&+-)"4(+$45886)M6),%$-Y)

Bitfield Bitfield Decimal Hex Name Type Size Mask Comments Introduced

2) 2P2222) %.045' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) )

B) 2P222B) /.-045' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) )

C) 2P222C) %.)65' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) )

A1) 2P222E) /.-)65' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) )

) XL+)$*012)&($O#(O$+V)$+'+$+4#+-)"4)(L+)$()V)O&+-)"4(+$45886)M6),%$-;)XL"&)9%-"'"+-)U+$&"%4)%') (L+)5M%U+)&($O#(O$+)N5&)"4($%-O#+-)"4),%$-)122ZY)

Bitfield Bitfield Decimal Hex Name Type Size Mask Comments Introduced

2) 2P2222) %.045' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ,%$-)122Z)

B) 2P222B) /.-045' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) ,%$-)122Z)

C) 2P222C) %.)65' 8%4@) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ,%$-)122Z)

A1) 2P222E) /.-)65' O8%4@) ) ) ="Q+)%')-5(5;)T@4%$+)%.)"')/.-)"&)Q+$%;) ,%$-)122Z)

AR) 2P22A2) %.7%5' 7E) ) ) 7"8+)>%&"("%4)NL+$+)-5(5)M+@"4&;) ,%$-)122Z)

Extracting Fonts from Documents

libwv (used in abiword) + bug fixes on font identification

Custom C program (several hundred lines)

Walk document character-by-character

Process results with shell scripts

Match name strings (complete match) against font database 100

Font Collection 90

80 XP+Office 2007

70

60

Percentage of Files Satisfied 50

40 Times New Roman

30 10 25 50 100 200 300 400 Total Fonts

Documents collected using Glossaries 100

90 Font Collection

80

XP+Office 2007 70

60

50 Percentage of Files Satisfied 40

30 Times New Roman

20 10 25 50 100 200 300 400 Total Fonts

.gov Documents Windows Font Analysis Tool Word Add-In Logo Example Cyrillic Font Problems

Old document (pre Unicode) used ascii character range for Cyrillic.

Modern versions of “Glasnost Light” use unicode pages properly. Old font is not available It is with profound regret that we inform you that Casady & Greene will close its doors on July 3rd, 2003,

Legacy substitution documentation identified Corel font “Czar”, but that’s not quite right http://nwalsh.com/comp.fonts/FAQ/cf_33.htm

The Cyrillic Charset Soup -- http://czyborra.com/ charsets/cyrillic.html WD2000: Incorrect Characters Appear When You Open Document i... http://support.microsoft.com/default.aspx?scid=kb;en-us;Q260162

Article ID: 260162 - Last Review: August 5, 2004 - Revision: 2.2 WD2000: Incorrect Characters Appear When You Open Document in Earlier Eastern European Version of Word

This article was previously published under Q260162

SYMPTOMS

When you open a Word document that was created using an earlier Eastern European version of Microsoft Word, the text of the document may appear as square boxes or other incorrect characters when you open the document in the English (U.S.) version of Microsoft Word.

CAUSE

This problem is caused by font-mapping and character-mapping problems. For example, this problem can occur when you create a document by using a font that is not installed on your computer.

RESOLUTION

To correct these problems, Microsoft provides a Word Font Repair Macro that converts the fonts in the document (or the selected text) to Arial.

The following file is available for download from the Microsoft Download Center: Eefonts.exe (http://download.microsoft.com/download/word2000/eefonts/2000/w9xnt4/en-us/eefonts.exe) Release Date: 27-Jan-2000

For additional information about how to download Microsoft Support files, click the following article number to view the article in the Microsoft Knowledge Base: 119591 (http://support.microsoft.com/kb/119591/EN-US/ ) How to Obtain Microsoft Support Files from Online Services Microsoft scanned this file for viruses. Microsoft used the most current virus-detection software that was available on the date that the file was posted. The file is stored on security-enhanced servers that help to prevent any unauthorized changes to the file.

MORE INFORMATION

To download and install the Word Font Repair Macro, follow these steps:

1. Click the Eefonts.exe link in the "Resolution" section of this article to download the Eefonts.exe file. 2. In the File Download dialog box, click to select Save this program to disk, and then click OK. 3. Change the Save in box to the folder where you want to save the Eefonts.exe file, and then click Save. 4. After the file has been downloaded, install the macro. To do this, follow these steps: a. Quit Microsoft Word 2000. b. Click Start, and then click Run. c. In the Run dialog box, click Browse. d. In the Browse dialog box, change the Look in box to the folder where you saved the Eefonts.exe file. e. Click to select the Eefonts.exe file, and then click Open. f. In the Run dialog box, click OK. g. In the Microsoft Word Font Repair Macro Setup dialog box, click Next. h. Click to select I accept the License Agreement, and then click Next.

NOTE: If you do not select to accept the License Agreement, the Word Font Repair Macro will not be installed. i. To start the installation of the Word Font Repair Macro, click Next.

1 of 2 9/29/09 3:47 PM What can we do ?

A better job of identifying “critical” font substitutions

Use unicode range to assist language identification

Use language identification tools to identify human language vs symbolic font use

Create better tools to facilitate quality assurance testing by isolating text with font substitutions

Create a clearinghouse of font identification and substitution information