US007133862B2
(12) United States Patent (10) Patent No.: US 7,133,862 B2 Hubert et al. (45) Date of Patent: Nov. 7, 2006
(54) SYSTEM WITH USER DIRECTED 5,940,614 A 8, 1999 Allen et al...... 717/120 ENRICHMENT AND IMPORTFEXPORT 6,092,074. A 7/2000 Rodkin et al. . ... 707/102 CONTROL 6,138,129 A * 10/2000 Combs ...... 707/6 6,161,124. A 12/2000 Takagawa et al...... TO9,203 (75) Inventors: Laurence Hubert, St Bernard du 6.256,623 B1 7/2001 Jones ...... 707/3 Touvet (FR); Nicolas Guerin, Grenoble 6,594,662 B1* 7/2003 Sieffert et al...... 707/10 (FR) FOREIGN PATENT DOCUMENTS (73) Assignee: Xerox Corporation, Stamford, CT (US) EP O 986 010 A2 3, 2000 EP 1 087 306 A2 3, 2001 *) Notice: Subject to anyy disclaimer, the term of this WO WOOOf 67159 A2 11/2000 patent is extended or adjusted under 35 WO WO O1/31479 A1 5, 2001 U.S.C. 154(b) by 582 days. OTHER PUBLICATIONS (21) Appl. No.: 09/683,237 U.S. Appl. No. 09/746,917 “Recommender System and Method”. (22) Filed: Dec. 5, 2001 filed Dec. 22, 2000. Alexa Internet, "Alexa What's Related for Netscape Navigator'. (65) Prior Publication Data available on the Internet Apr. 2001 at http://www.alexa.com/ prod Serv/netscape.html. US 2003 FOO612OO A1 Mar. 27, 2003 Alexa Internet, "Alexa Related Links for Microsoft IE', available on the Internet Apr. 2001 at http://www.alexa.com/prod serv/msie. Related U.S. Application Data html. Autonomy, Technology White Paper, available on the Internet Apr. (60) Provisional application No. 60/311,857, filed on Aug. 2001 at http://www.autonomy.com/echo/userfile/ 13, 2001. Technology White Paper.pdf. (51) Int. Cl. (Continued) G6F 7/3 (2006.01) Primary Examiner Jean M. Corrielus (52) U.S. Cl...... 707/3; 707/2; 707/101; Assistant Examiner—Isaac M. Woo 707/102; 715/513; 715/514; 715/517 (58) Field of Classification Search ...... 707/12, (57) ABSTRACT 707/34, 5.6; 715/500, 500.1, 501.1, 514, 71.5/515 A system for enriching document content using enrichment See application file for complete search history. themes includes a directed search service and an import (56) References Cited export service. The directed search service allows users to author documents while querying information providers U.S. PATENT DOCUMENTS using the directed searches that are inserted as part of the 5,297.249 A * 3, 1994 Bernstein et al...... 71.5/5011 authored documents. The import-export service enables 5,359,514 A 10, 1994 Manthuruthil et al...... TO4f10 meta-document exchanges between systems that provide 5.367.621 A 1 1/1994 Cohen et al...... 71.5/501 document enrichment by binding imported meta-documents 5,822,720 A 10, 1998 Bookman et al...... 7043 to identical or similar information providers. 5,905,988 A * 5/1999 Schwartz et al...... TO7 104.1 5,930,787 A 7, 1999 Minakuchi et al...... 707/4 20 Claims, 69 Drawing Sheets
530
artiser
france Cklon-Premiuri C Lawsuity sightley CCyrighted Natorylighted Free Fad bix price: laxanum timber of words:
acat Resis: ent Selectors: & Aer current selection Xeroxin the past five 3afare Cutant Selection yearserroregrowth-1 has seen O As CommentTo current Selection O As footnote To Current Selection US 7,133,862 B2 Page 2
OTHER PUBLICATIONS Pazzani, M., Billsus, D., "Learning and Revising User Profiles: The Identification of Interesting Web Sites'. Machine Learning 27. Autonomy, ActiveKnowledge(TM), available on the Internet Apr. 2001 at http://www.autonomy.com/echo/userfile/ Kluwer Academic Publishers, pp. 313-331, 1997. PB Autonomy ActiveKnowledge(02.01).pdf. Thorsten Joachims. “A probabilistic analysis of the Rocchio algo Cooper et al. in “A Simple Question Answering System”, published rithm with TFIDF for text categorization”. Technical Report CMU in proceedings of the Ninth Text REtrieval Conference (TREC-9) CS-96-118, School of Computer Science, Carnegie Mellon Univer held in Gaithersburg, Maryland, Nov. 13-16, 2000. sity, Mar. 1996. "searchandfind, Inxight Thing Finder SDK'. Inxight Software, Inc., Yoelle S. Maarek, Israel Z. Ben Shaul. “Automatically Organizing ThingFinder SDK Data Sheet, available on the Internet Apr. 2001 at Bookmarks per Contents”. Fifth International World Wide Web http://www.inxight.com/pdfs/products/tf Sdk ds.pdf. Conference, May 6-10, 1996, Paris, France. Jon Kleinberg, "Authoritative sources in a hyperlinked environ ment”, Technical Report RJ 10076, IBM, May 1997. Zapper Tour, Zapper Technologies Inc., available on the Internet Kenichi Kamiya, Martin Röscheisen, and Terry Winograd. Apr. 2001 at http://www.zapper.com/tour/tour fr.html. “Grassroots: A System Providing A Uniform Framework for Com Walter Mossberg, “New Windows XP Feature Can Re-Edit Other municating, Structuring, Sharing Information, and Organizing Sites'. The Wall Street Journal, Jun. 7, 2001. People”. Fifth International World Wide Web Conference, May Luis Gravano and Hector Garca-Molina, “GIOSS: Text-Source 6-10, 1996, Paris, France. Discovery over the Internet”. ACM Transactions on Database A. McCallum, K. Nigam, “Text Classification by Bootstrapping Systems, 1999. with Keywords'. EM and Shrinkage, ACL Workshop for Unsuper vised Learning in Natural Language Processing, 1999. Bradley Rhodes and Pattie Maes, “Just-in-time information retrieval Projects, “Watson', available on the Internet Apr. 2001 at http:// agents', IBM Systems Journal special issue on the MIT Media dent.infolab.nwu.edu/infolab projects/project.asp?ID=5. Laboratory, vol. 39, Nos. 3 and 4, 2000 pp. 685-704. Budzik, J., and Hammond, K. J., “Watson: Anticipating and Scott Deerwester, Susan Dumais, Goerge Furnas, Thomas Contextualizing Information Needs”. In Proceedings of the Sixty Landauer, and Richard Harshman. “Indexing by latent Semantic Second Annual Meeting of the American Society for Information analysis”. Journal of the American Society for Information Science, Science. Information Today, Inc., Medford, NJ. 1999. 41(6):391-407, 1990. Budzik, J., Hammond K., and Birnbaum, L. “Information access in "Zapper Technology Brief, available on the Internet Apr. 2001 at context', available on the Internet Apr. 2001 at http://dent.infolab. nwu.edu/infolab projects/project.asp?ID=5. www.Zapper.com. Paul Festa, “Amazon pops into consumers reviews with ZBubbles'. "Zapper Technologies Announces Release of Zapper 2.0', available Nov. 17, 1999, available on the Internet at http://news.cnet.com/ on the Internet Apr. 2001 at www.zapper.com. news, 0-1007-200-1452140.html. “Inxight Thingfinder'. RubberTreePlant.co.uk KM Software John Lamping, Ramana Rao, and Peter Pirolli. “A focus+context Review, Apr. 3, 2001. technique based on hyperbolic geometry for visualizing large hier “Autonomy UpdateTM'. Read Between The Lines available on the archies', in Proceedings of the ACM SIGCHI Conference on Internet Apr. 2001 at www.autonomy.com. Human Factors in Computing Systems, ACM, May 1995. “Autonomy ServerTM'. Read Between The Lines available on the Inxight Software, Inc. "Star Tree Walk-Through Tour', available on Internet Apr. 2001 at www.autonomy.com. the Internet Apr. 2001 at http://www.inxight.com/products/ star tree?walk thru? index.html. “Smart Tags Overview', available on the Internet Jun. 2001 at “Network Tools Zapper Windows 95/98/2000/NT, IE 4.0+”. The www.microsoft.com. Scout Report, A Publication of the Internet Scout Project Computer Shi-Kuo Chang and Taieb Znati. “Adlet: Active Document for Sciences Department, University of Wisconsin-Madison, vol.7. No. Adaptive Information Integration', available on the Internet Jul. 4. Jun. 9, 2000 available on the Internet at http://scout.cs, wisc.edu/ 2001 at http://www.cs.pitt.edu/~chang/365/adlet.html. report/sr/2000/scout-000609.html#17. P. Ipeirotis, L. Gravano, and M. Sahami, “Probe, Count, and Netscape's “What's related service FAQ', available on the Internet Classify: Categorizing Hidden-Web Databases”. Proceedings of the Apr. 2001 at http://home.netscape.com/escapes/related/facq.html. 2001 ACM SIGMOD International Conference On Management of OpenCola Folders, Product and Technology Overview, available on Data, 2001. the Internet Apr. 2001 at http://www.opencola.com/products/ 1 folders. * cited by examiner U.S. Patent Nov. 7, 2006 Sheet 1 of 69 US 7,133,862 B2
|0|| U.S. Patent Nov. 7, 2006 Sheet 2 of 69 US 7,133,862 B2
&~~~ 922 U.S. Patent Nov. 7, 2006 Sheet 3 of 69 US 7,133,862 B2
8 IZ0
amarassuremensuuuuuusseausaurusammlunes
r------
NEWID00I INHIN0D
U.S. Patent Nov. 7, 2006 Sheet 4 of 69 US 7,133,862 B2
amus usuusaasaasmus-amunusuama-uuussmusses assemus
ZOI
U.S. Patent US 7,133,862 B2
lo
L
U.S. Patent Nov. 7, 2006 Sheet 6 of 69 US 7,133,862 B2
U.S. Patent Nov. 7, 2006 Sheet 7 of 69 US 7,133,862 B2
S S
s r E geE 5
S U.S. Patent Nov. 7, 2006 Sheet 8 of 69 US 7,133,862 B2
PRINTER NAME: ss WBOUILANTETAE v PROPERTIES it is posis. Pisa 84 WHERE AEGRENSEREERico-PASSIHR) PRINT TO FILE COMME, KERON225 B&WDUPLE-ECHAEAU-2NDFLOOR PAGERANGE O All O CURRENT PAGE O SELECTION o PAGES: C ENERPAGENABERS AND/OR PAGERANGES SEPERATED BY COMAS, FOREXAMPLE 35-12 3O4 ENRCHENT (e) PERSONALY. BUSINESS v O SELECA PERSONALY FOR ME APPLY ENROHAENTO INSERTENRICHMENTAS O ALL O LINKS O CURRENTPAGE O SELECTION O CONTENT o PAGES.
808 OOA 802 80 PRINT WHATEOUMENT v. PAGESPERSHEET. PRINT ALPAGESINRANGE v. SCALE TOPAPERSIZENOSA.Gv. OPTIONS
80 U.S. Patent Nov. 7, 2006 Sheet 9 of 69 US 7,133,862 B2
SPRINT DOCUMENT PROPERTIES LAYOUT PAPERQUALITY ENRICHMEN 90e PERSONALT, BSNESS v OSELECTPERSONALY FORME 900
PROVIDE ENRCHMENT BY INSERTENRICHMENTAS 90 C PRINTING WITH DOCUAENT (e) LINKS O STORING ATSERVER O CONTEN
WHENSTORINGENRICHMENTATSERVER PUREPON 908 96 C EMAIL AE WHENCOMPLETED C AE/ ADDRESS. C. O AT PRINTSERVE O DO NOTNOTF ME
FIG. 9 U.S. Patent Nov. 7, 2006 Sheet 10 of 69 US 7,133,862 B2
8,10|| s??IITTIIN?E?TERIT!
døR
y10|| 910|| U.S. Patent Nov. 7, 2006 Sheet 11 of 69 US 7,133,862 B2
O4 ARCHITECTURE BUILDINGS DESIGN 7 HOMES O2 BRICKHOMES a HAY BALE HOMES LOGHOMES TIMBER HOMES 8. TRE HOMES
FIG. T. 7 U.S. Patent Nov. 7, 2006 Sheet 12 of 69 US 7,133,862 B2
270
PROPERTES FOR USERSMITH
22 e)EFAULT O RECOAEND PERSONALITY FORME 230 PROPAGAE ENRICHMENT INSERENRICHMENTAS BETENDENIS LINK-2 O NO cost-' 1226 O YES O ATACAYDEERAE ONKORNSERTONTENT WER DOCUMENTENRICHED NOTFY AEB 205 AUTOMATICALLY COMPLEE 'Vici MESSAGE T206 CITATIONS: EMAIl-2 6 NC - o SMSTEXT MESSAGE o YES, DEPTH: 4 ORGANIZEMYPERSONALITIES 1240 (REATEMODIFYKY PERSONALTE
FIG. 72 U.S. Patent Nov. 7, 2006 Sheet 13 of 69 US 7,133,862 B2
8|
818|| U.S. Patent Nov. 7, 2006 Sheet 14 of 69 US 7,133,862 B2
HId}{1135
U.S. Patent Nov. 7, 2006 Sheet 15 Of 69 US 7,133,862 B2
CREATE/MODIFY MYPERSONALTES NAE, Boys - CREATEIMODIFY AGROUP OFEXISTING PERSONALITIES 56
USESELECTEDDOCUMENTS AND/ORFOLDERS REAEANEWPue--so PERSONALY
USESELECTEDFILEIWEBSITE TO CREATE ANEW PERSONALTY Brows
REATEQUESTIONFORM USENG PERSONATYLES
56 Ya KNOWLEDGE O NOVICE EVE: O EPERT - OK I (ANCE FIG. 75 U.S. Patent Nov. 7, 2006 Sheet 16 of 69 US 7,133,862 B2
600
CREATEIMODIFY AGROUP OFEXISING PERSONALTES r NAME: TECHNOLOGY WATCH IN ORGANICCHEMISTRY OPERATION O MERGE O2 O SUBTRAC-60
PERSONATYA w
WITH FROM
PERSONALITY w
FIG. T6 U.S. Patent Nov. 7, 2006 Sheet 17 of 69 US 7,133,862 B2
RECEIVESPECIFIEDDOCUMENTS) AND/ORFOLDERS) WITH DOCUMENTS) 702 TODEFINEAEWEL N=O DOCUMENTSET Extractalunks from Televil Documents - 7
FECONTENT FETRACTEDLINKS AND DEFINEFETCHED 706 CONTENTASLEVEN-DOCUMENTSET
DESCEND ONE MORELEVEL 708 NO DEFINE ALL THE NLEVELOCUMENTSES DEFINEDAT 702AND 706 TO 7709 BEAN EXPANDED DOCUMENT
CONSTRUCTENTTYDATABASEUSING THE EXPANDEDDO(UAENT 77O IDENTIFYEACHFORMINTHE EXPANDEDDOCUMENT-72
CREATE ASERVICE FOREACHFORMIDENTIFIEDIN THE EXPANDEDDOCUMENT -74
FIRE CREATED SERVES SINGENTY DATABASE 776
DEFINEA PERSONALY SING FILTERED SERVICES ANDENTITYDATABASE 77.8 U.S. Patent Nov. 7, 2006 Sheet 18 of 69 US 7,133,862 B2
908|| 8081
U.S. Patent Nov. 7, 2006 Sheet 19 of 69 US 7,133,862 B2
SCIENTIFICCONTENT PROVIDER (SCP);
SEARCHIN: 902 ALL FOLDERS 7903 PHYSICSFODER 904 CHEMISTRY FOLDER 905 BIOLOGY FOLDER
FIG. 79 U.S. Patent Nov. 7, 2006 Sheet 20 of 69 US 7,133,862 B2
(805|| TOESOEDD||SWIWIT?SI?LÐSLIOLLSWITTNEÐTHQIMORIEN?EIINEDSEDIISRIRISMERJET T?SITTEVÕTTEÏNETEEl|| U.S. Patent Nov. 7, 2006 Sheet 21 of 69 US 7,133,862 B2
776 174 ------
WHERE THE ENTITYDATABASE INCEDES FREQUENCY OF OCCURRENCE OF EACHENTITY
243 244
COMPUEENTITY FREQUENCY"E OF NEXTENTITY INSERVICE DATABASE OF NEXTSERVICE (E, CP DBZEN1TY FREQUENCY, WEIGHT->)
ASTSERVICE IN LIST -24. YES AEASURESEMLARITYBEWEEN FREQUENCY OF ENTIESN SERVICES(LE.E.E.) AND THE EXPANDEDDOCUMENT 252 U.S. Patent Nov. 7, 2006 Sheet 22 of 69 US 7,133,862 B2
- EXPANDEDDOCUMENT ph SERVICEA SERVICEB
FREQUENCY U.S. Patent Nov. 7, 2006 Sheet 23 of 69 US 7,133,862 B2
in yis - as s m m im as a up m m is in us m -
2355 2356 GETNEXTSERVICE IN LIST OF SERVICES
2357
GETNEXTENTY IN ENTYDATABASE
FORMULATE QUERY FORSERVICEUSING NEXTENTITY
SUBMITQUERY TO SERYE 2360
NO
SUMMEASURESIMLARTY OFTOPN RESULTSFROMSEARCH ANDONTEX INFORMATION RELATED TO THESELECTEDENTITY YES
SUM GOODNESS FOREACHENTITY ACROSS ALL SERVICES
23.63 2362 LASTSERVICE NO YES 2364 SELECTNSERVICES) WITH THE HIGHEST GOODNESS MEASURE TOSPECIFY FELTERED SERVICE
wn will 4...... * ...... U.S. Patent Nov. 7, 2006 Sheet 24 of 69 US 7,133,862 B2
2402 RECEIVE UESTION
2404 DETERMINETYPE OFESTION
24O6 ONVERTESTION TOQUERY
24.08 SUBTQUERY TOINFORMAONSERVICE
24 O ERACTPASSAGES FROM OPNRESS
COMPUTE PART OFSPEECHTAGSANDSHALLOWPARSE PASSAGES 242
244 CALCULATE WEIGHT OFREEVANCE FOREACHWORD INEACH PASSAGE USINGTYPE OFUESTION ANDUESTION
SELECTSENTENESORPART OF SENTENCES 246 WHORDSHAYINGHIGHESCOMPUTED WEIGHTOFREEWANCE ASPROPOSED ANSWERS TO THE QUESTION FIG. 24
U.S. Patent Nov. 7, 2006 Sheet 27 Of 69 US 7,133,862 B2
270 STOCK QUOTEOPTIONS
PROPERTIES UPDATEFREGUENCY EVERYDAY v SERFRAS: ASFOOTNOTE YO DETERMNEonly
APPLYSERVEEHAYORTO
CAL DOUMENTS g (URRENT DOCUMENT O SELECTION ONY
FIG. 27 U.S. Patent Nov. 7, 2006 Sheet 28 of 69 US 7,133,862 B2
108Z 8082 Z08%
008?
HIWIONNYDJMINH4OH&
#7082 U.S. Patent Nov. 7, 2006 Sheet 29 of 69 US 7,133,862 B2
WAT FORNEXDOCUMENTACCESSED BYUSER 2902 ) DEFINEANACESSED OMENT 2904 ENRCHIHEACCESSED DOCUMENT WITHENTITIES IN INTERACION HISTORY O DEFINEAD0CARENTWTH PROPAGATEDENRICHAENT U.S. Patent Nov. 7, 2006 Sheet 30 of 69 US 7,133,862 B2
DEFINEASETOFRULES FORIDENTIFYINGENTITES 30O2 3004 DENTFYENTIES IN THE ACCESSEDDOCUMENTUSING THESETOFRULES
3006 FILTERIDENTFEDENTIES 3008 FILTERENTTES FROM THEDENTIFIEDENTITIES
3OTO UPDATENTERACTION HISTORY WITH THESE OF ENITIES
U.S. Patent Nov. 7, 2006 Sheet 31 of 69 US 7,133,862 B2
ea is m n n is an an a a man m
8918: U.S. Patent Nov. 7, 2006 Sheet 32 of 69 US 7,133,862 B2
qeM,!185,018 U.S. Patent US 7,133,862 B2
8IZ8
|× mae| U.S. Patent US 7,133,862 B2
s
U.S. Patent US 7,133,862 B2 0098 /Z098 U.S. Patent US 7,133,862 B2
U.S. Patent Nov. 7, 2006 Sheet 37 Of 69 US 7,133,862 B2
8i98
08:48, 3.Tf|ClOW50NINAWET
U.S. Patent Nov. 7, 2006 Sheet 38 of 69 US 7,133,862 B2
0188 ZZ88 OZ88 8088 802
OZ98
==--r------?k=<== 0198 U.S. Patent Nov. 7, 2006 Sheet 39 Of 69 US 7,133,862 B2
08:68 0i68:
Z068
U.S. Patent Nov. 7, 2006 Sheet 40 of 69 US 7,133,862 B2
U.S. Patent Nov. 7, 2006 Sheet 41 of 69 US 7,133,862 B2
sae;soir
Z||0 U.S. Patent Nov. 7, 2006 Sheet 42 of 69 US 7,133,862 B2
U.S. Patent Nov. 7, 2006 Sheet 43 of 69 US 7,133,862 B2
U.S. Patent US 7,133,862 B2
U.S. Patent Nov. 7, 2006 Sheet 46 of 69 US 7,133,862 B2
4602 RECEIVE REGUEST FOR AUTO-COAPLETONIDENTIFYING PARTANPTSTRING ANDARGEDOCUMEN 4604 EXTRATCONEX INFORMATIONSURROUNDING PARTALINPUTSTRING IN THETARGETDOCUMENT 4606 FORMAEUERY USINGETRACTED CONTET NFORMATION AND PARTALENPUTSTRING 4608 SUBATUERYOINFORMATION RETREAL SYSTEM 46 RANKRESUSFROINFORMATION RETREWALSYSTEM 6 2 DISPLAYOP RANKED RESULS 4674 RECEIVEDACCEPTANCE AND DENIFECATION YES OFREQUESTED AUTO-COMPLETON 462O REMOVE TOPRANKED RESULTS) AUTO-COMPLETEREQUEST WITH SELECTED REST
ALTERNATERESISDESRED
NC
FIG. 46 U.S. Patent US 7,133,862 B2
U.S. Patent Nov. 7, 2006 Sheet 48 of 69 US 7,133,862 B2
U.S. Patent Nov. 7, 2006 Sheet 49 of 69 US 7,133,862 B2
U.S. Patent Nov. 7, 2006 Sheet 51 of 69 US 7,133,862 B2
------540 SO TOKENIZETE OBJECTO ALIST OF WORDS YES 542 ASWORD ENLS NO GETNETWORD NLS 544
NO 546
548
550 RANKRESULTS OF QUERY
552 SELECTHEGHESTRANKEDRESULT OF QUERY
554 DOES ONE OF HIGHESTRANKED RESULT OF QUERY SASFIESEWA AIONCRTERA
YES 5758 (ORRECTWORD INTETOBJETUSINGSEECTED RESULTSFOR QUERYTHA SATISFY EVALUATION CRITERA (NOTE CORRECTION MODIFIES INFORMATION SPACE) U.S. Patent Nov. 7, 2006 Sheet 52 of 69 US 7,133,862 B2
•
---?------•••??-)
- a -w- mns as a a a son - - - was same --- or as as an un U.S. Patent Nov. 7, 2006 Sheet 53 of 69 US 7,133,862 B2
... it
C Low Cusly (CHigh quality C Copyrighted ONot Copyrighted 53.8 OFree Paid Max prices: E
cation CfResults: © After current Selection 534 O Before Current Selection O As Comment To Current Selection O As Footnote To Current Selection
U.S. Patent Nov. 7, 2006 Sheet 55 of 69 US 7,133,862 B2
5500
IMPORT/EXPORT
EXPORT: 5506 g ORIGINAL DOCUMENTCONTENT AAR ge DOCUMENT MARKUPE, ENRICHMENT ADVANCEDEPORT.CONTROL O PERSONALITIES 5504
O IMPORT NAME:
IMPORT O) AL CONTENT BINDUN-MATCHEDEEENTS O SELECTEDCONTENT ONLY O ORIGINAL DOCUMENTCONTENT ENACE BOUND O DOCUMENT MARKUP (I.E., ENRICHMENT) ELEMENTS WITH EXISTING O PERSONATES SERVICE PROVIDERS
F.G. 55 U.S. Patent US 7,133,862 B2
U.S. Patent US 7,133,862 B2
U.S. Patent US 7,133,862 B2
889;"SOMH U.S. Patent Nov. 7, 2006 Sheet 60 of 69 US 7,133,862 B2
META-DOCUMENT 5902 DOCUMENTD PERSONATY 594 REFERENCE
DOCUAENT 59 CONTENT PERSONALTY REFERENCE
DOCUMENT ANNOAONS SERVICE RECEST REFERENCE
O 5973 () ARKUP OFDOCUMENT SERVICE REQUEST CONTENT REFERENCE AND ANNOATONS
FIG. 59 U.S. Patent Nov. 7, 2006 Sheet 61 of 69 US 7,133,862 B2
SERVICE REQUEST 6000
O O 593
SERYICE REQUEST REFERENCE U.S. Patent Nov. 7, 2006 Sheet 62 of 69 US 7,133,862 B2
-- oESCRIPTION, BUSINESSAMAYSS 602 PROPERTIES 2 REFRESHPERIOD SO4 NOTFCATION METHOD
s oLIST OF SERVICE PROVIDERS (KEYWORD/KEY s YAHOOCKEYWORD) c HOOVERSKEY 2 2 oSTOFDICTIONARYSTRAEGES L - PEOPENAMES - NORALIZEPLURAS -- BUSINESSNARES o SOFDICTIONARES PEOPLE NAMES BUSINESS NAMES F.G. 6 U.S. Patent US 7,133,862 B2
OOZ9 U.S. Patent US 7,133,862 B2
Z1890.18990890089
MHIATYROIOTIOSINHWR)00No..1
80$9 U.S. Patent Nov. 7, 2006 Sheet 65 of 69 US 7,133,862 B2
PALOALTORESEARCH CENTER PAOAO RESEARCH CENTER 6400 EED COMPUTER SCIENCETERASDEFINITION ED COMPUTER SCIENCE ARTICLES Kiss SEARCHENGINE 64O2 GOOGLECATEGORIED 6406 GOOGLECKY 8408
F.G. 64 U.S. Patent Nov. 7, 2006 Sheet 66 of 69 US 7,133,862 B2
PARC 65OO D COMPUTER SCIENCETERMS DEFINITION ED COMPUTERSCIENCE ARTICLES SEARCHENGINE KF ORIGINAL LINK 65O2 OPENINDocumentSouls OPENIN NEW WINDOW F.G. 65 U.S. Patent Nov. 7, 2006 Sheet 67 of 69 US 7,133,862 B2
MBIMINJWHOWNYWONIHOISITIOSINGWIDOJ? M00NIMSH1350T) ? 9099 Þ099 Z099 U.S. Patent US 7,133,862 B2
SNOLld09NIMHWWI?OSINAWI1300QT
0029 U.S. Patent Nov. 7, 2006 Sheet 69 of 69 US 7,133,862 B2
??£H ######################################################################## «BIDIT,5?ILJUTISKIE?III knowledge management: are assigned to the same assignee as this patent application, anticipatory services, unstructured information manage (c) are incorporated in this patent application by reference, ment, and visualization of information and knowledge. and (d) claim priority to U.S. Provisional Patent Application Watson, for example, from the InfoLab at the University of Serial No. 60/311,857, filed Aug. 13, 2001: U.S. patent Northwestern, is a program which operates while a user is application Ser. No. 09/683.238, entitled “Meta-Document creating a document. Watson retrieves information as the Management System With Personality Identifiers'; U.S. user works, from which the user can select for further patent application Ser. No. 09/683.239, entitled “Meta investigation. Information retrieved by Watson comes from Document Management System With Document identifi a service provider, and Watson stores the retrieved informa ers'; U.S. patent application Ser. No. 09/683,240, entitled 25 tion in memory associated with Watson. "Meta-Document Management System With Transit Trig Also, Autonomy.com's ActiveKnowledgeTM analyzes gered Enrichment”; U.S. patent application Ser. No. 09/683, documents that are being prepared on the user's computer 235 (now U.S. Pat. No. 6,778,979), entitled “System For desktop and provides links to relevant information. In addi Automatically Generating Queries”; U.S. patent application tion, online services such as Alexa.com, Zapper.com, and Ser. No. 09/683.241 (now U.S. Pat. No. 6,928,425), entitled 30 FlySwat.com suggest links that are relevant to the content “System For Propagating Enrichment Between Docu currently viewed highlighted in a browser window. The ments”; U.S. patent application Ser. No. 09/683,242 (now Suggested links appear in an additional window inside or U.S. Pat. No. 6,820.075), entitled “Document-Centric Sys separate from the current browser window. These services tem. With Auto-Completion And Auto-Correction”; U.S. treat documents as static objects. Specifically, using Zap patent application Ser. No. 09/683,236 (now U.S. Pat. No. 35 per.com’s engine, when a user right clicks on selected text, 6.732,090), entitled “Meta-Document Management System words Surrounding the selected text are analyzed to under With User Definable Personalities. stand the context of the search request, and to reject pages BACKGROUND OF INVENTION that use those words in a different context. 40 Various products, such as commercial information 1. Field of the Invention retrieval systems, provide unstructured information, Such as The invention relates generally to the management and web pages, documents, emails etc. (which content may use of documents, and in particular, to improved manage consist of text, graphics, video, or audio). Typical manage ment and use of documents which may act as agents, ment services for unstructured information include: Search generating requests for information, then seeking, retrieving 45 and retrieval; navigation and browsing; content extraction, and packaging responses to enrich the documents while topic identification, categorization, Summarization, and facilitating reading comprehension, understanding relation indexing; organizing information by automatic hyperlinking ships with other documents, and content creation. In par and creation of taxonomies; user profiling by tracking what ticular, this invention relates to a directed search service and a user reads, accesses, or creates create communities; etc. an import-export service. 50 For example, Inxight's parabolic tree is an example of a 2. Description of Related Art system that organizes unstructured information and presents Knowledge management through document management it in an intuitive tree-like format. forms an important part of the knowledge creation and Furthermore, it is known how to embed executable code sharing lifecycle. A typical model of knowledge creation and in documents to perform certain functions at specified times. sharing is cyclical, consisting of three main steps: synthe 55 For example, European Patent Applications EP 0986010A2 sizing (search, gather, acquire and assimilate), sharing and EP 10873.06 A2 set forth different techniques in which (present, publish/distribute), and servicing (facilitate docu to define active documents (i.e., documents with embedded ment use for decision making, innovative creativity). executable code). More specifically, these publications set Most systems consider documents as static objects that forth that executable code within the document can be used only acquire new content when acted upon by an authorized 60 to control, Supplement, or manipulate their content. Such user. A user's decision to read and modify a document, or to active documents are said to have active properties. run a program on it which may change its contents (for Notwithstanding these existing methods for statically and example, by adding hyperlinks), is needed for the document actively enriching document content, there continues to exist to acquire new information. This view of the document as a a need to provide an improved document enrichment archi passive repository leads to the current situation in which 65 tecture that allows ubiquitous use of document enrichment documents remain static unless a user is in front of the services. Such an improved document enrichment architec screen piloting the system. OpenCola FoldersTM offers one ture would advantageously provide methods for facilitating US 7,133,862 B2 3 4 the use of Such services by automatically attaching, moni FIG. 7 illustrates a tag reader for receiving document toring, and Suggesting such services for users. identifiers from a mobile computing device or tag associated with a particular object; SUMMARY OF INVENTION FIG. 8 illustrates a client interface for invoking a print command at a computer with enrichment selections; In accordance with one aspect of the invention there is FIG. 9 illustrates a properties interface for the client provided, a system for enriching a document. The system interface shown in FIG. 8: includes a document editor for authoring the document and FIG. 10 illustrates a client interface for accessing the a directed search service for specifying a directed search meta-document server shown in FIG. 2; while the document is authored with the document editor. 10 FIG. 11 illustrates a blow up of the window 1014 shown The directed search service includes search criteria and in FIG. 10 for an architecture personality in which hay bale result parameters that may be specified by a user. The result homes and tire homes personalities are selected; parameters include information provider parameters, loca FIG. 12 illustrates an example of a properties window tion parameters and form parameters. The information pro 1210 that is displayed when the properties configuration vider parameters identify one or more information providers 15 button 1022 is selected in FIG. 10; to perform the directed search and provide search results. FIG. 13 illustrates one embodiment of a client interface The location parameters identify where in the document the for creating and/or modifying personalities; search results are to be inserted. The form parameters FIG. 14 illustrates a client window for specifying prop specify a form in which the search results are to be inserted erties of searches performed at the search engine defined in into the document. Once the directed search is specified, the FIG. 13; directed search service inserts the directed search in the FIG. 15 illustrates another embodiment of a client inter document. face for creating and/or modifying personalities; In accordance with another aspect of the invention, the FIG. 16 illustrates a client interface for creating and/or system for enriching a document includes a user interface modifying personalities by performing operations to groups for specifying a directed search while authoring the docu 25 of personalities; ment. The user interface has means for specifying search FIG. 17 is a flow diagram illustrating steps for generating criteria and result parameters of the directed search. A a personality; meta-document server communicates with the user interface FIG. 18 illustrates an example of an expanded document for carrying out the directed search. In addition, the system 1800, developed by descending two levels; includes means for exporting the document for import at 30 FIG. 19 illustrates a form that can be used to create other meta-document servers. services; In accordance with yet another aspect of the invention, FIG. 20 illustrates four services that can be generated there is provided a method, and document enrichment sys using the form shown in FIG. 19: tem therefor, for creating at a meta-document server a new FIG. 21 is a flow diagram that depicts one method for meta-document from an exported meta-document. The 35 filtering services at act 1716 in FIG. 17: method includes: (a) adding services in the new meta FIG.22 illustrates a graphical representation of a selection document available at an importing meta-document server process for selecting services with the highest similarity with services that map exactly to a predefined categoriza measure; tion; and (b) for those services not added at (a) but specified FIG. 23 is a flow diagram that depicts another method for in the exported meta-document, adding services in the new 40 filtering services at act 1716 in FIG. 17: meta-document available at the importing meta-document FIG. 24 is a flow diagram that depicts one embodiment for server with services that partially map to a predefined identifying an answer of an instantiated question; categorization, and have at least one dictionary and one key FIG. 25 illustrates an example list of services available in common. when an e-learning personality is selected to enrich docu 45 ment content; FIG. 26 illustrates an example list of services available BRIEF DESCRIPTION OF DRAWINGS when a language learning personality is selected to enrich document content; These and other aspects of the invention will become FIG. 27 illustrates a client interface for selectively speci apparent from the following description read in conjunction 50 fying personality and/or service behaviors to entities recog with the accompanying drawings wherein the same refer nized in specified content or documents; ence numerals have been applied to like parts and in which: FIG. 28 illustrates a client interface for specifying differ FIG. 1 is a schematic of a meta-document according to ent modes for determining when to annotate an identified one embodiment of the invention; entity; FIG. 2 illustrates a block diagram of a system incorpo 55 FIG. 29 is a flow diagram that sets forth the steps for rating a meta-document server; propagating enrichment between electronic documents; FIG. 3 is a schematic of meta-document enrichment FIG. 30 is a flow diagram for creating and updating an interaction history that are performed at act 2912 in FIG. 29: according to one embodiment of the invention; FIG. 31 is a flow diagram for identifying what entities to FIG. 4 illustrates an example of meta-document enrich 60 markup at act 3008 in FIG. 30: ment as illustrated in FIG. 3; FIG. 32 illustrates the propagation of enrichment between FIG. 5 illustrates an electronic identification tag having a accessed documents; specified personality that is affixed or positioned proximate FIG. 33 illustrates an interaction history; to a physical object; FIG. 34 illustrates the manner in which to apply pairs of FIG. 6 illustrates an embodiment in which a hardcopy 65 entities and in addition identify third party entities; document has encoded thereon a personality identifier in FIG. 35 illustrates entity types organized hierarchically: embedded data; FIG. 36 illustrates a text categorizer; US 7,133,862 B2 5 6 FIG. 37 illustrates a personality recommender; DETAILED DESCRIPTION FIG. 38 illustrates the elements and flow of information for generating a query: Outline Of Detailed Description: FIG. 39 illustrates an example of a query contextualized using classification labels of document categorization hier 5 A. Definition Of Terms archy; B. General Features FIG. 40 is a flow diagram which depicts one embodiment B.1 The Knowledge Management Cycle in which both categories and aspect vectors can be used to B.2 Services improve the accuracy of an information retrieval system; B.3 Personalities FIG. 41 illustrates a client interface similar to the client 10 B.4 Methods For Identifying and Using Entities interface that illustrates an augmented query that can be C. Ubiquitous Personalities performed using a recognized entity; C. 1 Personality and Service Tokens FIG. 42 illustrates an information space that surrounds C.2 Personalities Identified By Location meta-document (i.e., a meta-document information space); C.3 Transit Triggered Enrichment FIG. 43 illustrates an auto-completion module that oper 15 D. Creating and Modifying Personalities ates with a text editor and the meta-document information D.1 Generally Space; D.2 Using An Algebra FIG. 44 illustrates an alternate embodiment in which an D.3 Using A List Of Links auto-completion module operates integrally with elements D.4 Using Predefined Personalities and Knowledge Lev of the meta-document server shown in FIG. 2; els FIG. 45 is a flow diagram for creating and updating an D.5 Using Information Extraction Techniques entity database dynamically from the document information D.6 Using Learning Personalities Space; E. User Controlled Enrichment FIG. 46 illustrates a flow diagram for selecting words E.1 Automatically Inserting and/or Linking Content using the auto-completion system shown in FIG. 44; 25 E.2 Propagating Enrichment Between Documents FIG. 47 illustrates an example of the auto-completion E.3 Automatically Completing Citations process performed using the auto-completion entity data E.4 Combining Or Intersecting Entities base presented in FIG. 48: E.5 Using Entity Types Defined In A Hierarchy FIG. 48 illustrates an example of an auto-completion F. Text Categorization and Related Services and Utilities entity database; 30 F.1 Text Categorizer FIG. 49 illustrates a document-centric auto-correction F.2 Recommending Personalities system that iteratively corrects errors in meta-document F.3 Generating Queries Using Identified Entities using a meta-document information space; F.4 Finding An Expert For An Enriched Document FIG. 50 is a flow diagram for performing error correction G. Additional Meta-Document Services using the system shown in FIG. 49; 35 G.1 Notification Of Enrichment FIG. 51 is a flow diagram depicting a process for iden G.2 Document-Centric Suggestions tifying and correcting errors in document content for act G.3 User Directed Enrichment 5026 shown in FIG. 50: G.4 Exporting/importing Enriched Documents FIG. 52 illustrates a block diagram of the elements for G.5 Alternate Embodiments forming a directed search; 40 H. Miscellaneous FIG. 53 illustrates an example of a user interface for A. Definition Of Terms invoking a directed search; The terms defined below have the indicated meanings FIG. 54 illustrates an example of the output of the directed throughout this application, including the claims: search specified in FIG. 53; 45 “Annotate' is used herein to mean to create a reference FIG. 55 illustrates one embodiment of an interface for between an entity in a document, or region of a document, specifying a meta-document exchange; and some set of links, text segment, images, or embedded FIGS. 56, 57, 58A, and 58B illustrate a detailed example data (e.g., glyphs). of an export format; “Content retrieval is used herein to mean an annotation FIG. 59 illustrates another embodiment of a meta-docu 50 that consists of content obtained by following a series of one ment, or more links and retrieving their content, which content FIG. 60 illustrates an embodiment of the contents of a may be filtered or reformatted after retrieval. personality; A "document' is used herein to mean an electronic (e.g., FIG. 61 illustrates an embodiment of the contents of a digital) or physical (e.g., paper) recording of information. In service request; 55 its electronic form, a document may include image data, FIG. 62 illustrates an alternate embodiment of the client audio data, or video data. Image data may include text, interface shown in FIG. 10; graphics, or bitmaps. FIG. 63 illustrates a status window that displayed when Document “mark-up' is used herein to mean the annota enrichment is invoked for a specified document; tion applied to a document. FIGS. 64 and 65 illustrate two examples of popup win 60 A "document Soul’ is used herein to mean a personality dows that appear when identified entities are selected; that remains attached to a document for an extended period FIG. 66 illustrates an example of a document storing of time that may be indefinite or pre-specified of finite management view of a user's files; duration. FIG. 67 illustrates an example interface for selecting “Enrich' is used herein to mean to annotate a document document marking options; and 65 in accordance with a predefined personality. FIG. 68 illustrates an example of an interface for config “Entity is used herein to mean Something recognized in uring services. a document (e.g., a person's name, a location, a medical US 7,133,862 B2 7 8 term, a graphics entity that may include image data, graphics information needs of the writer or reader, either indepen data, audio data or video data) that can be in the form of an dently through a pre-defined set of document service image, text, embedded data, HTML, etc. requests or by following specific or customized instructions, “Information space' is used herein to mean the entire set and performs the sometimes tedious tasks of searching, of annotations associated with an entity, a document seg gathering, assimilating, and organizing information relevant ment, a document, or a set of documents. to the document content. A "lexicon' is used herein to mean a data structure, The actions of the synthesis phase occur through the program, object, or device that indicates a set of words that activation of one or more document service requests 106. may occur in a natural language set. Alexicon may be said Document service requests 106 may be activated while the to “accept a word it indicates, and those words may thus be 10 user is creating or working on the meta-document 100 or called “acceptable' or may be referred to as “in” or “occur when user has set aside the meta-document 100 so that the ring in the lexicon. service requests can benefit from idle computer time, unused A “link' is used herein to mean, by way of example, a network bandwidth, etc. Activating a document service URL (Uniform Resource Locator) associated with a text request 106 while the user works on the document has the segment or an image segment. 15 additional advantage of allowing the meta-document to learn A “morphological variant' is used herein to mean the about the user's preferences. Document service requests 106 conjugated form of a word or expression (e.g., plural form), may be activated automatically by a scheduler 204 or or a derivational form of a word (e.g., presidential is a manually by a user. variant of president). Morphological variants can be reduced The next phase in the knowledge management cycle is to stems or lemmas using known techniques such as stem concerned with sharing the information produced during the ming algorithms such as Porter's algorithm or a lemmati synthesizing phase. Typically the sharing phase consists of Zation scheme in linxight's LinguistX Platform. integrating the information gathered during the synthesizing A "personality” is used herein to mean a thematic set of phase into the contents of the meta-document 100 in a services that can be applied to enrich a document. format useful for the user, person, or community that will A 'service' is used herein to mean a program that 25 use the document. The document content can be further provides new markup based on content and meta-data in a enhanced for the user by assigning a personality to the document in its current state. For example, the program may document which marks up the document with information identify entities in a document, and annotate each entity with that eases the understanding of the content or that regularly data associated to that entity (e.g., in a database). For provides more recent updates related to the content. The example, a service may enrich a document with external 30 final servicing step in the cycle deals with periodic updates information and/or add new services. whereby the meta-document performs predefined service A “text segment' is used herein to mean a continuous requests on behalf of the user. For example, the meta sequence of bytes in a document, or a group of Such document can keep up-to-date information of the tempera Segments. ture of an identified city. 35 B. General Features B.2 Services A block diagram of a meta-document or "document soul” Referring again to FIG. 2, one or more meta-documents 100 is shown in FIG.1. The meta-document 100 includes an 100 are stored in a meta-document server 200 at meta identifier 101, a content portion 102, which is a document document database 202. In an alternate embodiment, docu created by a user or obtained by a user, and a personality 40 ment references (e.g., URLs) are stored in meta-document 104. The personality 104 is a set of one or more document database 202 and their content referenced on network file service requests 106 and an entity database 111. The entity server 220. Each meta-document 100 in the meta-document database may include one or more separate entity databases, server 200 is endowed with a set of document service where each entity database identifies a class of entities (e.g., requests which each meta-document 100 exercises under people names, city names, business names, etc.). In one 45 control of a scheduler or scheduling demon 204, which embodiment, the personality 104 does not include the entity wakes up each meta-document in database 202 in accor database 111 but instead includes document service requests dance with some predetermined time schedule. The sched that identify entities. In another embodiment, the entity uler 204 may be implemented in a software mechanism database 111 records document-centric entities (i.e., entities which accesses the document service requests 106, entity that are related exclusively to the document content 102) 50 database 111, and content in a meta-document 100. that are specified by a user or by the system. It will be As illustrated in FIG. 3, after the scheduler 204 wakes up appreciated by those skilled in the art that the document the meta-document 100, the meta-document 100 informs the service requests 106 and the entity database(s) 111 forming scheduler 204 of its current set of document service requests part of the meta-document 100 may include the content of a 301. Depending on the resources (e.g., service providers document service request and an entity database and/or may 55 which can fulfill or satisfy a particular document service include references to a document service request and an request) available to the meta-document server 200, the entity database (in, for example, services database 210). The scheduler 204 chooses a document service request 106 to identifier 101 may include other administrative data such as fulfill (indicated by arrow 300). Subsequently, the scheduler creator, owner, size, access permissions, etc. 204 invokes service providers 206 identified using services B.1 The Knowledge Management Cycle 60 database 210 to satisfy those requests. FIG. 2 illustrates a meta-document management system The services database 210 includes “service provider 201, within which the meta-document 100 is produced as the methods” for lookup and selecting service providers (includ result of a knowledge crystallization process, where the ing authentication data associated with each service), “entity process may last the lifetime of the document. Typically a methods” for identifying entities in document content using meta-document's life begins with a focus and purpose which 65 entity database 111 or entity databases in services database helps direct and refine the synthesis phase. During the 210 or available as a network service 206, “notification synthesis phase, the meta-document 100 anticipates the methods” for notifying a user of new enrichment, regular US 7,133,862 B2 9 10 expressions, lexicons, and a categorizer. In other embodi ence the population of the world and may need to be updated ments, the services database 210 also includes content rights periodically to remain current. management methods. Some document service requests may take a long time Fulfilling a document service request means accessing a (for example, finding all the company names mentioned on service provider from the services database 210 (e.g., select a page and accessing all WWW pages mentioning two of ing a service provider from a list of possible service pro those companies together). Other document services may be viders) which includes some processes (or programs) that satisfied periodically (for example, finding the closing price are invoked by the scheduler to access to the document of a stock share price). Besides document service requests, content 102 (indicated by arrow 302) and document markup other functions not shown can be included in the meta 108 (indicated by arrow 304). The results received from 10 document server: a coordination system to orchestrate the service providers 206 are integrated back into the original concurrent execution of the functions described for the meta-document 100 by content manager 208. That is, these scheduler, a visualization and interaction system that allows processes terminate by producing document-specific various levels of display and interaction of metadata-en markup 108 (indicated by arrow 306) and/or new document hanced documents, and a learning system that learns by service requests 106 (indicated by arrow 308), both of which 15 observing the user interactions with the document. Likewise are added to the meta-document 100 by content manager the meta-document 100 may be physically stored as a 208. number of destination files (e.g., a file corresponding to the Various standards for attaching metadata exist, for original content 102, a file corresponding to markup 108, example, DOM (Document Object Model) and XML (ex and a file corresponding to document service requests 104. tended markup language) may be used. In one embodiment, which files may all be related by known naming schemes). both meta-document document service requests and result B.3 Personalities ing knowledge can be represented as XML metadata and The meta-document server 200 provides end-to-end solu added to the document at the end of each waking cycle. For tion for document-based knowledge creation and sharing in example, a meta-document's document service requests are a customizable fashion. Customization is provided by the expressed as XML fields: ... (where DSR 25 mechanism of personalities within a meta-document server. is short for “DOCUMENT-SERVICE-REQUEST). For Personalities are assigned to a document thereby assisting a example, one document service can be expressed as: user in the acquisition, sharing and utilization of knowledge; who-am-i who-made-me document, (c) private: marked to keep the documents where-am-i 45 metadata invisible to other documents, (d) Scientific: Search During each operating cycle of the meta-document server for online versions of the papers cited in the document 200, a meta-document 100 may acquire new markup 108 content, or (e) genealogical: looking for documents contain and new document service requests 106 as a function of ing similar contents as itself. document service requests that have been fulfilled. Some B.4 Methods for Identifying and Using Entities document service requests may add markup 108, and rep 50 As shown in FIG. 3, a personality 104 identifies one or licate the same document service request or other document more service requests 106. Each service request includes service requests. Some document service requests may indi methods for: (a) recognizing entities in the document con cate to the content manager 208 markup 108 that should be tent 102; and (b) accessing a service using the recognized eliminated when these requests are fulfilled. entities. In general, document service requests 106 correspond to 55 Entities include proper names (e.g., people, places, orga services which add markup 108 to the document, based on nizations, etc.), times, locations, amounts, citations (e.g., the document's existence as a file in a file system, based on book titles), addresses, etc. Entities can be recognized using the content of the document as it was originally authored; a variety of known techniques that may include any one or and based on the content of the markup added to the a combination of regular expressions, lexicons, keywords, document by some other process. When the document or the 60 and rules. Alexicon is typically a database of tuples of the document's location is altered, the knowledge in the docu form , where frequency and weight are as “f” and acquiring entity statistics using a utility measure as set forth “w” defined above. The expanded document is represented in the flow diagram show in FIG. 23. FIG. 23 sets forth a US 7,133,862 B2 27 28 method for filtering services at 1716. Initially at 2355, a list At 2361, if it is determined that the last entity in the entity of services and entity database are received. At 2356 a next database has been examined, then the measured similarities service in the list of services is selected, and at 2357 a next are summed for all the entities related to the selected service entity is picked from the database of entities. At 2358, a at 2362 as follows: query is formulated for the selected service using the selected entity as set forth above. At 2359, the query is submitted to the service. Using the top N results of the Service Utility(Service) = X. Entity Utility(E, Service), service at 2359, a similarity measure between the entity and EeFntityDB contextual information related to the selected entity and each 10 of the top N results is computed at 2360, as follows: where E is an entity in the entity database, and service is a service. At 2363, if this is performed for all services, then the top N services are selected with the highest service utility Entity Utility(Entity, Service) = Similarity (Entity, Doc), measure to specify the filtered services; otherwise, the Doc, TopMatchesRorServiceX. 15 process continues at 2356 with the next service in the list. Services can be organized in a number of ways such as flat or hierarchically. The services as represented in these ways where “entity” is one of the entities in the entity database: could be clustered and a representative service could be “service' is a service; and 'doc' is one of the N top results. selected from each cluster. In this embodiment, a multi More specifically, “entity” in the equation denotes both an dimensional graph is defined with one dimension for each entity string and a Surrounding context. For simplicity it may entity in the entity database. The frequency of each entity be assumed that an entity only occurs in one location in the occurring in the expanded document and the services are expanded document. The Surrounding context for an entity plotted against each other. Clusters are formed and associ can be determined in a number of ways using known parsing ated with a service. These clusters can then be used to techniques that delimit sentences, paragraphs, etc. For 25 hierarchically organize the services. example, techniques for determining the context Surround In an alternative embodiment, a generic service is applied ing an entity include: (a) letting the context be the textual to the expanded document subsequent to act 2363. The content of the whole document, which forms part of an generic service uses the contents of the expanded document expanded document, be the context; (b) letting the context to query a general purpose information provider instead of 30 an information provider that specializes in a specific Subject. be the sentence in which the entity string occurs; (c) letting In yet another embodiment, a service utility is computed for the context be the paragraph in which the entity string an entity type instead of for all entity types as described occurs; or (d) letting the context be the topic text in which above. In this alternative embodiment, the utility of services then entity string occurs as detected by known topic detec can be evaluated for particular types of entities. For tion techniques. example, a service utility is computed for the entity type Also in the equation, “doc' refers to either the document 35 Summary that appears (as an element in a result list) in the biology 2002 for the service 2004 shown in FIG. 20. results page of the service or alternatively to the entire D.4 Using Predefined Personalities and Knowledge Lev document, from which the summary was derived. The els similarity measure can be performed using either resulting In yet a further embodiment, a relative ability or existing 40 knowledge level in a field may be specified as shown at 1516 form. In this equation a similarity measure is generated for in FIG. 15. The specified knowledge level 1516 can be used each entity (represented as the entity plus a context) and for example to create new personalities that access different result document 'doc' (represented as a Summary or the levels of service providers from predefined personalities entire document content). specified at 1504. For example, with a personality directed In order to compute such a similarity measure both the 45 at medical information, if knowledge of someone is novice entity and the result document are first processed as follows: (i.e., a layman) then more basic information providers are (a) stop words are eliminated; and (b) each word is stemmed specified and more basic definitional services are specified using known stemming techniques such as Porter's Stemmer. in the personality. In addition, the knowledge level can be Subsequently, a similarity measure such as the Cosine used to either include or exclude entities from an entity measure could be used to calculate the degree of similarity 50 database that is used to create a personality (as set forth between the entity and the result document based upon text above in section D.3). For example, a expert in the medical features (for details of text features see U.S. patent appli field may not be interested in the same entities that a novice cation Ser. No. 09/928,619, entitled “Fuzzy Text Catego in the medical field would be. rizer” which is incorporated herein by reference). Besides providing a knowledge level of desired person In an alternate embodiment, the text features are trans 55 ality, a hint (i.e., Subject hint) is given to the type of formed using LSI into a reduced features space. This LSI personality that is desired as shown at 1514 in FIG. 15. Upon transformation is calculated using entity and entity fre receiving a hint, the meta-document server relates the hint of quency database that is extracted as described above. Having the desired personality to a set of actions that are specifically transformed the features using LSI, a similarity measure related to subject matter of the hint. Generally, the hint 1514 Such as a Cosine distance measure can be used to calculate 60 can be used to improve any of the methods for creating the similarity between the entity (and its context) and the personalities that may be specified in FIG. 15. The hint 1514 resulting document 'doc'. and knowledge level may be used individually or in com In the instance in which an entity occurs in multiple bination. contexts exist for an entity (i.e., the entity exists in multiple In one specific example, if a hint 1514 of a medical locations in a document or expanded document), each loca 65 personality is specified to the meta-document server along tion of the entity and its associated context are treated with document content referenced by the hyperlinks at 1508 separately (i.e., as different entities). or name at 1510, then the meta-document server 200 creates US 7,133,862 B2 29 30 a personality by identifying services that enrich the identi the same question for the example given above depending fied content relating to the following: (a) an access to a on how many alternate lexicons are defined in the specified general pharmaceutical guide for drugs mentioned in the personality. Thus, with the same question, but with a dif document content; (b) medical records related to the user ferent lexicon, for example of Surgical procedures, the and to the items mentioned in the document content; (c) question form can be defined: “What is the procedure for images, video clips, etc., associated with items mentioned in of the forms. Each question form is used to create a new document ?). Alternatively, the user may be given service request that is satisfied using a known question the option that document service requests be specified for all answering system that uses a combination of information or selected ones of the identified question forms. In addition, retrieval and syntactic or pattern matching techniques. the user may be given multiple answers and multiple infor In one embodiment, question forms are created automati 55 mation sources to select from. cally using an input question defined by a user at 1520 in FIG. 24 is a flow diagram that depicts one embodiment for FIG. 15. For example, if the question is “What is the identifying an answer of an instantiated question. Initially at procedure for ablation of the liver?” and the specified 2402, the meta-document server 200 receives the instanti personality at 1504 includes a lexicon that is body organs, ated question. The type of question is determined at 2404 which includes the word “liver, then the meta-document 60 and converted to a query at 2406. At 2408, the query is server would identify the body organ found in the question Submitted to an information service adapted to handle ques 1504 (e.g., liver) and replace it with a generic symbol tions of the type identified. At 2410, passages of the top N representative of the identified lexicon. In this specific results of the query are extracted using for example a example, the word “liver” would be replaced with the Summarizer. At 2412, the passages of the extracted top N generic symbol to produce the question 65 results of the query are assigned part of speech tags and form “What is the procedure for ablation of the '? Alternate question forms can be defined using for each word in the passages of the extracted top N results US 7,133,862 B2 31 32 of the query using the Substantiated question and the deter of the structure in the reader's native language, or to a mined question type. At 2416, sentences or part of sentences textual, audio or videogrammar lesson corresponding to that of the extracted passages with words having highest com structure); (c) service 2610 linking each word, multiword puted weight of relevance are selected as proposed answers expression, phrase or sentence to other instances of the same to the instantiated question. in different contexts from the present (e.g., by retrieving D.6 Using Learning Personalities similar but differing text segments possessing the same The meta-document server 200 provides an e-learning word, multiword expression, phrase or sentence; the personality that may for example be available in the per retrieved elements could be presented, for example, in a sonality window 1014 in FIG. 10. When an e-learning format that brings the similar structure to the center of the personality is applied to a document, each service in the 10 field of vision of the user for easy comparison of the personality analyzes the contents of the document, recog differing context); (d) service 2612 that links each word, nizing entities and concepts and combinations specific to multiword expression, phrase or sentence to a one or more that service. Each service then links these entities, concepts, interactive grammar exercises concerning that element; and or combinations to new content found by a possibly web (e) services 2614 and 2616 that link to content specific based database search, or prepares the search and inserts a 15 language teaching resource that corresponds to the docu link, that when activated, performs the search. Personality ment content. A similar approach can be followed for other services are not limited to simple search, but can perform topics of learning. any actions depending on the content analyzed. FIG. 25 illustrates a list of Services 2502 available when E. User Controlled Enrichment This section describes additional properties that can be an e-learning personality is selected to enrich document specified for personalities and services. Deciding what to content. E-learning service 2504 and 2506 link words or enrich and how to enrich content can vary depending on the multi-word expressions found in the document to their personality and/or service specified. In one form, a person definitions and/or translations, respectively. This service ality annotates any phrase or word identified in its associated may perform lemmatization or Stemming before accessing a list of lexicons (e.g., sports figures), pattern matching using dictionary. In addition, this service may use the context of 25 the words or multi-word expressions Surrounding an element POS tagging, and/or regular expressions (e.g., proper names, in the content to limit the number of definitions and/or noun phrases), or some linguistic processing variant of the translations displayed. Another e-learning service 2508 links two. In another form, a personality provides a global docu each text unit (i.e., document, paragraph, phrase, word) to a ment service that annotates an entire document with for tutorial concerning that element. Yet another e-learning 30 example citations and related documents. This section service 2510 links each text unit to a tutorial concerning the describes different techniques for providing users with more text unit. Yet further e-learning services 2512, 2514, and control over what and how personalities annotate content in 2516 link each text unit to interactive courses, available a meta-document (e.g., footnotes). online courses, or online resources concerning the Subject of E.1 Automatically Inserting and/or Linking Content the text units, respectively. 35 FIG. 12 illustrates at 1220 a mechanism for selectively Advantageously, personalities prepare and perform a mul specifying at a personality level whether to insert enrich tiplicity of independent language learning tasks on a speci ment as links 1222, or content 1224, or automatically fied document(s). When the personality is applied to the determine whether to link or insert content at 1226. In either document content, each selected service in the personality case, links are drawn from entities recognized in document analyses the contents of the specified document(s), recog 40 content 102 to either content or services located at a remote nizing entities and concepts and combinations specific to location (in the case of 1222) or content located in document that service. The service then links these entities, concepts, markup 108 of a meta-document. or combinations to new content found by a possibly web In an alternate embodiment shown in FIG. 27, the user is based database search, or prepares the search and inserts a given the ability to selectively specify personality and/or link, that when activated, performs the search. 45 service behaviors to recognized entities in specified content In one variation, the e-learning personality may also or documents. In this embodiment, a user for example can include a service that tracks the user's past action (or access select a portion of the enriched document 1018 shown in a user profile) to provide new information when the same FIG. 10 and select for example the stock quote global entity is linked to other documents. In one specific embodi service results 1026. This series of actions using known ment the e-learning personality is specifically directed at 50 pointer selection techniques causes the of stock quote learning languages. In this embodiment, the meta-document options window 2710 shown in FIG. 27. server 200 provides computer assisted language learning In the options window 2710, a user may specify that a through using the herein-described document enrichment particular service behavior be applied to all selected docu mechanisms. FIG. 26 illustrates an example list of services ments, a currently selected document, or a selection at 2712. 2602 available when a language learning personality is 55 In addition, the options window 2710 permits a user to selected to enrich document content. statically or dynamically update linked information at 2714 More specifically, the language learning personality is that is inserted in a specified form at 2716. For example, defined using a personality that performs two or more of the information may be inserted as links or content as described services defined in FIG. 26, which include: (a) service 2604 above. Content that is inserted can be inserted as for and 2606 that link words or multiword expressions found in 60 example footnotes or as a list of content at the end of a the document to their definitions and/or translations, respec document. Content that is accessed dynamically is recalcu tively (possibly performing lemmatization or Stemming lated each time a link or content is accessed (e.g., using before accessing the dictionary and possibly using the Microsoft OLE-like techniques). Content that is accessed context of the element to limit the number of definitions statically is done so at a frequency specified at 2718 (e.g., displayed); (b) service 2608 that links each sentence, or 65 monthly, daily, hourly, etc.). phrase, to a grammatical description of the structure of the Advantageously, a user is given the ability to modify a sentence or phrase (possibly linking to a textual explanation default behavior of a service while specifying whether US 7,133,862 B2 33 34 changes apply to all documents the user controls, the current identified corpora is less than a second predefined threshold, document only, or the current selection of a document that then the entity in document content of a meta-document is contains one or more recognized entities. Depending on the enriched. level of change, they are either stored as properties of a It will be appreciated that the subject of a referenced particular meta-document or as part of a user's profile. corpus may relate to a specific Subject or a plurality of Whether to link or retrieve and insert content in a meta Subjects. Also in this embodiment, the user is also given the document may be specified for each personality or it may be ability to specify at 2810 in FIG. 28, whether to limit the performed automatically if specified at 2724 in FIG. 27 or at annotation of words in the document content 102 and/or 1226 in FIG. 12. Determining whether to link or insert document markup 108 to only those words that appear once content automatically is performed using information from 10 or more than once in the document. This provides that only a user's past history of interaction with the meta-document terms appearing in the document content 102 more than a server 200. If specified to automatically link or insert content certain number of times will be annotated as specified at to a specific personality at 2724 or as a property of a 2811. personality at 1226, then the decision whether to insert In operation, when a particular document service request information as links or content will depend on whether the 15 106 is invoked by the meta-document server 200, entities are information is inside or outside a users interaction history. searched in reference document(s) and/or database(s) and/or If outside a user's interaction history, then links are inserted; document content 102 and/or document markup 108 for otherwise, if inside the user's interaction history, the content their frequency of occurrence. If outside the range of the is retrieved and inserted into a meta-document. predefined threshold values, then the entity identified in the A users interaction history can be specified using a document content is not annotated, thereby advantageously history of links accessed by the user and/or a list of inter limiting document markup in a user specifiable and intelli esting concepts to the user. A list of interesting concepts to gible manner. the user can be determined using for example frequently E.2 Propagating Enrichment Between Documents followed links or from a user profile developed by recording Enrichment of a document or meta-document can also be email history or using a recommender system Such as 25 controlled by automatically propagating markup there Knowledge Pump developed by Xerox Corporation. In this between as each document or meta-document is accessed by mode of operation, information from a users interaction a user. This information can be used as a first pass to enrich history from entity browsing patterns is used to determine documents in real-time while at the same time provide whether to enrich document content. enrichment that may be contextually related to a user's In yet another embodiment, an annotation property can be 30 current work in process. This enrichment can be distin set for a specific service as shown or more generally for a guished from other document enrichment using formatting personality. In FIG. 14, each service has a defined entity type such as font color or the like. In addition, since this enrich 1412 with an annotate property 1414. The annotate property ment can be tagged for later identification, it can be easily operates in one embodiment as defined in window 2800 removed from or reinserted into a particular meta-document shown in FIG. 28 that is made available when selecting a 35 similar to a track changes function in a text document. specific annotate property for a service. In one mode of In one embodiment, enrichment is propagated between operation 2802, any identified entity is annotated according meta-documents in the meta-document server 200 as shown to an annotation that is predefined for a particular entity in FIG. 2. The propagation of enrichment between docu type. ments is a user settable property that can be selected in In two other modes of operation 2804 and 2806, a filter 40 personalities window 1210 at 1230 shown in FIG. 12. In function is applied to a list of words. The filter function operation, if enrichment is selected to be propagated determines whether to annotate an entity based on pre between meta-documents, then entities identified by the defined filtering criteria such as the frequency the word is meta-document server during enrichment are associated used in a reference document (e.g., a document identified to with their annotations and stored together in an entities be linked to an entity) or the usage of the entity in the 45 propagation list. When a new meta-document is enriched by reference document as compared to the document content in the meta-document server, it first searches through the which the entity was identified (e.g., using POS tagging). document content looking for entities that are identified in In the “expert' mode of operation 2804, only those the entities propagation list. If found, the similar entity is entities that occur in referenced document(s) or database(s) annotated as stored and defined in the entities propagation 2805 with a frequency below a predefined threshold are 50 list. Subsequently, the document service continues with annotated. In the “novice' mode of operation 1206, only other enrichment functions associated with the service as those entities that are identified in referenced document(s) or described above. database(s) 2807 with a frequency above a predefined In an alternate embodiment for automatically propagating threshold are annotated. Alternatively or in conjunction with enrichment between documents, functionality for propagat these modes of operation, an entity with few dictionary 55 ing enrichment can be included in a plug-in to any browser senses, or synonyms (e.g., as determined from an online and need not be integrally coupled to the meta-document thesaurus) might be discerned as a domain specific entity server 200 as shown in FIG. 2. The plug-in in this instance and therefore either annotated or not annotated. In one would propagate markup (e.g., hyperlinks) seen on each embodiment, categories in services are used to form a document during a current session between fetched content Vocabulary to evaluate dictionary sense. 60 (e.g., web pages and/or documents). The markup could be A variation of this embodiment allows a user to specify recorded from a predetermined number (i.e., one or more) of frequency of occurrence at 2801 and 2803 and the reference previously fetched (or browsed) documents or by session in document(s) and/or database(s) 2805 and 2807 at 2810 (i.e., a markup propagation list that associates strings in fetched referenced corpus). For example, in one embodiment this content with their markup. variation would provide when in expert mode, if the fre 65 For example, a plug-in to browserS Such as Netscape or quency of an entity identified in a document is less than a Internet Explore can be added that marks up document first predefined threshold and the frequency of the entity in content as a user browses from one document to the next. US 7,133,862 B2 35 36 That is, every page that is viewed on the browser during a The enrichment information can be filtered using any current session (e.g., starting from a first identified docu number of known techniques. For example in one embodi ment) is analyzed and all strings that are marked up (e.g., ment, enrichment information is filtered with respect to a everything between the HTML Xerox tuples where frequency denotes the Services in the services database 210 and utilities such as number of occurrences of that term in the document to define personality recommender 216 may perform a variety of the set of terms (i.e., document representations 3624). US 7,133,862 B2 43 44 Subsequently, the approximate reasoning module 3618 target function c-” that models a dependency between a accesses a knowledge base 3622 that records variables (i.e., target variable C and a set of input features f. . . . . f.). The document features and associated frequencies) that are used target variable C is discrete, taking values from the finite set to define a function that models the mapping from the {c. . . . . c.). The naive Bayes classifier accepts as input document 3612, or its transformed representation 3624, to a 5 a document “Doc' and predicts the target value C, or a class in an ontology. One specific embodiment of Such a classification, for this tuple. It uses Bayes theorem in order knowledge base is represented using a set of rules that to perform inference: describe relationships between the recorded variables. Typi cally each class is represented by one rule. In mapping the function, the inference engine 3618 matches the document 10 Pr(Doc C)Pr(C) with each class rule stored in knowledge base 3622 and uses Pr(C, Doc) = Pr(Doc) a decision maker for drawing conclusions as to which action to rely on. Consequently, this problem can be represented in terms of The function as represented by the knowledge base 3622 class probability distributions Pr (C) and class conditional and approximate reasoning module 3618 can be defined 15 using a variety of model types including the following: probability distributions Pr(DocIC). probabilistic models; fuzzy set/logic models: Boolean-val In one specific embodiment, a document Doc is repre ued logic models; nearest neighbor approaches; and neural sented in terms of features such as words that occur in the networks; some of which are described in more detail below. document Doc. Consequently, the above class conditional For background relating to some of these algorithms see the probability distributions can be rewritten as follows: following publications by: Shanahan, “Soft Computing For Pr(f, . . . . f.IC). Knowledge Discovery: Introducing Cartesian Granule Fea Within the na i ve Bayesian framework a simplifying assumption is introduced. Sometimes known as the na i ve tures”, Kluwer Academic Publishers, Boston (2000); and assumption, where the input variables (in this case the terms) Mitchell “Machine Learning, McGraw-Hill, New York are assumed to be conditionally independent given the target (1997). 25 In addition to the elements shown in FIG. 36, the catego classification value. As a result, the class conditionals reduce rizer 3610 can include a learning module. The exact make up to: Pr(f|C). of the learning module will depend on the model (e.g., Thus, inference (calculation of the posterior probabilities probabilistic, fuZZy, etc.) used by the approximate reasoning given evidence) using Bayes theorem simplifies from: module 3618 to map a set of documents to the list of 30 categories. Generally, the learning module takes as input classified document examples for each class and generates a corresponding knowledge base. Pr( < f(, ... , f, a Class = Cl) Pr(Class = Cl) F1.1 Probabilistic Model Pr( < f(, ... , f -) In one embodiment, the approximate reasoning module 35 3618 can use a probabilistic representation. The learning of probabilistic models involves determining the probabilities to the following (and hereinafter referred to as “the of various events. These are usually estimated from a labeled simplified inference equation''): training dataset. More formally, a training dataset is a collection of labeled documents consisting of tuples 40 where D, denotes the document and L, denotes the label associated with D. Pr(f; | Class = Cl)Pr(Class = Cl) In describing one specific type of probabilistic model, namely, a Naive Bayesian model, first it is described below how to represent models and perform inference approximate 45 reasoning in Such a framework, then it is described below Decision making consists of taking the classification how to learn Na i ve Bayes models from labeled example value C, whose corresponding posterior probability is the documents. The naive Bayes approach to systems modeling maximum amongst all posterior probabilities Pr(C.) for all values CeS2. This can be mathematically stated text classification to disease prediction as disclosed in: Good 50 as follows: (1965), “The Estimation Of Probabilities: An Essay On Modern Bayesian Methods' M. I. T. Press; Duda et al. (1973), “Pattern Classification and Scene Analysis”. Wiley, Class( < f(, ... , f S) = Cmax = argmaxPr(C| < f(, ... , f, s). N.Y.; and Langley et al. (1992), “An Analysis Of Bayesian CeOC Classifiers', in the proceedings of Tenth National Confer 55 ence on AI, 223–228. Since, in this decision making strategy, the denominator To simplify the description of the text categorizer 3610, it in the simplified inference equation is common to all pos is assumed that documents 3612 will be assigned to no more terior probabilities, it can be dropped from the inference than one class. However, it will be appreciated by those process. This further simplifies the reasoning process (and skilled in the art that the text categorization method 60 the representation also) to the following: described herein may be readily extended to assign docu ments to more than one class. More formally, the problem of text classification can be represented as a text classification system S that assigns a Class( < f(, ... , f ·) = series C.) Pr( f, Class = Cl). document (or body of text) class labels drawn from a 65 discrete set of possible labels C. Mathematically it can be viewed as the mapping: S:Doc->{labellabeleC} (i.e., the US 7,133,862 B2 45 46 As a result of making the naive assumption, the number filtering and decision making mechanisms, accesses the of class conditional probabilities that need to be provided knowledge base 3622 to classify the unlabelled text object reduces from being exponential in the number of variables 3612. to being polynomial. This assumption, while unlikely to be In a first embodiment, the knowledge base 3622 contains true in most problems, generally provides a Surprisingly rules for each class (i.e., category), where each rule is made high performance that has been shown to be comparable to up of a class fuZZy set and an associated class filter. During other classification systems such as logic systems (decision operation of this embodiment, the approximate reasoning trees) and neural networks (see Wiley cited above; and module 3618: (1) calculates the degree of match between the Langley et al. (1992), “An Analysis Of Bayesian Classifi document fuzzy set 3624 and a fuzzy set associated with ers, in the proceedings of Tenth National Conference on AI, 10 each class (i.e., each class fuzzy set); (2) passes the resulting 223-228). degree of match through a respective filter function (i.e., In other words, each class is represented by a series of class filter); and (3) determines a class label to assign to the word conditional probabilities for each word and a class unlabelled text object based upon the filtered degrees of conditional that are used in the calculation of the posterior match (e.g., the class label associated with the highest probability for a class given a new document to be classified. 15 degree of match is assigned to be the class label of the text Naive Bayes classifiers can quite easily be learned from object). example data. The learning algorithm operating in a learning In a second embodiment, each rule is made up of a granule module consists of estimating the class conditional prob fuzzy set. Similar to the categorizer of the first embodiment abilities and the class probabilities from a training dataset that uses fuZZy set models, this categorizer uses granule Train (a labeled collection of documents) for each possible feature based models. In operation, the categorizer of this document classification Class, where the class conditionals second embodiment performs a functional mapping from the correspond to the following: Pr(f, Class-c) “ie{1,..., n} set of features to a set of class values. Further details of the and the class probability distribution corresponds to: a text categorizer that uses fuZZy models is described by Shanahan in U.S. patent application Ser. No. 09/928,619. Pr(Class-c). 25 entitled "Fuzzy Text Categorizer, which is incorporated The class probability Pr(Class-c)corresponds to the frac herein by reference. tion of documents having the classification of c in the F.13 LSI Model training dataset Train. In yet another embodiment, the text categorizer 3610 uses Each class conditional Pr(f|Class-c) can be estimated LSI (Latent Semantic Indexing) to categorize document using the m-estimate (see Mitchell cited above): 30 3612. Text classification and learning can be performed using LSI and similarity metrics in the resulting feature space. The LSI model is used to translate feature space into X. Freq(fi, Doc) + 1 latent concepts space that can be used to explain the vari j-1 Doc ance-co-variance structure of a set of features through linear Pr(f; | Class = c) = 35 combinations of these features. Subsequently these trans formed features can be used as input to any learning algo rithm. In addition, LSI classification can be used with K nearest neighbor and a fuZZy classifier. Having identified the where: Freq(f, Doc) denotes the number of occurrences latent concepts they can be used for classification (Such as of the feature f, in the training document Doc, Vocabl 40 fuzzy classifier defined above or K nearest neighbors) or denotes the number of unique features considered as the similarity metrics (Cosine metric that can be used for language of the model (i.e., the number of variables used to ranking or re-ranking). Additional background relating to the generation of context vectors is disclosed by Deerwester, solve the problem); and Doc, denotes the length of the in “Indexing. By Latent Semantic Analysis”, Journal of the document Doc, (i.e., the number of terms, words, or features 45 American Society for Information Science, 41 (6): 391–407, in the document). 1990. F.7.2 Fuzzy Model F1.4 Vector Space Model In another embodiment, the text categorizer 3610 uses a In yet a further embodiment, the text categorizer 3610 fuzzy model to categorize document 3612. In this embodi uses a vector space model to categorize document 3612. ment, the pre-processing module 3614 includes a feature 50 Under the vector-space model, document and queries can be extractor 3615, a feature reducer 3617, and a fuzzy set conceptually viewed as vectors of features. Such as words, generator 3621 as shown in FIG. 36. The feature reducer noun phrases, and other linguistically derived features (e.g., 3617 is used to eliminate features extracted by the feature parse tree features). Typically a feature extraction module extractor 3615 that provide little class discrimination. The transforms a document (or query) into its vector of features, fuzzy set generator 3621 generates either fuzzy sets or 55 granule fuZZy sets depending on the fuzzy model used. D=, where each f, denotes the statistical Associated weights of features generated by the preprocess importance (normalized) of that feature. One common way ing module 3614 are interpreted as fuzzy set memberships or to compute each weight f, associated for document Doc is as probabilities. follows: f=freq(fDoc)*idiff), More specifically in this embodiment, the approximate 60 reasoning module 3618 computes the degree of similarity where freq(f, Doc) represents the frequency of feature f. (i.e., match) between the unlabelled text object 3612 that is in document Doc and idf(f) represents the inverse document represented in terms of: a feature vector produced by feature frequency of the feature f, in a document collection DC. The extractor 3615, a document fuzzy set produced by the fuzzy idf(f) factor corresponds to the content discriminating set generator 3621, and one or more categories as specified 65 power of i' feature: i.e. a feature that appears rarely in a by the approximate reasoning module 3618. The approxi collection has a high idfvalue. The idf(f)factor is calculated mate reasoning module 3618, which contains matching, as follows: US 7,133,862 B2 47 48 ontology using documents previously uploaded to the meta document server 200 and assigned a personality by a user. More specifically, the personality recommender system f = log (f 3700 shown in FIG. 37 is a variant of the text categorizer described above in section F.1 and shown in FIG. 36. The knowledge base 3722 can be defined manually using data where DCI denotes the number of documents in the from personality database 212, which may contain user collection DC and DF(f)denotes the number of documents specific personalities or generally available personalities that contain f. Typically, a normalized document vector, (e.g., using features and weightings chosen manually for D= is used in the vector space model of to each personality that could be applied) and documents that information retrieval, where each inf, is obtained as: were previously assigned to those personalities in the meta document database 202. Alternatively, the knowledge base can be defined semi automatically or automatically using features and weight S(f)? 15 ings chosen by machine learning techniques. In the case of automatically learning the features and weightings, the learning module 3730 may use meta-documents existing in the meta-document database 202 to train the knowledge base Queries can also be represented as normalized vectors 3722. Subsequently, the learning module 3730 validates the over the feature Space, Q-q. . . . . d’, where each entry o knowledge base 3722 using user profile database 3708. The indicates the importance of the word in the search. user profile database 3708, which includes portions of the The similarity between a query q and a document d, meta-document database 202 and the personality database sim(q., d), is defined as the inner product of the query vector 212, includes references to meta-documents that users have Q and document vector D. This yields similarity values in already applied a personality thereto. the Zero to one range 0, 1. 25 In operation, the pre-processing module 3614 (described Additional background relating to the generation of vector above in section F.1) of the personality recommender 3700 space models is disclosed in U.S. Pat. No. 5,619,709, and by extracts features 3616 from an uploaded document 3712. Salton et al., in “A Vector Space Model For Information Subsequently, the approximate reasoning module 3618 (de Retrieval', Journal of the ASIS, 18:11, 613-620, November scribed above in section F.1) derives a list of personalities 1975. 30 3720 using knowledge base 3722. These extracted features F.2 Recommending Personalities would then be exploited, again using standard techniques (using for example, Bayesian inference, cosine distance, as The meta-document server 200 provides a service for described above), to classify the new document and rank the recommending personalities at 216 in FIG. 2. In one possible list of personalities 3720 to recommend enriching instance, personalities are recommended for each document 35 specified document content. Every personality ranking after a user uploads to the meta-document server 200 and the above a certain threshold or just the top N (ND=1) person user has selected the personality property 1214 shown in alities can be recommend by the approximate reasoning FIG. 12. After a user selects the personality property 1214, module 3618. the personality recommender 216 automatically recom In a variant of the personality recommender 3700, the mends a personality for each document uploaded by the user. 40 personalities ranked for a new document are re-ranked using By recommending a personality, the personality recom the profile of the user. For example, if the approximate mender 216 aids a user to decide which of a plurality of reasoning module 3618 attaches to a document a business document enrichment themes are to be applied to an and a sports personality, but the user's own profile in 3708 uploaded by analyzing document content and other contex reveals that this user has never applied a business person tual information (e.g., actions carried out on the document) 45 ality then the ranking can be altered in 3701 so that only the of the uploaded document. sports personality is proposed, or applied with greater pri In one embodiment, personalities that are recommended ority, before the business personality. Accordingly, person by the personality recommender 216 are automatically ality recommendations can be tailored for a particular user attached to the uploaded document without requiring user using the user's interaction history with the meta-document acknowledgment and these documents are immediately so server 200 (e.g., an example interaction history is shown in enriched by the meta-document server. Alternatively, the FIG. 33 and described in section E.2). personalities that are recommended by the personality rec F.3 Generating Queries Using Identified Entities ommender 216 are attached to a meta-document only after Traditional searches for information are invoked when an the user provides an acknowledgement that the recom information need exists for an identified task. From this mended personality is acceptable to the user. 55 information need a query is formulated and a search per In order to decide which personality (or personalities) to formed, generally directed by a user. In accordance with recommend to attach to a document, the meta-document searches performed by services of the meta-document server server 200 uses an uploaded document 3712 as input to the 200, one or more documents relating to a task are identified personality recommender system 216, an embodiment 3700 and uploaded to the meta-document server 200. From these of which is shown in detail in FIG. 37. Generally, the 60 documents queries are generated for specified services auto personality recommender system 3700 shown in FIG. 37 is matically (and optionally as specified by a user). similar to the document categorizer 3610 shown in FIG. 36 As set forth above, a document service request in a except that the personality recommender assigns a list of one personality associated with an uploaded document identifies or more personalities 3720 instead of a list of one or more entities that are used to perform other document service categories as specified in section F.1 for the categorizer. The 65 requests such as queries. The manner in which to automati personality recommender 3700 can learn rules for recom cally formulate queries given an identified entity and its mending personalities and for developing a personality associated document content is the Subject of this section. US 7,133,862 B2 49 50 This technique for automatically formulating a query aims to tion of an information retrieval system. In an alternate improve the quality (e.g., in terms of precision recall) of embodiment, the classification labels are appended to the information retrieval systems. terms in the aspect vector to formulate a more precise query. FIG. 38 illustrates the elements and flow of information Adding terms in the aspect vector adds constraints to the for generating a query 3812 by query generator 3810. The query that limit the search to a set of nodes and/or Sub-nodes query generated may include some or all of the following in a document categorization structure (e.g., hierarchy, elements as discussed in more detail below: (a) a set of graphs). In yet a further embodiment, the classification entities 3808 identified by, for example, a document service labels are used to identify the characteristic vocabulary (i.e., request 106 performed by entity extractor 3804 or manually category vocabulary) 3621 associated with the correspond by a user, (b) a set of categories 3620 generated by the 10 ing classes. The terms of the characteristic vocabulary 3621 categorizer 3610 (as described above in further detail while in this embodiment are appended to the aspect vector to referring to FIG. 36), (c) an aspect vector 3822 generated by again formulate a more precise query. categorizer 3610 or short run aspect vector generator 3820, After processing the query by Submitting it to an infor and (d) a category Vocabulary 3621 generated by the cat mation retrieval system (e.g., Google, Yahoo, Northern egorizer 3610. 15 Lights), the query can be refined by filtering and/or ranking In operation as shown in FIG. 38, the document content the results returned by the query mechanism using the 3612 or alternatively limited context (i.e., words, sentences, classification labels or its associated characteristic Vocabu or paragraphs) surrounding the entity 3808 is analyzed by lary in a number of ways. For example, results can be ranked categorizer 3610 to produce a set of categories 3620. It will from most relevant to least by matching returned document be appreciated that although the description is limited to profiles against the classification labels or the characteristic document content it may in also include enriched document Vocabulary of the predicted class by: using a categorizer; or content. In addition, the document content 3612 is analyzed using a similar metric in the case of the characteristic by short length aspect vector generator 3820 to formulate a Vocabulary. Such as the cosine distance or similarity measure short length aspect vector 3822. In an alternate embodiment, base on an LSI transformation of the original feature space. the aspect vector generator 3820 forms part of the catego 25 The results of these more precise queries are used to enrich rizer 3610. original document content. In one embodiment, documents In one embodiment, the query generator 3810 coalesces are enriched by the meta-document server 200 described these four elements (i.e., entity 3808, category 3620, aspect above, the operation of which involves automatically vector 3822, and category vocabulary 3621) to automati executing the query, for example, on the Internet, and cally formulate query 3812. Advantageously, the query 3812 30 retrieving the query results and linking these results to the may be contextualized at different levels: first, the query is original terms and entities in document content. set to be directed in a specific category of an information FIG. 39 illustrates an example of a query 3930 contex retrieval system that may, for example, be hierarchically tualized using classification labels 3920 of document cat organized; second, the query may be augmented with addi egorization hierarchy 3900. Using document content 3902, tional terms defined in aspect vector 3822; third, the query 35 the categorizer 3610 identifies classification labels 3920. may be further augmented with additional terms related to These labels identify nodes 3910, 3912, and 3914 of the the category vocabulary 3621. In alternate embodiments top-level node 3904. Specifically in this example, the enti described below a query can be contextualized using just ties “seven and “up' are determined by categorizer 3610 to one of the category 3620 and the aspect vector 3822. relate most appropriately to the class of documents found in After generating the query, in one example embodiment, 40 the directory Science>biology>genetics. As specified at it is used by the meta-document server 200 to access content 3930, the search is focused on documents found in the single provided by networks services 206 (introduced in FIG. 2). node of the document hierarchy genetics, at 3910. The content provided as a result of the query can then be F3.2 Aspect Vector Generation used by the content manager 208 to enrich the original As set forth above, personalities recognize certain entities document content 3612. In another embodiment, the content 45 in a document and search for information concerning them is provided to a user as a result of performing a search on a in personality-specific data sources. Aspect vectors add a specified entity 3808. small amount of context to the entity to restrict a search for F.3.1 Category Generation information, thereby making the search more precise. In generating the set of categories 3620, the categorizer In operation when an entity is found in document content 3610 classifies input document to generate classification 50 by a document service request, that entity will be used by labels for the document content 3612. Terms and entities another document service request to gather and filter infor (i.e., typed terms, such as people organizations, locations, mation concerning that entity. Producing an aspect vector etc.) are extracted from the document content. For example, contextualizes queries related to the entities by examining a given a classification scheme such as a class hierarchy (e.g., portion of the document content that may range from all of from a DMOZ ontology that is available on the Internet at 55 it to one or more paragraphs and/or segments around the dmoZ.org) in which documents are assigned class labels (or entity. assigned to nodes in a labeled hierarchy), a classification The aspect vector is produced by analyzing a documents profile is derived that allows document content to be textual content using natural language processing in order to assigned to an existing label or to an existing class, by extract different facets of the document. In one embodiment, measuring the similarity between the new document and the 60 three facets of document content are examined (i.e., tokens known class profiles. (i.e., words), phrases, and rare words) to identify terms to Document classification labels define the set of categories retain. The retained terms are added to the recognized entity, 3620 output by the categorizer 3610. These classification in order to increase the precision of the query. labels in one embodiment are appended to the query 3812 by Tokens from the document are identified using words that query generator 3810 to restrict the scope of the query (i.e., 65 are normalized using, for example, techniques such as the entity 3808 and the context vector 3822) to folders mapping uppercase characters to lower case, Stemming, etc. corresponding to classification labels in a document collec These tokens are divided into two parts: words appearing in US 7,133,862 B2 51 52 a list of stop words (e.g., in, a, the, of etc.); and all other describe the category. In one embodiment, the category words. Tokens identified in the list of stop words are Vocabulary is generated a priori and associated with each discarded and the remaining words are sorted by decreasing category in an ontology. At 4007, for the particular category frequency to define a sorted list of words. From the sorted identified at 4004, a node in the organizational structure of list of words, the N (e.g., N=3) most frequent words are the categories is located. retained. In addition, some of these N (e.g., N=2) words are At 4008, if the node located node has not been searched specially marked so that their presence becomes mandatory with the query, then the query as it is defined is directed to in documents retrieved by the query. the located node in the category organization at 4010. At Phrases in document content are defined either using a 4009, if the root node has not already been searched using language parser which recognizes phrases, or approximated 10 the defined query, then the node in the category organization by some means (e.g., taking all sequences of words between at which the category is defined is changed at 4014 to its stopwords as a phrase). Only phrases consisting of two or parent node. The parent node in a category organization is more words are retained. These remaining phrases are sorted generally less descriptive than the child node. The root node by decreasing frequency. The top M (e.g. M-3) most fre defines the least descriptive category in the category orga quent phrases, possibly fulfilling a minimum frequency 15 nization. criteria (e.g. appearing more than once in the entire docu At 4012, if search results are obtained at 4010, then they ment), are retained. are evaluated for accuracy at 4016. If no results are obtained Rare words are defined as those (non-stopwords) appear at 4012, the node in the category organization at which the ing with a low frequency in some reference corpus (e.g. The category is defined is changed at 4014 to its parent node and British National Corpus of 100 million English words). All act 4008 is repeated. Note that if there is no parent of the non-stopwords are sorted by their frequency in the reference located node at 4014, then the node remains unchanged and corpus in ascending order. The top Pleast frequent words is by definition the root node. (e.g. P3), possibly fulfilling a minimum frequency criteria At 4016, if the search results are determined to be (e.g. appearing more than three times in the entire docu accurate (e.g., by user approval), then the process terminates ment), are retained. 25 at 4030. At this point the results of the query may, for Variants of this method include limiting the number of example, be displayed to a user or used to automatically context words used by a certain number of words or char enrich document content. acters, for example, certain information retrieval systems At 4018, if the results are not accurate at 4016, then a accept queries up to a length of 256 characters in length, determination is made whether a short run aspect vector has while others information retrieval systems accept queries 30 already been added to the query. If it has not already been that have a maximum often words. Another variant includes added then a short run aspect vector using the document using additional lists of ranked items extracted from other content and the entity as described above in section F.3.2 is facets of the text such as: (a) proper names (e.g., ranked by generated at 4020. At, 4022 the aspect vector is added to the decreasing frequency), (b) rare phrases (as with rare words, query and the node to which the query is pointing in the calculating rareness by frequency in a reference corpus, for 35 category organization is reset to the node that corresponds to example, an image of the WWW), (c) dates, (d) numbers, or its original categorization at 4024. Subsequently using this (e) geographical locations. Advantageously, mixing terms augmented query, act 4008 is repeated. from different facets of the document content to extracted Furthermore, if the query should need to be further entities improves precision of query related to marking up augmented at 4026 with the category vocabulary because of the entity. 40 inaccurate results found at 4016, then the category vocabu For example, assume a web page mentions a professor lary is added to the query at 4028, thereby further augment named Michael Jordan. Further assume that the entity iden ing the query. The node to which this augmented query is tified by the meta-document server 200 is Michael Jordan. pointing in the category organization is reset to the node that Sending the query “Michael Jordan' to an information corresponds to its original categorization at 4024 and act retrieval system Such as AltaVista identifies approximately 45 4008 is repeated. 1.2 million documents, with the 10 top-ranked documents FIG. 41 illustrates a client interface 4110 similar to the about the basketball player Michael Jordan. By augmenting client interface 1010 shown in FIG. 10. Unlike the client the entities “Michael Jordan” of the query with the aspects interface 1010, the client interface 4110 displays an aug Such as "computer Science', 'electrical engineering, and mented query that can be performed using a recognized “faculty members’ extracted from the document content, a 50 entity 1032 in a pop-up window 4102. The pop-up window more precise query can be formulated for identifying infor 4102 appears when a user locates the pointer 1030 in the mation relating to a professor named Michael Jordan. vicinity of the recognized entity 1032. The pop-up window F.3.3 Example 4102 illustrates one or more category organizations 4104 FIG. 40 sets forth a flow diagram which depicts one used in defining a query, as well as, classification aspects embodiment in which both categories and aspect vectors can 55 4106 and contextual aspect 4108 that are associated with the be used to improve the accuracy of an information retrieval query, each of which can be viewed and edited as shown in system. At 4002, one or more entities are extracted from a window 4112. To manually invoke a search based on an document. Entity identification or extraction can be per entity, the user selects the desired level in the category formed: (a) manually by a user, (b) automatically by entity organization and whether one or more aspects should be extractor 38.04 shown in FIG.38 using for example a method 60 used to augment the query. as described in section B.4, or (c) by the categorizer 3610. F.4 Finding an Expert for an Enriched Document At 4003, the extracted entity at 4002 is added to a query at In order to help a user understand a document, an expert 4003. service provides help finding experts for Subjects mentioned At 4004, the document from which the entity is extracted in a meta-document. In one embodiment, a user selects is categorized. Categorization involves producing a category 65 button 1036 in FIG. 10 after a document is uploaded. Once 3620 and a category vocabulary 3621. The category vocabu invoked, the expert Service uses as input whatever content lary for a category consists of one or more terms that (e.g., text, hyperlinks, graphics) that is available in the US 7,133,862 B2 53 54 current state of the document (e.g., the user may be com Advantageously, the notification system is not based on posing the document) to find an expert about the Subject. specifying a URL or a document repository to be watched Advantageously, a document text segment can be used by for changes. Instead this notification system is initiated by the expert utility to generate the query to access a database specifying a document service request of a meta-document. of experts, and manage the exchange of responses or docu Consequently, the notification of changes to information ments, within the context of a the meta-document system involves only that information which the user is concerned shown in FIG. 2. about. In addition, this form of notification provides a level In one embodiment, the expert utility operates by per of indirection, since the user is alerted about new informa forming the following steps: (a) the current state of a tion concerning entities in a document even if the document meta-document is input to the expert utility; (b) a profile is 10 content 102 or markup 108 never changes. created for the meta-document (or for a document segment More specifically, this change alerting document service selected by the user) either by traditional indexing means, or request is packaged in a personality that can be activated in by creating short query context as disclosed in section F.3 the meta-document server 200 (i.e., attached to a document) above or by categorizing as described in section F.1 above by the user. Initially, document service requests analyze a (Note that the profiles can be created for the entire document 15 document by linguistically processing the document to rec or for any segment of the document depending on the ognize entities within the document. These entities can be number of segments of the meta-document selected by a strings from a list (e.g., list of medicine names), or regular user.); (c) this profile is used to query a known website for expressions describing a multiplicity of entities (e.g., a experts (e.g. http://www.exp.com) or by finding the most proper name recognizer, a chemical formula recognizer, active rater for topics in that profile in Some recommenda etc.), or elements recognized by linguistic processing (e.g., tion system such as Knowledge Pump developed by Xerox noun phrases, words in a Subject-verb relations, etc.). Enti Corporation; and (d) pointers to and/or content regarding the ties may also have keys associated with them in another list experts found are referenced and/or brought back as anno or database (e.g., Xerox as an entity with stock key XRX). tation for the document segment selected. Another document service accepts these entities, their 25 associated keys, a procedure for accessing information for G. Additional Meta-Document Services each entity, an update period, information about the user This section sets forth additional services and embodi requesting the notification and a change significance level ments that in one embodiment may operate separate from or (e.g., Any Change, Minor Change, Major Change, etc.) as integral with the meta-document server. input. This document service request then performs the G.1 Notification of Enrichment 30 information access (e.g., local database access, accessing a content source on the Internet, etc.) for each entity at the As set forth above, when a personality 104 is attached to beginning of every update period. document content 102, the personality consists of many The document service request compares the data retrieved document service requests 106 identifying document Ser for each entity at the current and at the previous update vices that are periodically initiated by scheduler 204 to 35 period (i.e., the data retrieved for each entity in the previous examine the document content 102 and the document update is stored and accessible to the document service markup 108. By examining content and markup of a docu request). If the stored information is significantly different, ment, a document service may recognize a certain number of as described below, from the newly retrieved information the entities inside the document. The document service also may user is notified (e.g., via e-mail or any other notification link these entities to a multiplicity of data sources on the 40 mechanism) that new entity-specific information is available World Wide Web (i.e., WWW) or fetch the content of the and the user is also given a description of the change. The link, as provided in section 1220 of FIG. 12. In addition as document service request decides on significance using a part of the document service, the service may also filter change significance parameter that measures how much the and/or transform retrieved document content. new information differs from the stored information (e.g., by If desired, a user may specify whether to invoke a 45 comparing the number of characters, etc.). notification service that will notify a user upon completion In one embodiment, the change significance parameter of a document service. It will be appreciated by those skilled has a plurality of settings (e.g., high, low). For example, if in the art that document services may be able to be per the information retrieved for an entity previously was a web formed in real time and therefore not require notification of page, and the change significance parameter was set high, its completion to a user. In the event notification is required 50 then the user may be notified only if the length of the web to perform actions that cannot be performed in real time, as page length changes by more than 30%. If the change part of the properties 1210 of a personality shown in FIG. significance was set low, then the user would be notified if 12, a user may specify at 1204 whether to be notified by the page length changes by more than 5%. If the change email 1205, voice mail 1206, or SMS (Short Message significance parameter was set to any change, then any Service) text messaging over GSM1207 upon completion of 55 change in the page length would cause the user to be alerted. the service. An example of a service that requires significant In an alternate embodiment, the change significance param processing time is a combinatorial search of a list of words. eter is computed by storing any reduced description of the When a notification mechanism is selected at 1204, a accessed pages (e.g., hash function, significant words, all notification document service request is added to the speci non stop words, etc.) in the system and comparing the stored fied personality to alert the user who applied the personality 60 representations of the page to newly accessed representa to the document when significant changes appear on the web tions in order to determine change. or in a local database concerning any of the entities men G.2 Document-Centric Suggestions tioned in the document. The threshold amount of change that This section describes a mechanism that uses an infor invokes a notification service can be predefined by the user mation space Surrounding a document to provide an and/or system. In addition, the user may be provided with a 65 improved (e.g., more accurate and more stylish) document mechanism (not shown) for specifying a specific entity to be centric auto-completion system and auto-correction system watched for changes. that can be used during content creation. Document auto US 7,133,862 B2 55 56 completion saves a user from having to retype text (and aid of the entity database 4214 in the information space 4200 other document content such as graphics) and related Suggestions for expanding the entity fragments are defined. markup such as hyperlinks, bibliographic entries etc., by As illustrated in FIG. 43 the auto-completion-module 4302 providing Suggestions of words that have been used previ includes a tracking module 4304, a query formulation mod ously in a contextually similar manner. Document auto ule 4306, an information retrieval system a suggestion correction provides a textual correction system that dynami module 4310, and an insertion module 4312. cally updates the information space as corrections are made In the embodiment shown in FIG. 43, the entity database or accepted. 4214 in information space 4200 stores one or more text The meta-document server 200 described above is an objects (i.e., a word or collection of words that may take the example of one embodiment that can be used to create an 10 form of a string) that could be used to auto-complete users information space Surrounding a document, thereby creating textual input at editor 4314 destined to form part of docu a document-centric view of the world. An information space ment content 4203. Exactly what text objects define the includes document content, document markup, and infor entity database 4214 depends on the content 4203 and the mation relating to additions and/or changes relating to personality used to define the information space Surrounding document content (e.g., additions, changes, keystroke order 15 the content. etc.). For example, FIG. 42 illustrates an information space The tracking module 4304 interacts with the text editor 4200 that surrounds meta-document 4202. The meta-docu 4314. An example of a text editor is the Microsoft(R) Word ment 4202 includes content and markup. The markup editor. The tracking module 4304 monitors a user's input for enriches content of the information space of the meta auto-completion requests (e.g., via designated keystrokes) or document 4202, for example, by linking identified entity for partially input words (e.g., characters string of 2 or more 4204 in the meta-document content 4203 to a set of meta characters). In one embodiment, the tracking module 4304 is documents 4208. integral with the text editor 4314. In another embodiment, In addition, the markup of meta-document 4202 grows the the tracking module 4304 operates independent from the text information space 4200 on a document level (as opposed to editor 4314, for example, as an optional plug-in to text editor an entity level) at 4216 using similar documents 4206. The 25 4314. similar documents 4206 links to a set of meta-documents The query formulation module 4306 translates an auto 4210 that relate to the content 4203 as a whole and not to any completion request received from the tracking module 4304 single entity of the content 4203. Also, the document level into a query that is passed onto the information retrieval markup of the information space 4200 includes a reference module 4308. The information retrieval module 4308 4212 to an entity database 4214 of extracted entities, an 30 accepts the query derived from the auto-completion request example of which is shown in FIG. 48 and discussed in more detail below. and searches the entity database 4214 for possible auto It will be appreciated by those skilled in the art that the completions that would be best used to auto-complete (i.e., elements making up the meta-document information space match) the string currently input by the user. 4200 (e.g., document content 4203, the sets of meta-docu 35 The suggestion module 4310 either selects the most ments 4208 and 4210, and the entity database 4214) need not appropriate String match (i.e., high confidence completion) be collocated together in a single space and/or machine. or presents a list that is ranked or otherwise ordered in a Instead, the elements making up the meta-document infor predefined form of the most appropriate alternative comple mation space 4200 may be located physically distant from tions to the user of the text editor 4314. The user subse each other on different computer systems and/or file storage 40 quently selects one or none of these alternative Strings. If systems that operate independently across the network 221 one of the alternative strings is selected, the insertion module 4312 takes the selected string for auto-completion shown in FIG. 2. and auto-completes the current input string by inserting the The construction of the information space 4200 surround remaining characters of the selected String after the string ing a document can begin at document creation time by, for fragment. example, creating a document on the meta-document server 45 200. Once the information space Surrounding a document is Although the example discussed herein is limited to query created, the user or the system can exploit it during the construction in a text auto-completion context, it will be knowledge management cycle. The system in case of an appreciated by those skilled in the art that a similar analysis auto-completion service uses the information space of a can be used for other types of objects that need be auto particular meta-document(s) to aid in creating Suggestions 50 completed. For example in alternate embodiments, entity for completing input for a user. fragments in the auto-completion system may include other Auto-completion involves the process of automatically objects for completion besides text objects, such as multi completing one or more words without manually typing all media type objects. Multimedia type objects include any the characters that makeup that word(s). In one embodiment, input sequence (e.g., from an input device such as a key the user types the first few characters of a word, presses a 55 board, mouse, interaction device Such as a gesture recogni special request key to invoke completion, and the rest of the tion system, etc.), graphics object, Sound object, and images word is filled in. The completed word may also be rejected object. with the aid of another special key. If multiple alternatives As such, the database of auto-completions 4302 is no exist, the user may be prompted to select one from a longer just a list of text strings but list of tuples consisting displayed list of alternatives or to reject the proposed 60 of an access key (e.g., entity fragment), and of an object Such completions. as a string of words, a graphics object, and/or an input In one embodiment illustrated in FIG. 43, an auto sequence, that is used to auto-complete a user's input. This completion module 4302 operates with a text editor 4314 object may have associated with it various attribute descrip and the meta-document information space 4200. The auto tions that make up other fields in the database tuple. For completion module 4302 provides document-centric Sug 65 example, an auto-completion system for graphics would gestions to entity fragments (e.g., String fragments) added to Suggest the completion of a fourth side of a square once document content 4203 using the text editor 4314. With the three or even two sides have been drawn. US 7,133,862 B2 57 58 Unlike traditional auto-completion systems, which typi otherwise, or upon completion of 4512, the service 4406 cally use a static database of entities to auto-complete user's waits for additional signals from the editor 4314. input and provide facilities for the user to add one's own As illustrated in the flow diagram shown in FIG. 45, auto-completion entities, the auto-completion system 4300 populating the auto-completion database is an ongoing pro dynamically builds an auto-completion database of entities cess, which involves scanning the dynamic information (text or otherwise) from the information space that can be space of the document for entities that could prove useful for created around a document using information space creation auto-completion. The process of entity extraction for auto systems such as the meta-document server 200. Which completion varies according to the type of entity extracted. entities are extracted, how they are extracted, and indexed in Considered first is text based entity extraction. A text-based the auto-completion database is determined by the person 10 entity is defined as a word or collection of words that appear ality associated with the document. For example, biblio contiguously in the document information space. graphic entries may only be important for Scientific person An entry that is inserted into the auto-completion database alities. for a text entity as shown for example in FIG. 48 includes: FIG. 44 illustrates an alternate embodiment in which the (a) a key or multiple keys (e.g., all possible n-grams. Such as auto-completion module 4302 operates integrally with ele 15 bi-grams or tri-grams, that make up a word or phrase) for ments of the meta-document server 200 described above and specifying entity fragments to be searched; (b) the expanded shown in FIG. 2. This embodiment includes a document entity relating to the entity fragment (i.e., word or words initialization module 4404 in the user manager 214-for making up the entity, which may be delimited by punctua initializing a meta-document with a name, a personality and tion characters such as spaces, fullstops etc. or using gram other meta-values. (e.g., access privileges etc.). For mar rules which chunk words together into semantic entities example, user operating computer 226 inputs and/or edits Such as noun phrases, verb phrases etc.); (c) any markup document content using text editor 4314 that forms part of (such as hyperlinks, cross-references, footnotes etc.) that is the content manager 208 (or alternatively, part of computer associated with the entity; (d) any formatting (such as bold, 224). italic, font size, etc.) that is associated with the entity; (e) the While receiving input and/or edits to document content, 25 origin of the entity (e.g., location of the document containing the meta-document server 200 anticipates the information the entity, segment containing the entity, etc.); (f) the posi needs of the user creating and/or editing the document tion of the entity at its origin; (g) an identified part of speech content by creating an information space around the docu of the entity at its origin; and (h) the context (e.g., catego ment content that might be useful for the creator (and rization) of the entity at its origin. ultimately the reader) of the meta-document. As described 30 Other types of information stored in the database that are above, this information can be linked to the document or useful for suggesting more accurate completions include inserted into the document. The meta-document server 200 bibliographic entries and related citations. Such entries and dynamically maintains the information space 4200 such that citations can be stored in the database as markup and newly inserted input by the user causes the system to update recognized using known pattern recognition techniques and the meta-documents information space. Furthermore, some 35 machine learning techniques such as hidden Markov mod of services of a personality used to create the document els. Once recognized, this markup can be stored in the information space maybe be periodically carried out thereby auto-completion database in similar fashion as the entities. resulting in new markup/content for the document as new The key in the case of a bibliographic entry could consist of content is added to the document. 40 the authors names, a Subset of the characters that make up G.2.1 Creating and Updating Auto-Completion Database the authors names, or the citation associated with the bib Entries liographic entry. In particular, the auto-completion system shown in FIG. In addition, generic objects can also be recognized and 44 illustrates the manner in which the entity database 4214 recorded in the auto-completion entity database. A generic is used for auto-completion, as well as, service 4406 for 45 object can viewed as being made up of a sequence of inputs carrying out the process set forth in FIG. 45 for creating and Such as mouse movements, mouse clicks, keyboard inputs, updating the entity database 4214 dynamically from the human gestures as identified by a gesture recognition sys document information space 4200. In one embodiment, the tem, and facial expressions as recognized by facial recog service 4406 accessed by scheduler 204 begins at 4504 by nition system. Such input sequences can be stored in the initializing the database with entities from lexicons associ 50 auto-completion database and be indexed by the first n ated with the personality that has been assigned to the inputs in the sequence. For example, consider an input meta-document. In alternate embodiments, the database is sequence that consists of four straight lines that form a either initialized using an empty database or it is initialized rectangle. This sequence could be retrieved and used for using a database of domain specific lexicons. In operation, auto-completion of rectangles once the first one or two lines the lexicons are used to identify entities in the document 55 have been input, thereby alleviating the need for drawing the content that are to be enriched by predefined services (see rest of the rectangle. These input sequences could be iden for example FIG. 4). tified automatically using known data mining techniques, Subsequently at 4506, the module 4406 waits for a signal which search for general patterns in the input sequence. from text editor 4314 that document content 4203 has been It will be appreciated by those skilled in the art that when added and/or edited. At 4508, the information space is 60 using the method outlined in FIG. 45 for populating the updated based on the added and/or edited document content. auto-completion entity database, the entity database can At 4510, the updated information space (i.e., added and/or grow to be prohibitively large, therefore, some entity selec edited document content and enrichment associated there tion algorithms should be used at 4510 to select which with) is processed for entities that could potentially be used entities will provide the most benefit to the user in terms of for auto-completion. At 4512, if extracted entities are 65 time saved through auto-completion of these entities. For deemed to be appropriate for auto-completion, then they are example, text based entities could be selected based on the indexed and inserted into the database of entities 4214; length or the utility of the entity or combination of these.