USOO8768313B2
(12) United States Patent (10) Patent No.: US 8,768,313 B2 Rodriguez (45) Date of Patent: Jul. 1, 2014
(54) METHODS AND SYSTEMS FOR IMAGE OR (56) References Cited AUDIO RECOGNITION PROCESSING U.S. PATENT DOCUMENTS (75) Inventor: Tony F. Rodriguez, Portland, OR (US) 7,272.407 B2 * 9/2007 Strittmatter et al...... 455,500 7.925,265 B2 * 4/2011 Souissi ...... 455,445 (73) Assignee: Digimarc Corporation, Beaverton, OR 8,229, 160 B2 7/2012 Rosenblatt (US) 8,358,811 B2 * 1/2013 Adams et al...... 382,118 2002/0102966 A1 8, 2002 Lev et al. (*) Notice: Subject to any disclaimer, the term of this 2004/0212630 A1 10/2004 Hobgood et al. patent is extended or adjusted under 35 2006/00316842005/0091604 A1 4/20052/2006 SharmaDavis et al. U.S.C. 154(b) by 819 days. (Continued) (21) Appl. No.: 12/855,996 OTHER PUBLICATIONS (22) Filed: Aug. 13, 2010 Ahmed, et al. MACE—Adaptive Component Management Middleware for Ubiquitous Systems, Proc. 4th Int'l Workshop on (65) Prior Publication Data Middleware for Pervasive and Ad-Hoc Computing, 2006. US 2011 FO143811 A1 Jun. 16, 2011 (Continued) Primary Examiner — Blane JJackson Related U.S. Application Data (74) Attorney, Agent, or Firm — Digimarc Corporation (60) Provisional application No. 61/234,542, filed on Aug. 17, 2009. (57) ABSTRACT Many of the detailed technologies are useful in enabling a (51) Int. Cl. Smartphone to respond to a user's environment, e.g., so it can H04M 3/42 (2006.01) serve as an intuitive hearing and seeing device. A few of the G06K 9/00 (2006.01) detailed arrangements involve optimizing division of shared H04N I/00 (2006.01) processing tasks between the phone and remote devices; H04L 29/08 (2006.01) using a phone GPU for exhaustive speculative execution and G06K 9/22 (2006.01) machine vision purposes (including facial recognition); novel (52) U.S. Cl. device architectures involving abstraction layers that facili CPC ...... H04N I/00127 (2013.01); H04N I/00244 tate substitution of different local and remote services; inter (2013.01); H04N I/00331 (2013.01); H04N actions with private networks as they relate to audio/image 1/00328 (2013.01); H04L 67/16 (2013.01); processing; adapting the orders in which operations are G06K 9/228 (2013.01); G06K 9/00986 executed, and the types of data that are exchanged with (2013.01); H04N2201/0084 (2013.01); G06K remote servers, in accordance with current context; reconfig 9/00288 (2013.01); H04N I/00336 (2013.01) uring networks based on sensed social affiliations among USPC ...... 455/4.14.1; 455/415; 382/276; 382/118 users and in accordance with predictive models of user behav (58) Field of Classification Search ior, etc. A great variety of other features and arrangements are USPC ...... 455/4.1.2, 414.1, 66.1, 556.1, 557, 415; also detailed. 382/118, 173, 190, 276 See application file for complete search history. 16 Claims, 55 Drawing Sheets
USERSNAPS IMAGE
FACIAL oct FFT WAWELT RCCGNTION TRANSFORTRANSFORMITRANSFORM t EDGE EIGENWALUE SPECTRUM OCR ETECTON ANALYSIS ANALYSIS g es Atter FOURIER-MELLIN TEXTURE COLOR 2. XTRACTION TRANSFORM CLASSFR ANALYSS g GIST EMOTION BARCODE AGE se g PROCESSNS CLASSIFIER DECODER DETECTION s
WATERMARK ORIENTATION I SIMILARITY GENDER ECOER ETECTION PROCESSORETECTON 3 OBJECT METADATA GEOLOCATION | ETC 3. SEGMENTATON ANAYSIS PROCESSENG Cs ROUTER
SRWC SRWC SERVICE Row}ER PRWR2 PROWIRN
CELLPHONE USERNTERFACE US 8,768,313 B2 Page 2
(56) References Cited neering for Self-Adaptive Systems, Springer Berlin Heidelberg, Jun. 19, 2009, pp. 146-163. U.S. PATENT DOCUMENTS Gu, Adaptive offloading inference for delivering applications in per 2006,0047584 A1 3, 2006 Vaschillo et al. vasive computing environments, Proc IEEE Int’l Confon Pervasive 2006/0217199 A1 9, 2006 Adcox et al. Computing and Communication, 2003. 2007/010.0480 A1 5, 2007 Sinclair et al. Kirsch-Pinheiro, , et al. Context-Aware Service Selection Using 2007/O159522 A1 7, 2007 Neven Graph Matching, 2nd Non Functional Properties and Service Level 2009 OO31381 A1 1/2009 Cohen et al. Agreements in Service Oriented Computing Workshop (NFPSLA 2009/0237546 A1 9, 2009 Bloebaum et al. SOCO8), CEUR Workshop Proceedings, vol. 411, 2008. 2009,0252383 A1 10/2009 Adam et al. 2009,0279794 A1 11/2009 Brucher et al. Noble, etal, Agile Application-Aware Adaptation for Mobility, Proc. 2009/0285492 A1 1 1/2009 Ramanujapuram et al. of the ACM Symposium on Operating System Principles (SOSP), 2010/0260426 A1 10/2010 Huang et al. 1997. Osman, et al. The Design and Implementation of Zap: A System for OTHER PUBLICATIONS Migrating Computing Environments, Proc of the Fifth Symposium Balan, Tactics-Based Remote Execution for Mobile Computing, on Operating Systems Design and Implementation (OSDI), 2002. MobiSys 2003. Reichle, et al. A Comprehensive Context Modeling Framework for Chun, Augmented Smartphone applications through clone cloud Pervasive Computing Systems, Distributed Applications and execution, Proc. of the 8th Workshop on Hot Topics in Operating Interoperable Systems, Springer Berlin Heidelberg, 2008. Systems, May 2009. Zachariadis, et al. Adaptable Mobile Applications: Exploiting Logi Flinn et al. Balancing Performance, Energy, and Quality in Pervasive cal Mobility in Mobile Computing, in Mobile Agents for Telecom Computing, Proc. of the 22nd International Conference on Distrib munication Applications, Springer Berlin Heidelberg, 2003, pp. 170 uted Computing Systems (ICDCS), Jul. 2002. 179. Flinn, et al. Self-Tuned Remote Execution for Pervasive Computing, Zachariadis, et al. Building Adaptable Mobile Middleware Services Proc. of the 8th Workshop on Hot Topics in Operating Systems Using Logical Mobility Techniques, in Contributions to Ubiquitous (HotOS), May 2001. Computing, Springer Berlin Heidelberg, 2007, pp. 3-26. Geihs, et al. Modeling of Context-Aware Self-Adaptive Applications in Ubiquitous and Service-Oriented Environments, Software Engi * cited by examiner U.S. Patent US 8,768,313 B2
U.S. Patent Jul. 1, 2014 Sheet 2 of 55 US 8,768,313 B2
Z"SOIH
LOETEO
U.S. Patent US 8,768,313 B2
XIÈHOWALEN
SSETTE?-HIAA
U.S. Patent Jul. 1, 2014 Sheet 4 of 55 US 8,768,313 B2
KEYVECTOR DATA (PIXELS & DERIVATIVES)
FIG. 4
RESULTA RESULT B RESULT C
KEYVECTOR FG. 4A
ADDRESSING, ETC
/ PXEL PACKET U.S. Patent Jul. 1, 2014 Sheet 5 of 55 US 8,768,313 B2
F.G. 4B
TASKA TASK B TASK C TASK D
WSUA TASK COMMONALY PIE CHARTS I FFT RE-SAMPLNG
PDF417 BARCOOE READING F.G. 5 U.S. Patent Jul. 1, 2014 Sheet 6 of 55 US 8,768,313 B2
RESDEN CAL-UP VSUAL PROCESSING SERVICES
ON ON OFF OFF ON OFF
COMMON SERVICES SORTER
OW
LEVE PXEL PRO- 8. CESSENG LIBRARY
FOW GATE CONFIGURATION, SOFTWARE PROGRAMMING EXERNAL
PXEL KEYVECTOR DATA SENSOR milier HARDWARE (PROCD PIXELS & ROUTING OTHER)
F.G. 6 NERNA U.S. Patent US 8,768,313 B2
U.S. Patent US 8,768,313 B2
U.S. Patent Jul. 1, 2014 Sheet 9 Of 55 US 8,768,313 B2
SOAC U.S. Patent Jul. 1, 2014 Sheet 10 Of 55 US 8,768,313 B2
>JOSSEOORHd ?JOSSEOORHd
ENOHdTTHO 9NISSEOO}}d U.S. Patent US 8,768,313 B2
U.S. Patent Jul. 1, 2014 Sheet 12 Of 55 US 8,768,313 B2
MY PROFILE, INCLUDING WHOI AM, AND WHAT DO GENERALLY LIKE/WANT
MY CONTEXT, INCLUDING WHERE AM, AND WHAT ARE MY CURRENT DESRES
MY WISUAL OUERYPXELS THEMSELVES
PACKETS RESULTS
VISUAL QUERY PACKET
PACKETS RESULTS PACKETS RESULTS
CONFIGURABLE ARRAY OF OFFEREDATAUCTION TO AN SERVICE PROVIDERS ARRAY OF SERVICE PROVIDERS
FIG. 11 U.S. Patent US 8,768,313 B2
EOLAECI>JESÍTI
U.S. Patent Jul. 1, 2014 Sheet 14 of 55 US 8,768,313 B2
12 SENSOR 14
PROCESSING FOR OBJECT DENTIFICATION
PROCESSING FOR HUMAN WISUAL ADDRESSING SYSTEM (e.g., JPEG compression) 15
FIG. 13
16
56
CAMERA STAGE 1 STAGE 2 STAGE Y INSTRUCTIONS INSTRUCTIONS INSTRUCTIONS ' ' ' INSTRUCTIONS 55 58a 58b.
STAGE Z NSTRUCTIONS DATA
56 59
FIG. 17 U.S. Patent Jul. 1, 2014 Sheet 15 Of 55 US 8,768,313 B2
CEL PHONE
FIG. 14
APPLICATION NCELL PROCESSING C) PHONE EXTERNAL FROM CELL PHONE APPLICATION l PHONENCELL
APPLICATION ------a EXTERNAL FROM 10 CELL PHONE
APPLICATION EXTERNAL FROM CELL PHONE U.S. Patent Jul. 1, 2014 Sheet 16 of 55 US 8,768,313 B2
0
U.S. Patent Jul. 1, 2014 Sheet 17 Of 55 US 8,768,313 B2
L CD a. O
-NO(HONAS NO||W/Z1 U.S. Patent Jul. 1, 2014 Sheet 18 Of 55 US 8,768,313 B2
8/
01 ------;;;;;;;.
U.S. Patent Jul. 1, 2014 Sheet 20 Of 55 US 8,768,313 B2
NOISIATWO ?JEHLO SEKOLARJES
CIEX1 ?JEAN-JES -----Z___------Z___---- TOOOLO>jc)
U.S. Patent Jul. 1, 2014 Sheet 21 Of 55 US 8,768,313 B2
RESPONSE TIME ROUTING NEEDED CONSTRANTS OPERATION DETALS USER HARDWARE • High level PREFERENCES o What special purpose operations hardware is local?
O LOW level o What is Current operations hardware utilization? o Common operations
o CPU pipeline length o Operations that are preconditions to o Pipeline stall risks others CONNECTWTY o Power consumption STATUS
GEOGRAPHICAL CONSIDERATIONS RULES, HEURISTICS NFORE REMOTE PROVIDER(S), E.G., Readiness Speed Cost Attributes of importance to user
OPERATIONS FOR OPERATIONS FOR LOCAL PROCESSING REMOTE PROCESSENG
FIG. 19B U.S. Patent Jul. 1, 2014 Sheet 22 Of 55 US 8,768,313 B2
MICRO MIRROR PROJECTOR U.S. Patent Jul. 1, 2014 US 8,768,313 B2
U.S. Patent Jul. 1, 2014 Sheet 24 Of 55 US 8,768,313 B2
U.S. Patent Jul. 1, 2014 Sheet 25 Of 55 US 8,768,313 B2
CAPTURE OR RECEIVE IMAGE MAGE AND/OR METADATA
------PRE-PROCESS IMAGE, SUBMIT TO GOOGLE, RECEIVE e.g., IMAGE ENHANCEMENT, AUXLIARY INFORMATION : OBJECT SEGMENTATION/ : EXTRACTION, etc.
pre- repre- vers v are per a very w w w w w w w w SUBMIT TO SERVICE FEATURE EXTRACTION FIG. 24
CAPTURE OR RECEIVE MAGE
FEATURE EXTRACTION
------SEARCH FOR MAGES WITH AUGMENTIENHANCE SMAR METRICS METADATA
USE SIMLARMAGE DATAN SUBMIT TO SERVICE EU OF USER MAGE DATA FIG. 25 FIG. 23 U.S. Patent Jul. 1, 2014 Sheet 26 of 55 US 8,768,313 B2
CAPTURE OR RECEIVE IMAGE CAPTURE OR RECEIVE IMAGE
FEATURE EXTRACTION FEATURE EXTRACTION
SEARCH FOR MAGES WITH SEARCH FOR IMAGES WITH SMILAR METRICS SMILAR METRICS
SUBMIT PLURAL SETS OF COMPOSITE OR MERGE IMAGE DATA TO SERVICE SEVERAL SETS OF IMAGE DATA PROVIDER
SUBMIT TO SERVICE PROVIDER SERVICE PROVIDER, OR USER DEVICE, DETERMINES ENHANCED RESPONSE BASED ON PLURAL SETS OF DATA FIG. 27
FIG. 26 CAPTURE OR RECEIVE USER MAGE
FEATURE EXTRACTION
SEARCH FOR IMAGES WITH SMILAR METRICS
USE INFO FROM SIMILAR IMAGE(S) TO ENHANCE USER IMAGE
FIG. 28 U.S. Patent Jul. 1, 2014 Sheet 27 of 55 US 8,768,313 B2
CAPTURE OR RECEIVE IMAGE CAPTURE OR RECEIVE IMAGE OF SUBJECT WITH
GEOLOCATION INFO FEATURE EXTRACTION
INPUT GEOLOCATION INFO TO SEARCH FOR MAGES WITH DB; GET TEXTUAL SMAR METRICS DESCRIPTORS
HARVEST METADATA SEARCH MAGE DATABASE USING TEXTUAL DESCRIPTORS:
IDENTIFY OTHER IMAGE(S) OF SUBJECT SEARCH FOR MAGES WITH
SMAR METADATA SUBMT TO SERVICE SUBMIT TO SERVICE FIG. 30 FIG. 28A
CAPTURE OR RECEIVE IMAGE OF SUBJECT WITH GEOLOCATION INFO
SEARCH IMAGE DATABASE USING GEOLOCATION DATA; IDENTIFY OTHER IMAGE(S) OF SUBJECT
SUBMIT OTHER IMAGE(S) TO SERVICE
FIG. 31 U.S. Patent Jul. 1, 2014 Sheet 28 Of 55 US 8,768,313 B2
U.S. Patent Jul. 1, 2014 Sheet 29 of 55 US 8,768,313 B2
CAPTURE OR RECEIVE IMAGE CAPTURE OR RECEIVE IMAGE OF SUBJECT WITH
OF SUBJECT WITH GEOLOCATION INFO GEOLOCATION INFO
SEARCH FOR MAGES WITH SEARCH FOR MAGES WITH SMILAR IMAGE METRICS SMAR MAGE METRICS
ADD GEOLOCATION INFO TO HARVEST METADATA MATCHING MAGES
FIG. 32 SEARCH FOR MAGES WITH SMLAR METADATA
ADD GEOLOCATION INFO TO CAPTURE OR RECEIVE IMAGE MATCHING MAGES OF SUBJECT WITH GEOLOCATION INFO FIG. 33 SEARCH FOR IMAGES WITH MATCHING GEOLOCATION
HARVEST METADATA
SEARCH FOR MAGES WITH SMAR METADATA
ADD GEOLOCATION INFO TO MATCHING MAGES
FIG. 34 U.S. Patent Jul. 1, 2014 Sheet 30 Of 55 US 8,768,313 B2
USER SNAPS IMAGE
CONSUMER URBAN/RURAL FAMILY/FRIEND/ PRODUCTIOTHER ANALYSS STRANGER ANALYSS ANALYSS
RESPONSE RESPONSE RESPONSE ENGINE if ENGINE #2 ENGINE #3
FIG. 49 CONSOLIDATOR
CELL PHONE USER INTERFACE
ROUTER
RESPONSE RESPONSE ENGINEA ENGINEB
FIG. 37A FIG. 37B U.S. Patent Jul. 1, 2014 Sheet 31 Of 55 US 8,768,313 B2
USER-IMAGE -o- "IMAGE JUCER (algorithmic, crowd-sourced, etc.)
USER MAGE-RELATED FACTORS
(First Tier) e.g., data about one or more of:
Color histogram, Pattern, Shape, Texture, Facial recognition, Eigenvalue,
Object recognition, Orientation, Text, Numbers, OCR, Barcode, Watermark, Emotion, Location, Transform data, etc
DATABASE OF PUBLIC IMAGES, SEARCHENGINE --> IMAGE-RELATED FACTORS, AND OTHER METADATA
OTHER MAGE-RELATED OTHER FACTORS AND/OR MSESin N' OTHER METADATA (Second Tier)
------r : DATABASE OF PUBLIC IMAGES, SEARCHENGINE ---> IMAGE-RELATED FACTORS, AND ------OTHER METADATA ------
FINAL FINAL OTHER IMAGES/CONTENT NFORMATION (Nth Tier) (Nth Tier)
FIG. 39 U.S. Patent Jul. 1, 2014 Sheet 32 Of 55 US 8,768,313 B2
USER-IMAGE mos (algorithmic,"IMAGE crowd-sourced, JUICER etc.)
USER IMAGE-RELATED FACTORS (First Tier)
DATABASE OF PUBLIC IMAGES, SEARCHENGINE --> IMAGE-RELATED FACTORS, AND OTHER METADATA
OTHER MAGE-RELATED OTHER FACTORS AND/OR MSEN,N' OTHER METADATA (Second Tier)
GOOGLETEXT, SEARCH ENGINE -o-o- WEB DATABASE
------DATABASE OF PUBLIC IMAGES, SEARCHENGINE ---> IMAGE-RELATED FACTORS, AND ki w w w w w w w w w OTHER MEADATA ------FINA FINAL OTHER MAGES.CONTENT INFORMATION (Nth Tier) (Nth Tier)
FIG. 40 U.S. Patent Jul. 1, 2014 Sheet 33 Of 55 US 8,768,313 B2
"MAGE JUCER” (algorithmic, crowd-sourced, etc.)
g ...Image eigenvalues; image classifier concludes "drill".
DATABASE OF PUBLIC IMAGES, SEARCH ENGINE MAGE-RELATED FACTORS, AND OHER MEADAA
Similar- ext metadata for }ooking similar-looking pix
pi
PROCESSOR Ranked metadata, incl. "Dril," "Black & Decker" and "DR250B'
DATABASE OF SEARCH PUBLIC MAGES, IMAGE-REATED ENGINE ONE OR MORE FACTORS, AND MAGE-BASED OTHER MEADATA ROUTING
SERVICES (may BROWSER Pix with similar metadata employ user, preference data),
Cached E.G., SNAPNOW web pageS FG. 41
USER INTERFACE -->
Options presented may include "Similar looks," "Similar descriptors," "Web results," "Buy," "Sell," "Manual," "More," "Snapnow," etc. U.S. Patent Jul. 1, 2014 Sheet 34 of 55 US 8,768,313 B2
"MAGE UCER" (algorithmic, Crowd-sourced, etc.)
Geolocation Coordinates, Data restraight edgelarced edge density...
DATABASE OF PUBLIC MAGES, SEARCHENGINE IMAGE-RELATED FACTORS, AND OTHER METADAA Pix that are similarly located, with Text metadata for pix that are similarly-located, similar edge? with similar edgelarc density arc density PROCESSOR
Ranked metadata, incl. "Eiffe Tower," and "Paris” FG. 42
DATABASE OF SEARCH PUBLIC IMAGES, MAGE-REATED
ENGINE ONE OR MORE FACTORS, AND MAGE-BASED OHER METAOATA ROUTING SERVICES (may Pix With Similar metadata employ user, preference data),
E.G., SNAPNOW Cached Web U options presented may include pages "Similar looks," "Similar descriptors," "Similar location," "Web results," "Snapnow." Map, Directions, USER NERFACE History, Search nearby, etc. U.S. Patent Jul. 1, 2014 Sheet 35 of 55 US 8,768,313 B2
"IMAGE JUCER" (algorithmic, crowd-sourced, etc.)
...Geolocation Coordinates, Eigenvectors.
DATABASE OF \ MS DATABASE OF PUBLIC IMAGES, MAGES, MAGE-RELAED --> SEARCH ENGINE --> MAGE-REATED FACTORS, AND FACTORS, AND OTHER METADATA OHER MEADAA
Neighborhood Address, zip code, school district, scenes, text square footage, price, bedrooms/ metadata from baths, acreage, time on market, similarly-located datafimagery re similarly-located images houses, similarly-priced houses, similarly-featured houses, etc
GOOGLE, GOOGLE EARTH
USER INTERFACE
FG. 43 U.S. Patent Jul. 1, 2014 Sheet 36 of 55 US 8,768,313 B2
as- a Fissessis a
3 -ississs & &w SS s S. SSs S. e a i : is
s FG. 44 t
DOOOOOOOOO11 k U.S. Patent Jul. 1, 2014 Sheet 37 of 55 US 8,768,313 B2
FG. 45B U.S. Patent Jul. 1, 2014 Sheet 38 of 55 US 8,768,313 B2
Rockefeller Center (21) New York (18) NYC (10) GE GEOCOORDNATES FROM MAGE Manhattan (8) Empire State Building (4) Midtown (4) SEARCH FLICKR, ETC, FOR SIMLARLY Top of the rock (4) LOCATED IMAGES (AND/OR SIMILAR IN Nueva York (3) OTHER WAYS): "SET 1" IMAGES USA (3) 30 Rock (2) August (2) HARVEST, CLEAN-UP AND CLUSTER Atlas (2) METADAA FROM SET 1 MAGES Christmas (2) CityScape (2)
Fifth Avenue (2) FROM METADATA, DETERMINE Me (2) LKELIHOOD THAT EACH MAGE IS Prometheus (2) PLACE-CENTRIC (CLASS 1) Skating rink (2) Skyline (2)
Skyscraper (2) FROMMETADATA AND FACE-FINDING, Clouds (1) DEERMINE KELIHOOD THAT EACH General Motors Building (1) IMAGE IS PERSON-CENTRIC (CLASS 2) Gertrude (1) Jerry (1)
Lights (1) Y FROM METADATA, AND ABOVE Statue (1) A SCORES, DETERMINE LIKELIHOOD HAT EACH MAGE IS THING-CENTRIC (CLASS 3)
DEERMNE WHICH MAGE METRICS REABLY GROUP LIKE-CASSED IMAGES TOGETHER, AND DISTINGUISH DFFERENTLY-CLASSED IMAGES
FG. 46A U.S. Patent Jul. 1, 2014 Sheet 39 of 55 US 8,768,313 B2
APPLY TESTS TO INPUT IMAGE: DETERMINE FROM METRICS ITS SCORE IN DIFFERENT CLASSES
PERFORMSMLARITY TESTING BETWEEN INPUT IMAGE AND EACH MAGE IN SET 1 B 1. WEIGHT METADATA FROM EACH 19 Rockefeller Center IMAGE IN ACCORDANCE WITH 12 Prometheus SIMILARITY SCORE: COMBINE 5 Skating rink
SEARCH IMAGE DATABASE FOR New York (17) ADDL SIMLARLY-LOCATED Rockefeller Center (12) IMAGES, THIS TIME ALSO USING Manhattan (6) NFERRED POSSIBLE METADATA Prometheus (5) C AS SEARCH CRITERA. "SET 2" Fountain (4) 11 Christmas (3) Paul Manship (3) Sculpture (3) PERFORMSMLARITY TESTING Skating Rink (3) BETWEEN INPUT IMAGE AND Jerry (1) EACH MAGE IN SET 2 National register of historical places (2) Nikon (2) RCA Building (2) WEIGHT METADATA FROM EACH Greek mythology (1) MAGE IN ACCORDANCE WITH SIMILARITY SCORE: COMBINE D 5 Prometheus 1. 4 Fountain FURTHERWEIGHT METADATA IN 3 Paul Manship ACCORDANCE WITH 3 Rockefeller Center UNUSUALNESS 3 Sculpture 2 National register of historical places 2 RCA Building 1 Greek mythology FIG. 46B U.S. Patent Jul. 1, 2014 Sheet 40 of 55 US 8,768,313 B2
USER SNAPS IMAGE
GEOINFO PROFE DATA DATABASE
PLACE-TERM GOOGLE, ETC GLOSSARY
MAGE PICASA ANALYSS, FLICKR DATA PROCESSING
WORD PERSON-NAME FREQUENCY GLOSSARY DATABASE FACEBOOK DATA
GOOGLE
FIG. 47
RESPONSE ENGINE
CELL PHONE USER INTERFACE U.S. Patent Jul. 1, 2014 Sheet 41 of 55 US 8,768,313 B2
USER SNAPS IMAGE
PERSON/PLACE/ THING ANALYSS
FAMILY/FRIEND/ CHILD/ADULT STRANGER ANALYSS ANALYSS
RESPONSE ENGINE
CELL PHONE USER INTERFACE FIG. 48A
USER SNAPS IMAGE
FAMILY/FRIEND/ PERSON/PLACE/ CHILD/ADULT STRANGER ANALYSS THING ANALYSS ANALYSIS
RESPONSE ENGINE
CELL PHONE USER INTERFACE FIG. 48B U.S. Patent Jul. 1, 2014 Sheet 42 of 55 US 8,768,313 B2
USER SNAPS IMAGE
FACA DCT FFT WAVELET RECOGNITION TRANSFORM TRANSFORM TRANSFORM O D - EDGE EGENVALUE SPECTRUM OCR D DETECTION ANALYSS ANALYSS U D CO T PATTERN FOURIER-MELLIN TEXTURE COLOR (? EXTRACTION TRANSFORM CLASSIFIER ANALYSS T X r GIST EMOTION BARCODE AGE T 2 PROCESSING CLASSIFIER DECODER DETECTION D - WATERMARK ORIENTATION SMILARITY GENDER T Of DECODER DETECTION PROCESSOR DETECTION O
OBJECT MEADATA GEOLOCATION | ETC c T SEGMENTATION ANALYSIS PROCESSING Of
ROUTER
SERVICE SERVICE SERVICE PROVIDER 1 PROVIDER 2 PROVIDERN
CELL PHONE USER INTERFACE
FIG. 50 U.S. Patent Jul. 1, 2014 Sheet 43 of 55 US 8,768,313 B2
INPUT IMAGE, OR NFORMATION ABOUT THE MAGE
s 3 ASO O st (3A rio3 O S.S g 2a gyg 2.U O2. Od5 C 2.st CaA 6 A. s & Oy & C Q T 2.G)
8 2 O - U re ; 332 8 33 g : 27 3: (a) Post annotated photo to user's Facebook page 3. (b) Start a text message to depicted person
V (a) Show estimations of who this may be (b) Send "Friend" invite on MySpace w (c) Who are the degrees of separation between us (a) Buy online (b) Show nutrition information (c) dentify local stores that sell
FIG. 51 U.S. Patent Jul. 1, 2014 Sheet 44 of 55 US 8,768,313 B2
COMPUTERA
CONTENT SETA ROUTER
ROUTERA COMPUTER B
RESPONSE RESPONSE CONTENT SETB ENGINE 1A ENGINE 2A
ROUTER B
COMPUTER C RESPONSE ENGINEB CONTENT SET C
RESPONSE COMPUTERD ENGINE C
CONTENT SETD
COMPUTERE ROUTERD
CONTENT SET E 1 50
RESPONSE ENGINE
FIG. 52 U.S. Patent Jul. 1, 2014 Sheet 45 of 55 US 8,768,313 B2
Sge setts 2
FEATURE VECTORS
WWW.SMTH.HOME. COM/8ABYCAM.HTM WWW.TRAFF.C.COM 52828A2796C 7B TRAFFIC MAP fSEAE
F.G. 55 SIGN GEOSSARY FIG. 54 U.S. Patent Jul. 1, 2014 Sheet 46 of 55 US 8,768,313 B2
MAGE SENSORS
AMPLFY ANALOG SIGNALS
CONVERT TO DIGITAL NATIVE MAGE DATA
BAYER INTERPOLATION
WHTE BALANCE CORRECTION
GAMMA CORRECTION
EDGE ENHANCEMENT
JPEG COMPRESSION
STORE IN BUFFER MEMORY
DISPLAY CAPTURED IMAGE ON SCREEN
ONUSER COMMAND, STORE IN CARD AND/OR TRANSMIT
FIG. 56 (Prior Art) U.S. Patent Jul. 1, 2014 Sheet 47 of 55 US 8,768,313 B2
IMAGE SENSORS
AMPFY ANALOG F.G. 57 SIGNALS
CONVERT TO DIGITAL NATIVE IMAGE DATA
or or " BAYER FOURIER ETC EGENFACE ETC INTERPOLATION TRANSFORM CALCULATION - - Etc (as in Fig. 56)
MATCH MATCH FACE TEMPLATE BARCODE Yes Yes TEMPLATE
SEND DIGITA NATIVE MELLIN MAGE DATA AND/OR TRANSFORM FOURIER DOMAN DATA TO BARCODE SERVICE
SEND DIGITAL NATIVE MAGE DATA AND/OR F-M MATCH TEXT DOMAN DATA TO OCR TEMPLATEP SERVICE Yes SEND DIGITAL NATIVE MAGE DATA AND/OR F-M ORIENTATION DOMAN DATA TO WM Yes TEMPLATET SERVICE SEND DIGITAL NATIVE MATCH FACE TEMPLATE MAGE DATA AND/OR F-M Yes DOMAN DATA AND/OR EGEN FACE DATA TO FACA RECOGNITION SERVICE U.S. Patent Jul. 1, 2014 Sheet 48 of 55 US 8,768,313 B2
FROM APPARENT is
HORIZON:
ROATON
SNCE 17 degrees left STAR
SCALE CHANGE SNCE 30% closer STAR
IMAGE CAPTURE! PROCESSING
ANALYSS, MEMORY
OUTPUT FIG. 59 DEPENDENT ON IMAGE DATA U.S. Patent Jul. 1, 2014 Sheet 49 of 55 US 8,768,313 B2
U.S. Patent Jul. 1, 2014 Sheet 50 of 55 US 8,768,313 B2
FOURIER TRANSFORM DATA b BARCODE NFORMATION FOURIER-MELLIN TRANSFORM DATA b
OCR INFORMATION wo
WATERMARK NFORMATION
DCT DATA e
MOTON INFORMATION ...-e- - METADATA EGENFACE DATA b
FACE INFORMATION -e-
SOBEL FILTER DATA e
WAVELET b TRANSFORM DATA
COLOR HISTOGRAM b DATA FIG. 61 U.S. Patent Jul. 1, 2014 Sheet 51 of 55 US 8,768,313 B2
FG 67
...r.t. --
*
F.G. 68 U.S. Patent Jul. 1, 2014 Sheet 52 of 55 US 8,768,313 B2
F.G. 69
U.S. Patent Jul. 1, 2014 Sheet 53 of 55 US 8,768,313 B2
Spatial frequency Cycles/Degree)
, 1,000
O.O. Sensitivity s 1: Threshold Wisible Contrast)
O. Ot FG. 72 Spatial Frequency (Cyclesi MMOn Retina)
F.G. 73 U.S. Patent Jul. 1, 2014 Sheet 54 Of 55 US 8,768,313 B2
THE WALISTRET OURNA
&X: : X----- H : 83 x H x:&
3.x:x
Captured image Frane
Offset Of DeCoded Watermark
FG. 76 U.S. Patent Jul. 1, 2014 Sheet 55 of 55 US 8,768,313 B2
START START
ROUGH OUTLINES LARGE DETAS INTERMEDIATE DETAS FACA FNE DETALS EGENVALUES
R, G, B Data Alpha Channel Data
FIG. 77A FIG. 77B US 8,768,313 B2 1. 2 METHODS AND SYSTEMIS FOR MAGE OR No. 6,389,055 (Alcatel-Lucent), U.S. Pat. No. 6,121,530 AUDIO RECOGNITION PROCESSING (Sonoda), and U.S. Pat. No. 6,002,946 (Reber/Motorola). The presently-detailed technology concerns improve RELATED APPLICATION DATA ments to Such technologies—moving towards the goal of intuitive computing: devices that can see and/or hear, and This application is a non-provisional of, and claims priority infer the user's desire in that sensed context. to, provisional application 61/234,542, filed Aug. 17, 2009. The present technology builds on, and extends, technology SELECTED FEATURES disclosed in other patent applications by the present assignee. The reader is thus directed to the following applications that 10 As will be apparent, the present specification details a serve to detail arrangements in which applicants intend the wealth of novel technologies. To give the reader an introduc present technology to be applied, and that technically supple tory overview, a few such arrangements are reviewed in the ment the present disclosure: following paragraphs: application Ser. No. 12/271,772, filed Nov. 14, 2008 (pub 15 (A) A portable device that receives input from one or more lished as 201001 19208); physical sensors, employs processing by one or more local application Ser. No. 61/150,235, filed Feb. 5, 2009: services, and also employs processing by one or more remote application Ser. No. 61/157,153, filed Mar. 3, 2009: services, wherein software in the device includes one or more application Ser. No. 61/167.828, filed Apr. 8, 2009: abstraction layers through which said sensors, local services, application Ser. No. 12/468,402, filed May 19, 2009 now and remote services are interfaced to the device architecture, U.S. Pat. No. 8,004,576): facilitating Substitution. application Ser. No. 12/484,115, filed Jun. 12, 2009 now (B) A portable device that receives input from one or more U.S. Pat. No. 8,385,971); and physical sensors, processes the input and packages the result application Ser. No. 12/498,709, filed Jul. 7, 2009 (now into key vector form, and transmits the key vector form from published as 20100261465) 25 the device. Also, Such an arrangement in which the device The disclosures of all the above-identified documents are receives a further-processed counterpart to the key vector incorporated herein by reference. back from a remote resource to which the key vector was transmitted. Also, Such an arrangement in which the key vec INTRODUCTION tor form is processed—on the portable device or a remote 30 device in accordance with one or more instructions that are The present specification details a diversity of technolo implied in accord with context. gies, assembled over an extended period of time, to serve a (C) A distributed processing architecture for responding to variety of different objectives. Yet they relate together in physical stimulus sensed by a cellphone (aka 'smartphone'), various ways, and can be used in conjunction, and so are the architecture employing a local process on the cellphone, presented collectively in this single document. 35 and a remote process on a remote computer, the two processes This varied, interrelated subject matter does not lend itself to a straightforward presentation. Thus, the reader's indul being linked by a packet network and an inter-process com gence is solicited as this narrative occasionally proceeds in munication construct, the architecture also including a pro nonlinear fashion among the assorted topics and technolo tocol by which different processes may communicate, this 40 protocol including a message passing paradigm with either a g1eS. message queue, or a collision handling arrangement. Also, BACKGROUND Such an arrangement in which driver Software for one or more physical sensor components provides sensor data in packet Digimarc's U.S. Pat. No. 6,947,571 shows a system in form and places the packet on an output queue, either which a cell phone camera captures content (e.g., image 45 uniquely associated with that sensor or common to plural data), and processes same to derive information related to the components; wherein local processes operate on the packets imagery. This derived information is Submitted to a data and place resultant packets back on the queue, unless the structure (e.g., a remote database), which indicates corre packet is to be processed remotely, in which case it is directed sponding data or actions. The cell phone then displays to a remote process by a router arrangement. responsive information, or takes responsive action. Such 50 (D) An arrangement in which a network associated with a sequence of operations is sometimes referred to as “visual particular physical venue is adapted to automatically discern search.” whether a set of visitors to the venue have a social connection, Related technologies are shown in patent publications by reference to traffic on the network. Also, such an arrange 20080300011 (Digimarc), U.S. Pat. No. 7,283,983 and ment that also includes discerning a demographic character WO07/130,688 (Evolution Robotics), 20070175998 and 55 istic of the group. Also, such an arrangement in which the 20020102966 (DSPV), 20060012677, 20060240862 and network facilitates adhoc networking among visitors who are 20050185060 (Google), 20060056707 and 20050227674 discerned to have a social connection. (Nokia), 20060026140 (ExBiblio), U.S. Pat. No. 6,491,217, (E) Anarrangement whereina network including computer 20020152388, 200201784.10 and 20050144455 (Philips), resources at a public venue is dynamically reconfigured in 20020072982 and 20040199387 (Shazam), 20030083098 60 accordance with a predictive model of behavior of users vis (Canon), 20010055391 (Qualcomm), 20010001854 (Air iting said venue. Also, such an arrangement in which the Clic), U.S. Pat. No. 7.251,475 (Sony), 7,174.293 (Iceberg), network reconfiguration is based, in part, on context. Also, U.S. Pat. No. 7,065,559 (Organnon Wireless), U.S. Pat. No. Such an arrangement wherein the network reconfiguration 7,016,532 (Evryx Technologies), U.S. Pat. Nos. 6,993,573 includes caching certain content. Also, such an arrangement and 6,199,048 (Neomedia), U.S. Pat. No. 6,941,275 (Tune 65 in which the reconfiguration includes rendering synthesized Hunter), U.S. Pat. No. 6,788.293 (Silverbrook Research), content and storing in one or more of the computer resources U.S. Pat. Nos. 6,766,363 and 6,675,165 (BarEoint), U.S. Pat. to make same more rapidly available. Also, such an arrange US 8,768,313 B2 3 4 ment that includes throttling back time-insensitive network and other processing tasks can be performed on a processor— traffic in anticipation of a temporal increase in traffic from the or plural processors—remote from the device, and in which USCS. packets are employed to convey data between processing (F) An arrangement in which advertising is associated with tasks, and the contents of the packets are determined in auto real world content, and a chargetherefore is assessed based on mated fashion based on consideration of two or more differ Surveys of exposure to said content—as indicated by sensors ent factors drawn from a set that includes at least: mobile in users cellphones. Also, such an arrangement in which the device power considerations; response time needed; routing charged is set through use of an automated auction arrange constraints; state of hardware resources within the mobile ment. device; connectivity status; geographical considerations; risk (G) An arrangement including two subjects in a public 10 venue, wherein illumination on said Subjects is changed dif of pipeline stall; information about the remote processor ferently—based on an attribute of a person proximate to the including its readiness, processing speed, cost, and attributes Subjects. of importance to a user of the mobile device; and the relation (H) An arrangement in which content is presented to per of the task to other processing tasks; wherein in Some circum Sons in a public venue, and there is a link between the pre 15 stances the packets may include data of a first form, and in sented content and auxiliary content, wherein the linked aux other circumstances the packets may include data of a second iliary content is changed in accordance with a demographic form. Also, such an arrangement in which the decision is attribute of a person to whom the content is presented. based on a score dependent on a combination of parameters (I) An arrangement wherein a temporary electronic license related to at least some of the listed considerations. to certain content is granted to a person in connection with the (M) An arrangement wherein a venue provides data ser person’s visit to a public venue. vices to users through a network, and the network is arranged (J) An arrangement for recognition processing of stimuli to deter use of electronic imaging by users while in the venue. captured by a sensor of a user's mobile device, in which some Also, Such an arrangement in which the deterrence is effected processing tasks can be performed on processing hardware in by restricting transmission of data from user devices to cer the device, and other processing tasks can be performed on a 25 tain data processing providers external to the network. processor—or plural—remote from the device, and in which (N) An arrangement in which a mobile communications a decision regarding whether a first task should be performed device with image capture capability includes a pipelined on the device hardware or on a remote processor is made in processing chain for performing a first operation, and a con automated fashion based on consideration of at least one trol system that has a mode in which it tests image data by factor drawn from both of the following groups: (1) band 30 performing a second operation thereon, the second operation width costs, external service provider costs, power costs to being computationally simpler than the first operation, and the cell phone battery, intangible costs in consumer (dis-) the control system applies image data to the pipelined pro satisfaction by delaying processing, available processing cessing chain only if the second operation produces an output capacity of the remote processor(s), distance to the remote of a first type. processor(s); and (2) routing constraints, geographical con 35 (O) An arrangement in which a cellphone is equipped with siderations other than distance to the remote processor(s), risk a GPU to facilitate rendering of graphics for display on a cell of pipeline stall, and the relation of the first task to other phone screen, e.g., for gaming, and the GPU is also employed processing tasks; wherein in some circumstances the first task for machine vision purposes. Also, such an arrangement in is performed on the device hardware, and in other circum which the machine vision purpose includes facial detection. stances the first task is performed on the remote processor. 40 (P) An arrangement in which plural socially-affiliated Also, such an arrangement in which the decision is based on mobile devices, maintained by different individuals, cooper a score dependent on a combination of parameters related to ate in performing a machine vision operation. Also, such an at least some of the listed considerations. arrangement wherein a first of the devices performs an opera (K) An arrangement for processing stimuli captured by a tion to extract facial features from an image, and a second of sensor of a user's mobile device, in which some processing 45 the devices performs template matching on the extracted tasks can be performed on processing hardware in the device, facial features produced by the first device. and other processing tasks can be performed on a processor— (Q) An arrangement in which a voice recognition operation or plural processors—remote from the device, and in which a is performed on audio from an incoming video or phone call sequence in which a set of tasks should be performed is made to identify a caller. Also, Such an arrangement in which the in automated fashion based on consideration of two or more 50 Voice recognition operation is performed only if the incoming different factors drawn from a set that includes at least: call is not identified by CallerID data. Also, such an arrange mobile device power considerations; response time needed; ment in which the Voice recognition operation includes ref routing constraints; state of hardware resources within the erence to data corresponding to one or more earlier-stored mobile device; connectivity status; geographical consider Voice messages. ations; risk of pipeline stall; information about the remote 55 (R) An arrangement in which speech from an incoming processor including its readiness, processing speed, cost, and Video orphone call is recognized, and text data corresponding attributes of importance to a user of the mobile device; and the thereto is generated as the call is in process. Also, such an relation of the task to other processing tasks; wherein in some arrangement in which the incoming call is associated with a circumstances the set of tasks is performed in a first sequence, particular geography, and Such geography is taken into and in other circumstances the set of tasks is performed in a 60 account in recognizing the speech. Also, Such an arrangement second, different, sequence. Also, such an arrangement in in which the text data is used to query a data structure for which the decision is based on a score dependent on a com auxiliary information. bination of parameters related to at least some of the listed (S) An arrangement for populating overlay baubles onto a considerations. mobile device screen, derived from both local and cloud (L) An arrangement for processing stimuli captured by a 65 processing. Also, Such an arrangement in which the overlay sensor of a user's mobile device, in which some processing baubles are tuned in accordance with user preference infor tasks can be performed on processing hardware in the device, mation. US 8,768,313 B2 5 6 (T) An arrangement wherein a user may (1) be charged by FIG. 11 illustrates that tasks referred for external process a vendor for a data processing service, or alternatively (2) ing can be routed to a first group of service providers who may be provided the service free or even receive credit from routinely perform certain tasks for the cell phone, or can be the vendor if the user takes certain action in relation thereto. routed to a second group of service providers who compete on (U) An arrangement in which a user receives a commercial a dynamic basis for processing tasks from the cellphone. benefit in exchange for being presented with promotional FIG. 12 further expands on concepts of FIG. 11, e.g., content—as sensed by a mobile device conveyed by the user. showing how a bid filter and broadcast agent software module (V) An arrangement in which a first user allows a second may oversee a reverse auction process. party to expend credits of the first user, or incur expenses to be FIG. 13 is a high level block diagram of a processing 10 arrangement incorporating aspects of the present technology. borne by the first user, by reason of a Social networking FIG. 14 is a high level block diagram of another processing connection between the first user and the second party. Also, arrangement incorporating aspects of the present technology. Such an arrangement in which a social networking web page FIG.15 shows an illustrative range of image types that may is a construct with which the second party must interact in be captured by a cell phone camera. expending Such credits, or incurring such expenses. 15 FIG. 16 shows a particular hardware implementation (W) An arrangement for charitable fundraising, in which a incorporating aspects of the present technology. user interacts with a physical object associated with a chari FIG. 17 illustrates aspects of a packet used in an exemplary table cause, to trigger a computer-related process facilitating embodiment. a user donation to a charity. FIG. 18 is a block diagram illustrating an implementation (X) An arrangement wherein visual query data is processed of the SIFT technique. in distributed fashion between a user's mobile device and FIG. 19 is a block diagram illustrating, e.g., how packet cloud resources, to generate a response, and wherein related header data can be changed during processing, through use of information is archived in the cloud and processed so that a memory. Subsequent visual query data can generate a more intuitive FIG. 19A shows a prior art architecture from the robotic response. 25 Player Project. The foregoing and many other features and advantages of FIG. 19B shows how various factors can influence how the present technology will be further apparent from the fol different operations may be handled. lowing detailed description, which proceeds with reference to FIG. 20 shows an arrangement by which a cell phone the accompanying drawings. camera and a cell phone projector share a lens. 30 FIG. 20A shows a reference platform architecture that can BRIEF DESCRIPTION OF THE DRAWINGS be used in embodiments of the present technology. FIG.21 shows an image of a desktop telephone captured by FIG. 1 is a high level view of an embodiment incorporating a cellphone camera. aspects of the present technology. FIG. 22 shows a collection of similar images found in a FIG. 2 shows some of the applications that a user may 35 repository of public images, by reference to characteristics request a camera-equipped cell phone to perform. discerned from the image of FIG. 21. FIG. 3 identifies some of the commercial entities in an FIGS. 23-28A, and 30-34 are flow diagrams detailing embodiment incorporating aspects of the present technology. methods incorporating aspects of the present technology. FIGS. 4, 4A and 4B conceptually illustrate how pixel data, FIG.29 is an arty shot of the EiffelTower, captured by a cell and derivatives, is applied in different tasks, and packaged 40 phone user. into packet form. FIG. 35 is another image captured by a cell phone user. FIG. 5 shows how different tasks may have certain image FIG. 36 is an image of an underside of a telephone, discov processing operations in common. ered using methods according to aspects of the present tech FIG. 6 is a diagram illustrating how common image pro nology. cessing operations can be identified, and used to configure 45 FIG. 37 shows part of the physical user interface of one cellphone processing hardware to perform these operations. style of cell phone. FIG. 7 is a diagram showing how a cell phone can send FIGS. 37A and 38B illustrate different linking topologies. certain pixel-related data across an internal bus for local pro FIG.38 is an image captured by a cellphone user, depicting cessing, and send other pixel-related data across a communi an Appalachian Trail trail marker. cations channel for processing in the cloud. 50 FIGS. 39-43 detail methods incorporating aspects of the FIG. 8 shows how the cloud processing in FIG. 7 allows present technology. tremendously more “intelligence' to be applied to a task FIG. 44 shows the user interface of one style of cellphone. desired by a user. FIGS. 45A and 45B illustrate how different dimensions of FIG.9 details how key vector data is distributed to different commonality may be explored through use of a user interface external service providers, who perform services in exchange 55 control of a cell phone. for compensation, which is handled in consolidated fashion FIGS. 46A and 46B detail a particular method incorporat for the user. ing aspects of the present technology, by which keywords FIG. 10 shows an embodiment incorporating aspects of the such as Prometheus and Paul Manship are automatically present technology, noting how cellphone-based processing determined from a cell phone image. is Suited for simple object identification tasks—such as tem 60 FIG. 47 shows some of the different data sources that may plate matching, whereas cloud-based processing is Suited for be consulted in processing imagery according to aspects of complex tasks—such as data association. the present technology. FIG. 10A shows an embodiment incorporating aspects of FIGS. 48A, 48B and 49 show different processing methods the present technology, noting that the user experience is according to aspects of the present technology. optimized by performing visual key vector processing as close 65 FIG.50 identifies some of the different processing that may to a sensor as possible, and administering traffic to the cloud be performed on image data, inaccordance with aspects of the as low in a communications Stack as possible. present technology. US 8,768,313 B2 7 8 FIG. 51 shows an illustrative tree structure that can be recognition and interaction' is contemplated, where a user of employed in accordance with certain aspects of the present the mobile device expects virtually instantaneous results and technology. augmented reality graphic feedback on the mobile device FIG. 52 shows a network of wearable computers (e.g., cell screen, as that user points the camera at a scene or object. phones) that can cooperate with each other, e.g., in a peer-to In accordance with one aspect of the present technology, a peer network. distributed network of pixel processing engines serve Such FIGS. 53-55 detailhow a glossary of signs can be identified mobile device users and meet most qualitative "human real by a cell phone, and used to trigger different actions. time interactivity” requirements, generally with feedback in FIG. 56 illustrates aspects of prior art digital camera tech much less than one second. Implementation desirably pro nology. 10 vides certain basic features on the mobile device, including a FIG. 57 details an embodiment incorporating aspects of the rather intimate relationship between the image sensors out present technology. put pixels and the native communications channel available to FIG. 58 shows how a cell phone can be used to sense and the mobile device. Certain levels of basic “content filtering display affine parameters. and classification of the pixel data on the local device, fol FIG. 59 illustrates certain state machine aspects of the 15 lowed by attaching routing instructions to the pixel data as present technology. specified by the user's intentions and Subscriptions, leads to FIG. 60 illustrates how even “still imagery can include an interactive session between a mobile device and one or temporal, or motion, aspects. more “cloud based pixel processing services. The key word FIG. 61 shows some metadata that may be involved in an “session further indicates fast responses transmitted back to implementation incorporating aspects of the present technol the mobile device, where for some services marketed as “real Ogy. time' or “interactive. a session essentially represents a FIG. 62 shows an image that may be captured by a cell duplex, generally packet-based, communication, where sev phone camera user. eral outgoing 'pixel packets' and several incoming response FIGS. 63-66 detail how the image of FIG. 62 can be pro packets (which may be pixel packets updated with the pro cessed to convey semantic metadata. 25 cessed data) may occur every second. FIG. 67 shows another image that may be captured by a cell Business factors and good old competition are at the heart phone camera user. of the distributed network. Users can subscribe to or other FIGS. 68 and 69 detail how the image of FIG. 67 can be wise tap into any external services they choose. The local processed to convey semantic metadata. device itself and/or the carrier service provider to that device FIG. 70 shows an image that may be captured by a cell 30 can be configured as the user chooses, routing filtered and phone camera user. pertinent pixel data to specified object interaction services. FIG.71 details how the image of FIG. 70 can be processed Billing mechanisms for such services can directly plug into to convey semantic metadata. existing cell and/or mobile device billing networks, wherein FIG. 72 is a chart showing aspects of the human visual users get billed and service providers get paid. system. 35 Butlets back up a bit. The addition of camera systems to FIG. 73 shows different low, mid and high frequency com mobile devices has ignited an explosion of applications. The ponents of an image. primordial application certainly must be folks simply Snap FIG. 74 shows a newspaper page. ping quick visual aspects of their environment and sharing FIG. 75 shows the layout of the FIG. 74 page, as set by such pictures with friends and family. layout software. 40 The fanning out of applications from that starting point FIG. 76 details how user interaction with imagery captured arguably hinges on a set of core plumbing features inherent in from printed text may be enhanced. mobile cameras. In short (and non-exhaustive of course), FIG. 77 illustrates how semantic conveyance of metadata Such features include: a) higher quality pixel capture and low can have a progressive aspect, akin to JPEG2000 and the like. level processing; b) better local device CPU and GPU 45 resources for on-device pixel processing with Subsequent DETAILED DESCRIPTION user feedback; c) structured connectivity into “the cloud: and importantly, d) a maturing traffic monitoring and billing The present specification details a collection of interrelated infrastructure. FIG. 1 is but one graphic perspective on some work, including a great variety oftechnologies. A few of these of these plumbing features of what might be called a visually include image processing architectures for cell phones, 50 intelligent network. (Conventional details of a cell phone, cloud-based computing, reverse auction-based delivery of Such as the microphone, A/D converter, modulation and services, metadata processing, image conveyance of semantic demodulation systems, IF stages, cellular transceiver, etc., are information, etc., etc., etc. Each portion of the specification not shown for clarity of illustration.) details technology that desirably incorporates technical fea It is all well and good to get better CPUs and GPUs, and tures detailed in other portions. Thus, it is difficult to identify 55 more memory, on mobile devices. However, cost, weight and “a beginning from which this disclosure should logically power considerations seem to favor getting “the cloud' to do begin. That said, we simply dive in. as much of the “intelligence' heavy lifting as possible. Mobile Device Object Recognition and Interaction Using Relatedly, it seems that there should be a common denomi Distributed Network Services nator set of “device-side' operations performed on visual data There is presently a huge disconnect between the unfath 60 that will serve all cloud processes, including certain format omable Volume of information that is contained in high qual ting, elemental graphic processing, and other rote operations. ity image data streaming from a mobile device camera (e.g., Similarly, it seems there should be a standardized basic in a cell phone), and the ability of that mobile device to header and addressing scheme for the resulting communica process this data to whatever end. “Off device' processing of tion traffic (typically packetized) back and forth with the visual data can help handle this fire hose of data, especially 65 cloud. when a multitude of visual processing tasks may be desired. This conceptualization is akin to the human visual system. These issues become even more critical once “real time object The eye performs baseline operations, such as chromaticity US 8,768,313 B2 9 10 groupings, and it optimizes necessary information for trans FIGS. 4A and 4B also introduce a central player in this mission along the optic nerve to the brain. The brain does the disclosure: the packaged and address-labeled pixel packet, real cognitive work. And there's feedback the other way into a body of which key vector data is inserted. The key vector too—with the brain sending information controlling muscle data may be a single patch, or a collection of patches, or a movement where to point eyes, Scanning lines of a book, time-series of patches/collections. A pixel packet may be less controlling the iris (lighting), etc. than a kilobyte, or its size can be much much larger. It may FIG.2 depicts a non-exhaustive but illustrative list of visual convey information about an isolated patch of pixels processing applications for mobile devices. Again, it is hard excerpted from a larger image, or it may convey a massive not to see analogies between this list and the fundamentals of Photosynth of Notre Dame cathedral. how the human visual system and the human brain operate. It 10 (AS presently conceived, a pixel packet is an application is a well studied academic area that deals with how “opti layer construct. When actually pushed around a network, mized the human visual system is relative to any given object however, it may be broken into Smaller portions—as transport recognition task, where a general consensus is that the eye layer constraints in a network may require.) retina-optic nerve-cortex system is pretty darn wonderful in FIG. 5 is a segue diagram—still at an abstract level, but how efficiently it serves a vast array of cognitive demands. 15 pointing toward the concrete. A list of user-defined applica This aspect of the technology relates to how similarly effi tions, such as illustrated in FIG. 2, will map to a state-of-the cient and broadly enabling elements can be built into mobile art inventory of pixel processing methods and approaches devices, mobile device connections and network services, all which can accomplish each and every application. These with the goal of serving the applications depicted in FIG. 2, as pixel processing methods break down into common and not well as those new applications which may show up as the So-common component Sub-tasks. Object recognition text technology dance continues. books are filled with a wide variety of approaches and termi Perhaps the central difference between the human analogy nologies which bring a sense of order into what at first glance and mobile device networks must surely revolve around the might appear to be a bewildering array of “unique require basic concept of “the marketplace where buyers buy better ments' relative to the applications shown in FIG. 2. (In addi and better things So long as businesses know how to profit 25 tion, multiple computer vision and image processing librar accordingly. Any technology which aims to serve the appli ies, such as OpenCV and CMVision—discussed below, have cations listed in FIG.2 must necessarily assume that hundreds been created that identify and render functional operations, if not thousands of business entities will be developing the which can be considered "atomic' functions within object nitty gritty details of specific commercial offerings, with the recognition paradigms.) But FIG. 5 attempts to show that expectation of one way or another profiting from those offer 30 there are indeed a set of common steps and processes shared ings. Yes, a few behemoths will dominate main lines of cash between visual processing applications. The differently flows in the overall mobile industry, but an equal certainty shaded pie slices attempt to illustrate that certain pixel opera will be that niche players will be continually developing niche tions are of a specific class and may simply have differences applications and services. Thus, this disclosure describes how in low level variables or optimizations. The size of the overall a marketplace for visual processing services can develop, 35 pie (thought of in a logarithmic sense, where a pie twice the whereby business interests across the spectrum have some size of another may represent 10 times more Flops, for thing to gain. FIG.3 attempts a crude categorization of some example), and the percentage size of the slice, represent of the business interests applicable to the global business degrees of commonality. ecosystem operative in the era of this filing. FIG. 6 takes a major step toward the concrete, sacrificing FIG. 4 sprints toward the abstract in the introduction of the 40 simplicity in the process. Here we see a top portion labeled technology aspect now being considered. Here we find a “Resident Call-Up Visual Processing Services, which repre highly abstracted bit of information derived from some batch sents all of the possible list of applications from FIG. 2 that a of photons that impinged on Some form of electronic image given mobile device may be aware of, or downright enabled to sensor, with a universe of waiting consumers of that lowly bit. perform. The idea is that not all of these applications have to FIG. 4A then quickly introduces the intuitively well-known 45 beactive all of the time, and hence some sub-set of services is concept that singular bits of visual information arent worth actually “turned on at any given moment. The turned on much outside of their role in both spatial and temporal group applications, as a one-time configuration activity, negotiate to ings. This core concept is well exploited in modern video identify their common component tasks, labeled the “Com compression standards such as MPEG7 and H.264. mon Processes Sorter' first generating an overall common The “visual character of the bits may be pretty far 50 list of pixel processing routines available for on-device pro removed from the visual domain by certain of the processing cessing, chosen from a library of these elemental image pro (consider, e.g., the vector strings representing eigenface cessing routines (e.g., FFT, filtering, edge detection, resam data). Thus, we sometimes use the term "key vector data” (or pling, color histogramming, log-polar transform, etc.). “key vector strings') to refer collectively to raw sensor/stimu Generation of corresponding Flow Gate Configuration/Soft lus data (e.g., pixel data), and/or to processed information and 55 ware Programming information follows, which literally loads associated derivatives. A key vector may take the form of a library elements into properly ordered places in a field pro container in which Such information is conveyed (e.g., a data grammable gate array set-up, or otherwise configures a Suit structure such as a packet). A tag or other data can be included able processor to perform the required component tasks. to identify the type of information (e.g., JPEG image data, FIG. 6 also includes depictions of the image sensor, fol eigenface data), or the data type may be otherwise evident 60 lowed by a universal pixel segmenter. This pixel segmenter from the data or from context. One or more instructions, or breaks down the massive stream of imagery from the sensor operations, may be associated with key vector data—either into manageable spatial and/or temporal blobs (e.g., akin to expressly detailed in the key vector, or implied. An operation MPEG macroblocks, wavelet transform blocks, 64x64 pixel may be implied in default fashion, for key vector data of blocks, etc.). After the torrent of pixels has been broken down certain types (e.g., for JPEG data it may be “store the image: 65 into chewable chunks, they are fed into the newly pro for eigenface data is may be “match this eigenface template”). grammed gate array (or other hardware), which performs the Or an implied operation may be dependent on context. elemental image processing tasks associated with the selected US 8,768,313 B2 11 12 applications. (Such arrangements are further detailed below, services can largely get the job done. But even in the barcode in an exemplary system employing 'pixel packets.) Various example, flexibility in the growth and evolution of overt output products are sent to a routing engine, which refers the visual coding targets begs for an architecture which doesn't elementally-processed data (e.g., key vector data) to other force “code upgrades' to a gaZillion devices every time there resources (internal and/or external) for further processing. is some advance in the overt symbology art. This further processing typically is more complex that that At the other end of the spectrum, arbitrarily complex tasks already performed. Examples include making associations, can be imagined, e.g., referring to a network of Supercomput deriving inferences, pattern and template matching, etc. This ers the task of predicting the apocryphal typhoon resulting further processing can be highly application-specific. from the fluttering of a butterfly's wings halfway around the (Consider a promotional game from Pepsi, inviting the 10 world—if the application requires it. OZ beckons. public to participate in a treasure huntina State park. Based on FIG. 8 attempts to illustrate this radical extra dimension internet-distributed clues, people try to find a hidden six-pack ality of pixel processing in the cloud as opposed to the local of soda to earn a S500 prize. Participants must download a device. This virtually goes without saying (or without a pic special application from the Pepsi-dot-com web site (or the ture), but FIG. 8 is also a segue figure to FIG. 9, where Apple AppStore), which serves to distribute the clues (which 15 Dorothy gets back to Kansas and is happy about it. may also be published to Twitter). The downloaded applica FIG.9 is all about cash, cash flow, and happy humans using tion also has a prize verification component, which processes cameras on their mobile devices and getting highly meaning image data captured by the users cell phones to identify a ful results back from their visual queries, all the while paying special pattern with which the hidden six-pack is uniquely one monthly bill. It turns out the Google AdWords' auction marked. SIFT object recognition is used (discussed below), genie is out of the bottle. Behind the scenes of the moment with the SIFT feature descriptors for the special package by-moment visual scans from a mobile user of their immedi conveyed with the downloaded application. When an image ate visual environment are hundreds and thousands of micro match is found, the cell phone immediately reports same decisions, pixel routings, results comparisons and micro wirelessly to Pepsi. The winner is the user whose cell phone auctioned channels back to the mobile device user for the hard first reports detection of the specially-marked six-pack. In the 25 good they are “truly” looking for, whether they know it or not. FIG. 6 arrangement, some of the component tasks in the SIFT This last point is deliberately cheeky, in that searching of any pattern matching operation are performed by the elemental kind is inherently open ended and magical at Some level, and image processing in the configured hardware; others are part of the fun of searching in the first place is that Surpris referred for more specialized processing—either internal or ingly new associations are part of the results. The search user external.) 30 knows after the fact what they were truly looking for. The FIG. 7 up-levels the picture to a generic distributed pixel system, represented in FIG. 9 as the carrier-based financial services network view, where local device pixel services and tracking server, now sees the addition of our networked pixel "cloud based pixel services have a kind of symmetry in how services module and its role in facilitating pertinent results they operate. The router in FIG.7 takes care of how any given being sent back to a user, all the while monitoring the uses of packaged pixel packet gets sent to the appropriate pixel pro 35 the services in order to populate the monthly bill and send the cessing location, whether local or remote (with the style of fill proceeds to the proper entities. pattern denoting different component processing functions; (As detailed further elsewhere, the money flow may not only a few of the processing functions required by the enabled exclusively be to remote service providers. Other money visual processing services are depicted). Some of the data flows can arise. Such as to users or other parties, e.g., to induce shipped to cloud-based pixel services may have been first 40 or reward certain actions.) processed by local device pixel services. The circles indicate FIG. 10 focuses on functional division of processing that the routing functionality may have components in the illustrating how tasks in the nature of template matching can cloud—nodes that serve to distribute tasks to active service be performed on the cell phone itself, whereas more sophis providers, and collect results for transmission back to the ticated tasks (in the nature of data association) desirably are device. In some implementations these functions may be 45 referred to the cloud for processing. performed at the edge of the wireless network, e.g., by mod Elements of the foregoing are distilled in FIG. 10A, show ules at wireless service towers, so as to ensure the fastest ing an implementation of aspects of the technology as a action. Results collected from the active external service pro physical matter of (usually) software components. The two viders, and the active local processing stages, are fed back to ovals in the figure highlight the symmetric pair of Software Pixel Service Manager software, which then interacts with 50 components which are involved in setting up a "human real the device user interface. time' visual recognition session between a mobile device and FIG. 8 is an expanded view of the lower right portion of the generic cloud or service providers, data associations and FIG. 7 and represents the moment where Dorothy's shoes visual query results. The oval on the left refers to “keyvec turn red and why distributed pixel services provided by the tors' and more specifically “visual key vectors.” As noted, this cloud—as opposed to the local device—will probably trump 55 term can encompass everything from simple JPEG com all but the most mundane object recognition tasks. pressed blocks all the way through log-polar transformed Object recognition in its richer form is based on visual facial feature vectors and anything in between and beyond. association rather than strict template matching rules. If we The point of a key vector is that the essential raw information all were taught that the capital letter “A” will always be of some given visual recognition task has been optimally strictly following some pre-historic form never to change, a 60 pre-processed and packaged (possibly compressed). The oval universal template image if you will, then pretty clean and on the left assembles these packets, and typically inserts some locally prescriptive methods can be placed into a mobile addressing information by which they will be routed. (Final imaging device in order to get it to reliably read a capital A addressing may not be possible, as the packet may ultimately any time that ordained form 'A' is presented to the camera. be routed to remote service providers—the details of which 2D and even three 3D barcodes in many ways follow this 65 may not yet be known.) Desirably, this processing is per template-like approach to object recognition, where for con formed as close to the raw sensor data as possible. Such as by tained applications involving Such objects, local processing processing circuitry integrated on the same Substrate as the US 8,768,313 B2 13 14 image sensor, which is responsive to software instructions In some circumstances, one input to the query router and stored in memory or provided from another stage in packet response manager may be the user's location, so that a differ form. ent service provider may be selected when the user is at home The oval on the right administers the remote processing of in Oregon, than when she is vacationing in Mexico. In other key Vector data, e.g., attending to arranging appropriate Ser 5 instances, the required turnaround time is specified, which vices, directing traffic flow, etc. Desirably, this software pro may disqualify some vendors, and make others more com cess is implemented as low down on a communications stack petitive. In some instances the query router and response as possible, generally on a "cloud side' device, access point, manager need not decide at all, e.g., if cached results identi or cell tower. (When real-time visual key vector packets fying a service provider selected in a previous auction are still stream over a communications channel, the lower down in the 10 communications Stack they are identified and routed, the available and not beyond a “freshness” threshold. smoother the “human real-time' look and feel a given visual Pricing offered by the vendors may change with processing recognition task will be.) Remaining high level processing load, bandwidth, time of day, and other considerations. In needed to support this arrangement is included in FIG. 10A some embodiments the providers may be informed of offers for context, and can generally be performed through native 15 Submitted by competitors (using known trust arrangements mobile and remote hardware capabilities. assuring data integrity), and given the opportunity to make FIGS. 11 and 12 illustrate the concept that some providers their offers more enticing. Such a bidding war may continue of some cloud-based pixel processing services may be estab until no bidder is willing to change the offered terms. The lished in advance, in a pseudo-static fashion, whereas other query router and response manager (or in some implementa providers may periodically vie for the privilege of processing tions, the user) then makes a selection. a user's key vector data, through participation in a reverse For expository convenience and visual clarity, FIG. 12 auction. In many implementations, these latter providers shows a software module labeled “Bid Filter and Broadcast compete each time a packet is available for processing. Agent. In most implementations this forms part of the query Consider a user who Snaps a cell phone picture of an router and response manager module. The bid filter module unfamiliar car, wanting to learn the make and model. Various 25 decides which vendors—from a universe of possible ven service providers may compete for this business. A startup dors—should be given a chance to bid on a processing task. vendor may offer to perform recognition for free to build its (The user's preference data, or historical experience, may brand or collect data. Imagery submitted to this service indicate that certain service providers be disqualified.) The returns information simply indicating the car's make and broadcast agent module then communicates with the selected model. Consumer Reports may offer an alternative service— 30 bidders to inform them of a user task for processing, and which provides make and model data, but also provides tech provides information needed for them to make a bid. nical specifications for the car. However, they may charge 2 Desirably, the bid filter and broadcast agent do at least cents for the service (or the cost may be bandwidth based, some their work in advance of data being available for pro e.g., 1 cent per megapixel). Edmunds, or JD Powers, may cessing. That is, as soon as a prediction can be made as to an offer still another service, which provides data like Consumer 35 operation that the user may likely soon request, these modules Reports, but pays the user for the privilege of providing data. start working to identify a provider to perform a service In exchange, the vendor is given the right to have one of its expected to be required. A few hundred milliseconds later the partners send a text message to the user promoting goods or user key vector data may actually be available for processing services. The payment may take the form of a credit on the (if the prediction turns out to be accurate). user's monthly cell phone voice/data service billing. 40 Sometimes, as with Google's present AdWords system, the Using criteria specified by the user, Stored preferences, service providers are not consulted at each user transaction. context, and other rules/heuristics, a query router and Instead, each provides bidding parameters, which are stored response manager (in the cellphone, in the cloud, distributed, and consulted whenever a transaction is considered, to deter etc.) determines whether the packet of data needing process mine which service provider wins. These stored parameters ing should be handled by one of the service providers in the 45 may be updated occasionally. In some implementations the stable of static standbys, or whether it should be offered to service provider pushes updated parameters to the bid filter providers on an auction basis—in which case it arbitrates the and broadcast agent whenever available. (The bid filter and outcome of the auction. broadcast agent may serve a large population of users, such as The static standby service providers may be identified all Verizon subscribers in area code 503, or all subscribers to when the phone is initially programmed, and only reconfig 50 an ISP in a community, or all users at the domain well-dot ured when the phone is reprogrammed. (For example, Verizon com, etc.; or more localized agents may be employed. Such as may specify that all FFT operations on its phones be routed to one for each cell phone tower.) a server that it provides for this purpose.) Or, the user may be If there is a lull in traffic, a service provider may discount its able to periodically identify preferred providers for certain services for the next minute. The service provider may thus tasks, as through a configuration menu, or specify that certain 55 transmit (or post) a message stating that it will perform eigen tasks should be referred for auction. Some applications may vector extraction on an image file of up to 10 megabytes for 2 emerge where static service providers are favored; the task cents until 1244754176 Coordinated Universal Time in the may be so mundane, or one provider's services may be so Unix epoch, after which time the price will return to 3 cents. un-paralleled, that competition for the provision of services The bid filter and broadcast agent updates a table with stored isn't warranted. 60 bidding parameters accordingly. In the case of services referred to auction, Some users may (Information about the Google reverse auction, used to exalt price above all other considerations. Others may insist place sponsored advertising on web search results page, is on domestic data processing. Others may want to Stick to reproduced at the end of the provisional specification to service providers that meet “green.” “ethical or other stan which this application claims priority. This information was dards of corporate practice. Others may prefer richer data 65 published in Wired magazine on May 22, 2009, in an article output. Weightings of different criteria can be applied by the by Stephen Levy, entitled “Secret of Googlenomics: Data query router and response manager in making the decision. Fueled Recipe Brews Profitability.”) US 8,768,313 B2 15 16 In other implementations, the broadcast agent polls the In another example, a consumer may be rewarded for bidders—communicating relevant parameters, and soliciting accepting commercials, or commercial impressions, from a bid responses whenever a transaction is offered for process company. If a consumer goes into the Pepsi Center in Denver, ing. she may receive a reward for each Pepsi-branded experience Once a prevailing bidder is decided, and data is available she encounters. The amount of micropayment may scale with for processing, the broadcast agent transmits the key vector the amount of time that she interacts with the different Pepsi data (and other parameters as may be appropriate to a par branded objects (including audio and imagery) in the venue. ticular task) to the winning bidder. The bidder then performs Notjust large brand owners can provide credits to individu the requested operation, and returns the processed data to the als. Credits can be routed to friends and social/business 10 acquaintances. To illustrate, a user of Facebook may share query router and response manager. This module logs the credit (redeemable for goods/services, or exchangeable for processed data, and attends to any necessary accounting (e.g., cash) from his Facebook page—enticing others to visit, or crediting the service provider with the appropriate fee). The linger. In some cases, the credit can be made available only to response data is then forwarded back to the user device. people who navigate to the Facebook page in a certain man In a variant arrangement, one or more of the competing 15 ner—such as by linking to the page from the user's business service providers actually performs some or all of the card, or from another launch page. requested processing, but "teases the user (or the query As another example, consider a Facebook user who has router and response manager) by presenting only partial earned, or paid for, or otherwise received credit that can be results. With a taste of what’s available, the user (or the query applied to certain services—such as for downloading Songs router and response manager) may be induced to make a from iTunes, or for music recognition services, or for identi different choice than relevant criteria/heuristics would other fying clothes that go with particular shoes (for which an wise indicate. image has been Submitted), etc. These services may be asso The function calls sent to external service providers, of ciated with the particular Facebook page, so that friends can course, do not have to provide the ultimate result sought by a invoke the services from that page—essentially spending the consumer (e.g., identifying a car, or translating a menu listing 25 host’s credit (again, with Suitable authorization or invitation from French to English). They can be component operations, by that hosting user). Likewise, friends may Submit images to such as calculating an FFT, or performing a SIFT procedure a facial recognition service accessible through an application or a log-polar transform, or computing a histogram or eigen associated with the user's Facebook page. Images Submitted vectors, or identifying edges, etc. in such fashion are analyzed for faces of the hosts friends, In time, it is expected that a rich ecosystem of expert 30 and identification information is returned to the submitter, processors will emerge—serving myriad processing requests e.g., through a user interface presented on the originating from cell phones and other thin client devices. Facebook page. Again, the host may be assessed a fee for each More on Monetary Flow Such operation, but may allow authorized friends to avail Additional business models can be enabled, involving the themselves of Such service at no cost. subsidization of consumed remote services by the service 35 Credits, and payments, can also be routed to charities. A providers themselves in exchange for user information (e.g., viewer exiting a theatre after a particularly poignant movie for audience measurement), or in exchange for action taken about poverty in Bangladesh may capture an image of an by the user, such as completing a Survey, visiting specific associated movie poster, which serves as a portal for dona sites, locations in store, etc. tions for a charity that serves the poor in Bangladesh. Upon Services may be subsidized by third parties as well, such a 40 recognizing the movie poster, the cell phone can present a coffee shop that derives value by providing a differentiating graphical/touch user interface through which the user spins service to its customers in the form of free/discounted usage dials to specify an amount of a charitable donation, which at of remote services while they are seated in the shop. the conclusion of the transaction is transferred from a finan In one arrangement an economy is enabled wherein a cur cial account associated with the user, to one associated with rency of remote processing credits is created and exchanged 45 the charity. between users and remote service providers. This may be More on a Particular Hardware Arrangement entirely transparent to the user and managed as part of a As noted above and in the cited patent documents, there is service plan, e.g., with the user's cell phone or data service a need for generic object recognition by a mobile device. provider. Or it can be exposed as a very explicit aspect of the Some approaches to specialized object recognition have present technology. Service providers and others may award 50 emerged, and these have given rise to specific data processing credits to users for taking actions or being part of a frequent approaches. However, no architecture has been proposed that user program to build allegiance with specific providers. goes beyond specialized object recognition toward generic As with other currencies, users may choose to explicitly object recognition. donate, save, exchange or generally barter credits as needed. Visually, a generic object recognition arrangement Considering these points in further detail, a service may 55 requires access to good raw visual data preferably free of pay a user for opting-in to an audience measurement panel. device quirks, Scene quirks, user quirks, etc. Developers of E.g., The Nielsen Company may provide services to the pub systems built around object identification will best prosper lic—Such as identification of television programming from and serve their users by concentrating on the object identifi audio or video samples Submitted by consumers. These ser cation task at hand, and not the myriad existing roadblocks, vices may be provided free to consumers who agree to share 60 resource sinks, and third party dependencies that currently Some of their media consumption data with Nielsen (Such as must be confronted. by serving as an anonymous member for a city's audience As noted, virtually all object identification techniques can measurement panel), and provided on a fee basis to others. make use of or even rely upon a pipe to “the cloud.” Nielsen may offer, for example, 100 units of credit micro “Cloud can include anything external to the cell phone. An payments or other value—to participating consumers each 65 example is a nearby cellphone, or plural phones on a distrib month, or may provide credit each time the user Submits uted network. Unused processing power on Such other phone information to Nielsen. devices can be made available for hire (or for free) to call upon US 8,768,313 B2 17 18 as needed. The cell phones of the implementations detailed done in the cell phone, other processing is done externally, herein can Scavenge processing power from Such other cell and the application Software orchestrating the process resides phones. in the cell phone. Such a cloud may be ad hoc, e.g., other cellphones within To illustrate further discussion, FIG. 15 shows a range 40 of Bluetooth range of the user's phone. The ad hoc network can some of the different types of images 41-46 that may be be extended by having such other phones also extend the local captured by a particular user's cell phone. A few brief (and cloud to further phones that they can reach by Bluetooth, but incomplete) comments about some of the processing that may the user cannot. be applied to each image are provided in the following para The "cloud can also comprise other computational plat graphs. 10 Image 41 depicts a thermostat. A steganographic digital forms, such as set-top boxes; processors in automobiles, ther watermark 47 is textured or printed on the thermostats case. mostats, HVAC systems, wireless routers, local cell phone (The watermark is shown as visible in FIG. 15, but is typically towers and other wireless network edges (including the pro imperceptible to the viewer). The watermark conveys infor cessing hardware for their software-defined radio equip mation intended for the cell phone, allowing it to present a ment), etc. Such processors can be used in conjunction with 15 graphic user interface by which the user can interact with the more traditional cloud computing resources—as are offered thermostat. A bar code or other data carrier can alternatively by Google, Amazon, etc. be used. Such technology is further detailed below, and in (In View of concerns of certain users about privacy, the patent application Ser. No. 12/498,709, filed Apr. 14, 2009. phone desirably has a user-configurable option indicating Image 42 depicts an item including a barcode 48. This whether the phone can refer data to cloud resources for pro barcode conveys Universal Product Code (UPC) data. Other cessing. In one arrangement, this option has a default value of barcodes may convey other information. The barcode pay “No, limiting functionality and impairing battery life, but load is not primarily intended for reading by a user cellphone also limiting privacy concerns. In another arrangement, this (in contrast to watermark 47), but it nonetheless may be used option has a default value of “Yes”) by the cell phone to help determine an appropriate response Desirably, image-responsive techniques should produce a 25 for the user. short term “result or answer,” which generally requires some Image 43 shows a product that may be identified without level of interactivity with a user hopefully measured in reference to any express machine readable information (Such fractions of a second for truly interactive applications, or a as a bar code or watermark). A segmentation algorithm may few seconds or fractions of a minute for nearer-term “I’m be applied to edge-detected image data to distinguish the patient to wait' applications. 30 apparent image subject from the apparent background. The As for the objects in question, they can break down into image Subject may be identified through its shape, color and various categories, including (1) generic passive (clues to texture. Image fingerprinting may be used to identify refer basic searches), (2) geographic passive (at least you know ence images having similar labels, and metadata associated where you are, and may hook into geographic-specific with those other images may be harvested. SIFT techniques resources), (3) “cloud supported passive, as with “identified/ 35 (discussed below) may be employed for such pattern-based enumerated objects and their associated sites, and (4) active/ recognition tasks. Specular reflections in low texture regions controllable, a la ThingPipe (a reference to technology may tend to indicate the image Subject is made of glass. detailed in application Ser. No. 12/498,709, such as WiFi Optical character recognition can be applied for further infor equipped thermostats and parking meters). mation (reading the visible text). All of these clues can be An object recognition platform should not, it seems, be 40 employed to identify the depicted item, and help determine an conceived in the classic “local device and local resources appropriate response for the user. only' software mentality. However, it may be conceived as a Additionally (or alternatively), similar-image search sys local device optimization problem. That is, the software on tems, such as Google Similar Images, and Microsoft Live the local device, and its processing hardware, should be Search, can be employed to find similar images, and their designed in contemplation of their interaction with off-device 45 metadata can then be harvested. (As of this writing, these software and hardware. Ditto the balance and interplay of services do not directly Supportupload of a user picture to find both control functionality, pixel crunching functionality, and similar web pictures. However, the user can post the image to application software/GUI provided on the device, versus off Flickr (using Flickr's cellphone upload functionality), and it the device. (In many implementations, certain databases use will soon be found and processed by Google and Microsoft.) ful for object identification/recognition will reside remote 50 Image 44 is a Snapshot of friends. Facial detection and from the device.) recognition may be employed (i.e., to indicate that there are In a particularly preferred arrangement, such a processing faces in the image, and to identify particular faces and anno platform employs image processing near the sensor—opti tate the image with metadata accordingly, e.g., by reference to mally on the same chip, with at least Some processing tasks user-associated data maintained by Apple's iPhoto service, desirably performed by dedicated, special purpose hardware. 55 Google's Picasa service, Facebook, etc.) Some facial recog Consider FIG. 13, which shows an architecture of a cell nition applications can be trained for non-human faces, e.g., phone 10 in which an image sensor 12 feeds two processing cats, dogs animated characters including avatars, etc. Geolo paths. One, 13, is tailored for the human visual system, and cation and date/time information from the cell phone may includes processing Such as JPEG compression. Another, 14. also provide useful information. is tailored for object recognition. As discussed, some of this 60 The persons wearing Sunglasses pose a challenge for some processing may be performed by the mobile device, while facial recognition algorithms. Identification of those indi other processing may be referred to the cloud 16. viduals may be aided by their association with persons whose FIG. 14 takes an application-centric view of the object identities can more easily be determined (e.g., by conven recognition processing path. Some applications reside wholly tional facial recognition). That is, by identifying other group in the cellphone. Other applications reside wholly outside the 65 pictures in iPhoto/Picasa/Facebook/etc. that include one or cell phone—e.g., simply taking key vector data as stimulus. more of the latter individuals, the other individuals depicted More common are hybrids, Such as where some processing is in Such photographs may also be present in the Subject image. US 8,768,313 B2 19 20 These candidate persons form a much smaller universe of author the field 55 to specify that the sensor is to sum sensor possibilities than is normally provided by unbounded iPhoto/ charges to reduce resolution (e.g., producing a frame of 640x Picasa/Facebook/etc data. The facial vectors discernable 480 data from a sensor capable of 1280x960), output data from the Sunglass-wearing faces in the Subject image can then only from red-filtered sensor cells, output data only from a be compared against this Smaller universe of possibilities in 5 horizontal line of cells across the middle of the sensor, output order to determine a best match. If, in the usual case of data only from a 128x128 patch of cells in the center of the recognizing a face, a score of 90 is required to be considered pixel array, etc. The camera instruction field 55 may further a match (out of an arbitrary top match score of 100), in specify the exact time that the camera is to capture data—so searching Such a group-constrained set of images a score of as to allow, e.g., desired synchronization with ambient light 70 or 80 might suffice. (Where, as in image 44, there are two 10 ing (as detailed later). persons depicted without Sunglasses, the occurrence of both Each packet 56 issued by setup module 34 may include of these individuals in a photo with one or more other indi different camera parameters in the first header field 55. Thus, viduals may increase its relevance to Such an analysis— a first packet may cause camera 32 to capture a full frame implemented, e.g., by increasing a weighting factor in a image with an exposure time of 1 millisecond. A next packet matching algorithm.) 15 may cause the camera to capture a full frame image with an Image 45 shows part of the statue of Prometheus in Rock exposure time of 10 milliseconds, and a third may dictate an efeller Center, NY. Its identification can follow teachings exposure time of 100 milliseconds. (Such frames may later be detailed elsewhere in this specification. processed in combination to yield a high dynamic range Image 46 is a landscape, depicting the Maroon Bells moun image.) A fourth packet may instruct the camera to down tain range in Colorado. This image subject may be recognized sample data from the image sensor, and combine signals from by reference to geolocation data from the cellphone, in con differently color-filtered sensor cells, so as to output a 4x3 junction with geographic information services such as array of grayscale luminance values. A fifth packet may GeoNames or Yahoo!'s GeoPlanet. instruct the camera to output data only from an 8x8 patch of (It should be understood that techniques noted above in pixels at the center of the frame. A sixth packet may instruct connection with processing of one of the images 41-46 in 25 the camera to output only five lines of image data, from the FIG. 15 can likewise be applied to others of the images. top, bottom, middle, and mid-upper and mid-lower rows of Moreover, it should be understood that while in some respects the sensor. A seventh packet may instruct the camera to output the depicted images are ordered according to ease of identi only data from blue-filtered sensor cells. An eighth packet fying the Subject and formulating a response, in other respects may instruct the camera to disregard any auto-focus instruc they are not. For example, although landscape image 46 is 30 tions but instead capture a full frame at infinity focus. And so depicted to the far right, its geolocation data is strongly cor O. related with the metadata"Maroon Bells.” Thus, this particu Each such packet 57 is provided from setup module 34 lar image presents an easier case than that presented by many across a bus or other data channel 60 to a camera controller other images.) module associated with the camera. (The details of a digital In one embodiment, Such processing of imagery occurs 35 camera—including an array of photosensor cells, associated automatically—without express user instruction each time. analog-digital converters and control circuitry, etc., are well Subject to network connectivity and power constraints, infor known to artisans and so are not belabored.) Camera 32 mation can be gleaned continuously from Such processing, captures digital image data in accordance with instructions in and may be used in processing Subsequently-captured the header field 55 of the packet and stuffs the resulting image images. For example, an earlier image in a sequence that 40 data into a body 59 of the packet. It also deletes the camera includes photograph 44 may show members of the depicted instructions 55 from the packet header (or otherwise marks group without Sunglasses—simplifying identification of the header field 55 in a manner permitting it to be disregarded by persons later depicted with Sunglasses. Subsequent processing stages). FIG. 16, Etc., Implementation When the packet 57 was authored by setup module 34 it FIG. 16 gets into the nifty gritty of a particular implemen 45 also included a series of further header fields 58, each speci tation incorporating certain of the features earlier dis fying how a corresponding. Successive post-sensor stage 38 cussed. (The other discussed features can be implemented by should process the captured data. As shown in FIG. 16, there the artisan within this architecture, based on the provided are several Such post-sensor processing stages 38. disclosure.) In this data driven arrangement 30, operation of a Camera 32 outputs the image-stuffed packet produced by cellphone camera 32 is dynamically controlled in accordance 50 the camera (a pixel packet) onto a bus or other data channel with packet data sent by a setup module 34, which in turn is 61, which conveys it to a first processing stage 38. controlled by a control processor module 36. (Control pro Stage 38 examines the header of the packet. Since the cessor module 36 may be the cellphone's primary processor, camera deleted the instruction field 55 that conveyed camera oran auxiliary processor, or this function may be distributed.) instructions (or marked it to be disregarded), the first header The packet data further specifies operations to be performed 55 field encountered by a control portion of stage 38 is field 58a. by an ensuing chain of processing stages 38. This field details parameters of an operation to be applied by In one particular implementation, setup module 34 dic stage 38 to data in the body of the packet. tates—on a frame by frame basis—the parameters that are to For example, field 58a may specify parameters of an edge be employed by camera 32 in gathering an exposure. Setup detection algorithm to be applied by stage 38 to the packets module 34 also specifies the type of data the camera is to 60 image data (or simply that such an algorithm should be output. These instructional parameters are conveyed in a first applied). It may further specify that stage 38 is to substitute field 55 of a header portion 56 of a data packet 57 correspond the resulting edge-detected set of data for the original image ing to that frame (FIG. 17). data in the body of the packet. (Substituting of data, rather For example, for each frame, the setup module 34 may than appending, may be indicated by the value of a single bit issue a packet 57 whose first field 55 instructs the camera 65 flag in the packet header.) Stage 38 performs the requested about, e.g., the length of the exposure, the aperture size, the operation (which may involve configuring programmable lens focus, the depth of field, etc. Module 34 may further hardware in certain implementations). First stage 38 then US 8,768,313 B2 21 22 deletes instructions 58a from the packet header 56 (or marks particular type of processing (e.g., object identification). This them to be disregarded) and outputs the processed pixel focus/exposure information can be used as predictive setup packet for action by a next processing stage. data for the camera the next time a frame of the same or A control portion of a next processing stage (which here similar type is captured. The control processor module 36 can comprises stages 38a and 38b, discussed later) examines the set up a frame request using a filtered or time-series prediction header of the packet. Since field 58a was deleted (or marked sequence of focus information from previous frames, or a to be disregarded), the first field encountered is field 58b. In sub-set of those frames. this particular packet, field 58b may instruct the second stage Error and status reporting functions may also be accom not to perform any processing on the data in the body of the plished using outputs 31. Each stage may also have one or packet, but instead simply delete field 58b from the packet 10 more other outputs 33 for providing data to other processes or header and pass the pixel packet to the next stage. modules—locally within the cell phone, or remote (“in the A next field of the packet header may instruct the third cloud'). Data (in packet form, or in other format) may be stage 38c to perform 2D FFT operations on the image data directed to such outputs in accordance with instructions in found in the packet body, based on 16x16 blocks. It may packet 57, or otherwise. further direct the stage to hand-off the processed FFT data to 15 For example, a processing module 38 may make a data flow a wireless interface, for internet transmission to address selection based on Some result of processing it performs. E.g., 216.239.32.10, accompanied by specified data (detailing, if an edge detection stage discerns a sharp contrast image, e.g., the task to be performed on the received FFT data by the then an outgoing packet may be routed to an external service computer at that address, such as texture classification). It provider for FFT processing. That provider may return the may further direct the stage to hand off a single 16x16 block resultant FFT data to other stages. However, if the image has of FFT data, corresponding to the center of the captured poor edges (such as being out of focus), then the system may image, to the same or a different wireless interface for trans not want FFT and following processing to be performed on mission to address 12.232.235.27—again accompanied by the data. Thus, the processing stages can cause branches in the corresponding instructions about its use (e.g., search an data flow, dependent on parameters of the processing (such as archive of stored FFT data for a match, and return information 25 discerned image characteristics). if a match is found; also, store this 16x16 block in the archive Instructions specifying such conditional branching can be with an associated identifier). Finally, the header authored by included in the header of packet 57, or they can be provided setup module 34 may instruct stage 38c to replace the body of otherwise. FIG. 19 shows one arrangement. Instructions 58d the packet with the single 16x16 block of FFT data dispatched originally in packet 57 specify a condition, and specify a to the wireless interface. As before, the stage also edits the 30 location in a memory 79 from which replacement subsequent packet header to delete (or mark) the instructions to which it instructions (58e'-58g) can be read, and substituted into the responded, so that a header instruction field for the next packet header, if the condition is met. If the condition is not processing stage is the first to be encountered. met, execution proceeds in accordance with header instruc In other arrangements, the addresses of the remote com tions already in the packet. puters are not hard-coded. For example, the packet may 35 In other arrangements, other variations can be employed. include a pointer to a database record or memory location (in For example, all of the possible conditional instructions can the phone or in the cloud), which contains the destination be provided in the packet. In another arrangement, a packet address. Or, stage 38c may be directed to hand-off the pro architecture is still used, but one or more of the header fields cessed pixel packet to the Query Router and Response Man do not include explicit instructions. Rather, they simply point ager (e.g., FIG. 7). This module examines the pixel packet to 40 to memory locations from which corresponding instructions determine what type of processing is next required, and it (or data) are retrieved, e.g., by the corresponding processing routes it to an appropriate provider (which may be in the cell stage 38. phone if resources permit, or in the cloud—among the stable Memory 79 (which can include a cloud component) can of static providers, or to a provider identified through an also facilitate adaptation of processing flow even if condi auction). The provider returns the requested output data (e.g., 45 tional branching is not employed. For example, a processing texture classification information, and information about any stage may yield output data that determines parameters of a matching FFT in the archive), and processing continues per filter or other algorithm to be applied by a later stage (e.g., a the next item of instruction in the pixel packet header. convolution kernel, a time delay, a pixel mask, etc). Such The data flow continues through as many functions as a parameters may be identified by the former processing stage particular operation may require. 50 in memory (e.g., determined/calculated, and stored), and In the particular arrangement illustrated, each processing recalled for use by the later stage. In FIG. 19, for example, stage 38 strips-out, from the packet header, the instructions on processing stage 38 produces parameters that are stored in which it acted. The instructions are ordered in the header in memory 79. A subsequent processing stage 38c later retrieves the sequence of processing stages, so this removal allows these parameters, and uses them in execution of its assigned each stage to look to the first instructions remaining in the 55 operation. (The information in memory can be labeled to header for direction. Other arrangements, of course, can alter identify the module/provider from which they originated, or natively be employed. (For example, a module may insert to which they are destined APIs pro recognition, perhaps SIFT or other feature recognition refer vide standardized abstractions so that the drivers and services ence data should be downloaded for candidate DVDs and do not need to concern themselves with the particular con stored in a cell phone cache. Recent releases are good pros figuration of the robot or server (and vice-versa). pects (except those rated G, or rated high for violence— (FIG. 20A, discussed below, also provides a layer of 5 stored profile data indicates the user just doesn't have a his abstraction between the sensors, the locally-available opera tory of watching those). So are movies that she's watched in tions, and the externally-available operations.) the past (as indicated by historical rental records—also avail Certain embodiments of the present technology can be able to the phone). implemented using a local process & remote process para If the user's position corresponds to a downtown Street, and digmakin to that of the Player Project, connected by a packet 10 magnetometer and other position data indicates she is looking network and inter-process & intra-process communication north, inclined up from the horizontal, what’s likely to be of constructs familiar to artisans (e.g., named pipes, sockets, interest? Even without image data, a quick reference to online etc.). Above the communication minutiae is a protocol by resources Such as Google Streetview can suggest she's look which different processes may communicate; this may take ing at business signage along 5" Avenue. Maybe feature the form of a message passing paradigm and message queue, 15 recognition reference data for this geography should be or more of a network centric approach where collisions of downloaded into the cache for rapid matching against to-be key vectors are addressed after the fact (re-transmission, drop acquired image data. if timely in nature, etc.). To speed performance, the cache should be loaded in a In Such embodiments, data from sensors on the mobile rational fashion—So that the most likely object is considered device (e.g., microphone, camera) can be packaged in first. Google Streetview for that location includes metadata key vector form, with associated instructions. The instruc indicating 5" Avenue has signs for a Starbucks, a Nordstrom tion(s) associated with data may not be express; they can be store, and a That restaurant. Stored profile data for the user implicit (such as Bayer conversion) or session specific— reveals she visits Starbucks daily (she has their branded loy based on context or user desires (in a photo taking mode, face alty card); she is a frequent clothes shopper (albeit with a detection may be presumed.) 25 Macy's, rather than a Nordstrom's charge card); and she's In a particular arrangement, key vectors from each sensor never eaten at a That restaurant. Perhaps the cache should be are created and packaged by device driver Software processes loaded so as to most quickly identify the Starbucks sign, that abstract the hardware specific embodiments of the sensor followed by Nordstrom, followed by the That restaurant. and provide a fully formed key vector adhering to a selected Low resolution imagery captured for presentation on the protocol. 30 viewfinder fails to trigger the camera's feature highlighting The device driver software can then place the formed probable faces (e.g., for exposure optimization purposes). key vector on an output queue unique to that sensor, or in a That helps. There’s no need to pre-warm the complex pro common message queue shared by all the sensors. Regardless cessing associated with facial recognition. of approach, local processes can consume the key vectors and She touches the virtual shutter button, capturing a frame of perform the needed operations before placing the resultant 35 high resolution imagery, and image analysis gets underway— key vectors back on the queue. Those key vectors that are to be trying to recognize what’s in the field of view, so that the processed by remote services are then placed in packets and camera application can overlay a ranked ordering of graphical transmitted directly to a remote processes for additional pro links related to objects in the captured frame. (Or this may cessing or to a remote service that distributes the key vec happen without user action—the camera may be watching tors—similar to a router. It should be clear to the reader, that 40 proactively.) Unlike Google web search—which ranks search commands to initialize or setup any of the sensors or pro results in an order based on aggregate user data, the camera cesses in the system can be distributed in a similar fashion application attempts a ranking customized to the user's pro from a Control Process (e.g., box 36 in FIG. 16.) file. If a Starbucks sign or logo is found in the frame, the Branch Prediction; Commercial Incentives Starbucks link gets top position for this user. The technology of branch prediction arose to meet the 45 If signs for Starbucks, Nordstrom, and the That restaurant needs of increasingly complex processor hardware; it allows are all found, links would normally be presented in that order processors with lengthy pipelines to fetch data and instruc (per the user's preferences inferred from profile data). How tions (and in some cases, execute the instructions), without ever, the cell phone application may have a capitalistic bent waiting for conditional branches to be resolved. and be willing to promote a link by a position or two (although A similar science can be applied in the present context— 50 perhaps not to the top position) if circumstances warrant. In predicting what action a human user will take. For example, the present case, the cellphone routinely sent IP packets to the as discussed above, the just-detailed system may “pre-warm' web servers at addresses associated with each of the links, certain processors, or communication channels, in anticipa alerting them that an iPhone user had recognized their corpo tion that certain data or processing operations will be forth rate signage from a particular latitude/longitude. (Other user coming. 55 data may also be provided, if privacy considerations and user When a user removes an iPhone from her purse (exposing permissions allow.) The That restaurant server responds back the sensor to increased light) and lifts it to eye level (as sensed in an instant—offering to the next two customers 25% offany by accelerometers), what is she about to do? Reference can be one item (the restaurant's point of sale system indicates only made to past behavior to make a prediction. Particularly rel four tables are occupied and no order is pending; the cook is evant may include what the user did with the phone camera 60 idle). The restaurant server offers three cents if the phone will the last time it was used; what the user did with the phone present the discount offer to the user in its presentation of camera at about the same time yesterday (and at the same time search results, or five cents if it will also promote the link to a week ago); what the user last did at about the same location; second place in the ranked list, or ten cents if it will do that and etc. Corresponding actions can be taken in anticipation. be the only discount offer presented in the results list. (Star If her latitude/longitude correspond to a location within a 65 bucks also responded with an incentive, but not as attractive). Video rental store, that helps. Expect to maybe perform image The cell phone quickly accepts the restaurants offer, and recognition on artwork from a DVD box. To speed possible payments are quickly made—either to the user (e.g., defray US 8,768,313 B2 37 38 ing the monthly phone bill) or more likely to the phone carrier requested, or authorized, by the cell phone. In Such case (e.g., AT&T). Links are presented to Starbucks, the That MySpace can tell the cell phone application to take needed restaurant, and Nordstrom, in that order, with the restaurants steps for higher bandwidth service, and MySpace will rebate link noting the discount for the next two customers. to the user (or to the carrier, for benefit of the user's account) Google's AdWord technology has already been noted. It the extra associated costs. decides, based on factors including a reverse-auction deter In some arrangements, the quality of service (e.g., band mined payment, which ads to present as Sponsored Links width) is managed by pipe manager 51. Instructions from adjacent the results of a Google web search. Google has MySpace may request that the pipe manager start requesting adapted this technology to presentads on third party web sites augmented service quality, and setting up the expected high and blogs, based on the particular contents of those sites, 10 bandwidth session, even before the user selects the MySpace terming the service AdSense. link. In accordance with another aspect of the present technol ogy, the AdWord/AdSense technology is extended to visual In some scenarios, vendors may negotiate preferential image search on cell phones. bandwidth for its content. MySpace may make a deal with AT&T, for example, that all MySpace content delivered to Consider a user located in a small bookstore who Snaps a 15 picture of the Warren Buffet biography Snowball. The book is AT&T phone subscribers be delivered at 10 megabits per quickly recognized, but rather than presenting a correspond second—eventhough most Subscribers normally only receive ing Amazon link atop the list (as may occur with a regular 3 megabits per second service. The higher quality service may Google search), the cell phone recognizes that the user is be highlighted to the user in the presented links located in an independent bookstore. Context-based rules Modeling of User Behavior. consequently dictate that it present a non-commercial link Aided by knowledge of a particular physical environment, first. Top ranked of this type is a Wall Street Journal review of a specific place and time, and behavior profiles of expected the book, which goes to the top of the presented list of links Decorum, however, only goes so far. The cell phone passes users, simulation models of human computer interaction with the book title or ISBN (or the image itself) to Google AdSense the physical world can be based on tools and techniques from or AdWords, which identifies sponsored links to be associated 25 fields as diverse as robotics, and audience measurement. An with that object. (Google may independently perform its own example of this might be the number of expected mobile image analysis on any provided imagery. In some cases it may devices in a museum at a particular time; the particular sen pay for Such cell phone-submitted imagery—since Google sors that Such devices are likely to be using; and what stimuli has a knack for exploiting data from diverse sources.) Per are expected to be captured by those sensors (e.g., where are Google, Barnes and Noble has the top sponsored position, 30 they pointing the camera, what is the microphone hearing, followed by alldiscountbooks-dot-net. The cell phone appli etc.). Additional information can include assumptions about cation may present these sponsored links in a graphically social relationships between users: Are they likely to share distinct manner to indicate their origin (e.g., in a different part common interests? Are they within common Social circles of the display, or presented in a different color), or it may that are likely to share content, to share experiences, or desire insert them alternately with non-commercial search results, 35 creating location-based experiences Such as wiki-maps (c.f. i.e., at positions two and four. The AdSense revenue collected by Google can again be shared with the user, or with the user's Barricelli, “Map-Based Wikis as Contextual and Cultural carrier. Mediators.” MobileHCI, 2009)? In some embodiments, the cell phone (or Google) again In addition, modeling can be based on generalized heuris pings the servers of companies for whom links will be pre tics derived from observations at past events (e.g., how many sented—helping them track their physical world-based 40 people used their cellphone cameras to capture imagery from online visibility. The pings can include the location of the the Portland Trailblazers scoreboard during a basketball user, and an identification of the object that prompted the game, etc.), to more evolved predictive models that are based ping. When alldiscountbooks-dot-net receives the ping, it on innate human behavior (e.g., people are more likely to may check inventory and find it has a significant overstock of capture imagery from a scoreboard during overtime than dur Snowball. As in the example earlier given, it may offer an 45 ing a game's half-time). extra payment for some extra promotion (e.g., including “We Such models can inform many aspects of the experience for have 732 copies—cheap” in the presented link). the users, in addition to the business entities involved in In addition to offering an incentive for a more prominent provisioning and measuring the experience. search listing (e.g., higher in the list, or augmented with These latter entities may consist of the traditional value additional information), a company may also offer additional 50 chain participants involved in event production, and the bandwidth to serve information to a customer. For example, a arrangements involved in measuring interaction and monetiz user may capture video imagery from an electronic billboard, ing it. Event planners, producers, artists on the creation side and want to download a copy to show to friends. The user's and the associated rights societies (ASCAP, Directors Guild cell phone identifies the content as a popular clip of user of America, etc.) on the royalties side. From a measurement generated content (e.g., by reference to an encoded water 55 perspective, both sampling-based techniques from opt-in mark), and finds that the clip is available from several sites— the most popular of which is YouTube, followed by MySpace. users and devices, and census-driven techniques can be uti To induce the user to link to MySpace, MySpace may offer to lized. Metrics for more static environments may consist of upgrade the user's baseline wireless service from 3 megabits Revenue Per Unit (RPU) created by digital traffic created on per second to 10 megabits per second, so the video will the digital service provider network (how much bandwidth is download in a third of the time. This upgraded service can be 60 being consumed) to more evolved models of Click Through only for the video download, or it can be longer. The link Rates (CTR) for particular sensor stimuli. presented on the screen of the user's cell phone can be For example, the Mona-Lisa painting in the Louvre is amended to highlight the availability of the faster service. likely to have a much higher CTR than other paintings in the (Again, MySpace may make an associated payment.) museum, informing matters such as priority for content pro Sometimes alleviating a bandwidth bottleneck requires 65 visioning, e.g., content related to the Mona Lisa should be opening a bandwidth throttle on a cell phone end of the cached and be as close to the edge of the cloud as possible, if wireless link Or the bandwidth service change must be not pre-loaded onto the mobile device itself when the user US 8,768,313 B2 39 40 approaches or enters the museum. (Ofequal importance is the prebuilt/edited and ready to go), caching the Twitter feeds of role that CTR plays in monetizing the experience and envi the star players, buffering video from city center showing the ronment.) hometown crowds watching on a Jumbotron display—erupt Consider a school group that enters a sculpture museum ing with joy at the buZZer, etc.; anything relating to the expe having a garden with a collection of Rodin works. The rience or follow-on actions should prepped/cached in museum may provide content related to Rodin and his works advance, where possible. on servers or infrastructure (e.g., router caches) that serve the Stimuli for sensors (audio, visual, tactile, odor, etc.) that garden. Moreover, because the visitors comprise a pre-estab are most likely to instigate user action and attention are much lished social group, the museum may expect some social more valuable from a commercial standpoint than stimuli less connectivity. So the museum may enable sharing capabilities 10 likely to instigate Such action (similar to the economic prin (e.g., ad hoc networking) that might not otherwise be used. If ciples on which Google’s Adwords ad-serving system is one student queries the museum's online content to learn based). Such factors and metrics directly influence advertis more about a particular Rodin sculpture, the system may ing models through auction models well understood by those accompany delivery of the solicited information with a in the art. prompt inviting the student to share this information with 15 Multiple delivery mechanisms exist for advertising deliv others in the group. The museum server can Suggest particular ery by third parties, leveraging known protocols such as “friends” of the student with whom such information might VAST. VAST (Digital Video Ad Serving Template) is a stan be shared if such information is publicly accessible from dard issued by the Interactive Advertising Bureau that estab Facebook or other social networking data source. In addition lishes reference communication protocols between Scriptable to names of friends, such a social networking data source can Video rendering systems and ad servers, as well as associated also provide device identifiers, IP addresses, profile informa XML schemas. As an example, VAST helps standardize the tion, etc., for the students friends—which may be leveraged service of video ads to independent web sites (replacing old to assist the dissemination of educational material to others in style banner ads), commonly based on a bit of JavaScript the group. These other students may find this particular infor included in the web page code—code that also aids in track mation relevant, since it was of interest to another in their 25 ing traffic and managing cookies. VAST can also insert pro group—even if the original student’s name is not identified. If motional messages in the pre-roll and post-roll viewing of the original student is identified with the conveyed informa other video content delivered by the web site. The web site tion, then this may heighten the informations interest to owner doesn’t concern itself with selling or running the others in the group. advertisements, yet at the end of the month the web site owner (Detection of a socially-linked group may be inferred from 30 receives payment based on audience viewership/impressions. review of the museum’s network traffic. For example, if a In similar fashion, physical stimuli presented to users in the device sends packets of data to another, and the museum’s real world, sensed by mobile technology, can be the basis for network handles both ends of the communication—dispatch payments to the parties involved. and delivery, then there's an association between two devices Dynamic environments in which stimulus presented to in the museum. If the devices are not ones that have historical 35 users and their mobile devices can be controlled (such as patterns of network usage, e.g., employees, then the system Video displays, as contrasted with static posters) provide new can conclude that two visitors to the museum are socially opportunities for measurement and utilization of metrics Such connected. If a web of Such communications is detected— as CTR. involving several unfamiliar devices, then a social group of Background music, content on digital displays, illumina visitors can be discerned. The size of the group can be gauged 40 tion, etc., can be modified to maximize CTR and shape traffic. by the number of different participants in such network traf For example, illumination on particular signage can be fic. Demographic information about the group can be inferred increased, or flash, as a targeted individual passes by. Simi from external addresses with which data is exchanged; larly when a flight from Japan lands at the airport, digital middle schoolers may have a high incidence of MySpace signage, music, etc. can all be modified overtly (change in the traffic; college students may communicate with external 45 advertising to the interests of the expected audience) or addresses at a university domain; senior citizens may dem covertly (changing the linked experience to take the user to onstrate a different traffic profile. All such informationarising the Japanese language website), to maximize the CTR. from traffic analysis can be employed in automatically adapt Mechanisms may be introduced as well to contend with ing the information and services provided to the visitors—as rogue or un-approved sensor stimuli. Within the confined well as providing useful information to the museum’s admin 50 spaces of a university or business park, stimuli (posters, istration and marketing departments.) music, digital signage, etc.) that don’t adhere to the intentions Consider other situations. One is halftime at the U.S. foot or policies of the property owner—or the entity responsible ball Superbowl, featuring a headline performer (e.g., Bruce for a domain may need to be managed. This can be accom Springsteen, or Prince). The show may cause hundreds of plished through the use of simple blocking mechanisms that fans to capture pictures or audio-video of the event. Another 55 are geography-specific (not dissimilar to region coding on context with predictable public behavior is the end of an NBA DVD’s), indicating that all attempts within specific GPS championship basketball game. Fans may want to memorial coordinates to route a key vector to a specific place in the ize the final buzzer excitement: the scoreboard, streamers and cloud must be mediated by a routing service or gateway confetti dropping from the ceiling, etc. In such cases, actions managed by the domain owner. that can be taken to prepare, or optimize, delivery of content 60 Other options include filtering the resultant experience. Is or experience should be taken. Examples include rights clear it age appropriate'? Does it run counter to pre-existing adver ance for associated content, rendering virtual worlds and tising or branding arrangements, such as a CocaCola adver other synthesized content, throttling down routine time-in tisement being delivered to a user inside the Pepsi center sensitive network traffic, queuing commercial resources that during a Denver Nuggets game. may be invoked as people purchase Souvenir books/music 65 This may be accomplished on the device as well, through from Amazon (caching pages, authenticating users to finan the use of content rules, such as the Movielabs Content Rec cial sites), propagating links for post-game interviews (some ognition Rules related to conflicting media content (c.f. US 8,768,313 B2 41 42 www.movielabs-dot-com/CRR), parental controls provided order. For example, optical character recognition may require by carriers to the device, or by adhering to DMCA Automatic edge detection, followed by region-of-interest segmentation, Take Down Notices. followed by template pattern matching. Facial recognition Under various rights management paradigms, licenses may involve skintone detection, Hough transforms (to iden play a key role in determining how content can be consumed, tify oval-shaped areas), identification of feature locations shared, modified etc. A result of extracting semantic meaning (pupils, corners of mouth, nose), eigenface calculation, and from stimulus presented to the user (and the user's mobile template matching. device), and/or the location in which stimulus is presented, The system can identify the component operations that can be issuance of a license to desired content or experiences may need to be performed, and the order in which their (games, etc.) by third parties. To illustrate, consider a user at 10 a rock concert in an arena. The user may be granted a tempo respective results are required. Rules and heuristics can be rary license to peruse and listen to all music tracks by the applied to help determine whether these operations should be performing artist (and/or others) on iTunes beyond the 30 performed locally or remotely. second preview rights normally granted to the public. How For example, at one extreme, the rules may specify that ever, such license may only persist during the concert, or only 15 simple operations. Such as color histograms and thresholding, from when the doors open until the headline act begins its should generally be performed locally. At the other extreme, performance, or only while the user is in the arena, etc. complex operations may usually default to outside providers. Thereafter, such license ends. Scheduling can be determined based on which operations Similarly, passengers disembarking from an international are preconditions to other operations. This can also influence flight may be granted location-based or time-limited licenses whether an operation is performed locally or remotely (local to translation services or navigation services (e.g., an aug performance may provide quicker results—allowing Subse mented reality system overlaying directions for baggage quent operations to be started with less delay). The rules may claim, bathrooms, etc., on camera-captured scenes) for their seek to identify the operation whose output(s) is used by the mobile devices, while they transit through customs, are in the greatest number of Subsequent operations, and perform this airport, for 90 minutes after their arrival, etc. 25 operation first (its respective precedent(s) permitting). Opera Such arrangements can serve as metaphors for experience, tions that are preconditions to Successively fewer other opera and as filtering mechanisms. One embodiment in which shar tions are performed Successively later. The operations, and ing of experiences are triggered by sensor stimuli is through their sequence, may be conceived as a tree structure—with broadcast Social networks (e.g., Twitter) and syndication pro the most globally important performed first, and operations of tocols (e.g., RSS web feeds/channels). Other users, entities or 30 lesser relevance to other operations performed later. devices can subscribe to such broadcasts/feeds as the basis for Such determinations, however, may also be tempered (or subsequent communication (social, information retrieval, dominated) by other factors. One is power. If the cell phone etc.), as logging of activities (e.g., a person's daily journal), or battery is low, or if an operation will involve a significant measurement (audience, etc.). Traffic associated with Such drain on a low capacity battery, this can tip the balance in networkS/feeds can also be measured by devices at a particu 35 favor of having the operation performed remotely. lar location—allowing users to traverse in time to understand Another factor is response time. In some instances, the who was communicating what at a particular point in time. limited processing capability of the cellphone may mean that This enables searching for and mining additional informa processing locally is slower than processing remotely (e.g., tion, e.g., was my friend here last week? Was someone from where a more robust, parallel, architecture might be available my peer group here? What content was consumed? Such 40 to perform the operation). In other instances, the delays of traffic also enables real-time monitoring of how users share establishing communication with a remote server, and estab experiences. Monitoring “tweets” about a performer's song lishing a session, may make local performance of an opera selection during a concert may cause the performer to alter the tion quicker. Depending on user demand, and needs of other Songs to be played for the remainder of a concert. The same is operation(s), the speed with which results are returned may be true for brand management. For example, if users share their 45 important, or not. opinions about a car during a car show, live keyword filtering Still another factor is userpreferences. As noted elsewhere, on the traffic can allow the brand owner to re-position certain the user may set parameters influencing where, and when, products for maximum effect (e.g., the new model of Corvette operations are performed. For example, a user may specify should spend more time on the spinning platform, etc.). that an operation may be referred to remote processing by a More on Optimization 50 domestic service provider, but if none is available, then the Predicting the user's action or intent is one form of opti operation should be performed locally. mization. Another form involves configuring the processing Routing constraints are another factor. Sometimes the cell So as to improve performance. phone will be in a WiFi or other service area (e.g., in a concert To illustrate one particular arrangement, consider again the arena) in which the local network provider places limits or Common Services Sorter of FIG. 6. What key vector opera 55 conditions on remote service requests that may be accessed tions should be performed locally, or remotely, or as a hybrid through that network. In a concert where photography is of some sort? In what order should key vector operations be forbidden, for example, the local network may be configured performed? Etc. The mix of expected operations, and their to block access to external image processing service provid scheduling, should be arranged in an appropriate fashion for ers for the duration of the concert. In this case, services the processing architecture being used, the circumstances, 60 normally routed for external execution should be performed and the context. locally. One step in the process is to determine which operations Yet another factor is the particular hardware with which the need to occur. This determination can be based on express cellphone is equipped. If a dedicated FFT processor is avail requests from the user, historical patterns of usage, context able in the phone, then performing intensive FFT operations and status, etc. 65 locally makes sense. If only a feeble general purpose CPU is Many operations are high level functions, which involve a available, then an intensive FFT operation is probably best number of component operations performed in a particular referred out for external execution. US 8,768,313 B2 43 44 A related factor is current hardware utilization. Even if a set of factors, the cell phone may perform all of the compo cellphone is equipped with hardware that is well configured nent OCR operations up until the last (template matching), for a certain task, it may be so busy and backlogged that the and send out data only for this last operation. (Under yet system may refer a next task of this sort to an external another set of factors, the OCR operation may be completed resource for completion. wholly by the cell phone, or different components of opera Another factor may be the length of the local processing tion can be performed alternately by the cell phone and chain, and the risk of a stall. Pipelined processing architec remote service provider(s), etc.) tures may become stalled for intervals as they wait for data Reference was made to routing constraints as one possible needed to complete an operation. Such a stall can cause all factor. This is a particular example of a more general factor— other subsequent operations to be similarly delayed. The risk 10 external business rules. Consider the earlier example of a user of a possible stall can be assessed (e.g., by historical patterns, who is attending an event at the Pepsi Center in Denver. The or knowledge that completion of an operation requires further Pepsi Center may provide wireless communication services data whose timely availability is not assured—such as a result to patrons, through its own WiFi or other network. Naturally, from another external process) and, if the risk is great enough, the Pepsi Center is reluctant for its network resources to be the operation may be referred for external processing to avoid 15 used for the benefit of competitors, such as Coca Cola. The stalling the local processing chain. host network may thus influence cloud services that can be Yet another factor is connectivity status. Is a reliable, high utilized by its patrons (e.g., by making some inaccessible, or speed network connection established? Or are packets by giving lowerpriority to data traffic of certain types, or with dropped, or network speed slow (or wholly unavailable)? certain destinations). The domain owner may exert control Geographical considerations of different sorts can also be over what operations a mobile device is capable of perform factors. One is network proximity to the service provider. ing. This control can influence the local/remote decision, as Another is whether the cellphone has unlimited access to the well as the type of data conveyed in key vector packets. network (as in a home region), or a pay-per-use arrangement Another example is a gym, which may want to impede (as when roaming in another country). usage of cellphone cameras, e.g., by interfering with access Information about the remote service provider(s) can also 25 to remote service providers for imagery, as well as photo be factored. Is the service provider offering immediate turn sharing sites Such as Flickr and Picasa. Still another example around, or are requested operations placed in a long queue, is a School which, for privacy reasons, may want to discour behind other users awaiting service? Once the provider is age facial recognition of its students and staff. In Such case, ready to process the task, what speed of execution is access to facial recognition service providers can be blocked, expected? Costs may also be key factors, together with other 30 or granted only on a moderated case-by-case basis. Venues attributes of importance to the user (e.g., whether the service may find it difficult to stop individuals from using cell phone provider meets “green” standards of environmental responsi cameras—or using them for particular purposes, but they can bility). A great many other factors can also be considered, as take various actions to impede Such use (e.g., by denying may be appropriate in particular contexts. Sources for Such services that would promote or facilitate such use). data can include the various elements shown in the illustrative 35 The following outline identifies other factors that may be block diagrams, as well as external resources. relevant in determining which operations are performed A conceptual illustration of the foregoing is provided in where, and in what sequence: FIG. 19B. 1. Scheduling optimization of key vector processing units Based on the various factors, a determination can be made based on numerous factors: as to whether an operation should be performed locally, or 40 Operation mix, which operations consist of similar atomic remotely. (The same factors may be assessed in determining instructions (MicroOps, Pentium II etc.) the order in which operations should be performed.) Stall states, which operations will generate stalls due to: In some embodiments, the different factors can be quanti waiting for external key vector processing fied by scores, which can be combined in polynomial fashion poor connectivity to yield an overall score, indicating how an operation should 45 user input be handled. Such an overall score serves as a metric indicating change in user focus the relative suitability of the operation for remote or external Cost of operation based on: processing. (A similar scoring approach can be employed to published cost choose between different service providers.) expected cost based on state of auction Depending on changing circumstances, a given operation 50 state of battery and power mode may be performed locally at one instant, and performed power profile of the operation (is it expensive?) remotely at a later instant (or vice versa). Or, the same opera past history of power consumption tion may be performed on two sets of key vector data at the opportunity cost, given the current state of the device, same time—one locally, and one remotely. e.g., what other processes should take priority Such as While described in the context of determining whether an 55 a voice call, GPS navigation, etc. operation should be performed locally or remotely, the same user preferences, i.e., I want a “green” provider, or open factors can influence other matters as well. For example, they Source provider can also be used in deciding what information is conveyed by legal uncertainties (e.g., certain providers may be at key vectors. greater risk of patent infringement charges, e.g., due Consider a circumstance in which the cell phone is to 60 to their use of an allegedly patented method) perform OCR on captured imagery. With one set of factors, Domain owner influence: unprocessed pixel data from a captured image may be sent to privacy concerns in specific physical arenas such as no a remote service provider to make this determination. Under face recognition at Schools a different set of factors, the cell phone may perform initial pre-determined content based rules prohibiting specific processing, such as edge detection, and then package the 65 operations against specific stimuli edge-detected data in key vector form, and route to an external Voiceprint matching against broadcast songs high provider to complete the OCR operation. Under still another lighting the use other singers (Milli-Vanilli’s US 8,768,313 B2 45 46 Grammy award was revoked when officials discov shape, to determine positions of eye pupils, nose location, and ered that the actual Vocals on the Subject recording distance across the mouth. For any oval that yielded useful were performed by other singers) candidate facial information, associated parameters would be All of the above influence scheduling and ability to per packaged in key vector form, and transmitted to a cloud Ser form out of order execution of key vectors based on the Vice that checks the key vectors of analyzed facial parameters optimal path to the desired outcome against known templates, e.g., of the user's Facebook friends. uncertainty in a long chain of operations, making pre (Or, such checking could also be performed by the GPU, or by diction of need for Subsequent key vector operations another processor in the cell phone.) difficult (akin to the deep pipeline in processors & (It is interesting to note that this facial recognition—like branch prediction)—difficulties might be due to weak 10 others detailed in this specification—distills the volume of metrics on key vectors data, e.g., from millions of pixels (bytes) in the originally past behavior. captured image, to a key vector that may comprise a few tens, location (GPS indicates that the device is quick hundreds, or thousands of bytes. This smaller parcel of infor motion) & pattern of GPS movements mation, with its denser information content, is more quickly is there a pattern of exposure to stimuli. Such as a 15 routed for processing sometimes externally. Communica user walking through an airport terminal being tion of the distilled key vector information takes place over a repeatedly exposed to CNN that is being pre channel with a corresponding bandwidth capability—keep sented at each gate ing costs reasonable and implementation practical.) proximity sensors indicating the device was placed in Contrast the just-described GPU implementation of face a pocket, etc. detection to Such an operation as it might be implemented on other approaches such as Least Recently Used (LRU) a scalar processor. Performing Hough-transform-based oval can be used to track how infrequent the desired detection across the entire image frame is prohibitive interms key vector operation resulted or contributed to the of processing time—much of the effort would be for naught, desired effect (recognition of a song, etc.) and would delay other tasks assigned to the processor. Further regarding pipelined or other time-consuming 25 Instead, such an implementation would typically have the operations, a particular embodiment may undertake some processor examine pixels as they come from the camera— Suitability testing before engaging a processing resource for looking for those having color within an expected “skintone' what may be more than a threshold number of clock cycles. A range. Only if a region of skintone pixels is identified would simple Suitability test is to make Sure the image data is poten a Hough transform then be attempted on that excerpt of the tially useful for the intended purpose, as contrasted with data 30 image data. In similar fashion, attempting to extract facial that can be quickly disqualified from analysis. For example, parameters from detected ovals would be done in a laborious whether it is all black (e.g., a frame captured in the user's serial fashion—often yielding no useful result. pocket). Adequate focus can also be checked quickly before Ambient Light committing to an extended operation. Many artificial light sources do not provide a consistent (The artisan will recognize that certain of the aspects of this 35 illumination. Most exhibit a temporal variation in intensity technology discussed above have antecedents visible in hind (luminance) and/or color. These variations commonly track sight. For example, considerable work has been put into the AC power frequency (50/60 or 100/120 Hz), but some instruction optimization for pipelined processors. Also, some times do not. For example, fluorescent tubes can give off devices have allowed user configuration of power settings, infrared illumination that varies at a ~40 KHZ rate. The emit e.g., user-selectable deactivation of a power-hungry GPU in 40 ted spectra depend on the particular lighting technology. certain Apple notebooks to extend battery life.) Organic LEDs for domestic and industrial lighting sometimes The above-discussed determination of an appropriate can use distinct color mixtures (e.g., blue and amber) to make instruction mix (e.g., by the Common Services Sorter of FIG. white. Others employ more traditional red/green/blue clus 6) particularly considered certain issues arising in pipelined ters, or blue/UV LEDs with phosphors. architectures. Different principles can apply in embodiments 45 In one particular implementation, a processing stage 38 in which one or more GPUs is available. These devices typi monitors, e.g., the average intensity, redness, greenness or cally have hundreds or thousands of scalar processors that are other coloration of the image data contained in the bodies of adapted for parallel execution, so that costs of execution packets. This intensity data can be applied to an output 33 of (time, stall risk, etc.) are Small. Branch prediction can be that stage. With the image data, each packet can convey a handled by not predicting: instead, the GPU processes for all 50 timestamp indicating the particular time (absolute, or based of the potential outcomes of a branch in parallel, and the on a local clock) at which the image data was captured. This system uses whatever output corresponds to the actual branch time data, too, can be provided on output 33. condition when it becomes known. A synchronization processor 35 coupled to Such an output To illustrate, consider facial recognition. A GPU-equipped 33 can examine the variation in frame-to-frame intensity (or cellphone may invoke instructions—when its camera is acti 55 color), as a function of timestamp data, to discern its period vated in a user photo-shoot mode—that configure 20 clusters icity. Moreover, this module can predict the next time instant of scalar processors in the GPU. (Such a cluster is sometimes at which the intensity (or color) will have a maxima, minima, termed a 'stream processor.) In particular, each cluster is or other particular state. A phase-locked loop may control an configured to perform a Hough transform on a small tile from oscillator that is synced to mirror the periodicity of an aspect a captured image frame—looking for one or more oval shapes 60 of the illumination. More typically, a digital filter computes a that may be candidate faces. The GPU thus processes the time interval that is used to set or compare against timers— entire frame in parallel, by 20 concurrent Hough transforms. optionally with Software interrupts. A digital phased-locked (Many of the stream processors probably found nothing, but loop or delay-locked loop can also be used. (A Kalman filter the process speed wasn’t impaired.) is commonly used for this type of phase locking.) When these GPU Hough transform operations complete, 65 Control processor module 36 can poll the synchronization the GPU may be reconfigured into a lesser number of stream module 35 to determine when a lighting condition is expected processors—one dedicated to analyzing each candidate oval to have a desired state. With this information, control proces US 8,768,313 B2 47 48 sor module 36 can direct setup module 34 to capture a frame data originally received in the body of the packet. In other of data under favorable lighting conditions for a particular arrangements this need not be the case. For example, a stage purpose. For example, if the camera is imaging an object may output a result of its processing to a module outside the Suspected of having a digital watermark encoded in a green depicted processing chain, e.g., on an output 33. (Or, as noted, color channel, processor 36 may direct camera 32 to capture a stage may maintain in the body of the output packet—the a frame of imagery at an instant that green illumination is data originally received, and augment it with further data— expected to be at a maximum, and direct processing stages 38 Such as the result(s) of its processing.) to process that frame for detection of Such a watermark. Reference was made to determining focus by reference to The camera phone may be equipped with plural LED light DCT frequency spectra, or edge detected data. Many con Sources that are usually operated in tandem to produce a flash 10 of white light illumination on a subject. Operated individually Sumer cameras perform a simpler form of focus check— or in different combinations, however, they can cast different simply by determining the intensity difference (contrast) colors of light on the Subject. The phone processor may con between pairs of adjacent pixels. This difference peaks with trol the component LED sources individually, to capture correct focus. Such an arrangement can naturally be used in frames with non-white illumination. If capturing an image 15 the detailed arrangements. (Again, advantages can accrue that is to be read to decode a green-channel watermark, only from performing Such processing on the sensor chip.) green illumination may be applied when the frame is cap Each stage typically conducts a handshaking exchange tured. Or a camera may capture plural Successive frames— with an adjoining stage—each time data is passed to or with different LEDs illuminating the subject. One frame may received from the adjoining stage. Such handshaking is rou be captured at a /250" second exposure with a corresponding tine to the artisan familiar with digital system design, so is not period of red-only illumination; a Subsequent frame may be belabored here. captured at a /100" second exposure with a corresponding The detailed arrangements contemplated a single image period of green-only illumination, etc. These frames may be sensor. However, in other embodiments, multiple image sen analyzed separately, or may be combined, e.g., for analysis in sors can be used. In addition to enabling conventional stereo the aggregate. Or a single frame of imagery may be captured 25 scopic processing, two or more image sensors enable or over an interval of /100" of a second, with the green LED enhance many other operations. activated for that entire interval, and the red LED activated for One function that benefits from multiple cameras is distin %50" of a second during that /100" second interval. The guishing objects. To cite a simple example, a single camera is instantaneous ambient illumination can be sensed (or pre unable to distinguish a human face from a picture of a face dicted, as above), and the component LED colored light 30 Sources can be operated in a responsive manner (e.g., to (e.g., as may be found in a magazine, on a billboard, or on an counteract orangeness of tungsten illumination by adding electronic display screen). With spaced-apart sensors, in con blue illumination from a blue LED). trast, the 3D aspect of the picture can readily be discerned, Other Notes: Projectors allowing a picture to be distinguished from a person. (De While a packet-based, data driven architecture is shown in 35 pending on the implementation, it may be the 3D aspect of the FIG. 16, a variety of other implementations are of course person that is actually discerned.) possible. Such alternative architectures are straightforward to Another function that benefits from multiple cameras is the artisan, based on the details given. refinement of geolocation. From differences between two The artisan will appreciate that the arrangements and images, a processor can determine the device's distance from details noted above are arbitrary. Actual choices of arrange 40 landmarks whose location may be precisely known. This ment and detail will depend on the particular application allows refinement of other geolocation data available to the being served, and most likely will be different than those device (e.g., by WiFi node identification, GPS, etc.) noted. (To cite but a trivial example, FFTs need not be per Just as a cell phone may have one, two (or more) sensors, formed on 16x16 blocks, but can be done on 64x64,256x256, Such a device may also have one, two (or more) projectors. the whole image, etc.) 45 Individual projectors are being deployed in cell phones by Similarly, it will be recognized that the body of a packet can CKing (the N70 model, distributed by ChinaVision) and convey an entire frame of data, or just excerpts (e.g., a 128x Samsung (the MPB200). LG and others have shown proto 128 block). Image data from a single captured frame may thus types. (These projectors are understood to use Texas Instru span a series of several packets. Different excerpts within a ments electronically-steerable digital micro-mirror arrays, in common frame may be processed differently, depending on 50 conjunction with LED or laser illumination.) Microvision the packet with which they are conveyed. offers the PicoP Display Engine, which can be integrated into Moreover, a processing stage 38 may be instructed to break a variety of devices to yield projector capability, using a a packet into multiple packets—such as by splitting image micro-electro-mechanical scanning mirror (in conjunction data into 16 tiled Smaller Sub-images. Thus, more packets with laser sources and an optical combiner). Other suitable may be present at the end of the system than were produced at 55 projection technologies include 3M's liquid crystal on silicon the beginning. In like fashion, a single packet may contain a (LCOS) and Displaytech's ferroelectric LCOS systems. collection of data from a series of different images (e.g., Use of two projectors, or two cameras, gives differentials images taken sequentially—with different focus, aperture, or of projection or viewing, providing additional information shutter settings; a particular example is a set of focus regions about the subject. In addition to stereo features, it also enables from five images taken with focus bracketing, or depth offield 60 regional image correction. For example, consider two cam bracketing—overlapping, abutting, or disjoint.) This set of eras imaging a digitally watermarked object. One camera's data may then be processed by later stages—either as a set, or view of the object gives one measure of a transform that can through a process that selects one or more excerpts of the be discerned from the object's Surface (e.g., by encoded cali packet payload that meet specified criteria (e.g., a focus bration signals). This information can be used to correct a sharpness metric). 65 view of the object by the other camera. And vice versa. The In the particular example detailed, each processing stage two cameras can iterate, yielding a comprehensive character 38 generally substituted the result of its processing for the ization of the object surface. (One camera may view a better US 8,768,313 B2 49 50 illuminated region of the Surface, or see Some edges that the Part of the illumination is routed to the camera sensor 12. The other camera can’t see. One view may thus reveal information other part of the optical path goes to a micro-mirror projector that the other does not.) system 86. If a reference pattern (e.g., a grid) is projected onto a Lenses used in cell phone projectors typically are larger surface, the shape of the surface is revealed by distortions of 5 aperture than those used for cellphone cameras, so the cam the pattern. The FIG. 16 architecture can be expanded to era may gain significant performance advantages (e.g., include a projector, which projects a pattern onto an object, enabling shorter exposures) by use of such a shared lens. Or, for capture by the camera system. (Operation of the projector reciprocally, the beam splitter 84 can be asymmetrical—not can be synchronized with operation of the camera, e.g., by equally favoring both optical paths. For example, the beam 10 splitter can be a partially-silvered element that couples a control processor module 36 with the projector activated smaller fraction (e.g., 2%, 8%, or 25%) of externally incident only as necessary, since it imposes a significant battery drain.) light to the sensor path 83. The beam-splitter may thus serve Processing of the resulting image by modules 38 (local or to couple a larger fraction (e.g., 98%, 92%, or 75%) of illu remote) provides information about the Surface topology of mination from the micro-mirror projector externally, for pro the object. This 3D topology information can be used as a clue 15 jection. By this arrangement the camera sensor 12 receives in identifying the object. light of a conventional—for a cell phone camera—intensity In addition to providing information about the 3D configu (notwithstanding the larger aperture lens), while the light ration of an object, shape information allows a Surface to be output from the projector is only slightly dimmed by the lens virtually re-mapped to any other configuration, e.g., flat. Such sharing arrangement. remapping serves as a sort of normalization operation. In another arrangement, a camera head is separate—or In one particular arrangement, system 30 operates a pro detachable from the cellphone body. The cellphone body is jector to project a reference pattern into the camera's field of carried in a user's pocket or purse, while the camera head is view. While the pattern is being projected, the camera cap adapted for looking out over a user's pocket (e.g., in a form tures a frame of image data. The resulting image is processed factor akin to a pen, with a pocket clip, and with a battery in to detect the reference pattern, and therefrom characterize the 25 the pen barrel). The two communicate by Bluetooth or other 3D shape of an imaged object. Subsequent processing then wireless arrangement, with capture instructions sent from the follows, based on the 3D shape data. phone body, and image data sent from the camera head. Such (In connection with Such arrangements, the reader is configuration allows the camera to constantly Survey the referred to the Google book-scanning patent, U.S. Pat. No. scene in front of the user without requiring that the cell 7,508,978, which employs related principles. That patent 30 phone be removed from the user's pocket/purse. In a related arrangement, a strobe light for the camera is details a particularly useful reference pattern, among other separate—or detachable from the cell phone body. The relevant disclosures.) light (which may incorporate LEDs) can be placed near the If the projector uses collimated laser illumination (such as image Subject, providing illumination from a desired angle the PicoP Display Engine), the pattern will be in focus regard 35 and distance. The strobe can be fired by a wireless command less of distance to the object onto which the pattern is pro issued by the cell phone camera system. jected. This can be used as an aid to adjust focus of a cell (Those skilled in optical system design will recognize a phone camera onto an arbitrary Subject. Because the pro number of alternatives to the arrangements particularly jected pattern is known in advance by the camera, the cap noted.) tured image data can be processed to optimize detection of the 40 Some of the advantages that accrue from having two cam pattern—such as by correlation. (Or the pattern can be eras can be realized by having two projectors (with a single selected to facilitate detection—such as a checkerboard that camera). For example, the two projectors can project alter appears strongly at a single frequency in the image frequency nating or otherwise distinguishable patterns (e.g., simulta domain when properly focused.) Once the camera is adjusted neous, but of differing color, pattern, polarization, etc) into for optimum focus of the known, collimated pattern, the 45 the camera's field of view. By noting how the two patterns— projected pattern can be discontinued, and the camera can projected from different points—differ when presented on an then capture a properly focused image of the underlying object and viewed by the camera, stereoscopic information Subject onto which the pattern was projected. can again be discerned. Synchronous detection can also be employed. The pattern Many usage models are enabled through use a projector, may be projected during capture of one frame, and then off for 50 including new sharing models (c.f., Greaves, “View & Share: capture of the next. The two frames can then be subtracted. Exploring Co-Present Viewing and Sharing of Pictures using The common imagery in the two frames generally cancels— Personal Projection.” Mobile Interaction with the Real World leaving the projected patternata much higher signal to noise 2009). Such models employ the image created by the projec ratio. tor itself as a triggerto initiate a sharing session, either overtly A projected pattern can be used to determine correct focus 55 through a commonly understood symbol ("open sign), to for several subjects in the camera's field of view. A child may covert triggers that are machine readable. Sharing can also pose in front of the Grand Canyon. The laser-projected pat occur through ad hoc networks utilizing peer to peer applica tern allows the camera to focus on the child in a first frame, tions, or a server hosted application. and on the background in a second frame. These frames can Other output from mobile devices can be similarly shared. then be composited—taking from each the portion properly 60 Consider key vectors. One user's phone may process an image in focus. with Hough transform and other eigenface extraction tech If a lens arrangement is used in the cell phone's projector niques, and then share the resulting key vector of eigenface system, it can also be used for the cellphone's camera system. data with others in the user's social circle (either by pushing A mirror can be controllably moved to steer the camera or the same to them, or allowing them to pull it). One or more of projector to the lens. Or a beam-splitter arrangement 80 can 65 these socially-affiliated devices may then perform facial tem be used (FIG. 20). Here the body of a cell phone 81 incorpo plate matching that yields an identification of a formerly rates a lens 82, which provides a light to a beam-splitter 84. unrecognized face in the imagery captured by the original US 8,768,313 B2 51 52 user. Such arrangement takes a personal experience, and be redefined on the fly, as needed (e.g., an object may perform makes it a public experience. Moreover, the experience can SIFT processing in one state; FFT processing in another state; become a viral experience, with the key vector data shared— log-polar processing in a further state, etc.). essentially without bounds—to a great number of further (While many grid arrangements of logic devices are based USCS. 5 on “nearest neighbor’ interconnects, additional flexibility Selected Other Arrangements can be achieved by use of a “partial crossbar interconnect. In addition to the arrangements earlier detailed, another See, e.g., U.S. Pat. No. 5,448,496 (Quickturn Design Sys hardware arrangement suitable for use with the present tech tems).) nology uses the Mali-400 ARM graphics multiprocessor Also in the realm of hardware, certain embodiments of the architecture, which includes plural fragment processors that 10 present technology employ “extended depth offield imaging can be devoted to the different types of image processing systems (see, e.g., U.S. Pat. Nos. 7.218.448, 7,031,054 and tasks referenced in this document. 5,748,371). Such arrangements include a mask in the imaging The standards group Khronos has issued OpenGL ES2.0, path that modifies the optical transfer function of the system which defines hundreds of Standardized graphics function so as to be insensitive to the distance between the object and calls for systems that include multiple CPUs and multiple 15 the imaging system. The image quality is then uniformly poor GPUs (a direction in which cell phones are increasingly over the depth of field. Digital post processing of the image migrating). OpenGL ES2.0 attends to routing of different compensates for the mask modifications, restoring image operations to different of the processing units—with Such quality, but retaining the increased depth of field. Using Such details being transparent to the application Software. It thus technology, the cell phone camera can capture imagery hav provides a consistent software API usable with all manner of ing both nearer and further subjects all in focus (i.e., with GPU/CPU hardware. greater high frequency detail), without requiring longer expo In accordance with another aspect of the present technol Sures—as would normally be required. (Longer exposures ogy, OpenGL ES2. Standard is extended to provide a stan exacerbate problems such as hand-jitter, and moving Sub dardized graphics processing library not just across different jects.) In the arrangements detailed here, shorter exposures CPU/GPU hardware, but also across different cloud process 25 allow higher quality imagery to be provided to image pro ing hardware—again with Such details being transparent to cessing functions without enduring the temporal delay cre the calling software. ated by optical/mechanical focusing elements, or requiring Increasingly, Java service requests (JSRs) have been input from the user as to which elements of the image should defined to standardize certain Java-implemented tasks. JSRs be in focus. This provides for a much more intuitive experi increasingly are designed for efficient implementations on 30 ence, as the user can simply point the imaging device at the top of OpenGL ES2.0 class hardware. desired target without worrying about focus or depth of field In accordance with a still further aspect of the present settings. Similarly, the image processing functions are able to technology, some or all of the image processing operations leverage all the pixels included in the image/frame captured, noted in this specification (facial recognition, SIFT process as all are expected to be in-focus. In addition, new metadata ing, watermark detection, histogram processing, etc.) can be 35 regarding identified objects or groupings of pixels related to implemented as JSRs providing standardized implementa depth within the frame can produce simple “depth map' tions that are suitable across diverse hardware platforms. information, setting the stage for 3D video capture and stor In addition to supporting cloud-based JSRs, the extended age of video streams using emerging standards on transmis standards specification can also Support the Query Router and sion of depth information. Response Manager functionality detailed earlier—including 40 In some embodiments the cell phone may have the capa both static and auction-based service providers. bility to perform a given operation locally, but may decide Akin to OpenGL is OpenCV a computer vision library instead to have it performed by a cloud resource. The decision available under an open Source license, permitting coders to of whether to process locally or remotely can be based on invoke a variety of functions—without regard to the particu “costs, including bandwidth costs, external service provider lar hardware that is being utilized to perform same. (An 45 costs, power costs to the cellphone battery, intangible costs in O'Reilly book, Learning OpenCV, documents the language consumer (dis-)satisfaction by delaying processing, etc. For extensively.) A counterpart, NokiaCV, provides similar func example, if the user is running low on battery power, and is at tionality specialized for the Symbian operating system (e.g., a location far from a cell tower (so that the cellphone runs its Nokia cell phones). RF amplifier at maximum output when transmitting), then OpenCV provides Support for a large variety of operations, 50 sending a large block of data for remote processing may including high level tasks such as facial recognition, gesture consume a significant fraction of the battery's remaining life. recognition, motion tracking/understanding, segmentation, In Such case, the phone may decide to process the data locally, etc., as well as an extensive assortment of more atomic, or to forward it for remote processing when the phone is elemental vision/image processing operations. closer to the cell site or the battery has been recharged. A set CMVision is another package of computer vision tools that 55 of stored rules can be applied to the relevant variables to can be employed in embodiments of the present technol establish a net “cost function' for different approaches (e.g., ogy—this package compiled by researchers at Carnegie Mel process locally, process remotely, defer processing), and lon University. these rules may indicate different outcomes depending on the Still another hardware architecture makes use of a field states of these variables. programmable object array (FPOA) arrangement, in which 60 An appealing "cloud’ resource is the processing capability hundreds of diverse 16-bit “objects are arrayed in a gridded found at the edges of wireless networks. Cellular networks, node fashion, with each being able to exchange data with for example, include tower stations that are, in large part, neighboring devices through very high bandwidth channels. Software-defined radios—employing processors to per (The PicoChip devices referenced earlier are of this class.) form—digitally—some or all of the operations traditionally The functionality of each can be reprogrammed, as with 65 performed by analog transmitting and receiving radio cir FPGAs. Again different of the image processing tasks can be cuits. Such as mixers, filters, demodulators, etc. Even Smaller performed by different of the FPOA objects. These tasks can cell stations, so-called “femtocells, typically have powerful US 8,768,313 B2 53 54 signal processing hardware for Such purposes. The Pico Chip mencing in 200 milliseconds, allows the cell site to take such processors noted earlier, and other field programmable object anticipated demand into account as it serves other users. arrays, are widely deployed in Such applications. Consider a cell site having an excess (allocable) channel Radio signal processing, and image signal processing, capacity of 15 Mbit/s, which normally allocates to a new have many commonalities, e.g., employing FFT processing to video service user a channel of 10 Mbit/s. If the site knows convert sampled data to the frequency domain, applying vari that a cell camera user has requested reservation for a 8 Mbit/s ous filtering operations, etc. Cell station equipment, includ channel starting in 200 milliseconds, and a new video service ing processors, are designed to meet peak consumer user meanwhile requests service, the site may allocate the demands. This means that significant processing capability is new video service user a channel of 7 Mbit/s, rather than the 10 usual 10 Mbit/s. By initially setting up the new video service often left unused. user's channel at the slower bit rate, service impairments In accordance with another aspect of the present technol associated with cutting back bandwidth during an ongoing ogy, this spare radio signal processing capability at cellular channel session are avoided. The capacity of the cell site is the tower stations (and other edges of wireless networks) is repur same, but it is now allocated in manner that reduces the need posed in connection with image (and/or audio or other) signal 15 for reducing the bandwidth of existing channels, mid-trans processing for consumer wireless devices. Since an FFT mission. operation is the same whether processing sampled radio In another situation, the cell site may determine that it has signals or image pixels—the repurposing is often straightfor excess capacity at present, but expects to be more heavily ward: configuration data for the hardware processing cores burdened in a half second. In this case it may use the present needn't be changed much, if at all. And because 3G/4G net excess capacity to speed throughput to one or more video works are so fast, a processing task can be delegated quickly Subscribers, e.g., those for whom it has collected several from a consumer device to a cell station processor, and the packets of video data in a buffer memory, ready for delivery. results returned with similar speed. In addition to the speed These video packets may be sent through the enlarged chan and computational muscle that such repurposing of cell sta nel now, in anticipation that the video channel will be slowed tion processors affords, another benefit is reducing the power 25 in a half second. Again, this is practical because the cell site consumption of the consumer devices. has useful information about future bandwidth demands. Before sending image data for processing, a cellphone can The service reservation message sent from the cell phone quickly inquire of the cell tower station with which it is may also include a priority indicator. This indicator can be communicating to confirm that it has enough unused capacity used by the cell site to determine the relative importance of Sufficient to undertake the intended image processing opera 30 meeting the request on the stated terms, in case arbitration tion. This query can be sent by the packager/router of FIG.10; between conflicting service demands is required. the local/remote router of FIG. 10A, the query router and Such anticipatory service requests from cell phones can response manager of FIG. 7; the pipe manager 51 of FIG. 16, also allow the cell site to provide higher quality Sustained etc. service than would normally be allocated. Alerting the cell tower/base station of forthcoming pro 35 Cell sites are understood to employ statistical models of cessing requests, and/or bandwidth requirements, allows the usage patterns, and allocate bandwidth accordingly. The allo cell site to better allocate its processing and bandwidth cations are typically set conservatively, in anticipation of resources in anticipation of meeting Such needs. realistic worst case usage scenarios, e.g., encompassing sce Cell sites are at risk of becoming bottlenecked: undertak narios that occur 99.99% of the time. (Some theoretically ing service operations that exhaust their processing or band 40 possible scenarios are sufficiently improbable that they may width capacity. When this occurs, they must triage by unex be disregarded in bandwidth allocations. However, on the rare pectedly throttling back the processing/bandwidth provided occasions when Such improbable scenarios occur—as when to one or more users, so others can be served. This sudden thousands of Subscribers sent cell phone picture messages change in service is undesirable, since changing the param from Washington D.C. during the Obama inauguration, some eters with which the channel was originally established (e.g., 45 Subscribers may simply not receive service.) the bit rate at which video can be delivered), forces data The statistical models on which site bandwidth allocations services using the channel to reconfigure their respective are based, are understood to treat Subscribers—in part—as parameters (e.g., requiring ESPN to provide a lower quality unpredictable actors. Whether a particular subscriber Video feed). Renegotiating Such details once the channel and requests service in the forthcoming seconds (and what par services have been originally setup invariably causes glitches, 50 ticular service is requested) has a random aspect. e.g., video delivery Stuttering, dropped syllables in phone The larger the randomness in a statistical model, the larger calls, etc. the extremes tend to be. If reservations, or forecasts of future To avoid the need for these unpredictable bandwidth slow demands, are routinely submitted by, e.g., 15% of subscrib downs, and resulting service impairments, cell sites tend to ers, then the behavior of those subscribers is no longer ran adopt a conservative strategy—allocating bandwidth/pro 55 dom. The worst case peak bandwidth demand on a cell site cessing resources parsimoniously, in order to reserve capacity does not involve 100% of the subscribers acting randomly, but for possible peak demands. But this approach impairs the only 85%. Actual reservation information can be employed quality of service that might otherwise be normally pro for the other 15%. Hypothetical extremes in peak bandwidth vided—sacrificing typical service in anticipation of the unex usage are thus moderated. pected. 60 With lower peak usage scenarios, more generous alloca In accordance with this aspect of the present technology, a tions of present bandwidth can be granted to all subscribers. cell phone sends alerts to the cell tower station, specifying That is, if a portion of the user base sends alerts to the site bandwidth or processing needs that it anticipates will be reserving future capacity, then the site may predict that the forthcoming. In effect, the cell phone asks to reserve a bit of realistic peak demand that may beforthcoming will still leave future service capacity. The tower station still has a fixed 65 the site with unused capacity. In this case it may grant a capacity. However, knowing that a particular user will be camera cellphone user a 12 Mbit/s channel instead of the 8 needing, e.g., a bandwidth of 8 Mbit/s for 3 seconds, com Mbit/s channel stated in the reservation request, and/or may US 8,768,313 B2 55 56 grant a video user a 15 Mbit/s channel instead of the normal presumably thermal noise.) A short interval after the cooling 10 Mbit/s channel. Such usage forecasting can thus allow the device is activated, a Substitute image can be captured—the site to grant higher quality services than would normally be interval depending on thermal response time for the cooler/ the case, since bandwidth reserves need be held for a lesser sensor. Likewise if cellphone video is captured, a cooler may number of unpredictable actors. 5 be activated, since the increased Switching activity by cir Anticipatory service requests can also be communicated cuitry on the sensor increases its temperature, and thus its from the cell phone (or the cell site) to other cloud processes thermal noise. (Whether to activate a cooler can also be appli that are expected to be involved in the requested services, cation dependent, e.g., the cooler may be activated when allowing them to similarly allocate their resources anticipa torily. Such anticipatory service requests may also serve to 10 capturing imagery from which watermark data may be read, alert the cloud process to pre-warm associated processing. but not activated when capturing imagery from which bar Additional information may be provided from the cellphone, code data may be read.) or elsewhere, for this purpose. Such as encryption keys, image As noted, packets in the FIG.16 arrangement can conveya dimensions (e.g., to configure a cloud FPOA to serve as an variety of instructions and data in both the header and the FFT processor for a 1024x768 image, to be processed in 15 packet body. In a further arrangement a packet can addition 16x16 tiles, and output coefficients for 32 spectral frequency ally, or alternatively, contain a pointer to a cloud object, or to bands), etc. a record in a database. The cloud object/database record may In turn, the cloud resource may alert the cell phone of any contain information Such as object properties, useful for information it expects might be requested from the phone in object recognition (e.g., fingerprint or watermark properties performance of the expected operation, or action it might 20 for a particular object). request the cell phone to perform, so that the cell phone can If the system has read a watermark, the packet may contain similarly anticipate its own forthcoming actions and prepare the watermark payload, and the header (or body) may contain accordingly. For example, the cloud process may, under cer one or more database references where that payload can be tain conditions, request a further set of input data, Such as if it associated with related information. A watermark payload assesses that data originally provided is not sufficient for the 25 read from a business card may be looked-up in one database; intended purpose (e.g., the input data may be an image with a watermark decoded from a photograph may be looked-up in out sufficient focus resolution, or not enough contrast, or another database, etc. A system may apply multiple different needing further filtering). Knowing, in advance, that the watermark decoding algorithms to a single image (e.g., Medi cloud process may request Such further data can allow the cell aSec, Digimarc ImageBridge, Civolution, etc.). Depending phone to consider this possibility in its own operation, e.g., 30 on which application performed a particular decoding opera keeping processing modules configured in a certain filter tion, the resulting watermark payload may be sent off to a manner longer than may otherwise be the case, reserving an corresponding destination database. (Likewise with different interval of sensor time to possibly capture a replacement barcodes, fingerprint algorithms, eigenface technologies, image, etc. etc.) The destination database address can be included in the Anticipatory service requests (or the possibility of condi- 35 application, or in configuration data. (Commonly, the tional service requests) generally relate to events that may addressing is performed indirectly, with an intermediate data commence in few tens or hundreds of milliseconds—occa store containing the address of the ultimate database, permit sionally in a few single seconds. Situations in which the ting relocation of the database without changing each cell action will commence tens or hundreds of second in the future phone application.) will be rare. However, while the period of advance warning 40 The system may perform a FFT on captured image data to may be short, significant advantages can be derived: if the obtain frequency domain information, and then feed that randomness of the next second is reduced—each second, then information to several watermark decoders operating in par system randomness can be reduced considerably. Moreover, allel each applying a different decoding algorithm. When the events to which the requests relate can, themselves, be of one of the applications extracts valid watermark data (e.g., longer duration—such as transmission of a large image file, 45 indicated by ECC information computed from the payload), which may take ten seconds or more. the data is sent to a database corresponding to that format/ Regarding advance set-up (pre-warming), desirably any technology of watermark. Plural Such database pointers can operation that takes more than a threshold interval of time to be included in a packet, and used conditionally—depending complete (e.g., a few hundred microseconds, a millisecond, on which watermark decoding operation (or barcode reading ten microseconds, etc.—depending on implementation) 50 operation, or fingerprint calculation, etc.) yields useful data. should be prepped anticipatorily, if possible. (In some Similarly, the system may send a facial image to an inter instances, of course, the anticipated service is never mediary cloud service, in a packet containing an identifier of requested, in which case such preparation may be for naught.) the user (but not containing the user's Apple iPhoto, or Picasa, In another hardware arrangement, the cellphone processor or Facebook user name). The intermediary cloud service can may selectively activate a Peltier device or other thermoelec- 55 take the provided user identifier, and use it to access a data tric cooler coupled to the image sensor, in circumstances base record from which the user's names on these other when thermal image noise (Johnson noise) is a potential services are obtained. The intermediary cloud service can problem. For example, if a cell phone detects a low light then route the facial image data to an Apple's server with condition, it may activate a cooler on the sensor to try and the user's iPhoto user name; to Picasa's service with the enhance the image signal to noise ratio. Or the image pro- 60 user's Google user name; and to Facebook's server with the cessing stages can examine captured imagery for artifacts user's Facebook user name. Those respective services can associated with thermal noise, and if Such artifacts exceed a then perform facial recognition on the imagery, and return the threshold, then the cooling device can be activated. (One names of identified persons identified from the user's iPhoto/ approach captures a patch of imagery, such as a 16x16 pixel Picasa/Facebook accounts (directly to the user, or through the region, twice in quick Succession. Absent random factors, the 65 intermediary service). The intermediate cloud service— two patches should be identical perfectly correlated. The which may serve large numbers of users—can keep informed variance of the correlation from 1.0 is a measure of noise— of the current addresses for relevant servers (and alternate US 8,768,313 B2 57 58 proximate servers, in case the user is away from home), rather ware of FIG. 6, etc. DSP is a general purpose digital signal than have each cell phone try to keep Such data in updated processor, which can be configured to perform specialized fashion. operations; CPU is the phone's primary processor; GPU is a Facial recognition applications can be used not just to graphics processor unit. OpenCL and OpenGL are APIs identify persons, but also to identify relationships between through which graphics processing services (performed on individuals depicted in imagery. For example, data main the CPU and/or GPU) can be invoked. tained by iPhoto/Picasa/Facebook may contain not just facial Different specialized technologies are in the middle, such recognition features, and associated names, but also terms as one or more digital watermark decoders (and/or encoders), indicating relationships between the named faces and the barcode reading Software, optical character recognition soft account owner (e.g., father, boyfriend, sibling, pet, room 10 mate, etc.). Thus, instead of simply searching a user's image ware, etc. Cloud services are shown on the right, and appli collection for, e.g., all pictures of “David Smith' the user's cations are on the top. collection may also be searched for all pictures depicting The reference platform establishes a standard interface “sibling.” through which different applications can interact with hard The application software in which photos are reviewed can 15 ware, exchange information, and request services (e.g., by present differently colored frames around different recog API calls). Similarly, the platform establishes a standard nized faces—in accordance with associated relationship data interface through which the different technologies can be (e.g., blue for siblings, red for boyfriends, etc.). accessed, and through which they can send and receive data to In some arrangements, the user's system can access Such other of the system components. Likewise with the cloud information stored in accounts maintained by the user's net services, for which the reference platform may also attend to work "friends.” A face that may not be recognized by facial details of identifying a service provider whether by reverse recognition data associated with the user's account at Picasa, auction, heuristics, etc. In cases where a service is available may be recognized by consulting Picasa facial recognition both from a technology in the cellphone, and from a remote data associated with the account of the user's friend "David service provider, the reference platform may also attend to Smith.” Relationship data indicated by David Smith's 25 weighing the costs and benefits of the different options, and account can be similarly used to present, and organize, the deciding which should handle a particular service request. user's photos. The earlier unrecognized face may thus be By Such arrangement, the different system components do labeled with indicia indicating the person is David Smith's not need to concern themselves with the details of other parts roommate. This essentially remaps the relationship informa of the system. An application may call for the system to read tion (e.g., mapping "roommate' as indicated in David 30 text from an object in front of the cell phone. It needn't Smith's account, to "David Smith's roommate” in the user's concern itself with the particular control parameters of the account). image sensor, nor the image format requirements of the OCR The embodiments detailed above were generally described engine. An application may call for a read of the emotion of a in the context of a single network. However, plural networks person in front of the cell phone. A corresponding call is may commonly be available to a user's phone (e.g., WiFi. 35 passed to whatever technology in the phone Supports such Bluetooth, possibly different cellular networks, etc.) The user functionality, and the results are returned in a standardized may choose between these alternatives, or the system may form. When an improved technology become available, it can apply stored rules to automatically do so. In some instances, be added to the phone, and through the reference platform the a service request may be issued (or results returned) across system takes advantages of its enhanced capabilities. Thus, several networks in parallel. 40 growing/changing collections of sensors, and growing/evolv Reference Platform Architecture ing sets of service providers, can be set to the tasks of deriving The hardware in cellphones was originally introduced for meaning from input stimuli (audio as well as visual, e.g., specific purposes. The microphone, for example, was used speech recognition) through use of Such an adaptable archi only for voice transmission over the cellular network: feeding tecture. an A/D converter that fed a modulator in the phone's radio 45 Arasan Chip Systems, Inc. offers a Mobile Industry Pro transceiver. The camera was used only to capture Snapshots. cessor Interface UniPro Software Stack, a layered, kernel Etc. level stack that aims to simplify integration of certain tech As additional applications arose employing Such hard nologies into cellphones. That arrangement may be extended ware, each application needed to develop its own way to talk to provide the functionality detailed above. (The Arasan pro to the hardware. Diverse software stacks arose—each spe 50 tocol is focused primarily on transport layer issues, but cialized so a particular application could interact with a par involves layers down to hardware drivers as well. The Mobile ticular piece of hardware. This poses an impediment to appli Industry Processor Interface Alliance is a large industry group cation development. working to advance cellphone technologies.) This problem compounds when cloud services and/or spe Leveraging Existing Image Collections, E.g., for Metadata cialized processors are added to the mix. 55 Collections of publicly-available imagery and other con To alleviate such difficulties, embodiments of the present tent are becoming more prevalent. Flickr. YouTube, Photo technology can employ an intermediate Software layer that bucket (MySpace), Picasa, Zooomr, FaceBook, Webshots provides a standard interface with which and through which and Google Images are just a few. Often, these resources can hardware and Software can interact. Such an arrangement is also serve as sources of metadata—either expressly identified shown in FIG. 20A, with the intermediate software layer 60 as such, or inferred from data Such as file names, descriptions, being labeled “Reference Platform.” etc. Sometimes geo-location data is also available. In this diagram hardware elements are shown in dashed An illustrative embodiment according to one aspect of the boxes, including processing hardware on the bottom, and present technology works as follows. A captures a cellphone peripherals on the left. The box “IC HW is “intuitive com picture of an object, or scene—perhaps a desk telephone, as puting hardware.” and comprises the earlier-discussed hard 65 shown in FIG. 21. (The image may be acquired in other ware that Supports the different processing of image related manners as well. Such as transmitted from another user, or data, such as modules 38 in FIG. 16, the configurable hard downloaded from a remote computer.) US 8,768,313 B2 59 60 As a preliminary operation, known image processing those terms occurring most frequently) are the terms that operations may be applied, e.g., to correct color or contrast, to most accurately characterize the user's FIG. 21 image. perform ortho-normalization, etc. on the captured image. The inferred metadata can be augmented or enhanced, if Known image object segmentation or classification tech desired, by known image recognition/classification tech niques may also be used to identify an apparent Subject region niques. Such technology seeks to provide automatic recogni of the image, and isolate same for further processing. tion of objects depicted in images. For example, by recogniz The image data is then processed to determine character ing a TouchTone keypad layout, and a coiled cord, Such a izing features that are useful in pattern matching and recog classifier may label the FIG. 21 image using the terms Tele nition. Color, shape, and texture metrics are commonly used phone and Facsimile Machine. for this purpose. Images may also be grouped based on layout 10 If not already present in the inferred metadata, the terms returned by the image classifier can be added to the list and and eigenvectors (the latter being particularly popular for given a count value. (An arbitrary value, e.g. 2, may be used, facial recognition). Many other technologies can of course be or a value dependent on the classifiers reported confidence in employed, as noted elsewhere in this specification. the discerned identification can be employed.) (Uses of vector characterizations/classifications and other 15 If the classifier yields one or more terms that are already image/video/audio metrics in recognizing faces, imagery, present, the position of the term(s) in the list may be elevated. video, audio and other patterns are well known and suited for One way to elevate a terms position is by increasing its count use in connection with the present technology. See, e.g., value by a percentage (e.g., 30%). Another way is to increase patent publications 20060020630 and 20040243567 (Digi its count value to one greater than the next-above term that is marc), 20070239756 and 20020037083 (Microsoft), not discerned by the image classifier. (Since the classifier 20070237364 (Fuji Photo Film), U.S. Pat. Nos. 7,359,889 returned the term “Telephone' but not the term "Cisco, this and 6,990,453 (Shazam), 20050180635 (Corel), U.S. Pat. latter approach could rank the term Telephone with a count Nos. 6,430.306, 6,681,032 and 20030059124 (L-1 Corp.), value of “19' one above Cisco.) A variety of other tech U.S. Pat. Nos. 7,194,752 and 7,174.293 (Iceberg), U.S. Pat. niques for augmenting/enhancing the inferred metadata with No. 7,130,466 (Cobion), U.S. Pat. No. 6,553,136 (Hewlett 25 that resulting from the image classifier are straightforward to Packard), and U.S. Pat. No. 6,430.307 (Matsushita), and the implement. journal references cited at the end of this disclosure. When A revised listing of metadata, resulting from the foregoing, used in conjunction with recognition of entertainment content may be as follows: Such as audio and video. Such features are sometimes termed content “fingerprints' or “hashes.) 30 After feature metrics for the image are determined, a search Telephone (19) is conducted through one or more publicly-accessible image Cisco (18) Phone (10) repositories for images with similar metrics, thereby identi VOIP (7) fying apparently similar images. (As part of its image ingest P (5) process, Flickr and other such repositories may calculate 35 7941 (3) eigenvectors, color histograms, keypoint descriptors, FFTs. Phones (3) Technology (3) or other classification data on images at the time they are 7960 (2) uploaded by users, and collect same in an index for public Facsimile Machine (2) search.) The search may yield the collection of apparently 7920 (1) similar telephone images found in Flickr, depicted in FIG. 22. 40 795.0 (1) Metadata is then harvested from Flickr for each of these Best Buy (1) Desk (1) images, and the descriptive terms are parsed and ranked by Ethernet (1) frequency of occurrence. In the depicted set of images, for P-phone (1) example, the descriptors harvested from Such operation, and Office (1) Pricey (1) their incidence of occurrence, may be as follows: 45 Sprint (1) Telecommunications (1) Uninett (1) Cisco (18) Work (1) Phone (10) Telephone (7) VOIP (7) 50 The list of inferred metadata can be restricted to those IP (5) terms that have the highest apparent reliability, e.g., count 7941 (3) Phones (3) values. A Subset of the list comprising, e.g., the top N terms, Technology (3) or the terms in the top Mth percentile of the ranked listing, 7960 (2) may be used. This subset can be associated with the FIG. 21 7920 (1) 55 image in a metadata repository for that image, as inferred 795.0 (1) Best Buy (1) metadata. Desk (1) In the present example, if N=4, the terms Telephone, Cisco, Ethernet (1) Phone and VoIP are associated with the FIG. 21 image. IP-phone (1) Once a list of metadata is assembled for the FIG. 21 image Office (1) Pricey (1) 60 (by the foregoing procedure, or others), a variety of opera Sprint (1) tions can be undertaken. Telecommunications (1) One option is to Submit the metadata, along with the cap Uninett (1) tured content or data derived from the captured content (e.g., Work (1) the FIG. 21 image, image feature data Such as eigenvectors, 65 color histograms, keypoint descriptors, FFTs, machine read From this aggregated set of inferred metadata, it may be able data decoded from the image, etc), to a service provider assumed that those terms with the highest count values (e.g., that acts on the Submitted data, and provides a response to the US 8,768,313 B2 61 62 user. Shazam, Snapnow (now LinkMe Mobile), ClusterMe image data in the database being searched). Eitherin response dia Labs, Snaptell (now part of Amazon's A9 search service), to such a failure, or proactively, data from similar images Mobot, Mobile Acuity, Nokia Point & Find, Kooaba, idée identified from Flickr may be submitted to the service pro TinEye, iVisits SeeScan, Evolution Robotics' ViPR, IQ vider as alternatives—hoping they might work better. Engine's oMoby, and Digimarc Mobile, are a few of several Another approach—one that opens up many further possi commercially available services that capture media content, bilities is to search Flickr for one or more images with and provide a corresponding response; others are detailed in similar image metrics, and collect metadata as described the earlier-cited patent publications. By accompanying the herein (e.g., Telephone, Cisco, Phone, VoIP). Flickr is then content data with the metadata, the service provider can make searched a second time, based on metadata. Plural images a more informed judgment as to how it should respond to the 10 with similar metadata canthereby be identified. Data for these user's Submission. further images (including images with a variety of different The service provider—or the user's device—can submit perspectives, different lighting, etc.) can then be submitted to the metadata descriptors to one or more other services, e.g., a the service provider—notwithstanding that they may “look” web search engine Such as Google, to obtain a richer set of different than the user's cellphone image. auxiliary information that may help better discern/infer/intuit 15 When doing metadata-based searches, identity of metadata an appropriate desired by the user. Or the information may not be required. For example, in the second search of obtained from Google (or other Such database resource) can Flickr just-referenced, four terms of metadata may have been be used to augment/refine the response delivered by the ser associated with the user's image: Telephone, Cisco, Phone Vice provider to the user. (In some cases, the metadata— and VoIP. A match may be regarded as an instance in which a possibly accompanied by the auxiliary information received Subset (e.g., three) of these terms is found. from Google—can allow the service provider to produce an Another approach is to rank matches based on the rankings appropriate response to the user, without even requiring the of shared metadata terms. An image tagged with Telephone image data.) and Cisco would thus be ranked as a better match than an In some cases, one or more images obtained from Flickr image tagged with Phone and VoIP. One adaptive way to rank may be substituted for the user's image. This may be done, for 25 a “match’ is to sum the counts for the metadata descriptors for example, if a Flickr image appears to be of higher quality the user's image (e.g., 19+18+10+7=54), and then tally the (using sharpness, illumination histogram, or other measures), count values for shared terms in a Flickr image (e.g., 35, if the and if the image metrics are sufficiently similar. (Similarity Flickr image is tagged with Cisco, Phone and VoIP). The ratio can be judged by a distance measure appropriate to the met can then be computed (35/54) and compared to a threshold rics being used. One embodiment checks whether the dis 30 (e.g., 60%). In this case, a “match’ is found. A variety of other tance measure is below a threshold. If several alternate adaptive matching techniques can be devised by the artisan. images pass this screen, then the closest image is used.) Or The above examples searched Flickr for images based on Substitution may be used in other circumstances. The Substi similarity of image metrics, and optionally on similarity of tuted image can then be used instead of (or in addition to) the textual (semantic) metadata. Geolocation data (e.g., GPS captured image in the arrangements detailed herein. 35 tags) can also be used to get a metadata toe-hold. In one such arrangement, the Substitute image data is Sub If the user captures an arty, abstract shot of the Eiffel tower mitted to the service provider. In another, data for several from amid the metalwork or another unusual vantage point Substitute images are submitted. In another, the original (e.g., FIG. 29), it may not be recognized—from image met image data—together with one or more alternative sets of rics—as the Eiffel tower. But GPS info captured with the image data—are Submitted. In the latter two cases, the service 40 image identifies the location of the image subject. Public provider can use the redundancy to help reduce the chance of databases (including Flickr) can be employed to retrieve tex error—assuring an appropriate response is provided to the tual metadata based on GPS descriptors. Inputting GPS user. (Or the service provider can treat each submitted set of descriptors for the photograph yields the textual descriptors image data individually, and provide plural responses to the Paris and Eiffel. user. The client software on the cellphone can then assess the 45 Google Images, or another database, can be queried with different responses, and pick between them (e.g., by a Voting the terms Eiffel and Paris to retrieve other, more perhaps arrangement), or combine the responses, to help provide the conventional images of the Eiffel tower. One or more of those user an enhanced response.) images can be submitted to the service provider to drive its Instead of substitution, one or more related public image(s) process. (Alternatively, the GPS information from the user's may be composited or merged with the user's cell phone 50 image can be used to search Flickr for images from the same image. The resulting hybrid image can then be used in the location; yielding imagery of the Eiffel Tower that can be different contexts detailed in this disclosure. submitted to the service provider.) A still further option is to use apparently-similar images Although GPS is gaining in camera-metadata-deployment, gleaned from Flickr to inform enhancement of the user's most imagery presently in Flickrand other public databases is image. Examples include color correction/matching, contrast 55 missing geolocation info. But GPS info can be automatically correction, glare reduction, removing foreground/back propagated across a collection of imagery that share visible ground objects, etc. By Such arrangement, for example, Such features (by image metrics such as eigenvectors, color histo a system may discern that the FIG. 21 image has foreground grams, keypoint descriptors, FFTs, or other classification components (apparently Post-It notes) on the telephone that techniques), or that have a metadata match. should be masked or disregarded. The user's image data can 60 To illustrate, if the user takes a cell phone picture of a city be enhanced accordingly, and the enhanced image data used fountain, and the image is tagged with GPS information, it thereafter. can be submitted to a process that identifies matching Flickr/ Relatedly, the user's image may suffer Some impediment, Google images of that fountain on a feature-recognition basis. e.g., Such as depicting its subject from an odd perspective, or To each of those images the process can add GPS information with poor lighting, etc. This impediment may cause the user's 65 from the user's image. image not to be recognized by the service provider (i.e., the A second level of searching can also be employed. From image data Submitted by the user does not seem to match any the set of fountain images identified from the first search US 8,768,313 B2 63 64 based on similarity of appearance, metadata can be harvested FIG. 21 photo case, in which metadata occurrence counts are and ranked, as above. Flickr can then be searched a second used to judge the relative merit of each item of metadata (e.g., time, for images having metadata that matches within a speci Telephone=19 or 7, depending on the methodology used). fied threshold (e.g., as reviewed above). To those images, too, Similar methods can be used to rank reliability when several GPS information from the user's image can be added. metadata contributors offer descriptors for a given image. Alternatively, or in addition, a first set of images in Flickr/ Crowd-sourcing techniques are known to parcel image Google similar to the user's image of the fountain can be identification tasks to online workers, and collect the results. identified not by pattern matching, but by GPS-matching However, prior art arrangements are understood to seek (or both). Metadata can be harvested and ranked from these simple, short-term consensus on identification. Better, it GPS-matched images. Flickr can be searched a second time 10 seems, is to quantify the diversity of opinion collected about for a second set of images with similar metadata. To this image contents (and optionally its variation over time, and second set of images, GPS information from the user's image information about the sources relied-on), and use that richer can be added. data to enable automated systems to make more nuanced Another approach to geolocating imagery is by searching decisions about imagery, its value, its relevance, its use, etc. Flickr for images having similar image characteristics (e.g., 15 To illustrate, known crowd-sourcing image identification gist, eigenvectors, color histograms, keypoint descriptors, techniques may identify the FIG.35 image with the identifiers FFTs, etc.), and assessing geolocation data in the identified “soccer ball' and “dog.” These are the consensus terms from images to infer the probable location of the original image. one or several viewers. Disregarded, however, may be infor See, e.g., Hays, et al. IM2GPS: Estimating geographic infor mation about the long tail of alternative descriptors, e.g., mation from a single image, Proc. of the IEEE Conf. on Summer, Labrador, football, tongue, afternoon, evening, ComputerVision and Pattern Recognition, 2008. Techniques morning, fescue, etc. Also disregarded may be demographic detailed in the Hays paper are Suited for use in conjunction and other information about the persons (or processes) that with the present technology (including use of probability served as metadata identifiers, or the circumstances of their functions as quantizing the uncertainty of inferential tech assessments. A richerset of metadata may associate with each niques). 25 descriptor a set of sub-metadata detailing this further infor When geolocation data is captured by the camera, it is mation. highly reliable. Also generally reliable is metadata (location The Sub-metadata may indicate, for example, that the tag or otherwise) that is authored by the proprietor of the image. “football was contributed by a 21 year old male in Brazil on However, when metadata descriptors (geolocation or seman Jun. 18, 2008. It may further indicate that the tags “after tic) are inferred or estimated, or authored by a stranger to the 30 noon.” “evening and “morning were contributed by an image, uncertainty and other issues arise. automated image classifier at the University of Texas that Desirably, such intrinsic uncertainty should be memorial made these judgments on Jul. 2, 2008 based, e.g., on the angle ized in some fashion so that later users thereof (human or of illumination on the subjects. Those three descriptors may machine) can take this uncertainty into account. also have associated probabilities assigned by the classifier, One approach is to segregate uncertain metadata from 35 e.g., 50% for afternoon, 30% for evening, and 20% for morn device-authored or creator-authored metadata. For example, ing (each of these percentages may be stored as a Sub different data structures can be used. Or different tags can be metatag). One or more of the metadata terms contributed by used to distinguish Such classes of information. Or each meta the classifier may have a further sub-tag pointing to an on-line data descriptor can have its own Sub-metadata, indicating the glossary that aids in understanding the assigned terms. For author, creation date, and source of the data. The author or 40 example, such as Sub-tag may give the URL of a computer Source field of the Sub-metadata may have a data string indi resource that associates the term “afternoon” with a defini cating that the descriptor was inferred, estimated, deduced, tion, or synonyms, indicating that the term means noon to 7 etc., or such information may be a separate Sub-metadata tag. pm. The glossary may further indicate a probability density Each uncertain descriptor may be given a confidence met function, indicating that the mean time meant by “afternoon” ric or rank. This data may be determined by the public, either 45 is 3:30 pm, the median time is 4:15 pm, and the term has a expressly or inferentially. An example is the case when a user Gaussian function of meaning spanning the noonto 7pm time sees a Flickr picture she believes to be from Yellowstone, and interval. adds a “Yellowstone” location tag, together with a "95% Expertise of the metadata contributors may also be confidence tag (her estimation of certainty about the contrib reflected in sub-metadata. The term “fescue' may have sub uted location metadata). She may add an alternate location 50 metadata indicating it was contributed by a 45 year old grass metatag, indicating "Montana, together with a correspond seed farmer in Oregon. An automated system can conclude ing 50% confidence tag. (The confidence tags needn't Sum to that this metadata term was contributed by a person having 100%. Just one tag can be contributed with a confidence unusual expertise in a relevant knowledge domain, and may less than 100%. Or several tags can be contributed possibly therefore treat the descriptor as highly reliable (albeit maybe overlapping, as in the case with Yellowstone and Montana). 55 not highly relevant). This reliability determination can be If several users contribute metadata of the same type to an added to the metadata collection, so that other reviewers of image (e.g., location metadata), the combined contributions the metadata can benefit from the automated Systems assess can be assessed to generate aggregate information. Such ment. information may indicate, for example, that 5 of 6 users who Assessment of the contributor’s expertise can also be self contributed metadata tagged the image as Yellowstone, with 60 made by the contributor. Or it can be made otherwise, e.g., by an average 93% confidence; that 1 of 6 users tagged the image reputational rankings using collected third party assessments as Montana, with a 50% confidence, and 2 of 6 users tagged of the contributor’s metadata contributions. (Such reputa the image as Glacier National park, with a 15% confidence, tional rankings are known, e.g., from public assessments of etc. sellers on EBay, and of book reviewers on Amazon.) Assess Inferential determination of metadata reliability can be 65 ments may be field-specific, so a person may be judged (or performed, either when express estimates made by contribu self-judged) to be knowledgeable about grass types, but not tors are not available, or routinely. An example of this is the about dog breeds. Again, all such information is desirably US 8,768,313 B2 65 66 memorialized in Sub-metatags (including Sub-Sub-metatags, (possibly after pre-processing, as detailed above), and when the information is about a sub-metatag). searches (e.g., Flickr) for visually similar images. These More information about crowd-sourcing, including use of images are transmitted to the cellphone (e.g., by the service, contributor expertise, etc., is found in Digimarc's published or directly from Flickr), and they are buffered for display. The patent application 20070162761. 5 service can prompt the user, e.g., by instructions presented on Returning to the case of geolocation descriptors (which the display, to repeatedly press the right-arrow button 116b on may be numeric, e.g., latitude/longitude, or textual), an image the four-way controller (or press-and-hold) to view a may accumulate—over time—a lengthy catalog of contrib sequence of pattern-similar images (130, FIG. 45A). Each uted geographic descriptors. An automated system (e.g., a time the button is pressed, another one of the buffered appar server at Flickr) may periodically review the contributed 10 ently-similar images is displayed. geotag information, and distill it to facilitate public use. For By techniques like those earlier described, or otherwise, numeric information, the process can apply known clustering the remote service can also search for images that are similar algorithms to identify clusters of similar coordinates, and in geolocation to the Submitted image. These too can be sent average same to generate a mean location for each cluster. For to and buffered at the cellphone. The instructions may advise example, a photo of a geyser may be tagged by some people 15 that the user can press the left-arrow button 116d of the with latitude/longitude coordinates in Yellowstone, and by controller to review these GPS-similar images (132, FIG. others with latitude/longitude coordinates of Hells Gate Park 45A). in New Zealand. These coordinates thus form distinct two Similarly, the service can search for images that are similar clusters that would be separately averaged. If 70% of the in metadata to the Submitted image (e.g., based on textual contributors placed the coordinates in Yellowstone, the dis metadata inferred from other images, identified by pattern tilled (averaged) value may be given a confidence of 70%. matching or GPS matching). Again, these images can be sent Outlier data can be maintained, but given a low probability to the phone and buffered for immediate display. The instruc commensurate with its outlier status. Such distillation of the tions may advise that the user can press the up-arrow button data by a proprietor can be stored in metadata fields that are 116a of the controller to view these metadata-similar images readable by the public, but not writable. 25 (134, FIG. 45A). The same or other approach can be used with added textual Thus, by pressing the right, left, and up buttons, the user metadata—e.g., it can be accumulated and ranked based on can review images that are similar to the captured image in frequency of occurrence, to give a sense of relative confi appearance, location, or metadata descriptors. dence. Whenever such review reveals a picture of particular inter The technology detailed in this specification finds numer 30 est, the user can press the down button 116c. This action ous applications in contexts involving watermarking, bar identifies the currently-viewed picture to the service provider, coding, fingerprinting, OCR-decoding, and other approaches which then can repeat the process with the currently-viewed for obtaining information from imagery. Consider again the picture as the base image. The process then repeats with the FIG. 21 cell phone photo of a desk phone. Flickr can be user-selected image as the base, and with button presses searched based on image metrics to obtain a collection of 35 enabling review of images that are similar to that base image Subject-similar images (e.g., as detailed above). A data in appearance (16b), location (16d), or metadata (16a). extraction process (e.g., watermark decoding, fingerprint cal This process can continue indefinitely. At some point the culation, barcode- or OCR-reading) can be applied to some or user can press the center button 118 of the four-way control all of the resulting images, and information gleaned thereby ler. This action Submits the then-displayed image to a service can be added to the metadata for the FIG. 21 image, and/or 40 provider for further action (e.g., triggering a corresponding submitted to a service provider with image data (either for the response, as disclosed, e.g., in earlier-cited documents). This FIG. 21 image, and/or for related images). action may involve a different service provider than the one From the collection of images found in the first search, text that provided all the alternative imagery, or they can be the or GPS metadata can be harvested, and a second search can be same. (In the latter case the finally-selected image need not be conducted for similarly-tagged images. From the text tags 45 sent to the service provider, since that service provider knows Cisco and VoIP. for example, a search of Flickr may find a all the images buffered by the cell phone, and may track photo of the underside of the user's phone with OCR-read which image is currently being displayed.) able data—as shown in FIG. 36. Again, the extracted infor The dimensions of information browsing just-detailed mation can be added to the metadata for the FIG. 21 image, (similar-appearance images; similar-location images; simi and/or submitted to a service provider to enhance the 50 lar-metadata images) can be different in other embodiments. response it is able to provide to the user. Consider, for example, an embodiment that takes an image of As just shown, a cellphone user may be given the ability to a house as input (or latitude/longitude), and returns the fol look around corners and under objects—by using one image lowing sequences of images: (a) the houses for sale nearest in as a portal to a large collection of related images. location to the input-imaged house; (b) the houses for sale User Interface 55 nearest in price to the input-imaged house; and (c) the houses Referring to FIGS. 44 and 45A, cell phones and related for sale nearest in features (e.g., bedrooms/baths) to the input portable devices 110 typically include a display 111 and a imaged house. (The universe of houses displayed can be keypad 112. In addition to a numeric (or alphanumeric) key constrained, e.g., by Zip-code, metropolitan area, School dis pad there is often a multi-function controller 114. One popu trict, or other qualifier.) lar controller has a center button 118, and four surrounding 60 Another example of this user interface technique is presen buttons 116a, 116b, 116c and 116d (also shown in FIG. 37). tation of search results from EBay for auctions listing Xbox An illustrative usage model is as follows. A system 360 game consoles. One dimension can be price (e.g., push responds to an image 128 (either optically captured or wire ing button 116b yields a sequence of Screens showing Xbox lessly received) by displaying a collection of related images 360 auctions, starting with the lowest-priced ones); another to the user, on the cell phone display. For example, the user 65 can be seller's geographical proximity to user (closest to captures an image and Submits it to a remote service. The furthest, shown by pushing button 116d); another can be time service determines image metrics for the Submitted image until end of auction (shortest to longest, presented by pushing US 8,768,313 B2 67 68 button 116a). Pressing the middle button 118 can load the full similarly located images/similar metadata-tagged images), web page of the auction being displayed. more or less dimensions can be employed. FIG. 45B shows A related example is a system that responds to a user browsing screens injust two dimensions. (Pressing the right captured image of a car by identifying the car (using image button yields a first sequence 140 of information screens; features and associated database(s)), searching EBay and 5 pressing the left button yields a different sequence 142 of Craigslist for similar cars, and presenting the results on the information screens.) screen. Pressing button 116b presents screens of information Instead of two or more distinct buttons, a single UI control about cars offered for sale (e.g., including image, seller loca can be employed to navigate in the available dimensions of tion, and price) based on similarity to the input image (same information. Ajoystick is one such device. Another is a roller model year? same color first, and then nearest model years/ 10 wheel (or scroll wheel). Portable device 110 of FIG. 44 has a colors), nationwide. Pressing button 116d yields such a roller wheel 124 on its side, which can be rolled-up or rolled sequence of screens, but limited to the user's State (or metro down. It can also be pressed-in to make a selection (e.g., akin politan region, or a 50 mile radius of the user's location, etc). to buttons 116c or 118 of the earlier-discussed controller). Pressing button 116a yields such a sequence of screens, again Similar controls are available on many mice. limited geographically, but this time presented in order of 15 In most user interfaces, opposing buttons (e.g., left button ascending price (rather than closest model year/color). Again, 116b, and right button 116d) navigate the same dimension of pressing the middle button loads the full web page (EBay or information just in opposite directions (e.g., forward/re Craigslist) of the car last-displayed. verse). In the particular interface discussed above, it will be Another embodiment is an application that helps people recognized that this is not the case (although in other imple recall names. A user sees a familiar personata party, but can't mentations, it may be so). Pressing the right button 116b, and remember his name. Surreptitiously the user Snaps a picture then pressing the left button 116d, does not return the system of the person, and the image is forwarded to a remote service to its original state. Instead, pressing the right button gives, provider. The service provider extracts facial recognition e.g., a first similar-appearing image, and pressing the left parameters and searches Social networking sites (e.g., Face button gives the first similarly-located image. Book, MySpace, Linked-In), or a separate database contain 25 Sometimes it is desirable to navigate through the same ing facial recognition parameters for images on those sites, sequence of screens, but in reverse of the order just-reviewed. for similar-appearing faces. (The service may provide the Various interface controls can be employed to do this. user's sign-on credentials to the sites, allowing searching of One is a “Reverse” button. The device 110 in FIG. 44 information that is not otherwise publicly accessible.) Names includes a variety of buttons not-yet discussed (e.g., buttons and other information about similar-appearing persons 30 120a-120f around the periphery of the controller 114). Any located via the searching are returned to the user's cell of these—if pressed—can serve to reverse the scrolling order. phone to help refresh the user's memory. By pressing, e.g., button 120a, the scrolling (presentation) Various UI procedures are contemplated. When data is direction associated with nearby button 116b can be reversed. returned from the remote service, the user may push button So if button 116b normally presents items in order of increas 116b to scroll thru matches in order of closest-similarity— 35 ing cost, activation of button 120a can cause the function of regardless of geography. Thumbnails of the matched indi button 116b to Switch, e.g., to presenting items in order of viduals with associated name and other profile information decreasing cost. If, in reviewing screens resulting from use of can be displayed, or just full screen images of the person can button 116b, the user"overshoots' and wants to reverse direc be presented with the name overlaid. When the familiar tion, she can push button 120a, and then push button 116b person is recognized, the user may press button 118 to load 40 again. The screen(s) earlier presented would then appear in the full FaceBook/MySpace/Linked-In page for that person. reverse order—starting from the present screen. Alternatively, instead of presenting images with names, just a Or, operation of such abutton (e.g., 120a or 120f) can cause textual list of names may be presented, e.g., all on a single the opposite button 116d to scroll back thru the screens pre screen—ordered by similarity of face-match; SMS text mes sented by activation of button 116b, in reverse order. saging can Suffice for this last arrangement. 45 A textual or symbolic prompt can be overlaid on the dis Pushing button 116d may scroll thru matches in order of play Screen in all these embodiments—informing the user of closest-similarity, of people who list their residence as within the dimension of information that is being browsed, and the a certain geographical proximity (e.g., same metropolitan direction (e.g., browsing by cost: increasing). area, same state, same campus, etc.) of the user's present In still other arrangements, a single button can perform location or the user's reference location (e.g., home). Pushing 50 multiple functions. For example, pressing button 116b can button 116a may yield a similar display, but limited to persons cause the system to start presenting a sequence of Screens, who are “Friends of the user within a social network (or who e.g., showing pictures of houses for sale nearest the user's are Friends of Friends, or who are within another specified location presenting each for 800 milliseconds (an interval degree of separation of the user). set by preference data entered by the user). Pressing button A related arrangement is a law enforcement tool in which 55 116b a second time can cause the system to stop the an officer captures an image of a person and Submits same to sequence—displaying a static screen of a house for sale. a database containing facial portrait/eigenvalue information Pressing button 116b a third time can cause the system to from government driver license records and/or other sources. present the sequence in reverse order, starting with the static Pushing button 116b causes the screen to display a sequence screen and going backwards thru the screens earlier pre of images/biographical dossiers about persons nationwide 60 sented. Repeated operation of buttons 116a, 116b, etc., can having the closest facial matches. Pushing button 116d causes operate likewise (but control different sequences of informa the screen to display a similar sequence, but limited to persons tion, e.g., houses closest in price, and houses closest in fea within the officer's state. Button 116a yields such a sequence, tures). but limited to persons within the metropolitan area in which In arrangements in which the presented information stems the officer is working. 65 from a process applied to a base image (e.g., a picture Snapped Instead of three dimensions of information browsing (but by a user), this base image may be presented throughout the tons 116b, 11.6d, 116a, e.g., for similar-appearing images/ display—e.g., as a thumbnail in a corner of the display. Or a US 8,768,313 B2 69 70 button on the device (e.g., 126a, or 120b) can be operated to first set, a second, different type of information (image met immediately Summon the base image back to the display. rics/semantic metadata/GPS/decoded info, etc.) can be gath Touch interfaces are gaining in popularity, Such as in prod ered. From that second type of information, a second set of ucts available from Apple and Microsoft (detailed, e.g., in information-similar images can be obtained. From that sec Apple's patent publications 20060026535, 20060026536, 5 ond set, a third, different type of information (image metrics/ 20060250377, 200802.11766, 20080158169, 20080158172, semantic metadata/GPS/decoded info, etc.) can be gathered. 20080204426, 20080174570, and Microsoft's patent publi From that third type of information, a third set of information cations 20060033701, 20070236470 and 20080001924). similar images can be obtained. Etc. Such technologies can be employed to enhance and extend Thus, while the illustrated embodiments generally start the just-reviewed user interface concepts—allowing greater 10 with an image, and then proceed by reference to its image degrees of flexibility and control. Each button press noted metrics, and so on, entirely different combinations of acts are above can have a counterpart gesture in the Vocabulary of the also possible. The seed can be the payload from a product touch screen system. barcode. This can generate a first collection of images depict For example, different touch-screen gestures can invoke ing the same barcode. This can lead to a set of common display of the different types of image feeds just reviewed. A 15 metadata. That can lead to a second collection of images brushing gesture to the right, for example, may present a based on that metadata. Image metrics may be computed from rightward-scrolling series of image frames 130 of imagery this second collection, and the most prevalent metrics can be having similar visual content (with the initial speed of scroll used to search and identify a third collection of images. The ing dependent on the speed of the user gesture, and with the images thus identified can be presented to the user using the scrolling speed decelerating—or not—overtime). A brushing arrangements noted above. gesture to the left may present a similar leftward-scrolling In some embodiments, the present technology may be display of imagery 132 having similar GPS information. A regarded as employing an iterative, recursive process by brushing gesture upward may present images an upward which information about one set of images (a single image in scrolling display of imagery 134 similar in metadata. At any many initial cases) is used to identify a second set of images, point the user can tap one of the displayed images to make it 25 which may be used to identify a third set of images, etc. The the base image, with the process repeating. function by which each set of images is related to the next Other gestures can invoke still other actions. One such relates to a particular class of image information, e.g., image action is displaying overhead imagery corresponding to the metrics, semantic metadata, GPS, decoded info, etc. GPS location associated with a selected image. The imagery In other contexts, the relation between one set of images can be Zoomed in/out with other gestures. The user can select 30 and the next is a function not just of one class of information, for display photographic imagery, map data, data from dif but two or more. For example, a seed user image may be ferent times of day or different dates/seasons, and/or various examined for both image metrics and GPS data. From these overlays (topographic, places of interest, and other data, as is two classes of information a collection of images can be known from Google Earth), etc. Icons or other graphics may determined images that are similar in both some aspect of be presented on the display depending on contents of particu 35 visual appearance and location. Other pairings, triplets, etc., lar imagery. One Such arrangement is detailed in Digimarc's of relationships can naturally be employed in the determi published application 20080300011. nation of any of the Successive sets of images. "Curbside' or “street-level imagery—rather than over Further Discussion head imagery—can be also displayed. Some embodiments of the present technology analyze a It will be recognized that certain embodiments of the 40 consumer cell phone picture, and heuristically determine present technology include a shared general structure. An information about the picture’s Subject. For example, is it a initial set of data (e.g., an image, or metadata Such as descrip person, place, or thing? From this high level determination, tors or geocode information, or image metrics Such as eigen the system can better formulate what type of response might values) is presented. From this, a second set of data (e.g., be sought by the consumer—making operation more intui images, or image metrics, or metadata) are obtained. From 45 tive. that second set of data, a third set of data is compiled (e.g., For example, if the subject of the photo is a person, the images with similar image metrics or similar metadata, or consumer might be interested in adding the depicted person image metrics, or metadata). Items from the third set of data as a FaceBook "friend. Or sending a text message to that can be used as a result of the process, or the process may person. Or publishing an annotated version of the photo to a continue, e.g., by using the third set of data in determining 50 web page. Or simply learning who the person is. fourth data (e.g., a set of descriptive metadata can be com If the Subject is a place (e.g., Times Square), the consumer piled from the images of the third set). This can continue, e.g., might be interested in the local geography, maps, and nearby determining a fifth set of data from the fourth (e.g., identify attractions. ing a collection of images that have metadata terms from the If the subject is a thing (e.g., the Liberty Bell or a bottle of fourth data set). A sixth set of data can be obtained from the 55 beer), the consumer may be interested in information about fifth (e.g., identifying clusters of GPS data with which images the object (e.g., its history, others who use it), or in buying or in the fifth set are tagged), and so on. selling the object, etc. The sets of data can be images, or they can be other forms Based on the image type, an illustrative system/service can of data (e.g., image metrics, textual metadata, geolocation identify one or more actions that it expects the consumer will data, decoded OCR-, barcode-, watermark-data, etc). 60 find most appropriately responsive to the cell phone image. Any data can serve as the seed. The process can start with One or all of these can be undertaken, and cached on the image data, or with other information, such as image metrics, consumer's cell phone for review. For example, Scrolling a textual metadata (aka semantic metadata), geolocation infor thumbwheel on the side of the cell phone may present a mation (e.g., GPS coordinates), decoded OCR/barcode/wa succession of different screens—each with different informa termark data, etc. From a first type of information (image 65 tion responsive to the image subject. (Or a screen may be metrics, semantic metadata, GPS info, decoded info), a first presented that queries the consumer as to which of a few set of information-similar images can be obtained. From that possible actions is desired.) US 8,768,313 B2 71 72 In use, the system can monitor which of the available Other images may have metadata that is exclusively place actions is chosen by the consumer. The consumer's usage related. These images are likely place-centric, rather than history can be employed to refine a Bayesian model of the person-centric or thing-centric. consumers interests and desires, so that future responses can In between are images that have both place-related meta be better customized to the user. data, and other metadata. Various rules can be devised and These concepts will be clearer by example (aspects of utilized to assign the relative relevance of the image to place. which are depicted, e.g., in FIGS. 46 and 47). One rule looks at the number of metadata descriptors asso Processing a Set of Sample Images ciated with an image, and determines the fraction that is found Assume a tourist Snaps a photo of the Prometheus statue at in the glossary of place-related terms. This is one metric. 10 Another looks at where in the metadata the place-related Rockefeller Center in New York using a cell phone or other descriptors appear. If they appear in an image title, they are mobile device. Initially, it is just a bunch of pixels. What to likely more relevant than if they appear at the end of a long do? narrative description about the photograph. Placement of the Assume the image is geocoded with location information placement-related metadata is another metric. (e.g., latitude/longitude in XMP- or EXIF-metadata). 15 Consideration can also be given to the particularity of the From the geocode data, a search of Flickr can be under place-related descriptor. A descriptor “New York” or “USA taken for a first set of images—taken from the same (or may be less indicative that an image is place-centric than a nearby) location. Perhaps there are 5 or 500 images in this first more particular descriptor, such as “Rockefeller Center” or Set. “Grand Central Station.” This can yield a third metric. Metadata from this set of images is collected. The metadata A related, fourth metric considers the frequency of occur can be of various types. One is words/phrases from a title rence (or improbability) of a term—either just within the given to an image. Another is information in metatags collected metadata, or within a superset of that data. “RCA assigned to the image—usually by the photographer (e.g., Building is more relevant, from this standpoint, than “Rock naming the photo subject and certain attributes/keywords), efeller Center because it is used much less frequently. but additionally by the capture device (e.g., identifying the 25 These and other metrics can be combined to assign each camera model, the date/time of the photo, the location, etc). image in the set with a place score indicating its potential Another is words/phrases in a narrative description of the place-centric-ness. photo authored by the photographer. The combination can be a straight Sum of four factors, each Some metadata terms may be repeated across different ranging from 0 to 100. More likely, however, some metrics images. Descriptors common to two or more images can be 30 will be weighted more heavily. The following equation identified (clustered), and the most popular terms may be employing metrics M1, M2, M2 and M4 can be employed to ranked. (Such as listing is shown at “A” in FIG. 46A. Here, yield a score S, with the factors A, B, C, D and exponents W. and in other metadata listings, only partial results are given X, Y and Z determined experimentally, or by Bayesian tech for expository convenience.) niques: From the metadata, and from other analysis, it may be 35 possible to determine which images in the first set are likely person-centric, which are place-centric, and which are thing Person-Centric Processing centric. A different analysis can be employed to estimate the per Consider the metadata with which a set of 50 images may Son-centric-ness of each image in the set obtained from be tagged. Some of the terms relate to place. Some relate to 40 Flickr. persons depicted in the images. Some relate to things. As in the example just-given, a glossary of relevant terms Place-Centric Processing can be compiled—this time terms associated with people. In Terms that relate to place can be identified using various contrast to the place name glossary, the person name glossary techniques. One is to use a database with geographical infor can be global—rather than associated with a particular locale. mation to look-up location descriptors near a given geo 45 (However, different glossaries may be appropriate in different graphical position. Yahoo’s GeoPlanet service, for example, countries.) returns a hierarchy of descriptors such as “Rockefeller Cen Such a glossary can be compiled from various sources, ter,” “10024” (a zip code), “Midtown Manhattan.” “New including telephone directories, lists of most popular names, York,” “Manhattan,” “New York, and “United States, when and other reference works where names appear. The list may queried with the latitude/longitude of the Rockefeller Center. 50 start, "Aaron, Abigail, Adam, Addison, Adrian, Aidan, Aiden, The same service can return names of adjoining/sibling Alex, Alexa, Alexander, Alexandra, Alexis, Allison, Alyssa, neighborhoods/features on request, e.g., "10017.” “10020.” Amelia, Andrea, Andrew, Angel, Angelina, Anna, Anthony, “10036,” “Theater District,” “Carnegie Hall,” “Grand Central Antonio, Ariana, Arianna, Ashley, Aubrey, Audrey, Austin, Station,” “Museum of American Folk Art, etc., etc. Autumn, Ava, Avery . . .” Nearby street names can be harvested from a variety of 55 First names alone can be considered, or last names can be mapping programs, given a set of latitude/longitude coordi considered too. (Some names may be a place name or a person nates or other location info. name. Searching for adjoining first/last names and/or adjoin A glossary of nearby place-descriptors can be compiled in ing place names can help distinguish ambiguous cases. E.g., such manner. The metadata harvested from the set of Flickr Elizabeth Smith is a person; Elizabeth N.J. is a place.) images can then be analyzed, by reference to the glossary, to 60 Personal pronouns and the like can also be included in Such identify the terms that relate to place (e.g., that match terms in a glossary (e.g., he, she, him, her, his, our, her, I, me, myself. the glossary). we, they, them, mine, their). Nouns identifying people and Consideration then turns to use of these place-related meta personal relationships can also be included (e.g., uncle, sister, data in the reference set of images collected from Flickr. daughter, gramps, boss, student, employee, wedding, etc) Some images may have no place-related metadata. These 65 Adjectives and adverbs that are usually applied to people images are likely person-centric or thing-centric, rather than may also be included in the person-term glossary (e.g., happy, place-centric. boring, blonde, etc), as can the names of objects and attributes US 8,768,313 B2 73 74 that are usually associated with people (e.g., t-shirt, back forms. E.g., a score (in a 0-100 scale) can be increased by 10, pack, Sunglasses, tanned, etc.). Verbs associated with people or increased by 10%. Or increased by half the remaining can also be employed (e.g., Surfing, drinking). distance to 100, etc.) In this last group, as in some others, there are some terms Thus, for example, a photo tagged with “Elizabeth' meta that could also apply to thing-centric images (rather than data is more likely a person-centric photo if the facial recog person-centric). The term 'sunglasses' may appear in meta nition algorithm finds a face within the image than if no face data for an image depicting Sunglasses, alone; "happy may is found. appear in metadata for an image depicting a dog. There are (Conversely, the absence of any face in an image can be also some cases where a person-term may also be a place used as a “plus” factor to increase the confidence that the term (e.g., Boring, Oregon). In more Sophisticated embodi 10 image subject is of a different type, e.g., a place or a thing. Thus, an image tagged with Elizabeth as metadata, but lack ments, glossary terms can be associated with respective con ing any face, increases the likelihood that the image relates to fidence metrics, by which any results based on Such terms a place named Elizabeth, or a thing named Elizabeth—Such may be discounted or otherwise acknowledged to have dif as a pet.) ferent degrees of uncertainty.) 15 Still more confidence in the determination can be assumed AS before, if an image is not associated with any person if the facial recognition algorithm identifies a face as a female, related metadata, then the image can be adjudged likely not and the metadata includes a female name. Such an arrange person-centric. Conversely, if all of the metadata is person ment, of course, requires that the glossary—or other data related, the image is likely person-centric. structure—have data that associates genders with at least For other cases, metrics like those reviewed above can be SO alS. assessed and combined to yield a score indicating the relative (Still more Sophisticated arrangements can be imple person-centric-ness of each image, e.g., based on the number, mented. For example, the age of the depicted person(s) can be placement, particularity and/or frequency/improbability of estimated using automated techniques (e.g., as detailed in the person-related metadata associated with the image. U.S. Pat. No. 5,781,650, to Univ. of Central Florida). Names While analysis of metadata gives useful information about 25 found in the image metadata can also be processed to estimate whetheran image is person-centric, other techniques can also the age of the thus-named person(s). This can be done using be employed—either alternatively, or in conjunction with public domain information about the statistical distribution of metadata analysis. a name as a function of age (e.g., from published Social One technique is to analyze the image looking for continu Security Administration data, and web sites that detail most ous areas of skin-tone colors. Such features characterize 30 popular names from birth records). Thus, names Mildred and many features of person-centric images, but are less fre Gertrude may be associated with an age distribution that quently found in images of places and things. peaks at age 80, whereas Madison and Alexis may be associ A related technique is facial recognition. This science has ated with an age distribution that peaks at age 8. Finding advanced to the point where even inexpensive point-and statistically-likely correspondence between metadata name shoot digital cameras can quickly and reliably identify faces 35 and estimated person age can further increase the person within an image frame (e.g., to focus or expose the image centric score for an image. Statistically unlikely correspon based on Such subjects). dence can be used to decrease the person-centric score. (Esti (Face finding technology is detailed, e.g., in U.S. Pat. No. mated information about the age of a Subject in the 5,781,650 (Univ. of Central Florida), U.S. Pat. No. 6,633,655 consumer's image can also be used to tailor the intuited (Sharp), U.S. Pat. No. 6,597,801 (Hewlett-Packard) and U.S. 40 response(s), as may information about the Subject's gender.)) Pat. No. 6,430.306 (L-1 Corp.), and in Yang et al. Detecting Just as detection of a face in an image can be used as a Faces in Images: A Survey, IEEE Transactions on Pattern “plus” factor in a score based on metadata, the existence of Analysis and Machine Intelligence, Vol. 24, No. 1, January person-centric metadata can be used as a “plus’ factor to 2002, pp. 34-58, and Zhao, et al. Face Recognition: A Litera increase a person-centric score based on facial recognition ture Survey, ACM Computing Surveys, 2003, pp. 399-458. 45 data. Additional papers about facial recognition technologies are Of course, if no face is found in an image, this information noted in a bibliography at the end of the provisional specifi can be used to reduce a person-centric score for the image cation to which this application claims priority.) (perhaps down to Zero). Facial recognition algorithms can be applied to the set of Thing-Centric Processing reference images obtained from Flickr, to identify those that 50 A thing-centered image is the third type of image that may have evident faces, and identify the portions of the images be found in the set of images obtained from Flickr in the corresponding to the faces. present example. There are various techniques by which a Of course, many photos have faces depicted incidentally thing-centric score for an image can be determined. within the image frame. While all images having faces could One technique relies on metadata analysis, using principles be identified as person-centric, most embodiments employ 55 like those detailed above. A glossary of nouns can be com further processing to provide a more refined assessment. piled either from the universe of Flickr metadata or some One form of further processing is to determine the percent other corpus (e.g., WordNet), and ranked by frequency of age area of the image frame occupied by the identified face(s). occurrence. Nouns associated with places and persons can be The higher the percentage, the higher the likelihood that the removed from the glossary. The glossary can be used in the image is person-centric. This is another metric than can be 60 manners identified above to conduct analyses of the images used in determining an image's person-centric score. metadata, to yield a score for each. Another form of further processing is to look for the exist Another approach uses pattern matching to identify thing ence of (1) one or more faces in the image, together with (2) centric images—matching each against a library of known person-descriptors in the metadata associated with the image. thing-related images. In this case, the facial recognition data can be used as a “plus’ 65 Still another approach is based on earlier-determined factor to increase a person-centric score of an image based on scores for person-centric and place-centric. A thing-centric metadata or other analysis. (The “plus' can take various score may be assigned in inverse relationship to the other two US 8,768,313 B2 75 76 scores (i.e., if an image scores low for being person-centric, Subject of the cellphone image can be achieved (and a more and low for being place-centric, then it can be assigned a high accurate response can be intuited and provided to the con score for being thing-centric). Sumer). Such techniques may be combined, or used individually. In A fixed set of image assessment criteria can be applied to any event, a score is produced for each image—tending to 5 distinguish images in the three categories. However, the indicate whether the image is more- or less-likely to be thing detailed embodiment determines such criteria adaptively. In centric. particular, this embodiment examines the set of images and Further Processing of Sample Set of Images determines which image features/characteristics/metrics Data produced by the foregoing techniques can produce most reliably (1) group like-categorized images together 10 (similarity); and (2) distinguish differently-categorized three scores for each image in the set, indicating rough con images from each other (difference). Among the attributes fidence/probability/likelihood that the image is (1) person that may be measured and checked for similarity/difference centric, (2) place-centric, or (3) thing-centric. These scores behavior within the set of images are dominant color, color needn't add to 100% (although they may). Sometimes an diversity; color histogram; dominant texture; texture diver image may score high in two or more categories. In Such case 15 sity; texture histogram; edginess; wavelet-domain transform the image may be regarded as having multiple relevance, e.g., coefficient histograms, and dominant wavelet coefficients; as both depicting a person and a thing. frequency domain transfer coefficient histograms and domi The set of images downloaded from Flickr may next be nant frequency coefficients (which may be calculated in dif segregated into groups, e.g., A, B and C, depending on ferent color channels); eigenvalues; keypoint descriptors; whether identified as primarily person-centric, place-centric, geometric class probabilities; symmetry; percentage of image or thing-centric. However, since some images may have split area identified as facial; image autocorrelation; low-dimen probabilities (e.g., an image may have some indicia of being sional "gists’ of image; etc. (Combinations of Such metrics place-centric, and Some indicia of being person-centric), may be more reliable than the characteristics individually.) identifying an image wholly by its high score ignores useful One way to determine which metrics are most salient for information. Preferable is to calculate a weighted score for 25 these purposes is to compute a variety of different image the set of images—taking each image's respective scores in metrics for the reference images. If the results within a cat the three categories into account. egory of images for a particular metric are clustered (e.g., if A sample of images from Flickr—all taken near Rock for place-centric images, the color histogram results are clus efeller Center may suggest that 60% are place-centric, 25% tered around particular output values), and if images in other are person-centric, and 15% are thing-centric. 30 categories have few or no output values near that clustered This information gives useful insight into the tourist's cell result, then that metric would appear well Suited for use as an phone image—even without regard to the contents of the image assessment criteria. (Clustering is commonly per image itself (except its geocoding). That is, chances are good formed using an implementation of a k-means algorithm.) that the image is place-centric, with less likelihood it is per In the set of images from Rockefeller Center, the system Son-centric, and still less probability it is thing centric. (This 35 may determine that an edginess score of>40 is reliably asso ordering can be used to determine the order of Subsequent ciated with images that score high as place-centric; a facial steps in the process—allowing the system to more quickly area score of >15% is reliably associated with images that gives responses that are most likely to be appropriate.) score high as person-centric; and a color histogram that has a This type-assessment of the cell phone photo can be local peak in the gold tones—together with a frequency con used—alone—to help determine an automated action pro 40 tent for yellow that peaks at lower image frequencies, is vided to the touristin response to the image. However, further Somewhat associated with images that score high as thing processing can better assess the image's contents, and thereby centric. allow a more particularly-tailored action to be intuited. The analysis techniques found most useful in grouping/ Similarity Assessments and Metadata Weighting distinguishing the different categories of images can then be Within the set of co-located images collected from Flickr, 45 applied to the user's cellphone image. The results can then be images that are place-centric will tend to have a different analyzed for proximity—in a distance measure sense (e.g., appearance than images that are person-centric or thing-cen multi-dimensional space)—with the characterizing features tric, yet tend to have some similarity within the place-centric associated with different categories of images. (This is the group. Place-centric images may be characterized by Straight first time that the cellphone image has been processed in this lines (e.g., architectural edges). Or repetitive patterns (win 50 particular embodiment.) dows). Or large areas of uniform texture and similar color Using Such techniques, the cell phone image may score a near the top of the image (sky). 60 for thing-centric, a 15 for place-centric, and a 0 for person Images that are person-centric will also tend to have dif centric (on scale of 0-100). This is a second, better set of ferent appearances than the other two classes of image, yet scores that can be used to classify the cell phone image (the have common attributes within the person-centric class. For 55 first being the statistical distribution of co-located photos example, person-centric images will usually have faces— found in Flickr). generally characterized by an ovoid shape with two eyes and The similarity of the user's cell phone image may next be a nose, areas of flesh tones, etc. compared with individual images in the reference set. Simi Although thing-centric images are perhaps the most larity metrics identified earlier can be used, or different mea diverse, images from any given geography may tend to have 60 Sures can be applied. The time or processing devoted to this unifying attributes or features. Photos geocoded at a horse task can be apportioned across the three different image cat track will depict horses with some frequency; photos geo egories based on the just-determined scores. E.g., the process coded from Independence National Historical Park in Phila may spend no time judging similarity with reference images delphia will tend to depict the Liberty Bell regularly, etc. classed as 100% person-centric, but instead concentrate on By determining whether the cell phone image is more 65 judging similarity with reference images classed as thing- or similar to place-centric, or person-centric, or thing-centric place-centric (with more effort—e.g., four times as much images in the set of Flickr images, more confidence in the effort being applied to the former than the latter). A simi US 8,768,313 B2 77 78 larity score is generated for most of the images in the refer corresponding similarity measure. The results can then be ence set (excluding those that are assessed as 100% person combined to yield a set of metadata weighted in accordance centric). image similarity. Consideration then returns to metadata. Metadata from the Some of the metadata—often including some highly reference images are again assembled—this time weighted in 5 ranked terms—will be of relatively low value in determining accordance with each image's respective similarity to the cell image-appropriate responses for presentation to the con phone image. (The weighting can be linear or exponential.) Sumer. “New York.” “Manhattan are a few examples. Gen Since metadata from similar images is weighted more than erally more useful will be metadata descriptors that are rela metadata from dissimilar images, the resulting set of meta tively unusual. data is tailored to more likely correspond to the cell phone 10 A measure of “unusualness' can be computed by determin image. ing the frequency of different metadata terms within a rel From the resulting set, the top N (e.g., 3) metadata descrip evant corpus, such as Flickr image tags (globally, or within a tors may be used. Or descriptors that—on a weighted basis— geolocated region), or image tags by photographers from comprise an aggregate M% of the metadata set. whom the respective images were Submitted, or words in an In the example given, the thus-identified metadata may 15 encyclopedia, or in Google's index of the web, etc. The terms comprise “Rockefeller Center,” “Prometheus.” and “Skating in the weighted metadata list can be further weighted in rink.” with respective scores of 19, 12 and 5 (see “B” in FIG. accordance with their unusualness (i.e., a second weighting). 46B). The result of such Successive processing may yield the list With this weighted set of metadata, the system can begin of metadata shown at “D” in FIG. 46B (each shown with its determining what responses may be most appropriate for the respective score). This information (optionally in conjunction consumer. In the exemplary embodiment, however, the sys with a tag indicating the person/place/thing determination) tem continues by further refining its assessment of the cell allows responses to the consumer to be well-correlated with phone image. (The system may begin determining appropri the cellphone photo. ate responses while also undertaking the further processing.) It will be recognized that this set of inferred metadata for Processing a Second Set of Reference Images 25 the user's cell phone photo was compiled entirely by auto At this point the system is better informed about the cell mated processing of other images, obtained from public phone image. Not only is its location known; so is its likely Sources such as Flickr, in conjunction with other public type (thing-centric) and some of its most-probably-relevant resources (e.g., listings of names, places). The inferred meta metadata. This metadata can be used in obtaining a second set data can naturally be associated with the user's image. More of reference images from Flickr. 30 importantly for the present application, however, it can help a In the illustrative embodiment, Flickr is queried for images service provider decide how best to respond to submission of having the identified metadata. The query can be geographi the user's image. cally limited to the cell phone's geolocation, or a broader (or Determining Appropriate Responses for Consumer unlimited) geography may be searched. (Or the query may Referring to FIG. 50, the system just-described can be run twice, so that half of the images are co-located with the 35 viewed as one particular application of an "image juicer that cellphone image, and the others are remote, etc.) receives image data from a user, and applies different forms of The search may first look for images that are tagged with all processing so as to gather, compute, and/or infer information of the identified metadata. In this case, 60 images are found. that can be associated with the image. If more images are desired, Flickr may be searched for the As the information is discerned, it can be forwarded by a metadata terms in different pairings, or individually. (In these 40 router to different service providers. These providers may be latter cases, the distribution of selected images may be chosen arranged to handle different types of information (e.g., so that the metadata occurrence in the results corresponds to semantic descriptors, image texture data, keypoint descrip the respective scores of the different metadata terms, i.e., tors, eigenvalues, color histograms, etc) or to different classes 19/12/5.) of images (e.g., photo of friend, photo of a can of Soda, etc). Metadata from this second set of images can be harvested, 45 Outputs from these service providers are sent to one or more clustered, and may be ranked (“C” in FIG. 46B). (Noise devices (e.g., the user's cell phone) for presentation or later words (“and, of, or, etc.) can be eliminated. Words descrip reference. The present discussion now considers how these tive only of the camera or the type of photography may also be service providers decide what responses may be appropriate disregarded (e.g., “Nikon,” “D80, “HDR,” “black and for a given set of input information. white, etc.). Month names may also be removed.) 50 One approach is to establish a taxonomy of image Subjects The analysis performed earlier—by which each image in and corresponding responses. A tree structure can be used, the first set of images was classified as person-centric, place with an image first being classed into one of a few high level centric or thing-centric—can be repeated on images in the groupings (e.g., person/place/thing), and then each group second set of images. Appropriate image metrics for deter being divided into further subgroups. In use, an image is mining similarity/difference within and between classes of 55 assessed through different branches of the tree until the limits this second image set can be identified (or the earlier mea of available information allow no further progress to be made. Sures can be employed). These measures are then applied, as Actions associated with the terminal leaf or node of the tree before, to generate refined scores for the user's cell phone are then taken. image, as being person-centric, place-centric, and thing-cen Part of a simple tree structure is shown in FIG. 51. (Each tric. By reference to the images of the second set, the cell 60 node spawns three branches, but this is for illustration only; phone image may score a 65 for thing-centric, 12 for place more or less branches can of course be used.) centric, and 0 for person-centric. (These scores may be com If the subject of the image is inferred to be an item of food bined with the earlier-determined scores, e.g., by averaging, if (e.g., if the image is associated with food-related metadata), desired.) three different screens of information can be cached on the As before, similarity between the user's cell phone image 65 user's phone. One starts an online purchase of the depicted and each image in the second set can be determined. Metadata item at an online vendor. (The choice of vendor, and payment/ from each image can then be weighted in accordance with the shipping details, can be obtained from user profile data.) The US 8,768,313 B2 79 80 second screen shows nutritional information about the prod The results from such a search can be clustered by other uct. The third presents a map of the local area—identifying concepts. For example, Some of the results may be clustered stores that sell the depicted product. The user Switches among because they share the theme “art deco. Others may be these responses using a roller wheel 124 on the side of the clustered because they deal with corporate history of RCA phone (FIG. 44). and GE. Others may be clustered because they concern the If the subject is inferred to be a photo of a family member works of the architect Raymond Hood. Others may be clus or friend, one screen presented to the user gives the option of tered as relating to 20" century American sculpture, or Paul posting a copy of the photo to the user's FaceBook page, Manship. Other concepts found to produce distinct clusters annotated with the person(s)'s likely name(s). (Determining may include John Rockefeller. The Mitsubishi Group, the names of persons depicted in a photo can be done by 10 Colombia University, Radio City Music Hall, The Rainbow Submitting the photo to the user's account at Picasa. Picasa Room Restaurant, etc. performs facial recognition operations on Submitted user Information from these clusters can be presented to the images, and correlates facial eigenvectors with individual user on Successive UI screens, e.g., after the screens on which names provided by the user, thereby compilinga user-specific prescribed information/actions are presented. The order of 15 these screens can be determined by the sizes of the informa database of facial recognition information for friends and tion clusters, or the keyword-determined relevance. others depicted in the user's prior images. Picasa's facial Still a further response is to present to the user a Google recognition is understood to be based on technology detailed search screen pre-populated with the twice-weighted meta in U.S. Pat. No. 6,356,659 to Google. Apple's iPhoto soft data as search terms. The user can then delete terms that arent ware and Facebook's Photo Finder software include similar relevant to his/her interest, and add other terms, so as to facial recognition functionality.) Another screen starts a text quickly execute a web search leading to the information or message to the individual, with the addressing information action desired by the user. having been obtained from the user's address book, indexed In Some embodiments, the system response may depend on by the Picasa-determined identity. The user can pursue any or people with whom the user has a “friend' relationship in a all of the presented options by Switching between the associ 25 social network, or some other indicia of trust. For example, if ated Screens. little is known about user Ted, but there is a rich set of If the Subject appears to be a stranger (e.g., not recognized information available about Ted's friend Alice, that rich set of by Picasa), the system will have earlier undertaken an information may be employed in determining how to respond attempted recognition of the person using publicly available to Ted, in connection with a given content stimulus. facial recognition information. (Such information can be 30 Similarly, if user Ted is a friend of user Alice, and Bob is a extracted from photos of known persons. VideoSurf is one friend of Alice, then information relating to Bob may be used vendor with a database of facial recognition features for in determining an appropriate response to Ted. actors and other persons. L-1 Corp. maintains databases of The same principles can be employed even if Ted and Alice driver's licenses photos and associated data which may— are strangers, provided there is another basis for implicit trust. with appropriate safeguards—be employed for facial recog 35 While basic profile similarity is one possible basis, a better nition purposes.) The screen(s) presented to the user can show one is the sharing an unusual attribute (or, better, several). reference photos of the persons matched (together with a Thus, for example, if both Ted and Alice share the traits of “match” score), as well as dossiers of associated information being fervent supporters of Dennis Kucinich for president, compiled from the web and other databases. A further screen and being devotees of pickledginger, then information relat gives the user the option of sending a “Friend' invite to the 40 ing to one might be used in determining an appropriate recognized person on MySpace, or another social networking response to present to the other. site where the recognized person is found to have a presence. The arrangements just-described provides powerful new A still further screen details the degree of separation between functionality. However, the “intuiting of the responses likely the user and the recognized person. (E.g., my brother David desired by the user rely largely on the system designers. They has a classmate Steve, who has a friend Matt, who has a friend 45 consider the different types of images that may be encoun Tom, who is the son of the depicted person.) Such relation tered, and dictate responses (or selections of responses) that ships can be determined from association information pub they believe will best satisfy the users’ likely desires. lished on Social networking sites. In this respect the above-described arrangements are akin Of course, the responsive options contemplated for the to early indexes of the web—such as Yahoo! Teams of humans different Sub-groups of image subjects may meet most user 50 generated taxonomies of information for which people might desires, but some users will want something different. Thus, search, and then manually located web resources that could at least one alternative response to each image may be open satisfy the different search requests. ended—allowing the user to navigate to different informa Eventually the web overwhelmed such manual efforts at tion, or specify a desired response—making use of whatever organization. Google's founders were among those that rec image/metadata processed information is available. 55 ognized that an untapped wealth of information about the web One such open-ended approach is to Submit the twice could be obtained from examining links between the pages, weighted metadata noted above (e.g., “D” in FIG. 46B) to a and actions of users in navigating these links Understanding general purpose search engine. Google, per se, is not neces of the system thus came from data within the system, rather sarily best for this function, because current Google searches than from an external perspective. require that all search terms be found in the results. Better is 60 In like fashion, manually crafted trees of image classifica a search engine that does fuzzy searching, and is responsive to tions/responses will probably someday be seen as an early differently-weighted keywords—not all of which need be stage in the development of image-responsive technologies. found. The results can indicate different seeming relevance, Eventually Such approaches will be eclipsed by arrangements depending on which keywords are found, where they are that rely on machine understanding derived from the system found, etc. (A result including “Prometheus' but lacking 65 itself, and its use. “RCA Building would be ranked more relevant than a result One such technique simply examines which responsive including the latter but lacking the former.) screen(s) are selected by users in particular contexts. AS Such US 8,768,313 B2 81 82 usage patterns become evident, the most popular responses U.S. Pat. No. 7,383,169 (Microsoft) details how dictionar can be moved earlier in the sequence of screens presented to ies and other large works of language can be processed by the user. NLP techniques to compile lexical knowledge bases that Likewise, if patterns become evident in use of the open serve as formidable sources of such "common sense' infor ended search query option, such action can become a standard 5 mation about the world. This common sense knowledge can response, and moved higher in the presentation queue. be applied in the metadata processing detailed herein. (Wiki The usage patterns can be tailored in various dimensions of pedia is another reference source that can serve as the basis context. Males between 40 and 60 years of age, in New York, for Such a knowledge base. Our digital life log is yet may demonstrate interest in different responses following another—one that yields insights unique to us as individuals.) 10 When applied to our digital life log, NLP techniques can capture of a snapshot of a statue by a 20" century sculptor, reach nuanced understandings about our historical interests than females between 13 and 16 years of age in Beijing. Most and actions—information that can be used to model (predict) persons Snapping a photo of a food processor in the weeks our present interests and forthcoming actions. This under before Christmas may be interested in finding the cheapest standing can be used to dynamically decide what information online vendor of the product; most persons Snapping a photo 15 should be presented, or what action should be undertaken, of the same object the week following Christmas may be responsive to a particular user capturing aparticularimage (or interested in listing the item for sale on E-Bay or Craigslist. to other stimulus). Truly intuitive computing will then have Etc. Desirably, usage patterns are tracked with as many demo arrived. graphic and other descriptors as possible, so as to be most Other Comments predictive of user behavior. While the image/metadata processing detailed above takes More Sophisticated techniques can also be applied, draw many words to describe, it need not take much time to per ing from the rich sources of expressly- and inferentially form. Indeed, much of the processing of reference data, com linked data sources now available. These include not only the pilation of glossaries, etc., can be done off-line—before any web and personal profile information, but all manner of other input image is presented to the system. Flickr, Yahoo! or other digital data we touch and in which we leave traces, e.g., cell 25 service providers can periodically compile and pre-process phone billing statements, credit card Statements, shopping reference sets of data for various locales, to be quickly avail data from Amazon and EBay, Google search history, brows able when needed to respond to an image query. ing history, cached web pages, cookies, email archives, phone In some embodiments, other processing activities will be message archives from Google Voice, travel reservations on started in parallel with those detailed. For example, if initial Expedia and Orbitz, music collections on iTunes, cable tele 30 processing of the first set of reference images suggests that the vision subscriptions, Netflix movie choices, GPS tracking Snapped image is place-centric, the system can request likely information, social network data and activities, activities and useful information from other resources before processing of postings on photo sites such as Flickr and Picasa and video the user image is finished. To illustrate, the system may sites such as YouTube, the times of day memorialized in these immediately request a street map of the nearby area, together records, etc. (our “digital life log”). Moreover, this informa 35 with a satellite view, a street view, a mass transit map, etc. tion is potentially available not just for the user, but also for Likewise, a page of information about nearby restaurants can the users friends/family, for others having demographic be compiled, together with another page detailing nearby similarities with the user, and ultimately everyone else (with movies and show-times, and a further page with a local appropriate anonymization and/or privacy safeguards). weather forecast. These can all be sent to the user's phone and The network of interrelationships between these data 40 cached for later display (e.g., by Scrolling a thumb wheel on sources is smaller than the network of web links analyzed by the side of the phone). Google, but is perhaps richer in the diversity and types of links These actions can likewise be undertaken before any image From it can be mined a wealth of inferences and insights, processing occurs—simply based on the geocode data which can help inform what a particular user is likely to want accompanying the cellphone image. done with a particular Snapped image. 45 While geocoding data accompanying the cellphone image Artificial intelligence techniques can be applied to the was used in the arrangement particularly described, this is not data-mining task. One class of Such techniques is natural necessary. Other embodiments can select sets of reference language processing (NLP), a science that has made signifi images based on other criteria, Such as image similarity. (This cant advancements recently. may be determined by various metrics, as indicated above and One example is the Semantic Map compiled by Cognition 50 also detailed below. Known image classification techniques Technologies, Inc., a database that can be used to analyze can also be used to determine one of several classes of images words in context, in order to discern their meaning This into which the input image falls, so that similarly-classed functionality can be used, e.g., to resolve homonym ambigu images can then be retrieved.) Another criteria is the IP ity in analysis of image metadata (e.g., does “bow refer to a address from which the input image is uploaded. Other part of a ship, a ribbon adornment, a performers thank-you, 55 images uploaded from the same-or geographically-proxi or a complement to an arrow? ProXimity to terms such as mate IP addresses, can be sampled to form the reference “Carnival cruise.” “satin. “Carnegie Hall' or “hunting can SetS. provide the likely answer). U.S. Pat. No. 5,794,050 (FRCD Even in the absence of geocode data for the input image, Corp.) details underlying technologies. the reference sets of imagery may nonetheless be compiled The understanding of meaning gained through NLP tech 60 based on location. Location information for the input image niques can also be used to augment image metadata with other can be inferred from various indirect techniques. A wireless relevant descriptors—which can be used as additional meta service provider thru which a cellphone image is relayed may data in the embodiments detailed herein. For example, a identify the particular cell tower from which the tourists close-up image tagged with the descriptor “hibiscus stamens' transmission was received. (If the transmission originated can—through NLP techniques—be further tagged with the 65 through another wireless link, such as WiFi, its location may term “flower” (As of this writing, Flickr has 460 images also be known.) The tourist may have used his credit card an tagged with “hibiscus” and “stamen.” but omitting “flower.”) hour earlier at a Manhattan hotel, allowing the system (with US 8,768,313 B2 83 84 appropriate privacy safeguards) to infer that the picture was Some of the database searches can be iterative/recursive. taken somewhere near Manhattan. Sometimes features For example, results from one database search can be com depicted in an image are so iconic that a quick search for bined with the original search inputs and used as inputs for a similar images in Flickr can locate the user (e.g., as being at further search. the Eiffel Tower, or at the Statue of Liberty). It will be recognized that much of the foregoing processing GeoPlanet was cited as one source of geographic informa is fuzzy. Much of the data may be interms of metrics that have tion. However, a number of other geoinformation databases no absolute meaning, but are relevant only to the extent dif can alternatively be used. GeoNames-dot-org is one. (It will ferent from other metrics. Many such different probabilistic be recognized that the “-dot-convention, and omission of the factors can be assessed and then combined—a statistical stew. usual http preamble, is used to prevent the reproduction of this 10 Artisans will recognize that the particular implementation text by the Patent Office from being indicated as a live hyper Suitable for a given situation may be largely arbitrary. How link) In addition to providing place names for a given latitude/ ever, through experience and Bayesian techniques, more longitude (at levels of neighborhood, city, state, country), and informed manners of weighting and using the different fac providing parent, child, and sibling information for geo tors can be identified and eventually used. 15 If the Flickr archive is large enough, the first set of images graphic divisions, GeoNames free data (available as a web in the arrangement detailed above may be selectively chosen service) also provides functions such as finding the nearest to more likely be similar to the Subject image. For example, intersection, finding the nearest post office, finding the Sur Flickr can be searched for images taken at about the same face elevation, etc. Still another option is Google's GeoSe time of day. Lighting conditions will be roughly similar, e.g., arch API, which allows retrieval of and interaction with data so that matching a night scene to a daylight scene is avoided, from Google Earth and Google Maps. and shadow/shading conditions might be similar. Likewise, It will be recognized that archives of aerial imagery are Flickr can be searched for images taken in the same season/ growing exponentially. Part of Such imagery is from a month. Issues such as seasonal disappearance of the ice skat straight-down perspective, but off-axis the imagery increas ing rink at Rockefeller Center, and Snow on a winter land ingly becomes oblique. From two or more different oblique scape, canthus be mitigated. Similarly, if the camera/phone is views of a location, a 3D model can be created. As the reso 25 equipped with a magnetometer, inertial sensor, or other tech lution of such imagery increases, Sufficiently rich sets of data nology permitting its bearing (and/or azimuth/elevation) to be are available that—for some locations—a view of a scene as determined, then Flickr can be searched for shots with this if taken from ground level may be synthesized. Such views degree of similarity too. can be matched with street level photos, and metadata from Moreover, the sets of reference images collected from one can augment metadata for the other. 30 Flickr desirably comprise images from many different As shown in FIG. 47, the embodiment particularly Sources (photographers)—so they don't tend towards use of described above made use of various resources, including the same metadata descriptors. Flickr, a database of person names, a word frequency data Images collected from Flickr may be screened for adequate base, etc. These are just a few of the many different informa metadata. For example, images with no metadata (except, 35 perhaps, an arbitrary image number) may be removed from tion sources that might be employed in Such arrangements. the reference set(s). Likewise, images with less than 2 (or 20) Other Social networking sites, shopping sites (e.g., Amazon, metadata terms, or without a narrative description, may be EBay), weather and traffic sites, online thesauruses, caches of disregarded. recently-visited web pages, browsing history, cookie collec Flickr is often mentioned in this specification, but other tions, Google, other digital repositories (as detailed herein), collections of content can of course be used. Images in Flickr etc., can all provide a wealth of additional information that 40 commonly have specified license rights for each image. These can be applied to the intended tasks. Some of this data reveals include “all rights reserved as well as a variety of Creative information about the users interests, habits and prefer Commons licenses, through which the public can make use of ences—data that can be used to better infer the contents of the the imagery on different terms. Systems detailed herein can Snapped picture, and to better tailor the intuited response(s). limit their searches through Flickr for imagery meeting speci Likewise, while FIG. 47 shows a few lines interconnecting 45 fied license criteria (e.g., disregard images marked “all rights the different items, these are illustrative only. Different inter reserved'). connections can naturally be employed. Other image collections are in Some respects preferable. The arrangements detailed in this specification are a par For example, the database at images.google-dot-com seems ticular few out of myriad that may be employed. Most better at ranking images based on metadata-relevance than embodiments will be different than the ones detailed. Some 50 Flickr. Flickr and Google maintain image archives that are pub actions will be omitted, some will performed in different licly accessible. Many other image archives are private. The orders, some will be performed in parallel rather than serially present technology finds application with both including (and vice versa). Some additional actions may be included, some hybrid contexts in which both public and proprietary etc. image collections are used (e.g., Flickr is used to find an One additional action is to refine the just-detailed process 55 image based on a user image, and the Flickr image is Submit by receiving user-related input, e.g., after the processing of ted to a private database to find a match and determine a the first set of Flickr images. For example, the system iden corresponding response for the user). tified “Rockefeller Center,” “Prometheus,” and “Skating Similarly, while reference was made to services such as rink” as relevant metadata to the user-Snapped image. The Flickr for providing data (e.g., images and metadata), other system may query the user as to which of these terms is most 60 Sources can of course be used. relevant (or least relevant) to his/her particular interest. The One alternative source is an ad hoc peer-to-peer (P2P) further processing (e.g., further search, etc.) can be focused network. In one such P2P arrangement, there may optionally accordingly. be a central index, with which peers can communicate in Within an image presented on a touch screen, the user may searching for desired content, and detailing the content they touch a region to indicate an object of particular relevance 65 have available for sharing. The index may include metadata within the image frame. Image analysis and Subsequent acts and metrics for images, together with pointers to the nodes at can then focus on the identified object. which the images themselves are stored. US 8,768,313 B2 85 86 The peers may include cameras, PDAs, and other portable In processing metadata, it is sometimes helpful to clean-up devices, from which image information may be available the data prior to analysis, as referenced above. The metadata nearly instantly after it has been captured. may also be examined for dominant language, and if not In the course of the methods detailed herein, certain rela English (or other particular language of the implementation), tionships are discovered between imagery (e.g., similar the metadata and the associated image may be removed from geolocation; similar image metrics; similar metadata, etc). consideration. These data are generally reciprocal, so if the system discov While the earlier-detailed embodiment sought to identify ers—during processing of Image A, that its color histogram is the image Subject as being one of a person/place/thing so that similar to that of Image B, then this information can be stored a correspondingly-different action can be taken, analysis/ for later use. If a later process involves Image B, the earlier 10 identification of the image within other classes can naturally stored information can be consulted to discover that Image A be employed. A few examples of the countless other class/ has a similar histogram without analyzing Image B. Such type groupings include animal/vegetable/mineral; golf ten relationships are akin to virtual links between the images. nis/football/baseball; male/female; wedding-ring-detected/ For such relationship information to maintain its utility wedding-ring-not-detected; urban/rural; rainy/clear, day/ over time, it is desirable that the images be identified in a 15 night; child/adult: Summer/autumn/winter/spring; car/truck; persistent manner. If a relationship is discovered while Image consumer product/non-consumer product; can/box/bag; A is on a user's PDA, and Image B is on a desktop somewhere, natural/man-made; Suitable for all ages/parental advisory for a means should be provided to identify Image A even after it children 13 and below/parental advisory for children 17 and has been transferred to the user's MySpace account, and to below/adult only; etc. track Image B after it has been archived to an anonymous Sometimes differentanalysis engines may be applied to the computer in a cloud network. user's image data. These engines can operate sequentially, or Images can be assigned Digital Object Identifiers (DOI) for in parallel. For example, FIG. 48A shows an arrangement in this purpose. The International DOI Foundation has imple which—if an image is identified as person-centric—it is next mented the CNRI Handle System so that such resources can referred to two other engines. One identifies the person as be resolved to their current location through the web site at 25 family, friend or stranger. The other identifies the person as doi-dot-org. Another alternative is for the images to be child or adult. The latter two engines work in parallel, after the assigned and digitally watermarked with identifiers tracked by Digimarc For Images service. first has completed its work. If several different repositories are being searched for Sometimes engines can be employed without any certainty imagery or other information, it is often desirable to adapt the that they are applicable. For example, FIG. 48B shows query to the particular databases being used. For example, 30 engines performing family/friend/stranger and child/adult different facial recognition databases may use different facial analyses—at the same time the person/place/thing engine is recognition parameters. To search across multiple databases, undertaking its analysis. If the latter engine determines the technologies such as detailed in Digimarc's published patent image is likely a place or thing, the results of the first two applications 20040243567 and 20060020630 can be engines will likely not be used. employed to ensure that each database is probed with an 35 (Specialized online services can be used for certain types appropriately-tailored query. of image discrimination/identification. For example, one web Frequent reference has been made to images, but in many site may provide an airplane recognition service: when an cases other information may be used in lieu of image infor image of an aircraft is uploaded to the site, it returns an mation itself. In different applications image identifiers, char identification of the plane by make and model. (Such tech acterizing eigenvectors, color histograms, keypoint descrip 40 nology can follow teachings, e.g., of Sun, The Features Vector tors, FFTs, associated metadata, decoded barcode or Research on Target Recognition of Airplane, JCIS-2008 Pro watermark data, etc., may be used instead of imagery, perse ceedings; and Tien, Using Invariants to Recognize Airplanes (e.g., as a data proxy). in Inverse Synthetic Aperture Radar Images, Optical Engi While the earlier example spoke of geocoding by latitude/ neering, Vol. 42, No. 1, 2003.) The arrangements detailed longitude data, in other arrangements the cell phone/camera 45 herein can refer imagery that appears to be of aircraft to Such may provide location data in one or more other reference systems, such as Yahoo’s GeoPlanet ID—the Where on Earth a site, and use the returned identification information. Or all ID (WOEID). input imagery can be referred to such a site; most of the Location metadata can be used for identifying other returned results will be ambiguous and will not be used.) resources in addition to similarly-located imagery. Web FIG. 49 shows that different analysis engines may provide pages, for example, can have geographical associations (e.g., 50 their outputs to different response engines. Often the different a blog may concern the author's neighborhood; a restaurants analysis engines and response engines may be operated by web page is associated with a particular physical address). different service providers. The outputs from these response The web service GeoURL-dot-org is a location-to-URL engines can then be consolidated/coordinated for presenta reverse directory that can be used to identify web sites asso tion to the consumer. (This consolidation may be performed ciated with particular geographies. 55 by the user's cell phone—assembling inputs from different GeoURL Supports a variety of location tags, including their data sources; or Such task can be performed by a processor own ICMB meta tags, as well as GeoTags. Other systems that elsewhere.) Support geotagging include RDF, Geo microformat, and the One example of the technology detailed herein is a home GPSLongitude/GPSLatitude tags commonly used in XMP builder who takes a cell phone image of a drill that needs a and EXIF-camera metainformation. Flickr uses a syntax 60 spare part. The image is analyzed, the drill is identified by the established by Geobloggers, e.g. system as a Black and Decker DR250B, and the user is provided various info/action options. These include review geotagged ing photos of drills with similar appearance, reviewing photos geo:lat=57.64911 of drills with similar descriptors/features, reviewing the geo:lon=10.40744 65 user's manual for the drill, seeing a parts list for the drill, buying the drill new from Amazon or used from EBay, listing the builder's drill on EBay, buying parts for the drill, etc. The