(12) Patent Application Publication (10) Pub. No.: US 2009/0259633 A1 Bronstein Et Al

US 200902596.33A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2009/0259633 A1 Bronstein et al. (43) Pub. Date: Oct. 15, 2009

(54) UNIVERSAL LOOKUP OF VIDEO-RELATED (60) Provisional application No. 61/045,278, filed on Apr. DATA 15, 2008. (75) Inventors: Alexander Bronstein, San Jose, Publication Classification CA (US); Michael Bronstein, (51) Int. Cl. Santa Clara, CA (US); Shlomo G06F 7/30 (2006.01) Selim Rakib, Cupertino, CA (US) (52) U.S. Cl...... 707/3; 707/104.1, 707/E17.108; Correspondence Address: 707/E17.044 Stevens Law Group (57) ABSTRACT 1754 Technology Drive, Suite #226 San Jose, CA 95110 (US) A universal video-related lookup system and method receives a request for information associated with specific video con (73) Assignee: Novafora, Inc., San Jose, CA (US) tent from a requesting device. The system and method iden tify a first video content identifier associated with the specific (21) Appl. No.: 12/349,473 video content and retrieves first metadata associated with the specific video content based on the first video content iden Filed: Jan. 6, 2009 tifier. Next, the system and method translate the first video (22) content identifier into a second video content identifier asso ciated with the specific video content and retrieves second Related U.S. Application Data metadata based on the second video content identifier. The (63) Continuation-in-part of application No. 12/349,469, first metadata and the second metadata are then provided to filed on Jan. 6, 2009. the requesting device.

100 VIDEO VIDEO VIDEO y SOURCE 1 SOURCE 2 SOURCE 3 102 104 106

UNIVERSAL METADATA NETWORK VIDEO LOOKUP SOURCE 1 108 SYSTEM 110 116

METADATA SOURCE 2

112

METADATA SOURCE 3 114 Patent Application Publication Oct. 15, 2009 Sheet 1 of 25 US 2009/0259633 A1

100 VIDEO VIDEO VIDEO Y SOURCE 1 SOURCE 2 SOURCE 3 102 104. 106

O UNIVERSAL METADATA NETWORK VIDEO LOOKUP SOURCE 1 108 SYSTEM 110 116

METADATA SOURCE 2

112

METADATA SOURCE 3 114

FIG. 1 Patent Application Publication Oct. 15, 2009 Sheet 2 of 25 US 2009/0259633 A1

116 UNIVERSAL VIDEO LOOKUP SYSTEM

COMMUNICATION PROCESSOR MODULE 2O2 204

VIDEO ANALYSIS STORAGE DEVICE MODULE 208 206

VIDEO DNA CORRESPONDENCE MODULE MODULE 210 212

HASH ADVERTISEMENT CALCULATION MANAGEMENT MODULE MODULE 214 216

FIG. 2 Patent Application Publication Oct. 15, 2009 Sheet 3 of 25 US 2009/0259633 A1

Y 300

302 RECEIVE AREQUEST FOR INFORMATIONASSOCIATED WITH SPECIFIC VIDEO CONTENT

IDENTIFY AVIDEO CONTENT DENTIFIER ASSOCIATED 304 trategy:WITH THE SPECIFIC VIDEO CONTENT Apost

306 RETRIEVE METADATA ASSOCIATED WITH THE SPECIFIC VIDEO CONTENT BASED ON THE WIDEO CONTENT DENTIFIER

TRANSLATE THE VIDEO CONTENT DENTIFIER 3O8 ASSOCIATED WITH THE SPECIFIC VIDEO CONTENT INTO OTHER VIDEO CONTENT DENTIFIERS THAT CORRESPOND TO THE PREVIOUSLY DENTIFIED VIDEO CONTENT DENTIFIER

310 RETRIEVE METADATA ASSOCIATED WITHEACH OF THE OTHER VIDEO CONTENT DENTIFIERS

312 PROVIDE METADATA AND INFORMATION REGARDING CORRESPONDING WIDEO CONTENT TO THE REQUESTING DEVICE

FIG. 3 Patent Application Publication Oct. 15, 2009 Sheet 4 of 25 US 2009/0259633 A1

Y 400 402 IDENTIFY A FIRST VIDEO SEQUENCE ASSOCIATED WITH A VIDEO PROGRAM

IDENTIFY A SECOND WIDEO SEQUENCE ASSOCIATED 404 origyesignacioWITH THE SAME WIDEO PROGRAM

406 CALCULATE A CORRESPONDENCE BETWEEN THE FIRST VIDEO SEQUENCE AND THE SECOND WIDEO SEQUENCE

408 DETERMINE AN ALIGNMENT OF THE FIRST VIDEO SEQUENCE AND THE SECOND WIDEO SEQUENCE

410 STORE THE CALCULATED CORRESPONDENCE BETWEEN THE FIRST VIDEO SEQUENCE AND THE SECOND WIDEO SEQUENCE

412 STORE INFORMATION REGARDING THE ALIGNMENT OF THE FIRST VIDEO SEQUENCE AND THE SECOND WIDEO SEQUENCE

FIG. 4 Patent Application Publication Oct. 15, 2009 Sheet 5 of 25 US 2009/0259633 A1

Y 500 502 RECEIVE AREQUEST FOR SPECIFIC SUBTITLE INFORMATIONASSOCIATED WITH A PARTICULAR MOVIE AVAILABLE ON DVD

504 IDENTIFY AFIRST DVD IDENTIFIER ASSOCIATED WITH THE DVD

506 IDENTIFY AVIDEO SEQUENCE WITHIN THE MOVIE STORED ON THE DVD

508 IDENTIFY A SECOND DVD IDENTIFIER ASSOCIATED WITH THE DVD BASED ON THE DENTIFIED VIDEO SEQUENCE

510 USING THE SECOND DVD IDENTIFIER, RETRIEVE MOVIE SUBTITLES FROMA SUBTITLE SOURCE

512 IDENTIFY A CORRESPONDENCE BETWEEN THE DVD MOVIE TIMELINE AND THE MOVIE TIMELINE ASSOCIATED WITH THE SUBTITLE SOURCE

514 MAP THE SUBTITLE INFORMATION TO THE DVD MOVIE BY ALIGNING THE TWO TIMELINES

FIG. 5 Patent Application Publication Oct. 15, 2009 Sheet 6 of 25 US 2009/0259633 A1

{009 Patent Application Publication Oct. 15, 2009 Sheet 7 of 25 US 2009/0259633 A1

?Iddy Bueu.eg Patent Application Publication Oct. 15, 2009 Sheet 8 of 25 US 2009/0259633 A1

TEMPORAL INTERVAL

TEMPORAL INTERVAL VISUAL WOCABULARY

VISUAL ELEMENT (ATOM) OOO40000050

( NUCLEOTIDEVISUAL 2 f 708 714

FIG. 7 Patent Application Publication Oct. 15, 2009 Sheet 9 of 25 US 2009/0259633 A1

3-- Patent Application Publication Oct. 15, 2009 Sheet 10 of 25 US 2009/0259633 A1

FIG. 9

Video data 900

Feature detection 1000 Feature OCations 1010

Feature description 2000 Feature descriptors 2010

Feature pruning 3000

Subset of featureS 3010

Segmentation into Feature representation temporal intervals 4000 5000 WSUal atoms 4010

Visual atom aggregation 6000

VideO DNA 6010 Patent Application Publication Oct. 15, 2009 Sheet 11 of 25 US 2009/0259633 A1

FIG. 10

1020 1022 R it to 1024 1026 Patent Application Publication Oct. 15, 2009 Sheet 12 of 25 US 2009/0259633 A1 FIG. 11

Video data 900

H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H

Input Video frame "t"

t2 number of frames?

Detect(Xiyit) features In frame t

InCrementt

Feature locationS 1010 Patent Application Publication Oct. 15, 2009 Sheet 13 of 25 US 2009/0259633 A1

FIG. 12

Feature descriptors 2010 Feature locationS 1010

Feature tracking 3100

TrackS 3110

Track pruning

Subset of 3010 Patent Application Publication Oct. 15, 2009 Sheet 14 of 25 US 2009/0259633 A1

FIG. 13

Feature Track 3110 Feature locations 1010 descriptors 2010

Compute track Compute track Compute duration motion variables description Variance

Decision rule

Pruned trackS

Feature Selection

SubSet of features 3010 Patent Application Publication Oct. 15, 2009 Sheet 15 of 25 US 2009/0259633 A1

F.G. 14

Video Data 900 VideO Data 901

VideODNA VideODNA Computation Computation 1410 1410

VideO DNA 6010 Video DNA 6011

Temporal alignment 1420

Temporal Correspondence 1425

Selection of temporally Corresponding subunits VideO Data 900 Of video VideO Data 901 1430

SubSet 1435 SubSet 1436

Spatial Spatial alignment Correspondence 1440 1445 Patent Application Publication Oct. 15, 2009 Sheet 16 of 25 US 2009/0259633 A1

INSIGNIFICANT TRACKS ARE REJECTED 1506 FEATURE POINT 1502

1500 / Frame at Track time t 1504 FIG. 15 Patent Application Publication Oct. 15, 2009 Sheet 17 of 25 US 2009/0259633 A1

LOCALNEIGHBORHOOD 1602

FEATURE POINT AFTER PRUNING 1600

1604 (23,14,3,17...)

FEATURE DESCRIPTOR (N-DIMENSIONAL VECTOR)

ACTUAL FEATURE DESCRIPTOR REPRESENTATIVE FEATURE DESCRIPTOR (NEAREST NEIGHBOR)

1702 1700 ANOTHER MEMBER OF THE N- - LIBRARY OFK REPRESENTATIVE O N FEATURE DESCRIPTORS (NOT THE V NEARESTNEIGHBORTO (1700) O Y AND THEREFORE (1700) WAS NOT O O Y. ASSIGNED TO THIS ONE) O N NY

O O O N-DIMENSIONAL FEATURE DESCRIPTOR SPACE FIG. 17 Patent Application Publication Oct. 15, 2009 Sheet 18 of 25 US 2009/0259633 A1

Series of video frames (temporal aggregation) representing a temporal interval

1808

1806 Feature 1804 Temporal Aggregation Interval

(0,3,2,4,... 1)

Visual Nucleotide Or "bag of features" (K-dimensional vector) FIG. 18 Patent Application Publication Oct. 15, 2009 Sheet 19 of 25 US 2009/0259633 A1

TIME OR 1. VIDEO FRAME / NUMBER 1 1804 -> (1,0,2,3,.4) 1802

/ -> (4,2,3,5,.2)

/-> (0,2,5,3,...,0) -> (0,3,2,4,...1) 1808 N. 1900 N 1810 REFERENCE WIDEO DNA OF REFERENCE WIDEO MEDIA Atom (binned feature descriptor) Video frame # Nucleotide # 1 2 3 4. K. 1-10 | 1 || 0 || 3 || 2 || 4 ... 1 11-20 2 0 || 2 || 5 || 3 ... O 21-30 || 3 || 4 || 2 || 3 || 5 ... 2

CLIENT WIDEO DNA

CLIENT CLIENT WIDEO VIDEO DNACREATION (0,2,5,3,...,0) 1914 PROCESS / 1916 1918 Patent Application Publication Oct. 15, 2009 Sheet 20 of 25 US 2009/0259633 A1

TTW,(5)SETIVOS (K^X) OETWOS SDI @

{S_NOIITTOANOO Patent Application Publication Oct. 15, 2009 Sheet 21 of 25 US 2009/0259633 A1

FIG. 21

PLANNED FEATURE POINT

FEATURE POINT

Patent Application Publication Oct. 15, 2009 Sheet 23 of 25 US 2009/0259633 A1

£Z"SDI

Patent Application Publication Oct. 15, 2009 Sheet 24 of 25 US 2009/0259633 A1

vºv,zvywyzsay?nly=}} Patent Application Publication Oct. 15, 2009 Sheet 25 of 25 US 2009/0259633 A1

FIG. 25

1. 2500

VideO Data SOurce

2504 2 Video Video 506 Segmenter PrOCeSSOr

2508 Storage Device

2510 Video Aggregator

Video DNA US 2009/0259633 A1 Oct. 15, 2009

UNIVERSAL LOOKUP OF VIDEO-RELATED (e.g., subtitles are associated with moviehash, DVD informa DATA tion with DVDid, etc.). However, identifier for one informa tion source will not typically work to identify video content RELATED APPLICATIONS available from another information source.

0001. This application claims the priority benefit of U.S. BRIEF DESCRIPTION OF THE DRAWINGS Provisional Patent Application No. 61/045,278, “Video Genomics: a framework for representation and matching of 0008 FIG. 1 shows an example environment in which the video content, filed Apr. 15, 2008, the disclosure of which is systems and methods discussed herein can be applied. incorporated by reference herein. This application is also a 0009 FIG. 2 shows an example universal video lookup Continuation. In Part of, and claims the priority benefit of system capable of implementing the procedures discussed U.S. patent application Ser. No. TBD, "Methods and systems herein. for representation and matching of video content” (identified 0010 FIG. 3 is a flow diagram showing an embodiment of by Docket No. NOVA-00801), filed Jan. 6, 2009, the disclo a procedure for retrieving video content identifiers and meta sure of which is incorporated by reference herein. This appli data associated with specific video content. cation is also related to U.S. patent application Ser. No. TBD. 0011 FIG. 4 is a flow diagram showing an embodiment of "Methods and systems for representation and matching of a procedure for determining correspondence between mul video content” (identified by Docket No. NOVA-00803), tiple video sequences. filed concurrently herewith. 0012 FIG. 5 is a flow diagram showing an embodiment of a procedure for identifying, retrieving, and aligning Subtitle BACKGROUND information with a movie played from a DVD. 0013 FIG. 6A shows examples of spatial alignment of 0002 The invention relates generally to systems and Video data and temporal alignment of video data. methods for identifying and correlating multiple video con 0014 FIG. 6B shows an example context representation tent identifiers associated with specific video content. Addi using Video genomics. tionally, the described systems and methods aggregate meta 0015 FIG. 7 shows an example procedure for the forma data associated with specific video content from one or more tion of video DNA. metadata Sources. 0003 Specific video content in a video repository may 0016 FIG. 8 shows an example comparison between bio have an associated identifier that uniquely refers to this video. logical DNA and video DNA. Such an identifier is usually referred to as a globally unique 0017 FIG. 9 is a flow diagram showing an embodiment of identifier (GUID). Examples of video repositories in this a procedure for constructing video DNA. context include a video hosting and distribution website such 0018 FIG. 10 shows an example of dividing a video as NetFlix, YouTube, or Hulu, a collection of DVD media, a sequence into temporal intervals. collection of media files, or a peer-to-peer network. 0019 FIG. 11 is a flow diagram showing an embodiment 0004 Typically, GUIDs are specific for each video reposi of a procedure for frame based feature detection. tory. For example, videos in YouTube have an associated 0020 FIG. 12 is a flow diagram showing an embodiment uniform resource locator (URL) that is unique to that online of a procedure for feature tracking to find consistent features. video source. Similarly, files in the BitTorrent peer-to-peer 0021 FIG. 13 is a flow diagram showing an embodiment network have a hash value computed from its content and of a procedure for feature track pruning. used as an identifier of the file. A DVD can be uniquely 0022 FIG. 14 is a flow diagram showing an embodiment identified by a hash value produced from the media files of a procedure for finding spatio-temporal correspondence recorded on the disc (commonly referred to as the DVDid). between two video DNA sequences. 0005. In addition to repositories of video content, there (0023 FIG. 15 shows an example overview of the video exists multiple repositories of video-related information (also DNA generation process. referred to as video-related metadata). A few examples 0024 FIG.16 shows an example of how video features are include: Wikipedia, containing detailed descriptions of movie processed during video DNA generation. plots and characters: International Movie Database (IMDB), (0025 FIG. 17 show an example of how video feature containing lists of actors performing in movies; OpenSub descriptors are binned into a standardized library (visual titles, containing subtitles in different languages; DVDXML vocabulary) of feature descriptors. database containing information about DVDs, etc. 0026 FIG. 18 shows an example of how video is seg 0006. Many of these metadata repositories are available mented into various short multiple-frame intervals or "snip and indexed using different types of identifiers. For example, pets' during the video DNA creation process. DVD-related information (e.g., cover art, list of chapters, (0027 FIG. 19 shows an example of how a video can be title, etc.) can be retrieved using a DVDid. Subtitles in Open indexed and described by its corresponding video DNA. Subtitles database are indexed by moviehash, an identifier 0028 FIG.20 illustrates an example of the video signature used in the BitTorrent network. Other information sources feature detection process. also use a hash value associated with a movie or other video 0029 FIG. 21 shows an example of the video signature program as an index to video-related information. For certain feature tracking and pruning process. online information services, a URL is used as an index for 0030 FIG. 22 shows an example of video signature fea video-related information. ture description. 0007 Although there are multiple ways to identify differ 0031 FIG. 23 shows an example of a vector quantization ent types of video content, these identifiers are not inter process. changeable. Metadata repositories use some types of identi 0032 FIG. 24 shows an example of video DNA construc fiers to access the data specific for the typical content Source tion. US 2009/0259633 A1 Oct. 15, 2009

0033 FIG. 25 shows an example system for processing tional details regarding universal video lookup system 116 video data as described herein. components and operation are provided herein. Video device 0034. Throughout the description, similar reference num 118 is capable of receiving and processing video content, for bers may be used to identify similar elements. example, to display on one or more display devices (not shown). Specific examples of video device 118 include a DETAILED DESCRIPTION computer, set top box, satellite receiver, DVD player, Blu-ray 0035. The systems and methods described herein identify DiscTM player, digital video recorder, game console and the and correlate multiple video content identifiers associated like. with specific video content. Additionally, the described sys 0040 Although FIG. 1 shows three video sources 102,104 tems and methods aggregate metadata associated with spe and 106, and three metadata sources 110, 112 and 114, a cific video content from one or more metadata sources. For particular environment 100 may include any number of video example, these systems and methods identify multiple ver Sources and any number of metadata sources. Additionally, a sions of a particular video program, regardless of transcoding particular environment 100 may include any number of video format, aspect ratio, commercial advertisements, altered pro devices 118, universal video lookup systems 116, and other gram length, and so forth. The correlation of video content is devices or systems (not shown) coupled to one another performed in spatial and/or temporal coordinates. through network 108. 0036 FIG. 1 shows an example environment 100 in which 0041 Although the various components and systems the systems and methods discussed herein can be applied. shown in FIG. 1 are coupled to network 108, one or more Environment 100 includes multiple video sources 102, 104 components or systems can be coupled to universal video and 106 coupled to a network 108. Video sources 102, 104 lookup system 116 via another network, communication link, and 106 can be any type of video storage repository, peer-to and the like. peer network, computer system, or other system capable of 0042 FIG. 2 shows an example of universal video lookup storing, retrieving, streaming, or otherwise providing video system 116 that is capable of implementing the procedures content. Example video sources include a repository of mov discussed herein. Universal video lookup system 116 ies available for downloading or streaming, a peer-to-peer includes a communication module 202, a processor 204, a network that Supports the exchange of video content, a per video analysis module 206 and a storage device 208. Com Sonal computer system that contains video content, a DVD munication module 202 communicates data and other infor player, a Blu-ray DiscTM player, a digital video recorder, a mation between universal video lookup system 116 and other game console, YouTube, NetFlix, BitTorrent, and the like. devices, such as Video Sources, metadata sources, Video Example video content includes movies, television programs, devices, and so forth. Processor 204 performs various opera home videos, computer-generated videos and portions of full tions necessary during the operation of universal video length movies or television programs. lookup system 116. For example, processor 204 is capable of 0037 Network 108 is a data communication network performing several methods and procedures discussed herein implemented using any communication protocol and any to process video content identifiers and metadata associated type of communication medium. In one embodiment, net with the video content. Video analysis module 206 performs work 108 is the Internet. In other embodiments, network 108 various video processing and video analysis operations as is a combination of two or more networks coupled to one discussed herein. For example, video analysis module 206 is another. Network 108 may be accessed via a wired or wireless capable of identifying content contained within the video data communication link. being displayed. Storage device 208 stores data and other 0038 Environment 100 also includes multiple metadata information used during the operation of universal video sources 110, 112 and 114 coupled to network 108. Metadata lookup system 116. Storage device 208 may include one or sources 110, 112 and 114 can be any type of metadata storage more Volatile and/or non-volatile memories. In a particular repository, computer system, or other system capable of stor embodiment, storage device 208 includes a hard disk drive ing, retrieving, or otherwise providing metadata related to combined with volatile and non-volatile memory devices. video content. The video-related metadata includes informa 0043 Universal video lookup system 116 also includes a tion such as cover art for a DVD, subtitle information, video video DNA module 210, a correspondence module 212, a content title, actors in a video program, a narrative Summary hash calculation module 214, and an advertisement manage of the video content, an author/producer of the video content, ment module 216. Video DNA module 210 identifies objects viewer comments associated with video content, and the like. within video content to find correspondence between differ Example metadata sources include web-based services Such ent video sequences. Correspondence module 212 analyzes as the dvdxml.com database, opensubtitles.org database, the multiple video sequences to find spatial and/or temporal cor International Movie Database (IMDB), YouTube user com respondence between two or more of the video sequences. ments, and the like. The dvdxml.com database contains infor Hash calculation module 214 calculates a hash function asso mation about DVD versions of movies, such as title and key ciated with video content or a segment of the video content. A actors. The opensubtitles.org database contains Subtitle files hash function is an algorithm that converts a large amount of in a variety of different languages for various movies. data (such as a media file) into a smaller “hash value”. Hash 0039. Additionally, environment 100 includes a universal values are often used as an index to a table or other collection video lookup system 116 and a video device 118, both of of data. Advertisement management module 216 performs which are coupled to network 108. As described herein, uni various advertisement-related functions, as discussed herein. versal video lookup system 116 identifies and correlates mul For example, advertisement management module 216 is tiple video content identifiers associated with specific video capable of selecting among multiple advertisements for inser content. Additionally, universal video lookup system 116 tion into video content based on various factors. aggregates metadata associated with specific video content 0044 Although not shown in FIG. 2, the components of from one or more metadata sources 110, 112 and 114. Addi universal video lookup system 116 communicate with one US 2009/0259633 A1 Oct. 15, 2009

other via one or more communication links, such as buses, tion allows the requesting device to display some or all of the within the universal video lookup system 116. In particular metadata to a user of the requesting device. Additionally, the embodiments, various modules shown in FIG. 2 (such as requesting device can display all available versions of the video analysis module 206, video DNA module 210, corre video content to the user, thereby allowing the user to select spondence module 212, hash calculation module 214 and the desired version to display. advertisement management module 216) represent com 0048 FIG. 4 is a flow diagram showing an embodiment of puter-readable instructions that are executed, for example, by a procedure 400 for determining correspondence between processor 204. multiple video sequences associated with the same video 0045 FIG.3 is a flow diagram showing an embodiment of program. The multiple video sequences are typically different a procedure 300 for retrieving video content identifiers and versions of the same video program. The different versions metadata associated with specific video content. Initially, may have different aspect ratios, different transcoding for procedure 300 receives a request for information associated mats, or different broadcast versions (e.g., a full-length movie with specific video content (block 302). This request is version without commercials and an edited version for tele received, for example, from video device 118 shown in FIG. vision broadcast that includes commercials). Initially, proce 1. The requested information may include metadata associ dure 400 identifies a first video sequence associated with a ated with the video content or other versions of the video video program (block 402). The first video sequence may content. Procedure 300 continues by identifying a video con represent all or a portion of the video program. Next, the tent identifier associated with the specific video content procedure identifies a second video sequence associated with (block 304). The request may include a video content identi the same video program (block 404). As mentioned above, the fier associated with the video content, a link to the video first and second video sequences are different versions of the content, or a portion of the video content itself. If the request same video program. does not include a video content identifier, the procedure 0049 Procedure 400 continues by calculating a corre identifies an associated video content identifier based on the spondence between the first video sequence and the second specific video content, as discussed below. video sequence (block 406). This correspondence may 0046 Procedure 300 continues by retrieving metadata include temporal coordinates and/or spatial coordinates. Vari associated with the specific video content based on the video ous systems and methods are available to calculate the corre content identifier (block 306). The retrieved metadata can be spondence between the first and second video sequences. associated with the entire video content, associated with a Calculating temporal correspondence between two video specific time interval in the video content, or associated with sequences is particularly important when the video sequences a spatio-temporal object in the video content. This metadata need to be synchronized in time. For example, if the subtitles can be retrieved from any number of metadata sources. For contained in one video sequence are to be displayed with the example, one metadata source provides Subtitle information, Video content of a second video sequence, the Subtitles should another metadata source includes actor information, and a be displayed at the correct time within the second video third metadata source includes reviews and user ratings of the sequence. Calculating the temporal correspondence in this associated video content. The procedure then translates the situation provides the appropriate synchronization. video content identifier associated with the specific video 0050 Calculating spatial correspondence between two content into other video content identifiers that correspond to Video sequences is particularly useful when identifying geo the previously identified video content identifier (block 308). metric information in the video sequences, such as identify The translation of the video content identifier into corre ing various objects in a scene. Different versions of a video sponding video content identifiers may have been previously program can have different resolutions, aspect ratios, and the performed and stored in a database, table, or other data struc like. Such that a spatial correspondence is necessary to inter ture for future retrieval. Various procedures can be utilized to change between the two programs. For example, spatial cor translate the video content identifier into corresponding video respondence provides for the interchange of metadata related content identifiers, as discussed herein. For example, the to different types of content and different versions of the video content identifier can be translated into another video video. content identifier associated with the entire video content. 0051. The procedure of FIG. 4 continues by determining Alternatively, the video content identifier can be translated an alignment (temporally and/or spatially) of the first video into a second video content identifier that refers to a specific sequence and the second video sequence (block 408). Alter time interval within the video content. In another implemen nate embodiments align the first video sequence and the sec tation, the video content identifier is translated into a second ond video sequence by aligning audio-based data segments. video content identifier that refers to a spatio-temporal object 0.052 The procedure then stores the calculated correspon in the video content. An example table that stores pre-com dence between the first video sequence and the second video puted video content identifiers has a first column of video sequence (block 410). The correspondence information is identifiers and a second column that contains video content stored for later use in correlating the two video sequences identifiers associated with video content similar to (or iden without requiring recalculation of the correspondence. tical to) video content associated with the video identifiers in Finally, the procedure stores the information regarding the the first column of the table. alignment of the first video sequence and the second video 0047. The procedure continues by retrieving metadata sequence (block 412). The alignment information is stored associated with each of the other video content identifiers for future use in aligning the two video sequences without (block 310). Thus, metadata associated with various versions repeating the determining of the alignment of the video of the same video content is retrieved from any number of Sequences. metadata sources. Finally, procedure 300 provides the meta 0053. The procedure of FIG. 4 determines correspondence data as well as information regarding the corresponding video between two particular video sequences associated with the content to the requesting device (block 312). This informa same video program. Procedure 400 is repeated for each pair US 2009/0259633 A1 Oct. 15, 2009 of video sequences, thereby pre-calculating the correspon information is then mapped to the DVD movie by aligning the dence information and creating a database (or other data two timelines (block 514). This correspondence and align structure) containing the various correspondence informa ment is necessary to temporally synchronize the display of tion. The database of correspondence information is useful in the Subtitles with the appropriate movie content. In a particu accelerating operation of universal video lookup system 116 lar embodiment of procedure 500, the correspondence by avoiding calculation of correspondence information that is between the DVD movie timeline and the movie timeline already contained in the database. This database of corre associated with the Subtitle source is pre-computed and stored spondence information may be contained within universal in a database or other data structure. In other embodiments, video lookup system 116 or accessible to the universal video the correspondence information identified at block 512 is lookup system (and other systems) via network 108 or other calculated when needed, then optionally stored for future data communication link. In a particular embodiment, the reference. pre-calculated correspondence information is offered as a 0.058 A particular implementation of the described sys data service to multiple systems and devices, such as video tems and methods includes both a metadata component and a devices 118. correspondence component. The metadata component is 0054 FIG. 5 is a flow diagram showing an embodiment of associated with a video content identifier, such as a YouTube a procedure 500 for identifying, retrieving, and aligning Sub URL or hash function. The metadata component is also asso title information with a movie played from a DVD. In the ciated with a spatio-temporal coordinate. Such as (x,y,t) in the example of FIG. 5, a request is received for specific subtitle Video sequence corresponding to this video content identifier. information associated with a particular movie available on In the spatio-temporal coordinate (x,y,t), X and y correspond DVD (block 502). For example, a user may want to watch a to the two spatial dimensions and t corresponds to the time movie on DVD, but wants subtitle information in a particular dimension. A particular metadata component is denoted language. Such as the Russian language. The DVD may m(VCI; x,y,t), where “VCI represents the video content include subtitle information in English and Spanish, but not identifier. For different types of VCIs, the metadata may be Russian. Using the described universal video lookup system located in different metadata sources. 116 and procedure 500, the user is able to watch the desired 0059. In certain situations, the metadata is sequence-level movie on DVD with Russian subtitles. data (e.g., a movie title) so the metadata component m(VCI; 0055 Procedure 500 continues by identifying a first DVD X.y,t)-m(VCI). In this situation, the spatial and temporal data identifier associated with the DVD (block 504). For example, is not relevant to the movie title. the first DVD identifier may be a DVDid or a hash value 0060. In other situations, the metadata is interval-level resulting from performing a hash function on a portion of the data (e.g., movie Subtitles) so the metadata component video content on the DVD. In particular embodiments, iden m(VCI; x,y,t)=m(VCI; t). In this situation, the temporal align tifiers associated with video content, such as the content ment of the subtitle information with the video content is stored on a DVD, can be identified based on a file name important, but the spatial data is not necessary. associated with the video content, a file-based hash value, or 0061 Finally, metadata about objects contained in the a content-based hash value. An example of a file name iden Video require both spatial and temporal coordinate. tifier is a URL used with fixed video repositories where the 0062. The correspondence component between two dif associated video content is stored permanently and not ferent video content identifiers is expressed as a function changed. A file-based hash value is useful in peer-to-peer (x2y2,t2) =c(VCI1, VCI2; x1,y1,t1), where (x1,y1,t1) and networks to identify the same file across multiple users inde (X2.y2,t2) are spatio-temporal coordinates in the video pendently of changes to the name of the file. One example of sequences corresponding to VCI1 and VCI2, respectively. a file-based hash value is Moviehash. A content-based hash Correspondence can be established between video content value, such as video DNA discussed herein, analyzes the from different sources by having VCIs of different types, video content itself to identify the video. Thus, a content (e.g., a YouTube video clip and a DVD) or between video based hash value is invariant to the file name, encoding pro content with VCIs of the same type, such as different editions cess, processing of the video content, or editing of the video of the same movie on DVD. COntent. 0063 Universal video-related lookup is performed given 0056. After identifying the first DVD identifier, the proce (VCI0.x0,y0, t0) in two stages. First, the system finds VCIs dure identifies a video sequence within the movie stored on that have correspondence to VCI0, denoted in this example as the DVD (block 506). The identified video sequence may a c(VCIO.VCI1; x0,y0, t0), c(VCI0,VCI2; x0,y0, t0) . . . pre-determined portion of the movie, such as the first 30 c(VCIO.VCIN: x0,y0, t0). These multiple correspondences seconds of the movie, or any other video sequence within the translate the coordinates (x0,y0.t0) into (x1,y1, t1), (x2y2,t2) movie. Procedure 500 then identifies a second DVD identifier . . . (xNyN.tN). Next, the system retrieves the metadata associated with the DVD based on the identified video m(VCI1; x1,y1,t1), m(VCI2; x2y2,t2) ... mCVCIN: xNyN, sequence (block 508). The second DVD identifier is selected tN). For a particular metadata component, the system may based on the additional information (e.g., video metadata) retrieve the entire metadata or a portion of the metadata. For desired. In this example, the second DVD identifier is example, if the metadata contains all information regarding a selected based on the type of indexing used to identify infor movie, the system may only want to display the title and mation in a source of subtitles. Thus, the second DVD iden Summary information for the movie, so the system only tifier is used to find and retrieve Russian subtitle information retrieves that portion of the metadata. in the subtitle source associated with the specific DVD 0064. As discussed herein, universal video-related lookup selected by the user (block 510). computes the correspondence between two video sequences. 0057 The procedure then identifies a correspondence In certain situations, spatio-temporal correspondence is com between the DVD movie timeline and the movie timeline puted. In other situations, temporal correspondence or spatial associated with the subtitle source (block 512). The subtitle correspondence is sufficient. When spatio-temporal corre US 2009/0259633 A1 Oct. 15, 2009 spondence is necessary, an embodiment of universal video quantify and evaluate, while semantic similarity is more Sub lookup system 116 first computes the temporal correspon jective and problem-dependent. dence between the two video sequences, then computes the 0070 There is almost always noise and distortion in video spatial correspondence between the same two video signals, caused by differing angles, lighting conditions, edit Sequences. ing, resolution, and the like. Here an ideal similarity criterion 0065 Temporal correspondence between the two video should be invariant to these and other variations. In terms of sequences can be computed using various techniques. Tem nomenclature, if the similarity criterion deems the depictions poral correspondence between video sequences is not neces of two objects similar no matter how they are illuminated, we sarily a one-to-one correspondence due to insertions and/or say that the similarity is invariant to lighting conditions. deletions in one or both video sequences, which appear as 0071. The described systems and methods allow for edit gaps in the correspondence. A general description oftemporal and distortion-invariant matching of video sequences. More correspondence is written as a set of pairs (t1, t2), meaning specifically, the systems and methods provide a framework that time t1 in the first video sequence corresponds to time t2 for spatio-temporal matching based on visual similarity, in the second video sequence. A dynamic programming algo which is invariant to temporal distortions (transformations rithm, such as the Smith-Waterman algorithm, is used to find like frame rate change), temporal edits (removal and insertion an optimal alignment (e.g., correspondence with one or more of frames), spatial distortions (pixel-wise operations) and gaps) between two video sequences. The data used to perform spatial edits (removal or insertion of content into frames). On the alignment can be an audio track or a video-based descrip a mathematical level, the problem of spatio-temporal match tor, such as a Video DNA descriptor discussed herein. ing can be formulated as: given two video sequences, find a 0066. As discussed above, the correspondence between correspondence between the spatio-temporal system of coor two video sequences can be pre-computed for aparticular pair dinates (x, y, t) in the first sequence and the spatio-temporal of video content identifiers. Typically, there is a “master system of coordinates (x, y', t') in the second system. version of the video content, which is the most complete and 0072 Thinking of video data as a three-dimensional array authoritative version of the video content. All other versions of pixels, the spatio-temporal matching problem can be con of the same video content will be synchronized to the master sidered as finding the correspondence between three-dimen version (e.g., the correspondence is computed using the mas sional arrays. In general, this problem is so computationally ter version as the reference). In a particular example, a DVD complex (complexity level NP-complete), as to be impracti version of a movie is used as the master and all other versions cal to compute. This is because without further simplification, (such as YouTube versions and BitTorrent versions) are the computing system will try to find matching between all aligned with the timeline of the DVD. the possible subsets of pixels between the first and the second 0067 Various systems and methods can identify, corre sequences, which is a very large number of operations. late, track, match, and align video frames and video 0073 However, the matching problem can be greatly sim sequences. A particular embodiment for performing these plified if the problem is split into two separate processes: types of functions is discussed below. Video data includes temporal matching and spatial matching. Here the problem of spatio-temporal data, containing two spatial dimensions and spatial matching is more complex because the video frames one temporal dimension (i.e., the two dimensional video are two dimensional, and thus a large number of two dimen images and the time sequence of the different video frames). sional comparisons must be made. In contrast, the one-di We distinguish between temporal and spatial correspondence mensional temporal matching problem, although still com of two different video frames. Temporal correspondence is plex, is enough simpler that one-dimensional (temporal) performed at the time granularity of the time between differ signals can be matched very efficiently using the video DNA ent video frames: the video sequences are regarded as one or video genomics dynamic programming methods discussed dimensional ordered sequences of frames, and the matching herein. produces a correspondence between the frames in the two 0074 FIG. 6A shows examples of spatial alignment of sequences. Spatial correspondence is performed at a Sub Video data and temporal alignment of video data. At a first frame granularity, finding matching between corresponding stage 600 of FIG. 6A, temporal matching is performed (this pixels or regions of pixels “things' within two frames in the step is discussed in more detail below). Temporal matching Sequences. produces the correspondence between the temporal coordi 0068. The correspondence and similarity problems are nate “t in a subset of the first video sequence and the tem intimately related, and usually computing one problem poral coordinate “t' in a subset of the second video sequence. allows one to infer that the other problem is also being com By performing temporal matching, weavoid the need to try to puted. For example, we can define the similarity as the perform two dimensional spatial matching between all the amount of corresponding parts of the video. Conversely, if we possible Subsets of pixels in the video sequences (essentially have a criterion of similarity between the different parts of the a three dimensional matching problem). Rather, the problem Video sequences, we can define a correspondence that maxi is reduced in size so that the spatial matching must now only mizes this part-wise similarity. be performed between the small subsets of temporally corre 0069. Here we want to distinguish between two types of sponding portions of the video sequences. In other words, for similarity: semantic and visual. “Visual similarity of two the spatial matching, a large 3D matching problem is turned objects implies that they “look similar', i.e., their pixel rep into a much smaller 2D matching problem between relatively resentation is similar. “Semantic' similarity implies that the small sets of 2D video frames. For example, instead of trying concepts represented by the two objects are similar. Semantic to match the “apple' series of pixels “thing from the entire similarity defines much wider equivalence classes than visual upper video sequence into a corresponding “apple' thing in similarity. For example, a truck and a Ferrari are visually the entire lower video sequence, now just the Small number of dissimilar, but semantically similar (both represent the con frames in “sequence A” and “sequence B' which are most cept of a vehicle). As a rule, visual similarity is easier to relevant are examined. US 2009/0259633 A1 Oct. 15, 2009

0075 Typically, one of the video sequences is a short 0082 Scale invariant feature transform (SIFT), query, and thus the size of the temporally corresponding described in D. G. Lowe. “Distinctive image features portions of the video sequences is Small, which greatly from Scale-invariant keypoints.” International Journal of reduces the problem of spatial matching, discussed below. At Computer Vision, 2004; a second stage 602 of FIG. 6A, spatial matching between the 0.083 Motion vectors obtained by decoding the video temporally corresponding video data is performed. Spatial Stream; matching produces the correspondence between the spatial 0084 Direction of spatio-temporal edges; coordinates (x, y) and (x, y) in the temporally matching 0085 Distribution of color; portions (e.g., frames) of the first and second sequences. I0086) Description of texture; 0076. In the described systems and methods, the matching 0.087 Coefficients of decomposition of the pixels in can be made more robust and invariant to distortions and edits Some known dictionary, e.g., of wavelets, curvelets, etc. of the video content. In particular, the temporal matching can 0088 Specific objects known a priori. be made to be invariant to temporal edits of the video I0089 Extending this idea to video data, we can abstract a sequences. Spatial matching can be made to be invariant to Video sequence into a three-dimensional structure of features spatial distortions and edits of the video sequences (for (two spatial dimensions formed by the various 2D images, example, the different aspect ratio of the apple, different and one time dimension formed by the various video frames). lighting, and the background of different fruits shown in FIG. This 3D structure can be used as the basic building blocks of 6A). a representation of the video sequence. 0077. It should be understood that the methods described 0090. As previously discussed, it can be extremely useful herein are normally carried out in a computer system contain to think about video analysis problems in biological terms, ing at least one processor (often a plurality of processors will and draw insight and inspiration from bioinformatics. Here, for example, it is useful to think of the features as "atoms', the be used), and memory (often megabytes or gigabytes of feature abstraction of the various video frames in a video as a memory will be used). Processors suitable for implementing “nucleotide', and the video itself as being like an ordered the methods of the present invention will often be either sequence of nucleotides, such as a large DNA or RNA mol general purpose processors, such as x86, MIPS, Power, ARM, ecule. or the like, or they may be dedicated image interpretation 0091. The spatial and the temporal dimensions in the processors, such as video processors, digital signal proces video sequence have different interpretations. Temporal sors, field programmable gate arrays, and the like. The meth dimension can be thoughofas ordering of the video data—we ods described herein may be programmed in a high level can say that one feature comes before another. If we divide the language, such as “C, C+”, java, Perl, Python, and the like, Video sequence into temporal intervals, we can consider it as programmed in a lower level assembly language, or even an ordered sequence of “video elements', each of which embedded directly into dedicated hardware. The results of contains a collection of features. As previously discussed, this analysis may be stored in either Volatile memory, such as here we consider the video data to be an ordered sequence of RAM, or in non-volatile memory such as flash memory, hard Smaller nucleotides, and we considera video signal to be also drives, CD, DVD, Blue-ray disks, and the like. composed of a string of “nucleotide-like' video subunits, 0078 Visual information (video images) can be repre called video DNA. sented by means of a small number of “points of interest'. 0092 Drawing upon inspiration from DNA sequence also called “features”. Typically, features are points that are analysis, the systems and methods can represent a video both easily detectable in the image in a way that is invariant to as three-, two- and one-dimensional signals. Considering the various image modifications. A “feature' in an image entire set of feature points, we have a thee-dimensional (Spa includes both the coordinates of the “point of interest as well tio-temporal) structure. Considering the sequence of tempo as a “descriptor which typically describes the local image ral intervals, we obtain a one-dimensional representation. Considering one frame in the sequence, we obtain a two content or environment around the “point of interest'. Fea dimensional representation. The same representation is used tures are often chosen for their ability to persist even if an to carry out the temporal and spatial matching stages. An image is rotated, presented with altered resolution, presented example two-stage matching approach follows. with different lighting, etc. 0093. At the first stage, a temporal representation of the 0079 A feature is often described as a vector of informa Video sequences is created. Each video sequence is divided tion associated with a spatio-temporal subset of the video. For into temporal intervals. Here a temporal interval is usually not example, a feature can be the 3D direction of a spatio-tem just a single video frame, but rather is often a series of at least poral edge, local direction of the motion field, color distribu several video frames (e.g., 3 to 30 frames) spanning a fraction tion, etc. Typically, local features provide a description of the of a second. Temporal intervals are discussed in greater detail object, and global features provide the context. For example, herein. an 'apple' object in a computer advertisement and an “apple' 0094 For each time interval, the actual video image is object in an image of various fruits may have the same local abstracted into a representation (also referred to herein as a features describing the object, but the global context will be visual nucleotide) containing just the key features in this different. interval. This series of features is then further abstracted and 0080 For example, local features may include: compressed by discarding the spatio-temporal coordinates of 0081 Harris comer detector and its variants, as the various features. For example, we just start counting dif described in C. Harris and M. Stephens, “A combined ferent types of features. In other words, we only keep track of comer and edge detector, Proceedings of the 4th Alvey the feature descriptors, and how many different types of fea Vision Conference, 1988: ture descriptors there are. US 2009/0259633 A1 Oct. 15, 2009

0095. Each time division of the video signal (which we duce, this prior art “brute force' approach rapidly reaches a will call a “nucleotide' in analogy to a biological nucleotide) point of diminishing returns due to high computational over is represented as an unordered collection or “bag of features head. (or a bag of feature descriptors). Thus, if each feature is 0.100 However, we have found that an increase of accu considered to be a “visual atom, the “bag of features” that racy of object description that would otherwise require a prior represents a particular video time interval can be called a art increase of the visual vocabulary size by two orders of “nucleotide'. The representations of the various video time magnitude (increasing computational overhead by nearly two intervals (visual nucleotides) are then arranged into an orders of magnitude as well) can be easily matched by the ordered “sequence' or map (video DNA). In this discussion, described systems and methods using a computationally less intense process. Using the systems and methods described we will generally use the term “nucleotide' rather than “bag herein, to improve accuracy, we avoid increasing the number of features” because it helps guide thinking towards a useful of feature descriptors, and instead improve accuracy by an bioinformatic approach to video analysis procedures. increase in the time resolution of the analysis. This is done by 0096. The video map/video DNAs corresponding to two simply adding two more “nucleotides' (i.e., using slightly Video sequences can be aligned in much the same way that smaller time divisions in the video analysis) to the “video DNA sequences can be compared and aligned. In DNA DNA sequences being compared. By avoiding a drastic sequence analysis, one of the central problems is trying to find increase in the number of features, the systems and methods alignment which gives the best correspondence between Sub can achieve high accuracy, yet can be much more efficient sets of the two DNA sequences by maximizing the similarity from a computational overhead standpoint. between the corresponding nucleotides and minimizing the 0101 Prior art approaches, such as J. Sivic and A. Zisser gaps. In the systems and methods described herein, algo man, “Video Google: a text retrieval approach to object rithms similar to those used in bioinformatics for DNA matching in video approached video as a collection of sequence alignment can be used for aligning two different images and thus had to use feature “vocabularies' of very Video signals. large size (up to millions of elements) in order to obtain high descriptive power. By contrast, the described use oftemporal 0097. After two portions of video media are matched by Support gives equal or better results using much smaller fea the first stage, additional image analysis can be done. For ture vocabularies (hundreds or thousands of elements), with a example, at the second stage, the spatial correspondence corresponding large increase in computational efficiency. between temporally corresponding subsets of the video 0102) A second advantage is that for content-based sequences can be found. That is, “things' (pixel groups) retrieval applications, the described systems and methods shown in a first video can be matched with “things' shown in allow retrieval of both an object of interest, and the context in a second video. More specifically, we can now look for spatial which the object appears. The temporal sequence can be correspondence between the contents of two temporally-cor considered as additional information describing the object, in responding video image frames. addition to the description of the object itself. 0098. In this later second stage, we do not discard the 0103 FIG. 6B shows an example of the same object (an spatio-temporal coordinates of the features. Rather, in this apple 610) appearing in two different contexts: Fruits 612 and second stage each frame is represented as a two-dimensional Computers 614. In the first case, the Apple' object appears in structure of features, and we retain the feature coordinates. a sequence with a Banana and a Strawberry, which places the For this second stage purpose of spatial matching of frames object in the context of Fruits. In the second case, the Apple and comparing the contents of the video frames, more stan object appears in sequence with a Laptop and an iPhone, dard feature-based algorithms, previously used in computer which places the object in the context of Computers. Here, the vision literature can now be used. systems and methods are Sophisticated enough to recognize 0099 For object recognition, and other applications where these context differences. As a result, the Video map/Video object-based analysis is required, the “video genomics' DNA representation in these two cases will be different, approach offers significant advantages overprior art methods, despite the fact that the object itself is the same. including the following. First, the systems and methods 0104. By contrast, prior art approaches, such as Sivic and described herein offer a higher discriminative power than Zisserman, do not take into consideration the context of the standalone object descriptors. This discriminative power is Video content, and thus are unable to distinguish between the due to the discriminative power of the object descriptors two different instances of the apple object in the above themselves as well as the temporal Support, i.e., the time example. sequence of these descriptors. Although some existing meth 0105. A third advantage is that the described “Video ods teach that the best discrimination is obtained when a large genomics' approach allows for performing partial compari number of precisely optimized features are used, we have son and matching of video sequences in many different ways. found that this is not the case. Surprisingly, we have found Just as methods from bioinformatics allow different DNA that when the systems and methods described herein are sequences to be compared, two different video DNA compared on a head-to head basis with prior art techniques, it sequences can be matched despite having some dissimilar turns out that the temporal Support (i.e., the time order in Video frames (nucleotides), insertions or gaps. This is espe which various feature groups appear) is more important for cially important when invariance to video alterations such as discriminative power than is a very large number of different temporal editing is required—for example, when the video descriptors. For example, increases in accuracy in object DNAs of a movie and its version with inserted advertisements description are usually desirable. The prior art “brute force' need to be matched correctly. way to increase accuracy would be to simply use more and 0106 FIG. 7 presents a conceptual scheme of an example more features and feature descriptors, but since each feature creation of the video map/video DNA representation of a and feature descriptor is computationally intensive to pro Video sequence. The procedure consists of the following US 2009/0259633 A1 Oct. 15, 2009

stages. At a first stage 702, a local feature detector is used to 0 cases of feature 3, then the visual nucleotide or “bag” that detect points of interest in the video sequence. Suitable fea describes these video images can be represented as the histo ture detectors include the Harris corner detector disclosed in gram or vector (3.2.0). In this example, the visual nucleotide C. Harris and M. Stephens “A combined corner and edge (321) is represented as the histogram or vector (0, 0, 0, 4, 0, 0, detector, Alvey Vision Conference, 1988; or the Kanade 0, 0, 0, 5, 0). Lucas algorithm, disclosed in B. D. Lucas and T.Kanade, "An 0112 The “bag of features' representation allows for iterative image registration technique with an application to invariance to spatial editing: if the video sequence is modified stereo vision', 1981; or the SIFT scale-space based feature by, for example, overlaying pixels over the original frames, detector, disclosed in D. G. Lowe. “Distinctive image features the new sequence will consist of a mixture of features (one from scale-invariant keypoints', IJCV, 2004; part of old features belonging to the original video and 0107 The points of interest can be tracked over multiple another part of new features corresponding to the overlay). If Video frames to prune insignificant or temporally inconsistent the overlay is not very significant in size (i.e., most of the (e.g., appearing for a too short of a time period) points. This information in the frame belongs to the original video), it is will be discussed in more detail later. The remaining points possible to correctly match two visual nucleotides by requir are then described using a local feature descriptor, e.g., SIFT ing only a certain percentage of feature elements in the based on a local distribution of gradient directions; or Speed respective "bags” (i.e., sparse vectors) to coincide. up robust features (SURF) algorithm, described in H. Bay, T. 0113 Finally, all the visual nucleotides (or feature bags) Tuytelaars and L. van Gool. “Speed up robust features', 2006. are aggregated into an ordered sequence referred to as a video The descriptor is represented as a vector of values. map or video DNA 714. Each representation (or visual nucle 0108. The feature detection and description algorithms otide, "bag”, histogram or sparse vector) can be thought of as should be designed in Such a way that they are robust or a generalized letter over a potentially infinite alphabet, and invariant to spatial distortions of the video sequence (e.g., thus the video DNA is a generalized text sequence. change of resolution, compression noise, etc.) The spatio 0114. The temporal matching of two video sequences can temporal feature locations and the corresponding feature be performed by matching the corresponding video DNAS descriptors constitute the most basic representation level of using a variety of different algorithms. These can range from the video sequence. very simple “match/no match algorithms, to bioinformatics 0109 At a second stage 704, the video sequence is seg like “dot matrix algorithms, to very Sophisticated algorithms mented into temporal intervals 706 which often span multiple similar to those used in bioinformatics for matching of bio individual video frames (often 3 to 30 frames). Such segmen logical DNA sequences. Examples of some of these more tation can be done, for example, based on the feature tracking complex bioinformatics algorithms include the Needleman from the previous stage. It should be noted that the segmen Wunsch algorithm, described in S. B Needleman, C. D. Wun tation is ideally designed to be rather invariant to modifica sch, “A general method applicable to the search for similari tions of the video such as frame rate change. Another way is ties in the amino acid sequence of two proteins, 1970; Smith to use time intervals of fixed size with some time overlap. At Waterman algorithm, described in T. F. Smith and M. S. a third stage 708, the features in each temporal interval are Waterman, “Identification of common molecular subse aggregated. As previously discussed, the spatio-temporal quences', 1981; and heuristics such as Basic Local Align locations (feature coordinates) at this stage are not used. ment Search Tool (BLAST), described in S. F. Alschulet al., Rather, the information in the temporal interval is described “Basic Local Alignment Search Tool, 1990. using a “bag of features' approach 710. 0115 Often, a suitable sequence matching algorithm will 0110. Here, similar to Sivic and Zisserman, all the feature operate by defining a matching score (or distance), represent descriptors are represented using a visual Vocabulary (a col ing the quality of the match between two video sequences. lection of representative descriptors obtained, for example, The matching score comprises two main components: simi by means of vector quantization). Each feature descriptor is larity (or distance) between the nucleotides and gap penalty, replaced by the corresponding closest element in the visual expressing to the algorithm the criteria about how critical it is Vocabulary. As previously discussed, features represented in to try not to “tear the sequences by introducing gaps. this way are also referred to herein as visual atoms. Continu 0116. In order to do this, the distance between a nucleotide ing this analogy, the visual Vocabulary can be thought of as a in a first video and a corresponding nucleotide in a second “periodic table' of visual elements. Video must be determined by Some mathematical process. 0111. Unlike the prior art approach of Sivic and Zisser That is, how similar is the “bag of features” from the first man, however, here we discard the spatial coordinates of the series of frames of one video similar to the “bag of features” features, and instead represent the frequency of appearance of from a second series of frames from a second video? This different visual atoms in the temporal interval as a histogram similarity value can be expressed as a matrix measuring how (group or vector), which is referred to as a “representation', similar or dissimilar the two nucleotides are. In a simple “visual nucleotide”, “nucleotide' and occasionally “bag of example, it can be a Euclidean distance or correlation features' 710. Here a “visual nucleotide 712 is essentially the between the vectors (bags of features) representing each “bag of features created by discarding the spatial coordi nucleotide. If one wishes to allow for partial similarity (which nates and just counting frequency of occurrence (this process frequently occurs, particularly in cases where the visual is referred to as a “bag function” or “grouping function') that nucleotides may contain different features due to spatial represents a certain number of video frames from the video. If edits), a more complicated metric with weighting or rejection a standardized set of visual elements is used to describe the of outliers should be used. More complicated distances may contents of each "bag”, then a visual nucleotide can be rep also take into consideration the mutation probability between resented mathematically as a histogram or sparse vector. For two nucleotides: two different nucleotides are more likely example, if the “bag of features’ describing several video similar if they are likely to be a mutation of each other. As an images contains 3 cases of feature 1, 2 cases of feature 2, and example, consider a first video with a first sequence of video US 2009/0259633 A1 Oct. 15, 2009

images, and a second video with the same first sequence of for example, 500 or 1000 standardized features is used as a Video images, and a video overlay. Clearly many video fea standard video analysis option, each “bag of features” would tures (atoms or elements) in the bag describing the first video be a histogram or vector composed of the coefficients of how will be similar to many video features in the bag describing many times each one of these 500 or 1000 standardized fea the second video, and the “mutation' here is those video tures appeared in the series of video frames described by the features that are different because of the video overlay. “nucleotide' or “bag of features, so the number of permuta 0117 The gap penalty is a function accounting for the tions of this bag, each of which can potentially represent a introduction of gaps between the nucleotides of a sequence. If different video nucleotide, is huge. a linear penalty is used, it is simply given as the number of 0.122 These factual differences make video DNA match gaps multiplied by some pre-set constant. More complicated ing only similar in its spirit to biological sequence matching. gap penalties may take into consideration the probability of In Some aspects, the video matching problem is more difficult appearance of a gap, e.g., according to statistical distribution and in Some respects it is easier. More specifically, the match of advertisement positions and durations in the content. ing algorithms are different in the following aspects. 0118. The following discussion identifies example simi I0123 First, in biological sequences, since the number of larities and differences between biological DNA and video different nucleotides is Small, the score of matching two DNA. Because the systems and methods discussed herein nucleotides can be represented as a simple “match”, “don’t essentially transform the problem of matching corresponding match” result. That is, a biological nucleotide can be an 'A'. portions of different video media into a problem that bears “T”, “G” or “C”, and there either is an 'A' to “A” match, or Some resemblance to the problem of matching biological there is not. By contrast, each nucleotide in video DNA is DNA sequences, some insight can be obtained by examining itself an array, histogram, vector or “bag of features” that this analogy in more detail. Since DNA sequence matching often will have hundreds or thousands of different coeffi art is in a comparatively advanced State of development, cients, and thus the matching operation is more complex. relative to video matching art, the systems and methods have Thus, for video DNA, we need to use a more general concept the unexpected result of showing how a number of advanced of “score function' or “distance function' between nucle DNA bioinformatics methodology techniques can be unex otides. This score can be thought of as some kind of distance pectedly applied to the very different field of matching video function between histograms or vectors. In other words, how signals. far apart are any two different “bags of features”? 0119. As previously discussed, at the conceptual level, 0.124. Otherwise, many other concepts, such as homology there is a strong similarity between the structure of biological scores, insertions, deletions, point-mutations, and the like DNA and the described video DNA methods. A biological have a remarkable resemblance between these two otherwise DNA is a sequence composed of nucleotides, the same way as very different fields. video DNA is composed of visual nucleotides (bags of fea 0.125. In one embodiment, the video DNA of an input tures from multiple video frames). A nucleotide in biology is video sequence is computed as depicted in FIG. 9. The pro a molecule composed of atoms from a periodic table, the cess of video DNA computation receives video data 900 and same way as a visual nucleotide is a bag of features composed includes the following stages: feature detection 1000, feature of visual atoms (i.e., features) from the visual vocabulary description 2000, feature pruning 3000, feature representa (usually a standardized pallet of different features). tion 4000, segmentation into temporal intervals 5000 and 0120 FIG. 8 graphically shows the reason for the name visual atom aggregation 6000. The output of the process is a “video DNA' by showing the analogy between an abstracted video DNA 6010. Some of the stages may be performed in video signal 800, and the structure of a biological DNA mol different embodiments or not performed at all. The following ecule and its constituents (nucleotides and atoms) 802. description details different embodiments of the above stages Despite the conceptual similarity, the are many specific dif of video DNA computation. ferences between the biological and video DNA. First, the I0126. As shown in FIG. 10, the video sequence is divided size of the periodic table of atoms that appear in biological into a set of temporal (time) intervals. FIG. 10 shows that in molecules is Small, usually including only a few elements one embodiment, the video time intervals 1020 are of fixed (e.g., Carbon, Hydrogen, Oxygen, Phosphorous, Nitrogen, duration (e.g., 1 second) and non-overlapping. In another etc.) In video DNA, the size of the visual vocabulary of embodiment, time intervals 1022 have some overlap. Here features (atoms) is typically at least a few thousands up to a each video nucleotide could be composed from as many video few millions of visual elements (features). Second, the num frames as are present in one second (or a Subset of this), which ber of atoms in a typical nucleotide molecule is also relatively depending upon frame rate per second might be 10 frames, small (tens or hundreds). The number of “visual atoms” (fea 16, frames, 24 frames, 30 frames, 60 frames, or some subset tures) in a visual nucleotide (bag of features) is typically of this. hundreds or thousands. Whereas in a biological nucleotide, 0127. In another embodiment, the intervals are set at the the spatial relationship and relationship between atoms is locations of shot (scene) cuts or abrupt transition in the con important, for a video nucleotide, this relationship (i.e., the tent of two consecutive frames (identified by reference feature coordinates) between features is deemphasized or numeral 1024). It is possible to use the result of tracking to ignored. determine the shot cuts in the following way: at each frame, 0121. Third, the number of different nucleotides in bio the number of tracks disappearing from the previous frame logical DNA sequences is small—usually four (“A”, “T”. and new tracks appearing in the current frame is computed. If “G”, “C”) nucleotides in DNA sequences and twenty in pro the number of disappearing tracks is above Some threshold, tein sequences. By contrast, in Video DNA, each visual nucle and/or the number of new tracks is above some other thresh otide is a "bag of features' usually containing at least hun old, the frame is regarded as a shot cut. If shot or scene cuts are dreds of thousands of different features, and which can be used, a video nucleotide could be composed of as many video represented as a histogram or vector. Thus, ifa set or pallet of frames that are in the shot or scene cut, and this could be as US 2009/0259633 A1 Oct. 15, 2009

high as hundreds or even thousands of video frames if the I0134. In one embodiment, a frame-based tracking is used. scene is very long. In another embodiment, the intervals are of This type of tracking tries to find correspondence between constant duration and are resynchronized at each shot cut two sets of features {(x,y,t)}^ and {(x,y,t'), '' in (identified by reference numeral 1026). frames t and t', where usually t'=t--1/fps for fps being the 0128 Feature detection (FIG. 9, 1000). A feature detector frame rate. In another embodiments, tracking is performed is operated on the video data 900, producing a set of N between multiple frames at the same time. invariant feature point locations, {(x,y,t)}^ (denoted by I0135. The output of the tracker 3100 is a set of T tracks 1010 in FIG.9) where x, y and t are the spatial and temporal 3110, each track representing a trajectory of a feature through coordinates of the feature point, respectively. Feature detec space-time. A track can be represented as a set of indices of tion step 1000 is shown in more detail in FIG. 11, which feature points belonging to this track. In one of the embodi shows one embodiment of this method. Feature detection ments, a track is a set of indices of the form T, {(it)}, , 1000 is performed on a frame basis. For a frame at time t, a set implying that a set of points {(x,y,t)}, ... t1 and t2 are the of N, features {(x,y,t)}, \ is located. Typical features have temporal beginning and end of the track, and t-t is its the form of two-dimensional edges or corners. Standard algo temporal duration. Determining the tracks may based on fea rithms for invariant feature point detection described in com ture similarity (i.e., the features belonging to the track have puter vision literature can be used. Such algorithms may similar descriptors), motion (i.e., the locations of the feature include, for example, the Harris corner detector, Scale-invari points do not change significantly along the track), or both. ant feature transform (SIFT), Kanade-Lucas tracker, etc. Standard algorithms for feature tracking used in computer I0129. Typical values of N, range between tens to thou vision literature can be used. sands. In particular embodiments, the values of N=100, 200, 0.136 The consistency of the resulting tracks is checked ..., 1000 are used. In another embodiment, the value of N, is and track pruning 3200 is performed. In one embodiment, pre-set and is a result of feature detection algorithm used. In tracks of duration below some threshold are pruned. In another embodiment, the feature detection is performed on another embodiment, tracks manifesting high variance of spatio-temporal data, producing a set {(x,y,t)}^. Three spatial coordinate (abrupt motions) are pruned. In another dimensional versions of standard feature detection algo embodiment, tracks manifesting high variance of feature rithms may be used for this purpose. descriptors of feature points along them are pruned. The 0130. Feature description (FIG. 9, 2000). For each feature result of pruning is a subset T of the tracks, {t} . . point detected at feature description stage 2000, a feature I0137 In one of the embodiments, a set of features {(x,y, descriptor is computed, producing a set of feature descriptors t)}, Y and the corresponding descriptors {f}, Y are com (denoted by 2010 in FIG. 9) {f}^ corresponding to the puted in the beginning of a shott, and the tracker is initialized feature points. A feature descriptor is a representation of the to X,(t) x, y,(t) y, and a Kalman filter is used to predict the local video information in the neighborhood of the feature feature locations x,(t'), y(t) in the next framet'. The set of point. Many feature descriptors used in computer vision lit features with {(x 'y',t)} ^ the corresponding descriptors erature (e.g. SIFT or SURF feature descriptors) compute a {f} ^ computed in the frame t+dt. Each feature x, y, f, is local histogram of directed edges around the feature point. matched against the Subset of the features x, y, f, in a circle Typically, a feature descriptor can be represented as a vector with a certain radius centered at x,(t'), y,(t'), and the match of dimension F, i.e., f, e R. For example, for SIFT feature with the closest descriptor is selected. When no good match is descriptor F=128, and for SURF feature descriptor, F=64. found for a contiguous sequence of frames, the track is ter 0131. In a particular embodiment, the feature descriptors minated. Only features belonging to tracks of Sufficient tem are computed on a frame basis, meaning that they represent poral duration are preserved. the pixels in the spatial neighborhood of a feature point within 0.138. In one embodiment, the Kalman filter is used with a one frame. Standard feature descriptors such as SIFT or constant Velocity model, and the estimated feature location SURF can be used in this case. In another embodiment, the covariance determines the search radius in the next frame. feature descriptors are spatio-temporal, meaning that they 0.139. One of the embodiments offeature pruning based on represent the pixels in the spatio-temporal neighborhood. A tracking previously shown in FIG. 12 (block 3200) is shown three-dimensional generalization of standard feature descrip in more detail in FIG. 13. Inputting the feature locations 1010, tors can be used in this case. corresponding feature descriptors 2010 and tracks of features 0132) Feature pruning (FIG. 9, step 3000). At this stage, 3110, for each track, the track duration 'd', motion variance among all the features, a subset 3010 of consistent features is “mv and descriptor variance “dv’ are computed. These val found. In different embodiments, consistency may imply spa ues go through a set of thresholds and a decision rule, reject tial consistency (i.e., that the feature point does not move ing tracks with too small duration and too large variance. The abruptly and its position in nearby temporal locations is simi results is a subset of features 3010 belonging to tracks that lar), temporal consistency (i.e., that a feature does not appear Survived the pruning. or disappear abruptly), or spatio-temporal consistency (a 0140. One of the possible decision rules leaving the track combination of the above). is expressed as: 0133. In one embodiment, tracking is performed for find ing consistent features as shown in FIG. 12. A feature tracking (deth d) AND (mvath mv) AND (dvath div), algorithm 3100 tries to find sets of features consistently where th d is a duration threshold, th mv is the motion vari present in a Sufficiently large contiguous sequence of frames, ance threshold, and th dv is the descriptor variance threshold. thus removing spurious features detected in a single frame. 0141 Feature representation (FIG. 9, block 4000): Such spurious features are known to arise, for example, from Returning to FIG.9, block 4000 shows the features on tracks specular reflections, and their removal improves the accuracy remaining after pruning undergo representation using a visual and discriminative power of the description of the visual Vocabulary. The result of this stage is a set of visual atoms content in a frame. 4010. The visual vocabulary is a collection of K representa US 2009/0259633 A1 Oct. 15, 2009

tive feature descriptors (visual elements), denoted here by inversely proportional to the typical frequency of the visual {e}, *. The visual vocabulary can be pre-computed, for atom of type n. This type of weighting is analogous to inverse example, by collecting a large number of features in a set of document frequency (tf-idf) weighting in text search engines. representative video sequences and performing vector quan 0.147. In another embodiment, the weight of the nth bin is tization on their descriptors. In different embodiments, values inversely proportional to the variance of the nth bin computed of K=1000, 2000, 3000,..., 1000000 are used. on representative under typical mutations and directly pro 0142. Each feature i is replaced by the number 1 of the portional to the variance of the nth bin on the same content. element from the visual vocabulary which is the closest to the 0.148. Once the video DNA has been computed for at least descriptor of feature i. In one of the embodiments, a nearest two video sequences, these different video sequences can neighbor algorithm is used to find the representation of fea then be matched (aligned) as to time, as described below. In ture i. one embodiment, the temporal correspondence between the query video DNA represented as the sequence {q}, , of visual nucleotides, and a video DNA from the database rep i = argminlf - e. resented as the sequence {S,i}, of visual nucleotides is i=1,... K computed in the following way. 0149. In a matching between the two sequences, a nucle where IOI is a norm in the descriptor space. In another otide q, is brought into correspondence either with a nucle embodiment, an approximate nearest neighborhood algo otides, or with a gap between the nucleotides s, and s, and, rithm is used. As a result, feature i is represented as (x,y,-1), similarly, a nucleotide S, is brought into correspondence referred to as a visual atom. either with a nucleotide q, or with a gap between the nucle 0143. In one embodiment, prior to representation of fea otides q, and q. A matching between {q}, , and {s} ^ ture in a visual Vocabulary, for each track a representative can be therefore represented as a sequence of K correspon feature is found. It can be obtained by taking a mean, median dences {(i,j)} i^, a sequence of G gaps {(i, j.l.)}, i. or majority Vote of the descriptors of the features along a where (ijl) represents the gap of length 1, between the track. In one of the embodiments, non-discriminative features nucleotides sands, to which the sub-sequence {9,9. ls are pruned. A non-discriminative feature is such a feature ..., q} corresponds, and a sequence of G'gaps {(i,jl) which is approximately equally distant from multiple visual }- where (i.il) represents the gap of length 1, between atoms. Such features can be determined by considering the the nucleotides q, and q, to which the sub-sequence {s, ratio between the distance from the first and second closest S j,+1 S,+.) corresponds. A match is assigned a score neighbor. according to the formula 0144 Visual atom aggregation (6000): For each temporal interval computed at FIG. 9 block 5000, the visual atoms within it are aggregated into visual nucleotides. The resulting K G G' sequence of visual nucleotides (video DNA 6010) is the out S = X4, si, Xin in lin) 'Xi, in lin) put of the process. A visual nucleotide S is created as a histo gram with Kbins (Kbeing the visual vocabulary size), nth bin counting the number of visual atoms of type n appearing in where O(qs) quantifies the score of the nucleotide q, cor the time interval. responding to the nucleotides, and g(i milm: l) is the gap 0145. In one embodiment, the histogram in the interval penalty. tt is weighted by the temporal location of a visual atom 0150. As previously discussed, many alternative algo within an interval according to the formula rithms may be used to compute matching, ranging from simple to extremely complex. In one embodiment of the invention, the Needleman-Wunsch algorithm is used to find the matching by maximizing the total score S. In another embodiment, the Smith-Waterman algorithm is used. In yet another embodiment, the BLAST algorithm is used. 0151. In an alternate embodiment, the matching maximiz where w(t) is a weight function, and his the value of the nth ing the total score S is done in the following way. In the first bin in the histogram. In one embodiment, the weight is set to stage, good matches of a small fixed length. W between the its maximum value in the center of the interval, decaying query and sequence in the database are searched for. These towards interval edges, e.g. according to the Gaussian for good matches are known as seeds. In the second stage, an mula attempt is made to extend the match in both directions, start ing at the seed. The ungapped alignment process extends the initial seed match of length W in each direction in an attempt 2 to boost the alignment score. Insertions and deletions are not w(t) = est- 2-2. - .2t.) considered during this stage. If a high-scoring un-gapped alignment is found, the database sequence passes on to the third stage. In the third stage, agapped alignment between the In another embodiment, shot cuts withing the interval Itt query sequence and the database sequence can be performed are detected, and w(t) is set to zero beyond the boundaries of using the Smith-Waterman algorithm. the shot to which the center /2(t+t) of the interval belongs. 0152. In one embodiment of the invention, the gap penalty 0146 In a particular embodiment, the bins of the histo is linear, expressed by g(ijl)-Cl, where C. is a parameter. gram are further weighted in order to reduce the influence of In another embodiment, the gap penalty is affine, expressed unreliable bins. For example, the weight of the nth bin is by g(ijl) B+C.(1-1) where B is another parameter. US 2009/0259633 A1 Oct. 15, 2009

I0153. In an embodiment, the score function O(qs) For example, a user photographing a television screen using a describes the similarity between the histogram h representing handheld video taken with a cell phone may wish to deter the nucleotide q, and the histogram h' representing the nucle mine exactly what television show or movie was being otides. In another embodiment, the similarity is computed played. In both cases, it is useful to determine the spatial as the inner product (h, h ). In alternate embodiments, the alignment between two different videos, as well as the time inner product is weighted by a vector of weight computed (frame number) alignment. 0159. In one embodiment of the present invention, the from training data to maximize the discriminative power of spatial correspondence between the visual nucleotide q, rep the score function. Alternatively, the score function O(qs) resenting the temporal interval Itt in the query sequence, is inversely proportional to the distance between the histo and the best matching visual nucleotides, representing the gram h representing the nucleotide q, and the histogram h" temporal interval t't' in the database sequence is computed representing the nucleotide S. k In other embodiments, the distance is computed as the Lp norm in the following way. 0160. In this embodiment, a frame is picked out of the interval Itt, and represented as a set of features (x,y,}, Y with the corresponding descriptors {f} ''. Another frame is picked out of the interval It't' and represented as a set of |h - h"I = (2. (h, -...)". features {X,j} ^ with the corresponding descriptors 0154) In a specific embodiment, the distance is the Kull {f}, Y. A correspondence is found between the two sets in back-Leibler divergence between the histograms. In other such a way that each f, is matched to the closest f. Insuffi embodiments, the distance is the earth mover's distance ciently close matches are rejected. The corresponding points between the histograms. are denoted by {xy}, {x,y,}. 0155. In a particular implementation, the score function 0.161. Once this correspondence is found, a transformation O(qs) is proportional to the probability of a nucleotides, T is found by minimizing mutating into a nucleotide q, by a spatial or temporal distor tion applied to the underlying video sequence. This, in turn, can be expressed as the probability of the histogram h repre min|T(x, yi ) - (xi, yi. ). senting the nucleotide q, being the mutation of the histogram h' representing the nucleotides, 0162. In one embodiment, the minimization is performed 0156. In one example, the probability is estimated as using a RANSAC (random sample consensus) algorithm. In another embodiment, the minimization is performed using the iteratively-reweighted least squares fitting algorithm. Often it will be useful to perform rotation, size, or distortion Ph h') = P(h, Ih), transformations. (0163. In one of the embodiments, the transformation T is where P(h, h') is the probability that the nth bin of the of the form histogram h' changes its value to h. The probabilities P(h, h') are measured empirically on the training data, inde pendently for each bin. cost sine it 0157. In another example, the Bayes theorem is used to T = -sine cosé y . represent the score function O(qs) as the probability O O 1 (0164 In another embodiment, the transformation T is of the form where P(hh') is computed as explained previously, and P(h) cost sine it and P(h') are expressed as T = -asine acosé y . O O 1 (0165 In another embodiment, the transformation T is of the form

a b it where P(h) measures the probability of the nth bin of the histogram hassuming the value ofh, and is estimated empiri 0 0 1 cally from the training data, independently for each bin. r 0158 Often it is useful not only to find the overall frame or (0166 In another embodiment, the transformation T is a time alignment between two different videos, but also to find projective transformation. the alignment between a first “thing (group of pixels) in one 0.167 Finding of spatio-temporal correspondence spatial alignment in one video, and a second corresponding between two sequences is depicted in FIG. 14. The process “thing with a second spatial alignment in a second video. consists of the following stages: Alternatively, sometimes it is useful to compare videos that 0168 1. Video DNA computation. Two sets of video data have been taken with different orientations and resolutions. 900 and 901 are inputted into a video DNA computation stage US 2009/0259633 A1 Oct. 15, 2009

1410. Stage 1410 was shown in more detail in FIG.9 as steps (0177 Segmentation into temporal intervals 5000: The 1000, 2000, 3000 and 4000. This stage can be performed video sequence is divided into a set offixed temporal intervals on-line, or pre-computed and stored. of fixed duration of 1 sec, (see FIG. 10, reference numeral 0169 2. Temporal matching. The resulting video DNAS 1020). 6010 and 6011 are inputted into a temporal alignment stage 0.178 Visual atom aggregation 6000: For each temporal 1420, which computes a temporal correspondence 1425. The interval computed at stage 5000, the visual atoms within it are temporal correspondence is essentially a transformation from aggregated into visual nucleotides. The resulting sequence of visual nucleotides (video DNA 6010) is the output of the the temporal system of coordinates of the video data 900, and process. A visual nucleotide is created as a histogram with that of the video data 901. K=1000 bins, nth bin counting the number of visual atoms of 0170 3. Spatial matching. The temporal correspondence type n appearing in the time interval. 1425 is used at stage 1430 of selection of temporally corre 0179. After the video DNA for two different or more dif sponding subsets of the video data 900 and 901. The selected ferent videos is produced, the video DNA from these materi subsets 1435 and 1436 of the video data 900 and 901, respec als may then be checked for correspondence, and matched as tively, are inputted to a spatial alignment stage 1440, which follows: computes a spatial correspondence 1445. The spatial corre 0180 Temporal matching (see FIG. 14, reference numeral spondence is essentially a transformation from the spatial 1420) can be performed using the SWAT (Smith-Waterman) system of coordinates of the video data 900, and that of the algorithm with an affine gap penalty with the parameters C-5 video data 901. and B-3. The weighted score function is used 0171 A particular example is discussed below, in which the video DNA of an input video sequence is computed as depicted in FIG. 9. The process of video DNA computation 1000 inputs video data 900 and includes the following stages: O(h, h') = - – - – feature detection 1000, feature description 2000, feature 1000 1000 pruning 3000, feature representation 4000, segmentation into | X w. (h). X w.(h,) temporal intervals 5000 and visual atom aggregation 6000. =l =l The output of the process is a video DNA 6010. 0172 Feature detection 1000: A SURF feature detector 0181. The weights w, can be computed empirically. For (described in “Speeded Up Robust Features'. Proceedings of that purpose, various training video sequences can be trans the 9th European Conference on Computer Vision, May formed using a set of random spatial and temporal deforma 2006) is operated independently on each frame of the video tions, including blurring, resolution, aspect ratio, and frame sequence 900, producing a set of N=150 strongest invariant rate changes, and its video DNA can be computed. The vari feature point locations (denoted by 1010 in FIG. 9) per each ance of each bin in the visual nucleotides, as well as the frame “t. variance each bin in the corresponding visual nucleotides (0173 Feature description 2000: For each feature point under the deformations are estimated. For each bin n, the detected at feature detection stage 1000, a 64-dimensional weight w, is set to be ratio between the latter two variances. SURF feature descriptor is computed, as described in 0182 Spatial matching (see FIG. 14, reference numeral described in “Speeded Up Robust Features”. Proceedings of 1440): The spatial alignment can be done between two 1 sec the 9th European Conference on Computer Vision, May corresponding intervals of features representing the two sets 2006. of video data 900 and 901, where the correspondence is obtained from the previous temporal alignment stage 1420. 0.174 Feature pruning 3000: This is an optional step which For each feature in one interval, the corresponding feature in is not performed in this example. the other interval is found by minimizing the Euclidean dis (0175 Feature representation 4000: The features are rep tance between their respective descriptors. The output of the resented in a visual vocabulary consisting of K=1000 entries. process is two sets of corresponding features {(x,y,t)}, {(x, The representative elements are computed using the approxi y't'. Once the correspondence is found, a transformation of mate nearest neighbor algorithm described in S. Arya and D. the form M. Mount, Approximate Nearest Neighbor Searching, Proc. 4th Ann. ACM-SIAM Symposium on Discrete Algo rithms (SODA'93), 1993, 271-280. Only features whose dis a b it tance to the nearest neighbor is below 90% of the distance to T = -b, c w the second nearest neighbor are kept. The result of this stage 0 0 1 is a set of visual atoms 4010. 0176 The visual vocabulary for the feature representation stage is pre-computed from a sequence of 750,000 feature can be found between the corresponding sets using the descriptors obtained by applying the previously described RANSAC algorithm. stages to a set of assorted visual context serving as the training 0183 Another way to view the at least one aspect of the data. A k-means algorithm is used to quantize the training set invention is that it is a method of spatio-temporal matching of into 1000 clusters. In order to alleviate the computational digital video data that includes multiple temporally matching burden, the nearest neighbor search in the k-means algorithm video frames. In this view, the method consists of the steps of is replaced by its approximate variant as described in S. Arya performing temporal matching on the digital video data that and D. M. Mount, “Approximate Nearest Neighbor Search includes the plurality oftemporally matching video frames to ing'. Proc. 4th Ann. ACM-SIAM Symposium on Discrete obtain a similarity matrix, where the spatial matching repre Algorithms (SODA'93), 1993, 271-280. sents each of the video frames using a representation that US 2009/0259633 A1 Oct. 15, 2009 includes a matching score, a similarity component, and a gap temporal intervals of constant duration or temporal intervals penalty component, and the representation is operated upon of variable duration. Temporal interval start and end times can using a local alignment algorithm (such as one based upon a also be computed according to the shot transitions in the video bioinformatics matching algorithm, or other Suitable algo data. It is also noted that the temporal intervals may be non rithm); and performing spatial matching on the digital video overlapping or overlapping. data that includes the plurality of temporally matching video 0198 Visual nucleotide computation: In another varia frames obtained using the similarity matrix. Here the step of tion, the visual nucleotide (the term used, as mentioned pre performing spatial matching is substantially independent viously, to describe the visual contentina temporal interval of from the step of performing temporal matching. the video data) can also be computed using the following 0184 The above method could use a Needleman-Wunsch steps: algorithm, a Smith-Waterman algorithm or similar type of 0199 representing a temporal interval of the video data algorithm. The above method can be also be implemented as a collection of visual atoms; with a bioinformatics matching algorithm Such as a basic 0200 constructing the nucleotide as a function of at local alignment search tool used to compare biological least one of the visual atoms. sequences or a protein or nucleotides DNA sequencing like 0201 With respect to this computation, the function may algorithm. be a histogram of the appearance frequency of the features 0185. The above method may further include performing (visual atoms) in the temporal interval, or the function may be local feature detection on the digital video data that includes a weighted histogram of the appearance frequency of visual the plurality of temporally matching video frames to detect atoms in the temporal interval. If a weighted histogram, then points of interest; and using the points of interest to segment the weight assigned to a visual atom can be a function of a the digital video data that includes the plurality oftemporally combination of the following: matching video frames into a plurality of temporal intervals; 0202 the temporal location of the visual atom in the and wherein the step of performing temporal matching and temporal interval; performing spatial matching operate upon the plurality of 0203 the spatial location of the visual atom in the tem temporal intervals. poral interval; 0186. In another aspect, the method may determine spatio 0204 the significance of the visual atom. temporal correspondence between video data, and include 0205 Relative weight of different features or visual atoms steps such as: inputting the video data; representing the video in the nucleotide or “bag of features’: In one implementation, data as ordered sequences of visual nucleotides; determining the weight is constant over the interval (i.e., all features are temporally corresponding Subsets of video data by aligning treated the same). However in other implementations, the sequences of visual nucleotides; computing spatial corre features may not all be treated equally. For example, in an spondence between temporally corresponding Subsets of alternative weighting scheme, the weight can be a Gaussian Video data; and outputting the spatio-temporal correspon function with the maximum weight being inside the interval. dence between subsets of the video data. The weight can also be set to a large value for the visual 0187. Types of input data: With respect to this other aspect content belonging to the same shot as the center of the inter the video data may be a collection of video sequences, and can val, and to a small value for the visual content belonging to also be query of video data and corpus video data, and can different shots. Alternatively, the weight can be set to a large also comprise Subsets of a single video sequence or modified value for visual atoms located closer to the center of the subsets of a video sequence from the corpus video data. Still frame, and to a small value for visual atoms located closer to further, the spatio-temporal correspondence can be estab the boundaries of the frame. lished between at least one of the subsets of at least one of the 0206 Visual atom methods: As described previously, the Video sequences from the query video data and at least one of visual atom describes the visual content of a local spatio Subsets of at least one of the video sequences from the corpus temporal region of the video data. In one implementation, Video data. In a specific implementation, the spatio-temporal representing a temporal interval of the video data as a collec correspondence can be established between a subset of a tion of visual atoms can include the following steps: Video sequence from the query video data and a Subset of a 0207 detecting a collection of invariant feature points Video sequence from the corpus video data. in the temporal interval; 0188 With respect to the query video data mentioned 0208 computing a collection of descriptors of the local above, the query can contain modified Subsets of the corpus spatio-temporal region of the video data around each Video data, and the modification can be a combination of one invariant feature point; or more of the following 0209 removing a subset of invariant feature points and 0189 frame rate change; their descriptors; 0.190 spatial resolution change; 0210 constructing a collection of visual atoms as a 0191 non-uniform spatial scaling: function of the remaining invariant feature point loca 0.192 histogram modification; tions and descriptors. 0193 cropping: 0211 Feature detection methods: In addition to the feature 0194 overlay of new video content; detection methods previously described, the collection of 0.195 temporal insertion of new video content. invariant feature points in the temporal interval of the video 0196. Nucleotide segmentation: In another variation, the data mentioned above may be computed using the Harris described systems and methods can also have the video data Laplace corner detector or using the affine-invariant Harris which are segmented into temporal intervals, and one visual Laplace corner detector or using the spatio-temporal corner nucleotide can be computed for each interval. detector or using the MSER algorithm. If the MSER algo 0197) Interval duration: In another variation, the described rithm is used, it can be applied individually to a subset of systems and methods can also segment the video data into frames in the video data or can be applied to a spatio-temporal US 2009/0259633 A1 Oct. 15, 2009

subset of the video data. The descriptors of the invariant 0231 assigning a quality metric for each visual atom in feature points mentioned above can also be SIFT descriptors, the collection; spatio-temporal SIFT descriptors, or SURF descriptors. 0232 removing the visual atoms whose quality metric 0212 Tracking methods: In some embodiments, comput value is below a predefined threshold. ing a collection of descriptors mentioned above can include: tracking of corresponding invariant feature points in the tem 0233. The threshold value may be fixed or adapted to poral interval of the video data, using methods such as: maintain a minimum number of visual atoms in the collection 0213 computing a single descriptor as a function of the or adapted to limit the maximum number of visual atoms in descriptors of the invariant feature points belonging to a the collection. Further, the assigning the quality metric may track; include: 0214 assigning the descriptor to all features belonging 0234 receiving a visual atom as the input; to the track. 0235 computing a vector of similarities of the visual 0215. This computing the function may be the average of atom to visual atoms in a collection of representative the invariant feature points descriptors or the median of the invariant feature points descriptors. visual atoms; 0216 Feature pruning methods: In some embodiments, 0236 outputting the quality metric as a function of the removing a Subset of invariant feature points as mentioned vector of similarities. This function may be proportional above can include: to the largest value in the vector of similarities, propor 0217 tracking of corresponding invariant feature points tional to the ratio between the largest value in the vector in the temporal interval of the video data; of similarities and the second-largest value in the vector 0218 assigning a quality metric for each track; of similarities or a function of the largest value in the 0219 removing the invariant feature points belonging vector of similarities and the ratio between the largest to tracks whose quality metric value is below a pre value in the vector of similarities and the second-largest defined threshold. value in the vector of similarities. 0220. In some embodiments, the quality metric assigned 0237 Sequence alignment methods: In some embodi for a track as mentioned above may be a function of a com ments, the aligning sequences of visual nucleotides men bination of the following tioned above may include 0221) descriptor values of the invariant feature points 0238 receiving two sequences of visual nucleotides belonging to the track; S{s1, ..., St and q{q. . . . .qf) as the input; 0222 locations of the invariant feature points belonging 0239 receiving a score function O(sq) and a gap pen to the track. alty function Y(i, j, n) as the parameters; 0223) The function may be proportional to the variance of 0240 finding the partial correspondence C={(i.9), - - - the descriptor values or to the total variation of the invariant , (ikjk)} and the collection of gaps G={(limin),..., feature point locations. (1,m,n)} maximizing the functional 0224 Visual atom construction: In some embodiments, constructing a collection of visual atoms mentioned above may also be performed by constructing a single visual atom K L. for each of the remaining invariant feature points as a function F(C,G)=Xor(S, , q)+Xycle, m, n) of the invariant feature point descriptor. The function com k=1 k=1 putation may include: 0225 receiving an invariant feature point descriptor as the input; 0241 outputting the found partial correspondence C 0226 finding a representative descriptor from an and the maximum value of the functional. ordered collection of representative descriptors match 0242 Other alignment methods: As previously discussed, ing the best the invariant feature point descriptor the maximization may be performed using the Smith-Water received as the input; man algorithm, the Needleman-Wunsch algorithm, the 0227 outputting the index of the found representative BLAST algorithm or may be performed in a hierarchical descriptor. a. 0228 Finding a representative descriptor may be per 0243 Scoring methods: The score function mentioned formed using a vector quantization algorithm or using an above may be a combination of one or more functions of the approximate nearest neighbor algorithm. form 0229 Visual vocabulary methods: The ordered collection of representative feature descriptors (visual Vocabulary) may be fixed and computed offline from training data, or may be SAq; adaptive and updated online from the input video data. In SAq., Some cases, it will be useful to construct a standardized visual vocabulary that operates either universally over all video, or at least over large video domains, so as to facilitate standard ization efforts for large video libraries and a large array of different video sources. wherein A may be an identity matrix, a diagonal matrix. 0230 Visual atom pruning methods: In some embodi 0244. The score may also be proportional to the condi ments, constructing the collection of visual atoms mentioned tional probability P(q.ls,) of the nucleotide q, being a mutation above may be followed by removing a subset of visual atoms, of the nucleotides, and the mutation probability may be and removing a Subset of visual atoms may include: estimated empirically from training data. US 2009/0259633 A1 Oct. 15, 2009

0245. The score may also be proportional to the ratio of 0259 inputting two sets of feature points; probabilities 0260 providing descriptors of feature points; 0261 matching descriptors; 0262 The feature points may be the same as used for video nucleotides computation, and the descriptors may be the same as used for video nucleotides computation. 0263. Also, finding correspondence between feature and the mutation probability may be estimated empirically points may be performed using a RANSAC algorithm or from training data. consist of finding parameters of a model describing the trans 0246 Distance based scoring methods: Further, the score formation between two sets of feature points, wherein finding function may be inversely proportional to a distance function parameters of a model may be performed by solving the d(sq.), and the distance function may be a combination of at following optimization problem least one of the following 0247 L1 distance: 0248 Mahalanobis distance: 0249 Kullback-Leibler divergence; (0250 Earth Mover's distance. 0251 Weighting schemes: In addition to the weighting where {(x,y)} and {(x,y)} are two sets of feature points and schemes previously described, the diagonal elements of the T is a parametric transformation between sets of points matrix A may be proportional to depending on parameters 0. 0264. The correspondence between spatial coordinates may be expressed as a map between the spatial system of I ogE1 coordinates (x, y) in one Subset of video data and spatial system of coordinates (x, y) in another subset of video data. 0265. Output methods: the output spatio-temporal corre where E, denotes the expected number of times that a visual spondence between subsets of video data may be represented atom i appears in a visual nucleotide. E, may be estimated as a map between the spatio-temporal system of coordinates from training video data or from the input video data. And the (x, y, t) in one subset and spatio-temporal System of coordi diagonal elements of the matrix A may be proportional to nates (x, y, t') in another Subset. 0266. An example of the video DNA generation process is shown in FIG. 15. Here, a local feature detector is applied in a frame-wise manner to the various image frames of the video sequence (1500). This feature detector finds points of interest (1502), also referred to as “feature points, in the video where V, is the variance of the visual atom i appearing in sequence. As previously discussed, many different types of mutated versions of the same visual nucleotide, and V, is the feature detectors may be used, including the Harris corner variance of the visual atom i appearing in any visual nucle detector (C. Harris and M. Stephens. A combined corner and otide. Further, V, and V, may be estimated from training video edge detector, Alvey Vision Conference, 1988), the Kanade data. Lucas algorithm (B. D. Lucas and T. Kanade, “An iterative 0252 Gap penalty methods: In some embodiments, the image registration technique with an application to stereo gap penalty can be aparametric function of the form y(i,j.n.0), vision', 1981) SIFT scale-space based feature detectors (D. where i and j are the starting position of the gap in the two G. Lowe, Distinctive image features from Scale-invariant key sequences, n is the gap length, and 0 are parameters. The points, IJCV, 2004) and others. Generally, this feature detec parameters may be estimated empirically from the training tion algorithm is designed in Such a way that the feature data, and the training data may consist of examples of video descriptors are robust or invariant to spatial distortions of the sequences with inserted and deleted content. Further, the gap Video sequence (e.g., change of resolution, compression penalty may be a function of the form: Y(n)=a+bn, where n is noise, etc.). In order to reduce transient noise and focus on the the gap length and a and b are parameters. Still further, the gap most useful features, the features are often tracked over mul penalty may be a convex function or inversely proportional to tiple frames (1504), and features that appear for too short a the probability offinding a gap of length n starting at positions period are deleted or pruned (1506). i and in the two sequences. 0267. The next stage of the video DNA generation process 0253 Spatial correspondence methods: Methods of com is shown in FIG.16. Here FIG.16 shows a detail of one video puting spatial correspondence may include: image frame, where the dots in the frame (1502) correspond 0254 inputting temporally corresponding Subsets of to image features that have been detected. Here the feature Video data; points remaining after feature pruning (1600) are then 0255 providing feature points in subsets of video data; described using a local feature descriptor. This feature 0256 finding correspondence between feature points; descriptor generates a second type of vector that represents 0257 finding correspondence between spatial coordi the local properties (local neighborhood) (1602) of the video nates. frame around a feature point (1600). As previously discussed, 0258 Temporally corresponding subsets of video data many different algorithms can be used to describe the prop may be at least one pair of temporally corresponding frames. erties of the video image frame around a feature point. These Further, finding correspondence between feature points fur algorithms can include a local histogram of edge directions, ther may include: the scale invariant feature transform (SIFT), the speed up US 2009/0259633 A1 Oct. 15, 2009 robust features (SURF) algorithm (H. Bay, T. Tuytelaars and 0273 Next, the now visual-vocabulary-binned visual fea L. van Gool. “Speed up robust features”, 2006). ture descriptors (visual atoms) in each temporal interval are 0268 Mathematically, this feature descriptor can be rep combined (aggregated) (1806). Here, the space and time resented as a second type of vector that describes the local coordinates of the features themselves (1808) are not used, properties of video image (1604) associated with each feature rather it is the sum total of the different types of feature point. This second type of vector of values can correspond to descriptors present in the series of video frames (temporal many types of properties of the local neighborhood (1602) interval) that is used here. This process essentially ends up near the pruned feature point (1600). Some vector coeffi creating a histogram, Vector, or "bag of feature (descriptors) cients (1604) could correspond to the presence or absence of (1810) for each series of video frames. The frequency of image edges at or near point (1600), others may correspond to appearance of the various binned feature descriptors (visual the relative image brightness or color near point (1600), and atoms) can be represented as a histogram or vector, and as so on. Thus a video DNA “nucleotide' or signature that used herein, this histogram or vector is occasionally referred describes a video "snippet” (short temporal series of video to as a visual nucleotide. frames) contains two types of vectors: a first type of vector 0274. This “bag of features' method of abstracting or that tells how many different types of feature descriptors are indexing a video has a number of advantages. One advantage in the Snippet, and a second type of vector that is used to is that this method is robust, and can detect relationships mathematically describe the properties of each of the indi between related videos even if one or both of the videos are vidual feature descriptors. altered by overlaying pixels over the original frames, spatially 0269. In order to create a standardized process that can edited (e.g., cropped), changed to different resolutions or enable many different videos to be easily compared, rather frame rates, and the like. For example, if one of the video than using descriptors that are unique to each segment of sequences has been modified (e.g., by overlaying pixels over video, it is often desirable to create a standardized library of the original frames), the new video sequence will consist of a descriptors that can be used for many different videos, and do mixture of features (one set belonging to the original video a best fit to “map”, “bin', or “assign' the descriptors from any and the other set belonging to the overlay). If the overlay is not given video into this standardized library or “vocabulary”. very large (i.e., most of the information in the frame belongs 0270. In FIG. 17, as previously discussed, the actual fea to the original video), it is still possible to correctly match the ture descriptors (1700) for the visual environment around two visual nucleotides from the two videos by adopting a each pruned feature point (FIG.16, 1600) are then assigned to relaxed matching criteria that determines that the nucleotides “bins' according to the “visual library” or “visual vocabu (histograms or vectors of features) match with less than 100% lary’ which is a pre-computed set of feature descriptor types. correspondence between the two. This visual vocabulary can be viewed as a standardized (0275 FIG. 19 shows an example formation of the video library of feature descriptors. Here, a finite set (usually DNA for a particular media. Here, the video DNA consists of around 1000 or more) of “ideal representative feature an ordered array or “sequence' of the different “histograms”, descriptors is computed, and each “real' feature descriptor is “vectors of feature descriptors', or “nucleotides’ taken from assigned to whatever “ideal feature descriptor in the “visual the various time segments (snippets) (1800), (1802), (1804), vocabulary' most closely matches the “real' feature descrip etc. of the video. Either video, that is either the original tor. As a result, each “real' feature descriptor (1700) from the reference video intended for the metadata database on a portion of the actual video is binned into (or is replaced by) server, or a client video which can be a copy of the original the corresponding closest element in the visual Vocabulary reference video, can be abstracted and indexed by this video (1702), and only the index (i.e., the fact that this particular DNA process, and generally the video DNA created from a library feature descriptor had another closed neighbor) of the reference video will be similar enough to the video DNA closest “ideal or representative descriptor is stored, rather created by a client video so that one video DNA can be used than the real descriptor (1700) itself. as an index or match to find a correspondence with the other 0271 From a nomenclature standpoint, features repre video DNA. sented this way will occasionally be referred to in this speci 0276. This reference video DNA creates an index that fication as “visual atoms'. As a rough analogy, the visual allows another device. Such as a client about to play a client vocabulary can be viewed as a “periodic table' of visual copy of the reference video, to locate the portion of the video atoms or elements. Continuing this analogy, the visual that the client is about to play in the reference or server video vocabulary can be thought of as a “periodic table' of visual DNA database. As an example, a client about to play a client elements. video (1914) can compute (1916) the video DNA of the client 0272 FIG. 18 gives additional details showing how the video by the same video DNA process, send the video DNA original video is segmented into multiple-frame intervals signature of this client video DNA to the server or other (temporal segmentation). In this stage, the video sequence is device holding the reference video DNA, the position and segmented into various time (temporal) intervals or Snippets nature of this series of video frames can be determined by (1800), (1802), (1804), etc. These intervals can be of fixed using the client video DNA as an index into the server or size (e.g., every 10 frames represents one interval), or of reference video DNA database. This index in turn can be used variable size, and can be either overlapping or non-overlap to retrieve metadata from the server database that corresponds ping. Often it will be convenient to track features, and seg to the portion of video that is being played on the client. ment the video into regions where the features remain rela 0277 As previously discussed, even when a relatively tively constant, which will often correspond to aparticular cut large array (i.e. hundreds or thousands) of different feature or edit of a particular video scene. Such segmentation can be detection algorithms are used to analyze video images, not all done, for example, based on the feature tracking from the image features will fit neatly into each different feature algo previous stage. It should be noted that the segmentation is rithm type. Some image features descriptors will either not usually done automatically by a pre-determined algorithm. precisely fit into a specific feature descriptor algorithm, or US 2009/0259633 A1 Oct. 15, 2009

else will have an ambiguous fit. To improve the overall fidelity “T”, “G”, or “C”, whereas a video DNA nucleotide is a more of the video DNA process, it is often useful to try use nearest complex “bag of features” (bag of feature descriptors). Thus neighbor algorithms to try to get the closest fit possible. In the it is quite often the case that a given video nucleotide will nearest neighbor fit, the actual observed features (feature never quite find a perfect match. Rather, the criterion for a descriptors) are credited to the counterbin associated with the “match’ is usually going to be a close but not quite perfect feature descriptor algorithm that most closely fits the match. Often, this match will be determined by a distance observed feature descriptor. function, Such as a distance, a L1 distance, the Mahalanobis 0278. The temporal matching of client-side and reference distance, the Kullback-Leibler divergence distance, the Earth video DNAs can be performed using a variety of different Mover's distance, or other function. That is, an example algorithms. These algorithms can range from very simple match is whenever video nucleotide “distance'<=threshold. “match/no match algorithms, to bioinformatics-like “dot 0283. A smaller match criteria is considered to be a more matrix algorithms, to very Sophisticated algorithms similar stringent match (i.e. fewer video DNA nucleotides or signa to those used in bioinformatics for matching of biological tures will match with each other), and a larger match criteria DNA sequences. Examples of Some of these more complex is considered to be a less stringent match (i.e. more video bioinformatics algorithms include the Needleman-Wunsch DNA nucleotides or signatures will match with each other). algorithm, described in S. B Needleman, C. D. Wunsch, “A 0284. Referring to FIGS. 20-24, a series of diagrams are general method applicable to the search for similarities in the shown to illustrate a process configured according to the amino acid sequence of two proteins”, 1970; Smith-Water systems and methods described herein. FIG. 20 illustrates an man algorithm, described in T. F. Smith and M. S. Waterman, example of the video signature feature detection process. In “Identification of common molecular subsequences”, 1981; this example, an input video (A) is composed of a series of and heuristics such as Basic Local Alignment Search Tool various frames 2000 having a feature image 2004 and an area (BLAST), described in S. F. Alschul et al., “Basic Local defined by X and y over a period of time is used as input into Alignment Search Tool”, 1990. a multi-scale feature detector 2006. The video signals s1, s2, 0279. Often, a suitable sequence matching algorithm will s3 are subjected to a convolution with filters of different operate by defining a matching score (or distance), represent spatial width (B), producing a series of images with different ing the quality of the match between two video sequences. feature scales of resolution. These different scale space The matching score comprises two main components: simi images are then analyzed (for example by corner detection), larity (or distance) between the nucleotides and gap penalty, at the different scales 1.2.3 in (C). The picture can then be expressing to the algorithm the criteria about how critical it is described by a series of multiscale peaks (D) where certain to try not to “tear the sequences by introducing gaps. features f1, f2., in the frames (E) are identified. 0280. In order to do this, the distance between a nucleotide 0285 FIG. 21 shows an example of the video signature in a first video and a corresponding nucleotide in a second feature tracking and pruning process. This is an optional Video must be determined by Some mathematical process. stage, but if it is used, features may be tracked over multiple That is, how similar is the “bag of features” from the first frames and features that persist for enough (e.g., meet a preset series of frames of one video similar to the “bag of features” criteria) frames are retained, while transient features that do from a second series of frames from a second video? This not persist long enough to meet the criteria are rejected. similarity value can be expressed as a matrix measuring how 0286 FIG. 22 shows an example of video signature fea similar or dissimilar the two nucleotides are. In a simple case, ture description. The example of FIG. 22 illustrates how it can be a Euclidean distance or correlation between the previously detected features can then be described. In gen vectors (bags of features) representing each nucleotide. If one eral, the process works by again taking the input video 2200. wishes to allow for partial similarity (which frequently and this time analyzing the video in the neighborhood (x, y, r) occurs, particularly in cases where the visual nucleotides may around each of the previously detected features (G). This contain different features due to spatial edits), a more com feature description process can be done by a variety of dif plicated metric with weighting or rejection of outliers can be ferent methods. In this example, a SIFT gradient of the image used. More complicated distances may also take into consid around the neighborhood of a feature point is computed (H), eration the mutation probability between two nucleotides: and from this gradient a histogram of gradient orientations in two different nucleotides are more likely similar if they are local regions for a fixed number of orientations is generated likely to be a mutation of each other. As an example, consider (I). This histogram is then parsed into a vector with elements a first video with a first sequence of video images, and a (J), called a feature descriptor. second video with the same first sequence of video images, 0287 FIG. 23 shows an example of a vector quantization and a video overlay. Clearly many video features (atoms, or process that maps an image into a series of quantized feature elements) in the bag describing the first video will be similar descriptors. In this example, the video image, previously to many video features in the bag describing the second video, described as a feature descriptor vector (K) with an arbitrary and the “mutation' here is those vide features that are differ feature descriptor Vocabulary, is mapped onto a standardized ent because of the video overlay. d-dimensional feature descriptor vocabulary (L). This use of 0281. The gap penalty is a function accounting for the a standardized descriptor Vocabulary enables a standardized introduction of gaps between the nucleotides of a sequence. If scheme (M) that is capable of uniquely identifying video, a linear penalty is used, it is simply given as the number of regardless of Source. gaps multiplied by some pre-set constant. More complicated 0288 FIG. 24 shows an example of video DNA construc gap penalties may take into consideration the probability of tion. In contrast to standard video analysis, which often ana appearance of a gap, e.g. according to statistical distribution lyzes video on a frame-by-frame basis, video DNA often of advertisement positions and durations in the content. combines or averages bags of features from multiple video 0282 Although the term “video DNA” gives a good frames to produce an overall “video nucleotide' for a time descriptive overview of the described video signature interval. An example of this is shown in FIG.8. As previously method, it should be evident that matching the different video discussed, the video data is analyzed and bags of features for nucleotides can be more complex than matching biological particular frames are aggregated into k dimensional histo nucleotides. A biological nucleotide is usually a simple 'A'. grams or vectors (N). These bags of features from neighbor US 2009/0259633 A1 Oct. 15, 2009

ing video frames (e.g., frame 1, frame 2, frame 3) are then 2. The method of claim 1, wherein the first video content averaged (P), producing a representation of a multi-frame identifier is associated with the entire video content. video time interval, often referred to hereinas a “video nucle 3. The method of claim 1, wherein the first video content otide’”. identifier is associated with a particular time interval within 0289 FIG.25 shows an example system 2500 for process the video content. ing video data as described herein. A video data source 2502 4. The method of claim 1, wherein the first video content stores and/or generates video data. A video segmenter 2504 identifier is associated with a spatio-temporal object in the receives video data from video data source 2502 and seg Video content. ments the video data into temporal intervals. A video proces sor 2506 receives video data from video data source 2502 and 5. The method of claim 1, wherein the first video content performs various operations on the received video data. In identifier is a filename-based identifier associated with video this example, video processor 2506 detects feature locations content stored at a fixed location in a video repository. within the video data, generates feature descriptors associ 6. The method of claim 1, wherein the first video content ated with the feature locations, and prunes the detected fea identifier is a file content-based identifier associated with ture locations to generate a Subset of feature locations. A payload of the video file. video aggregator 2510 is coupled to video segmenter 2504 7. The method of claim 6, wherein the video content iden and video processor 2506. Video aggregator 2510 generates a tifier is a hash value computed from the payload of the video video DNA associated with the video data. As discussed file. herein, the video DNA can include video data ordered as 8. The method of claim 1, wherein the first video content sequences of visual nucleotides. identifier is a video content-based identifier associated with 0290 Astorage device 2508 is coupled to video segmenter information obtained from frames of the video content. 2504, video processor 2506, and video aggregator 2510, and 9. The method of claim 1, wherein the video content source stores various data used by those components. The data stored is a DVD and the first video content identifier is a DVDid. includes, for example, video data, frame data, feature data, feature descriptors, visual atoms, video DNA, algorithms, 10. The method of claim 1, wherein the video content is settings, thresholds, and the like. The components illustrated obtained from a file in a peer-to-peer network and the first in FIG. 25 may be directly coupled to one another and/or video content identifierisahash value associated with the file. coupled to one another via one or more intermediate devices, 11. The method of claim 1, wherein the video content is systems, components, networks, communication links, and obtained from a video file stored on a server and the first video the like. content identifier is a URL identifying the server and the 0291 Embodiments of the systems and methods video file. described herein facilitate identification and correlation of 12. The method of claim 1, wherein translating the first multiple video content identifiers associated with specific video content identifier into a second video content identifier Video content. Additionally, some embodiments may be used includes generating the second video content identifier Such in conjunction with one or more conventional video process that the second video content identifier is associated with the ing and/or video display systems and methods. For example, entire video content. one embodiment may be used as an improvement of existing 13. The method of claim 1, wherein translating the first Video processing systems. video content identifier into a second video content identifier 0292 Although the components and modules illustrated includes generating the second video content identifier Such herein are shown and described in a particular arrangement, that the second video content identifier is associated with a the arrangement of components and modules may be altered time interval in the video content. to perform the identification and correlation of multiple video content identifiers in a different manner. In other embodi 14. The method of claim 13, wherein translating the first ments, one or more additional components or modules may video content identifier into a second video content identifier be added to the described systems, and one or more compo includes finding a correspondence between temporal coordi nents or modules may be removed from the described sys nates in video data associated with the first video content tems. Alternate embodiments may combine two or more of identifier and video data associated with the second video the described components or modules into a single compo content identifier. nent or module. 15. The method of claim 1, wherein translating the first 0293 Although specific embodiments of the invention video content identifier into a second video content identifier have been described and illustrated, the invention is not to be includes generating the second video content identifier Such limited to the specific forms or arrangements of parts so that the second video content identifier is associated with a described and illustrated. The scope of the invention is to be spatio-temporal object in the video content. defined by the claims appended hereto and their equivalents. 16. The method of claim 15, wherein translating the first 1. A method comprising: video content identifier into a second video content identifier receiving a request for information associated with specific includes finding a correspondence between spatial and tem Video content from a requesting device; poral coordinates in video data associated with the first video identifying a first video content identifier associated with content identifier and video data associated with the second the specific video content from a video content Source; video content identifier. retrieving first metadata associated with the specific video 17. The method of claim 1, wherein the first metadata is content based on the first video content identifier; untimed metadata associated with the entire video content. translating the first video content identifier into a second 18. The method of claim 1, wherein the first metadata is video content identifier; timed metadata associated with a time interval in the video retrieving second metadatabased on the second video con COntent. tent identifier; and 19. The method of claim 1, wherein the first metadata is providing the first metadata and the second metadata to the spatio-temporal metadata associated with a spatio-temporal requesting device. object in the video content. US 2009/0259633 A1 Oct. 15, 2009 20

20. The method of claim 1, wherein the first metadata is identifying a correspondence between a timeline associ retrieved from a first metadata source and the second meta ated with the video program and a timeline associated data is retrieved from a second metadata source. with the retrieved metadata. 21. The method of claim 1, wherein translating the first 32. The method of claim 31, wherein identifying a corre video content identifier into a second video content identifier spondence includes calculating a spatial correspondence includes performing the translation in response to each between the video program and the retrieved metadata. received request. 33. The method of claim 31, wherein identifying a corre 22. The method of claim 1, wherein translating the first spondence includes calculating a temporal correspondence video content identifier into a second video content identifier between the video program and the retrieved metadata. includes retrieving a previously generated second video con 34. The method of claim 31, wherein identifying a corre tent identifier. spondence includes calculating a spatio-temporal correspon 23. The method of claim 1, wherein identifying a first video dence between the video program and the retrieved metadata. content identifier includes receiving a first video content iden 35. The method of claim 31, wherein identifying a corre tifier from the requesting device. spondence includes retrieving a previously computed corre 24. The method of claim 1, further comprising identifying spondence from a storage device. a correspondence between a timeline associated with the specific video content and a timeline associated with the 36. The method of claim 31, wherein identifying a corre second metadata. spondence includes calculating an alignment between the 25. The method of claim 24, further comprising providing Video program and the retrieved metadata. the identified correspondence to the requesting device. 37. The method of claim 31, wherein the requested meta 26. A method comprising: data is subtitle information associated with the video pro identifying a first video sequence associated with a video gram. program; 38. The method of claim 31, wherein the requested meta identifying a second video sequence associated with the data includes reviews associated with the video program. Video program; 39. An apparatus comprising: calculating a correspondence between the first video a communication module configured to receive a request sequence and the second video sequence; for information associated with specific video content determining an alignment of the first video sequence and from a requesting device; and the second video sequence; a processor coupled to the communication module and storing the calculated correspondence between the first configured to identify a first video content identifier Video sequence and the second video sequence; and associated with the specific video content, the processor storing information regarding the alignment of the first further configured to retrieve first metadata associated Video sequence and the second video sequence. with the specific video content based on the first video 27. The method of claim 26, wherein calculating a corre content identifier and to translate the first video content spondence between the first video sequence and the second identifier into a second video content identifier, wherein Video sequence includes calculating a spatial correspondence the processor is further configured to retrieve second between the first video sequence and the second video metadata based on the second video content identifier Sequence. and provide the first metadata and the second metadata 28. The method of claim 26, wherein calculating a corre to the requesting device. spondence between the first video sequence and the second 40. The apparatus of claim 39, wherein the processor is Video sequence includes calculating a temporal correspon further configured to find a correspondence between temporal dence between the first video sequence and the second video coordinates in video data associated with the first video con Sequence. tent identifier and video data associated with the second video 29. The method of claim 26, wherein calculating a corre content identifier. spondence between the first video sequence and the second 41. The apparatus of claim 39, wherein the processor is Video sequence includes calculating a spatio-temporal corre further configured to find a correspondence between spatial spondence between the first video sequence and the second coordinates and temporal coordinates in video data associ Video sequence. ated with the first video content identifier and video data 30. The method of claim 26, wherein determining an align associated with the second video content identifier. ment of the first video sequence and the second video 42. A method comprising: sequence includes calculating a plurality of video-based inputting a first video identifier, wherein the first video descriptors associated with the first video sequence and the identifier includes a content identifier and a first set of second video sequence. coordinates in a first video content; 31. A method comprising: identifying a set of second video identifiers associated with receiving a request for metadata associated with a video a second video content, wherein the second video con program, wherein the request includes a first video con tent is similar to the first video content; tent identifier associated with the video program; for each video identifier in the set of second video identi identifying a video sequence in the video program; fiers: identifying a second video content identifier associated identifying a set of coordinates in the second video con with the video program based on the identified video tent; and Sequence; outputting the identified set of coordinates in the second retrieving the requested metadata from a metadata source Video content to request metadata associated with the based on the second video content identifier; and first video content. US 2009/0259633 A1 Oct. 15, 2009

43. The method of claim 42, wherein the first set of coor 51. The method of claim 49, wherein the table is a corre dinates is a set of time points in a timeline associated with the spondence table. first video content. 52. The method of claim 49, wherein the table is computed 44. The method of claim 42, wherein the first set of coor by calculating a video content-based descriptor from the first dinates is a set of time intervals in the timeline of the first Video content and identifying other video content similar to Video content. the first video content by comparing the content-based 45. The method of claim 42, wherein the first set of coor descriptors to a database of content-based descriptors. dinates is a set of spatial coordinates associated with an object 53. The method of claim 52, wherein the content-based in the first video content. descriptor is a visual nucleotide. 46. The method of claim 45, wherein the set of spatial 54. An apparatus comprising: coordinates identify diagonal corners of a bounding box. a communication module configured to receive a first video 47. The method of claim 42, wherein the first set of coor identifier, wherein the first video identifier includes a dinates is a set of spatio-temporal coordinates associated with content identifier and a first set of coordinates in a first an object in the first video content. video content, the communication module further con 48. The method of claim 47, wherein the spatio-temporal figured to send a second video identifier and a corre coordinates are represented as a set of temporal intervals, sponding second set of coordinates in a second video wherein spatial coordinates associated with the object are content; and provided for each temporal interval. a processor coupled to the communication module and 49. The method of claim 42, wherein identifying a set of configured to identify the second video identifier by second video identifiers includes retrieving the second video retrieving the second video identifier from a table that identifiers from a table including a first column of video includes a first column of video identifiers and a second identifiers and a second column of video identifiers associ column of video identifiers associated with video con ated with video content similar to video content associated tent similar to video content associated with the video with the video identifiers in the first column. identifiers in the first column. 50. The method of claim 49, wherein the table is pre computed. c c c c c