From Video Information Retrieval to Hypervideo Management

From video information retrieval to hypervideo management

Olivier Aubert Yannick Prié LIRIS - FRE 2672 CNRS - Université Lyon 1 LIRIS - FRE 2672 CNRS - Université Lyon 1 [email protected] [email protected]

Abstract due to its dissemination to a wider audience. This dissemination has been effective because of the availabil- The digital revolution that has somehow taken place ity and interoperability of tools for visualisation and with the World-Wide Web took advantage of the avail- manipulation of text-based data (e.g. web browsers + ability and interoperability of tools for visualisation and text processors), as well as the satisfying pertinence of manipulation of text-based data, as well as the satisfy- search engines results that makes them usable by non- ing pertinence of search engines results that makes them expert users. The current semantic web efforts aim at usable by non-expert users. If we are to undergo such a improving this interoperability. revolution in the audiovisual domain, a number of issues mainly related to the temporal nature of audiovisual documents have to be resolved. In this article, we expose our view of the current state of Audiovisual Information Sys- While important research efforts are carried out to tems, and suggest that they should be considered as video extend the four above capabilities to audiovisual docu- information management systems rather than video re- ments, the same level of achievement has not yet been trieval systems, in order to foster new uses of video infor- reached. Indeed, if some features are common to all mation. We define the notion of hypervideo that can be types of documents (authorship, date of publication, used as an analysis framework for this issue. We then de- etc), the most important ones are often specific. For in- scribe our implementation of hypervideos in the Advene stance, images are not composed of such useful items framework, which is aimed at DVD material annotation as words for texts, and the temporal nature of audio- and analysis. We eventually discuss how the notion of hy- visual documents induces original issues hardly trans- pervideo puts video information retrieval in a somewhat latable from textual documents. In this context, our different light. team –working on the domains of document engineering, documentary information systems and annotation systems– is interested in the specificities of AudioVi- sual Information Systems (AVIS) and their emerging 1. Introduction uses.

The success of the WWW has given access to a wide collection of documents, mainly textual ones1. To us, accessing documents means: a/ the ability to visualise In this article, we expose in section 2 our view of the documents ; b/ the ability to search for documents current state of AVIS, and suggest that they should (through search engines) ; c/ the ability to refer to doc- be considered as video information management sys- uments (through hyperlinks) ; and d/ the ability to ma- tems rather than video retrieval systems, in order to nipulate documents or parts of documents (by saving foster new uses of video information. The notion of hy- them, or by copy-pasting). pervideo, which we propose in section 3, can be used The worldwide success of the WWW does not only as an analysis framework for this issue. Section 4 de- stem from its academic origin where scientists needed scribes our implementation of hypervideos in the Ad- to share research information. It is also to a large extent vene framework, which is aimed at DVD material annotation and analysis. The last section is dedicated to 1 From now on, we will refer to them as “text-based documents”, discussing how the notion of hypervideo puts video in- in opposition to “audiovisual documents”, mainly based on formation retrieval in a somewhat different light, and moving images and/or sounds. how the Advene project contributes to this evolution. 2. A view on Audiovisual information detection, closed-caption detection, voice processing for systems transcriptions) result in high-level information (shots, texts, etc.). Eventually, manual indexing produces syn- AudioVisual Information Systems (AVIS) should thetic information (structured descriptions or texts). provide means to search, retrieve and use audiovisual documents. They are still an active research do- 2.2. Query results selection main [20, 7, 21] though some of them are already in wide use (Virage [24], used by CNN for instance). AVIS processing can be analysed using the following steps: Once the interesting information is located, it has indexing/retrieval, results selection, results exploita- to be visualised in order to select the most relevant tion. Indexing is the base upon which retrieval is built items. When dealing with text documents, it is easy (CNN’s video search proposes to retrieve video frag- to quickly scan a document or a fragment in order to ments based on the presence of keywords in text tran- determine if it is suited to the user’s need. On the scriptions). Once a (potentially large) set of results has contrary, audiovisual documents possess a temporal- been obtained, the user has to evaluate their relevance ity which makes them inherently not instantly accessi- and select the most important ones (in CNN’s video ble. Surrogates [10] have to be used in order to sum- search, some context in the form of a keyframe and marise the video information and quickly analyse re- part of the transcription is provided). After the rel- sults of queries on audiovisual documents. evant fragments have been selected, they can be ex- The visualisation of these collections commonly uses ploited. This step depends on the needs of the user but hypertext documents illustrated by keyframes from the most of the time, it only consists in visualising the se- videos and extracts from bibliographical descriptions. lected videos or fragments. For instance, Graham et al [6] proposes a static view of We will detail these steps and see that the last one a video document using key frames extracted from the is often overlooked and not well integrated with the video, in order to quickly browse through a number of others. We will then argue that indexing and retrieval video fragments. The VAST system, by Mu and Mar- should be integrated in the exploitation phase as a chionini [10], also provides multiple means to quickly video processing cycle. visualise video data either statically or dynamically (fast-forward, etc). 2.1. Video indexing and retrieval 2.3. Query results exploitation Video retrieval is now recognised as a specific domain. For instance, the TREC conference series, which Once the query results have been selected, they must provides a test-bed for the comparison of information eventually be exploited. Dealing with text documents, retrieval systems, integrates a video corpus since 2001, or fragments thereof, is common these days: hypertext and has made the video track an independent work- has become mainstream and the public daily uses text- shop and evaluation process called TRECVID [13] since based search engines and word processors for textual 2003. document generation using cut-and-paste and hyper- Most AVIS use video and associated descriptors linking features. to access specific video documents or fragments. The The situation is different for audiovisual documents. range of video descriptors is very large, from very low- In current AVIS, results often consist in collections of level features that are extracted automatically to more video documents (as in the Internet Movie Archive [8]), high-level indexing resulting in natural language de- or of video fragments (as in CNN digital archive [24]). scriptions. Indeed, while in textual documents, indexes Their main use is often limited to a simple visualisa- are existing and objective units (words, characters), no tion. More advanced uses such as editing in order to such existing indexes are available for audiovisual doc- produce a new video need video editing tools that are uments [2]. They have to be extracted or manually de- not yet integrated with AVIS, although research is also fined, a process producing textual indexes (symbolic active in this area [4, 22]. HyperHitchcock [19] proposes descriptors, transcriptions, textual annotations, ...) on a means to create hypervideos featuring hyperlinks be- which more common retrieval techniques can be used. tween video fragments. The creation does not however Low-level features (color or shape analysis, sound use an underlying retrieval engine, but is manually de- level analysis) generate information that has to be in- fined through an authoring software. The VideoZap- terpreted in a specific context (specialised domain) in per [3] approach allows to dynamically edit a video, order to be useful. Other automatic approaches (shot based on previous user’s experience. The output of this system is then a new video, with selected fragments how the Advene project allows us to put hypervideos from a number of broadcasted sources. in practice.

2.4. Importance of usage 3. Hypervideos

The studies of information retrieval systems often We first define an Annotated Audiovisual Document overlook the importance of the usage of the retrieved (AAD) as an audiovisual document augmented by an information. It should be stressed that the intended us- annotation structure, then a view as a way to visualise age determines the descriptors that are used for the part of the AAD. An hypervideo is a specific view, us- information retrieval, and even their interpretation. ing the annotation structure and giving access to the Moreover, for audiovisual documents, other uses than stream temporality. simple visualisation depend on the descriptors: they Annotated Audiovisual Document. We call An- will often become indexes to specify links in the video, notated Audiovisual Document (AAD) an audiovisual captions put on the video, etc. Video descriptors are document augmented with an annotation structure SA thus determined by, and essential to, retrieval as well with spatiotemporal relations to the document (i.e. as visualisation in AVIS. some elements from the annotation structure are linked We claim that the main issue in AVIS is not video to spatiotemporal fragments of the document). The retrieval but video usage, which determines both re- annotation structure that enriches/augments the doc- trieval and processing. Feedback from user studies, as ument is composed of annotations ai linked to spa- done in [20], should guide the development of new sys- tiotemporal fragments fi, and of a structure S defining tems. We must evolve from simple video retrieval us- relations between the annotations. ing descriptors to building new documents from video and descriptors. Other authors share the same vision: Nack [12] also underlines the need for integrated video AAD = {AV D, SA} = {AV D, {{ai, fi}, S}} systems, from shooting to broadcasting and archival. For instance (see figure 1), a 4-minutes long movie 2.5. Proposal stored as an MPEG2 file can be annotated with a hierarchic structure that describes the logical scene/shot As indexing and retrieval cannot be done without structure (a movie is made of scenes, each having a ti- knowing the needs and intended usage, research in mul- tle and containing shots). Shots are temporally located timedia information retrieval should be closely inter- in the document. The structure also holds a descrip- acting with the multimedia information visualisation tion of the different characters and their relationships domain, and gather information through user stud- (here, Char1, who is present on the screen during the ies [20]. We need to build upon both video retrieval third shot and part of the fourth, sends to Char2 a and hypermedia authoring in order to produce effec- letter, whose text is known and describes what hap- tive AVIS that will be more widely used, therefore pro- pened during the first scene of the movie). The anno- viding more feedback on the needs and usefulness of tations ai contain data about the shots and the char- these systems. As Frank Nack said, “the current goal acters, the associated fragments fi are temporal frag- of multimedia research is to make multimedia infor- ments and the structure S contains the remaining data mation pervasively accessible and useable” [11]. This (movie, scenes, relationships, text of the letter, etc). extended use needs appropriate tools and also a large video corpus. Moreover, innovative uses of video information can develop when given opportunities, just as what hap- pened with text documents: hypertext brought for instance a new way of conceiving and using documents, which is now exploited by media, scientists, artists, etc. The notions of hyperlinks, of fragments addressing and reuse, should be employed in order to promote new uses of video material. To address these issues, we propose to use the no- Figure 1: Example of Annotated Audiovisual Docu- tion of hypervideo, which we will define more precisely ment. in the next section. The following section will describe Note that the definition of the annotation structure No access to AAD Access to AAD tempo- remains very generic on purpose: it holds all the infor- temporality rality mation linked to fragments, as well as their interrela- AV D A- Display of a static B- Simple visualisations. The nature of the information itself is not speci- summary of the movie, tion of the movie in fied and can feature a temporal dimension, like an au- as a table of screen- a basic video player, dio annotation for instance, or not. shots extracted from with regular controls View of an AAD. We call view of an Annotated the movie every 10 sec- (play, pause, start, fast Audiovisual Document any “way of visualising” it. A onds, automatically forward, etc). view is built 1/ from information taken from the docu- generated. Visuali- sation of the sound ment AV D, 2/ from the annotation structure SA, and envelope of the movie. 3/ from its visualisation information. The visualisation information is always in part hard-coded in the visu- SA C- Visualisation of the Not applicable alisation tools/programs, and can also be specified by information structure declarative means such as stylesheets. movie/scenes/shots. Two criteria can be used to analyse views: first, the Visualisation of the existence of the possibility to offer an access to the character’s names and AAD temporal stream; second, the usage of informa- of the text of the let- tions coming from either the annotation structure SA ter. or the audiovisual document AV D, or from both. Ta- AV D ble 1 presents different possibilities illustrated with ex- D- Display of a hier- E- Visualisation of a + archical table of con- hypertext table of con- amples based on the AAD described in figure 1. SA tents of the movie, tents of the movie The examples proposed in table 1 illustrate various where each shot is rep- with links to the corre- ways of view generation using information taken from resented by its first sponding fragment of the AAD, but also what we called the visualisation- picture. Visualisation the video. Visualisa- specific information of the tools. For instance, a stan- of the letter’s con- tion of the movie, cap- dard video player will contain all necessary information tent with 3 illustrat- tioned with the cur- in its source code (type B-); a stylesheet can be enough ing screenshots taken rent shot number, with to generate a table of contents from annotations stored from the 3 shots of the the possibility to nav- in an XML file (type C-). However, a specific tool is first scene (see the de- igate between shots. scription On-the-fly needed to extract screenshots from the movie if the link). edition of the movie featuring view needs them (type D-). More complex views com- only the shots where bine visualisation of audiovisual information and dis- characters are present play of information taken from the annotation struc- (shots 3, 4, 7). ture. They require new tools, for instance video play- ers with enhanced capabilities such as the display of Table 1: Five types of views for an Annotated Audio- various kind of information (textual, graphical) on the visual Document video screen, original navigation capabilities, etc. They are the views that we call hypervideos. Hypervideo. We call hypervideo any view of an shot being represented by a key frame (for instance AAD that on the one hand uses the annotation struc- the first frame of each shot). This view (type D-) uses ture SA and the audiovisual document AV D, and on the annotation structure, the audiovisual document (to the other hand gives access to the temporal stream of extract pictures) and, as visualisation-specific informa- the AVD. tion, a stylesheet that specifies how to present the in- For instance, elaborating from the previous example formation and a tool to extract keyframes from the (MPEG2 movie plus hierarchic annotation structure), document. it is possible to visualise the stream in a standard video This view is however not a hypervideo, because it player, with classic navigation capabilities (play, pause, offers no access to the original document temporality. fast forward, etc). This view can be called “simple nav- The addition of the possibility to click on each keyframe igation” (type B-). It uses the information from the au- to play the corresponding video segment makes it a hy- diovisual stream only, as well as the information built in pervideo (type E-). Another example consists in using the video player code (which we will call visualisation- the annotation structure to enhance the original docu- specific information). Another view may use the anno- ment while it is playing, through an instrumented video tation structure to present it in a web browser, each player. For instance, it can be a view constantly dis- playing over the video the title of the current sequence 4. Advene as well as the number of the shot, and a link to the 2 next and previous sequences. The Advene project aims at developing an open- source framework for hypervideo engineering, that al- The notion of hypervideo that we propose here is com- lows to 1/ annotate audiovisual documents, i.e. to as- patible with and extends the definition given in other sociate information to specific fragments of a video; works [17, 4]. It tries to set a unifying framework to 2/ provide augmented visualisations of the video that build hypervideos, but also to analyse existing ones. use the annotation structure; 3/ exchange the annota- There are indeed a number of existing and sometimes tions and their associated visualisation modes indepen- widely used hypervideos. The most well-known exam- dently from the original video, as documentary units ple is DVD menus, which provide a view on the video called packages. It should meet the four aspects of doc- chapters with the possibility to directly play the in- ument access presented in the introduction: visualise teresting part of the movie. They can be very sim- audiovisual documents, search them, hyperlink them ple –chapter titles linked to the movie’s chapters– or and manipulate them through dynamic selection and more elaborated: they can display in parallel reduced- edit. size excerpts of each chapter, in order to provide more The basic principles of Advene are 1/ to accommo- information about them. Moreover, subtitles are also date various needs in its annotation structure as well a supplemental information that can be displayed on as in its visualisation means3; 2/ to facilitate the devel- the video, thus providing a new view of the movie. opment of video usage at a simple user level and foster Analysing DVDs as hypervideos explicits the metadata new usages; 3/ to use a video corpus that is accessi- constituted by chapter definitions and subtitles, and ble and interesting: movies edited as DVDs. suggests the idea of their reuse. The main application domains of the project di- Considering existing media as hypervideos offers a rectly depend on the chosen audiovisual corpus, namely unifying analysis framework that allows the compar- movies available on DVDs. This makes it useful for lan- ison of different experiments and makes possible the guage teachers using the movies as pedagogical sup- transmission of experience from one to another. ports, moviegoers wishing to exchange movies analysis and discourses, etc. The genericity of the Advene For instance, the Hyperfilm [16] project proposes to model makes it usable in other domains such as hu- enrich a video with links to either other parts of the manities (using video support). same video or to external web documents. It does not The choice of dealing with DVD movies is prepon- however consider that the metadata embedded inside derant in the Advene project, even though it imposes its documents may be exploited in other ways. Hyper- strong constraints due to the technical capabilities of cafe [17] also explores hyperlinks in videos and tries to DVDs. One of the roots of the success of the WWW lies express the specificities induced by the spatial and tem- in its widespread adoption by a great number of users, poral nature of documents. Both systems are however which leads to the development of many tools to dig a the result of an interactive edit process, and not di- vast mainly text-based corpus. In order to observe simi- rectly built from requests results. The previously cited lar developments in the audiovisual domain, we need to examples of HyperHitchcock [19] or video skimming [6] have such a large user base and corpus. But the rights qualify as hypervideos according to our definition: they management of audiovisual data is stronger and more provide a hypertext visualisation of a video, allowing strictly enforced. One of the solutions is to build an ac- to access any identified fragment of the video. They are cessible corpus [20, 8] from either public documents or automatically generated, but do not provide the same self-made documents. It is however limited in its scope interactivity when playing the video. and cannot foster the interest of a large community. Thus some experience of hypervideo creation and in- DVDs on the other hand constitute a large video cor- teraction is already available, but its link to AVIS is pus, disseminated worldwide, maybe with some varia- not always explicit. The use of descriptors as indexes tions depending on the location but globally homoge- for new documents could be generalised, enriching the neous. In addition, it is available to a large user base. manual edited hypervideos with additional data. The The criteria for a widespread usage are thus met: a bridge between information retrieval and hypervideos large user base accessing a large, common corpus. The is not yet achieved. That is the reason why we propose in the Advene project to build upon the previous expe- 2 http://liris.cnrs.fr/advene/ rience in hypervideos and to take advantage of struc- 3 “[there is a] wide variety of needs from users, and so the inter- tured metadata of audiovisual documents in order to face should have some kind of flexibility (like the agile views).” integrate it. (James Turner [23]). condition is that every person owns a copy of the DVD defines a structure content that will hold the se- she is interested in. Its enhanced use, through annota- quence number and the sequence title. Eventu- tion and visualisation tools, is then an authorised, pri- ally, the Shot type holds the following information: vate use of the media. the shot number, and the description of the shot. Fig- The development of new usages of audiovisual mate- ure 2 sums up this schema. rial depends on the ability of the system to address the needs of a variety of persons. Hence, as we have seen in section 2, two major elements have been taken into account: the flexibility in the design of descriptors used to enrich the audiovisual document and the versatil- ity of the proposed visualisation means.

4.1. The Advene model Figure 2: The Cutting schema

Advene is based upon a model, with a focus on simplicity over completeness, that defines the available ele- Using this schema, it is possible to annotate a movie ments in the Advene framework. The model only spec- and use the information in various ways that will be ifies how to link a piece of data to a fragment of an described in the following sections. audiovisual document. The handling of the data struc- As we previously argued, an important aspect is to ture is then handled by specific plugins for each type provide the user with facilities to define its own visu- of data. alisation means. The Advene model integrates a no- The basic element in the Advene model is the anno- tion of view that defines a way to visualise elements tation. An annotation is simply a piece of data linked from the model. The model does not impose a spe- to a part of a document called a fragment. Annota- cific mechanism but the prototype described in sections can be put in relation with one another by means tion 4.2 defines for the moment two different types of of n-ary relations, which can also possess a content de- user-definable views, in addition to the GUI elements: scribing them. a static one, using a standard web browser to visu- Annotations and relations are not unstructured: alise documents generated from the hypervideo data, their content type and their relation structure are iden- and a dynamic one able to react to specific events dur- tified and constrained by annotation types and relation ing the video play. They will be described further. types. An annotation type possesses a name and de- The field of study of the Advene project being AVIS, fines a content-type for its annotations in the form of the Advene model also integrates the notion of query, a MIME type (text/plain, text/xml, audio/*, etc.). that allows to select elements from the model. The If the type is text/xml, it can be more precisely con- model only specifies that queries return a set of ele- strained by a structure description (e.g. DTD, XML ments from the model, in order to accomodate mul- Schema, ...). A relation type also possesses a name and tiple query models. Details are left to the implemen- defines a content-type for its instances. In addition, it tation, which can propose multiple ways to represent specifies the number of participating annotations and and execute queries. The current implementation pro- their respective types. poses a simple filtering system, that selects from a set Annotation types and relation types can be seen the elements matching a condition. This method cov- as elementary elements of a specific analysis. There- ers the basic needs of the current experiments. But this fore, they can be grouped in a unit called a schema. A issue is still a work in progress. schema represents a certain point of view on the anal- All these elements (views, queries, schemas, ysed document, and may be reused with other movies. annotation- and relation-types, annotations and rela- One of the most common examples is the tions) form a consistent set: views and queries depend Shot/Sequence analysis. It can be implemented on the schemas, schemas are defined to achieve a spe- with a schema called Cutting, which defines the fol- cific goal. The elements are gathered in a single lowing annotation types: Movie, Sequence and Shot. It representation unit called package. An Advene pack- also defines a isPartOf relation type that links an in- age thus contains the description of a hypervideo: the stance of Sequence to the Shots it contains. The schemas defining the annotation structure, the annota- Movie type defines a structured content-type that tion structure itself, queries to manipulate it and views will hold information about the movie: director, re- to visualise it. A package can be shared by users in or- lease date, reference URL. The Sequence type also der to be simply visualised or to be edited. As some schemas –and their associated views– may be reused (Cutting is very generic for instance), it is possible for the creator of a new package to reference external packages so that she does not have to rede- fine them. Such a reference is called an import. The intended use of this functionality is to make it possible to have a repository of common, reusable schemas that would allow people (moviegoers, teachers, stu- dents, etc) to share their analyses of movies.

4.2. Visualisation

One of the main contributions of Advene is to integrate the issue of visualisation in the AVIS. The Ad- vene model only specifies that a view has a content-type and generates data from the annotated document. In the Advene prototype, two types of views that users can specify themselves have been implemented, in addition to the programmed, ad-hoc GUI views (such as timeline, etc). They provide a base for the generation of hypervideos. We quickly describe them here, Figure 3: Multiple views of a single Advene package and more details can be found in [1]. Figure 3 illus- used in a language teaching context trates the exploitation of the different types of views: in the upper left-hand corner, a web browser displays a ural support for it is (X)HTML and more generally static, User-Time Based view generated from the pack- XML, which can be displayed by any web browser and age’s data. The lower right-hand corner presents a dy- can easily be generated. We have chosen a technol- namic, Stream-Time Based View, captioning the video ogy issued from the Zope platform [9]: a template lan- with the content of selected annotations. In the lower guage called TAL (Template Attribute Language) cou- left-hand corner is an ad-hoc view, presenting the dif- pled with an expression syntax for data access called ferent annotations on a timeline. TALES (Template Attribute Language Expression Syn- Ad-hoc views Ad-hoc views are GUI interfaces to the tax). The couple TAL/TALES presents a number of data. They cannot fully qualify as user-definable views, interesting properties: simplicity, availability of useful as they are defined by a source code that only a pro- tools and libraries, validity of the XML templates and grammer can modify. However, they can be parame- model abstraction. terized to offer some variations in the display of the Its simplicity mainly stems from its template-based data. For instance, the transcription view displays an- nature: a TAL template is a valid XML document notations of the same type by concatenating their con- where some tags feature specific attributes. It has been tent in a text window. During the movie play, the cur- designed so that it does not interfere with existing XML rently active annotation is highlighted. This view pro- tools, especially WYSIWYG editors. The user only has poses a number of parameters (the separator used to to design a template document and afterwards insert concatenate the annotation’s contents, a toggle indi- simple processing instructions such as text substitu- cating wether to display timestamps or not, etc). tion or basic loops. User-Time Based Visualisation. The first kind of user- The model abstraction is provided by the TALES definable view, called User-Time Based Visualisation component, which references elements in a data model (noted UTBV), may be seen as a static view. It is in using a path-like syntax which should be easily un- fact the definition of a hypertext document, whose tem- derstandable by users. Moreover, the exposed struc- porality is the one of the user visualising it. ture represents the model itself (as seen through the li- The User-Time Based Visualisation deals with pre- brary) rather than its XML representation. This makes senting the available data (from the package and from it both more flexible and more intuitive. the audiovisual document) as hypertext documents. Using the Cutting schema presented in figure 2, it The interaction with the user is done through a stan- is possible to easily generate a shot-by-shot table of dard web browser, that connects to a simple web contents of the movie with links to the correspond- server integrated in the Advene application. The nat- ing fragments of the video. Figure 4 presents the ac- tual template code used. The application of this tem- language teachers showed us that a common function- plate to data conforming to the Cutting schema gener- ality is to pause at the end of a sequence and possibly ates a HTML table, which can be displayed in a stan- loop over it. Using the adequate rules, we have been dard web browser, containing the following informa- able to quickly implement this behavior. tion for each shot: shot number, shot duration, shot de- A STBV is defined by a set of rules. Each rule can be scription with a link to the corresponding video frag- triggered by the occurrence of an event, generated by ment. The attributes defined in the tal: namespace the application and related to the activation of anno- are TAL processing instructions whose path-like val- tations during the movie play or the changing state of ues are TALES expressions, that allow to designate el- the player (pause, play, etc). When activated, the rule ements from the Advene model. checks that some user-specified conditions are met, and executes actions accordingly. For instance, using the

example presented in section 4.1, it is possible to design a view that allows to navigate between shots. Fig- ure 5 presents the GUI that allows the user to deﬁne notation occurs, and if the type of the annotation is

Shot number

Duration

the corresponding rule: when the beginning of an an- Shot Description

Shot, then caption the annotation with the shot number (extracted from the annotation’s content) and dis- Figure 4: Template for the table of contents play a popup dialog proposing to the user to skip to the next shot.

In the current Advene implementation, the HTML code of the templates must be hand-edited, or copy/pasted from existing examples. A WYSI- WYG editor is currently being integrated into the application in order to make the design of simple templates more accessible. Stream-Time Based Visualisation. In the second kind of user-definable view, Stream-Time Based Visualisa- tion (noted STBV), the temporality of the resulting document is more related to the time of the audiovisual document than to the time of the user. It can be seen as a video augmented with additional capabili- Figure 5: Example of Stream-Time Based Visualisa- ties. Two main approaches can be used to deal with tion rule dynamic presentation of content: scheduled or event- based [18]. SMIL for instance uses a hybrid approach, mixing determinate timings (scheduled) with undeter- No effort is made by the application to ensure the minate ones (event-based). global validity of a set of rules, except for basic safe- In the Advene prototype, we based the imple- guards. A great deal of flexibility is given to the user, mentation on an event-based model, using the which may allow the definition of conflicting rules. Event-Condition-Action (ECA) paradigm. The ECA To implement such dynamic views, the Advene ap- model [14] is used in many applications such as plication integrates an enhanced video-player, con- databases or to define the filtering capabilities of mail trolled through a CORBA bus, that offers standard and client software. not-so-standard capabilities: navigation in the movie Our use of ECA-rules is not meant as a full-fledged stream, screenshot capture, text captioning, display of composition language, but as a simple means to achieve SVG graphics on the screen, etc. These capabilities are goals expressed by the users, using an understandable then exposed to the user through a number of possi- formalism. With it, we can quickly imagine and de- ble actions. The currently implemented actions allow sign new visualisation modalities for hypervideos. For to: instance, using both annotations and relations, it is possible to generate on the fly an alternate cut for a • control the standard functionalities of the player: movie: annotations define the sequences, and relations start/stop the player, go to a given position, mod- define their order. Our first experiments with English- ify the volume, etc. • control the extended functionalities of the player: viable system. The generated hypervideos are a repre- take a screenshot, display a text caption, display sentation of the query results. Their non-linear nature, SVG graphics over the video. thanks to the hyperlinks, makes them objects distinct from standard audiovisual documents. • interact with the user: log messages, display dialog popups with video navigation options, etc Visualisation of hypervideos puts forward a number of issues, either technical or conceptual. On the techni- • control the application itself: activate another dy- cal side, the need of adequate video software with ex- namic view. tended capabilities, as identified in section 3, may pre- The infrastructure of Advene makes it easy to pro- vent a large use of hypervideo software. The Advene pose and implement new types of actions, thus offering project brings a contribution by providing open-source an excellent playground for testing new hypervideo in- software built upon a cross-platform open-source video teraction modalities. player [5]. The same principle is applied for instance by the authors of the Annodex technology [15], who propose another approach for annotating audiovisual doc- 4.3. Discussion uments where the audiovisual document and the annotation structure are merged in a unique stream. More- The model proposed in Advene aims at investigating over, annotating a video with audiovisual content raises some of the issues related to the management of video the question of mixing multiple audiovisual streams. in Audiovisual Information Systems, through the no- More conceptual issues deal with the relevance of video tion of hypervideo. Advene proposes a simple but flex- material in a multimedia context, how it is perceived ible annotation scheme, that should accomodate the by the user, what new uses can be brought by hyper- needs of various categories of users. Moreover, visuali- video. These questions are also in the current and fu- sation of data is integrated in the model. More impor- ture scope of investigation of the Advene project. tantly, users have the ability to define their own ways of visualising the data, either by defining HTML templates or by specifying dynamic rules. 5. Conclusion Our first experiments with the Advene prototype gave rise to some interesting issues related to hyper- In the context of AudioVisual Information Systems, video management, from information retrieval to infor- we have identified the different phases that occur dur- mation visualisation. In the context of AVIS, a query is ing their exploitation and underlined that one of their applied on an Annotated Audiovisual Document, rep- shortcomings lies in the low level of integration between resented as a package accompanying the audiovisual the different phases. More precisely, we have argued document. The status of the Advene package depends that usage determines the retrieval phase as well as the on its use. Holding annotations, queries and visualisa- exploitation phase. We have then defined the notion of tion definitions, it can be seen as a document in itself. hypervideo that we propose as a unifying framework to It can also be seen as a document generator as each hy- analyse and build new interaction modalities with en- pervideo generated from a package is a document. riched video information. We have eventually presented As a document, it is meant to be shared and possibly the Advene project that explicitly implements the no- modified. It is possible to create a repository holding tion of hypervideo and is meant to be accessible to a Advene packages so that people interested in the same wide audience. analysis of DVD movies can share their work, or use the Indeed, the Advene model does not impose a specific same base (for instance, scene/shot packages of movies indexing scheme, but rather proposes a flexible model are of general interest). In such a repository, queries on able to hold various types of information, thus accom- the package’s database would result in either a collec- modating the various needs of the users. Advene aims tion of packages, or a new package importing data from at providing a platform for exploring new uses of video the matching ones. A query result then becomes a new by allowing the users to specify their own views and document. It can thus be reused and shared between data model for audiovisual information. Using DVDs, users, and possibly augmented, bringing new possibili- it offers access to a large corpus and covers the vari- ties into the management of video databases. ous aspects needed for audiovisual document exploita- As a document generator, it features multiples views tion: visualising, searching, linking/referring and edit- that can be extended to fit the needs of the different ing. users, reusing the same data or adding more data. This While hypervideos exist, they are not necessarily possibility to adapt the visualisation and the data to meant as such. They are often intended as a visual- the various needs of the users is essential to propose a isation means, without taking into account the additional information stored in its structure. By defin- [7] A. Hauptmann, R. Baron, W. Lin, M. Chen, ing hypervideos as views upon an Annotated Audiovi- M. Derthick, M. C. R. Jin, and R. Yan. Video classi- sual Document, we want to emphasise the importance fication and retrieval with the informedia digital video of the additional data in the processing of video doc- library system. 2002. uments. Hypervideos are then more than edited ver- [8] Internet movie archive, 2004. sions of videos with supplemental hyperlinks: they be- http://www.archive.org/movies/. come complex documents featuring multiple and ex- [9] A. Latteier and M. Pelletier. Zope Book. tensible visualisation means. Moreover, in the context http://www.zope.org/, 2003. of AVIS, hypervideos can be generated by queries, as [10] X. Mu and G. Marchionini. Enriched video semantic metadata: Authorization, integration, and presenta- the display of the query results (selection phase), or tion. In ASIST 2003 Annual Meeting, 2003. as their intended exploitation (exploitation phase), us- [11] F. Nack. Understanding Media Seman- ing both the audiovisual material and the descriptors tics. ACM Multimedia Tutorial, 2003. used for the search. In other terms, search descrip- http://www.acm.org/sigmm/mm2003/t11.shtml. tors are used not only to search documents but also 4 [12] F. Nack and W. Putz. Designing annotation before to build them ; conversely, visualisation descriptors – it’s needed. In Proceedings of the ninth ACM inter- i.e. descriptors primarily used for hypervideo document national conference on Multimedia, pages 251–260, Ot- rendering– are also used as indexes. tawa, Canada, 2001. We think that using the concept of hypervideos of- [13] N. I. of Standards and Technology. fers a unifying framework to analyse and extend AVIS. Trec video retrieval evaluation, 2004. It stresses the importance of annotation data in the http://www-nlpir.nist.gov/projects/trecvid/ . process of video retrieval and results visualisation and [14] N. W. Paton, editor. Active Rules in Database Systems. exploitation. The Advene project allowed us to exper- Springer, 1999. iment with the hypervideo concept and to validate its [15] S. Pfeiffer, C. Parker, and C. Schremmer. Annodex: a applicability. We are currently experimenting with dif- simple architecture to enable hyperlinking, search and Proceed- ferent categories of users in order to validate the generic retrieval of time–continuous data on the web. In ings of the 5th ACM SIGMM international workshop on applicability of hypervideos as well as their usability by Multimedia information retrieval, pages 87–93, 2003. non-expert users. [16] M. Pollone. Hyperfilm: video hyper-media produc- tion. Technical report, Hyperfilm project, 2001. References http://www.hyperfilm.it/. [17] N. N. Sawhney, D. Balcom, and I. E. Smith. HyperCafe: [1] O. Aubert, P.-A. Champin, and Y. Prié. The advene Narrative and Aesthetic Properties of Hypervideo. In model for hypervideo document engineering. Research UK Conference on Hypertext, pages 1–10, 1996. Report RR-2004022, LIRIS, 2004. 19 pages. [18] P. Schmitz. Unifying scheduled time models with inter- [2] G. Auffret and Y. Prié. Managing Full-indexed Audio- active event-based timing. Technical report, Microsoft visual Documents: a New Perspective for the Humani- Research, 2000. ties. Computer and the Humanities, special issue on Dig- [19] F. Shipman, A. Girgensohn, and L. Wilcox. Image an- ital Images, 33(4):319–344, 1999. notation and video summarization: Generation of inter- [3] M. Boavida, S. Caba¸co, and N. Correia. Videozapper: A active multi-level video summaries. In Proceedings of system for delivering personalized video content. Multi- the eleventh ACM international conference on Multime- media Tools and Applications Journal, 2004. dia, november 2003. [4] T. Chambel and N. Guimaraes. Context perception [20] L. Slaughter, G. Marchionini, and G. Geisler. Open in video-based hypermedia spaces. In Proceedings of video: A framework for a test collection. Journal of Net- the thirteenth conference on Hypertext and hypermedia, work and Computer Applications, 3(23):219–245, 2000. pages 85–94. ACM Press, 2002. [21] A. Smeaton, H. Lee, and K. M. Donald. Experiences of [5] H. Fallon, A. de Lattre, J. Bilien, A. Daoud, M. Gautier, creating four video library collections with the fischlar and C. Stenac. VLC User Guide. VideoLAN Project, system. Journal of Digital Libraries, 2004. 2003. [22] T. Tran-Thuong and C. Roisin. Multimedia modeling [6] J. Graham, B. Erol, J. Hull, and D.-S. Lee. The video pa- using mpeg-7 for authoring multimedia integration. In per multimedia playback system. In Proceedings of the Proceedings of the 5th ACM SIGMM international work- eleventh ACM international conference on Multimedia, shop on Multimedia information retrieval, 2003. november 2003. [23] Symposium on understanding video, 2002. http://ils.unc.edu/idl/video_symposium02/. 4 As a comparison, for text documents in the semantic web, in- [24] Virage video systems, 2004. http://www.virage.com/. dexes are used to search data as well as to construct result documents.