Spring 2015 STRL Bulletin

NHK Science & Technology Research Laboratories ISSN 1345-4099

“Video File Transmission” no. 60 -Automatic transmission of recorded video to CONTENTS a broadcasting station he Science & Technology Research Laboratories (STRL) has been working on improving the transmission environment so that news video materials can be Tgathered from locations that would normally present problems. One recent accomplishment involves the construction of “video file transmission” technology that “Video File Transmission” is capable of delivering camera footage to a broadcasting station via a public network -Automatic transmission of recorded video to a broadcasting station...... 1 such as the Internet. NHK has been working to enhance news reporting through the development of a Trends in Advanced Video Representation Techniques ...... 2 skip-back recorder, which automatically records footage before and after an earthquake by detecting a tremor. Our new technology will also enable priority-based automatic Video Analysis Techniques that Enhance Video Retrieval...... 9 transmission of weather-related information. This technology records second-time scale video signals in a storage media as file data. These files are combined after simultaneous transmission. By retransmitting some of the files on a different line, the transmission will reach the broadcasting station Challenge / R&D / Treatise / even if some of the transmissions are interrupted. This enables reliable, high-quality About us / NHK Technology video transmissions on interruption-prone lines such as mobile phone networks. A transmission device incorporating this technology and controlled from the broadcasting station can crop out video segments from constantly updated stored video. Because simultaneous transmission of video signals from multiple locations may use up the available bandwidth capacity, the number of files sent at one time can be varied in order to control the transmission rate. For instance, a remote command from the broadcasting station can prioritize video footage from locations experiencing more severe tremors during an earthquake (Figure). STRL will examine the practicality of automatic transmission of footage from remotely located cameras via mobile phone networks during natural disasters, including earthquakes, torrential rain, flooding, and volcanic eruptions. We will continue our studies on automatic monitoring at more locations in order to promptly deliver local video coverage during an emergency.

Remotely located cameras Reception system and video file transmission system at a broadcasting Fewer simultaneous station Areas with transmissions smaller tremors

Areas with larger tremors Received video footage is used in news programs Increased number of simultaneous Video signals are divided transmissions enables prioritized into small segments reception of important video data Figure: Video collection using a video file transmission technology STRL

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 1 Trends in Advanced Video Representation Techniques

The recent applications of computer vision and applications of video analysis in program production. computer graphics (CG) have caused a dramatic expansion in the variety of video representations in TV programs and the like. This paper describes the trends 2. Issues of video production faced by broadcasters in representation techniques that use video analysis in Program production essentially consists of research, program production. We also comment on issues affecting planning, materials recording, post-production, and video production, present approaches to their resolution, transmission. Image processings such as VFX are and give examples of video representation applications additional steps in this process. using video analysis with real-time processing. As shown in Table 1, the objective of materials management is not simply to collect raw video footage, but also to enable the efficient handling of large volumes 1. Introduction of video footage by providing functions to list and search A wide variety of video representations are the available materials. At present, however, except for incorporated in TV programs. In TV drama in particular, labels indicating temporal information, users must add it is becoming commonplace to see highly realistic their own labels to enable the content of the video to representations created by adding visual effects (VFX) to be searched. Given the constraints of production, it is digital video footage during the processing and editing impractical to manually add detailed labels to a huge stage (called post-production, or “postpro”). amount of video. Moreover, once the desired video images VFX is used for a wide variety of purposes, from the have been retrieved, the job of adding information for striking, e.g., the depiction of scenes that are impossible processing that video footage has just begun. in real life, to the inconspicuous, such as the removal of The pre-visualization (“previz”) step in this table create manmade structures that would not have existed during a rough video for imagining what the finished video the times in which a period drama is set. However, while would be like3). Pre-visualizations are used for selecting VFX satisfies various production needs, the content lenses or for showing producers and performers, by way creator must invest a great deal of effort in using it. This of example, what the finished video would look like. The labor-intensive aspect is a major issue because content CGs of pre-visualizations are simplified, but still require creators also have to do all of the program production substantial knowledge and effort to produce. Hence, the work they would normally do and productions are previz step is omitted in TV program productions when constrained in terms of time and cost. the preparation period before filming is too short. In such Under these circumstances, there has been growing cases, although producers can use storyboards to convey recognition of the importance of process management in their ideas about the finished video to the performers and program production and proper management of video staff, this form of expression does not facilitate intuitive materials. For example, new workflows have been proposed, comprehension like previz does. such as the scene-linear workflow in which the original As regards the subject area extraction step listed in data (RAW data) of the captured video is left untouched the table, it often happens that is it not possible to use and is saved together with related metadata, such as lens chroma key on location. In such a case, it is necessary information acquired during filming. This video information to create subject area information (matte generation, is then handled in a consistent way, i.e., “linearly”, within also called key generation) by hand. This task takes a same color space1)2). While such management makes it considerable time and requires experience and skill. It is unlikely that video materials will be damaged or degraded, especially difficult to perform in TV program productions it also removes much of the subjectivity of the creator from that have many time-related constraints. the production environment. That is, while the aim is to How to obtain camera movement information is make the work involved in VFX more efficient to enable another issue. This information is used to composite the creator to concentrate on purely creative activities, in CGs and live-action images. To ensure that composited order to be creative it is necessary to have technology that image appears natural, it is necessary to match the supports the complicated work that is inherent to each environment information of the live-action filming process. This paper describes issues faced by broadcasters in space with that of the CG space, as shown in Table video production and shows how analysis techniques such 2. In particular, in order to make natural appearing as computer vision*1can support VFX. It also describes new composited images, the movements of the camera and the lighting conditions during the live-action filming have to be recorded and used to make the CGs. However, *1 A technique that uses a computer to analyze images captured acquiring this information places further constraints by a camera to elucidate the world of the subject and the on the production. For instance, it may require special relationship between that world and the camera. equipment and measurements to be made on the set.

2 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Feature

Table 1: List of VFX tasks Task Objective and details Typical technique Materials management Version management, searching, etc., of Version management and searching of materials materials using a video database (labels attached manually) Pre-visualization (previz) A simple visualization Imagining the finished CG production using authoring tools (software video before filming starts for creating information such as the shape and movements of CG subjects) Post-visualization (postviz) A simple visualization of finished video made Provisional composite of simple CG and directly after filming live-action video Extraction and compositing of subject area by Chroma key (technique of filming using a using video analysis background such as a blue screen, then using color information to separate the areas of subject Extraction and compositing and background in the video images) of subject area Specifying subject areas by hand Preparing a mask by using a method such as rotoscoping (a technique of tracing the outline of the subject by hand from the captured video image)

Camera tracking (acquisition of Composite video images in which the Estimating camera movement by analyzing camera orientation information) movement of the camera (live action) match captured video images and using sensors the movements of the CG mounted on tripods or cranes that can measure the camera orientation

CG production Raw images of video composite Preparation of textures (surface patterns) using CG authoring tool or paint tool, modeling, rendering, etc

Color correction Adjustment of color and matching of color Linear or non-linear scaling (expansion or between cameras and between video materials compression processing) in a specific color space

Table 2: Matching of environmental information in the production of naturally appearing composite video images Temporal matching Factors such as timestamps and shutter speeds of composite materials Spatial matching Factors such as coordinate system, scaling, and camera parameters (lens state, position, and orientation) of composite materials Optical matching Factors such as lighting conditions and mutual reflections of composite materials

As the above has illustrated, there is a strong experience and a great deal of effort and has to be demand for technical support that frees content creators accomplished within strict time constraints. The use of from having to do complicated and time-consuming file-based content within a broadcast station can make production work that can sap their creativity. the production environment more efficient, not just by facilitating centralized management and non-linear editing, but by providing new ways of making content. 3. Approaches to resolving the issues To address the issues raised above, we have proposed The problems outlined in previous section boil down a “video bank” that automatically assigns metadata to two main issues: how can the desired video footage for image processing in addition to metadata for be retrieved from a large volume of raw video efficiently, searching by using video analysis techniques and sensor and how can helpful information (metadata) be obtained technology. A conceptual diagram is shown in Figure 1. quickly and easily in each step of the image processing. The video bank performs video analysis around This section first deals with the question of how the clock by collecting and automatically assigning to make video production more efficient. Then, it metadata to new and recorded video. Thus, if a content overviews technologies for acquiring camera orientation creator searches the video bank for raw video footage information, subject area information, and lighting by keyword, he or she can simultaneously acquire the information that can lighten the burden of workers at metadata necessary for image processing. the production site. Another issue is that if there is insufficient camera orientation information and lighting information on 3.1 Making the video production workflow more efficient the captured video images to act as clues for analysis, it At the start of a typical video production workflow, becomes difficult to obtain enough metadata for image there is only the video footage that has been filmed; the processing. To alleviate this problem, the video bank metadata necessary for image processing is acquired stores additional information gathered from sensors that afterwards by using various video analysis tools and operate in a way that does not interfere with filming. authoring tools. At present, most of this work cannot We are working on increasing the accuracy and be done automatically. Moreover, it usually requires robustness of each technology that makes up the video

3 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Supports video production

Automatic generation of metadata by video analysis 001101 Sensor Provides ways of searching a Metadata for information wide variety of video footage searches Video information Metadata for processing

Video bank Provides information necessary for image processing

Figure 1: Conceptual diagram of video bank

bank, with the aim of putting this system to practical subject to displacement inaccuracies. In contrast, the use. camera orientation can be more accurately determined by using video analysis on the captured images. The 3.2 Camera orientation information acquisition result is that CG objects can be composited with no technique displacement. However, conventional video analysis The camera orientation is necessary information for techniques have the disadvantage that estimation the making natural-looking composites of live action depends on the filming environment and video content, images and CGs. If the camera in the CG world does not and in some cases, it has to be done manually. move in the same way as the live-action camera, the With the aim of automating and increasing the accuracy resulting video would appear unnatural. For example, of the camera orientation information acquisition, we if inaccurate camera orientation information is used have developed a technique that augments the video in compositing, the CG may appear to slid or sway analysis with physical sensor information if the analytical unnaturally in relation to the subject. estimate becomes unstable4). The estimation is based on The current techniques for acquiring camera bundle adjustment5). Bundle adjustment first extracts orientation information use physical sensors. They are feature points such as the corners of subjects in each

Feature point 3 Re-projection

Feature point 1 Feature point 2 Re-projection error Feature point on image plane Object

Image plane

Camera orientation 1 Correspondence Camera orientation 3 relationship Frame 3 Frame 1

Frame 2 Unknown Known Camera orientation 2

Figure 2: BasicCheck principle tools of camera orientation estimation (verification, correction, addition)

4 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Feature

(a) Filmed video image (b) Triangular patches (pink regions) that include mismatched points Figure 3: Results of detecting feature point mismatches

frame of the captured video image and then it tracks model*4to determine color distributions of the extracted them between frames. Since these feature points have 2D subject areas and the other areas. More complicated coordinates, they can be placed on the image planes as processing is necessary for moving images. In this case an shown in Figure 2 ( ). On the other hand, since the 3D initial tracing value is manually assigned to the outline position of each feature point and the orientation of the of a subject in a frame containing a timestamp at the camera are unknown, each point is allocated a suitable production site. The outline can later be traced through initial value and the 3D coordinates of the feature points frames and the subject areas extracted. The errors in are projected onto the surfaces in the image ( ). The tracking the outline of the subject have to be manually camera orientation is obtained by iterative processing to corrected. This sort of work requires an experienced minimize the sum of errors (re-projection errors) between operator since the quality of the resulting composite each point on the image surface and the corresponding depends on how the initial tracking values are allocated. feature point position. Bundle adjustment can also be To address these issues, we have developed a technique used to estimate lens distortion and focal length. that can automatically extract a specific subject area A factor that dictates the accuracy and robustness of throughout a set of frames9). It turns moving images the estimate is the correspondence of the feature points into voxels (video information in which pixel values are between frames. Here, mismatches can be detected by arranged in a 3D space formed by the two horizontal and applying a condition such that the topological phases vertical axes of 2D imagery and a time axis). The mean- between feature points in a frame carry over to the next shift method*5 is used to segment the images according frame. The results of a mismatch detection are shown to the color information in them. An overview of the in Figure 3. Each pink triangle in Figure 3(b) indicates a procedure is shown in Figure 4. Subjects are tracked by mismatched feature point. determining whether or not pixels are adjacent in the The camera orientation can’t be accurately estimated voxel space. Here, several sets of area divisions having when the feature points are distributed unevenly within different granularities are prepared so that the accuracy the frame. Here, we can use this unevenness and of the extraction can be tuned by dividing each frame projection error to evaluate the accuracy. If it is poor, into a suitable number of areas. we can instead measure the orientation by attaching a This technique thus consists of the two phases: area hybrid sensor6) to the camera. This sensor is equipped division and subject area extraction. The first is automated with a micro-electromechanical systems (MEMS) and done beforehand. The area extraction processing is gyroscope and a sensor camera that is aimed at the floor. based on GrabCut, which is fast and interactive.

3.3 Area division and extraction technique 3.4 Lighting information acquisition technique The area division and extraction technique extracts Composite images can be made to look more natural specific subject areas in the filmed video footage to by using lighting information gathered from the filming be composited with different background video. If a space in the CG rendering. A commonly used technique still image is the target, a software technique such as for doing so is to install a spherical mirror in the vicinity GrabCut7)*2 can be used to roughly distinguish the subject of where the CG will be composited, shoot it with a digital area to be extracted. Graph cut8)*3 can then be used to camera, and acquire ambient lighting information perform an optimal extraction. It uses a Gaussian mixture

*4 A model that is represented by as a number of superimposed *2 A technique of separating areas using graph cuts. A Gaussian Gaussian distributions. mixture model represents the color distributions of the subject *5 An iterative technique for finding the set of camera pa- area to be extracted and the background area. rameters that most accurately predicts the locations of the *3 An optimization technique that minimizes an energy function. observed points in images.

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 5 Video bank

Store

Raw video

Time-space area division processing

Manipulation

Area division information

Specification of extraction subject area

Subject area extraction processing Extraction ・・・ results

Figure 4: Technique of extracting subject areas from moving image

Real image

CG

a) Lighting conditions 1 b) Lighting conditions 2 c) Lighting conditions 3

Figure 5: Comparison of real image and CG drawn by using estimated lighting conditions (the subject is a white sphere)

over a wide dynamic range10). There are a number of We have developed a technique that can indirectly problems with this approach, such as the difficulty of estimate the color tone and intensity of each piece of the dealing with varying lighting conditions. It is necessary studio lighting by analyzing video footage taken by a to acquire high-dynamic-range information. This means small, wide-angle sensor camera shooting the ceiling of that either an ordinary camera has to take a number of the studio. Here, the effect of the lighting on the studio shots at different exposures, or a special camera capable set is analyzed rather than the lighting directly11). This of taking high dynamic range images has to be used. approach requires lighting information to be acquired in

6 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Feature

the studio, including the positions of lighting and images points and their 3D positions from the captured video of objects in the studio lit by individual pieces of lighting. image itself and then use that information to estimate Since the brightness is not that of the direct lighting, the orientation of the camera12), it does not have the these measurements can be done with an ordinary accuracy required for TV program productions. camera. This method thus has the virtue of creating We have developed a technique that accurately natural looking composite video by using simple and estimates the camera orientation by analyzing how inexpensive equipment. Figure 5 compares live-action the edges of the subjects move in relation to the shapes video that was filmed with a real white sphere under and textures of the studio set. The shape and texture actual lighting and a CG white sphere rendered with information is acquired before the shoot, and ideally, this lighting conditions estimated by this method. enables the method to work in real time13).However, it is still subject to delays that sometimes cause the composite video to be more than three frames behind the captured 4. Latest trends in video representation using real- video images and interfere with the synchronization time video analysis of lip movements and speech. We intend to solve this Video analysis techniques combining various sensor problem by using the hybrid sensor developed for video technologies and software can handle video at 30 frames bank and make it possible to use handheld cameras in per second in real time. This section introduces examples a virtual studio. of applications that take advantage of this real-time A hybrid sensor measures the camera’s rotation by capability. using a small gyroscope and its translation in the vertical direction by using a laser rangefinder. It estimates the 4.1 Virtual studio translation in the horizontal direction by analyzing the A virtual studio is a video representation technique video images of a sensor camera (Figure 6) aimed at the that puts a CG in live-action video in real time and floor and using clues such as the floor’s pattern. This makes it move in accordance with the movements of the method can measure camera orientation information camera. It is used in live broadcasts to show performers with little delay and has been used in live broadcasts by interacting with CGs or a virtual world or for visualizing NHK. complicated information in an easy-to-understand manner. 4.2 Display Applications “Augmented TV” The current virtual studio technology uses special The Hybridcast platform can send information related tripods, dollies (carts used for travel shots), or cranes to the program being broadcast on one screen, e.g., a fitted with physical sensors to measure the camera TV set in a living room, to a second screen, e.g. on a orientation; it is not possible to use handheld cameras tablet computer held by the viewer. “Augmented TV”14) is with it. Although a method exists to acquire video feature a style of viewing content in which the TV and a second

Gyroscope Laser rangefinder

Ry Ty Hybrid sensor Rx Rz

(Amount of rotation) (Height)

Video analysis of sensor camera

Broadcast camera y Tx Tz

(Amount of travel along floor) z x

3D coordinate system

Figure 6: Overview of hybrid sensor that measures camera orientation information

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 7 screen interact. 5) Okatani: “Bundle Adjustment,” Information Augmented TV uses a handheld terminal to analyze Processing Society of CVIM Research Materials, images acquired by its built-in camera, synchronizes No. 37, pp. 1-16 (2009) content on the TV and the handheld terminal, and 6) Kato, Muto, Mitsumine, Okamoto, A. Moro, Seki, acquires relative orientation information of the TV and Mizukami: “Study of Detecting Method of with respect to the handheld terminal. By using this Broadcasting Camera’s Movement using MEMS information, it is possible to create an augmented reality Sensor and Image Processing Technology,” Journal of wherein a character jumps from the TV screen to the the Robotics Society of Japan, Vol. 32, No. 7, pp. 2-11 second screen. (2014) 7) C. Rother, A. Blake and V. Kolmogorov: “‘GrabCut’: Interactive Foreground Extraction Using Iterated 5. Conclusions Graph Cuts,” ACM Trans. Graphics, Vol. 23, No. 3, Video analysis technology will steadily become pp. 309-314 (2004) more sophisticated. We believe that the objective is 8) Ishikawa : “Graph cut,” Information Processing not to automate all the creative activities in video Society of Japan CVIM Research Materials, No. 31, pp. representation, but to provide an environment that 193-204 (2007) will enable producers to concentrate on their creative 9) Okubo and Mitsumine : “A Study of Spatio-temporal work. The video bank we have proposed on the basis Segmentation Method Suitable for Interactive of this concept makes it possible to access raw video Object Extraction For Video Asset Processing footage quickly and effectively, expand the variety of and Management System “Video Bank”,” Vision video representations that are possible, and implement Engineering Workshop ViEW 2013, OS1-O3 (2013) an efficient workflow. In the future we will improve its 10)P. E. Debevec and J. Malik: “Recovering High Dynamic elemental technologies and put it to use in practical Range Radiance Maps from Photographs,” Proc. ACM program production. SIGGRAPH 1997, pp. 369-378 (1997) We will continue to develop services and applications 11) Morioka, Okubo, Mitsumine : “Real-Time Estimation that make use of video analysis techniques, which are not Method of Lighting Condition with Sensor Cameras limited to video representation and program production. for Image Compositing,” Transactions of the 13th (Hideki Mitsumine) Forum on Information Technology, No. 3, I-006, pp. 169-170 (2014) 12) G. Klein and D. Murray: “Parallel Tracking and References Mapping for Small AR Workspaces,” Proc. IEEE/ACM 1) ACES (Academy Color Encoding Specification) Science ISMAR 2007, pp. 225-234 (2007) and Technology Council Academy of Motion Picture 13) H. Park, H. Mitsumine, and M. Fujii: “Adaptive Edge Arts and Science: “The Academy Color Encoding Detection for Robust Model-Based Camera Tracking,” System -- Idealized System,” https://www.oscars.org/ IEEE Trans. Consum. Electron., Vol. 57, No. 4, pp. sites/default/files/acesoverview.pdf 1465-1470 (2011) 2) “Open Source Color Management,” http:// 14) Kiwakita, Nakagawa, and Sato: “Augmented TV: an opencolorio.org/ Augmented Reality System for TV Pictures Beyond the 3) Mori, Ichikari, Shibata, Kimura, and Tamura: TV Screen,” Transactions of the Virtual Reality Society “Development of MR-PreViz System for On-site S3D of Japan, Vol. 19, No. 3, pp. 319-328 (2014) Pre-Visualization of Real World Scenes,” Transactions of the Virtual Reality Society of Japan, Vol. 17, No. 3, pp. 231-240 (2012) 4) Mitsumine, Muto, Fujii, and Kato: “A Robust Camera Tracking Method Considering Visual Effects,” Annual Conference of the Institute of Image Electronics Engineers of Japan, R6-1 (2013)

8 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Video Analysis Techniques that Enhance Video Retrieval

Search engines have made it possible for anyone to Broadcasters use huge volumes of video footage every obtain information readily from the Internet. On the day and have recently started handling these resources other hand, broadcasters are still looking for simple as computer files. One advantage of files-based methods ways to search large quantities of video footage. is that it makes it possible to skip through the replay, Improvements in computer capabilities have made thereby enabling efficient editing and browsing. However, it possible to quickly find similar images amidst large it is impossible to find video as easily as text information, quantities of video footage, and computers can recognize even if the video has been converted into file data. As and search video content by themselves, albeit in a with text data, in order to quickly find the desired images limited manner. Here, broadcasters are migrating their amidst a large quantity of video footage, it is necessary video to files on computers by preparing environments to have information that describes the content of each that will have enhanced video searching capabilities. video (metadata), such as when (temporal position) it This paper describes research on video analysis appears in the footage and what (the subject content) it techniques that enhance not only searches of programs is. However, it is difficult to create enough metadata for by using text data such as titles and summaries, but also large quantities of video footage manually. If detailed enable searches of shots within the video content. It also metadata could be automatically assigned to such video describes a powerful video search system for broadcasters data, it would become possible to easily search for and has been introduced on a trial basis at NHK. retrieve video footage. In this paper, we describe various video analysis techniques for automating the creation of metadata 1. Introduction with the aim of enhancing video retrieval. In addition, The increasing number of Web pages has led to the we describe the trials of an enhanced video search system rapid development of Internet search techniques that at NHK. target large-scale text data. These techniques have in turn fed the phenomenal expansion of Web content and its usefulness. Techniques for searching images and 2. Video analysis techniques that enhance search- audio have recently appeared, and powerful systems ing for searching Web content are becoming indispensable. 2.1 Issues with video searching However, looking at the video services on the Internet, The program archives of NHK are manages and their searches are in units of video clips (short videos searched in program units. Text information, such as lasting from several tens of seconds to several minutes). the title, performers’ names, and program summary, We can not find any service that provides a search which was attached when each program was stored function that can answer a query for video that will show in the database, is used in such searches. The stored the desired subject in certain time intervals. program video footage is recycled as raw video for the

Program title: Darwin is here! Program summary: “Chasing the giant catfish of Lake The giant Lake Biwa catfish lives in Lake Biwa, which is the largest lake in Japan. With Biwa!” a total length of 1.2 meters, it is twice the length of ordinary catfish. And it is five times as heavy, at ten kilos. Seen only in Lake Biwa and its surroundings, it is rare....

Time

Program video footage

This segment shows “catfish”, This segment shows “flowers” “lake”, and “sun” and “mountain” Effective information retrieval for video recycle purposes

Figure 1: Division into video segments and assignment of video content information (metadata)

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 9 production of new programs and Web content. Program descriptions (metadata) from video footage, and similar- producers seeking to recycle video footage would like to image searches based on the composition of objects in have a simple means of searching video in units smaller images that is difficult to express in words. than programs. They need an easy-to-use system that automatically creates metadata describing video 2.2 Structure of video footage segments and uses that metadata for searches (Figure 1). Video differs from still images as a medium in that In order to find video using the current search system, it has a time axis, and hence, it is not easy to list its users must search through the titles of programs that contents. In long video footage such as in a TV program, are likely to include what they are looking for, and they it takes time to discover the temporal position of the must spend time replaying the programs to find it within information one is interested in. For that reason, a video the footage. It is not feasible to manually assign detailed search function for obtaining footage in a short time metadata to a huge quantity of archived video (the should first divide the video into easy-to-handle time NHK archives has over 700,000 items). Here, although lengths, then determine what those segments show. subtitles have been added to many programs, they A program video can be represented in terms of usually do not describe the video content, meaning that granularity, i.e., program, scene, and shot (or cut), as subtitles are of limited value in searches. determined by semantic and temporal divisions in Figure Raw video footage does not have text information 2. A program can thus be viewed as a unit of video content attached to it; it is managed by means such as tape consisting of a number of scenes and it is the result of numbers and simple notes attached the video content. editing the raw footage. A scene, or a situation, refers The length of raw footage ranges from ten to almost to a semantic video segment having the same depicted one thousand times the running time of the program. time or location, and it consists of one or more (usually It would be a great waste of labor and money to attach several) shots. A shot is a segment of video that was metadata to all the raw video footage, especially since filmed without a break. The boundary between one shot most of it is not used in broadcasts. Furthermore, vast and another is called a cut point (excluding boundaries quantities of video are captured in a short time when that have special effects such as a wipe*1, dissolve*2, or there is a disaster or other major event; a file based fade*3); a shot may also be called a cut. Frames are still system should be able to quickly and efficiently assign images, which at, say, about 30 frames per second make metadata to video footage. As shown above, broadcasters must spend *1 A scene transformation technique where the screen “wipes” considerable effort finding the video footage they want away, starting from one side of the image to the other side as from among large quantities of video footage in a variety away to lead in to the next piece of video footage. of situations, and they are demanding enhancements *2 A scene transformation technique where the screen gradually to the current search functions. In the following switches to the next piece of video footage. sections, we describe video analysis techniques, such *3 A scene transformation technique where the screen gradually as temporal demarcation, for easily searching video changes from a color such as black to a state in which the footage, techniques for generating searchable content video footage can be seen (or vice versa).

Time

Program video footage

Scenes

Shots

Frames

Figure 2: Structure of program video footage

10 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Feature

up the video. Unedited raw video footage, on the other detection method developed so far. We have increased hand, has none of the demarcations corresponding to its accuracy on broadcast video footage by adding programs or scenes that are given meaning by editing, functions such as one that calculates between-frame and the medium (such as tape) used in the filming is differences not just between adjacent frames but also a the top unit. Depending on the content being filmed, the few frames before and after each boundary candidate, entire raw video footage of one tape could consist of one and it determines that any amount of change below a shot. certain threshold is not due to a shot boundary, in order to prevent erroneous detection of shot boundaries due to (1) Division into shots flashes that often occur in raw news footage, etc. A video analysis method that detects the boundary However, these methods are less than perfect at points between shots can be used to retrieve individual detecting gradual switchovers such as wipes, dissolves, shots. In modern video recording devices where video and fades, that are often used as dramatic techniques. In is recorded as a file, these shot delimiters are the data addition, the challenge remains of reducing erroneous delimiters of file units, so it is not necessary to detect them detections in cases such as filming large subjects that again, but with VTR video footage, the shot boundary cross the screen or rapidly blinking light sources. information is often missing. A basic method of shot boundary detection is to (2) Division of shot contents (thumbnail extraction) evaluate the continuity of the video data from one frame It is sometimes useful to divide up shots that run a to the next and determine a point where the continuity long time. In addition, the shot boundaries are only breaks to be a cut point. A histogram comparison method time data; they do not describe the content of the shot. is used to evaluate continuity. Histogram comparison To represent the content, an image (thumbnail) can be methods use differences in color histograms*4, such extracted from the shot. A frame from the opening of as RGB (red-green-blue) or HSV*5, and the processing the shot is often used in a thumbnail, even though the involved often does not cover the entire screen, but rather main subject does not necessarily appear in the opening. is done within each of 16 or so blocks that the screen is In fact, there are cases where many subjects appear in divided into1). Histogram comparison methods cannot sequence. In such case, a number of thumbnail images detect switches between shots that have a high degree would have to be extracted from one shot in order to of similarity and can make detection errors with fast- represent its content, for instance, frames at the opening, moving shots. For those reasons, there has been research ending, and middle of the shot, or in a format that into techniques such as for handling a number of feature depends on changes in the content, not on time. Using quantities in a multi-dimensional feature space2) or on more thumbnail images to search through the video detecting compositional changes in the texture of the also has the effect of reducing the chance of missing the images3). subject and reduces the amount of processing compared The international TRECVid*6 workshops running from with searching all of the frames. 2001 to 2007 were a driving force in the development and The method we developed accumulates frame- evaluation of methods for detecting shot boundaries4). difference information for determining shot boundaries, The methods resulting from TRECVid can accurately and it outputs a thumbnail image when the accumulated detect shot boundaries (cut switchovers). value exceeds a certain threshold. The number of We have developed a shot boundary detection method5) thumbnails depends on the amount of changes in that emphasizes processing speed in order to enhance content; many thumbnails are output when there are practicality. This method uses the sum of differences large changes in the video footage, but only a single of absolute RGB density values between frames, which thumbnail is output when there is little change. has a low processing cost, to discover candidate shot boundaries. Then it uses the block-matching difference*7, (3) Division into scenes which has a high processing cost but is very accurate, In a program video that is the result of editing, to measure the difference between adjacent frames, and individual shots typically last from five to 30 seconds, it takes any boundary that exceeds a certain threshold and a shot can feel too short when verifying the search value to be a shot boundary. It is the fastest boundary results or when searching to find video footage that can be recycled for the purpose at hand. In addition, it is often *4 A frequency distribution chart with color plotted on the hori- easier to handle video that is summarized in terms of zontal axis and frequency plotted on the vertical axis. the concept of a scene by placing semantic links between *5 A color space consisting of the three components of hue, shots having the same depicted location or time. However, saturation, and value. scene boundaries are also semantic boundaries, and they *6 Text REtrieval Conference Video Retrieval Evaluation: A com- may differ depending on the content, the person doing petitive workshop on information retrieval, sponsored by the the viewing, and the purpose of recycling, etc. For this US National Institute of Standards and Technology (NIST). reason, automatic scene division is a difficult challenge. *7 A technique of representing differences between frames, The current scene division techniques use wherein images are divided into small areas (blocks) and it comparatively superficial information such as changes is checked if the differences between corresponding blocks of in color within the video or changes in the audio consecutive frames exceed a certain value. signal to judge the continuity of content, and they

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 11 often regards such instances as scene cuts7). We have other hand, someone interested in recycling video on researched scene division techniques based on the nature programs might enter search keywords such continuity of color information8)9). Another method is as “Okinawa dolphin”, whereas someone compiling a to analyze the subtitles synchronized with broadcast sports report might enter “Ichiro home-run”. However, video footage as document and compute the degree of describing video footage in enough detail that would cohesiveness of single words (a measure of how often the produce relevant results for such queries is difficult even same single word appears)10). Although this is useful for for humans, and the range that can be described by video some purposes, it can not often be used on broadcast analysis alone is rather limited. The current technology programs because words indicating the subjects do not can only handle keywords consisting of general nouns likely appear throughout the entire program. To divide such as “person” or “mountain” and are confined to up video into scenes that could be recognized as such by indicating the “What is it” of generic objects or “What is humans, we need to research scene division techniques occurring” in very limited circumstances. that integrate components such as sound and text with the video footage, to enable not just signal processing of (1) Object recognition a single stream of media information, but also to reflect Object recognition refers to an automated function in the subject and its meanings. which a computer recognizes an object that is the subject of a video and outputs text (typically, the name of the 2.3 Techniques of generating content description subject). Object recognition has been taken up as a information (metadata) challenge by researchers since the advent of computers, The most common video search queries (search but computers cannot yet recognize objects as easily as keyword) are typically the name of the subject, often a humans do. noun such as “Mount Fuji”, or “sea”, or “bird”. Search In the 1990s, programs were developed that can efficiency would be greatly improved if such subject determine the presence or absence of a subject by names could be assigned automatically to video representing local features such as the corners and edges segments that have been divided into shot units as of a subject within an image as vectors and training a described above. computer to recognize these vectors as representing the The metadata describing content varies greatly, not desired subjects. As shown in Figure 3, image features only depending on the content but also the search are expressed as a large number of feature quantities situation and the user. Someone searching for a news (feature quantity vectors) that are digitizations of the item might query something like “Press conference colors and patterns of the image. This set of feature by the Prime Minister on XX day of YY month” or quantity vectors is processed by the computer in order “Metropolitan Police Department building”. On the to find the conceptual boundaries that separate a set of

Training data Input image Learning by Watches feature vectors Identifier (discriminant criterion)

Not watches Identification Wristwatch

Bag-of-visual-words technique Training data Gradient histogram space (multi-dimensional) Word creation Clustering

… Visual words Feature point extraction and gradient histogram calculation

Feature calculation Input image Feature vector frequency Appearance … Feature point extraction Substitution into Visual words and gradient histogram calculation

Figure 3: General object recognition using bag-of-visual-words technique

12 Broadcast Technology No.60, Spring 2015 ● C NHK STRL positive examples which include the subject and a set of feature vectors that can take account of larger areas, in negative examples which do not include it. The presence addition to local feature vectors obtained by the BoVW or absence of the subject in question is determined by technique. Furthermore, by selecting effective feature which of the sets the unknown image feature vectors quantities, it achieves highly accurate recognition even belong to. This method works on different subjects by with a comparative small quantity of training data. We changing the training data that gives a correct answer. have also devised semi-supervised training*12 method The Scale-Invariant Feature Transform (SIFT) that can efficiently yet accurately create training data by method11) and the Speeded Up Robust Features (SURF) assigning labels to only part of the data15). Furthermore, method12) are examples of feature quantity extraction we use search functions like those used on the Internet methods. Techniques such as Bag-of-Visual-Words when collecting images for training16). (BoVW) that were subsequently developed can quickly check a large number of feature quantities and handle (2) Face detection local feature quantities independently of position. The The faces of people in TV programs often convey BoVW technique calculates a gradient histogram*8 of important meanings. Finding situations within news luminances from the area surrounding a feature point videos such as interviews and press conferences in which such as an edge or corner and determines clusters of such faces are shown would be also useful for recycling video points in a multi-dimensional space (similar objects form footage. In comparison with other objects, variations in a cluster) to be Visual Words*9. Then it calculates feature human faces apparent from the front are comparatively vectors based on the appearance frequencies of those small. Such subjects are especially useful for detection clusters, as shown in Figure 3. It uses Support Vector and recognition. Machine (SVM)*10 or the like to identify the features. Face detection research17) has a long history. Here, In 2012, a method called Deep Learning, which uses a face can be initially depicted as an outline that a a neural network*11 with many layers, was applied to computer then matches with standard face patterns18). general object recognition. Besides greatly improving One such method19) uses AdaBoost*13, a statistical recognition accuracy13), it could automatically determine machine learning method*14. It extracts rectangular when the feature quantities were valid. Recent advances feature quantities represented by a group of black in the capabilities of computers and the use of large and white filters, configures strong trainers*15 that quantities of video footage and images on the Internet select objects with a high degree of importance from have made it possible to develop appropriate training the weak trainers*16 for each feature quantity. Face methods for building neural networks that have good detection systems such as this are used for improving performance. However, training a neural network still the photographic quality of commercial digital cameras, takes time and since a large quantity of data is required for authentication on smartphones and PCs, and for and the parameter adjustment is complicated, it will checking persons’ identities at airports20). However, it likely be some time before deep learning is practical. works only when the subject directly faces the camera We are developing a technique that gives good under controlled dimension and brightness conditions. recognition accuracy despite using less training data. It is It has difficulty detecting faces in video having various especially useful in broadcasting where the training time different lighting conditions and persons facing in short14). This technique resolves issues with the BoVW different directions other than directly at the camera. technique wherein the spatial position information within It also has trouble accurately identifying individuals the video frame is not reflected in the feature vectors and among crowd scenes appearing in many TV programs. the training data is expensive to make. It divides each We are conducting research into face detection, frame image into areas of various sizes; then it composes tracking, and identification techniques21) based on the feature vectors reflecting spatial information about the method of Reference 19. However, faces in video footage subject by calculating the image features in each area. to be used in broadcasts, particularly in the raw video In the image feature calculation, subject features can be material for the news, may be partially obscured by comprehended more accurately by considering global masks, eyeglasses, and hats. Furthermore, faces may have extreme orientations. The detection results in such *8 A frequency distribution diagram with the gradient direction cases have many false negatives. In order to reduce their plotted along the horizontal axis and gradient strength along number, we modified a object recognition method so that the vertical axis. *9 A method of handling a document as a set of words and *12 A machine training method that generates identifiers from handling local feature quantities of an image as single words incomplete training data (a mixture of labeled and unlabeled through the application of a natural language processing data). method called Bag-of-Words. *13 A machine training algorithm that adjusts for mistakes made *10 A supervised machine training method for pattern identifi- by the previous identifier when it creates the next identifier. cation (a training method for data that has been assigned *14 A machine training method based on statistical methods. labels) which creates linear classifiers that divide data into *15 A trainer that has a high identification accuracy as a result of two classes. combining weak trainers. *11 A mathematical model of brain functions that can be used to *16 A trainer that does not have a high identification accuracy, make a brain simulation on a computer. but is more accurate than random identification.

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 13 it could train using close-up images of human faces and in comparatively similar shots 26) 27). It is also possible for combined it with another face detection method being computers to recognize motions such as jumps and shots used in some applications22). on the hoop in basketball28) and free kicks and kick-offs of soccer players from video capturing a view of an entire (3) Recognition of text information pitch29). If text appearing in video footage could be recognized However, unlike events captured under specific readily, extremely effective metadata in the form of conditions such as those of sports, the identification of unique place names, personal names, and product complex and diverse events in everyday situations is names could be attached to broadcast video footage. still a difficult challenge. There is a recognition method Printed text recognition (Optical Character Recognition, that limits events according to the type of subject30), or OCR) is a mature technology, and characters can be and attempts are being made at developing a way to recognized with a high degree of accuracy if documents recognize complex events by combining and correlating are printed in black and white with a clear font. Sufficient simpler events31). Evidence of the growing interest in such accuracy can also be obtained for multi-level typesetting studies, the TRECVid Multimedia Event Detection task for and vertical and horizontal writing. extracting complicated events from video footage began However, the same cannot be said for characters in 201032). appearing in images (photographs) and video 23). There Unlike still images such as a photographs, video are various reasons for this failure, such as the wide footage has movement information that can be used variety of typefaces, character colors, and background for describing events in greater detail. The detection and colors of characters that prevent them from being recognition of complex events will continue to be an neatly separated from the background, and the effects important research topic for the foreseeable future. of variations in the size, orientation, and slope of characters are large. In addition, there are numerous 2.4 Similar image search edge features within normal scenes that are similar There are times when a user wants to search for video, to those of characters, such as buildings and window say, of landscape that is similar to the image at hand, or frames, making it difficult to detect the character has an image in mind and wants to focus on details such areas themselves. Finally, languages that use Chinese as its composition, color, or pattern that cannot be easily characters, such as Japanese, are extremely difficult to described in words. Here, similar-image search technology perform text recognition on because they have many that finds images similar to an example image is a way different characters and feature both vertical and to enable non-verbal information to be used as a query. horizontal writing. A similar image search digitizes the colors and patterns The recent spread of smartphones and the advent of within a given image and compares the mathematical eyeglass-type smart devices have led to an increasing distances between these numerical values and stored demand for character information recognition digitized data corresponding to similar images. Previous techniques, and more and more researchers are training of the system is unnecessary, and large numbers being attracted by the challenges set by competitive of images can be processed comparatively quickly by workshops24) for developing techniques of recognizing digitizing and storing features such as colors and patterns character data within video footage. beforehand. A number of such systems are in use, or are nearing deployment, and they have functions33) similar (4) Detection of events to those of Google image search and other image search People can easily determine events (incidents), such web sites34). as goals in a soccer game, by observing short video We are researching similar image search techniques segments. If a computer could detect such events as that calculate the degree of similarity to an example easily, it would lead to a variety of new uses of video. image. For broadcasting production purposes, we are A lot of research has gone into automatic identification interested more in similarities of general structures, of the movements of people and on detecting suspicious rather than exactly matching images. This method behavior in video footage25). Events in sports programs compares the colors and textures of rather large 4x4 such as home runs, doubles, and strikeouts in baseball blocks, enabling searches with an emphasis on the entire can be now extracted by virtue of the fact that they exist composition. To address the problem of deterioration of

Saliency map Subject area Specify subject area Displace central block Compare image features in blocks by “Saliency Map” into subject area between chasing the subject

It is possible to discover images where the layout is slightly different

Figure 4: Processing for when a subject in a similar image search straddles a number of block boundaries

14 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Feature

the degree of similarity when the main subject straddles We have prototyped a metadata supplementation the boundaries of a number of blocks, as shown in Figure system that detects shot boundaries in raw video footage 4, we have also implemented a gradual similar image and reduces the work needed to assign metadata to search that moves the blocks within a certain range35). the subject information22)37). This system automatically The results of experiments on this technique showed that assigns metadata by combining techniques such as over 70% of the images were judged to be similar, from video analysis and sound identification. With the aim among the search results where the degree of similarity of assigning accurate metadata efficiently, it provides is in the top ten, which we consider to be sufficient automatic means of demarcation of footage into shot accuracy for narrowing down video footage. This units and high-speed assignment of subject information. technique can also take a hand-drawn sketch as input; Users manually correct recognition errors and add it compares its color and texture with those of images semantic information. Its video search function uses stored in the database36). This similar image search could the assigned metadata. This system is currently in trial be further developed to cover whole scenes by comparing operation at the Fukushima broadcasting station where the feature quantities of representative still images of a it has processed more than ten thousand VTR tapes in number of shots. approximately three months, and the materials found by using its video search functions have been used in program production. It works on a notebook PC, and we 3. Search system that applies video analysis tech- niques Video footage management in the past mainly involved searching through libraries of VTR tapes and small quantities of text data. NHK is moving away from this method and is converting video into file data. We have started experimental use of video search systems. Below, we describe two of those systems.

3.1 Earthquake disaster metadata supplementation system During the Great East Japan Earthquake that occurred in 2011, large quantities of raw video were captured for news coverage. This video footage is not just valuable as raw broadcast material, it is very important for preventing and reducing the impact of future disasters, and there is an urgent need to organize and archive it. However, the work of organizing it is enormous, and the construction of a video database has been a challenge. Figure 6: The earthquake disaster metadata supplementation system in use at the Fukushima office

Automatic processing management tool Object recognition Experimentally Caption data techniques extracted subjects: Image - Fire and flames Filming notes Shot division - Water *2 FTP Proxy - Rubble - Helicopter tool video Color bar thumbnail creation - Ambulance LTO*1 drive - Nuclear plant Aerial view

Face close-up Check tools Metadata (verification, correction, addition) Face detection

Interview Audio Search database People’s voices Some of audio recognition techniques

*1 Linear Tape-Open: Standard for magnetic tape for computers *2 File Transfer Protocol Figure 5: Overview of processing of earthquake disaster metadata supplementation system

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 15 are studying how it can be used for other purposes. An currently under way (see Reference 38 for details). overview of the processing is shown in Figure 5 and the The video analyses that we have described for system in use at the Fukushima office is shown in Figure enhancing video searches work at the current 6. technological level, depending on usage environment and method of systemization. Even though it is 3.2 Validation of sophisticated archives search not possible to automate the entire process, we can Targeting NHK’s archive of the past programs, we incorporate these techniques into support systems to have prototyped a search system that incorporates reduce the work of broadcasters and workers in other object recognition for extracting subject information fields that process video materials. and similar image searches for video footage. We plan to (Hideki Sumiyoshi) start verification experiments of the search functions in January 2015. This system provides the following search functions on the search screen shown in Figure 7: References - Search using subject metadata assigned to each shot by object recognition 1) A. Nagasaka and Y. Tanaka: “Automatic Video - Search for shots that are similar to a designated shot or Indexing and Full-Video Search for Object an uploaded image Appearances,” Proc. IFIP TC 2/WG 2.6 Second - List of program contents by shot units or scene units Working Conference on Visual Database Systems II, - List of shots in a program that that have superimpositions pp. 113-127 (1991) (characters or diagrams that are overlaid on the screen) 2) Iwamoto and Yamada: “A Cut Detection Method for a - List of shots in a program that show faces Video Sequence based on Multi-Dimensional Feature - Search results for shots that are similar to the currently Space Analysis,” FIT, I-026 (2005) selected shot 3) Mochizuki, Tadenuma, and Yagi: “Cut Point Detection We will evaluate the search functions at broadcast based on Variations of Fractal Features,” The Institute production sites and the function that automatically of Electronics, Information and Communication assigns metadata to video archives. Engineers (IEICE) General Conference, D11-134, p. 134 (2005) 4) A. Smeaton, P. Over, and A. Doherty: “Video Shot Boundary Detection: Seven Years of TRECVid Activity,” Computer Vision and Image Understanding, Volume 114, Issue 4, pp. 411-418 (2010) 5) Kawai, Sumiyoshi, Yagi: “Fast Detection Method for Shot Boundary Including Gradual Transition Using Sequential Feature Calculation,” The Institute of Electronics, Information and Communication Engineers (IEICE) Transaction on Information and Systems D, Vol. J91-D, No. 10, pp. 2529-2539 (2008) 6) Kawai, Sumiyoshi, Fujii, and Yagi: “Method of Correcting Video Fluctuations due to Flash Using Frame Interpolation,” The Institute of Image Information and Television Engineers Journal, Vol. 66, No. 11, pp. J444-J452 (2012) 7) Sou, Ogawa, and Haseyama: “Study into Increasing Figure 7: Example of search screens of NHK archives search Accuracy of Scene Divisions by MCMC Method, system Focusing on Video Structure,” IEICE technical report, CAS Circuitry and Systems, 110 (86), pp. 115-120 (2010) 8) Fukuda, Mochizuki, Sano, and Fujii: “Scene Sequence 4. Conclusions Generation of Program Video Based on Integrative This paper described trends in video analysis Color List,” , Proceedings of the ITE Annual techniques and the circumstances under which we Convention, 23-8 (2012) have developed a video footage search system enabling 9) Mochizuki and Sano: “Video Scene Sequence users to readily obtain the video footage they want. Generation by Shot Integration based on Image Peace The increasing capabilities of computers have made it List,” FIT, No. 3, H-004, pp. 101-102 (2013) possible for broadcasters to handle large quantities of 10) M. Hearst: “Multi-Paragraph Segmentation of video footage by turning it into file. We have finally Expository Text,” Proc. 32nd Annual Meeting of the arrived at the automated search era for video. However, Association for Computational Linguistics (1994) there are still many challenges to overcome before we 11) David G. Lowe: “ Distinctive Image Features from truly have a search system that can easily find and Scale-Invariant Keypoints,” International Journal of retrieve video footage, and various research projects are Computer Vision 60 (2), pp. 91-110 (2004)

16 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Feature

12) H. Bay, T. Tuytelaars, L. Gool: “SURF: Speeded Up Detection Using Bayes Hidden Markov model,” IEICE Robust Features,” Proc. of European Conference on technical report, Human Information Processing, HIP Computer Vision, pp. 404-415 (2006) 109 (471), pp. 401-406 (2010) 13) A. Krizhevsky, I. Sutskever, G. Hinton: “ ImageNet 28) M. Takahashi, M. Naemura, M. Fujii, and James J. Classification with Deep Convolutional Neural Little: “Recognition Action in Broadcast Basketball Networks,” Advances in Neural Information Video on Basis of Global and Local Pairwise Processing Systems 25 (2012) Representation,” Proc. IEEE International Symposium 14) Kawai and Fujii: “Semantic Concept Detection on Multimedia (ISM 2013), pp. 147-154 (2013) 29) based on Spatial Pyramid Matching and Semi- Misu, Takahashi, Tadenuma, and Yagi: “Real-Time supervised training,” ITE Trans. Media Technology Event Detection based on Formation Analysis of and Applications, Vol. 1, No. 2, pp. 190-198 (2013) Soccer Video,” FIT, LI003 (2005) [KA6] 15) Kawai and Fujii: “A Video Retrieval System based on 30) M. Mazloom, E. Gavves, K. E. A. van de Sande, and an Interative Learning Considering Closed Caption C. Snoek: “Searching Informative Concept Banks and Image Feature,” Proceedings of the ITE Annual for Video Event Detection,” Proc. 3rd ACM Conf. Convention, 23-7 (2012) International Conference on Multimedia Retrieval 16) Kawai, Mochizuki, Sumiyoshi, and Sano: “NHK STRL (ICMR2013), pp. 255-262 (2013) at TRECVID 2013: Semantic Indexing,” TREC Video 31) Z. Ma, Y. Yang, Z. Xu, S. Yan, Nicu Sebe and A. G. Retrieval Evaluation (TRECVID 2013 Workshop) Hauptmann: “Complex Event Detection via Multi- (2013) raw video Attributes,” IEEE Conf. computer Vision 17) Iwai, Lao, Yamaguchi, and Hirayama: “A Survey on and Pattern Recognition (CVPR2013), pp. 2627-2633 Face Detection and Face Recognition,” Information (2013) Processing Society of Japan, CVIM Research Materials, 32) J. Fiscus, G. Sanders, D. Joy and P. Over: “2013 CVIM-149 (37) (2005) TRECVID Workshop Multimedia Event Detection 18) Sakai, Nagao, Fujibayashi, and Kidode: “Line and Recounting Tasks,” http://www-nlpir.nist.gov/ Extraction and Pattern Detection in a Photograph,” projects/tvpubs/tv13.slides/tv13.med.mer.final.slides. Information Processing, Vol. 10, No. 3, pp. 132-142 pdf (2013) (1969) 33) Google image search: “Similar Images Graduates 19) P. Viola and M. Jones: “Robust Real-time Face from Google Labs,” http://googleblog.blogspot. Detection,” International Journal of Computer Vision com/2009/10/similar-images-graduates-from-google. (IJCV) 57(2), pp. 137-154 (2004) html [KA7] 20) “Concerning the Implementation of Demonstration 34) amanaimages: “http://amanaimages.com/” Experiments on New Automated Gate,” http:// 35) T. Mochizuki, H. Sumiyoshi, M. Sano and M. Fujii: www.moj.go.jp/nyuukokukanri/kouhou/ “Visual-based Image Retrieval by Block Reallocation nyuukokukanri04_00023.html Considering Object Region,” Asian Conference on 21) S. Clippingdale and M. Fujii: “Video Face Tracking Pattern Recognition (ACPR2013), PS2-03, pp. 371-375 and Recognition with Skin Region Extraction and (2013) Deformable Template Matching,” International 36) Mochizuki, Sumiyoshi, and Fujii: “Faster Image Journal of Multimedia Data Engineering and Retrieval by Query Image Drawing using Structural Management (IJMDEM) Vol. 3, No. 1, pp. 36-48 (2012) Template,” FIT, No. 3, H-040, pp. 219-220 (2010) 22) Sumiyoshi, Kawai, Mochizuki, Sano, and Fujii: [KA8] “Metadata Supplementation System for Earthquake 37) Sumiyoshi, Kawai, Mochizuki, Clippingdale, and Disaster Archives,” Proceedings of the ITE Annual Sano: “Metadata Supplementation System for Convention, 6-1 (2012) Earthquake Disaster Archives,” Proceedings of the ITE 23) L. Neumann and J. Matas: “A Method for Text Annual Convention, 14-2 (2013) [KA9] Localization and Recognition in Real-world Images,” 38) M. Haseyama, T. Ogawa and N. Yagi: “A Review of Asian Conference on Computer Vision, ACCV’2010, Video Retrieval Based on Image and Video Semantic pp. 2067-2078 (2010) Understanding,” ITE Trans. on Media Technology and 24) D. Karatzas, et al.: “ICDAR 2013 Robust Reading Applications, Vol. 1, No. 1, pp. 2-9 (2013) Competition,” ICDAR 2013 (2013) 25) J. Fiscus, et al.: “TRECVID 2009 Video Surveillance Event Detection Track,” 2009 TREC Video Retrieval Evaluation Notebook Papers and Slides (http://www- nlpir.nist.gov/projects/tvpubs/tv.pubs.9.org.html) 26) Mochizuki, Fujii, Yagi, and Shinoda: “Automatic Event Classification of Baseball Broadcast Video, Using Patterning of Scenes Focusing on Next Shot in Baseball and Discrete Hidden Markov Model,” ITE Journal,[KA5] Vol. 61, No. 8, pp. 1139-1149 (2007) 27) Yazaki, Misu, Nakata, Motoi, Kobayashi, Matsumoto, and Yagi: “Increasing the Accuracy of Sports Event

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 17 Series: Element Technologies for Advanced Hybridcast Hybridcast is a technology platform for integrated broadcast-broadband services. The first service, “NHK Hybridcast,” was launched in September 2013. The Science & Technology Research Laboratories (STRL) has been conducting research and development of technologies for Hybridcast so that it will enable even more convenient and diverse services to be provided. This series of four articles features the typical element technologies and service models for advanced Hybridcast. Overview of R&D for Advanced Hybridcast Hisayuki Ohmata, Integrated Broadcast-Broadband Systems Research Division

ybridcast provides view- ous devices including TVs, smart To realize such a multi-view ers with a new kind of TV phones, and PCs. We are currently functionality, we have developed Hexperience by integra- developing a video player for a video synchronization technol- tion of broadcast and broadband. Hybridcast-ready TVs and video ogy among multiple devices. This Convenient services which pro- distribution systems compliant technology enables multiple vid- vide online information as the TV with MPEG-DASH,*2 which is a eo streams, i.e., broadcast content program progresses are available standard video streaming technol- on TV and broadband content on on a TV and companion devices: ogy for smart phones and PCs. tablets, to be synchronized with smart phones and tablets. frame-by-frame accuracy. STRL started research and System technology enabling development of Hybridcast in various providers to offer Hy- *1 IPTV Forum Japan: a Japanese 2010 and has been contributing bridcast services domestic organization for standard- to standardization activities re- Diversity of services is one ization of IPTV. NHK, commercial lated to it at organizations such as of the important elements for broadcasters, TV manufacturers, IPTV Forum Japan*1. This article Hybridcast to become more and telecommunications operators participate in it. introduces our current research popular. Therefore, we have been *2 MPEG-Dynamic Adaptive Stream- on advanced Hybridcast, which researching a system architecture ing over HTTP: a video streaming provides even more diverse ser- called “non-broadcast-oriented standard to provide optimum qual- *3 vices with higher functionality. managed application” that ity video depending on the network enables third-party providers status. Technology for video streaming as well as broadcasters to offer *3 Non-broadcast-oriented managed services available on various Hybridcast services safely and application: a type of Hybridcast ap- devices securely. plication which can be provided by NHK Hybridcast currently pro- third-party service providers as well vides video-on-demand services Accurate synchronization tech- as broadcasters. such as video archives. STRL has nology for TVs and companion conducted research on efficient devices video distribution technologies Have you ever wanted to that can provide video content watch a live sports program from with one streaming format for vari- different angles at the same time?

Technology for video streaming services available on various devices

Video synchronization Diverse applications Efficient video distribution between broadcast and provided safely and securely to various types of devices broadband devices Broadcast

Service providers

System technology Broadcaster Accurate synchronization enabling various providers to Internet technology for TVs and offer Hybridcast services companion devices

Figure: Element Technologies for Advanced Hybridcast

18 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Automatic Radio Broadcasting System Using Speech Synthesis for Stock Market/Weather Reporting

Hiroyuki Segi, Human Interface Research Division speech synthesis system would be possible to expand the stores speech data segments range of application of this tech- A(vocal recordings divided nology. Weather reports feature transmission at the scheduled pro- into segments), takes as input the many standard terms associated gram broadcast time. text to be readout, and combines with geography, numbers, and The same kind of sound trans- individual speech data segments to directions. Using this knowledge, mitter is used for both applica- create the readout. If the required we have produced standard sen- tions; it has a solid record in the sound is not available, it substitutes tence patterns (templates) to report stock market report application, speech data segments having the weather-related content and have which requires high reliability and closest sound. This results in un- constructed a speech database that stable transmission. The transmit- natural synthesized speech that contains a wide range of templates ter uses speech rate conversion* to does not meet the official broad- with speech data. By combining vary the rate and pose of playback casting level of quality. In order to templates from the database, it is of the synthesized speech file in solve this problem, the Science & possible to produce high-quality order to make the playback fit the Technology Research Laboratories synthesized weather reports. predetermined program length (STRL) has devised speech synthe- and adjust its ending time during sis technology that collects vocal Automatic radio broadcasting the broadcast. data and synthesizes speech with system based on speech synthesis This automatic radio broadcast- quality equivalent to that of speech This new system automatically ing system for stock market and by an actual announcer. This tech- broadcasts two types of programs: weather reports has been used for nology has been used in the “Stock stock market and weather reports. stock market reports since March Market Report” on NHK Radio 2 As shown in the figure, it analyzes 2014; its operation tests for weather channel since March 2010. stock prices and weather data re- reports are ongoing. We will con- ceived from external sources by us- tinue to examine other programs Expanded range of speech syn- ing two separate speech synthesiz- with the potential to incorporate thesis applications ers and outputs the corresponding speech synthesis. Examinations on the use of synthesized sounds as files. After speech synthesis in the “Weather one of the synthesizers produces * speech rate conversion: a technology to Report” program on the NHK a sound file, a sound transmitter adjust the speech rate and pose length Radio 2 channel revealed that it automatically plays back the file without degrading sound quality.

Stock market/weather report NHK Radio 2 Stock data automatic broadcasting system NHK ¥12345; +¥123. X Corp. ¥6789; -¥45; Speech synthesizer Speech for “stock market report” for stock market report X Bank; ¥258; +¥14. Synthesized speech Sound file NHK ¥12345 + ¥123

Sound transmitter Speech for “Weather report” Weather data 950hPa, Super Typhoon 3 is Synthesized speech located to the south of Sound file with the coordinates 23° north, Okhotsk Sea with the Speech synthesizer to the south of the Sea of Okhotsk coordinates 23°40′ north for weather report and 122°10′ east…

Figure: Overview of stock market/weather report automatic broadcasting system

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 19 Large-capacity Transmission Technology for 8K Super Hi- Vision Terrestrial Broadcasting

K Super Hi-Vision (8K) terres- table data is doubled because each trial broadcasting will require, carrier is capable of sending twice 8in addition to audio/video as many bits. compression encoding technolo- The effective symbol length gies, a large-capacity transmission for the current digital terrestrial technology to enhance spectral ef- broadcasting OFDM signals is ap- ficiency. The Science & Technology proximately 1 ms. Expanding the Takuya Shitomi, Research Laboratories (STRL) has effective symbol length while Advanced Transmission Systems been working on dual-polarized maintaining the same guard in- Research Division MIMO*1, “ultra-multilevel” OFDM*2 terval (GI)*4 as the current digital modulation, and effective symbol terrestrial broadcasting reduces length*3 expansion (Figure) as ele- the GI ratio in a single transmitted ultra-multilevel OFDM, we will ments of this technology. symbol to 1/32 from the previous continue to study ways of optimiz- Current digital terrestrial broad- 1/8, resulting in an increase in ing the signal point arrangement casting utilizes either horizontally transmission capacity. and error correcting code. The goal or vertically polarized waves to In 2014, we set up an experi- of delivering digital 8K broadcast- transmit one Hi-Vision (HD) pro- mental transmission system and ing over terrestrial waves will gram over a single channel (6 MHz reception station based on this be reached through our channel bandwidth). By contrast, dual- technology in Hitoyoshi City, Ku- sounding in urban areas and ana- polarized MIMO technology uses mamoto, and for the first time in lyzing the collected data. both horizontally and vertically the world, conducted terrestrial 8K polarized waves to transmit dif- transmissions over a long distance *1 Multiple-Input Multiple-Output: a transmission technology using mul- ferent signals, effectively doubling (27 km), which is the standard dis- tiple transmitting/receiving antennas. the transmittable data capacity. tance of current digital terrestrial *2 Orthogonal Frequency Division Mul- The OFDM signal for current broadcasting. We are now install- tiplexing digital terrestrial broadcasting ing equipment at several reception *3 Effective symbol length: time duration transmits a maximum of 6 bits points and beginning to make of useful part of an OFDM signal (sym- (64 signal points) on each of its long-term propagation measure- bol) for data transmission. 5,617 carriers. The incorporation ments. *4 Guard Interval: a redundant section of ultra-multilevel OFDM modu- With the goal of reducing the inserted to prevent inter-symbol inter- lation technology will raise that reception characteristic degrada- ference. maximum to 12 bits (4096 signal tion that accompanies the increase *5 Digital Broad- points). The amount of transmit- in number of signal points in casting

8K test station (UHF ch. 46) Dual-polarized transmission antenna -Uses two polarized waves (dual-polarized MIMO technology) 2 × current DTTB *5 -Ultra-multilevel modulation technology (4096QAM) 2 × current DTTB -Effective symbol length expansion (approx. 4ms) ⇒-Large-capacity transmission (four times of current DTTB)

Dual-polarized reception antenna

Test station 4096QAM Current digital terrestrial broadcasting Effective symbol length -Uses only single polarized wave 64QAM GI -Modulation technology (64QAM) Effective symbol length GI -Effective symbol length (approx. 1 ms) Time

Time Figure: Large-capacity transmission technology overview

20 Broadcast Technology No.60, Spring 2015 ● C NHK STRL Low-Current-Density Spin-Transfer Switching in Gd22Fe78-MgO Magnetic Tunnel Junction Journal of Applied Physics, Vol. 115, 203903.1-203903.3 (2014) Hidekazu Kinjo, Kenji Machida, Koichi Matsui*, Ken-ichi Aoshima, Daisuke Kato, Kiyoshi Kuga, Hiroshi Kikuchi, and Naoki Shimidzu *Tokyo Denki University

e have been investigating a spatial light modulator driven by spin-transfer switching (spin-SLM) for three dimensional holography systems. In the present study, we fabricated perpendicular TMR light Wmodulation devices with a Co-Fe/MgO/Co-Fe tunnel junction and a Gd-Fe light modulation layer. At about 560 nm × 560 nm in size, although the TMR ratio was 7.0%, which is a considerably low value, the magnetization of the light modulation layers switched at a low current density of 1.0×106 A/cm2. This low-current switching is mainly attributed to thermally assisted spin-transfer switching as a consequence of thermal magnetic behavior arising from Joule heating, because Gd-Fe alloys are temperature-sensitive materials.

Development of a Multilink 10 Gbit/sec Mapping Method and Interface Device for 120 Frames/sec Ultra High-Definition Television Signals SMPTE Motion Imaging Journal Vol. 123, No. 4, pp. 29-38, MAY/JUNE (2014) Takuji Soeno, Yukihiro Nishida, Takayuki Yamashita, Yuichi Kusakabe, Ryohei Funatsu, and Tomohiro Nakamura

e are researching a next-generation ultra high-definition television (UHDTV) broadcasting system. The video parameters of UHDTV systems with 120 frame/sec signals are specified in Recommendation ITU- WR BT.2020. In this study, a new mapping method was developed to transmit various UHDTV signals, including 120 frame/sec signals. A prototype interface for connecting UHDTV video devices was also developed. The method is to transform UHDTV signals into multilink 10 Gbit/sec streams. The number of 10 Gbit/sec streams differs according to the UHDTV format (frame frequency, pixel count, and sampling lattice). To realize a compact and low-power interface, we implemented the prototype using a parallel fiber-optic transceiver with a capacity of 10 Gbit/sec per channel. Finally, we verified the practicality and feasibility of the multilink 10 Gbit/sec mapping method and the prototype interface.

Low-Voltage-Operation Avalanche Photodiode Based on N-Gallium Oxide/P-Crystalline Sele- nium Heterojunction Applied Physics Letters, Vol.104, No.24, pp. 242101.1-242101.4 (2014) Shigeyuki Imura, Kenji Kikuchi, Kazunori Miyakawa, Hiroshi Ohtake, Misao Kubota

ecause of the strong demand for high-definition and high-frame rate video cameras, highly sensitive imaging devices are significantly desirable. We are developing the high-sensitivity image sensors over- Blaid with photoconversion layers which use carrier multiplication at low-applied voltage. In this study, we use crystalline selenium (c-Se), which has an extremely large absorption coefficient over the entire visible region, as a photoconversion layer to fabricate a test structure to measure the photoconversion characteristics. As a result of the measurements, the dark current was significantly reduced by using an n-type wide band gap

gallium oxide (Ga2O3) to prevent the number of carriers injected from an external electrode; hence, we demon- strated avalanche multiplication in the c-Se films exhibiting an extremely high external quantum efficiency (EQE)

of over 100% for the first time. Furthermore, tin (Sn) doping of the Ga2O3 layer effectively increases the carrier concentration, allowing the depletion layer between Ga2O3 and c-Se to be spread into the c-Se layer, resulting in a low-voltage-operation avalanche photodiode. These results made a giant step toward highly sensitive image sensors, opening the door to the age of next-generation ultra high-definition imaging systems.

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 21 We are distributing various types of information on the Web. Shown here are just some examples. Please feel free to access them and send us your opinions.

STRL Bulletin Multi-Viewpoint Robotic Camera System Achieving High-Quality “Time Slice”

A multi-viewpoint robotic camera system improvement in both operability and captures a moving subject from various performance by downsizing the robotic Spring directions. Using multiple robotic cameras, camera units, reducing the number of 2015 no.60 this system can be used to produce “time cables, and speeding up image processing. slice” that freezes time while moving the The results of the live broadcast NHK Science & viewpoint around the subject. NHK has demonstrated that the new system could Technology Research recently developed a new multi-viewpoint render the skaters posture of the kicking Laboratories Bulletin robotic camera system that enables more action and the posture in the air during diversified productions. This system was jumps from different viewpoints, making it used in a live broadcast of the NHK Trophy much easier for the viewer to understand Grand Prix of Figure Skating held last such movements. The trial also showed November. that operability could be improved by The new system increases the number shortening the time needed to install of cameras from the previous one’s 9 to and adjust the system in the field. Going 16 enabling “time slice” with a wider forward, we plan to continue our research rotating viewing angle and smoother on multi-viewpoint robotic camera systems image switching. It marks a significant toward new forms of image rendering. © NHK Science & Technology Research Laboratories

Address: 1-10-11, Kinuta, Setagaya-ku, Tokyo, 157-8510, Japan

Phone: +81(0)3-5494-1125 Fax : +81(0)3-5494-3125 Shooting with the multi-viewpoint robotic camera system http://www.nhk.or.jp/strl/ english/index.html

Editors

Instantaneous movement of figure skaters by a “time slice” rotating viewpoint Toru KURODA, publisher, Director of STRL Toru IMAI, editor-in-chief Kyoko KIMURA, editor Kenichi MURAYAMA, editor Arisa FUJII, editor The fruits of another year of research at STRL will be put on show at the NHK Open House 2015, Narichika HAMAGUCHI, editor which runs from May 28 through May 31. Hiroyuki KANEKO, editor This year’s event focuses on the technology of 8K Super Hi-Vision (which is about to begin test trans- Ayako SUMI, editor mission), and will include 26 demonstrations of our latest research, and nine poster presentations. Layout & Design : Until last year, each technology necessary for 8K broadcasting had been exhibited on different Yohko OHTA, Masami OHNISHI, booths. This year, how 8K is finally coming to life as a whole broadcasting system from video footage DTP : Media-jin, Inc. shot to the reception of broadcasts in people’s homes will be showcased. Every year, this event attracts around 20,000 visitors, ranging from broadcasting experts to neighbor- hood families. We hope that you will stop by to tour the exhibits and discuss the future of broadcast- ing with us. Visit us online for the latest news and updates: http://www.nhk.or.jp/strl/open2015/en/

STRL

24 Broadcast Technology No.60, Spring 2015 ● C NHK STRL The role of the NHK Science and Technology Re- Overview of NHK STRL search Laboratories is to help build a richer broad- casting culture from a research and development viewpoint, both as Japan’s only research organization dedicated to broadcasting technology, and as part of Japan’s public broadcaster. To this end, the STRL is conducting a wide range of R&D, from basic technologies to practical applications, on next-generation broadcast media, universal broadcasting services, advanced program production technology, and devices and materials for use in broadcast- ing. We are using these technologies to enrich programming and are actively working on standardization, which is essential for successful implementation of new services.

Research Activities Research into services using the complementary features of broadcasting and communications With the growth of broadband and the increasing speed and capacity of wireless infrastructure, we are working in development of enhanced broadcasting services using telecommunications. Hybridcast We are researching and developing a platform called Hybridcast that features new ways for viewers Hybridcast service example to enjoy TV. Hybridcast realizes new service, which combines broadcast that can send information to a large number of people at once and broadband that can interactively and individually send information. Hybridcast will be able to provide services with synchronization of broadcast and broadband, linkage with terminals such as tablets, and social media. Technology for the use of “big data for broadcast- Automatic metadata generation ing” technology Hybridcast provides diverse information and services relating to TV We are researching and developing a “Video Bank” broadcasts over the Internet. It provides services that give viewers that facilitates comprehensive searches and flexible in-depth information during TV programs and lets them have fun usage of video assets for video production. It brings sharing information with their acquaintances. We are researching and together a mechanism that collects useful meta- developing database utilization techniques and service models that data when video assets are first acquired and a use program-related information and large-scale data sources, such technique that automatically generates metadata as social networking services (SNS), as “big data for broadcasting”. by analyzing the stored footage. 8K Super Hi-Vision We are researching a next-generation TV system called 8K Super Hi-Vision (8K) that conveys a sense of presence and reality to scenes in a very lifelike way. An 8K system combines an extremely high resolution 33-megapixel video picture (7,680×4,320) with three-dimensional sound provided by a 22.2 multichannel sound system. At NHK, we are researching and developing not only 8K cameras, production equipment, and transmission and display devices, but also video/audio coding devices, error correction methods, and modulation schemes to enable 8K to be delivered to homes by satellite or terrestrial broadcasting. Audio technology Imaging technology We are working on the development and standardiza- We are researching and developing various tion of a three-dimensional 22.2 multichannel sound 8K cameras, including a high-sensitivity system (upper layer with nine channels, middle layer camera that can operate even at the low with ten channels, lower layer with three channels, and brightness levels inside a theater, a com- two subwoofer channels), and we are researching trans- pact cube-shaped single-chip 8K camera Compact cube-shaped mission and playback technologies that can be used in single-chip 8K camera with head weighing just 2 kg, and three- the home. chip 8K imaging equipment that uses 8K image sensors compatible Transmission technology with a high frame rate of 120 fps. We are also researching stacked To deliver 8K signals to homes, we are studying satellite organic imaging devices with the aim of producing cameras that broadcasting in the 12-GHz and 21-GHz bands, as well are both smaller and higher in picture quality. as terrestrial broadcasting and cable distribution. We Compression and coding techniques are also researching and developing technologies such We are researching efficient compression methods based on the as wavelength division multiplexing for the delivery of HEVC (High Efficiency Video Coding), video compression standard video contributions to the studio by cable and uncom- and we are developing coding techniques for the efficient delivery of pressed wireless transmission in the 120-GHz band. 8K signals to households, including a new coding method called im- age restoration video coding that makes use of super-resolution Recording technology techniques. To make a single-chip 8K camera with a built-in record- Display technology ing device, we are developing a parallel solid-state We are researching and developing liquid crystal and plasma displays memory, and for archiving purposes, we are research- that can show extremely high resolution imagery (approximately 33 ing holographic recording technologies characterized megapixels). We are also conducting research aimed at making projec- by high-density multiplexed recording and the ability tors that are compatible with the color gamut of 8K and lightweight to record and play back in “data page” units of around flexible sheet-type displays that can be rolled up and carried around. 1 Mbit.

22 Broadcast Technology No.60, Spring 2015 ● C NHK STRL About us

ORGANIZATION

NHK Science & Technology Research Laboratories

Planning & Coordination Division Human Interface Research Division Research planning/management, public relations, international Speech recognition, advanced language processing such as simple /domestic liaison, external collaborations, etc. Japanese and sign language CG creation, transmission of tactile/haptic Patents Division information, etc. Patent applications and administration, technology transfers, etc. Three-Dimensional Image Research Division Integrated Broadcast-Broadband Systems Research Division Spatial 3D video system technology (integral 3D etc.), 3D display Hybridcast, security, production and utilization of metadata, content device technology, cognitive science and technology, etc. recommendation, etc. Advanced Functional Devices Research Division Advanced Transmission Systems Research Division Ultrahigh-resolution and ultrasensitive imaging devices, high-capacity Satellite/terrestrial transmission technology, millimeter-wave and fast-write technology, sheet-type display technology, etc. optical 8K contribution technology, multiplexing technology, IP General Affairs Division transmission technology, etc. Personnel, labor coordination, accounting, building management, Advanced Television Systems Research Division etc. 8K program production equipment, video coding for efficient transmission, highly realistic audio systems, etc.

Three-dimensional television for playing back spatial images As the next step in television after 8K, we are researching and developing three-dimensional television for playing back spatial images capable of displaying natural 3D images that can be viewed without special glasses. We are researching technology for capturing and displaying integral 3D images with high quality, as well as devices aimed at producing 3D displays by means of holography. Integral 3D television We are researching and developing integral 3D television technology that can capture and display images from various viewpoints by using a miniature lens array. The technology can display natural 3D images that can be viewed without special glasses and accommodate changes in the viewer’s position not only horizontally but also vertically. A three-dimensional television for playing back spatial images has to handle a huge amount of information in order to reproduce depth details, An image displayed on a miniature lens array, as seen and we are researching ways of increasing the number of pixels in the capture and display from various viewpoints devices. Generating 3D content from multi-viewpoint Ultra-high-resolution spatial optical video conversion element We are using integral video technologies to capture integral We are researching an electro-holographic display as a 3D images of subjects that are difficult to capture with con- means of 3D television capable of playing back spatial ventional optical methods, such as subjects that are far away images. To display 3D images with a wide field of view, it or large. For this, we are developing a technology whereby is necessary to use a high-resolution spatial light modula- multiple cameras are used to capture 3D information about the tor (SLM) consisting of pixels with higher resolution than subject, which is then transformed into an integral 3D image. has so far been achieved. We have devised a spin transfer switching SLM (spin SLM) device with a pixel pitch of less than 1 μm and are working on its development. Human-friendly Broadcasting Services We are researching human-friendly broadcasting techniques to ensure that programs can be enjoyed by everyone, including people with disabilities, elderly persons, and children. Computer graphics-based sign-language translation technology We are researching automatic translation of Japanese into sign language so that weather reports can be conveyed by sign language even during emergencies when no sign-language interpreter is available. This technology synthesizes Computer graphics-based sign lan- guage by using a dictionary database containing 20,000 Japanese terms paired with 3D motion data for the corresponding sign-language gestures, which is produced with the Example of Computer graphics-based TVML (TV Program Making Language). sign-language translation A real-time closed-captioning system based Tactile/haptic presentation of broadcast- on speech recognition ing data We are researching speech recognition for making subtitles in As aids for people with visual disabilities, we are researching live broadcasts in order to offer an expanded range of subtitled tactile and haptic means to convey information that may program for the elderly and people with hearing disabilities. be difficult to deliver in words, such as two-dimensional information (pictures etc.) and three-dimensional informa- tion (sculptures etc.).

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 23