CH: 7 FUNDAMENTALS OF CODING

Inter frame redundancy: An inter frame is a frame in a video compression stream which is expressed in terms of one or more neighboring frames. The "inter" part of the term refers to the use of Inter frame prediction. This kind of prediction tries to take advantage from temporal redundancy between neighboring frames enabling higher compression rates. : The temporal encoding aspect of this system relies on the assumption that rigid body motion is responsible for the differences between two or more successive frames. The objective of the motion estimator is to estimate the rigid body motion between two frames. The motion estimator operates on all current frame 16 x 16 image blocks and generates the pixel displacement or motion vector for each block. The technique used to generate motion vectors is called block-matching motion estimation.The method uses the current frame Ik and the previous reconstructed frame fk-l as input. Each block in the previous frame is assumed to have a displacement that can be found by searching for it in the current frame. The search is usually constrained to be within a reasonable neighborhood so as to minimize the complexity of the operation. Search matching is usually based on a minimum MSE or MAE criterion. When a match is found, the pixel displacement is used to encode the particular block. If a search does not meet a minimum MSE or MAE threshold criterion, the motion compensator will indicate that the current block is to be spatially encoded by using the intraframe mode. Motion Estimation technique full search and fast search techniques: Motion estimation (ME) is used extensively in video based on MPEG-4 standards to remove interframe redundancy. Motion estimation is based on the block matching method which evaluates block mismatch by the sum of squared differences (SSD) measure. Winograd’s Fourier transform is applied and the redundancy of the overlapped area computation among reference blocks is eliminated in order to reduce the computational amount of the ME. When the block size is N × N and the number of reference blocks in a search window is the same as the current block, this method reduces the computational amount (additions and multiplications) by 58 % of the straightforward approach for N = 8 and to 81 % for N = 16 without degrading motion tracking capability. The proposed fast full- search ME method enables more accurate motion estimation in comparison to conventional fast ME methods, thus it can be applied in video systems. The popularity of video as a mean of data representation and transmission is increasing. Hence the requirements for a quality and size of video are growing. High visual quality of video is provided by coding. In 1960s the motion estimation (ME) and compensation were proposed to improve the efficiency of video coding . The current frame is divided into non-overlapping blocks. For each block of the current frame the most similar block of the within the limited search area is found. The criterion of the similarity of the two blocks is called a metric comparison of the two blocks. The position of the block, for which an extremum of metric is founded, determines the coordinates of the motion vector of the current block. The full search algorithm is the most accurate method of the block ME, i.e. the proportion of true motion vectors found is the highest . The current block is compared to all candidate blocks within the restricted search area in order to find the best match. This ME algorithm requires a lot of computing resources. Therefore, a lot of alternative fast motion estimation algorithms were developed. In 1981 T. Koga and other authors proposed a three-step search algorithm (TTS). The disadvantage of fast search methods is finding a local extremum of a function of the difference of two blocks. Consequently motion estimation degrades by half degradation in some sequences compared to brute-force and visual quality of video degrades as well.

The Criterion To Compare Blocks The standards of video coding do not regulate the choice of criterion for matching two blocks (metric). One of the most popular metrics is the sum of square difference (SSD): Nh-1 Nw-1 2 SSD(i,j) = ∑ ∑(B(x,y)-s(x+i,y+j)) Y=0 x=0 where i, j – the coordinates of the motion vector of the current block, i ϵ (–Vw/2; Vw/2), j ϵ (–Vh/2; Vh/2), where Vw × Vh – size of the area which can be is the upper left corner of the title block on the reference frame; x, y – coordinates of the current block B; Nw × Nh – block size B; S – reference area of size Sw × Sh, where Sw = Nw + Vw, Sh = Nh + Vh; B and S – luminance images in color format YUV. Inside the search area size Sw × Sh is the minimum value of SSD criterion for the current block B, which determines the coordinates of the motion vector in order. SSD can be calculated through fewer number of operations by decomposition into three components and those are: Nh-1 Nw-1 2 ∑ ∑ B (x,y) y=0 x=0

Nh-1 Nw-1 -∑ ∑ B(x,y)S(x+i,y+j) y=0 x=0

Nh-1 Nw-1 2 +∑ ∑ S (x+i,y+j) y=0 x=0 We propose to replace this algorithm by other fast transforms: Winograd algorithm and the number- theoretic transform of Farm (NTT). Backward motion estimation: The motion estimation that we have discussed in Section-20.3 and Section 20.4 is essentially backward motion estimation, since the current frame is considered as the candidate frame and the reference frame on which the motion vectors are searched is a past frame, that is, the search is backward. Backward motion estimation leads to forward motion prediction. Backward motion estimation, illustrated in below fig

Forward motion estimation It is just the opposite of backward motion estimation. Here, the search for motion vectors is carried out on a frame that appears later than the candidates frame in temporal ordering. In other words, the search is “forward”. Forward motion estimation leads to backward motion prediction. Forward motion estimation, illustrated in fig 20.3

It may appear that forward motion estimation is unusual, since one requires future frames to predict the candidate frame. However, this is not unusual, since the candidate frame, for which the motion vector is being sought is not necessarily the current, that is the most recent frame. It is possible to store more than one frame and use one of the past frames as a candidate frame that uses another frame, appearing later in the temporal order as a reference. Forward motion estimation (or backward ) is supported under the MPEG 1 & 2 standards, in addition to the conventional backward motion estimation. The standard also supports bi- directional motion compensation in which the candidate frame is predicted from a past reference as well as a future reference frame with respect to the candidates frame. Frame classification: In video compression, a video frame is compressed by using different algorithms.These different algorithms for video frames are called picture types or frame types and they are I, P and B.The characteristics of frame types are: I-frame: I-frame are the least compressible but do not require other video frames to decode. P-frame: It can use data from previous frames to decompress and are more compressible than I-frame. B-frame: It can use both previous and forward frames for data reference to get the highest amount of . An I frame(Intra coded picture) is a complete image like JPG image file. A P-frame(Predicted picture) holds only the changes in the image from the previous frame. For example , in a scene where a car moves across a stationary background, only the car’s movements need to be encoded. The encoder does not need to store the unchanging background pixels in the P- frame for saving space. P-frames are also known as delta frames. A B-frame(Bidirectional predicted picture) saves even more space by using differences between the current frame and both the preceding and following frames to specify its content.

Picture/Frame: The term picture and frame are used interchangeably. The term picture is more general notion as a picture can be either a frame or a field. A frame is complete image and a field is the set of odd numbered or even numbered scan lines composing a partial image. For example, an HD 1080 picture has 1080 lines of pixels. An odd field consist of pixel information for lines 1,3,5,...... 1079. An even field has pixel information for lines 2,4,6,....1080.Whwn video is sent in interlaced scan format then each frame is sent in two fields, the fields of odd numbered lines followed by the field of even numbered lines. A frame used as a reference for preceding other frames is called a reference frame.

Frame encoded without information from other frames are called I-frames. Frame that use prediction from a single preceding reference frame are called P-frames. The frames that use prediction from a average of two reference frames, one preceding and one succeeding are called B-frames. Slices: A slice is a spatially distinct region of a frame that is encoded separately from any other region in the same frame . I-slices, P-slices, and B-slices take the place of I, P and B frames. : It is a processing unit in image and video compression formats based on linear block transforms, typically the DCT. It consist of 16x16 samples and is further subdivide into transform blocks and may be further subdivided into prediction blocks. Partitioning of picture: Slices: •A picture is split into 1 or several slices •Slices are self-contained •Slices are a sequence of macroblocks Macroblocks:• Basic syntax & processing unit •Contains 16x16 luma samples and 2 x 8x8 chroma samples •Macroblocks within a slice depend on each other •Macroblocks can be further partitioned Elements of video encoding and decoding:

Video coding basic system

Encoder block diagram of typical block based hybrid coder

Discrete Cosine Transform- DCT transformation decomposes each input block into a series of waveforms with a specific spatial frequency. Outputs an 8x8 block of horizontal and vertical frequency coefficients. Quantization- Quantization block uses the psychovisual characteristics to eliminate the unimportant DCT coefficients, high frequency coefficients. Inverse Quantization- IQ computes the inverse quantization matrix by multiplying the quantized DCT with the quantization table. Inverse Discrete Cosine Transform- IDCT computes the original input block. Errors are expected due to quantization. Motion Estimation- ME uses a scheme with fewer search locations and fewer pixels to generate motion vectors indicating the directions of the moving images. Motion Compensation- MC block increases the compression ratio by removing the redundancies between frames. Variable Length Coding Lossless- VLC coding reduces the by sending shorter codes for common pairs (number of zeros and number of non-zeros) and longer codes for less common pairs.

Decoder block diagram

Example of the wireless video (encoder and decoder) that includes pre-processing of the captured data to interface with the encoder and post-processing of the data to interface with the LCD panel. The video codec is compliant with the low bit rate codec for multimedia telephony defined by the Third Generation Partnership Project (3GPP) . The baseline CODEC defined by 3GPP is H.263 and MPEG-4 Simple Visual Profile is defined as an optional. The video codec implemented supports the following video formats. 1. SQCIF or 128 x 96 resolution 2 .QCIF or 176 x 144 resolution at Simple Profile Level 1 3 .CIF or 352 x 288 resolution at Simple Profile Level 2 4. 64 kbits/s for Simple Profile Level 1 5 .128 kbits/s for Simple Profile Level 2 Video CODEC Description The video encoder implemented requires a YUV 4:2:0 non-interface video input and, therefore, pre- processing of the video input may be required depending on the application. For the video decoder, post-processing is needed to convert the decoded YUV 4:2:0 data to RGB for displaying. Features 1.Pre-processing: − YUV 4:2:2 interlaced (from camera for example) to YUV 4:2:0 non-interlaced, only decimation and no filtering of the UV components. 2. Post-processing: − YUV 4:2:0 to RGB conversion − Display formats of 16 bits or 12 bits RGB − 0 to 90 degrees rotation for landscape and portrait displays 3. MPEG-4 Simple Profile Level 0, Level 1 and Level 2 support 4. H.263 and MPEG-4 decoder and encoder compliant 5. MPEG-4 video decoder options are: − AC/DC prediction − Reversible Variable Length Coding (RVLC) − Resynchronization Marker (RM) − Data Partitioning (DP) − Error concealment, proprietary techniques − 4 Motion Vectors per (4MV) − Unrestricted Motion Compensation − Decode VOS layers 6. MPEG-4 video encoder options are: − Reversible Variable Length Coding (RVLC) − Resynchronization Marker (RM) − Data Partitioning (DP) − 4 Motion Vectors per Macroblock (4MV) − Header Extension Codes − Bit rate target change during encoding − Coding frame rate change during encoding − Insertion or not of Visual Object Sequence start code 7. Insertion of I-frame during the encoding of a sequence support 8. Encoder Adaptive Intra Refresh (AIR) support 9. Multi-codec support, multiple codecs running from the same code Video Architecture Pixel Representation Red, Green and Blue or RGB are the primary colors for the computer display and the color depth supported by the OMAP5910 is programmable up to 16 bits per pixel, RGB565 (5 bits for Red, 6 bits for Green and 5 bits for Blue). In the consumer video such as DVD, camera, digital TV and others, the common color coding scheme is YCbCr where Y is the luminance, Cb is the blue chrominance and Cr is the red chrominance. Human eyes are much more sensitive to the Y component of the video and this enables video sub-sampling to reduce the chrominance component without being detected by the human eyes. This method is referred to as YCbCr 4:2:0, YCbCr 4:2:2 or YCbCr 4:4:4.

Video coding standards MPEG and H.26X The Moving Picture Experts Group (MPEG) was established in 1988 in the framework of the Joint ISO/IEC Technical Committee (JTC 1) on Information Technology with the mandate to develop standards for coded representation of moving pictures, associated audio and their combination when used for storage and retrieval on Digital Storage Media with a bitrate at up to about 1.5 Mbit/s. The standard was nicknamed MPEG-1 and was issued in 1992. The scope of the group was later extended to provide appropriate MPEG-2 video and associated audio compression algorithms for a wide range of audio-visual applications at substantially higher bitrates not successfully covered or envisaged by the MPEG-1 standard. Specifically, MPEG-2 was given the charter to provide video quality not lower than NTSC/PAL and up to CCIR601 quality with bitrates targeted between 2 and 10 Mbit/s. Emerging applications, such as digital cable TV distribution, networked database services via ATM, digital VTR applications, and satellite and terrestrial digital broadcasting distribution, were seen to benefit from the increased quality expected to result from the emerging MPEG-2 standard. The MPEG-2 standard was released in 1994. The Table I below summarizes the primary applications and quality requirements targeted by the MPEG-1 and MPEG-2 video standards together with examples of typical video input parameters and compression ratios achieved. The MPEG-1 and MPEG-2 video compression techniques developed and standardized by the MPEG group have developed into important and successful video coding standards worldwide, with an increasing number of MPEG-1 and MPEG-2 VLSI chip-sets and products becoming available on the market. One key factor for the success is the generic structure of the MPEG standards, supporting a wide range of applications and applications specific parameters [schaf, siko1]. To support the wide range of applications profiles a diversity of input parameters including flexible picture size and frame rate can be specified by the user. Another important factor is the fact that the MPEG group did only standardize the decoder structures and the bitstream formats. This allows a large degree of freedom for manufactures to optimize the coding efficiency (or in other words the video quality at a given bit rate) by developing innovative encoder algorithms even after the standards were finalized. . MPEG-1 Standard (1991) (ISO/IEC 11172) zTarget bit-rate about 1.5 Mbps zTypical image format CIF, no interlace zFrame rate 24 ... 30 fps zMain application: video storage for multimedia (e.g., on CD-ROM) MPEG-2 Standard (1994) (ISO/IEC 13818) zExtension for interlace, optimized for TV resolution (NTSC: 704 x 480 Pixel) zImage quality similar to NTSC, PAL, SECAM at 4 -8 Mbps zHDTV at 20 Mbps MPEG-4 Standard (1999) (ISO/IEC 14496) zObject based coding zWide-range of applications, with choices of interactivity, scalability, error resilience, etc.

MPEG-1: coding of I-pictures zI-pictures: intraframe coded z8x8 DCT zArbitrary weighting matrix for coefficients zDifferential coding of DC-coefficients zUniform quantization zZig-zag-scan, run-level-coding zEntropy coding zUnfortunately, not quite JPEG

MPEG-1: coding of P-pictures zMotion-compensated prediction from an encoded I-picture or P-picture (DPCM) zHalf-pel accuracy of motion compensation, bilinear interpolation zOne displacement vector per macroblock zDifferential coding of displacement vectors zCoding of prediction error with 8x8-DCT, uniform threshold quantization, zig-zag-scan as in I- pictures

MPEG-1: coding of B-pictures zMotion-compensated prediction from two consecutive P-or I-pictures zeither •only forward prediction (1 vector/macroblock) zor •only backward prediction (1 vector/macroblock) zor •Average of forward and backward prediction = interpolation(2 vectors/macroblock) zHalf-pel accuracy of motion compensation, bilinear interpolation zCoding of prediction error with 8x8-DCT, uniform quantization, zig-zag-scan as in I-pictures

MPEG-4 zSupport highly interactive multimedia applications as well as traditional applications zAdvanced functionalities: interactivity, scalability, error resilience… zCoding of natural and synthetic audio and video, as well as graphics zEnable the multiplexing of audiovisual objects and composition in a scene MPEG-4: Scene with audiovisual objects

MPEG Family ◼ MPEG-1 Similar to H.263 CIF in quality ◼ MPEG-2 Higher quality: DVD, Digital TV, HDTV ◼ MPEG-4/H.264 More modern codec. Aimed at lower bitrates. Works well for HDTV too. MPEG-1 Compression ◼ MPEG: Motion Pictures Expert Group ◼ Finalized in 1991 ◼ Optimized for video resolutions: 352x240 pixels at 30 fps (NTSC) 352x288 pixels at 25 fps (PAL/SECAM) ◼ Optimized for bit rates around 1-1.5Mb/s. ◼ Syntax allows up to 4095x4095 at 60fps, but not commonly used. ◼ Progressive scan only (not interlaced) MPEG Frame Types ◼ Unlike H.261, each frame must be of one type. H.261 can mix intra and inter-coded MBs in one frame. ◼ Three types in MPEG: I-frames (like H.261 intra-coded frames) P-frames (“predictive”, like H.261 inter-coded frames) B-frames (“bidirectional predictive”) MPEG I-frames ◼ Similar to JPEG, except:  Luminance and chrominance share quantization tables.  Quantization is adaptive (table can change) for each macroblock. ◼ Unlike H.261, every n frames, a full intra-coded frame is included.  Permits skipping. Start decoding at first I-frame following the point you skip to.  Permits fast scan. Just play I-frames.  Permits playing backwards (decode previous I-frame, decode frames that depend on it, play decoded frames in reverse order) ◼ An I frame and the successive frames to the next I frame (n frames) is known as a . MPEG P-Frames ◼ Similar to an entire frame of H.261 inter-coded blocks. Half-pixel accuracy in motion vectors (pixels are averaged if needed). ◼ May code from previous I frame or previous P frame. B-frames ◼ Bidirectional Predictive Frames. ◼ Each macroblock contains two sets of motion vectors. ◼ Coded from one previous frame, one future frame, or a combination of both. 1. Do motion vector search separately in past reference frame and future reference frame. 2. Compare: ◼ Difference from past frame. ◼ Difference from future frame. ◼ Difference from average of past and future frame. 3. Encode the version with the least difference. B-frame disadvantages ◼ Computational complexity.  More motion search, need to decide whether or not to average. ◼ Increase in memory bandwidth.  Extra picture buffer needed.  Need to store frames and encode or playback out of order. ◼ Delay  Adds several frames delay at encoder waiting for need later frame.  Adds several frames delay at decoder holding decoded I/P frame, while decoding and playing prior B-frames that depend on it. B-frame advantage ◼ B-frames increase compression. ◼ Typically use twice as many B frames as I+P frames. MPEG-2 ◼ ISO/IEC standard in 1995 ◼ Aimed at higher quality video. ◼ Supports interlaced formats. ◼ Many features, but has profiles which constrain common subsets of those features: Main profile (MP): 2-15Mb/s over broadcast channels (eg DVB-T) or storage media (eg DVD) PAL quality: 4-6Mb/s, NTSC quality: 3-5Mb/s. MPEG-3 ◼ Doesn’t exist. Was aimed at HDTV. Ended up being folded into MPEG-2. MPEG-4 ◼ ISO/IEC designation 'ISO/IEC 14496’: 1999 ◼ MPEG-4 Version 2: 2000 ◼ Aimed at low bitrate (10Kb/s) ◼ Can scale very high (1Gb/s) ◼ Based around the concept of the composition of basic video objects into a scene.

H.26X

H.261 Video Compression Standard

•First major video compression standard •Targeted for 2-way video conferencing and for ISDN networks that supported 40Kbps to 2Mbps. •Supported resolutions include CIF and QCIF. •Chrominance resolution subsampling4:2:0 •Low complexity and low delay to support real-time communications •Only I and P frames. No B frames. •Full-pixel accuracy motion estimation •8x8 block-based DCT coding of residual •Fixed linear quantization across all AC coefficients of DCT •Run-length coding of quantized DCT coefficients followed by Huffman coding for DCT and motion information •Loop filtering (simple digital filter applied on the block edges) applied to reference frames to reduce blocking artifacts. ISO/IEC MPEG-2 / ITU-T H.262 •Profiles defined for scalable video applications with scalable coding tools to allow multiple layer video coding, including temporal, spatial, and SNR scalability, and data partitioning. •MPEG-2 Main Profile supports single layer coding (non scalable) and is the one that is widely deployed. •MPEG-2 non-scalable (single layer) profiles‒Simple profile: no B frames for low-delay applications ‒Main profile: support for B frames; can also decode MPEG-1 video

•MPEG-2 scalable profiles‒SNR profile: adds enhancement layers for DCT coefficient refinement ‒Spatial profile: adds support for enhancement layers carrying the coded image at different spatial resolutions (sizes) ‒High profile: adds support for coding a 4:2:2 video signal and includes scalability tools of SNR and spatial profiles ITU-T H.263: Main Features Enhancement of H.261 Baseline algorithm •Half-pixel accuracy motion estimation and compensation. •MV differentially coded, median MV prediction. •8 by 8 discrete cosine transform and uniform quantization. •variable length coding of DCT coefficients and MVs. Four optional modes •Unrestricted Motion Vector (UMV) mode ‒increased motion vector range with frame boundary extrapolation. •Advanced Prediction (AP) mode ‒4 MVs per macroblock. ‒Overlapped Block Motion Compensation (OBMC). •PB frame mode: bi-directional prediction. •Arithmetic coding mode. •About 3 to 4 dB PSNR improvement over H.261 at bit-rates less or equal to 64Kbits/s. •30% saving in bit-rate as compared to MPEG-1. Design flexibility (things not specified by standard) •H.263 standard inherently has the capability to adapt to varying input video content. •Frame level: Intra or Inter or skipped. •Macroblock(MB) level: ‒Intra, Inter or Un-coded. ‒One MV or 4 MVs ‒Quantizerparameter, QP value. Constant QP almost constant quality, variable bit-rate. Varying QP variable quality, try to achieve almost constant bit-rate. H.261 Compression standard defined by the ITU-T for provision of video telephony and 1.videoconferencing services over ISDN 2. 64kbps CIF (videoconferencing) or quarter CIF (QCIF) (video telephony) used 3.each frame divided into macroblocks of 16 x 16 pixels Only I- and P-frames used Three P-frames between each pair of Iframes Start of each new encoded video frame is indicated by the picture start code H.263 Defined by ITU-T for use in video applications over wireless and PSTN Ìe.g. video telephony, video conferencing,security surveillance, interactive games playing Ìreal time applications over a modem • therefore, 28.8 kbps - 56 kbps Based on H.261, but H.261 gives poor picture quality below 64 kbps Ìtherefore H.263 is more advanced QCIF and sub-QCIF used Horizontal resolution reduced Uses I-, P- and B-frames Also, neighbouring pairs of P- and Bframes can be encoded as a single entity ÌPB-frame • reduced encoding overheads • increases frame rate Other mechanisms used: Ìunrestricted motion vectors Ìerror resilience Ìerror tracking Ìindependent segment decoding Ìreference picture selection

CH 8 Video segmentation

Temporal Segmentation: Segmentation is highly dependent on the model and criteria for grouping pixels into regions. In motion segmentation, pixels are grouped together based on their similarity in motion. For any given application, the segmentation algorithm needs to find a balance between model complexity and analysis stability. An insufficient model will inevitably result in over segmentation. Complicated models will introduce more complexity and require more computation and constraints for stability. In image coding, the objective of segmentation is to exploit the spatial and temporal coherences in the video data by adequately identifying the coherent motion regions with simple motion models. Block-based video coders avoid the segmentation problem altogether by artificially imposing a regular array of blocks and applying motion coherence within these blocks. This model requires very small overhead in coding, but it does not accurately describe an image and does not fully exploit the coherences in the video data. Region-based approaches which exploit the coherence of object motion by grouping similar motion regions into a single description, have shown improved performances over block-based coders. In the layered representation coding,14,15 video data is decomposed into a set of overlapping layers. Each layer consists of: an intensity map describing the intensity profile of a coherent motion region over many frames; an alpha map describing its relationship with other layers; and a parametric motion map describing the motion of the region. The layered representation has potentials for achieving greater compression because each layer exploits both the spatial and temporal coherences of video data. In addition, the representation is similar to those used in computer graphics and so it provides a convenient way to manipulate video data. Our goal in spatiotemporal segmentation is to identify the spatial and temporal coherences in video data and derive the layered representation for the image sequence. Temporal coherence Motion estimation provides the necessary information for locating corresponding regions in different frames. The new positions for each region can be predicted given the previously estimated motion for that region. Motion models are estimated within each of these predicted regions and an updated set of motion hypotheses derived for the image. Alternatively, the motion models estimated from the previous segmentation can be used by the region classifier to directly determine the corresponding coherent motion regions. Thus, segmentation based on motion conveniently provides a way to track coherent motion regions. In addition, when the analysis is initialized with the segmentation results from previous frame, computation is reduced and robustness of estimation is increased.

Temporal segmentation adds structure to the video by partitioning the video into chapters. This is a first step for video summarization methods, which should also enable fast browsing and indexing so that a user can quickly discover important activities or objects.

Shot boundary detection, hard cut and soft cuts: The concept of temporal image sequence (video) segmentation is not a new one, as it dates back to the first days of motion pictures, well before the introduction of computers. Motion picture specialists perceptually segment their works into a hierarchy of partitions. A video (or film) is completely and disjointly segmented into a sequence of scenes, which are subsequently segmented into a sequence of shots. Scenes (also called story units) are a concept that is much older than motion pictures, ultimately originating in the theater. Traditionally, a scene is a continuous sequence that is temporally and spatially cohesive in the real world, but not necessarily cohesive in the projection of the real world on film. On the other hand, shots originate with the invention of motion cameras and are defined as the longest continuous sequence that originates from a single camera take, which is what the camera images in an uninterrupted run. In general, the automatic segmentation of a video into scenes ranges from very difficult to intractable. On the other hand, video segmentation into shots is both exactly defined and also characterized by distinctive features of the video stream itself. This is because video content within a shot tends to be continuous, due to the continuity of both the physical scene and the parameters (motion, zoom, focus) of the camera that images it. Therefore, in principle, the detection of a shot change between two adjacent frames simply requires to compute an appropriate continuity or similarity metric. However, this simple concept has three major complications. The first, and most obvious one, is defining a continuity metric for the video in such a way that it is insensitive to gradual changes in camera parameters, lighting, and physical scene content, easy to compute and discriminant enough to be useful. The simplest way to do that is to extract one or more scalar or vector features from each frame and to define distance functions on the feature domain. Alternatively the features themselves can be used either for clustering the frames into shots, or for detecting shot transition patterns. The second complication is deciding which values of the continuity metric correspond to a shot change and which do not. This is not trivial, since the feature variation within certain shots can exceed the respective variation across shots. Decision methods for shot boundary detection include fixed thresholds, adaptive thresholds and statistical detection methods. The third complication, and the most difficult to handle, is the fact that not all shot changes are abrupt. Using motion picture terminology, changes between shots can belong to the following categories:-

1. Cut. This is the classic abrupt change case, where one frame belongs to the disappearing shot and the next one to the appearing shot. 2. Dissolve. In this case, the last few frames of the disappearing shot temporally overlap with the first few frames of the appearing shot. During the overlap, the intensity of the disappearing shot decreases from normal to zero (fade out), while that of the appearing shot increases from zero to normal (fade in). 3. Fade. Here, first the disappearing shot fades out into a blank frame, and then the blank frame fades in into the appearing shot. 4. Wipe. This is actually a set of shot change techniques, where the appearing and disappearing shots coexist in different spatial regions of the intermediate video frames, and the region occupied by the former grows until it entirely replaces the latter. 5. Other transition types. There is a multitude of inventive special effects techniques used in motion pictures. These are in general very rare and difficult to detect.

Shot-boundary detection is the first step towards scene extraction in , which is useful for video content analysis and indexing. A shot in a video is a sequence of frames taken continuously by one camera. A common approach to detect shot-boundary consists of computing similarity between pairs of consecutive frames and marking the occurrence of boundary where the similarity is lower than some threshold. The similarity is measured globally, such as histogram, or locally within rectangular blocks. Previously, luminance/color, edges, texture and SIFT have been used to represent individual frames