Copy DVD to VCD,SVCD,DivX, Convert to VCD, Free Download
The Structure of the MPEG-4 Video Coding Algorithm

The MPEG Video Group establishes so-called Test Models or Verification Models to develop image and video coding techniques in a collaborative effort. The MPEG-4 Video Verification Model describes a fully defined core video coding algorithm platform (encoder, decoder as well as bitstream syntax and semantics) for the development of the standard. As such the structure of the MPEG-4 Video Verification Model already gives some indication about the tools and algorithms that will be provided by the final MPEG-4 standard. The purpose of the text below is to outline the basic elements and structure of the MPEG-4 Video Verification Model under development.

In the January 1996 MPEG Video Group meeting in Munich, Germany, the first version of the official MPEG-4 Video Verification Model was defined. The VM has since then, by means of the Core Experiment process, iteratively progressed in each subsequent meeting and has been optimized with respect to coding efficiency and the provisions for new content-based functionalities and error robustness. As of January 1997 the MPEG-4 Video Verification Model supports the main features summarized below:

  • Standard Y:U:V luminance and chrominance intensity representation of regularly sampled pixels in 4:2:0 format. The intensity of each Y,U or V pixel is quantized into 8 bits. The image size and shape depends on the application.

  • Coding of multiple "Video Object Planes" (VOP's) as images of arbitrary shape to support many of the content-based functionalities. Thus the image sequence input for the MPEG-4 Video VM is in general considered to be of arbitrary shape - and the shape and location of a VOP within a reference window may vary over time. The coding of standard rectangular image input sequences is supported as a special case of the more general Video Object Plane approach.

  • Coding of shape and transparency information of each VOP by coding binary or gray scale alpha plane image sequences using a particularly optimized Modified Modified Reed Code method (MMMR of M4R method).

  • Support of Intra (I) coded VOP's as well as temporally predicted (P) and bi-directionally (B) predicted VOP's. Standard MPEG and H.263 I,P and B frames are supported as special case.

  • Support of fixed and variable frame rates of the input VOP sequences of arbitrary or rectangular shape. The frame rate depends on the application.

  • 8x8 pel block-based and 16x16 pel Macroblock-based motion estimation and compensation of the pixel values within VOP's, including provisions for block-overlapping motion compensation.

  • Texture coding in I, P and B-VOP's using a 8x8 Discrete Cosine Transform or alternatively a Shape-Adaptive DCT (SADCT) adopted to regions of arbitrary shape, followed by MPEG-1/2 or H.261/3 like quantization and run-length coding.

  • Efficient prediction of DC- and AC-coefficients of the DCT in Intra coded VOP's.

  • Support for efficient static as well as dynamic SPRITE prediction of global motion from a VOP panoramic memory using 8 global motion parameters.

  • Temporal and spatial scalability for arbitrarily shaped VOP's.

  • Adaptive Macroblock slices as well as improved bit stuffing and motion markers for resynchronization in error prone environments..

  • Almost backward compatibility with standard H.261/3 or MPEG-1 coding algorithms if the input image sequences are coded in a single layer using a single rectangular VOP structure.

1. Provisions for Content Based Functionalities - Decomposition into "Video Object Planes"

The MPEG-4 Video coding algorithm will eventually support all functionalities already provided by MPEG-1 and MPEG-2, including the provision to efficiently compress standard rectangular sized image sequences at varying levels of input formats, frame rates and bit rates as well as provisions for interlaced input sources.

Furthermore, at the heart of the so-called "content"-based MPEG-4 Video functionalities, is the support for the separate encoding and decoding of content (i.e. physical objects in a scene). Within the context of MPEG-4 this functionality - the ability to identify and selectively decode and reconstruct video content of interest - is refered to as "Content-Based Scalability". This MPEG-4 feature provides the most elementary mechanism for interactivity and manipulation with/of content of images or video in the compressed domain without the need for further segmentation or transcoding at the receiver.

To enable the content based interactive functionalities envisioned, the MPEG-4 Video Verification Model introduces the concept of Video Object Planes (VOP's). It is assumed that each frame of an input video sequence is segmented into a number of arbitrarily shaped image regions (video object planes) - each of the regions may possibly cover particular image or video content of interest, i.e. describing physical objects or content within scenes. In contrast to the video source format used for the MPEG-1 and MPEG-2 standards, the video input to be coded by the MPEG-4 Verification Model is thus no longer considered a rectangular region. This concept is illustrated in the Figure below.


The coding of image sequences using MPEG-4 Video Object Planes (VOP's) enables basic content-based functionalities at the decoder. Each VOP specifies particular image sequence content and is coded into a separate VOL-layer (by coding contour, motion and texture information). Decoding of all VOP-layers reconstructs the original image sequence. Content can be reconstructed by separately decoding a single or a set of VOL-layers (content-based scalability/access in the compressed domain). This allows content-based manipulation at the decoder without the need for transcoding.

The input to be coded can be a VOP image region of arbitrary shape and the shape and location of the region can vary from frame to frame. Successive VOP's belonging to the same physical object in a scene are refered to as Video Objects (VO's) - a sequence of VOP's of possibly arbitrary shape and position. The shape, motion and texture information of the VOP's belonging to the same VO is encoded and transmitted or coded into a separate VOL (Video Object Layer). In addition, relevant information needed to identify each of the VOL's - and how the various VOL's are composed at the receiver to reconstruct the entire original sequence is also included in the bitstream. This allows the separate decoding of each VOP and the required flexible manipulation of the video sequence. Notice that the video source input assumed for the VOL structure either already exists in terms of separate entities (i.e. is generated with chroma-key technology) or is generated by means of on-line or off-line segmentation algorithms.

Notice that, if the original input image sequences are not decomposed into several VOL's of arbitrary shape, the coding structure simply degenerates into a single layer representation which supports conventional image sequences of rectangular shape. The MPEG-4 content-based approach can thus be seen as a logical extension of the conventional MPEG-1 and MPEG-2 coding approach towards image input sequences of arbitrary shape.

2. Coding of Shape, Motion and Texture Information for each VOP

The information related to the shape, motion and texture information for each VO is coded into a separate VOL-layer in order to support separate decoding of VO's. The MPEG-4 Video VM uses an identical algorithm to code the shape, motion and texture information in each of the layers. The shape information is, however, not transmitted if the input image sequence to be coded contains only standard images of rectangular size. In this case the MPEG-4 Video coding algorithm has a structure similar to the successful MPEG-1/2 or H.261 coding algorithms - suitable for applications which require high coding efficiency without the need for extended content based functionalities.

The MPEG-4 VM compression algorithm employed for coding each VOP image sequence (rectangular size or not) is based on the successful block-based hybrid DPCM/Transform coding technique already employed in the MPEG coding standards. The MPEG-4 coding algorithm encodes the first VOP in Intra-Frame VOP coding mode (I-VOP). Each subsequent frame is coded using Inter-frame VOP prediction (P-VOP's) - only data from the nearest previously coded VOP frame is used for prediction. In addition the coding of B-directionally predicted VOP's (B-VOP's) is also supported.

Similar to the MPEG baseline coders the MPEG-4 Verification Model algorithm processes the successive images of a VOP sequence block-based. Taking the example of arbitrarily shaped VOP's, after coding the VOP shape information, each color input VOP image in a VOP sequence is partitioned into non-overlapping "Macroblocks" as depicted in the following Figure.




A.) Illustration of an I-picture VOP (I-VOP) and P-picture VOP's (P-VOP's) in a video sequence. P-VOP's are coded using motion compensated prediction based on the nearest previous VOP frame. Each frame is divided into disjoint "Macroblocks" (MB).

B.) With each Macroblock (MB), information related to four luminance blocks (Y1, Y2, Y3, Y4) and two chrominance blocks (U, V) is coded. Each block contains 8x8 pels.

Each Macroblock contains blocks of data from both luminance and co-sited chrominance bands - four luminance blocks (Y1, Y2, Y3, Y4) and two chrominance blocks (U, V), each with size 8 x 8 pels. The basic diagram of the MPEG-4 VM hybrid DPCM/Transform encoder and decoder structure for processing single Y,U or V blocks and Macroblocks is depicted in the Figure below. The previously coded VOP frame N-1 is stored in a VOP frame store in both encoder and decoder. Motion compensation is performed on a block or Macroblock basis - only one motion vector is estimated between VOP frame N and VOP frame N-1 for a particular block or Macroblock to be encoded. The motion compensated prediction error is calculated by subtracting each pel in a block or Macroblock belonging to the VOP frame N with its motion shifted counterpart in the previous VOP frame N-1. A 8x8 Discrete Cosine Transform (DCT) is then applied to each of the 8x8 blocks contained in the block or Macroblock followed by quantization (Q) of the DCT coefficients with subsequent run-length coding and entropy coding (VLC). A video buffer is needed to ensure that a constant target bit rate output is produced by the encoder. The quantization stepsize for the DCT-coefficients can be adjusted for each Macroblock in a VOP frame to achieve a given target bit rate and to avoid buffer overflow and underflow.


Block diagram of the basic MPEG-4 VM hybrid DPCM/Transform encoder and decoder structure.

The decoder uses the reverse process to reproduce a Macroblock of VOP frame N at the receiver. After decoding the variable length words contained in the video decoder buffer the pixel values of the prediction error are reconstructed. The motion compensated pixels from the previous VOP frame N-1 contained in the VOP frame store are added to the prediction error to recover the particular Macroblock of frame N.

In general, the input images to be coded in each VOP layer are of arbitrary shape and the shape and location of the images vary over time with respect to a reference window. For coding shape, motion and texture information in arbitrarily shaped VOP's, the MPEG-4 Video Verification Model introduces the concept of a "VOP image window" together with a "shape-adaptive" Macroblock grid. All VOL layers to be coded for a given input video sequence are defined with reference to the reference window of constant size. An example of a VOP image window within a reference window and an example of a Macroblock grid for a particular VOP image is depicted below:


Example of a MPEG-4 VM Macroblock grid for a foreground VOP image. This Macroblock grid is used for alpha plane coding, motion estimation and compensation as well as for block based DCT-based texture coding. A VOP window with a size of multiples of 16 pels in each image direction surrounds the foreground VOP of arbitrary shape and specifies the location of the Macroblocks, each of size 16x16 pels. This window is adjusted to collocate with the most top and most left border of the VOP. A shift parameter is coded to indicate the location of the VOP window with respect to the borders of a reference window (original image borders).

The shape information of a VOP is coded prior to coding motion vectors based on the VOP image window Macroblock grid and is available to both encoder and decoder. In subsequent processing steps only the motion and texture information for the Macroblocks belonging to the VOP image are coded (which includes the standard Macroblocks as well as the contour Macroblocks indicated in the figure above).

Shape Coding - Essentially two coding methods are supported by the MPEG-4 Video Verification Model for binary and gray scale shape information. The shape information is refered to as "alpha planes" in the context of the MPEG-4 VM. The techniques to be adopted for the standard will provide lossless coding of alpha-planes as well as the provision for lossy coding of shapes and transparency information, allowing the trade off between bit rate and the accuracy of shape representation. Furthermore it is foreseen to support both Intra shape coding as well as Inter shape coding functionalities employing motion compensated shape prediction - to allow both efficient random access operations as well as an efficient compression of shape and transparency information for diverse applications.

Motion Estimation and Compensation - The MPEG-4 VM employs block-based motion estimation and compensation techniques to efficiently explore temporal redundancies of the video content in the separate VOP layers. In general, the motion estimation and compensation techniques used can be seen as an extension of the standard MPEG block matching techniques towards image sequences of arbitrary shape. However, a wealth of different motion prediction methods is also being investigated in the Core Experiment process.

To perform block based motion estimation and compensation between VOP's of varying location, size and shape, the shape-adaptive Macroblock (MB) grid approach for each VOP image is employed. A block-matching procedure is used for standard Macroblocks. The prediction error is coded together with the Macroblock motion vectors used for prediction. An advanced motion compensation mode is defined which supports block-overlapping motion compensation as with the ITU H.263 standard as well as the coding of motion vectors for 8x8 blocks. The definition of the motion estimation and compensation techniques are, however, modified at the borders of a VOP. An image padding technique is used for the reference VOP frame N-1, which is available to both encoder and decoder, to perform motion estimation and compensation. The VOP padding method can be seen as an extrapolation of pels outside of the VOP based on pels inside of the VOP. After padding the reference VOP in frame N-1 (as shown in the Figure below), a "polygon" matching technique is employed for motion estimation and compensation. A polygon defines the part of the contour Macroblock (or the 8x8 block for advanced motion compensation, respectively) which belongs to the active area inside of the VOP frame N to be coded and excludes the pels outside of this area. Thus, the pels not belonging to the active area in the VOP to be coded are essentially excluded from the motion estimation process.


An image padding technique is employed for the purpose of contour block motion estimation and compensation. The aim of the padding procedure is to allow separate decoding and reconstruction of VOP's by extrapolating texture inside the VOP to regions outside the VOP. This allows block-based DCT coding of texture across a VOP border as in INTRA VOP's well. Furthermore the block based motion vector range for search and motion compensation in a VOP in frame N can be specified covering regions outside the VOP in frame N-1.

The MPEG-4 Video Verification Model supports the coding of both forward predicted (P) as well as bi-directionally (B) predicted VOP's (P-VOP and B-VOP). Motion vectors are predictively coded using standard MPEG-1/2 and H.263 VLC code tables including the provision for extended vector ranges. Notice, that the coding of standard MPEG I-frames, P-frames and B-frames is still supported by the Verification Model - for the special case of image input sequences (VOP's) of rectangular shape (standard MPEG or H.261/3 definition of frames).

Texture Coding - The Intra VOP's as well as the residual errors after motion compensated prediction are coded using a DCT on 8x8 blocks similar to the standard MPEG and H.263 standards. Again, the adaptive VOP window Macroblock grid is employed for this purpose. For each Macroblock a maximum of four 8x8 Luminance blocks and two 8x8 Chrominance blocks are coded. Particular adaptation is required for the 8x8 blocks straddling the VOP borders. The image padding technique in the figure above is used to fill the Macroblock content outside of a VOP prior to applying the DCT in Intra-VOP's. For the coding of motion compensated prediction error P- or B-VOP's the content of the pels outside of the active VOP area are set to 128. Alternatively a low complexity shape-adoptive DCT (SADCT) technique can be used to only encode the pixels belonging to the VOP - this results in higher quality at same bit rate at a slightly increased implementation complexity. Scanning of the DCT coefficients followed by quantization and run-length coding of the coefficients is performed using techniques and VLC tables defined with the MPEG-1/2 and H.263 standards, including the provision for quantization matrices. An efficient prediction of the DC- and AC-coefficients of the DCT is performed for Intra coded VOP's.

Multiplexing of Shape, Motion and Texture Information - Basically all "tools" (DCT, motion estimation and compensation, etc.) defined in the H.263 and MPEG-1 standards (and most of the ones defined for MPEG-2 Main Profile) are currently supported by the MPEG-4 Video Verification Model. The compressed alpha plane, motion vector and DCT bit words are multiplexed into a VOL layer bitstream by coding the shape information first, followed by motion and texture coding based on the H.263 and MPEG definitions.

The Verification Model defines two separate modes for multiplexing texture and motion information: A joint motion vector and DCT-coefficient coding procedure based on standard H.263-like Macroblock Type definitions is supported to achieve a high compression efficiency at very low bit rates. This guarantees that the performance of the VM at very low bit rates is at least identical to the H.263 standard. Alternatively, the separate coding of motion vectors and DCT-coefficients is also possible - to eventually incorporate new and more efficient motion or texture coding techniques separately into the Verification Model.

3. Coding Efficiency

Besides the provision for new content-based functionalities and error resilience and robustness, the coding of video with very high coding efficiency over a range of bit rates continues to be supported for the MPEG-4 standard. As indicated above, the MPEG-4 Video Verification Model allows the single object-layer (single VOP) coding approach as a special case. In this coding mode the single VOP input image sequence format may be (thus not segmented into several VOP's), and the MPEG-4 Video Verification Model coding algorithm can be made almost compatible to the ITU-H.263 or ISO-MPEG-1 standards. Most of the coding techniques used by the MPEG-2 standard at Main Profile are also supported. A number of motion compensation and texture coding techniques are being investigated in the Core Experiment process to further improve coding efficiency for a range of bit rates, including bit rates below 64 kbits/s.

4. Spatial and Temporal Scalability

An important goal of scaleable coding of video is to flexibly support receivers with different bandwidth or display capabilities or display requests to allow video database browsing and multiresolution playback of video content in multimedia environments. Another important purpose of scaleable coding is to provide a layered video bit stream which is amenable for prioritized transmission. The techniques adopted for the MPEG-4 Video Verification Model allow the "content-based" access or transmission of arbitrarily-shaped VOP's at various temporal or spatial resolutions - in contrast to the frame-based scalability approaches introduced for MPEG-2. Receivers either not capable or willing to reconstruct the full resolution arbitrarily shaped VOP's can decode subsets of the layered bit stream to display the arbitrarily shaped VOP's content/objects at lower spatial or temporal resolution or with lower quality.

Spatial Scalability - The figure below depicts the MPEG-4 general philosophy of a content-based VOP multiscale video coding scheme. Here three layers are provided, each layer supporting a VOP at different spatial resolution scales, i.e. a multiresolution representation can be achieved by downscaling the input video signal into a lower resolution video (downsampling spatially in our example). The downscaled version is encoded into a base layer bit stream with reduced bit rate. The upscaled reconstructed base layer video (upsampled spatially in our example) is used as a prediction for the coding of the original input video signal. The prediction error is encoded into an enhancement layer bit stream. If a receiver is either not capable or willing to display the full quality VOP's, downscaled VOP signals can be reconstructed by only decoding the lower layer bit streams. It is important to notice, however, that the display of the VOP at highest resolution with reduced quality is also possible by only decoding the lower bit rate base layer(s). Thus scaleable coding can be used to encode content-based video with a suitable bit rate allocated to each layer in order to meet specific bandwidth requirements of transmission channels or storage media. Browsing through video data bases and transmission of video over heterogeneous networks are applications expected to benefit from this functionality.


Spatial scalability approach for arbitrarily shaped VOP's.

Temporal Scalability - This technique was developed with an aim similar to spatial scalability. Different frame rates can be supported with a layered bit stream. Layering is achieved by providing a temporal prediction for the enhancement layer based on coded video from the lower layers. Using the MPEG-4 "content-based" VOP temporal scalability approach it is possible to provide different display rates for different VOL's within the same video sequence (i.e. a foreground person of interest may be displayed with a higher frame rate compared to the remaining background or other objects).

5. Error Resilience - Error Robustness

A considerable effort has been made to investigate the robust storage and transmission of MPEG-4 video in error prone environments. To this end an adaptive Macroblock Slice technique similar to the one already provided with the MPEG-1 and MPEG-2 standards has been introduced into the MPEG-4 Video Verification Model. The technique provides resynchronization bit words for groups of Macroblocks and has been optimized in particular to achieve efficient robustness for low bit rate video under a variety of severe error conditions, i.e. for the transmission over mobile channels. In improved bit stuffing as well as motion markers have been defined.

[contact us] [Order Now] [Video Tech] [links]
2002 - 2003 Amigo Software, All Right Reserved