The Structure of the MPEG-4 Video Coding Algorithm
The MPEG Video Group establishes so-called Test Models or Verification
Models to develop image and video coding techniques in a collaborative
effort. The MPEG-4 Video Verification Model describes a fully
defined core video coding algorithm platform (encoder, decoder
as well as bitstream syntax and semantics) for the development
of the standard. As such the structure of the MPEG-4 Video Verification
Model already gives some indication about the tools and algorithms
that will be provided by the final MPEG-4 standard. The purpose
of the text below is to outline the basic elements and structure
of the MPEG-4 Video Verification Model under development.
In the January 1996 MPEG Video Group meeting in Munich, Germany,
the first version of the official MPEG-4 Video Verification Model
was defined. The VM has since then, by means of the Core Experiment
process, iteratively progressed in each subsequent meeting and
has been optimized with respect to coding efficiency and the provisions
for new content-based functionalities and error robustness. As
of January 1997 the MPEG-4 Video Verification Model supports the
main features summarized below:
-
Standard Y:U:V luminance and chrominance intensity representation
of regularly sampled pixels in 4:2:0 format. The intensity of
each Y,U or V pixel is quantized into 8 bits. The image size and
shape depends on the application.
-
Coding of multiple "Video Object Planes" (VOP's) as
images of arbitrary shape to support many of the content-based
functionalities. Thus the image sequence input for the MPEG-4
Video VM is in general considered to be of arbitrary shape - and
the shape and location of a VOP within a reference window may
vary over time. The coding of standard rectangular image input
sequences is supported as a special case of the more general Video
Object Plane approach.
-
Coding of shape and transparency information of each VOP by coding
binary or gray scale alpha plane image sequences using a particularly
optimized Modified Modified Reed Code method (MMMR of M4R
method).
-
Support of Intra (I) coded VOP's as well as temporally predicted
(P) and bi-directionally (B) predicted VOP's. Standard MPEG and
H.263 I,P and B frames are supported as special case.
-
Support of fixed and variable frame rates of the input VOP sequences
of arbitrary or rectangular shape. The frame rate depends on the
application.
-
8x8 pel block-based and 16x16 pel Macroblock-based
motion estimation and compensation of the pixel values within
VOP's, including provisions for block-overlapping motion compensation.
-
Texture coding in I, P and B-VOP's using a 8x8 Discrete Cosine
Transform or alternatively a Shape-Adaptive DCT (SADCT) adopted
to regions of arbitrary shape, followed by MPEG-1/2 or H.261/3
like quantization and run-length coding.
-
Efficient prediction of DC- and AC-coefficients of the DCT in
Intra coded VOP's.
-
Support for efficient static as well as dynamic SPRITE prediction
of global motion from a VOP panoramic memory using 8 global motion
parameters.
-
Temporal and spatial scalability for arbitrarily shaped VOP's.
-
Adaptive Macroblock slices as well as improved bit stuffing and
motion markers for resynchronization in error prone environments..
-
Almost backward compatibility with standard H.261/3 or MPEG-1
coding algorithms if the input image sequences are coded in a
single layer using a single rectangular VOP structure.
1. Provisions for Content Based Functionalities - Decomposition
into "Video Object Planes"
The MPEG-4 Video coding algorithm will eventually support all
functionalities already provided by MPEG-1 and MPEG-2, including
the provision to efficiently compress standard rectangular sized
image sequences at varying levels of input formats, frame rates
and bit rates as well as provisions for interlaced input sources.
Furthermore, at the heart of the so-called "content"-based
MPEG-4 Video functionalities, is the support for the separate
encoding and decoding of content (i.e. physical objects in a scene).
Within the context of MPEG-4 this functionality - the ability
to identify and selectively decode and reconstruct video content
of interest - is refered to as "Content-Based Scalability".
This MPEG-4 feature provides the most elementary mechanism for
interactivity and manipulation with/of content of images or video
in the compressed domain without the need for further segmentation
or transcoding at the receiver.
To enable the content based interactive functionalities envisioned,
the MPEG-4 Video Verification Model introduces the concept of
Video Object Planes (VOP's). It is assumed that each frame of
an input video sequence is segmented into a number of arbitrarily
shaped image regions (video object planes) - each of the regions
may possibly cover particular image or video content of interest,
i.e. describing physical objects or content within scenes. In
contrast to the video source format used for the MPEG-1 and MPEG-2
standards, the video input to be coded by the MPEG-4 Verification
Model is thus no longer considered a rectangular region. This
concept is illustrated in the Figure below.

The coding of image sequences using MPEG-4 Video
Object Planes (VOP's) enables basic content-based functionalities
at the decoder. Each VOP specifies particular image sequence content
and is coded into a separate VOL-layer (by coding contour, motion
and texture information). Decoding of all VOP-layers reconstructs
the original image sequence. Content can be reconstructed by separately
decoding a single or a set of VOL-layers (content-based scalability/access
in the compressed domain). This allows content-based manipulation
at the decoder without the need for transcoding.
The input to be coded can be a VOP image region of arbitrary shape
and the shape and location of the region can vary from frame to
frame. Successive VOP's belonging to the same physical object
in a scene are refered to as Video Objects (VO's) - a sequence
of VOP's of possibly arbitrary shape and position. The shape,
motion and texture information of the VOP's belonging to the same
VO is encoded and transmitted or coded into a separate VOL (Video
Object Layer). In addition, relevant information needed to identify
each of the VOL's - and how the various VOL's are composed at
the receiver to reconstruct the entire original sequence is also
included in the bitstream. This allows the separate decoding of
each VOP and the required flexible manipulation of the video sequence.
Notice that the video source input assumed for the VOL structure
either already exists in terms of separate entities (i.e. is generated
with chroma-key technology) or is generated by means of on-line
or off-line segmentation algorithms.
Notice that, if the original input image sequences are not decomposed
into several VOL's of arbitrary shape, the coding structure simply
degenerates into a single layer representation which supports
conventional image sequences of rectangular shape. The MPEG-4
content-based approach can thus be seen as a logical extension
of the conventional MPEG-1 and MPEG-2 coding approach towards
image input sequences of arbitrary shape.
2. Coding of Shape, Motion and Texture Information for each
VOP
The information related to the shape, motion and texture information
for each VO is coded into a separate VOL-layer in order to support
separate decoding of VO's. The MPEG-4 Video VM uses an identical
algorithm to code the shape, motion and texture information in
each of the layers. The shape information is, however, not transmitted
if the input image sequence to be coded contains only standard
images of rectangular size. In this case the MPEG-4 Video coding
algorithm has a structure similar to the successful MPEG-1/2 or
H.261 coding algorithms - suitable for applications which require
high coding efficiency without the need for extended content based
functionalities.
The MPEG-4 VM compression algorithm employed for coding each VOP
image sequence (rectangular size or not) is based on the successful
block-based hybrid DPCM/Transform coding technique already employed
in the MPEG coding standards. The MPEG-4 coding algorithm encodes
the first VOP in Intra-Frame VOP coding mode (I-VOP). Each subsequent
frame is coded using Inter-frame VOP prediction (P-VOP's) - only
data from the nearest previously coded VOP frame is used for prediction.
In addition the coding of B-directionally predicted VOP's (B-VOP's)
is also supported.
Similar to the MPEG baseline coders the MPEG-4 Verification Model
algorithm processes the successive images of a VOP sequence block-based.
Taking the example of arbitrarily shaped VOP's, after coding the
VOP shape information, each color input VOP image in a VOP sequence
is partitioned into non-overlapping "Macroblocks" as
depicted in the following Figure.

A.) Illustration of an I-picture VOP (I-VOP)
and P-picture VOP's (P-VOP's) in a video sequence. P-VOP's are
coded using motion compensated prediction based on the nearest
previous VOP frame. Each frame is divided into disjoint "Macroblocks"
(MB).
B.) With each Macroblock (MB), information related
to four luminance blocks (Y1, Y2, Y3, Y4) and two chrominance
blocks (U, V) is coded. Each block contains 8x8 pels.
Each Macroblock contains blocks of data from both luminance and
co-sited chrominance bands - four luminance blocks (Y1,
Y2, Y3, Y4)
and two chrominance blocks (U, V), each with size 8
x 8 pels. The basic diagram of the MPEG-4 VM hybrid DPCM/Transform
encoder and decoder structure for processing single Y,U or V blocks
and Macroblocks is depicted in the Figure below. The previously
coded VOP frame N-1 is stored in a VOP frame store in both
encoder and decoder. Motion compensation is performed on a block
or Macroblock basis - only one motion vector is estimated between
VOP frame N and VOP frame N-1 for a particular block
or Macroblock to be encoded. The motion compensated prediction
error is calculated by subtracting each pel in a block or Macroblock
belonging to the VOP frame N with its motion shifted counterpart
in the previous VOP frame N-1. A 8x8 Discrete Cosine
Transform (DCT) is then applied to each of the 8x8 blocks
contained in the block or Macroblock followed by quantization
(Q) of the DCT coefficients with subsequent run-length coding
and entropy coding (VLC). A video buffer is needed to ensure that
a constant target bit rate output is produced by the encoder.
The quantization stepsize for the DCT-coefficients can be adjusted
for each Macroblock in a VOP frame to achieve a given target bit
rate and to avoid buffer overflow and underflow.

Block diagram of the basic MPEG-4 VM hybrid
DPCM/Transform encoder and decoder structure.
The decoder uses the reverse process to reproduce a Macroblock
of VOP frame N at the receiver. After decoding the variable
length words contained in the video decoder buffer the pixel values
of the prediction error are reconstructed. The motion compensated
pixels from the previous VOP frame N-1 contained in the
VOP frame store are added to the prediction error to recover the
particular Macroblock of frame N.
In general, the input images to be coded in each VOP layer are
of arbitrary shape and the shape and location of the images vary
over time with respect to a reference window. For coding shape,
motion and texture information in arbitrarily shaped VOP's, the
MPEG-4 Video Verification Model introduces the concept of a "VOP
image window" together with a "shape-adaptive"
Macroblock grid. All VOL layers to be coded for a given input
video sequence are defined with reference to the reference window
of constant size. An example of a VOP image window within a reference
window and an example of a Macroblock grid for a particular VOP
image is depicted below:

Example of a MPEG-4 VM Macroblock grid for a foreground
VOP image. This Macroblock grid is used for alpha plane coding,
motion estimation and compensation as well as for block based
DCT-based texture coding. A VOP window with a size of multiples
of 16 pels in each image direction surrounds the foreground VOP
of arbitrary shape and specifies the location of the Macroblocks,
each of size 16x16 pels. This window is adjusted to collocate
with the most top and most left border of the VOP. A shift parameter
is coded to indicate the location of the VOP window with respect
to the borders of a reference window (original image borders).
The shape information of a VOP is coded prior to coding motion
vectors based on the VOP image window Macroblock grid and is available
to both encoder and decoder. In subsequent processing steps only
the motion and texture information for the Macroblocks belonging
to the VOP image are coded (which includes the standard Macroblocks
as well as the contour Macroblocks indicated in the figure above).
Shape Coding - Essentially two coding methods are supported
by the MPEG-4 Video Verification Model for binary and gray scale
shape information. The shape information is refered to as "alpha
planes" in the context of the MPEG-4 VM. The techniques to
be adopted for the standard will provide lossless coding of alpha-planes
as well as the provision for lossy coding of shapes and transparency
information, allowing the trade off between bit rate and the accuracy
of shape representation. Furthermore it is foreseen to support
both Intra shape coding as well as Inter shape coding functionalities
employing motion compensated shape prediction - to allow both
efficient random access operations as well as an efficient compression
of shape and transparency information for diverse applications.
Motion Estimation and Compensation - The MPEG-4 VM employs
block-based motion estimation and compensation techniques to efficiently
explore temporal redundancies of the video content in the separate
VOP layers. In general, the motion estimation and compensation
techniques used can be seen as an extension of the standard MPEG
block matching techniques towards image sequences of arbitrary
shape. However, a wealth of different motion prediction methods
is also being investigated in the Core Experiment process.
To perform block based motion estimation and compensation between
VOP's of varying location, size and shape, the shape-adaptive
Macroblock (MB) grid approach for each VOP image is employed.
A block-matching procedure is used for standard Macroblocks. The
prediction error is coded together with the Macroblock motion
vectors used for prediction. An advanced motion compensation mode
is defined which supports block-overlapping motion compensation
as with the ITU H.263 standard as well as the coding of motion
vectors for 8x8 blocks. The definition of the motion estimation
and compensation techniques are, however, modified at the borders
of a VOP. An image padding technique is used for the reference
VOP frame N-1, which is available to both encoder and decoder,
to perform motion estimation and compensation. The VOP padding
method can be seen as an extrapolation of pels outside of the
VOP based on pels inside of the VOP. After padding the reference
VOP in frame N-1 (as shown in the Figure below), a "polygon"
matching technique is employed for motion estimation and compensation.
A polygon defines the part of the contour Macroblock (or the 8x8
block for advanced motion compensation, respectively) which belongs
to the active area inside of the VOP frame N to be coded
and excludes the pels outside of this area. Thus, the pels not
belonging to the active area in the VOP to be coded are essentially
excluded from the motion estimation process.

An image padding technique is employed for the
purpose of contour block motion estimation and compensation. The
aim of the padding procedure is to allow separate decoding and
reconstruction of VOP's by extrapolating texture inside the VOP
to regions outside the VOP. This allows block-based DCT coding
of texture across a VOP border as in INTRA VOP's well. Furthermore
the block based motion vector range for search and motion compensation
in a VOP in frame N can be specified covering regions outside
the VOP in frame N-1.
The MPEG-4 Video Verification Model supports the coding of both
forward predicted (P) as well as bi-directionally (B) predicted
VOP's (P-VOP and B-VOP). Motion vectors are predictively coded
using standard MPEG-1/2 and H.263 VLC code tables including the
provision for extended vector ranges. Notice, that the coding
of standard MPEG I-frames, P-frames and B-frames is still supported
by the Verification Model - for the special case of image input
sequences (VOP's) of rectangular shape (standard MPEG or H.261/3
definition of frames).
Texture Coding - The Intra VOP's as well as the residual
errors after motion compensated prediction are coded using a DCT
on 8x8 blocks similar to the standard MPEG and H.263 standards.
Again, the adaptive VOP window Macroblock grid is employed for
this purpose. For each Macroblock a maximum of four 8x8 Luminance
blocks and two 8x8 Chrominance blocks are coded. Particular adaptation
is required for the 8x8 blocks straddling the VOP borders. The
image padding technique in the figure above is used to fill the
Macroblock content outside of a VOP prior to applying the DCT
in Intra-VOP's. For the coding of motion compensated prediction
error P- or B-VOP's the content of the pels outside of the active
VOP area are set to 128. Alternatively a low complexity shape-adoptive
DCT (SADCT) technique can be used to only encode the pixels belonging
to the VOP - this results in higher quality at same bit rate at
a slightly increased implementation complexity. Scanning of the
DCT coefficients followed by quantization and run-length coding
of the coefficients is performed using techniques and VLC tables
defined with the MPEG-1/2 and H.263 standards, including the provision
for quantization matrices. An efficient prediction of the DC-
and AC-coefficients of the DCT is performed for Intra coded VOP's.
Multiplexing of Shape, Motion and Texture Information
- Basically all "tools" (DCT, motion estimation and
compensation, etc.) defined in the H.263 and MPEG-1 standards
(and most of the ones defined for MPEG-2 Main Profile) are currently
supported by the MPEG-4 Video Verification Model. The compressed
alpha plane, motion vector and DCT bit words are multiplexed into
a VOL layer bitstream by coding the shape information first, followed
by motion and texture coding based on the H.263 and MPEG definitions.
The Verification Model defines two separate modes for multiplexing
texture and motion information: A joint motion vector and DCT-coefficient
coding procedure based on standard H.263-like Macroblock Type
definitions is supported to achieve a high compression efficiency
at very low bit rates. This guarantees that the performance of
the VM at very low bit rates is at least identical to the H.263
standard. Alternatively, the separate coding of motion vectors
and DCT-coefficients is also possible - to eventually incorporate
new and more efficient motion or texture coding techniques separately
into the Verification Model.
3. Coding Efficiency
Besides the provision for new content-based functionalities and
error resilience and robustness, the coding of video with very
high coding efficiency over a range of bit rates continues to
be supported for the MPEG-4 standard. As indicated above, the
MPEG-4 Video Verification Model allows the single object-layer
(single VOP) coding approach as a special case. In this coding
mode the single VOP input image sequence format may be (thus not
segmented into several VOP's), and the MPEG-4 Video Verification
Model coding algorithm can be made almost compatible to the ITU-H.263
or ISO-MPEG-1 standards. Most of the coding techniques used by
the MPEG-2 standard at Main Profile are also supported. A number
of motion compensation and texture coding techniques are being
investigated in the Core Experiment process to further improve
coding efficiency for a range of bit rates, including bit rates
below 64 kbits/s.
4. Spatial and Temporal Scalability
An important goal of scaleable coding of video is to flexibly
support receivers with different bandwidth or display capabilities
or display requests to allow video database browsing and multiresolution
playback of video content in multimedia environments. Another
important purpose of scaleable coding is to provide a layered
video bit stream which is amenable for prioritized transmission.
The techniques adopted for the MPEG-4 Video Verification Model
allow the "content-based" access or transmission of
arbitrarily-shaped VOP's at various temporal or spatial resolutions
- in contrast to the frame-based scalability approaches introduced
for MPEG-2. Receivers either not capable or willing to reconstruct
the full resolution arbitrarily shaped VOP's can decode subsets
of the layered bit stream to display the arbitrarily shaped VOP's
content/objects at lower spatial or temporal resolution or with
lower quality.
Spatial Scalability - The figure below depicts the MPEG-4
general philosophy of a content-based VOP multiscale video coding
scheme. Here three layers are provided, each layer supporting
a VOP at different spatial resolution scales, i.e. a multiresolution
representation can be achieved by downscaling the input video
signal into a lower resolution video (downsampling spatially in
our example). The downscaled version is encoded into a base layer
bit stream with reduced bit rate. The upscaled reconstructed base
layer video (upsampled spatially in our example) is used as a
prediction for the coding of the original input video signal.
The prediction error is encoded into an enhancement layer bit
stream. If a receiver is either not capable or willing to display
the full quality VOP's, downscaled VOP signals can be reconstructed
by only decoding the lower layer bit streams. It is important
to notice, however, that the display of the VOP at highest resolution
with reduced quality is also possible by only decoding the lower
bit rate base layer(s). Thus scaleable coding can be used to encode
content-based video with a suitable bit rate allocated to each
layer in order to meet specific bandwidth requirements of transmission
channels or storage media. Browsing through video data bases and
transmission of video over heterogeneous networks are applications
expected to benefit from this functionality.

Spatial scalability approach for arbitrarily
shaped VOP's.
Temporal Scalability - This technique was developed with
an aim similar to spatial scalability. Different frame rates can
be supported with a layered bit stream. Layering is achieved by
providing a temporal prediction for the enhancement layer based
on coded video from the lower layers. Using the MPEG-4 "content-based"
VOP temporal scalability approach it is possible to provide different
display rates for different VOL's within the same video sequence
(i.e. a foreground person of interest may be displayed with a
higher frame rate compared to the remaining background or other
objects).
5. Error Resilience - Error Robustness
A considerable effort has been made to investigate the robust
storage and transmission of MPEG-4 video in error prone environments.
To this end an adaptive Macroblock Slice technique similar to
the one already provided with the MPEG-1 and MPEG-2 standards
has been introduced into the MPEG-4 Video Verification Model.
The technique provides resynchronization bit words for groups
of Macroblocks and has been optimized in particular to achieve
efficient robustness for low bit rate video under a variety of
severe error conditions, i.e. for the transmission over mobile
channels. In improved bit stuffing as well as motion markers have
been defined.
|