There have been several articles in the past discussing the choice of video codec and arguments for and against (usually against) so-called Long GOP codecs. In this article I will be taking a closer look at GOP structures, discussing what Long GOP means and arguing that there is no one size fits all codec and that Long GOP, when used for the right content and application, is the only sensible choice.
In this article I will be using the terms "bit rate" and "transmission" but I use them interchangeably with "file size" and "storage". I prefer to use bit rate because video has to be read off storage or a network, converted into RGB values and transmitted to the display via decoding and using some kind of bandwidth limited cable or data bus.
If you only want to know what Long GOP means, here is the short version: video codecs use two types of frames, intra coded and inter coded. Long GOP is when a mixture of intra and inter coding is used. If a codec is not Long GOP then it is intra only. For most purposes outside of production, a mix of intra and inter coding is preferable.
Types of Frame
Intra coded frames do not depend on any other frame in the video. They are compressed using block transforms, such as the Discrete Cosine Transform (DCT) where the block is transformed into frequency based coefficients. The coefficients are quantized resulting in the removal of more higher frequency coefficients that correspond to fine detail, thus making the image easier to compress. The human eye struggles to see very fine detail so we can get away with losing some of this information without compromising too much on image quality. If too much detail is lost then quality will suffer. Intra frames can also be compressed by "intra prediction", this is where blocks are predicted from neighboring blocks within the same frame. This works well for areas that do not have a lot of detail.
Intra frames, also known as "I" frames, are sometimes described as key frames because they are suitable for navigation points in the video. In order to skip forwards or backwards the user only has to decode the I frames. For inter coded frames, the user must find the most recent I frame, decode it, decode any other reference frames, then decode all the inter frames up to the desired frame.
Intra frames are vital for compression quality because they provide a higher quality version of the current picture to be used for motion compensation thus enabling better quality prediction and hence better compression.
Using I frames allows errors due to transmission or storage in the decoded video stream to be cleared. Without I frames errors can propagate into following frames. There is a special type of I frame called an IDR (Instantaneous Decoder Refresh) frame. When an IDR frame is encountered the bit stream is reset and no P or B frames that follow can reference any frame before it. In H.264 it is possible for P and B frames to use multiple reference frames for prediction, not just the most previous I frame.
The drawback of I frames is that they are considerably larger than P or B frames. Furthermore, using more regular I frames does not guarantee quality - I frames may be heavily compressed, by use of DCT encoding and intra prediction in H.264 and HEVC.
Inter coded frames are formed by finding the motion from "reference" frames and transmitting vectors. The motion vectors are combined with the reference frame in a process called motion compensation to generate a predicted version of the frame. The error between the predicted frame and the actual frame, the "residual", is compressed using the same techniques as intra coding and that difference is transmitted. In the decoder the decompressed residual is added back into the motion compensated prediction.
Inter frames, known as "P" for prediction or "B" for bidirectional are formed of motion vectors plus a "residual" difference image. P frames are forward predicted from the most recent I or P frame. B frames are predicted from both previous and future I frames. Residual images are strongly compressed but in the case that the intra frame was not representable at high quality, residual images can make up for the loss and improve the overall quality.
B frames used to be frowned upon because the decoder requires a future reference frame resulting in out of order transmission of frames and therefore more memory in the decoder; and also because using two references is less resilient to errors. These days codec implementations can handle these requirements and as B frames tend to give better compression, they are used where possible.
The pattern of intra and inter coded frames is repeated in a GOP (Group of Pictures) and influences the usability of the coded video in terms of
- Error resilience
- Data rate (file size)
A typical GOP is written as IBBP and the part after the I is usually mirrored around the P frame and repeated until the next I frame. That means IBBP would represent IBBPBBI or IBBPBBPBBI, etc. GOP length is used to define how many frames there are in one GOP. Other patterns are used including IBBBP, IBP, IP and I. The GOP structure is repeated throughout the bit stream. The diagram below shows a typical IBBP structure; arrows indicate which frames are referenced by the P and B frames.
Frame Type Selection
The frequency with which the different frame types are used within a GOP is a trade off determined by the final application for the video. Naturally, more I frames make navigation easier and keep quality high with good error resilience, but data rate will go through the roof. Using P and B frames results in better compression but one error in the bit stream could result in multiple frames being corrupted; also navigation will be much slower as more frames will need to be decoded if jumping forwards or backwards in time.
Any time there is a shot cut, a new GOP must be used, otherwise the first frame of the new shot could reference totally different content. If a cut appears in the middle of a GOP, usually the editing software will decode all the frames in the GOP and recode with an I frame in a suitable position.
For video editors and cinematographers, they must consider the end to end work flow such as
- The codecs that the camera supports and their customizable parameters,
- The editing process where the NLE is considerably easier to use when the video is in an intermediate codec that is intra only (such as Apple ProRes)
- The final mastered output used for delivery to a distributor who will further compress it or to the end user who will download it, stream it or playback from a physical medium
Choosing the GOP Size
Using GOPs is not a new idea, the concept has been around since the H.261 standard in 1988. The GOP structure is sometimes baked into a codec implementation but generally the GOP is customizable; therefore it's not really true to say such-and-such codec is Long GOP but so-and-so codec is not - it's the implementation that defines it.
There doesn't seem to be a clear definition of what makes a Long GOP and it seems that most sources define it as any GOP structure that includes P and B frames, in other words the codec is intra only or mixed. The maximum GOP size depends on the specifications of the decoder. The longer the GOP, the more memory is needed to store the frames needed to decode it. For MPEG-2 on DVD the maximum GOP size is 15 frames for PAL and 18 frames for NTSC frame rates, because DVD players were not expected to have large frame buffers. YouTube recommends IBBP, with the GOP size to not exceed half the frame rate ("2 consecutive B frames. Closed GOP. GOP of half the frame rate"). x264 is a library often used for H.264 coding; x264 implements a dynamic GOP size. For fast action sections of content a shorter GOP size would be more appropriate and for longer shots a longer GOP size is better.
Users may have come across intra only codecs when using MPEG IMX, AVC-Intra, ProRes or MJPEG. They would have encountered Long GOP when using AVCHD or AVCCAM. Typically the user only gets to choose the codec and the output bit rate. Some cameras let you choose the length of a GOP. Long GOP codecs are said to struggle with movement, in the case of the bit rate being too low then residual images become ineffective.
Usually the motion vectors and residual are not determined using information about the content type. For example, static camera motion does not result in changing the motion estimation strategy, even after hundreds of frames have been coded. Most people encode video using a single set of parameters at the start of the encode even though the content might change drastically. Generally you can get away with this but it is far from optimal.
Some codecs that can switch between intra and Long GOP also switch to 4:2:0 color subsampling because the manufacturers assume that Long GOP is only about maximizing compression. This would not be ideal for doing green screen work.
Finally, another factor in GOP design is open and closed GOPs. In an open GOP, B frames can reference frames from previous and future GOPs but closed GOPs are completely self contained. Closed GOPs can result in extra P frames being added, because a GOP cannot end in a B frame. If you edit a frame in one GOP it can affect the prediction of a frame in another GOP, for this reason some editing software prefers closed GOPs. Another issue to be wary of is that some codecs make every I frame an IDR frame by default and therefore all GOPs are closed.
Using the ffprobe command from FFMPEG tools and the -show_frames option, the following table was produced:
|Samsung Galaxy Tab 4||IP||30|
|GoPro HERO 3||IP||15|
|Canon 5D mark 2||IP||12|
Note that some cameras have quality settings that may affect the codec parameters and that while we believe the video samples are direct from camera, some may have been transcoded without our knowledge. Naturally, most cameras use a simple IP type GOP because they don't carry enough memory to store the out of order frames required when encoding B frames.
Discussion and Last Words
The rate-distortion characteristics of modern codecs such as HEVC and H.264 mean that very good quality video can be obtained when using inter coded frames. If you want to achieve slightly more than modest compression, it's advisable to use inter coded frames.
- GOP means Group of Pictures, intra and inter coded frames usually in a repeating pattern.
- Long GOP means inter coded frames are used.
- (Long) GOPs are necessary for high compression via inter compression.
- Intra only coding is very high bit rate; use inter coding to reduce the bit rate but intra frames are required for navigation and error resilience.
- Proper use of GOP structure can result in better quality for higher compression.
If bit rate is not important or storage is unlimited, use intra frames. Forget that, use RAW - but you'll need very fast and large storage to make it easier to edit, copy and view.
The best type of codec depends on the end user, the type of content you are capturing, your editing computer power and your work flow.