Knowledge Base
jitterBufferDelay
The sum of the time, in seconds, each audio sample or a video frame takes from the time the first packet is received by the jitter buffer to the time it exits the jitter buffer.
Description
Real number; in seconds
The purpose of the jitter buffer is to recombine RTP packets into frames (in the case of video) and have smooth playout. The model described here assumes that the samples or frames are still compressed and have not yet been decoded.
It is the sum of the time, in seconds, each audio sample or a video frame takes from the time the first packet is received by the jitter buffer (ingest timestamp) to the time it exits the jitter buffer (emit timestamp).
- In the case of audio, several samples belong to the same RTP packet, hence they will have the same ingest timestamp but different jitter buffer emit timestamps
- In the case of video, the frame maybe is received over several RTP packets, hence the ingest timestamp is the earliest packet of the frame that entered the jitter buffer and the emit timestamp is when the whole frame exits the jitter buffer
This metric increases upon samples or frames exiting, having completed their time in the buffer (and incrementing jitterBufferEmittedCount). The average jitter buffer delay can be calculated by dividing the jitterBufferDelay with the jitterBufferEmittedCount.
See also
- inbound-rtp->jitter
- inbound-rtp->jitterBufferTargetDelay
- inbound-rtp->jitterBufferEmittedCount
- inbound-rtp->jitterBufferMinimumDelay
- WebRTC Statistics Specification

Notes
- An audio sample refers to having a sample in any channel of an audio track - if multiple audio channels are used, metrics based on samples do not increment at a higher rate, simultaneously having samples in multiple channels counts as a single sample
- We strive to have this delay as small as possible