Audio-to-video synchronization

Audio-to-video synchronization (also known as audio-video sync, audio/video sync, lip sync, or by the lack of it: lip sync error, lip flap) refers to the relative timing of audio (sound) and video (image) parts during creation, post-production (mixing), transmission, reception and play-back processing. When sound and video have a timing related cause and effect, AV-sync can be an issue in television, videoconferencing, or film.

Lip sync errors are most commonly noticed by average viewers, (i.e. persons not professionally involved in the broadcast television industry) when a close up of the face (also known in the broadcast industry as a head shot) of a performer such as a newscaster is viewed. In home television viewing experiences involving high definition programs displayed on a flat panel TV the sound heard by the viewer most commonly leads the video seen by the viewer by a significant and often noticeable amount of time. This timing error (i.e. lip sync error) can range from near zero up to several seconds. The error typically slowly varies by a significant amount throughout a television program and frequently varies from noticeable to unnoticeable amounts several times per hour. In industry terminology the lip sync error is expressed as an amount of time the audio departs from perfect synchronization with the video where a positive time number indicates the audio leads the video and a negative number indicates the audio lags the video.[1] This terminology and standardization of the numeric lip sync error is utilized in the professional broadcast industry as evidenced by the various professional papers,[2] standards such as ITU-R BT.1359-1, and other references below.

Digital or analog audio video streams or video files usually contain some sort of explicit AV-sync timing, either in the form of interleaved video and audio data or by explicit relative time-stamping of data. The processing of data must respect the relative data timing by e.g. stretching between or interpolation of received data. If the processing does not respect the AV-sync error, it will increase whenever data gets lost because of transmission errors or because of missing or mis-timed processing.

Incorrectly synchronized

There are different ways in which the AV-sync can get incorrectly synchronized:

Examples of transmission (broadcasting), reception and playback that can get the AV-sync incorrectly synchronized:

Effect of no explicit AV-sync timing

When a digital or analog audio video stream does not have some sort of explicit AV-sync timing these effects will cause the stream to become out of sync:

Viewer experience of incorrectly synchronized AV-sync

The result typically leaves a filmed or televised character moving his or her mouth when there is no spoken dialog to accompany it, hence the term "lip flap" or "lip-sync error". The resulting audio-video sync error can be annoying to the viewer and may even cause the viewer to not enjoy the program, decrease the effectiveness of the program or lead to a negative perception of the speaker on the part of the viewer.[6] The potential loss of effectiveness is of particular concern for product commercials and political candidates. Television industry standards organizations, such as the Advanced Television Systems Committee, have become involved in setting standards for audio-video sync errors.[4]

Because of these annoyances, AV-sync error is a concern to the television programming industry, including television stations, networks, advertisers and program production companies. Unfortunately, the advent of high-definition flat-panel display technologies (LCD, DLP and plasma), which can delay video more than audio, has moved the problem into the viewer's home and beyond control of the television programming industry alone. Consumer product companies now offer audio-delay adjustments to compensate for video-delay changes in TVs and A/V receivers, and several companies manufacture dedicated digital audio delays made exclusively for lip-sync error correction.

Recommendations

For television applications, the Advanced Television Systems Committee recommends that audio should lead video by no more than 15 milliseconds and audio should lag video by no more than 45 milliseconds.[4] However, the ITU performed strictly controlled tests with expert viewers and found that the threshold for detectability is -125ms to +45ms.[1] For film, acceptable lip sync is considered to be no more than 22 milliseconds in either direction.[5][7]

The Consumer Electronics Association has published a set of recommendations for how digital television receivers should implement A/V sync.[8]

SMPTE ST2064

SMPTE standard ST2064, published in 2015,[9] provides technology to reduce or eliminate lip-sync errors in digital television. The standard utilizes audio and video fingerprints taken from a television program. The fingerprints can be recovered and used to correct the accumulated lip-sync error. When fingerprints have been generated for a TV program, and the required technology is incorporated, the viewer's display device has the ability to continuously measure and correct lip-sync errors.[10][11]

Timestamps

Presentation time stamps (PTS) are embedded in MPEG transport streams to precisely signal when each audio and video segment is to be presented, to avoid AV-sync errors. However, these timestamps are often added after the video undergoes frame synchronization, format conversion and preprocessing.[12][13][14][15]

The Real-time Transport Protocol clocks media using origination timestamps on an arbitrary timeline. A real-time clock such as one delivered by the Network Time Protocol and described in the Session Description Protocol[16] associated with the media may be used to syntonize media. A server may then be used to for final synchronization to remove any residual offset.[17]

See also

References

  1. 1 2 3 "ITU-R BT.1359-1, Relative Timing of Sound and Vision for Broadcasting" (PDF). ITU. 1998. Retrieved 30 May 2015.
  2. Patrick Waddell; Graham Jones; Adam Goldberg. "Audio/Video Standards and Solutions A Status Report" (PDF). ATSC. Retrieved 4 April 2012.
  3. RFC 3550
  4. 1 2 3 IS-191: Relative Timing of Sound and Vision for Broadcast Operations, ATSC, 2003-06-26, archived from the original on 2011-07-27
  5. 1 2 "The relative timing of the sound and vision components of a television signal" (PDF).
  6. Byron Reeves; David Voelker (October 1993). "Effects of Audio-Video Asynchrony on Viewer's Memory, Evaluation of Content and Detection Ability" (PDF). Archived (PDF) from the original on 2 October 2008. Retrieved 2008-10-19.
  7. Sara Kudrle; et al. (July 2011). "Fingerprinting for Solving A/V Synchronization Issues within Broadcast Environments". Motion Imaging Journal. SMPTE. Appropriate A/V sync limits have been established and the range that is considered acceptable for film is +/- 22 ms. The range for video, according to the ATSC, is up to 15 ms lead time and about 45 ms lag time
  8. Consumer Electronics Association. "CEA-CEB20 R-2013: A/V Synchronization Processing Recommended Practice". Archived from the original on 2015-05-30.
  9. ST 2064:2015 - SMPTE Standard - Audio to Video Synchronization Measurement, SMPTE, 2015
  10. SMPTE Standards Update: The Lip-Sync Challenge, SMPTE, 10 December 2013
  11. SMPTE Standards Update: The Lip-Sync Challenge (PDF), SMPTE, 10 December 2013
  12. MPEG-2 Systems FAQ: 19. Where are the PTSs and DTSs inserted?
  13. MPlayer-G2-dev: mpeg container's timing (PTS values)
  14. birds-eye.net: DTS - Decode Time Stamp
  15. svcd2dvd.com: Perfect AV Sync: Preparation is key...
  16. RFC 7273
  17. RFC 7272

Further reading

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.