Video Rewrite: Driving Visual Speech with Audio

Christoph Bregler, Michele Covell, Malcolm Slaney
Interval Research Corporation, 1801-C Page Mill Road, Palo Alto, CA 94304


Abstract

Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage. This technique is useful in movie dubbing, for example, where the movie sequence can be modified to sync the actors' lip motions to the new soundtrack.

Video Rewrite automatically labels the phonemes in the training data and in the new audio track. Video Rewrite reorders the mouth images in the training footage to match the phoneme sequence of the new audio track. When particular phonemes are unavailable in the training footage, Video Rewrite selects the closest approximations. The resulting sequence of mouth images is stitched into the background footage. This stitching process automatically corrects for differences in head position and orientation between the mouth images and the background footage.

Video Rewrite uses computer-vision techniques to track points on the speaker's mouth in the training footage, and morphing techniques to combine these mouth gestures into the final video sequence. The new video combines the dynamics of the original actor's articulations with the mannerisms and setting dictated by the background footage.

Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack.

Examples of our Animation Results

Video Rewrite automatically generated all the movies in this column. These animations are available as QuickTime movies, compressed with the CINEPAK video compressor for best playback.

Introducing Video Rewrite

This is Video Rewrite. (471k)
My face is Ellen's. (391k)
My voice is Michele's. (392k)
Ellen did not say these words. (418k)
Michele did. (386k)
The lip sync is synthetic. (555k)
We'll explain how we did this. (780k)

Re-writing the past

Video Rewrite gives lip-synced videos. (569k)
I never met Forrest Gump. (354k)
I did not inhale. (292k)
Read my lips! (329k)

Reading fairy tales

Say to him that we wish to have a cottage. (507k)
All was furnished in the best of everything. (481k)
Instead, it was quite green and yellow. (494k)
There was a teenie hall and a beautiful sitting room. (601k)


Description of data accompaning our SIGGRAPH 97 paper:

Copies of our paper are available in Postscript (5.6m) and Adobe PDF (179k).

This directory also contains several animations which demonstrate the quality of the reconstructions from the Video Rewrite process. All of the results contained here are synthesized from a video database of the subject speaking, and from new audio. Video Rewrite automatically rearranges the mouth and chin images, and stitches them into a background video. The resulting video shows the subject mouthing the words she never said.

We did most of our development with one subject, Ellen, from eight minutes of video. An example of the Ellen's training data is e_train.mov (771k). We also worked with one minute of public-domain footage of John F. Kennedy talking about the Cuban missle crisis. An example of the JFK footage is jfk_train.mov (276k).

Normally, Video Rewrite explicitly aligns the lip images with each other and with the face in the background video. To illustrate why this is necessary, we include a movie without lip registration. The movie e_noreg.mov (518k) shows the lip images correctly ordered to match a new audio track but without spatial registration.

Video Rewrite uses a global (affine) transform to register the lips with the background. This process is demonstrated in the movie e_affine.mov (4.1m). The movie shows the coordinate frames of the lip images, mapped onto the moving background sequence. Video Rewrite warps the lip images to the poses in the background sequence. We show the full frame of the lip image, superimposed on the background sequence using the affine transform.

Finally, we examine how our performance degrades with smaller databases. To do this, we cut the training data for Ellen's reanimation to one-eighth its original size (1 minute of data). Two movies show our animation results, from two disjoint 1-minute video databases. The two movies e_parta.mov (486k) and e_partb.mov (486k), show different articulation artifacts, because the two reduced video databases are missing different example sequences.