Copyright 1997. Interval Research Corporation.
The audio samples provided here were created as described in our technical report, IRC-TR 1997-061 and as summarized in Covell, Withgott, Slaney, "Mach1: Nonuniform Time-Scale Modification of Speech," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Seattle WA, May 12-15 1998.
We propose a new approach to nonuniform time compression, called Mach1, designed to mimic the natural timing of fast speech. At identical overall compression rates, listener comprehension for Mach1-compressed speech increased between 5 and 31 percentage points2 over that for linearly compressed speech, and response times dropped by 15%. For rates between 2.5 and 4.2 times real time, there was no significant comprehension loss with increasing Mach1 compression rates. In A-B preference tests, Mach1-compressed speech was chosen 95% of the time. This technical report describes the Mach1 technique and our listener-test results. Audio examples are given below.
The audio examples here were compressed by Mach1 and by a linear technique, both to the same overall compression rate. Mach1 compression was completed "open-loop" (that is, without a feedback loop to enforce a specific global compression rate). The compression rates achieved by Mach1 were measured after compression and the same utterance was recompressed using linear compression to the same overall compression rate. A more detailed description of our method is provided in Section 3 of our technical report. The examples given here are sorted by discourse type (short dialog, long dialog, or monolog) and by compression rate. The final speaking rate, in words per minute (wpm), is also given.
Linear compression | Mach1 compression | Compression rate (x faster than real time) | Speaking rate (wpm) |
Short dialogs | |||
LC_S18_2 | M1_S18_2 | 3.97 | 481 |
LC_S09_3 | M1_S09_3 | 3.95 | 490 |
LC_S21_1 | M1_S21_1 | 3.66 | 521 |
LC_S04_1 | M1_S04_1 | 3.59 | 495 |
LC_S19_2 | M1_S19_2 | 3.48 | 572 |
LC_S09_1 | M1_S09_1 | 3.40 | 450 |
LC_S22_1 | M1_S22_1 | 3.35 | 472 |
LC_S10_1 | M1_S10_1 | 2.96 | 546 |
Long Dialogs | |||
LC_L21 | M1_L21 | 2.94 | 591 |
LC_L29 | M1_L29 | 2.87 | 545 |
LC_L09 | M1_L09 | 2.73 | 566 |
LC_L37 | M1_L37 | 2.65 | 572 |
LC_L05 | M1_L05 | 2.61 | 551 |
Monologs | |||
LC_M09 | M1_M09 | 2.86 | 544 |
LC_M05 | M1_M05 | 2.80 | 430 |
LC_M25 | M1_M25 | 2.77 | 464 |
LC_M13 | M1_M13 | 2.56 | 391 |
The examples provided in this table are based on audio from the compact disks in the Kaplan TOEFL review materials. See: M. Rymniak, G. Kurlandski, et al., 1997. The Essential Review: TOEFL (Test of English as a Foreign Language), Kaplan Educational Centers and Simon & Schuster, New York. We thank Kaplan Educational Centers and Simon & Schuster for providing us with permission to use these excerpts in this manner. |
Voice mail makes it easy and attractive to leave impromptu messages. In contrast, listening to voice mail messages is often painful. While we can time compress the messages, current techniques typically are viable only up to 2 times real time. More specifically, human comprehension of linearly time-compressed speech typically degrades at compression rates around 2.0 to 2.5 times real time. These degradations are not due to the speech rate per se: Comprehension of linearly compressed speech often breaks down above 225 to 270 wpm, which is well below the rates at which long passages of natural speech are comprehensible.
Instead, the incomprehensibility of linearly time-compressed speech is due to its unnatural timing. Our new nonuniform time-compression technique, called Mach1, compresses the components of an utterance to resemble closely the natural timing of fast speech. The resulting compressed speech remains comprehensible at much higher rates: as high as 2.56 to 4.15 times real time and 390 to 673 wpm.
Mach1 offers statistically significant improvements in comprehensibility over linear time compression: At compression rates between 2.5 and 4.2 times real time, comprehension of Mach1-compressed speech is 17 percentage points better than that of linearly compressed speech. This difference in comprehension increased with increasing compression rate. Short dialogs provided the greatest improvement in comprehension: These improvements averaged 23 percentage points and ranged as high as 31 percentage points for naive listeners. The comprehension improvements were less with the longer clips: 10 percentage points with monologs and 5 percentage points with long dialogs.
This research is the first to maintain comprehensibility with time-scale modification at such high compression rates. It is also the first report of statistically significant improvements in comprehensibility over linear time compression.
Copies of our technical report are available in HTML, Postscript (583k), and Adobe PDF (117k).