Institute for Response-Genetics (e.V.), University of Zurich

Representing Speech Characteristics

Human speech is greatly influenced by the affective state of the speaker, such as sadness, happiness, fear, anger, aggression, lack of energy, or drowsiness. Thus, an attentive listener discovers a lot about the affective state of his partner with no great effort, and without having to talk about it explicitly during a conversation. In consequence, psychiatrists routinely monitor speaking behaviour and voice sound characteristics of their patients for diagnostic purposes and as sensitive indicators of clinical change.

Speaking Behavior and Voice Sound Characteristics

Speech characteristics can be roughly described by a few major features: speech flow, loudness, intonation and intensity of overtones. Speech flow describes the speed at which utterances are produced as well as the number and duration of temporary breaks in speaking. Loudness reflects the amount of energy associated with the articulation of utterances and, when regarded as a time-varying quantity, the speaker's dynamic expressiveness. Intonation is the manner of producing utterances with respect to rise and fall in pitch, and leads to tonal shifts in either direction of the speaker's mean vocal pitch. Overtones are the higher tones which faintly accompany a fundamental tone, thus being responsible for the tonal diversity of sounds.

Analysis of the Nonverbal Content of Human Speech

Firstly, the individual speech recordings are screened for intervals without signal. These intervals are then used to determine the thresholds for background noise under consideration of a certain "guard" zone. Based on these thresholds, time series are subdivided into pauses and utterances ("segmentation") with pauses of less than 250 msec duration being skipped. In a second step, "spectra" are calculated on the basis of 1-second epochs by means of a Discrete Fourier Transformation (DFT: "pure" utterances with pauses having been eliminated for spectral analyses). Finally, we approximate the shape of the F0 distribution curve ("F0" designates the mean vocal pitch of a speaker) by a 2nd degree polynomial and use the distance between the symmetrical -6dB points as a measure of the "F0-variability" (intonation). The ratio height/width of the 2nd degree polynomial serves as a measure of the "F0-narrowness" (monotony). The frequency resolution of the DFTs is a quartertone over 7 octaves (55-7040Hz).

OPTIMI: Early Detection & Prevention

Institute for Response-Genetics, University of Zurich

Head: Prof. Dr. Hans H. Stassen

Partners:
Everis, Spain
ETH, Switzerland
UZH, Switzerland
Freiburg, Germany
MA Systems, UK
Bristol, UK
Xiwrite, Italy
Ultrasis, UK
Jaume, Spain
Valencia, Spain
Lanzhou, China

EU-Grant (FP7):
248544

Representing Speech Characteristics

Speaking Behavior and Voice Sound Characteristics

Analysis of the Nonverbal Content of Human Speech

OPTIMI: Early Detection & Prevention

Institute for Response-Genetics, University of Zurich

Head: Prof. Dr. Hans H. Stassen

Partners: Everis, Spain ETH, Switzerland UZH, Switzerland Freiburg, Germany MA Systems, UK Bristol, UK Xiwrite, Italy Ultrasis, UK Jaume, Spain Valencia, Spain Lanzhou, China

EU-Grant (FP7): 248544

Representing Speech Characteristics

Speaking Behavior and Voice Sound Characteristics

Analysis of the Nonverbal Content of Human Speech

Partners:
Everis, Spain
ETH, Switzerland
UZH, Switzerland
Freiburg, Germany
MA Systems, UK
Bristol, UK
Xiwrite, Italy
Ultrasis, UK
Jaume, Spain
Valencia, Spain
Lanzhou, China

EU-Grant (FP7):
248544