OPTIMI: Early Detection and Prevention of Depression

Institute for Response-Genetics, Departement of Psychiatry (KPPP)

Psychiatric Hospital, University of Zurich

IFRG Emblem

Everis, Spain
ETH, Switzerland
UZH, Switzerland
Freiburg, Germany
MA Systems, UK
Bristol, UK
Xiwrite, Italy
Ultrasis, UK
Jaume, Spain
Valencia, Spain
Lanzhou, China


EU-Grant (FP7):

Representing Speech Characteristics

Affective State and Voice

Human speech is greatly influenced by the affective state of the speaker, such as sadness, happiness, fear, anger, aggression, lack of energy, or drowsiness. Thus, an attentive listener discovers a lot about the affective state of his partner with no great effort, and without having to talk about it explicitly during a conversation. In consequence, psychiatrists routinely monitor speaking behaviour and voice sound characteristics of their patients for diagnostic purposes and as sensitive indicators of clinical change.

Speaking Behavior and Voice Sound Characteristics

Speech characteristics can be roughly described by a few major features: speech flow, loudness, intonation and intensity of overtones. Speech flow describes the speed at which utterances are produced as well as the number and duration of temporary breaks in speaking. Loudness reflects the amount of energy associated with the articulation of utterances and, when regarded as a time-varying quantity, the speaker's dynamic expressiveness. Intonation is the manner of producing utterances with respect to rise and fall in pitch, and leads to tonal shifts in either direction of the speaker's mean vocal pitch. Overtones are the higher tones which faintly accompany a fundamental tone, thus being responsible for the tonal diversity of sounds.

Analysis of the Nonverbal Content of Human Speech

Firstly, the individual speech recordings are screened for intervals without signal. These intervals are then used to determine the thresholds for background noise under consideration of a certain "guard" zone. Based on these thresholds, time series are subdivided into pauses and utterances ("segmentation") with pauses of less than 250 msec duration being skipped. In a second step, "spectra" are calculated on the basis of 1-second epochs by means of a Discrete Fourier Transformation (DFT: "pure" utterances with pauses having been eliminated for spectral analyses). Finally, we approximate the shape of the F0 distribution curve ("F0" designates the mean vocal pitch of a speaker) by a 2nd degree polynomial and use the distance between the symmetrical -6dB points as a measure of the "F0-variability" (intonation). The ratio height/width of the 2nd degree polynomial serves as a measure of the "F0-narrowness" (monotony). The frequency resolution of the DFTs is a quartertone over 7 octaves (55-7040Hz).

formant analysis
Voice sound characteristics ("timbre") of a male speaker as quantified through spectral analyses. Spectral intensities are plotted along the y-axis on log-proportional scales as a function of frequency (x-axis: 7 octaves covering the frequency range of 64-8192Hz).
Mean vocal pitch in females lies 1 octave above that of male speakers.
Depression significantly reduces the dynamic expressiveness of human voices, thus greatly reducing inter-individual differences. As a direct consequence, the patients' voices become more similar to each other ("depressive voice"). Voices regain their distinct individuality during recovery.
[ Mail to Webmaster ] k454910@ifrg.ch