Mobile Sensing IX Prosodic Sensing. Spring 2015 Petteri Nurmi

Mobile Sensing IX Prosodic Sensing Spring 2015 Petteri Nurmi 21.4.2015 1

Learning Objectives Understand the basics of voice prosody and why it is meaningful input for mobile sensing applications Basic understanding of the source-filter theory for sound production Voiced and unvoiced speech and how to detect them Fundamental frequency: what it is? How to extract it from speech? What is cepstral analysis? Why is it important? What is the spectral envelope? How it can be extracted? Which other prosodic features are of interest? 21.4.2015 2

Prosodic Sensing Recall that prosody refers to a characterization of the way a person speaks So-called paralinguistic cues Prosodic sensing deals with the extraction of prosody information from (audio) measurements Widely studied in speech signal processing, but most works have assumed wearable or infrastructure sensors that are close to the audio source In mobile contexts, microphones often further away, possibly obstructed, and also sensitive to noise and frequency response differences Noise and distance from microphones particularly problematic for aperiodic portions of speech 21.4.2015 3

Why prosody matters? Personality Extroversion and introversion correlate with variations in speech rate and pitch contour Emotion Intraperson variations in pitch contour reflect changes in emotional states Fluency Extent of higher order harmonics characterizes fluency of non-native speakers Linguistic markers Sociolinguistic classes ( working class, blue collar), pragmatic linguistics (irony, joy, sarcasm) 21.4.2015 4

Prosody and Personality Personality refers to characteristics that determine how humans think, feel, and act in situations Dominant personality theories characterize personality in terms of traits Big-5: Extraversion, Agreeableness, Conscientiousness, Neuroticism, Openness Prosodic features correlate with personality traits to varying degree Extraversion highest correlation (changes in voice intensity, loudness, speech rate) Other traits much more difficult to identify from speech 21.4.2015 5

Prosody and Emotion Emotional prosody relates changes in individual s prosodic patterns with different emotions This contrasts with extroversion which examines differences between people Particularly anger, joy / happiness, and sadness can be identified from prosodic changes Pitch variations, intensity, energy contour, speech rate Several application areas Automotive scenarios Affective speech agents Speech production 21.4.2015 6

Speech Production Speech production refers to the process by which spoken words are produced Normal speech generated through pulmonary pressure provided by the lungs Sound generated through phonation in the glottis in the larynx Vocal tract modifies the signal, forming vocals and consonants Tongue, lips, palate used in combination with the vocal tract to shape the signal Source: http://en.wikipedia.org/wiki/speech_production 21.4.2015 7

Source Filter Model of Speech Source-filter theory models speech production as a two stage process Vocal tract operates as a filter on a sound generated by a sound source (glottis) Source and filter assumed independent of each other Output signal can then be written as a combination of the source and filter outputs: Convolution in time domain: s(t) = e(t) * h(t) Multiplication in frequency domain S(ω) = E(ω)H(ω) Foundation for most prosodic analysis techniques 21.4.2015 8

Voiced and Unvoiced Sounds Speech consists of two kinds of sounds: Voiced: all vowels, nasal and selected other consonants Periodic process where lungs build up air pressure against the glottis, which flaps open and closes again Period of the process determines the pitch of the voice Unvoiced: everything else (p, s, sh,...) Air pressure keeps glottis open, sound shaped by configuration of vocal tract and its configuration Lips, tongue, etc. influence the final voice Voice not periodic Classifying speech into voiced and unvoiced segments typically the first step of prosodic analysis Sequences then grouped together to separate between speech and silence segments 21.4.2015 9

Speech Detection Periodic signals (voiced) preserve their characteristics across noise and distance Audio energy highest on the periodic components Energy concentrated on few frequencies (harmonics) Speech detection typically operates by combining detection of voiced segments with a temporal model Energy and spectral entropy most widely used measures for voicing detection Adaptive thresholding techniques, such as Sound of Silence, can also be applied Periodicity related features also widely used E.g., number of autocorrelation peaks and their magnitude Speech detected by grouping voiced segments within close proximity of each other into utterrances 21.4.2015 10

Speech Detection: Example Signal Energy Spectral Entropy Speech / Not 21.4.2015 11

Speech Detection: Vowel Onset Points Onset refers to the first part of a syllable Vowel onsets correspond to starting points of voiced speaking segments, and thus to speech Vowel onsets also important for determining certain prosodic characteristics (e.g., intonation) 21.4.2015 12

Prosodic Features: Speech Rate Spontaneous speech characterized by bursty production sequences Articulation rate: the speed at which the speaker is producing phonemes Rate of voiced segments have been shown to correlate strongly with phoneme rate Articulation rate simply estimated by calculating the ratio of voiced segments within each speech segments Production rate: the speed at which the speaker moves from one burst to another Characterized by the gap distribution between successive speech segments 21.4.2015 13

Fundamental Frequency F0 Voiced sounds have periodic, repeatable and identifiable patterns (or cycles) Duration of each period τ called pitch period length or (duration of) glottal pulse Fundamental frequency Inverse of the glottal pulse duration: F0 = 1 / τ Frequency of vocal fold vibration Measure of the highness/lowness of a voice Human voice range within 50 300 Hz Typical male: 85 180 Hz, typical female: 165 255Hz Children and infants have higher frequencies Pitch: perceived tone frequency of a sound Not the same as F0, but used interchangeably 21.4.2015 14

Pitch and Energy Contour Pitch contour refers to a function or curve that tracks (perceived) pitch over time Extend and nature of the variations key characteristics in voice prosody Reflects tone, intonation, stress, and other natural means of modifying speech patterns Energy contour refers to a function or curve that tracks variations in the energy (intensity) of speech Energy a measure of loudness and hence important determinant of many social behaviours Maximal energy and variations in the contour the most important characteristics of the energy contour 21.4.2015 15

Example Pitch contour Energy contour 21.4.2015 16

Prosodic Pitch Features Most prosodic sensing applications look at the dynamics of the F0 and (log-)energy contours Generally any standard statistical feature related to change in F0 or energy can be used Most common ones relate to mean, standard deviation, duration, and difference in values Some features closely related to characteristics of speech Intonation: distance of F0 peak with respect to nearest vowel onset point Stress: variation of F0 around vowel onset point, change in log energy Fluency: regularity of autocorrelation peaks 21.4.2015 17

Voice Quality Features Formant frequencies (higher order harmonics) closely associated with voice quality perception Correlated with several behavioural and cognitive factors, including fluency, hesitation, sadness Typical measures include: harmonics to noise ratios within a sentence variations in the difference between formant frequencies energy band of formant frequencies Recall that harmonics are integer multiples of F0 Hence these features can be extracted once F0 known 21.4.2015 18

Fundamental Frequency Estimation Estimating the fundamental frequency F0 essential for extracting most prosodic features Typically performed using a pitch tracking algorithm that is constrained to a specific range of audible voice The basic idea in F0 estimation is to identify the dominant frequency peak in a voice Can be performed in time, frequency or cepstral domain Generally assume single voice source active at a time and that microphone close to source Turn-taking helps making this valid in practical situations Multipitch tracking algorithms developed for music can be used in more complex environments 21.4.2015 19

Autocorrelation-based estimation The most popular method for pitch estimation is to use autocorrelation function (ACF) During voiced segments, peaks in autocorrelation occur at integer multiples of the pitch Under the assumption that only a single audio source Identifying the dominant cycle can thus be used to determine pitch Signal (and particularly pitch) vary over time è analysis performed using short time windows A modified autocorrelation function usually considered: tapered /modified autocorrelation Decays as a function of time Less sensitive to changes in signal and aperiod noise spikes 21.4.2015 20

Tapered autocorrelation - example Speech signal harmonics 21.4.2015 21

ACF based F0 Estimation F0 can be estimated by identifying the dominant peak in the (tapered) ACF Peak period converted into Hz using Fs/L where Fs sampling rate of the signal and L is the peak lag Search space constrained to a suitable range ( 50 300 Hz) to ensure estimate corresponds to voice By definition ACF = 1 at lag 0 Noise / unvoiced segments can cause peaks in 500Hz range Limitations Autocorrelation overfits to peaks in amplitude è unvoiced segments and formants can cause errors Need to observe at least two F0 cycles è sensitive to window size 21.4.2015 22

F0 Estimation: ACF Example 21.4.2015 23

Extensions: YIN Estimator Extension of the ACF estimator that significantly improves the robustness of pitch estimation Instead of using ACF, estimates F0 by identifying minima in a squared difference function Dip corresp onding to F0 21.4.2015 24

Extensions: YIN Estimator Further reductions in pitch error can be achieved by Normalizing the difference with a cumulative mean Raises harmonics and the first lag, making F0 the dominant dip in the function Two-tier threshold: pick smallest lag that below global threshold, old smallest value if no such value found Reduces octave errors, i.e., situations where the pitch tracking assumes to high value Parabolic interpolation Values around local minima fitted a parabolic function, reduces gross overestimates Local search Search for dip restricted within time intervals to ensure smooth overall estimates 21.4.2015 25

Cepstral Analysis Cepstral analysis is concerned with separating the input excitation and system response for analysis Operates using a cepstrum representation of signal Recall that speech sequence can be represented as a convolution of excitation and vocal tract sequence In frequency domain: S(ω) = E(ω)H(ω) Hence we also have log S(ω) = log E(ω) + log H(ω) Separation can then be performed by taking the inverse Fourier transform of the log magnitude Formally: 21.4.2015 26

Cepstral Analysis signal cepstrum Quefrency domain 21.4.2015 27

Cepstral Analysis Unvoiced frame Voiced frame 1 0.5 0-0.5-1 0 100 200 300 400 500 600 700 800 10 5 0-5 -10-15 0 100 200 300 400 500 600 700 800 Flat cepstrum 10 5 Harmonics identifiable 0-5 -10 0 100 200 300 400 500 600 700 800 21.4.2015 28

Cepstral Analysis: Liftering Liftering function Liftering refers to the process of separating the spectral envelope from excitation Liftering = filtering the cepstrum Low-pass filtering the cepstrum returns the transfer function, i.e., spectral envelope High-pass filtering the cepstrum returns the excitation Peaks in the excitation can be used for determining pitch of the voice Rahmonic peak 21.4.2015 29

Cepstral Analysis: Liftering Example Cepstrum FFT of signal FFT of high-pass filtered cepstrum FFT of low-pass filtered cepstrum = Spectral envelope 21.4.2015 30

Cepstrum Analysis: F0 Estimation Peaks in the cepstrum can be used to estimate F0 using an analogous approach to autocorrelation 1. Construct the cepstrum of input signal 2. Lifter the signal Bandpass filter the cepstrum to focus on the frequency range of human voice ( 50 300) 3. Find the maximal peak in the liftered signal 4. Convert peak location into frequency Practical issues Sensitive to frame (window) size used in analysis Signal should be filtered before analysis to reduce noise Voicing detection should be used to restrict analysis to frames that are voiced 21.4.2015 31

Cepstrum Analysis: F0 Estimation - Example 21.4.2015 32

Other methods for F0 estimation Zero-Crossings Distance between zero-crossing points can be used to identify signal period (and hence F0) Harmonic Product Spectrum Measures frequencies of harmonic components, F0 determines as the greatest common divisor Maximum Likelihood (template-based) Audio frame correlated in frequency domain with all possible windowed impulses Many other techniques as well Super resolution pitch determination Perceptual pitch modelling 21.4.2015 33

Conversational Dynamics Thus far we have inherently assumed only a single voice is present In reality, prosody extraction needs to be performed during conversations with multiple people Speaker diarization The process of identifying speakers and their speaking segments, i.e., who spoke when? Prerequisite for prosodic sensing when multiple speakers present Relies on so-called turn-taking behaviors 21.4.2015 34

Conversational Dynamics: Turn-Taking Refers to the process by which people in a conversation decide who speaks next In terms of prosodic modelling, causes additional pauses in the speech that need to be considered In the context of speech over telephony, simply causes periods of silence Turn-exchange points refer to parts of discourse where speakers can be changed Can be identified (at least to some extent) using prosodic analysis (e.g., decreasing pitch) During pure turn-taking only single source active Methods described in these slides applicable directly However, sometimes overlapping segments è additional techniques required 21.4.2015 35

Overview of Speaker Diarization Speech detection Segmentation Clustering General process very similar to other audio and speech processing Preprocessing performed to reduce noise Speech detection performed analogously Segmentation and clustering required as additional steps to detect audio tracks of individual speakers Speaker diarization systems: Bottom-up: operate from low level processing to clusters and analysis of streams Top-down: model each audio as single speaker segment and progressively add new speakers until best fit found 21.4.2015 36

Speaker Diarization: Segmentation Focuses on splitting audio into speaker homogenous segments or detecting changes in speaker turns Classical approach is to segment using a hypothesis testing based approach Basic idea is to compare the probability of being able to fit (a portion of) the audio with a single distribution against the probability of two (or more) distributions Several possible metrics: Bayesian Information Criterion (BIC) Generalized Likelihood Ratio (GLR) Kullback-Leibler divergence 21.4.2015 37

Speaker Diarization: Clustering Clustering stage of speaker diarization groups similar segments together Essentially determines which of the segments result from the same speaker Accuracy depends on segmentation è often resegmentation performed based on clustering Particularly overlapping segments cause problems State-of-the-art systems perform segmentation and clustering in tandem So-called one-step clustering and segmentation 21.4.2015 38

Speaker Diarization: Additional Topics Often measurements need to be combined from multiple microphones (mobile devices) Audio composition techniques required to find optimal combination Traditional diarization assumes microphone locations known, in mobile contexts not the case Time-delay information needs to be used to estimate locations of devices Dealing with overlap Multi-pitch tracking required for identifying different speakers in audio More complex speaker models that allow for covariance structure required 21.4.2015 39

Summary Prosodic sensing focuses on determining features that characterize the way humans speak Most common features related to variations in pitch patterns or relative rates of voiced segments in speech Speech production models provide mathematical basis for prosody extraction Especially the source-filter model important Speech consists of three parts: voiced and unvoiced segments, and silence Segmentation essential for prosodic extraction 21.4.2015 40

Prosodic Sensing Process: Summary Prosodic sensing operates along the following pipeline Preprocessing: frame construction (with overlap), windowing, noise removal Speech detection and unvoiced/voiced segment detection Simple solution is to use spectral entropy and log-energy F0 estimation Applied only on voiced segments Autocorrelation and variants, cepstral analysis, etc. Feature extraction Rate related features Statistical features: time, frequency, and quefrency domains 21.4.2015 41

References Basu, S., Conversational scene analysis, Massachusetts Institute of Technology, 2002 Rabiner, L., On the use of autocorrelation analysis for pitch detection, IEEE Transactions on Acoustics Speech and Signal Processing, 1977, 25, 24-33 de Cheveigné, A. & Kawahara, H., YIN, a fundamental frequency estimator for speech and music, The Journal of the Acoustical Society of America, 2002, 111, 1917 1930 Wyatt, D.; Choudhury, T.; Bilmes, J. & Kitts, J. A., Inferring Colocation and Conversation Networks from Privacy-Sensitive Audio with Implications for Computational Social Science, ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2, 7:1-7:41 Marya, L. & Yegnanarayana, B., Extraction and representation of prosodic features for language and speaker recognition, Speech Communication, 2008, 50, 782 796 K. R. Scherer & H. Giles, e., Social Markers in Speech, Cambridge University Press, 1980, 147-209 Scherer, K. R., Personality inference from voice quality: the loud voice of extroversion, European Journal of Social Psychology, 1978, 8, 467 487 Miró, X. A.; Bozonnet, S.; Evans, N. W. D.; Fredouille, C.; Friedland, G. & Vinyals, O. Speaker Diarization: A Review of Recent Research, IEEE Transactions on Audio, Speech & Language Processing, 2012, 20, 356-370 21.4.2015 42