Speech and Language Understanding (SLU) 5 ECTS Prüfung 2021-08-02

Meta Information

Subject: Speech and Language Understanding (SLU) 5 ECTS, Summer semester
Date: 2021-08-02
Type of Exam: oral via Zoom
Examiner: Prof. Andreas Maier
Grade: 1,7
Evaluation
- Atmosphere was quite relaxed
- 2 additionals doctorands were there to learn, how oral exams are done but they didn't ask any questions
- I had to show the room 360 degrees and be alone
- I was allowed to have paper and sketch things onto it

Exam

* Summarize the lecture

I started with Human Speech Production: Sketch of Speech Production (Source, Modulator, Vocal Tract) and explanation, base frequency + articulators –> Fant's Source Filter Model

* How do we percieve sounds?

Description of ear: 16 - 20kHz, Outer ear, Ear drum, Inner ear: Cochlea, Frequency perception trough channel filled with fluid with hairs (swing), which produce nervous signal

* How are different frequencies percieved?

Channel is narrowing, Hairs are placed at fixed steps: excitation point on the *basilar membrane* –> frequency coding

* How about phonetics and phonemes, can you tell me more about that?

We have different articulators, which we can use to create different sounds. Phonemes have many categories: Vowels, Fricatives, Plosives, Nasaln, Laterals…

* What are vowels? How would you identify them in spectrogram?

Voiced sounds, F0 Frequency + Harmonics, Different vowels can be distinguish on formants (vocal tract resonation)

* How about plosives?

For example letters: p, t, k, pulse-like (over all freqency spectrum, really short)

* And fricatives?

High frequency noise, examples: f, s, unvoiced

(Now I'm not sure which order of questions occured nor I remember the exact questions)

* How to compute MFCCs?

We have signal, which is air pressure at time. Not really usefull, compute STFT Spectrogram. I did mention windowing and was then asked…

* Why would you use windowing at all?

Computing one DFT for the whole signal is not useful, since we lose temporal information.

* But why is it important for speech recognition?

Phonemes characterization is dependent on time resolution, for example plosives

* So we cut just part of the signal out and do DFT on that?

Not optimal, since artefacts occur from rectangular window.

* That's right. multiplication in frequency domain is equivalent to ?

…, (Convolution in time domain)

* How big should the windows be?

First said typical are 10ms, the corrected to 25ms window size and 10ms hop size.

* So are phonemes durations longer or shorted than our windows

Again said first wrong answer.

* Think about phonemes like plosive and how they look like.

We need shorter windows to identify correct location of the plosives. (Correct answer)

* So you have STFT Spectrum, what next?

Kind of explained Mel Filterbanks and it is perception motivated, but the Prof. was not really happy, with what he wanted to hear.

* What is cepstrum?

Spectrum of the spectrum, apply inverse Fourier transform. (Professor not ok with that)

I recapped the whole MFCC procedure: STFT –> Spectrogram –> Mel Filterbanks –> Log (compression of energy scale) –> Cosine transform (Source Filter separation) –> Mel-Cepstrum Coefficients + Energy + Delta's

* (About 20 mins in, topic switch) Please explain HMM

First explained informally, hidden states and observations of sequences, then said formal definition.

* What is Viterbi used for?

Efficient algorithm for most likely state sequence. Problem 2 in HMMs

* How would you implement that?

I stated the problem definition with: Initialization, then computing forward probabilities and finally backtracking

* Yes thats the problem definition but how would you implement that?

… (Some sketch was expected but I don't know what)

* Allright, lets move to Language Models, we learned more of them, choose your favorite

N-Grams, what that means, often utterances are context dependent

* What are problems about high N-Grams

If sequence is not represented in the training data, then the probability is 0. Touched on Jeffrey Smoothing and problem with that, as well on using backoff strategies.

* Last Question about Categorisation

Similar words are grouped together to reduce possible word sequences. Categories are not unique (ex. Essen), treat category as hidden state in HMM.