====== Speech and Language Understanding (SLU) 5 ECTS Prüfung 2021-08-02 ======

===== Meta Information =====
  * Subject: Speech and Language Understanding (SLU) 5 ECTS, Summer semester
  * Date: 2021-08-02
  * Type of Exam: oral via Zoom
  * Examiner: Prof. Andreas Maier
  * Grade: 1,7
  * Evaluation
    * Atmosphere was quite relaxed
    * 2 additionals doctorands were there to learn, how oral exams are done but they didn't ask any questions
    * I had to show the room 360 degrees and be alone
    * I was allowed to have paper and sketch things onto it

===== Exam =====

* Summarize the lecture

> I started with Human Speech Production: Sketch of Speech Production (Source, Modulator, Vocal Tract) and explanation, base frequency + articulators --> Fant's Source Filter Model

* How do we percieve sounds?

> Description of ear: 16 - 20kHz, Outer ear, Ear drum, Inner ear: Cochlea, Frequency perception trough channel filled with fluid with hairs (swing), which produce nervous signal

* How are different frequencies percieved?

> Channel is narrowing, Hairs are placed at fixed steps: excitation point on the *basilar membrane* --> frequency coding

* How about phonetics and phonemes, can you tell me more about that?

> We have different articulators, which we can use to create different sounds. Phonemes have many categories: Vowels, Fricatives, Plosives, Nasaln, Laterals...

* What are vowels? How would you identify them in spectrogram?

> Voiced sounds, F0 Frequency + Harmonics, Different vowels can be distinguish on formants (vocal tract resonation)

* How about plosives?

> For example letters: p, t, k, pulse-like (over all freqency spectrum, really short)

* And fricatives?

> High frequency noise, examples: f, s, unvoiced

(Now I'm not sure which order of questions occured nor I remember the exact questions)

* How to compute MFCCs?

> We have signal, which is air pressure at time. Not really usefull, compute STFT Spectrogram. I did mention windowing and was then asked...

* Why would you use windowing at all?

> Computing one DFT for the whole signal is not useful, since we lose temporal information.

* But why is it important for speech recognition?

> Phonemes characterization is dependent on time resolution, for example plosives

* So we cut just part of the signal out and do DFT on that?

> Not optimal, since artefacts occur from rectangular window.

* That's right. multiplication in frequency domain is equivalent to ?

> ..., (Convolution in time domain)

* How big should the windows be?

> First said typical are 10ms, the corrected to 25ms window size and 10ms hop size.

* So are phonemes durations longer or shorted than our windows

> Again said first wrong answer.

* Think about phonemes like plosive and how they look like. 

> We need shorter windows to identify correct location of the plosives. (Correct answer)

* So you have STFT Spectrum, what next?

> Kind of explained Mel Filterbanks and it is perception motivated, but the Prof. was not really happy, with what he wanted to hear.

* What is cepstrum?

> Spectrum of the spectrum, apply inverse Fourier transform. (Professor not ok with that)

> I recapped the whole MFCC procedure: STFT --> Spectrogram --> Mel Filterbanks --> Log (compression of energy scale) --> Cosine transform (Source Filter separation) --> Mel-Cepstrum Coefficients + Energy + Delta's 

* (About 20 mins in, topic switch) Please explain HMM

> First explained informally, hidden states and observations of sequences, then said formal definition.

* What is Viterbi used for?

> Efficient algorithm for most likely state sequence. Problem 2 in HMMs

* How would you implement that?

> I stated the problem definition with: Initialization, then computing forward probabilities and finally backtracking

* Yes thats the problem definition but how would you implement that?

> ... (Some sketch was expected but I don't know what)

* Allright, lets move to Language Models, we learned more of them, choose your favorite

> N-Grams, what that means, often utterances are context dependent

* What are problems about high N-Grams

> If sequence is not represented in the training data, then the probability is 0. Touched on Jeffrey Smoothing and problem with that, as well on using backoff strategies.

* Last Question about Categorisation

> Similar words are grouped together to reduce possible word sequences. Categories are not unique (ex. Essen), treat category as hidden state in HMM.