Du befindest dich hier: FSI Informatik » Prüfungsfragen und Altklausuren » Hauptstudiumsprüfungen » Lehrstuhl 5 » Speech and Language Understanding (SLU) 5 ECTS Prüfung 2021-08-02
Inhaltsverzeichnis
Speech and Language Understanding (SLU) 5 ECTS Prüfung 2021-08-02
Meta Information
- Subject: Speech and Language Understanding (SLU) 5 ECTS, Summer semester
- Date: 2021-08-02
- Type of Exam: oral via Zoom
- Examiner: Prof. Andreas Maier
- Grade: 1,7
- Evaluation
- Atmosphere was quite relaxed
- 2 additionals doctorands were there to learn, how oral exams are done but they didn't ask any questions
- I had to show the room 360 degrees and be alone
- I was allowed to have paper and sketch things onto it
Exam
* Summarize the lecture
I started with Human Speech Production: Sketch of Speech Production (Source, Modulator, Vocal Tract) and explanation, base frequency + articulators –> Fant's Source Filter Model
* How do we percieve sounds?
Description of ear: 16 - 20kHz, Outer ear, Ear drum, Inner ear: Cochlea, Frequency perception trough channel filled with fluid with hairs (swing), which produce nervous signal
* How are different frequencies percieved?
Channel is narrowing, Hairs are placed at fixed steps: excitation point on the *basilar membrane* –> frequency coding
* How about phonetics and phonemes, can you tell me more about that?
We have different articulators, which we can use to create different sounds. Phonemes have many categories: Vowels, Fricatives, Plosives, Nasaln, Laterals…
* What are vowels? How would you identify them in spectrogram?
Voiced sounds, F0 Frequency + Harmonics, Different vowels can be distinguish on formants (vocal tract resonation)
* How about plosives?
For example letters: p, t, k, pulse-like (over all freqency spectrum, really short)
* And fricatives?
High frequency noise, examples: f, s, unvoiced
(Now I'm not sure which order of questions occured nor I remember the exact questions)
* How to compute MFCCs?
We have signal, which is air pressure at time. Not really usefull, compute STFT Spectrogram. I did mention windowing and was then asked…
* Why would you use windowing at all?
Computing one DFT for the whole signal is not useful, since we lose temporal information.
* But why is it important for speech recognition?
Phonemes characterization is dependent on time resolution, for example plosives
* So we cut just part of the signal out and do DFT on that?
Not optimal, since artefacts occur from rectangular window.
* That's right. multiplication in frequency domain is equivalent to ?
…, (Convolution in time domain)
* How big should the windows be?
First said typical are 10ms, the corrected to 25ms window size and 10ms hop size.
* So are phonemes durations longer or shorted than our windows
Again said first wrong answer.
* Think about phonemes like plosive and how they look like.
We need shorter windows to identify correct location of the plosives. (Correct answer)
* So you have STFT Spectrum, what next?
Kind of explained Mel Filterbanks and it is perception motivated, but the Prof. was not really happy, with what he wanted to hear.
* What is cepstrum?
Spectrum of the spectrum, apply inverse Fourier transform. (Professor not ok with that)
I recapped the whole MFCC procedure: STFT –> Spectrogram –> Mel Filterbanks –> Log (compression of energy scale) –> Cosine transform (Source Filter separation) –> Mel-Cepstrum Coefficients + Energy + Delta's
* (About 20 mins in, topic switch) Please explain HMM
First explained informally, hidden states and observations of sequences, then said formal definition.
* What is Viterbi used for?
Efficient algorithm for most likely state sequence. Problem 2 in HMMs
* How would you implement that?
I stated the problem definition with: Initialization, then computing forward probabilities and finally backtracking
* Yes thats the problem definition but how would you implement that?
… (Some sketch was expected but I don't know what)
* Allright, lets move to Language Models, we learned more of them, choose your favorite
N-Grams, what that means, often utterances are context dependent
* What are problems about high N-Grams
If sequence is not represented in the training data, then the probability is 0. Touched on Jeffrey Smoothing and problem with that, as well on using backoff strategies.
* Last Question about Categorisation
Similar words are grouped together to reduce possible word sequences. Categories are not unique (ex. Essen), treat category as hidden state in HMM.