====== Speech and Language Understanding (SLU) 5 ECTS Prüfung 2021-08-02 ====== ===== Meta Information ===== * Subject: Speech and Language Understanding (SLU) 5 ECTS, Summer semester * Date: 2021-08-02 * Type of Exam: oral via Zoom * Examiner: Prof. Andreas Maier * Grade: 1,7 * Evaluation * Atmosphere was quite relaxed * 2 additionals doctorands were there to learn, how oral exams are done but they didn't ask any questions * I had to show the room 360 degrees and be alone * I was allowed to have paper and sketch things onto it ===== Exam ===== * Summarize the lecture > I started with Human Speech Production: Sketch of Speech Production (Source, Modulator, Vocal Tract) and explanation, base frequency + articulators --> Fant's Source Filter Model * How do we percieve sounds? > Description of ear: 16 - 20kHz, Outer ear, Ear drum, Inner ear: Cochlea, Frequency perception trough channel filled with fluid with hairs (swing), which produce nervous signal * How are different frequencies percieved? > Channel is narrowing, Hairs are placed at fixed steps: excitation point on the *basilar membrane* --> frequency coding * How about phonetics and phonemes, can you tell me more about that? > We have different articulators, which we can use to create different sounds. Phonemes have many categories: Vowels, Fricatives, Plosives, Nasaln, Laterals... * What are vowels? How would you identify them in spectrogram? > Voiced sounds, F0 Frequency + Harmonics, Different vowels can be distinguish on formants (vocal tract resonation) * How about plosives? > For example letters: p, t, k, pulse-like (over all freqency spectrum, really short) * And fricatives? > High frequency noise, examples: f, s, unvoiced (Now I'm not sure which order of questions occured nor I remember the exact questions) * How to compute MFCCs? > We have signal, which is air pressure at time. Not really usefull, compute STFT Spectrogram. I did mention windowing and was then asked... * Why would you use windowing at all? > Computing one DFT for the whole signal is not useful, since we lose temporal information. * But why is it important for speech recognition? > Phonemes characterization is dependent on time resolution, for example plosives * So we cut just part of the signal out and do DFT on that? > Not optimal, since artefacts occur from rectangular window. * That's right. multiplication in frequency domain is equivalent to ? > ..., (Convolution in time domain) * How big should the windows be? > First said typical are 10ms, the corrected to 25ms window size and 10ms hop size. * So are phonemes durations longer or shorted than our windows > Again said first wrong answer. * Think about phonemes like plosive and how they look like. > We need shorter windows to identify correct location of the plosives. (Correct answer) * So you have STFT Spectrum, what next? > Kind of explained Mel Filterbanks and it is perception motivated, but the Prof. was not really happy, with what he wanted to hear. * What is cepstrum? > Spectrum of the spectrum, apply inverse Fourier transform. (Professor not ok with that) > I recapped the whole MFCC procedure: STFT --> Spectrogram --> Mel Filterbanks --> Log (compression of energy scale) --> Cosine transform (Source Filter separation) --> Mel-Cepstrum Coefficients + Energy + Delta's * (About 20 mins in, topic switch) Please explain HMM > First explained informally, hidden states and observations of sequences, then said formal definition. * What is Viterbi used for? > Efficient algorithm for most likely state sequence. Problem 2 in HMMs * How would you implement that? > I stated the problem definition with: Initialization, then computing forward probabilities and finally backtracking * Yes thats the problem definition but how would you implement that? > ... (Some sketch was expected but I don't know what) * Allright, lets move to Language Models, we learned more of them, choose your favorite > N-Grams, what that means, often utterances are context dependent * What are problems about high N-Grams > If sequence is not represented in the training data, then the probability is 0. Touched on Jeffrey Smoothing and problem with that, as well on using backoff strategies. * Last Question about Categorisation > Similar words are grouped together to reduce possible word sequences. Categories are not unique (ex. Essen), treat category as hidden state in HMM.