In the first layer, the neural network converts the input into a Mel spectrogram. This is followed by convolutional and LSTM layers to extract features from the input signal. In addition to the commands, the network can also distinguish the categories “unknown word” and “background noise”.
In principle, the recognition of voice commands works reliably and within a short time despite the weak hardware. Feedback is returned to the operator via LED ring and voice output. Problems are also caused by the microphone, which has a high background noise level and therefore limits the range. Although it has been possible to record and generate a very large amount of training data, many more recordings of different people are required to optimize recognition. For comparison: Similar Open Source projects work with about 100000 data sets.
Maximilian Thiel, Zeynep Aydeniz, Michael Kleiner, Maximilian Spiegel, Daniel Zettler
Prof. Dr. Rieck, Kempten University (Project management)
SS 2019, Faculty of Mechanical Engineering