How to Train and Deploy a Speech Model on Arducam Pico4ML TinyML Dev Kit (RP2040) Using Edge Impulse
This article will teach you how to train and deploy a Machine Learning model that detects the words Yes and No on a Raspberry Pi Pico, a development board with an Arm Cortex-M0+ microprocessor.
The specific board that I will be using is the Arducam Pico4ML with BLE (Bluetooth Low Energy). This board extends the Raspberry Pi Pico by adding a microphone, camera, and screen to the board. This makes it easy for anyone unfamiliar with soldering to get started with the development board. This board also has an inertial measurement unit (IMU).
Because of the range of knowledge required to go from training a speech model all the way to deploying an embedded application, I will be utilizing the Edge Impulse platform for training the model. This approach will skip the TinyML learning curve for now.
Edge Impulse reduces the task into the following:
Designing an impulse (transforming data and training a model)
Deploying the model
I already wrote an article about setting up the Arducam Pico4ML on macOS. setting up the Raspberry Pi Pico for development on macOS. Please refer to that article if your board and SDK are not yet configured.
First, open a new browser window and navigate to https://www.edgeimpulse.com/
If you don't have an account, choose Sign Up, otherwise choose Login
The process of signing up includes creating a new project.
Training Data Set
Every ML task requires data. For this article, we will be making use of the speech commands data, which was released to the public by Google AI. This dataset contains voice recordings in .wav format, which is what we need sound to be in.
Proceed with the following steps to get your training data into your project in Edge Impulse.
Download the dataset and expand it.
Return to Edge Impulse and choose "Data acquisition" on the menu to the left.
Click on Upload Data
Please use the following information to fill in the form:
Select files: click on the button, browse to where you stored your files, and choose some files (either yes or no). You can select multiple files at the same time.
Upload into category: to keep things simple, choose the option to automatically split between training and testing.
Label: choose Enter label and provide a label, either Yes or No depending on the files that you are uploading
You will need to repeat this process for the next class. This means that if you uploaded files for Yes, repeat the same process for No.
It is important to avoid using a binary classifier. A binary classifier will identify Yes and No and provide a probability that it is one of those words that has been spoken. Now, what happens if neither of those words was spoken? A binary classifier would be forced to choose from one of them.
To avoid this, also upload samples of Unknown and Noise. This will train a classifier that identifies four classes.
When you are done, the screen will look similar to the following.
The middle of the screen has a table with collected data. There are four icons that let you do different things such as filter what is shown on the table, upload additional data, select files, or show a detailed view.
You can pick a file from the table and it will show up on the right side, in the section titled "Raw Data". The image below shows and example of filtering and reviewing the raw data.
Note that the sound files are called raw data. That is because they can't be used to train a model. First, they need to be processed. All of this happens during the Impulse Design.
Click on Impulse Design on the left navigation menu. That will take you to the impulse design screen with only the red component (Time Series Data) populated. This is automatically linked to your training data. More importantly, the input pipeline is automatically created for you.
You might want to maintain the default suggestions on the red block. Here is a quick explanation of what is going on:
Window size: 1000ms. This treats the sound in chunks of 1 second.
Window increase: 100ms. This is the step size, the rate at which the window slides. This is 0.1 seconds. The window size and increase are useful for the next step, the MFCC.
Frequency (Hz): 16000. This processes the audio at 16kHz, which is normal audio.
Zero-pad data: Set to true (checked). If this is unchecked then any samples that are less than 1 second in length will be discarded.
Click on "Add a processing block". You will be presented with a list of options as shown in the following image.
You will see that there are a number of options that suit various applications. We are working with human voices so the right choice is Audio (MFCC). Click on the Add button for this choice.
MFCC stands for Mel Frequency Cepstral Coefficients. Without getting technical, the MFCC gives us the appropriate numerical representation after feature extraction.
Next, back on the impulse design screen, click on "Add a learning block". You will be presented with two options as shown in the following image.
Click on the Add button next to classification. The system will generate a network architecture to get you started. At this point, you should see the appropriate number of features in the green output features box.
Click on Save Impulse.
Next, go back to the navigation menu and click on MFCC to inspect the MFCC block. Click on Generate features.
Afterward, you will see the MFCC parameters, which you can adjust if you would like. For now, just leave the defaults.
Next, click on "NN Classifier" on the navigation menu, still within Impulse Design. You should see the default network architecture screen.
You will see from the image above that the network is represented as a bunch of layers. If you are familiar with neural networks then know that the system lets you edit a number of features. The view above is a simple one. There is an expert view that lets you edit the code as shown below.
You may edit the network if you so desire.
Click on "Start training" when you are ready, and wait for a while for the training to complete. You can set options for how long you want to train, but just use the defaults for starters.
When the training is complete you will see the confusion matrix as well as a feature explorer that shows how the training data is clustered. Each item in the set is color-coded so you can get a sense of what is affecting your model quality. Don't worry if none of it makes sense at the moment.
The confusion matrix shows the model performance on the validation dataset. For each class, you can how many items were correctly classified and how many were wrongly classified.
The diagonal with the green boxes show the percentage of items in each class that was correctly classified. For example, No was correctly classified 84.6% of the time.
At the bottom of the screen is a section for on-device performance.
This section is extremely important in understanding how much space the model will take on the microcontroller.
Follow the steps below to create a deployment folder.
Click on Deployment in the navigation menu to the left of the Edge Impulse screen.
Choose C++ library under the Create library section.
Scroll down and optionally select an optimization.
Click on Build.
You will get a compressed file when the build process completes. You will need to download this file to see the contents.
Before we transplant these files into our embedded application let's take a look at the structure of one embedded application.
Hello World TFLite Micro
The "Hello World" example that gets bundled with the Arducam Pico4ML features an ML model that you can use for inference using TFLite Micro (TinyML). To get started, navigate to the examples folder and open up the hello_world application folder. The file you are interested in is the main_functions.cpp file.
The code includes some files that are specific to TFLite from lines 21 through to 25. There are three other lines to pay attention to.
Line 18: Constants
This file defines the information that is used for quantization. Quantization makes it possible to export a smaller model by converting floating-point numbers into integers. In order to carry out inference correctly, it's important to know the range of numbers on which the model was trained.
Line 19: Model
This file contains the model definition. TFLite Micro exports the model as a flatbuffer. There are two entries here, an array of numbers that contains the model weights (g_model), and an integer (g_model_len) that contains the number of weights in the array.
Line 20: Output Handler
This file processes the result of the inference. Recall that you are dealing with an ML model and for that reason, you will need to do something after the model is done running inference. The output handler takes inputs and runs them through an interpreter.
Back in the main_functions file, let's take a look at a few more lines.
Line 37: Tensor Arena
The tensor arena is an amount of memory that is reserved for all of the operations required by TFLite micro.
Line 64: Micro Interpreter
The interpreter does the actual work of model inference. It exposes pointers to the input and output tensors. You can see this happen on lines 76 and 77.
The actual model inference takes place in the loop function.
Within the loop function the input tensor is set on line 96. The inference happens on line 99. The quantized output is obtained on line 107.
The input was quantized on line 94. That output is then converted to floating-point on line 109.
The output is then handled on line 113.
With this knowledge, you are now ready to transplant your model. There is one more thing though, we need to capture the sound from the microphone.
Transplant Into Micro Speech
There is another example for the Pico4ML called micro_speech. This example adds two more things to what we have seen previously:
audio_provider: this takes care of audio capture
LCD_st7735: this provides facilities for writing to the on-board LCD screen.
You should have already built the sample code for Pico4ML and flashed your MCU.
There is a folder called micro_features within micro_speech. This contains code that is specific to the model.
Open the file called micro_model_settings.cpp and observe the array called kCategoryLabels. It shows that this model was originally trained to classify four words in this sequence:
Your model might have the words you are identifying in a different sequence.
Check inside the archive you downloaded from Edge Impulse.
Open the folder called model-parameters.
Open the file called model_variables.h. This file will contain the words that your model is trained to classify, along with the sequence.
Note that the format of this file is completely different from that of the Pico4ML examples.
What you are interested in is the content of the variable on line 28. In my example, the sequence is as shown below:
You may proceed to replace the content of kCategoryLabels with the content of ei_classifier_inferencing_categories. If you do not have the exact number of values, be sure to update kCategoryCount in the header file.
Next, within the micro_features folder in the Pico4ML examples, open the file model.cpp. Scroll to the bottom and observe the value of g_model_len.
Next, go back to the archive which you downloaded from Edge Impulse.
Navigate to the folder called tflite-model.
Open the file called tflite-trained.h
Scroll to the bottom of the file.
My example code shows that model length, trained_tflite_len, is 8040. This is smaller than the value of 18712 contained in g_model_len in the model.cpp (in the Pico4ML examples). You need to do two things here.
Copy the value of trained_tflite_len (in tflite-trained.h) and use it to replace the value of g_model_len (in model.cpp)
Copy the value of trained_tflite (in tflite-trained.h) and use it to replace the value of g_model (in model.cpp)
The beautiful thing about my test model is that it is smaller than the model that shipped with the examples, so I expect the tensor arena size to be sufficient.
Go ahead and build the code (don't forget to save your changes). If you didn't make any mistakes then the code should build successfully.
If that worked, you may proceed to flash your firmware unto the microcontroller.
You may take a look at the audio provider to get a better sense of how the audio capture works, and at the command responder to get a sense of how the output is written to the screen. You may also edit what gets written to the LCD screen.