metadata

title: ChainsawDetector
emoji: 🌳
colorFrom: gray
colorTo: green
sdk: docker
pinned: false

ChainsawDetector

Model Description

This model is proposed as a submission for the Frugal AI Challenge 2024, specifically for the audio binary classification task: detecting chainsaw among environmental noise.

Intended Use

Primary intended uses: Non commercial chainsaw detection from audio recordings
Primary intended users: Researchers and developers participating in the Frugal AI Challenge
Out-of-scope use cases: Not intended for production use or real-world classification tasks

Training

Data

The model was mainly trained on the rfcx/frugalai dataset:
- License: CC BY-NC 4.0
- Size: 35.3k audio samples of 3 seconds max
- 2 classes (chainsaw or environment)
- Validation set (15.1k samples) and final test set are provided by the same source

To improve performance, additional datasets have been explored:

Diverse audio recordings fetched from freesound
- Various open licenses (Creative Commons, Attribution, Non commercial), see datasources/ for complete attributions.
- Chainsaw, and environmental noise (forest, rain)
- After curation, 2425 chainsaw, and 2646 environment samples of 3 seconds
ESC-50
- License: CC BY-NC 3.0
- Initially an environmental sound dataset of 50 classes
- Selected only chainsaw (as class 0), crickets, birds and wind (class 1)
- Only 20 samples of 5 seconds for each class
- After mixing and cropping, 240 samples for class 0, and 240 of class 1 (a derisory amount but it was interesting to process)

Labels

Chainsaw
Environment

Preprocessing

The initial raw audio arrays are first downampled to 4kHz.
Indeed, according to [1], "chainsaw harmonics are visible only up to 1kHz". They can be higher, but in practice "often masked by background noise", so it was decided to keep frequencies up to 2kHz to feed the model. As the next step (fourier transform) requires to have as input at least 2 times the max frequency to keep (Nyquist–Shannon sampling theorem), the 4kHz downsampling makes sense. A low pass filter is applied before downsampling, to avoid aliasing.
This has two major advantages: reducing the amount of "useless" data to process (in the sense that it does not contains valuable information to identify chainsaws), leading to faster processing and converging. And filtering out an important part of possible noises (a lot of high frequencies bird songs are recensed in the recordings).
The Short Term Fourier Transform (STFT) is used to extract a spectrogram.
A n_fft parameter of 1024, a window length of about 0,25 seconds, and a hop length close to 0.05s leads to a spectrogram of size (513, 60) in respectively frequency and time dimensions for the 4kHz 3s inputs. These quite wide windows (compared to speech processing for example) allow to roughly summarize the information without producing too much data (remember, frugality). More generally in this work, most of the decisions were in favor of simplicity while still allowing decent performances.
The Per Channel Energy Normalization (PCEN) [2] is applied. Described as "a computationally efficient frontend for robust detection and classification of acoustic events", which sounds exactly what is needed here. It should brings better perfomances than traditionnal MFCC in this case.
The spectrograms are split along the time dimension in 6 splits of time-length 10 (which is about half a second with context). These (513, 10) sized chunks are given sequentially to the model. The idea is that a signal of any length can be chunked timewise and thus processed by the model. In this demonstration, 3s signals only are used to be able to process them by batches, but in a real world application, this model architecture can process real time continuous audio recordings.

Augmentation

During training only, different data augmentation techniques were applied:

Small time shift, the spectrogams are randomly padded to the right.
Random rectangular masks are applied to hide part of the input data.
Reasonable gaussian noise is added so the model won't learn each data point.

Details

The final model was trained on 20 epochs with a OneCycle learning rate peaking at 0.005.
A batch size of 16 was decided, to keep sufficient performances, as it was fully trained on CPU (Intel® Core™ i5-1135G7)!
With AdamW as optimizer, the first half of the training was computed with float32 with automatic mixed precision, and the last part fully with bfloat16.
The model is converging quickly, although erratically due to data augmentation, and then from 90% accuracy slightly improving until stuck to 93%.
Training code can be found in training/ directory (not used for inference, only for information).

Performance

Metrics

Accuracy: 93.02% on test split
Precision: 93.51%
Recall: 89.97%
F-score: 91,71%
Environmental Impact:
- Emissions tracked in gCO2eq: 5.19
- Energy consumption tracked in Wh: 14.07
Mistakes
- False positive represents 38.35% of mistakes
- False negative represents 61.65% of mistakes The model tends to predict more class 1 (environment) Possible explanations are:
  - This class is slighlty more represented in the train dataset
  - This corresponds to the default class (LSTM init states are biased towards it)
  - Technically, each audio sample is containing environmental noise, the chainsaw occurences are on top of it Overall, this is not a bad thing as false alert can lead to waste of time in real world situation

Model Architecture

The model itself is taking a sequence of chunks (as described above) as input, and produce a single decision (0 or 1).
Three convolutional layers (and some max pooling) are reducing the inputs to a 2d tensor of 8 points across 16 channels, then a fourth one is shrinking the channels, producing a 8 length vector. The first convolution is 2d, but the next ones are 1d, the time is soon reduced to one dimension. Thus, the vector is summarizing the frequencies into 8 values.
These values are passed to an LSTM that also receive an initial state of ones (environment by default).
Each chunk is then processed the same way: same convolutional kernels, then persisting LSTM updating its state.
At the end of the signal (here after 6 chunks), a final dense layer is taking the last LSTM state (8 values), and outputs an almost prediction. (ReLu activations are used in hidden states, as well as after the convolutions because it is efficient to compute).
After a sigmoid activation, and a simple thresholding (1 if above 0.5, else 0), the final decison is produced.
In total, 1798 parameters are used.

Environmental Impact

Environmental impact is tracked using CodeCarbon, measuring:

Carbon emissions during inference
Energy consumption during inference

This tracking helps establish a baseline for the environmental impact of model deployment and inference.

Limitations

Not much time was used for hyperparameters optimization, only learning rate was selected, and a few different layers configurations (size of kernels, number of layers). The main reason is because it is time and computationnaly expensive, but there are certainly improvements to find if HPO is considered more in details.
There are implementations of trainable PCEN, which can be interesting, but uses more weigths.
More data, and more diverse can be used to help the model distinguish chainsaws among any other kind of possible noise in a wild forest (and there are apparently a lot).

Ethical Considerations

Environmental impact is tracked to promote awareness of AI's carbon footprint.
Advices from [3] were applied to reduce size while keeping good performances.
Illegal deforestation is bad. Legal one also though.

References

[1] N. Stefanakis, K. Psaroulakis, N. Simou and C. Astaras, "An Open-Access System for Long-Range Chainsaw Sound Detection", 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 2022, pp. 264-268, doi: 10.23919/EUSIPCO55093.2022.9909629.
[2] V. Lostanlen et al., "Per-Channel Energy Normalization: Why and How", in IEEE Signal Processing Letters, vol. 26, no. 1, pp. 39-43, Jan. 2019, doi: 10.1109/LSP.2018.2878620.
[3] Menghani, Gaurav, "Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better", journal ACM Computing Surveys, Association for Computing Machinery (ACM), vol. 55, no. 12, pp. 1–37, 2023, doi: 10.1145/3578938