fbpx

Exploring Hugging Face: Audio Classification

Exploring Hugging Face: Audio Classification

Exploring Hugging Face: Audio Classification

Audio Classification Using Models From Hugging Face

Photo by Kelly Sikkema on Unsplash

The audio classification activity in Hugging Face includes categorizing audio knowledge into predefined classes or labels.

Audio recordsdata are transformed right into a format (reminiscent of waveforms or spectrograms) that the chosen mannequin can course of.

A waveform is a visible illustration of an audio sign’s amplitude over time. It reveals how the amplitude of the sound wave modifications. In audio processing, waveforms are essential for analyzing the traits of the sound, reminiscent of its loudness, pitch, and length.

import librosa

audio_path = 'speech.wav'
waveform, sample_rate = librosa.load(audio_path, sr=None)

We can use librosa package deal for transformation. The load operate from the librosa library is used to learn the audio file specified by audio_path.

waveform is a numpy array that represents the audio sign’s amplitude over time. It’s a sequence of floating-point numbers that symbolize the sound wave.

sample_rate is the variety of samples of audio carried per second, measured in Hz (Hertz). It defines the variety of knowledge factors used to symbolize every second of audio. The sr parameter specifies the pattern fee. By setting sr=None, we’re telling librosa to make use of the unique pattern fee of the audio file, which suggests it is not going to resample the audio and can preserve its authentic high quality.

import matplotlib.pyplot as plt

time_axis = librosa.times_like(waveform, sr=sample_rate)

plt.determine(figsize=(10, 4))
plt.plot(time_axis, waveform)
plt.title('Waveform of Audio')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.present()
Waveform. Image by the writer.

Now, let’s use the MIT/ast-finetuned-audioset-10–10–0.4593 mannequin from Hugging Face.

from transformers import pipeline

pipe = pipeline("audio-classification", mannequin="MIT/ast-finetuned-audioset-10-10-0.4593")

outcomes = pipe(waveform, sample_rate=sample_rate)

print(outcomes)

"""
[{'score': 0.7925717830657959, 'label': 'Speech'},
{'score': 0.03275119513273239, 'label': 'Speech synthesizer'},
{'score': 0.02389572374522686, 'label': 'Narration, monologue'},
{'score': 0.019056597724556923, 'label': 'Sound effect'},
{'score': 0.01026979461312294, 'label': 'Female speech, woman speaking'}]
"""

The mannequin assigns the best rating to “Speech,” indicating that it believes the audio is almost certainly to be speech. The different labels are the mannequin’s subsequent greatest guesses however with considerably decrease confidence scores.

Let’s use one other mannequin:


pipe = pipeline("audio-classification", mannequin="excellent/wav2vec2-base-superb-sid")

outcomes = pipe(waveform, sample_rate=sample_rate)

print(outcomes)

"""
[{'score': 0.47217562794685364, 'label': 'id10652'},
{'score': 0.23792167007923126, 'label': 'id10335'},
{'score': 0.10524415224790573, 'label': 'id10856'},
{'score': 0.08934732526540756, 'label': 'id10651'},
{'score': 0.022524842992424965, 'label': 'id10396'}]
"""

Each label corresponds to a singular speaker identifier (ID). The rating represents the mannequin’s confidence that the audio section belongs to the speaker related to that ID. You can get the mappings from the documentation of this mannequin.

Another mannequin predicts the emotion within the audio file:


pipe = pipeline("audio-classification", mannequin="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")

outcomes = pipe(waveform, sample_rate=sample_rate)

print(outcomes)

"""
[{'score': 0.13225293159484863, 'label': 'disgust'},
{'score': 0.12851978838443756, 'label': 'neutral'},
{'score': 0.12753769755363464, 'label': 'calm'},
{'score': 0.1254863142967224, 'label': 'angry'},
{'score': 0.12439820170402527, 'label': 'fearful'}]
"""

Sources

https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593

https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english

https://huggingface.co/ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition

https://huggingface.co/excellent/wav2vec2-base-superb-sid

HI-FI News

by way of Artificial Intelligence on Medium https://ift.tt/FwUSseI

March 17, 2024 at 12:48AM

Select your currency