Detecting Speech and Music in Audio Content

Detecting Speech and Music in Audio Content

Iroro Orife, Chih-Wei Wu and Yun-Ning (Amy) Hung


When you benefit from the newest season of Stranger Things or Casa de Papel (Money Heist), have you ever ever questioned concerning the secrets and techniques to incredible story-telling, moreover the beautiful visible presentation? From the violin melody accompanying a pivotal scene to the hovering orchestral association and thunderous sound-effects propelling an edge-of-your-seat motion sequence, the assorted elements of the audio soundtrack mix to evoke the very essence of story-telling. To uncover the magic of audio soundtracks and additional enhance the sonic expertise, we’d like a option to systematically look at the interplay of those elements, usually categorized as dialogue, music and results.

In this weblog submit, we’ll introduce speech and music detection as an enabling expertise for quite a lot of audio functions in Film & TV, in addition to introduce our speech and music exercise detection (SMAD) system which we not too long ago printed as a journal article in EURASIP Journal on Audio, Speech, and Music Processing.

Like semantic segmentation for audio, SMAD individually tracks the quantity of speech and music in every body in an audio file and is helpful in content material understanding duties in the course of the audio manufacturing and supply lifecycle. The detailed temporal metadata SMAD offers about speech and music areas in a polyphonic audio combination are a primary step for structural audio segmentation, indexing and pre-processing audio for the next downstream duties. Let’s take a look at just a few functions.

Practical use circumstances for speech & music exercise

Audio dataset preparation

Speech & music exercise is a crucial preprocessing step to organize corpora for coaching. SMAD classifies & segments long-form audio to be used in massive corpora, such as

From “Audio Signal Classification” by David Gerhard

Dialogue evaluation & processing

  • During encoding at Netflix, speech-gated loudness is computed for each audio grasp monitor and used for loudness normalization. Speech-activity metadata is thus a central a part of correct catalog-wide loudness administration and improved audio quantity expertise for Netflix members.
  • Similarly, algorithms for dialogue intelligibility, spoken-language-identification and speech-transcription are solely utilized to audio areas the place there may be measured speech.

Music data retrieval

  • There are just a few studio use circumstances the place music exercise metadata is necessary, together with quality-control (QC) and at-scale multimedia content material evaluation and tagging.
  • There are additionally inter-domain duties like singer-identification and music lyrics transcription, which don’t match neatly into both speech or classical MIR duties, however are helpful for annotating musical passages with lyrics in closed captions and subtitles.
  • Conversely, the place neither speech nor music exercise is current, such audio areas are estimated to have content material categorized as noisy, environmental or sound-effects.

Localization & Dubbing

Finally, there are post-production duties, which make the most of correct speech segmentation on the the spoken utterance or sentence stage, forward of translation and dub-script era. Likewise, authoring accessibility-features like Audio Description (AD) entails music and speech segmentation. The AD narration is often mixed-in to not overlap with the first dialogue, whereas music lyrics strongly tied to the plot of the story, are generally referenced by AD creators, particularly for translated AD.

A voice actor within the studio

Our Approach to Speech and Music Activity Detection

Although the appliance of deep studying strategies has improved audio classification methods lately, this information pushed strategy for SMAD requires massive quantities of audio supply materials with audio-frame stage speech and music exercise labels. The assortment of such fine-resolution labels is dear and labor intensive and audio content material usually can’t be publicly shared as a result of copyright limitations. We deal with the problem from a distinct angle.

Content, style and languages

Instead of augmenting or synthesizing coaching information, we pattern the massive scale information obtainable within the Netflix catalog with noisy labels. In distinction to wash labels, which point out exact begin and finish instances for every speech/music area, noisy labels solely present approximate timing, which can influence SMAD classification efficiency. Nevertheless, noisy labels permit us to extend the dimensions of the dataset with minimal guide efforts and doubtlessly generalize higher throughout several types of content material.

Our dataset, which we launched as TVSM (TV Speech and Music) in our publication, has a complete variety of 1608 hours of professionally recorded and produced audio. TVSM is considerably bigger than different SMAD datasets and comprises each speech and music labels on the body stage. TVSM additionally comprises overlapping music and speech labels, and each lessons have an analogous whole period.

Training examples had been produced between 2016 and 2019, in 13 nations, with 60% of the titles originating within the USA. Content period ranged from 10 minutes to over 1 hour, throughout the assorted genres listed under.

The dataset comprises audio tracks in three totally different languages, specifically English, Spanish, and Japanese. The language distribution is proven within the determine under. The identify of the episode/TV present for every pattern stays unpublished. However, every pattern has each a show-ID and a season-ID to assist determine the connection between the samples. For occasion, two samples from totally different seasons of the identical present would share the identical present ID and have totally different season IDs.

What constitutes music or speech?

To consider and benchmark our dataset, we manually labeled 20 audio tracks from numerous TV exhibits which don’t overlap with our coaching information. One of the basic points encountered in the course of the annotation of our manually-labeled TVSM-test set, was the definition of music and speech. The heavy utilization of ambient sounds and sound results blurs the boundaries between lively music areas and non-music. Similarly, switches between conversational speech and singing voices in sure TV genres obscure the place speech begins and music stops. Furthermore, should these two lessons be mutually unique? To guarantee label high quality, consistency, and to keep away from ambiguity, we converged on the next pointers for differentiating music and speech:

  • Any music that’s perceivable by the annotator at a cushty playback quantity must be annotated.
  • Since sung lyrics are sometimes included in closed-captions or subtitles, human singing voices ought to all be annotated as each speech and music.
  • Ambient sound or sound results with out obvious melodic contours shouldn’t be annotated as music. Traditional cellphone bell, ringing, or buzzing with out obvious melodic contours shouldn’t be annotated as music.
  • Filled pauses (uh, um, ah, er), backchannels (mhm, uh-huh), sighing, and screaming shouldn’t be annotated as speech.

Audio format and preprocessing

All audio information had been initially delivered from the post-production studios in the usual 5.1 encompass format at 48 kHz sampling charge. We first normalize all information to a median loudness of −27 LKFS ± 2 LU dialog-gated, then downsample to 16 kHz earlier than creating an ITU downmix.

Model Architecture

Our modeling selections make the most of each convolutional and recurrent architectures, that are recognized to work properly on audio sequence classification duties, and are properly supported by earlier investigations. We tailored the SOTA convolutional recurrent neural community (CRNN) structure to accommodate our necessities for enter/output dimensionality and mannequin complexity. The finest mannequin was a CRNN with three convolutional layers, adopted by two bi-directional recurrent layers and one totally related layer. The mannequin has 832k trainable parameters and emits frame-level predictions for each speech and music with a temporal decision of 5 frames per second.

For coaching, we leveraged our massive and numerous catalog dataset with noisy labels, launched above. Applying a random sampling technique, every coaching pattern is a 20 second section obtained by randomly choosing an audio file and corresponding beginning timecode offset on the fly. All fashions in our experiments had been skilled by minimizing binary cross-entropy (BCE) loss.


In order to grasp the affect of various variables in our experimental setup, e.g. mannequin structure, coaching information or enter illustration variants like log-Mel Spectrogram versus per-channel vitality normalization (PCEN), we setup an in depth ablation examine, which we encourage the reader to discover totally in our EURASIP journal article.

For every experiment, we reported the class-wise F-score and error charge with a section measurement of 10ms. The error charge is the summation of deletion charge (false damaging) and insertion charge (false optimistic). Since a binary resolution have to be attained for music and speech to calculate the F-score, a threshold of 0.5 was used to quantize the continual output of speech and music exercise features.


We evaluated our fashions on 4 open datasets comprising audio information from TV applications, YouTube clips and numerous content material equivalent to live performance, radio broadcasts, and low-fidelity folks music. The wonderful efficiency of our fashions demonstrates the significance of constructing a sturdy system that detects overlapping speech and music and helps our assumption that a big however noisy-labeled real-world dataset can function a viable resolution for SMAD.


At Netflix, duties all through the content material manufacturing and supply lifecycle work are most frequently inquisitive about one a part of the soundtrack. Tasks that function on simply dialogue, music or results are carried out a whole lot of instances a day, by groups across the globe, in dozens of various audio languages. So investments in algorithmically-assisted instruments for computerized audio content material understanding like SMAD, can yield substantial productiveness returns at scale whereas minimizing tedium.

Additional Resources

We have made audio options and labels obtainable by way of Zenodo. There can also be GitHub repository with the next audio instruments:

  • Python code for information pre-processing, together with scripts for five.1 downmixing, Mel spectrogram era, MFCCs era, VGGish options era, and the PCEN implementation.
  • Python code for reproducing all experiments, together with scripts of knowledge loaders, mannequin implementations, coaching and analysis pipelines.
  • Pre-trained fashions for every performed experiment.
  • Prediction outputs for all audio within the analysis datasets.

Special because of the complete Audio Algorithms staff, in addition to Amir Ziai, Anna Pulido, and Angie Pollema.

HI-FI News

by way of Stories by Netflix Technology Blog on Medium https://ift.tt/Ndy2Uqu

November 13, 2023 at 06:15PM

Select your currency