When a speech-emotion model is really just recognizing the actors

An accuracy that was answering an easier question

I had a speech-emotion model I was proud of. The first thing I found when I rebuilt it was that its accuracy was mostly the model recognizing the 24 RAVDESS actors, not reading their emotions. It had been graded on the same people it trained on, and once I stopped letting it do that, a chunk of the accuracy went with it.

This is the honest rebuild: a speaker-independent pipeline that fuses voice and face, with every number measured on people the model has never heard or seen. The code is on GitHub and there is a live demo you can talk to. The one number I will defend by the end of this post is 78.8%, and I will show you exactly why it is worth more than that inflated score was.

The whole system in one picture

Before any of the details, here is the entire system. Two models, trained separately, that only meet at the very end when their probabilities are combined. Keep this picture in mind; every section below zooms into one piece of it.

The full pipeline. The audio branch carries most of the signal; the face is a separate, weaker model that is added only at the probabilities.

The reason it is built as two separate models, rather than one network that swallows both, is the most interesting engineering decision in the project, and I will get to why the obvious alternative fails.

Why RAVDESS is a trap: 24 actors, two sentences

RAVDESS is a standard benchmark for emotion from speech. 24 professional actors each speak the same two sentences, "Kids are talking by the door" and "Dogs are sitting by the door", in eight emotions: neutral, calm, happy, sad, angry, fearful, disgust, surprised. That is 1440 short clips in total.

Read that again. Twenty-four people, two sentences. That tiny, repetitive structure is what makes it a trap. There is so little variety that a model can do well by memorizing voices instead of learning emotion, and the standard way people split the data lets it do exactly that.

Split by clip and you measure the wrong thing

If you shuffle all 1440 clips and take a random fifth as your test set, the same actor saying the same sentence in the same emotion ends up on both sides, separated only by which of the two takes it was. The model never has to learn what anger sounds like in general. It only has to remember what actor 14 sounds like angry, because it already met actor 14 being angry in training.

The same actor on both sides of a random split is the leak. Splitting by actor closes it, and the score drops 13 points.

I did not want to argue about this in the abstract, so I ran a controlled experiment: the same model, the same training code, the same everything, changing only the split.

A bar chart comparing the same audio model: about 78% on a random split versus 64.9% on a speaker-independent split. — Thirteen points from nothing but the partition. The high number is not a better model, only an easier test, and a random split like this is where almost every inflated RAVDESS score comes from.

Proving the split is clean, mechanically

The fix is to split by actor. I use six-fold cross-validation where each fold tests on four actors who appear in no training or validation data, balanced by gender, and four more actors are held out of each training set for early stopping. No actor is ever on two sides of a fold.

Because the entire result rests on this, I did not want to trust myself to get it right by hand. A unit test fails the build if any actor leaks across any fold:

def test_no_actor_leaks_across_splits():
    for f in make_speaker_independent_folds(Config()):
        assert not (set(f.train_actors) & set(f.test_actors))
        assert not (set(f.val_actors) & set(f.test_actors))
        assert set(f.train_actors | f.val_actors | f.test_actors) == set(range(1, 25))

Is 64.9% just a bad model?

That was my first worry. Watching the honest score land at 64.9%, after the leaky split had flattered the same model to 78%, feels like failure. It is not. It lands almost exactly on the peer-reviewed, genuinely speaker-independent baseline: EmoBox (Interspeech 2024) reports 66.2% for a HuBERT-base model on RAVDESS under a comparable protocol. My 64.9% with the same encoder sitting right there is the evidence that the pipeline is honest, not broken. The 90s are the mirage; the 60s are the real floor for a base-size model. Now the job is to raise that floor without cheating.

The audio model: where emotion actually lives

The audio branch is a self-supervised speech encoder (WavLM-large) with a small head on top. Two choices in that head matter specifically for emotion, and they are worth understanding because they are where most of the honest gain comes from.

The encoder stays frozen. All the learning happens in a small head that decides which layers to listen to and how to pool them.

Learnable layer weighting. Instead of using only the encoder's final layer, the head learns a softmax-weighted sum over all 25 layers. Emotion is carried in the middle layers; the final layer has drifted toward the actual words being spoken, which is what the encoder was pretrained to predict.
Attentive statistics pooling. The head pools both the attention-weighted mean and the standard deviation over time. Affect lives in how much the tone varies, and a plain average throws that variation away.

The weighted-layer sum is a handful of lines, and the learned weights end up favoring the middle of the network, exactly as the intuition predicts:

stack = torch.stack(out.hidden_states, dim=0)   # (25, B, T, H)
w = torch.softmax(self.layer_weights, dim=0)    # one learnable weight per layer
hidden = (w.view(-1, 1, 1, 1) * stack).sum(0)   # weighted blend of all layers

And it does not just match the intuition in theory. These are the actual weights the head learned, read straight out of the trained checkpoint:

A bar chart of the head's learned softmax weight for each of WavLM's 25 layers. The weights rise from the input layers, peak in the middle of the network around layer 11, then fall below the uniform baseline for the top transformer layers. — Not a diagram I drew; these are the weights the head actually learned, pulled from the checkpoint. It leans on the middle layers, where prosody lives, and pushes the top layers, where speaker identity concentrates, below the uniform line.

When fine-tuning loses: frozen features win on small data

Here is the result that surprised me most. The obvious move is to fine-tune the big encoder on the task. I tried it, and WavLM-large fine-tuned scored 67.6%, which is below a simple frozen-feature baseline. A 300-million-parameter model fine-tuned on 1440 clips overfits, and the speaker-independent folds are exactly where that overfitting shows. So I froze the encoder entirely and put all of the learning into the small head described above. That gave 70.3%.

Two panels of validation macro-F1 against training epoch, each showing six thin per-fold lines and a bold mean line, with the held-out test accuracy boxed in each. The frozen WavLM panel settles at 70.3% test; the full fine-tune panel reaches 67.6%. — Validation macro-F1 through training, every fold drawn. The frozen probe climbs steadily and lands at 70.3% on held-out actors; fully fine-tuning the backbone on so few clips tops out lower, at 67.6%. Freezing is not only simpler here, it wins.

The face has the same trap, in a new costume

RAVDESS is audio-visual, so the natural next step is to add the speaker's face. My first attempt made things worse, and the reason was the same lesson wearing a different costume. I had encoded each face with a standard image network trained on ImageNet, and I tested it the same way I tested the audio.

Same test as the audio. ImageNet face features memorize who the person is and collapse on new faces; expression features transfer.

The ImageNet features scored 89% when the same faces leaked across the split but only 35% on faces the model had never seen. They were memorizing identity, not reading expressions, which is the visual version of the speaker leak. Swapping to a model trained on facial expressions shrank that gap from 54 points to 32, lifting new-face accuracy to 58%. Now the face carried something real and transferable, worth fusing.

Why the obvious fusion fails, and the one that works

With a 70% audio model and a 58% face model, fusing them should be easy. It is not. The obvious approach, one network trained on both streams at once, scored 43 to 47%, below either model alone. That is a known failure mode called modality competition: with so few clips and identities, the optimizer leans on whichever stream is easier to fit on the training data, which here was the face with its leftover identity signal, and the joint model transfers worse than just trusting the audio.

Left, naive joint fusion: audio and face features flow into one learned fusion head, the optimizer leans on the easier stream, and it collapses to 43-47% from modality competition. Right, calibrated late fusion: separate audio and face models each produce probabilities, each temperature-scaled on validation, then a weighted average, reaching 78.8%, with a w slider noting that w=1 keeps audio-only in the grid so the blend can never do worse than audio. — Training the two streams together lets the weaker one drag the model down. Combining only their probabilities, with a validation-tuned weight, cannot.

The fix is decision-level fusion, the rule that won the older audio-video emotion challenges for exactly this reason. Keep the two models separate, let each be its best, and combine only their class probabilities. Two details make the average safe instead of harmful: calibrate each model with temperature scaling so a confidently-wrong face cannot shout down a quiet-but-right voice, and choose the mixing weight on held-out validation, where audio-only (weight one) is always one of the options.

Audio logits and face logits each divided by a temperature fit on validation and floored at 1.0 so it can only soften, shown as an overconfident bar distribution softening. The two are combined as w times audio plus (1 minus w) times face, with the weight swept on validation only and w=1 marked as the audio floor. A note states temperatures and weight never see the test actors. — The temperatures and the weight are fit per fold on the training and validation actors only, never on the test actors, so the gain cannot smuggle the leak back in.

Choosing that weight on validation, never on the test set, is what separates a real gain from a fake one. Because weight one means audio only, the floor is guaranteed: late fusion can never score below the better single model.

Two panels of the real per-fold fusion settings. Left, the audio and visual temperatures for each of the six folds, every bar above the T=1 line, so both streams are softened. Right, the audio versus visual weight per fold, audio between 0.50 and 0.55. — What the calibration actually chose, fold by fold. Every temperature lands above one, so both networks were overconfident and got softened; the validation search settles audio at 0.50 to 0.55 of the blend. None of these values ever sees a test actor.

The result: fold by fold, emotion by emotion

Put the whole arc on one axis. Every number here is speaker-independent, the mean across six actor-disjoint folds.

A horizontal bar chart, every result measured in this post. Audio-visual fusion 78.8% and the same HuBERT on a leaky random split 78.0% (red) sit at the top, then WavLM frozen probe 70.3%, WavLM fine-tuned 67.6% (gray), HuBERT-base 64.9%, and facial expression only 58.1%. — Every result in this post on one axis, drawn straight from the saved runs. The honest progression is emerald; the fine-tuned model that overfit is gray; the leaky split is red, and notice it scores as high as the real best system. That is the whole point: leakage can fake a great number.

The means above hide nothing, because here is every fold behind them. The spread is real, and reporting it honestly is part of the point.

A dot plot with one dot per fold for four models. HuBERT-base clusters near 64.9%, WavLM fine-tuned near 67.6%, WavLM frozen near 70.3% with one weak fold near 57%, and audio-visual fusion near 78.8% with one fold reaching 87.5%. A dashed line marks the 1/8 chance level. — The same means, with every actor-disjoint fold plotted. Frozen WavLM has one genuinely hard fold near 57%, which is exactly why I quote a standard deviation and not just an average.

Calibrated late fusion reaches 78.8%, a real 8.5 points over the 70.3% audio model, with no leak. The face is not magic; it is a modest, honest lift that the simple combination rule is able to keep. Per emotion, the model is confident where the voice is unambiguous and hesitant where people hesitate too.

A horizontal bar chart of per-emotion F1 for the fused model, sorted from highest to lowest: happy 0.87, angry 0.84, calm 0.84, disgust 0.81, surprised 0.80, neutral 0.75, fearful 0.74, sad 0.63. — Where the fused model is strong and where it struggles. Happy, angry and calm are clean; sad is the hardest at 0.63.

A row-normalized confusion matrix for the fused speaker-independent model across the eight emotions, brightest on the diagonal, with sad most often confused with the other low-arousal emotions calm and neutral, and calm leaking into neutral. — The same story as a matrix, which shows where it slips, not just how often. Sad is most often mistaken for calm or neutral, and calm itself leaks into neutral. These low-arousal, quiet emotions sound alike, and people mix them up from audio too.

What it does not do

A model is only as honest as its limitations, so here they are plainly:

1440 clips and only 24 actors. The speaker-independent numbers carry a real standard deviation across folds, from about 2 points for the steadier models to 6 for the noisiest, as the per-fold plot above shows, and no amount of cleverness changes that the dataset is small.
Acted, frontal, clean. RAVDESS emotions are performed in a studio. These numbers do not transfer directly to spontaneous, in-the-wild speech and faces.
One corpus. Cross-dataset generalization is a separate, harder question I did not test here.
The hosted demo runs the audio branch only. The face and fusion pipeline needs video frames and more compute than a free CPU Space, so the live demo predicts from voice alone at 70.3%.

What the rebuild was really about

In the end this taught me less about emotion and more about measurement. The first version was not dishonest on purpose; it just answered an easier question than the one I thought I was asking. "Can it read emotion in actors it already knows" is a very different, much easier question than "a person it has never met." The second one is the one that matters, and it is the one worth reporting even when the number is smaller.

So the number I stand behind is 78.8%, speaker-independent, with everything above to back it up. You can try the audio model yourself just below: record or upload a few seconds of speech and watch it predict, on a voice it has never heard.