Cross-modal attention model combining HuBERT audio and EfficientNet visual encoders with bidirectional fusion across 8 emotion classes.
Single-modality emotion detection misses context. Audio tone and facial expressions together reveal more than either alone.
Cross-modal attention model fusing HuBERT audio and EfficientNet visual encoders with bidirectional attention and learnable modality weights.
8-class emotion classification with deployed demo on HuggingFace.
Loading demo (free tier may take 30s to wake up)...