Discussion about this post

User's avatar
Neural Foundry's avatar

The SAM Audio extension is clever but the real unlock here is PE-AV's contrastive training on 100M videos. Extending prompt-based segmentation from pixels to waveforms feels logical in hindsight, but the RTF under 1.0 is what makes this production-ready. I'm curious how well it handles polyphonic music seperation versus isolated speech, since contrastive learning on YouTube-scale data probably skews toward clearer audio samples. The benchmark and judge models are a good move tho, especially for comparing against domain-specific tools.

No posts

Ready for more?