This application integrates Meta's (Facebook) state-of-the-art music separation model, Hybrid Transformer Demucs (htdemucs), providing you with studio-grade track separation services.
Unlike traditional EQ filtering, AI models can truly "understand" different instruments in music, perfectly isolating them into:
We've curated the three most powerful pre-trained models to suit different needs:
This is the standard Demucs V4 model, balancing separation quality with processing speed. It is suitable for 90% of use cases, especially creating Karaoke backing tracks.
The Fine-tuned version is specifically optimized to "preserve vocal details". If you are extracting dry vocals for AI Voice Cover Training, this model typically retains more high-frequency nuances.
This is the top-tier high-quality model based on the MDX-Net architecture. It provides industry-leading separation purity, perfect for users with extreme audio quality requirements.
This is a very professional and common question! The primary job of "Vocal Separation" models like Demucs is to "separate human vocals from background instruments (BGM)".
However, vocals extracted from pop songs or videos often still contain heavy spatial reverberation (Reverb), Echo, or Chorus. This happens because mixing engineers intentionally add "spatial effects" during studio recording or post-production to make the voice sound fuller and rounder.
When you check "☑️ Enable De-Reverb Filter" and listen to the resulting
_Vocal_Dry.wav, you might find the voice sounds very thin or slightly robotic. This is
a normal "Studio Magic Deprivation Effect", for the following reasons:
To train a flawless S-Tier voice conversion or cloning model, top-tier material is almost never pop songs (CDs), but rather:
These materials naturally lack complex background music and have relatively dry recording environments. After running such files through a standard denoiser or de-reverb, the output will be a perfect "zero-flaw dry vocal".
💡 Tip: Processed files are automatically saved in the Outputs/Vocals folder.
A: Deep learning models require massive matrix computations. If you use an NVIDIA Graphics Card (GPU), processing is typically 10~50x faster than CPU. If you only have a CPU, you'll need to wait longer.
A: The first time you use a new model, the program needs to download the model weights (hundreds of MBs) from the cloud. Subsequent uses will be extremely fast.
A: This means your graphics card VRAM is insufficient. Don't worry, this application features an auto-fallback mechanism. Upon detecting low memory, it automatically switches back to CPU mode to finish the task.
A: Yes! Starting from V2.3, this application natively integrates UVR5's
VR Architecture top-tier de-reverb model.
Simply check "☑️ Enable De-Reverb Filter (Designed for Model Training)" on the
interface. After isolating the vocals, the system will automatically launch a second AI neural
network pass in the background to extract the reverb, outputting a flawless
_Vocal_Dry.wav for you. This file can be directly dropped into RVC or GPT-SoVITS for
training without relying on external software!
This feature is built upon the following open-source technologies, ensuring maximum compatibility and performance: