Voice Cloning (GPT-SoVITS) is a powerful artificial intelligence speech generation technology. It only requires a short 3 to 10 seconds of clean human voice to instantly mimic the speaker's timbre, tone, and inflection, making them speak any text you specify.
This system supports cross-lingual pronunciation. You can even use a Chinese reference voice to make the AI speak fluent English or Japanese! This technology is perfect for YouTube video dubbing, audiobook reading, or creating your own personal AI voice assistant.
Since GPT-SoVITS is a massive core engine, we must start it before we can begin working:
If you don't have specific needs, you can leave this completely blank, and the system will automatically use the pre-trained "Default Official Model". If you have downloaded models from the internet or trained your own, you can load them here:
This is a core concept that confuses many beginners. Imagine GPT-SoVITS as a singer:
👉 Conclusion: If you want Jay Chou to speak, BOTH the model and reference audio are indispensable! You must load Jay Chou's .ckpt and .pth, while simultaneously providing a 3~10 second clean human voice clip of Jay Chou and its text as a reference.
This is the key to initiating generation:
This is the new content you want the AI to speak. The system offers two modes:
.srt subtitle file, and the system will
feed the text to the AI sentence by sentence to generate speech, then precisely paste it onto the timeline dictated by the subtitles!
Finally, it merges them into a complete, long audio track (WAV).Many users initially feel the AI sounds "stiff" or "robotic". This is usually because the input text lacks punctuation. GPT-SoVITS heavily relies on punctuation to determine rhythm and emotion:
, (Comma): Creates a short breath pause, making long sentences sound unhurried.. (Period): Lowers the tone, creating a complete end-of-sentence pause.! (Exclamation): Raises the pitch, adding excited, intense, or emphasizing emotions.? (Question Mark): Raises the tone at the end of the sentence, creating an inquiring intonation.... (Ellipsis): Creates a longer "thoughtful" or "hesitant" pause, perfect for adding dramatic flair.❌ Bad Example (No punctuation):
This is an amazing technology it can change the future dubbing industry we will see
→ Result: The AI rushes through it in one breath like a bullet train, lacking emotion.
✅ Good Example (With punctuation):
This is... an amazing technology! It can change the future dubbing industry, we, will see.
→ Result: The AI pauses to build anticipation after "This is", raises its volume at "technology!", and slows down between the commas at the end, sounding highly persuasive and human-like.
Although you can use the "Reference Audio" section above for a quick Few-Shot (3-second mimic) generation, this usually only captures about 60% of the speaker's essence, suitable for quick entertainment purposes. To achieve a 100% near-lossless similarity, you must perform Fine-Tuning for that character to produce dedicated .ckpt and .pth models.
Currently, the "Video Toolbox" focuses on providing the lightest, most stable, and cleanest "Inference" interface. We do not integrate the massive training interface into the main program. Here are the paths to acquire custom models:
You can easily train using the official GPT-SoVITS WebUI package you originally downloaded:
GPT-SoVITS folder on your computer and click go-webui.bat (or the corresponding launcher)..ckpt and .pth files belonging to that character in the GPT_weights and SoVITS_weights folders..ckpt and .pth files.models\SoVITS folder. Afterwards, you can load them with one click in the "Custom Models" section of the interface!
The two sliders at the bottom of the interface are magic wands to control the AI's creativity. Please only adjust them when you encounter weird noises:
| Parameter Name | Numerical Meaning | Recommended Scenario |
|---|---|---|
| Auto Split (Punctuation) Default: ON |
Automatically splits long sentences into chunks based on punctuation. | Strongly recommended to keep ON. This fixes the "3-second cutoff" issue and prevents AI hallucinations on long texts (100+ words). |
| Temperature (Temp) Default: 0.8 |
Controls the "richness and irregularity" of the output. Higher values: More emotion, but prone to slurring or noise. Lower values: Flatter, more robotic, but clearest pronunciation. |
0.8 is the sweet spot for stability and quality. Sligthly increase (1.00 ~ 1.10) if the voice is too flat. Decrease (0.60) if you hear weird noises. |
| Stability (Top_P) Default: 0.8 |
Determines the "conservatism" of the AI predicting the next word. Works with Temp. Lower is more rigid; higher is more expressive. |
Usually keep at 0.8 to match the default Temp for the highest generation success rate. |
A: Yes, this is completely normal! This is not garbage files generated by the system, but because it contains a "complete execution environment and AI brain":
Solution:
python.exe, and try again.Cause & Solution:
Solution: This is called AI Hallucination. It's usually because your
Reference Audio background isn't clean. Be sure to use the "Recording Assistant"'s
noise reduction feature to ensure you are feeding the AI very pure human voice. Also, moderately
lowering
Temp can effectively reduce the occurrence of weird ending noises.
A: This is completely normal and expected! This is just the main program "pinging" the API server to check if it has finished loading. Since the API root directory doesn't have a viewable page, the server returns a 400 error. As long as your main interface shows "✅ Connected", everything is working perfectly.
A: No, it is still running on your GPU! To ensure compatibility with the latest generation GPUs (like the RTX 40/50 series), we forced full-precision (FP32) mode to prevent crashes caused by specific instruction set incompatibilities. It is still performing the heavy lifting on your graphics card, which is significantly faster than using the CPU.