RVC Voice Timbre Conversion

RVC Inference Screenshot

01. Overview

RVC (Retrieval-based Voice Conversion) is a powerful AI voice conversion technology. Unlike "Voice Cloning (input text to output speech)", RVC's operation involves "inputting an original sound clip," and the AI retains the original speaking or singing intonation, emotion, and rhythm, but replaces the timbre with your specified model target.

This technology is most commonly used to create "AI Covers", such as having a famous singer sing someone else's song, or to hide one's real voice for live streaming and video dubbing.

02. Operations Section Explanation

1 Input Audio

This is your original sound source to be converted:

2 Model Selection

Load the voice of the person you want to "become" here:

03. Detailed Conversion Parameters (Core Must-Learn)

RVC's most powerful aspect is its highly flexible parameters. Proper adjustment can save voice cracking, missing audio, or make cross-gender covers sound seamless.

Pitch Shift (Pitch) Function: Changes the fundamental frequency (pitch) premise of the input sound, measured in "semitones".
Usage: If it's originally a male voice and you want to convert it with a female model, it's recommended to set Pitch to +12 (up one octave). Conversely, for female to male, set it to -12. Keep it at 0 for same-gender conversion. If you need to change the song's key itself to fit the model's vocal range, you can also fine-tune like `+1` or `-2`.
F0 Prediction Algorithm Function: The method the AI uses to track your original articulation pitch curve.
Options:
  • rmvpe (Recommended): Currently the strongest and most stable algorithm. Fast speed, strong noise resistance, moderate graphics card resource consumption. Default first choice!
  • fcpe (Recommended): A newer, powerful algorithm. More accurate than rmvpe for grasping large high/low note drops, vocal fry, or special singing styles, but takes longer. Try this if rmvpe struggles.
  • crepe: A veteran algorithm, high accuracy but slow and consumes a lot of graphics card resources.
  • pm: The fastest speed, but worst sound quality, prone to voice cracking, suitable for extremely low-end computer specs.
  • harvest: More accurate low pitch tracking, but high pitches easily crack.
Index Rate Function: Controls the degree to which voice characteristics "lean towards the model" (range 0~1). Requires a .index file to take effect.
Usage: Higher values: Articulation and tone will be more like the model itself, but if the original audio quality is poor, an excessively high index rate will cause various weird "artifacts (buzzing)". Lower values: Will have more traces of your own original voice.
Recommendation: Default 0.75. If weird noises appear, lower it to 0.3 ~ 0.5.
Filter Radius Function: When the pitch (F0) trajectory fluctuates wildly (e.g., prediction errors caused by breathiness or voice cracking), use this value for median filtering to smooth it. Effective only when greater than 3.
Usage: If you notice the converted voice suddenly has extremely unnatural "off-key or popping sounds" in certain sections, you can increase this item to weaken this abrupt change.
Recommendation: Default 3. Increase if you encounter breathiness or popping.
RMS Mix Rate Function: Determines how much the "volume change of the output sound" refers to the "volume of the original input sound".
Usage: Default 0.25 means only a quarter of the output sound's volume fluctuation follows your input, and three-quarters are determined by the model. If you check this (or increase the value closer to 1), where the AI shouts or whispers, the volume changes will more closely restore the level of your original recording.
Recommendation: Check it if you want obvious emotional fluctuations (large difference between loud and soft); uncheck it (set to 0) if you want the volume of every sentence to be very even.
💡 Magic Formula: Winning Tips for Male/Female Conversion
If you use a male voice to sing a female's song (male to female), first try setting the Pitch Shift to +12. If it still sounds weird, the reason is usually that the male simply can't hit those high notes. In this case, it is strongly recommended that you **make good use of our "Source Key Shift" feature in the first section**. Lower your original voice by -3 to -6 semitones, and keep Pitch at +12. Feeding the modified low-range vocals to RVC will yield miraculously natural results!

04. Troubleshooting

Q: Why does it suddenly pop up "Cannot find model file" or "Interrupted" during conversion?

Solution: Please ensure that the filenames of your .pth or .index files, and the "folder name" they are in, absolutely do not contain any spaces, Chinese characters, Japanese characters, or special symbols!
RVC's underlying Python is very strict about path parsing. Please change folder and model names entirely to pure English + numbers.

Q: I checked "Auto-Separate Vocals and Mix", why is the output only accompaniment without vocals, or the program freezes and won't run?

Solution: This indicates the underlying Demucs vocal separation module failed.
The cause is likely insufficient graphics card memory (VRAM) to load these two resource-heavy AI processes simultaneously. It's recommended to first go to the "Vocal Separation" menu, manually extract a pure vocal file, then bring that file here for voice conversion, and finally use video editing software to layer it with the accompaniment yourself.

Q: The converted voice sounds like it has a cold, or like an alien talking?

Solution: This is because RVC's algorithm caught the wrong vocal range. The most effective solution is: Change the F0 Prediction Algorithm!
If you originally used rmvpe, change it to fcpe or crepe and run it again; this usually results in massive improvement. Additionally, please check if your Pitch shift is set backward.