AI Subtitle Generator & Auto-Sync Engine

AI Subtitle Generation Screenshot

01. Overview

The Subtitle Generation Module uses OpenAI's open-source Whisper Series (Fast-Whisper implementation) as its recognition core. It can automatically transcribe any major video or audio file (such as MP4, MKV, MP3, M4A, WAV, FLAC, AAC, OGG) into accurate, time-stamped subtitles (SRT/VTT).

This system not only supports basic subtitle generation but also integrates Auto-Translation, Bilingual Subtitles layout, and a one-click Auto Burn-in feature, allowing you to produce finished, subtitled videos without opening other editing software.

02. AI Core Parameters Explained

🧠 AI Model Selection

We provide four different tiers of models. You can choose based on your PC's performance and accuracy needs:

🗣️ Video Language

Specify the language spoken in the video. While setting it to Auto is usually accurate, if the video has very sparse dialogue or loud background music, manually specifying the language (e.g., English) can significantly improve accuracy.

⚙️ Processing Mode

🛡️ VAD (Voice Activity Detection) Filter

This is a highly important feature. If checked, the AI will first analyze "where there is human voice" before transcribing.
Purpose: Prevents the AI from hallucinating subtitles (like meaningless symbols or repeated words) during purely musical or silent segments.
Recommendation: Turn on manually depending on the situation. It is turned off by default in the latest version to prevent accidentally deleting soft-spoken or whispered dialogue.

👥 Speaker Diarization

Uses Pyannote AI for audio track separation, automatically recognizing how many people are speaking in the video, and pre-pending speaker tags (e.g., [SPEAKER_00]:, [SPEAKER_01]:) to the beginning of subtitles.

⏳ Performance Tip & Time Estimation: When enabled, the transcription time will increase significantly (potentially taking 2-3x longer). This is because, besides Whisper's text recognition, the system must perform Pyannote's deep voiceprint feature extraction, and finally cross-reference and merge the "Text" and "Who said it" data on the timeline. If the video is long, please be patient; the system's progress percentage will continually update.

🔑 First Time Use & Authorization Check: The system will pop up a window asking for a Hugging Face Access Token (starting with hf_). This is a free service. If you encounter authorization errors, please follow these steps to verify:

  1. Go to the pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 model pages.
  2. Ensure you are logged into your Hugging Face account.
  3. If you haven't agreed to the terms, an agreement and form will appear. Fill it out and click "Agree and access repository".
  4. If you click in and directly see the model description and "Files and versions" list, it means you have successfully authorized!

03. Subtitle Output & Style

📄 Output Formats

🎨 Independent Style Settings (Exclusive Lossless ASS Dual Engine)

Our subtitle generation core utilizes a powerful Dual-Layer Overlay Dynamic Calculation Engine. It not only guarantees compatibility with all video resolutions but also allows the "Primary Subtitle" and "Secondary Subtitle (Original Text)" to have completely independent visual designs without interfering with each other!

On the right side of the interface, we provide an intuitive Tabview:

Within each tab, you can independently configure the following parameters, and the settings will sync in real-time to the Preview above:

04. Troubleshooting

Q: Receiving "VRAM OOM" or "System RAM OOM" errors?

Cause: Insufficient PC memory to load the large model.
Solution: Please downgrade the model to medium or large-v2 (Good). The difference in results is usually minor, but it allows the program to run smoothly.

Q: The burned-in video has no sound?

Solution: This issue has been fixed in the latest version. The system now forces extraction of the original audio track and transcodes it to AAC format, ensuring audio is fully preserved.

Q: What are Bilingual Subtitles?

A: This is an incredible tool for learning foreign languages! When checked, the system simultaneously displays both the "Translated/Target Language" and the "Original", formatting them automatically (Translated on top, Original on bottom).