Table of Contents
The motivation for this project was two reasons: (1) I'm trying to learn Spanish so this is a fun way to translate and learn from any video and (2) I wanted to get back into python development focusing on some form of AI/machine learning.
The project's initial scope has been severely limited to reach a quick MVP and get a working app that can be self-hosted and used as a tool right away.
Expect bugs & beware of gremlins!
Btw: The name "open whisperer" is a play on the main open source project that drives this project (OpenAI's whisper). I just chose the name to get started and kept it until now; it does not indeed to infringe on any copyrights or trademarks held by Open AI.
For the devs learning to code (I mean, we're all learning, but...), this is a mono-repo; if you're not familiar with this type of app, I came across a nice resource that explains the motivation behind it. It's the most extensive and helpful I've found.
Check it out here: https://monorepo.tools/
To get a local copy up and running follow these simple example steps.
Note: You will need about ~14GB available space for all the AI language models.
- Docker (recommended)
or
- node@^22.14.0
- python@^3.11
- yarn@3.8.7
- Install Docker: https://www.docker.com/get-started
- Create .env files (use the examples in both
apps/python-server
andapps/web-ui
cp apps/web-ui/.example.env.production.local apps/web-ui/.env.producation.local # used by docker cp apps/web-ui/.example.env.production.local apps/web-ui/.env.development.local # used when running locally in dev (i.e.: npm run dev) cp apps/python-server/.example.env apps/python-server/.env # used by docker
- Run the containers
docker compose up --build -d
- Verify the container is running
docker ps
- Access the Web UI Visit http://localhost:3567
- Enjoy
A great deal of effort went into making sure it runs without issue on docker, just try it and report back if you run into any issues.
// TODO: write manual installation instructions. But in the meanwhile, do use venv
- Activate the virtual environment
# For windows
venv/Scripts/activate
# macOS/Linux
venv/bin/activate
The app features a simple and easy to use interface that allows choosing from different languages to translate to and from.
Of course no AI model is 100% accurate so don't rely on this program where 100% accurate transcript/translations are required for your use.
The usage is self-explanatory and each button highlights green when you're ready to move to the next step.
The demo version has upload limits and may include other restrictions and/or scanning of media to comply with some local laws, it is recommended to use the self-hosted option via docker if you need higher limits.
This is a very rough outline of how I may go about adding new features; the roadmap somewhat follows the order of importance right now for my use case.
Please, do not depend on this project as it is not stable and development may be sporadic.
Feel free to clone the project and make your own changes.
PRs are always welcome, and I'll be happy to merge any that make sense with the general direction of the project.
- Setup pipeline (extract audio, transcriptions, translation, muxing) and working app
- Docker image for self hosted option
- Easily maintainable and well-structured mono repo
- Ability to edit the transcript before applying it to video
- Allow applying source language subtitles to video
- Show list of previously generated .srt files to quickly reuse and/or download
- Sync video & transcript when user clicks video it should sync both
- Add wavesurfer audio visualizer to show events & subtitle timeline
- Speaker diarization (recognize how many speakers spoke, when & which of either gender the speaker(s) are)
- Event-based status reporting with background tasks
- Add voice cloning to overdub videos in translated language (support different accents, gender)
- Support different style "templates" for subtitle styles
- Edit placement of subtitles
- Detect duplicate video sources
- Support uploading audio only and generating transcript w/ option to output karaoke style blank video
- Take advantage of hardware acceleration
- More advance cropping/slicing and basic editing videos (to cut dead-space)
- Support concurrent uploads (multiple videos/audio at the same time w/ status reporting)
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feat/amazing-feature-i-want-to-add
) - Commit your Changes (
git commit -m 'Add some amazing-feature-i-want-to-add'
) - Push to the Branch (
git push origin feat/amazing-feature-i-want-to-add
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
Othneil Drew - LinkedIn @othneildrew - codeguydrew@gmail.com
Project Link: https://github.com/othneildrew/open-whisperer
Website: https://othneildrew.com
After much research, I've come across these amazing set of tools/projects that made this one possible.
Big shout out to these amazing resources! Some aren't used yet, but these are more than likely what I will be using to implement other features on the roadmap.
Task | Tool | Notes |
---|---|---|
Audio Extraction | ffmpeg |
Industry standard |
Language Detection | Whisper | Detects and transcribes; use faster-whisper for speed |
Multi-Speaker Diarization | pyannote-audio | Best diarization tool (offline support with Hugging Face model download) |
Translation | argos-translate | Offline translation, install language pairs |
Voice Synthesis (TTS) | Tortoise TTS, Coqui TTS | High quality, supports speaker cloning too |
Subtitle Handling | ffmpeg , srt , autosub , or custom logic |
SRT file generation and muxing |
Muxing | ffmpeg |
Add subtitles or TTS audio back to the video |