Skip to content

othneildrew/open-whisperer

Repository files navigation


Logo

Open Whisperer

AI Video Translator and Subtitler

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

Open Whisperer Showcase

The motivation for this project was two reasons: (1) I'm trying to learn Spanish so this is a fun way to translate and learn from any video and (2) I wanted to get back into python development focusing on some form of AI/machine learning.

The project's initial scope has been severely limited to reach a quick MVP and get a working app that can be self-hosted and used as a tool right away.

Expect bugs & beware of gremlins!

Btw: The name "open whisperer" is a play on the main open source project that drives this project (OpenAI's whisper). I just chose the name to get started and kept it until now; it does not indeed to infringe on any copyrights or trademarks held by Open AI.

For the devs learning to code (I mean, we're all learning, but...), this is a mono-repo; if you're not familiar with this type of app, I came across a nice resource that explains the motivation behind it. It's the most extensive and helpful I've found.

Check it out here: https://monorepo.tools/

(back to top)

Built With

ffmpeg-python OpenAI Whisper argostranslate Next

(back to top)

Getting Started

To get a local copy up and running follow these simple example steps.

Note: You will need about ~14GB available space for all the AI language models.

Prerequisites

  1. Docker (recommended)

or

  1. node@^22.14.0
  2. python@^3.11
  3. yarn@3.8.7

Docker Installation

  1. Install Docker: https://www.docker.com/get-started
  2. Create .env files (use the examples in both apps/python-server and apps/web-ui
    cp apps/web-ui/.example.env.production.local apps/web-ui/.env.producation.local # used by docker
    cp apps/web-ui/.example.env.production.local apps/web-ui/.env.development.local # used when running locally in dev (i.e.: npm run dev)
    cp apps/python-server/.example.env apps/python-server/.env # used by docker
  3. Run the containers
    docker compose up --build -d
  4. Verify the container is running
    docker ps
  5. Access the Web UI Visit http://localhost:3567
  6. Enjoy

Manual Installation

A great deal of effort went into making sure it runs without issue on docker, just try it and report back if you run into any issues.

// TODO: write manual installation instructions. But in the meanwhile, do use venv

  1. Activate the virtual environment
# For windows
venv/Scripts/activate

# macOS/Linux
venv/bin/activate

(back to top)

Usage

The app features a simple and easy to use interface that allows choosing from different languages to translate to and from.

Of course no AI model is 100% accurate so don't rely on this program where 100% accurate transcript/translations are required for your use.

The usage is self-explanatory and each button highlights green when you're ready to move to the next step.

The demo version has upload limits and may include other restrictions and/or scanning of media to comply with some local laws, it is recommended to use the self-hosted option via docker if you need higher limits.

Roadmap

This is a very rough outline of how I may go about adding new features; the roadmap somewhat follows the order of importance right now for my use case.

Please, do not depend on this project as it is not stable and development may be sporadic.

Feel free to clone the project and make your own changes.

PRs are always welcome, and I'll be happy to merge any that make sense with the general direction of the project.

MVP

  • Setup pipeline (extract audio, transcriptions, translation, muxing) and working app
  • Docker image for self hosted option
  • Easily maintainable and well-structured mono repo

V1

  • Ability to edit the transcript before applying it to video
  • Allow applying source language subtitles to video
  • Show list of previously generated .srt files to quickly reuse and/or download
  • Sync video & transcript when user clicks video it should sync both
  • Add wavesurfer audio visualizer to show events & subtitle timeline
  • Speaker diarization (recognize how many speakers spoke, when & which of either gender the speaker(s) are)
  • Event-based status reporting with background tasks

V2

  • Add voice cloning to overdub videos in translated language (support different accents, gender)
  • Support different style "templates" for subtitle styles
  • Edit placement of subtitles
  • Detect duplicate video sources
  • Support uploading audio only and generating transcript w/ option to output karaoke style blank video
  • Take advantage of hardware acceleration

Way into the future

  • More advance cropping/slicing and basic editing videos (to cut dead-space)
  • Support concurrent uploads (multiple videos/audio at the same time w/ status reporting)

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feat/amazing-feature-i-want-to-add)
  3. Commit your Changes (git commit -m 'Add some amazing-feature-i-want-to-add')
  4. Push to the Branch (git push origin feat/amazing-feature-i-want-to-add)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Othneil Drew - LinkedIn @othneildrew - codeguydrew@gmail.com

Project Link: https://github.com/othneildrew/open-whisperer

Website: https://othneildrew.com

(back to top)

Acknowledgments

After much research, I've come across these amazing set of tools/projects that made this one possible.

Big shout out to these amazing resources! Some aren't used yet, but these are more than likely what I will be using to implement other features on the roadmap.

Task Tool Notes
Audio Extraction ffmpeg Industry standard
Language Detection Whisper Detects and transcribes; use faster-whisper for speed
Multi-Speaker Diarization pyannote-audio Best diarization tool (offline support with Hugging Face model download)
Translation argos-translate Offline translation, install language pairs
Voice Synthesis (TTS) Tortoise TTS, Coqui TTS High quality, supports speaker cloning too
Subtitle Handling ffmpeg, srt, autosub, or custom logic SRT file generation and muxing
Muxing ffmpeg Add subtitles or TTS audio back to the video

(back to top)