Skip to content

examples : add stereo to mono conversion in read_audio_data #3266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2025

Conversation

danbev
Copy link
Member

@danbev danbev commented Jun 18, 2025

This commit adds a conversion from stereo to mono in the read_audio_data function of common-whisper.cpp.

The motivation for this change is prior to Commit 7d3da68 ("examples : use miniaudio for direct decoding flac, mp3, ogg and wav (#2759)", there was a step that read stereo int16 data -> pcm16 (448512 samples), and then converted to mono (224256 samples), and then also convert to stereo in pcmf32s.

The middle step here seems to have been missed when rewriting the code to use Miniaudio and caused issues when transcribing stereo audio files.

For example, currently using the audio sample in the linked issue the output is:

[00:00:00.000 --> 00:00:03.000]  (speaker 1) Sous-titres réalisés para la communauté d'Amara.org

And with the change in this commit the output is:

[00:00:00.000 --> 00:00:01.500]  (speaker 1) *sonnerie de téléphone*
[00:00:01.500 --> 00:00:07.000]  (speaker 1) Salut jeune homme !
[00:00:07.000 --> 00:00:08.500]  (speaker 0) C'est vrai que je te dérange ?
[00:00:08.500 --> 00:00:10.500]  (speaker 1) Ah pas du tout, pas du tout, pas du tout !
[00:00:10.500 --> 00:00:12.500]  (speaker 1) J'étais en train de...
[00:00:12.500 --> 00:00:14.500]  (speaker 1) de préparer un courrier

Resolves: #3092


Notes/writeup: diarize-issue

This commit adds a conversion from stereo to mono in the
`read_audio_data` function of `common-whisper.cpp`.

The motivation for this change is prior to Commit
7d3da68 ("examples : use miniaudio for
direct decoding flac, mp3, ogg and wav (ggml-org#2759)", there was a step that
read stereo int16 data -> pcm16 (448512 samples), and then converted to
mono (224256 samples), and then also convert to stereo in `pcmf32s.

The middle step here seems to have been missed when rewriting the code to
use Miniaudio and caused issues then transcribing stereo audio files.

For example, currently using the audio sample in the linked issue the
output is:
```console
[00:00:00.000 --> 00:00:03.000]  (speaker 1) Sous-titres réalisés para la communauté d'Amara.org
```

And with the change in this commit the output is:
```
[00:00:00.000 --> 00:00:01.500]  (speaker 1) *sonnerie de téléphone*
[00:00:01.500 --> 00:00:07.000]  (speaker 1) Salut jeune homme !
[00:00:07.000 --> 00:00:08.500]  (speaker 0) C'est vrai que je te dérange ?
[00:00:08.500 --> 00:00:10.500]  (speaker 1) Ah pas du tout, pas du tout, pas du tout !
[00:00:10.500 --> 00:00:12.500]  (speaker 1) J'étais en train de...
[00:00:12.500 --> 00:00:14.500]  (speaker 1) de préparer un courrier
```

Resolves: ggml-org#3092
@danbev danbev requested a review from ggerganov June 18, 2025 14:20
@danbev
Copy link
Member Author

danbev commented Jun 18, 2025

This might also be related to #3263 though I've not looked into it specifically.

@danbev danbev merged commit ecb8f3c into ggml-org:master Jun 18, 2025
52 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--diarize flag no longer works with stereo input in latest release
2 participants