Discussions section in repo #282
-
Is these an opportunity to add It's not always a bug or feature request. Sometimes it's something I'm building realtime S2T with mic using this lib and found interesting thing. When I try to feed exactly 16k samples into
If I add 160 more, the error disappears. @sandrohanea, could you please comment if it might be related to missed header which can be obligatory? Because I use stream obviously no header is provided. Attaching the repro draft which uses mic from using System.Diagnostics;
using System.Text;
using System.Threading.Channels;
using NAudio.CoreAudioApi;
using NAudio.Wave;
using Whisper.net;
using Whisper.net.Ggml;
using Whisper.net.Logger;
namespace Whisper_Realtime;
internal static class Program
{
private static async Task Main()
{
Console.OutputEncoding = Encoding.UTF8;
Console.InputEncoding = Encoding.UTF8;
var ggmlType = GgmlType.Tiny;
var modelFileName = $"ggml-{ggmlType.ToString().ToLower()}.bin";
if (!File.Exists(modelFileName))
{
await DownloadModel(modelFileName, ggmlType);
}
LogProvider.Instance.OnLog += (level, message) =>
{
Console.Write($"{level}: {message}");
};
using var whisperFactory = WhisperFactory.FromPath(modelFileName);
var builder = whisperFactory.CreateBuilder()
.WithPrompt("To jest duzy dom. Novy sklep.")
.WithNoSpeechThreshold(0.8f)
.WithLanguage("pl");
await using var processor = builder.Build();
var channel = Channel.CreateUnbounded<short[]>(new UnboundedChannelOptions
{
SingleReader = true,
SingleWriter = true
});
var audioCaptureThread = new Thread(() => CaptureAudio(channel))
{
IsBackground = true
};
audioCaptureThread.Start();
var reader = channel.Reader;
var stopwatch = Stopwatch.StartNew();
var sampleBuffer = new float[16000 * 60];
var bufferSize = 0;
var second = 1;
while (await reader.WaitToReadAsync() && second < 16)
{
var samples = await reader.ReadAsync();
var floatSamples = CastShortToFloat(samples);
var targetSize = 16000 * second;
AddToBuffer(sampleBuffer, floatSamples, ref bufferSize);
if (bufferSize >= targetSize)
{
stopwatch.Reset();
stopwatch.Start();
await foreach (var result in processor.ProcessAsync(sampleBuffer.AsMemory(0, targetSize)))
{
if (result.Text.StartsWith(" [")) continue;
Console.WriteLine($" {result.Start}->{result.End}: {result.Text} => with probability: {result.Probability}");
}
Console.WriteLine($"Seconds sent: {second:000} Buffer size: {bufferSize:000000} Spent: {stopwatch.Elapsed.TotalMilliseconds} ms.");
second++;
}
}
}
private static void CaptureAudio(Channel<short[]> channel)
{
var writer = channel.Writer;
var waveIn = new WaveInEvent
{
DeviceNumber = 0,
WaveFormat = new WaveFormat(16000, 16, 1)
};
waveIn.DataAvailable += (sender, args) =>
{
Console.WriteLine($"Bytes Recorded: {args.BytesRecorded}");
var buffer = new short[args.BytesRecorded / 2];
Buffer.BlockCopy(args.Buffer, 0, buffer, 0, args.BytesRecorded);
writer.TryWrite(buffer);
};
waveIn.RecordingStopped += (_, _) =>
{
writer.Complete();
waveIn.Dispose();
};
waveIn.StartRecording();
Console.WriteLine("Press any key to stop recording...");
Console.ReadKey();
waveIn.StopRecording();
}
private static float[] CastShortToFloat(short[] samples)
{
var floatSamples = new float[samples.Length];
for (var i = 0; i < samples.Length; i++)
{
floatSamples[i] = samples[i] / 32768.0f;
}
return floatSamples;
}
private static void AddToBuffer(float[] destination, float[] source, ref int count)
{
Array.Copy(source, 0, destination, count, source.Length);
count += source.Length;
}
private static async Task DownloadModel(string fileName, GgmlType ggmlType)
{
Console.WriteLine($"Downloading Model {fileName}");
await using var modelStream = await WhisperGgmlDownloader.GetGgmlModelAsync(ggmlType);
await using var fileWriter = File.OpenWrite(fileName);
await modelStream.CopyToAsync(fileWriter);
}
} |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hey @AncientLust , Now, trying to answer your question as well: indeed, you'll need at least 1000 ms to perform the inference: It seems (based on the logs) that you're short of 10ms. Indeed, as you're sending a Memory, no header is required and shouldn't be the cause of this missing 10ms. It seems that 16k frames of audio, only produce 990ms of mel spectogram in the whisper.cpp library (one missing mel sample): Just adding 100 frames before calling the processor should fix it. Unfortunetly, I don't have an exact ETA for the new library (as I work on it only in my free time and weekend) but will announce here as well once it will be available. |
Beta Was this translation helpful? Give feedback.
-
Hey @AncientLust, I’m excited to share that the new library, EchoSharp, is now available (still in its early stages): https://github.com/sandrohanea/echosharp/. It’s designed to leverage Whisper.net as well as other Speech-to-Text components and VAD modules for near-real-time audio processing. I’d greatly appreciate it if you could take some time to try it out and share any early feedback—it would mean a lot! Thank you! |
Beta Was this translation helpful? Give feedback.
Hey @AncientLust ,
Created the discussion page and converted this one to discussion as well.
Thanks for the suggestion!
Now, trying to answer your question as well: indeed, you'll need at least 1000 ms to perform the inference:
https://github.com/ggerganov/whisper.cpp/blob/8c6a9b8bb6a0273cc0b5915903ca1ff9206c6285/src/whisper.cpp#L5375C5-L5375C39
It seems (based on the logs) that you're short of 10ms.
Indeed, as you're sending a Memory, no header is required and shouldn't be the cause of this missing 10ms.
It seems that 16k frames of audio, only produce 990ms of mel spectogram in the whisper.cpp library (one missing mel sample):
https://github.com/ggerganov/whisper.cpp/blob/8c6a9b8bb6a0273…