OpenAI’s whisper module might change the game of the speech-to-text (STT) industry

Another AI tool by OpenAI? And it is free?

Apr 27, 2023

Yes. Another AI tool by OpenAI. This time it is free as long as you know how to code, at least a bit.

No, I am not talking about ChatGPT or DALL-E.

When OpenAI launched their GPT-4 API in March 2023, they also released the whisper, a speech-to-text AI tool. Interestingly, not many people talked about it.

Speech-to-text (STT) technology has been around for some time, and several companies like Google, Microsoft, and IBM have already their own STT services. OpenAI has recently entered the market with its whisper module, which can transcribe video or audio files in over 90 different languages. It has two versions: 1- You pay while using the openai api ($0.006/minute), 2- open source and free.

Embarking on my STT Journey

I began my STT journey when I wanted to write a blog article about a video that a friend and I produced for a YouTube channel. The first step was to transcribe the video. However, when I tried using some of the existing speech-to-text services available in the market, I found them to be either too costly or inefficient.

Engaging freelancers on platforms like Upwork or Fiverr can reduce expenses, but the process can be tedious, requiring constant communication for each video. While some subscription services may lower the costs, I wasn’t interested in using them because I already have subscriptions to services like Netflix, gym, etc. I did not want to sign up for more. Hence, I decided to develop my own transcription service to save both time and money. After all, I am a tech bro with coding skills! Right?

Initially, I tried IBM’s transcription service, but it was not user-friendly, and I faced some technical difficulties despite reaching out to their customer support. My next attempt was with Google’s speech_v1 transcription service, but it required me to upload everything to Google Cloud Storage to transcribe an audio file. While this worked well for large files, it wasn’t ideal for my needs as I was looking for a service that could run locally. After conducting further research, I discovered OpenAI’s whisper module which is open-source and can be downloaded and executed on your local machine — precisely what I was looking for.

Over a few hours on weekends, using OpenAI’s whisper module, I built the backend of my own STT service on my laptop, achieving high accuracy with no cost. At that moment, I had an idea: Why not make this service available to people worldwide who face similar challenges as I did?

So, I went a step further and built a web-based platform (https://totext.ai) where you can upload your own files and transcribe them. The platform is not limited to media files, it also supports YouTube videos with the option to simply copy and paste the link into the web user interface.

Real-World Examples

For my demonstration, I utilized both a video file that I and a friend made for a YouTube channel, as well as a video of President Biden that was available on YouTube.

The animated image displayed above illustrates the process of transcribing a .mp4 video. Uploading a media file can be done with a single click of a button, and transcription can begin immediately after pressing the “Run” button.

The second animated image shows the process of converting Biden’s speech into text. You can simply copy the YouTube url of the video, paste in the textfield under the “YouTube Video” tab, and start the transcription by clicking the “Run” button.

For both examples, we get an output of a transcript, captions in srt and vtt formats, and a summary. The whisper module was utilized to generate the transcript and captions, while the GPT-3.5 API was employed to obtain the output in the application’s backend.

It is important to note that the whisper module is more suitable for offline transcription rather than real-time streaming transcription, at least for the time being.

Once you transcribe your audio or video or audio files, you can leverage ChatGPT’s interface to perform various tasks on the transcript. For example, you can easily copy and paste the text into ChatGPT’s interface and ask it to summarize the transcript, translate it into a different language, or even generate a blog post.

What Now?

My belief is that OpenAI is going to revolutionize the STT industry by providing an opportunity for people with coding knowledge to avoid the high costs of current STT services. For those without technical skills, I anticipate that there will soon be more affordable platforms available. Personally, I have already built such a platform and aspire for it to be advantageous for non-technical individuals in the most cost-effective manner possible.

The saying “It is not the AI that will replace you, it is the people who use AI effectively” has become quite popular recently. However, I would like to rephrase it: “It is not the AI that will replace your business, it is the businesses who use AI effectively”.

Let me know what you think about this new tech. Do you think it will change the STT industry? I would love to hear your opinions.

I Love Technology Newsletter

Discussion about this post