Simple hacks for reducing audio transcription cost.

I recently concluded a subcontracted project for one of the top real estate builders in the Middle East.

Requirement

With the advent of viral real estate listings on social media comes a barrage of call center inquiries about the projects and their details. The UAE, in particular, receives calls from across the globe.

The client was already using an AI SaaS product that would take call recordings and extract:

Entities:
- Name
- Nationality
- Property, builder, or area of interest
Budget
Purchase timeline

The system also tagged phone numbers and other metadata, automatically raising CRM tickets. These tickets were routed to agents over WhatsApp within seconds, along with a dossier containing extracted details, a transcript, and a playable audio file link. This enabled agents to exercise judgment and prioritize callbacks effectively.

The pricing with the SaaS provider was a fixed cost of 7 AED per call. My task was to replicate this functionality at a lower cost, while ensuring the client’s in-house tech team could take over and fully own the resulting IP.

Audio Processing

Using ffmpeg, we processed each audio file to:

Standardize the format to FLAC
Reduce the bitrate to 128 kbps
Speed up the playback to 1.25x–1.5x

Running ffprobe on the optimized files showed that we reduced audio file sizes by approximately 60% on average. This resulted in downstream savings in both storage and processing costs.

Audio-to-Text Conversion

After experimenting with optimized versions of OpenAI Whisper and various Hugging Face models, I ultimately chose the reliable Google Speech-to-Text API. One key reason was the need to handle multi-lingual conversations — many East Asian callers switched between their native language and English, e.g. Hindi/Urdu -> English.

Google’s API allowed us to specify a primary language along with up to four alternative languages.

Another useful trick was hosting audio files in Google Cloud Storage. This appeared to speed up transcription time — my hunch is that Google avoids copying files already hosted within their cloud infrastructure (as opposed to fetching from S3 or a local server).

I handed over a Dockerized version of the pipeline, micro-batching audio files in 3-minute windows, and ran a worker pool of 20 concurrent workers. ffmpeg was the most CPU-intensive part of the process. The client now runs the entire pipeline within Kubernetes, where the number of messages in Redis streams dynamically determines the scaling of the worker pool.

Their cost is now equal to the reserved instance + google transcription cost + Gemini 2.5 Pro. When spread across the no of calls they are now saving approx. 92% in cost in June alone.

Tech stack:

Redis Streams for Pub/Sub
Python
- DSPy for interaction with LLMs
OpenRouter
ffmpeg
Google
- Cloud Storage
- Gemini 2.5 Pro
- Speech-to-text