Your users want to talk to your app. They want it to talk back. They want it to understand them in any language. Azure Speech makes all of this possible, and you do not need an AI background to use it.
This talk is for application developers who want to add speech capabilities to .NET or Python applications using Azure.
Azure Speach in Foundry Tools is a cloud service in the Microsoft Foundry suite that handles all things speech:
One service, one subscription key, and SDKs for .NET, Python, Java, C++, and more.
Every operation follows the same three-step pattern:
Once you learn this pattern for one feature, every other feature works the same way.
.NET
// NuGet package:
// Microsoft.CognitiveServices.Speech
var config = SpeechConfig
.FromSubscription(key, region);
Python
# pip install
# azure-cognitiveservices-speech
config = speechsdk.SpeechConfig(
subscription=key,
region=region
)
Both read AZURE_SPEECH_KEY and AZURE_SPEECH_REGION from environment variables.
Turn audio into text. Think voice commands, meeting transcription, or captioning for accessibility.
// C# -- the Python version follows the same pattern
using var audioConfig =
AudioConfig.FromDefaultMicrophoneInput();
using var recognizer =
new SpeechRecognizer(config, audioConfig);
var result = await recognizer.RecognizeOnceAsync();
if (result.Reason == ResultReason.RecognizedSpeech)
Console.WriteLine($"Recognized: {result.Text}");
Swap FromDefaultMicrophoneInput() for FromWavFileInput("audio.wav") to transcribe a file.
Speech to text from the microphone and from a WAV file.
# .NET CLI
dotnet run -- stt
dotnet run -- stt --file sample.wav
# Python CLI
uv run python cli.py stt
uv run python cli.py stt --file sample.wav
# Streamlit dashboard
uv run streamlit run app.py
Give your app a voice. Use cases include IVR systems, accessibility readers, notification audio, and in-app narration.
// C# -- the Python version follows the same pattern
config.SpeechSynthesisVoiceName =
"en-US-JennyNeural";
using var synthesizer =
new SpeechSynthesizer(config, audioConfig);
await synthesizer.SpeakTextAsync("Hello, Computer!");
Use AudioConfig.FromDefaultSpeakerOutput() for live playback, FromWavFileOutput() to save a file, or pass null to get raw audio bytes for a web app.
SSML (Speech Synthesis Markup Language) gives you fine-grained control over how text is spoken:
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="medium" pitch="high">
Hello, welcome to our presentation!
</prosody>
</voice>
</speak>
Text to speech through the speaker, saved to a file, and in the Streamlit dashboard with voice selection.
# .NET CLI
dotnet run -- tts --text "Hello, Computer!"
dotnet run -- tts --text "Hello!" --output hello.wav
dotnet run -- voices
# Python CLI
uv run python cli.py tts --text "Hello, Computer!"
# Streamlit dashboard -- Text to Speech page
Translate, in real time, spoken audio from one language to another. The output can be text or synthesized speech.
This is not speech-to-text followed by machine translation. The service is optimized for spoken language, handling conversational nuances that text translators miss.
Imagine a customer support app that handles calls in Mandarin and English without a bilingual agent.
Translation uses TranslationRecognizer instead of SpeechRecognizer. Otherwise, it's the same three-step pattern with a different class:
zh-CN for Mandarin)en for English)Real-time speech translation between languages.
Azure Speech can score how well someone speaks, not just what they say.
Where developers use this today:
The pronunciation assessment feature evaluates:
Pronunciation assessment scoring on spoken English.
Speech services become much more powerful when combined with a language model. Instead of just transcribing or speaking, your app can understand and respond.
The pattern is straightforward:
Microsoft Foundry hosts language models (GPT-5.4, etc.) behind an OpenAI-compatible API. The standard openai Python package works directly:
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
base_url=f"{endpoint}/openai/v1/",
)
response = client.chat.completions.create(
model=deployment_name,
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": transcribed_text},
],
)
A voice chat loop: speak into the microphone, get an AI response read back to you.
# Streamlit dashboard -- Chat with AI page
uv run streamlit run app.py
Now that we have seen what Azure Speech can do, how does it compare to OpenAI's Whisper?
Whisper is an open-source speech recognition model. It handles one thing well: speech-to-text. Azure Speech is a full platform.
Choose Whisper when you need cost-effective transcription, want to run models locally, or only need speech-to-text.
Choose Azure Speech when you need real-time streaming, text-to-speech, translation, custom voices, or speech analysis.
You can also run Whisper inside Azure via the Azure OpenAI Service, so the two are not mutually exclusive.
Azure Speech uses a pay-as-you-go model with a generous free tier:
Free Tier (F0) -- 5 hours STT, 0.5M characters TTS, 5 hours translation per month.
Standard Tier (S0)
Azure Speech handles speech-to-text, text-to-speech, translation, pronunciation scoring, and custom voices -- all through a single service.
You can start with the free tier today. Add the NuGet package or pip install the Python SDK, create a Speech resource in Azure, and start building.
To learn more, go here:
https://csmore.info/on/speech
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact