Catallaxy Services | Hello, Computer: An Introduction to Azure Speech

Hello, Computer

An Introduction to Azure Speech

Kevin Feasel (@feaselkl)
https://csmore.info/on/speech

Who Am I? What Am I Doing Here?

	Catallaxy Services
	Curated SQL
	Finding Ghosts in Your Data

@feaselkl

Motivation

Your users want to talk to your app. They want it to talk back. They want it to understand them in any language. Azure Speech makes all of this possible, and you do not need an AI background to use it.

This talk is for application developers who want to add speech capabilities to .NET or Python applications using Azure.

Agenda

Getting Started
Speech to Text
Text to Speech
Multi-Lingual Speech Translation
Analyze Speech
Speech + Language Models
Whisper vs AI Speech
Pricing

What is Azure Speech?

Azure Speech in Foundry Tools is a cloud service in the Microsoft Foundry suite that handles all things speech:

Speech-to-text and text-to-speech
Real-time speech translation across languages
Pronunciation assessment and speech analysis
Custom neural voice creation

One service, one subscription key, and SDKs for .NET, Python, Java, C++, and more.

The SDK Pattern

Every operation follows the same three-step pattern:

Create a Speech Configuration with your key and region
Create an Audio Configuration for input or output
Create a Recognizer or Synthesizer and call it

Once you learn this pattern for one feature, every other feature works the same way.

Setting Up the SDK

.NET

// NuGet package:
// Microsoft.CognitiveServices.Speech

var config = SpeechConfig
  .FromSubscription(key, region);

Python

# pip install
# azure-cognitiveservices-speech

config = speechsdk.SpeechConfig(
    subscription=key,
    region=region
)

Both read AZURE_SPEECH_KEY and AZURE_SPEECH_REGION from environment variables.

Agenda

Getting Started
Speech to Text
Text to Speech
Multi-Lingual Speech Translation
Analyze Speech
Speech + Language Models
Whisper vs AI Speech
Pricing

Speech to Text

Turn audio into text. Think voice commands, meeting transcription, or captioning for accessibility.

Real-time transcription from a microphone or audio stream
Batch transcription of pre-recorded audio files
Support for 100+ languages and regional dialects
Speaker diarization: figure out who said what in a multi-speaker recording

Speech to Text in Code

// C# -- the Python version follows the same pattern
using var audioConfig =
    AudioConfig.FromDefaultMicrophoneInput();
using var recognizer =
    new SpeechRecognizer(config, audioConfig);

var result = await recognizer.RecognizeOnceAsync();

if (result.Reason == ResultReason.RecognizedSpeech)
    Console.WriteLine($"Recognized: {result.Text}");

Swap FromDefaultMicrophoneInput() for FromWavFileInput("audio.wav") to transcribe a file.

Demo Time

Speech to text from the microphone and from a WAV file.

# .NET CLI
dotnet run -- stt
dotnet run -- stt --file sample.wav

# Python CLI
uv run python cli.py stt
uv run python cli.py stt --file sample.wav

# Streamlit dashboard
uv run streamlit run app.py

Agenda

Getting Started
Speech to Text
Text to Speech
Multi-Lingual Speech Translation
Analyze Speech
Speech + Language Models
Whisper vs AI Speech
Pricing

Text to Speech

Give your app a voice. Use cases include IVR systems, accessibility readers, notification audio, and in-app narration.

Hundreds of AI-generated neural voices across many languages
Prebuilt, multilingual, and custom voice options
Multiple audio output formats (WAV, MP3, OGG, and more)
SSML: an XML markup language for fine-grained control over how text is spoken

Text to Speech in Code

// C# -- the Python version follows the same pattern
config.SpeechSynthesisVoiceName =
    "en-US-JennyNeural";

using var synthesizer =
    new SpeechSynthesizer(config, audioConfig);

await synthesizer.SpeakTextAsync("Hello, Computer!");

Use AudioConfig.FromDefaultSpeakerOutput() for live playback, FromWavFileOutput() to save a file, or pass null to get raw audio bytes for a web app.

Using SSML

SSML (Speech Synthesis Markup Language) gives you fine-grained control over how text is spoken:

<speak version="1.0"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xml:lang="en-US">
  <voice name="en-US-JennyNeural">
    <prosody rate="medium" pitch="high">
      Hello, welcome to our presentation!
    </prosody>
  </voice>
</speak>

Demo Time

Text to speech through the speaker, saved to a file, and in the Streamlit dashboard with voice selection.

# .NET CLI
dotnet run -- tts --text "Hello, Computer!"
dotnet run -- tts --text "Hello!" --output hello.wav
dotnet run -- voices

# Python CLI
uv run python cli.py tts --text "Hello, Computer!"

# Streamlit dashboard -- Text to Speech page

Agenda

Getting Started
Speech to Text
Text to Speech
Multi-Lingual Speech Translation
Analyze Speech
Speech + Language Models
Whisper vs AI Speech
Pricing

Multi-Lingual Speech Translation

Translate, in real time, spoken audio from one language to another. The output can be text or synthesized speech.

This is not speech-to-text followed by machine translation. The service is optimized for spoken language, handling conversational nuances that text translators miss.

Translation in Action: Chinese and English

Imagine a customer support app that handles calls in Mandarin and English without a bilingual agent.

Translate spoken Mandarin to English text in real time
Translate spoken English to Mandarin text
Combine with text-to-speech for full speech-to-speech translation
Works with dozens of other language pairs as well

Setting Up Translation

Translation uses TranslationRecognizer instead of SpeechRecognizer. Otherwise, it's the same three-step pattern with a different class:

Create a Translation Configuration with your subscription key
Set the speech recognition language (e.g., zh-CN for Mandarin)
Add target languages (e.g., en for English)
Create a Translation Recognizer and call it

Demo Time

Real-time speech translation between languages.

Agenda

Getting Started
Speech to Text
Text to Speech
Multi-Lingual Speech Translation
Analyze Speech
Speech + Language Models
Whisper vs AI Speech
Pricing

Analyze Speech

Azure Speech can score how well someone speaks, not just what they say.

Where developers use this today:

Language learning apps that give pronunciation feedback
Speech therapy tools that track patient progress
Call center QA that scores agent communication
Presentation coaching that evaluates delivery

Pronunciation Assessment

The pronunciation assessment feature evaluates:

Accuracy -- How closely phonemes match a native speaker
Fluency -- How natural the speech flow is
Completeness -- How much of the expected text was spoken
Prosody -- The rhythm, stress, and intonation of speech (do you sound natural or robotic?)

Demo Time

Pronunciation assessment scoring on spoken English.

Agenda

Getting Started
Speech to Text
Text to Speech
Multi-Lingual Speech Translation
Analyze Speech
Speech + Language Models
Whisper vs AI Speech
Pricing

Speech + Language Models

Speech services become much more powerful when combined with a language model. Instead of just transcribing or speaking, your app can understand and respond.

The pattern is straightforward:

Listen -- capture audio and transcribe with Speech-to-Text
Think -- send the transcription to a language model via Microsoft Foundry
Speak -- convert the model's response to audio with Text-to-Speech

Connecting to Microsoft Foundry

Microsoft Foundry hosts language models (GPT-5.4, etc.) behind an OpenAI-compatible API. The standard openai Python package works directly:

from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    base_url=f"{endpoint}/openai/v1/",
)

response = client.chat.completions.create(
    model=deployment_name,
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": transcribed_text},
    ],
)

Demo Time

A voice chat loop: speak into the microphone, get an AI response read back to you.

# Streamlit dashboard -- Chat with AI page
uv run streamlit run app.py

Agenda

Getting Started
Speech to Text
Text to Speech
Multi-Lingual Speech Translation
Analyze Speech
Speech + Language Models
Whisper vs AI Speech
Pricing

Whisper vs Azure Speech

Now that we have seen what Azure Speech can do, how does it compare to OpenAI's Whisper?

Whisper is an open-source speech recognition model. It handles one thing well: speech-to-text. Azure Speech is a full platform.

When to Choose Which

Choose Whisper when you need cost-effective transcription, want to run models locally, or only need speech-to-text.

Choose Azure Speech when you need real-time streaming, text-to-speech, translation, custom voices, or speech analysis.

You can also run Whisper inside Azure via the Azure OpenAI Service, so the two are not mutually exclusive.

Agenda

Getting Started
Speech to Text
Text to Speech
Multi-Lingual Speech Translation
Analyze Speech
Speech + Language Models
Whisper vs AI Speech
Pricing

Pricing

Azure Speech uses a pay-as-you-go model with a generous free tier:

Free Tier (F0) -- 5 hours STT, 0.5M characters TTS, 5 hours translation per month.

Standard Tier (S0)

Speech-to-text: ~$1 per audio hour
Text-to-speech: ~$16 per 1M characters (neural)
Speech translation: ~$2.50 per audio hour
Custom Neural Voice: training and hosting costs vary (but will be expensive)

Cost Optimization Tips

Use the free tier for development and testing
Batch processing can be more cost-effective than real-time
Choose the right voice tier -- prebuilt voices are cheaper than custom
Cache synthesized audio when the same text is spoken repeatedly
Monitor usage with Azure Cost Management
Compare with Whisper pricing for transcription-only workloads

Wrapping Up

Azure Speech handles speech-to-text, text-to-speech, translation, pronunciation scoring, and custom voices -- all through a single service.

You can start with the free tier today. Add the NuGet package or pip install the Python SDK, create a Speech resource in Azure, and start building.

Wrapping Up

To learn more, go here:
https://csmore.info/on/speech

And for help, contact me:
feasel@catallaxyservices.com | @feaselkl

Catallaxy Services consulting:
https://CSmore.info/on/contact