Hello, Computer

An Introduction to Azure Speech

Kevin Feasel (@feaselkl)
https://csmore.info/on/speech

Who Am I? What Am I Doing Here?

Motivation

Your users want to talk to your app. They want it to talk back. They want it to understand them in any language. Azure Speech makes all of this possible, and you do not need an AI background to use it.

This talk is for application developers who want to add speech capabilities to .NET or Python applications using Azure.

Agenda

  1. Getting Started
  2. Speech to Text
  3. Text to Speech
  4. Multi-Lingual Speech Translation
  5. Analyze Speech
  6. Speech + Language Models
  7. Whisper vs AI Speech
  8. Pricing

What is Azure Speech?

Azure Speach in Foundry Tools is a cloud service in the Microsoft Foundry suite that handles all things speech:

  • Speech-to-text and text-to-speech
  • Real-time speech translation across languages
  • Pronunciation assessment and speech analysis
  • Custom neural voice creation

One service, one subscription key, and SDKs for .NET, Python, Java, C++, and more.

The SDK Pattern

Every operation follows the same three-step pattern:

  1. Create a Speech Configuration with your key and region
  2. Create an Audio Configuration for input or output
  3. Create a Recognizer or Synthesizer and call it

Once you learn this pattern for one feature, every other feature works the same way.

Setting Up the SDK

.NET

// NuGet package:
// Microsoft.CognitiveServices.Speech

var config = SpeechConfig
  .FromSubscription(key, region);

Python

# pip install
# azure-cognitiveservices-speech

config = speechsdk.SpeechConfig(
    subscription=key,
    region=region
)

Both read AZURE_SPEECH_KEY and AZURE_SPEECH_REGION from environment variables.

Agenda

  1. Getting Started
  2. Speech to Text
  3. Text to Speech
  4. Multi-Lingual Speech Translation
  5. Analyze Speech
  6. Speech + Language Models
  7. Whisper vs AI Speech
  8. Pricing

Speech to Text

Turn audio into text. Think voice commands, meeting transcription, or captioning for accessibility.

  • Real-time transcription from a microphone or audio stream
  • Batch transcription of pre-recorded audio files
  • Support for 100+ languages and regional dialects
  • Speaker diarization: figure out who said what in a multi-speaker recording

Speech to Text in Code

// C# -- the Python version follows the same pattern
using var audioConfig =
    AudioConfig.FromDefaultMicrophoneInput();
using var recognizer =
    new SpeechRecognizer(config, audioConfig);

var result = await recognizer.RecognizeOnceAsync();

if (result.Reason == ResultReason.RecognizedSpeech)
    Console.WriteLine($"Recognized: {result.Text}");

Swap FromDefaultMicrophoneInput() for FromWavFileInput("audio.wav") to transcribe a file.

Demo Time

Speech to text from the microphone and from a WAV file.

# .NET CLI
dotnet run -- stt
dotnet run -- stt --file sample.wav

# Python CLI
uv run python cli.py stt
uv run python cli.py stt --file sample.wav

# Streamlit dashboard
uv run streamlit run app.py

Agenda

  1. Getting Started
  2. Speech to Text
  3. Text to Speech
  4. Multi-Lingual Speech Translation
  5. Analyze Speech
  6. Speech + Language Models
  7. Whisper vs AI Speech
  8. Pricing

Text to Speech

Give your app a voice. Use cases include IVR systems, accessibility readers, notification audio, and in-app narration.

  • Hundreds of AI-generated neural voices across many languages
  • Prebuilt, multilingual, and custom voice options
  • Multiple audio output formats (WAV, MP3, OGG, and more)
  • SSML: an XML markup language for fine-grained control over how text is spoken

Text to Speech in Code

// C# -- the Python version follows the same pattern
config.SpeechSynthesisVoiceName =
    "en-US-JennyNeural";

using var synthesizer =
    new SpeechSynthesizer(config, audioConfig);

await synthesizer.SpeakTextAsync("Hello, Computer!");

Use AudioConfig.FromDefaultSpeakerOutput() for live playback, FromWavFileOutput() to save a file, or pass null to get raw audio bytes for a web app.

Using SSML

SSML (Speech Synthesis Markup Language) gives you fine-grained control over how text is spoken:

<speak version="1.0"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xml:lang="en-US">
  <voice name="en-US-JennyNeural">
    <prosody rate="medium" pitch="high">
      Hello, welcome to our presentation!
    </prosody>
  </voice>
</speak>

Demo Time

Text to speech through the speaker, saved to a file, and in the Streamlit dashboard with voice selection.

# .NET CLI
dotnet run -- tts --text "Hello, Computer!"
dotnet run -- tts --text "Hello!" --output hello.wav
dotnet run -- voices

# Python CLI
uv run python cli.py tts --text "Hello, Computer!"

# Streamlit dashboard -- Text to Speech page

Agenda

  1. Getting Started
  2. Speech to Text
  3. Text to Speech
  4. Multi-Lingual Speech Translation
  5. Analyze Speech
  6. Speech + Language Models
  7. Whisper vs AI Speech
  8. Pricing

Multi-Lingual Speech Translation

Translate, in real time, spoken audio from one language to another. The output can be text or synthesized speech.

This is not speech-to-text followed by machine translation. The service is optimized for spoken language, handling conversational nuances that text translators miss.

Translation in Action: Chinese and English

Imagine a customer support app that handles calls in Mandarin and English without a bilingual agent.

  • Translate spoken Mandarin to English text in real time
  • Translate spoken English to Mandarin text
  • Combine with text-to-speech for full speech-to-speech translation
  • Works with dozens of other language pairs as well

Setting Up Translation

Translation uses TranslationRecognizer instead of SpeechRecognizer. Otherwise, it's the same three-step pattern with a different class:

  1. Create a Translation Configuration with your subscription key
  2. Set the speech recognition language (e.g., zh-CN for Mandarin)
  3. Add target languages (e.g., en for English)
  4. Create a Translation Recognizer and call it

Demo Time

Real-time speech translation between languages.

Agenda

  1. Getting Started
  2. Speech to Text
  3. Text to Speech
  4. Multi-Lingual Speech Translation
  5. Analyze Speech
  6. Speech + Language Models
  7. Whisper vs AI Speech
  8. Pricing

Analyze Speech

Azure Speech can score how well someone speaks, not just what they say.

Where developers use this today:

  • Language learning apps that give pronunciation feedback
  • Speech therapy tools that track patient progress
  • Call center QA that scores agent communication
  • Presentation coaching that evaluates delivery

Pronunciation Assessment

The pronunciation assessment feature evaluates:

  • Accuracy -- How closely phonemes match a native speaker
  • Fluency -- How natural the speech flow is
  • Completeness -- How much of the expected text was spoken
  • Prosody -- The rhythm, stress, and intonation of speech (do you sound natural or robotic?)

Demo Time

Pronunciation assessment scoring on spoken English.

Agenda

  1. Getting Started
  2. Speech to Text
  3. Text to Speech
  4. Multi-Lingual Speech Translation
  5. Analyze Speech
  6. Speech + Language Models
  7. Whisper vs AI Speech
  8. Pricing

Speech + Language Models

Speech services become much more powerful when combined with a language model. Instead of just transcribing or speaking, your app can understand and respond.

The pattern is straightforward:

  1. Listen -- capture audio and transcribe with Speech-to-Text
  2. Think -- send the transcription to a language model via Microsoft Foundry
  3. Speak -- convert the model's response to audio with Text-to-Speech

Connecting to Microsoft Foundry

Microsoft Foundry hosts language models (GPT-5.4, etc.) behind an OpenAI-compatible API. The standard openai Python package works directly:

from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    base_url=f"{endpoint}/openai/v1/",
)

response = client.chat.completions.create(
    model=deployment_name,
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": transcribed_text},
    ],
)

Demo Time

A voice chat loop: speak into the microphone, get an AI response read back to you.

# Streamlit dashboard -- Chat with AI page
uv run streamlit run app.py

Agenda

  1. Getting Started
  2. Speech to Text
  3. Text to Speech
  4. Multi-Lingual Speech Translation
  5. Analyze Speech
  6. Speech + Language Models
  7. Whisper vs AI Speech
  8. Pricing

Whisper vs Azure Speech

Now that we have seen what Azure Speech can do, how does it compare to OpenAI's Whisper?

Whisper is an open-source speech recognition model. It handles one thing well: speech-to-text. Azure Speech is a full platform.

When to Choose Which

Choose Whisper when you need cost-effective transcription, want to run models locally, or only need speech-to-text.

Choose Azure Speech when you need real-time streaming, text-to-speech, translation, custom voices, or speech analysis.

You can also run Whisper inside Azure via the Azure OpenAI Service, so the two are not mutually exclusive.

Agenda

  1. Getting Started
  2. Speech to Text
  3. Text to Speech
  4. Multi-Lingual Speech Translation
  5. Analyze Speech
  6. Speech + Language Models
  7. Whisper vs AI Speech
  8. Pricing

Pricing

Azure Speech uses a pay-as-you-go model with a generous free tier:

Free Tier (F0) -- 5 hours STT, 0.5M characters TTS, 5 hours translation per month.

Standard Tier (S0)

  • Speech-to-text: ~$1 per audio hour
  • Text-to-speech: ~$16 per 1M characters (neural)
  • Speech translation: ~$2.50 per audio hour
  • Custom Neural Voice: training and hosting costs vary (but will be expensive)

Cost Optimization Tips

  • Use the free tier for development and testing
  • Batch processing can be more cost-effective than real-time
  • Choose the right voice tier -- prebuilt voices are cheaper than custom
  • Cache synthesized audio when the same text is spoken repeatedly
  • Monitor usage with Azure Cost Management
  • Compare with Whisper pricing for transcription-only workloads

Wrapping Up

Azure Speech handles speech-to-text, text-to-speech, translation, pronunciation scoring, and custom voices -- all through a single service.

You can start with the free tier today. Add the NuGet package or pip install the Python SDK, create a Speech resource in Azure, and start building.

Wrapping Up

To learn more, go here:
https://csmore.info/on/speech

And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://CSmore.info/on/contact