Kafka For .NET Developers

Kevin Feasel (@feaselkl)

http://CSmore.info/on/kafka

Who Am I? What Am I Doing Here?

Catallaxy Services
@feaselkl
Curated SQL
We Speak Linux

Kafka

Apache Kafka is a message broker on the Hadoop stack. It receives messages from producers and sends messages to consumers. Everything in Kafka is distributed.

Why Use Kafka?

Suppose we have two applications which want to communicate. We connect them directly.

Works great at low scale--it's easy to understand, easy to work with, and fewer working parts to break. But it hits scale limitations.

Why Use Kafka?

We then expand out.

Easy to expand this way as long as you don't overwhelm the DB. Eventually you will.

Why Use Kafka?

We then expand out. Again.

Takes some effort here; need to manage connection strings and write to correct DB. But it's doable and expands indefinitely.

Why Use Kafka?

But what happens when a consumer (database) goes down?

Producers (app servers) hold messages or they fail. Neither option is great.

Why Use Kafka?

Enter brokers. Brokers take messages from producers and feed messages to consumers.

Consumer down? Broker holds messages & producers don't care. Producer down? Consumers don't care. Brokers deal with the jumble of connections and help with scale-out.

Motivation

Today's talk will focus on using Kafka to ingest, enrich, and consume data. We will build .NET applications in Windows to talk to a Kafka cluster on Linux.

Our data source is flight data. I’d like to ask a few questions, with answers split out by destination state:

  1. How many flights did we have in 2008?
  2. How many flights' arrivals were delayed?
  3. How many minutes of arrival delay did we have?
  4. Given a flight with a delay, how long can we expect it to be?

Agenda

  1. Kafka Concepts
  2. Producer App
  3. Enricher App
  4. Consumer App
  5. Performance

Kafka Concepts

  • Most message brokers act as queues.

Kafka Concepts

  • Kafka is a log, not a queue. Multiple consumers may read the same message and a consumer may re-read messages. Think microservices and replaying data.

Kafka Concepts

  • Brokers foster communication between producers and consumers. They store the produced messages and keep track of what consumers have read.

Kafka Concepts

  • Topics are categories or feeds to which messages get published.
  • Topics are broken up into partitions. Partitions are ordered, immutable sequences of records.

Kafka Concepts

  • Producers push messages to Kafka.

Kafka Concepts

  • Consumers read messages from topics.

Kafka Concepts

  • Consumers enlist in consumer groups. Consumer groups act as "logical subscribers" and Kafka distributes load to consumers in a group.

Kafka Concepts

  • Items in partitions are immutable. You do not modify the data, but can add new rows.

Kafka Concepts

  • Consumers should know where they left off. Kafka assists by storing consumer group-specific last-read pointer values per topic and partition.
  • Kafka retains messages for a certain (configurable) amount of time, after which point they drop off.
  • Kafka can also garbage collect messages if you reach a certain (configurable) amount of disk space.

The Competition

  • MSMQ and Service Broker: queues in Microsoftland
  • Amazon Kinesis and Azure Event Hub: Kafka as a Service
  • RabbitMQ: complex routing & guaranteed reliability
  • Celery: distributed queue built for Python
  • Queues.io lists dozens of queues and brokers

Agenda

  1. Kafka Concepts
  2. Producer App
  3. Enricher App
  4. Consumer App
  5. Performance

Producer App

Our first application reads data from a CSV and pushes messages onto a topic.

This application will not try to understand the messages; it simply takes data and pushes it to a topic.

Sidebar

I chose Confluent's Kafka .NET library (nee RDKafka-dotnet) as my library of choice.

There are several libraries available, each with their own benefits and drawbacks. This library serves up messages in an event-based model and has official support from Confluent, so use this one.

Producer App

Demo Time

Producer App

Takeaways

  • Data encoded as UTF-8 byte arrays and passed to Kafka
  • Can get 60K records/sec
  • Consumers can read while producers are pushing

Agenda

  1. Kafka Concepts
  2. Producer App
  3. Enricher App
  4. Consumer App
  5. Performance

Enricher App

Our second application reads data from one topic and pushes messages onto a different topic.

This application provides structure to our data and will be the largest application.

Enricher App

Enrichment opportunities:

  1. Convert "NA" values to appropriate values: either a default value or None (not NULL!).
  2. Perform lookups against airports given an airport code.
  3. Converting the input CSV record into a structured type (similar to a class).
  4. Outputting results as JSON for later consumers.

Enricher App

Demo Time

Enricher App

Takeaways

  • Plan external access to maximize throughput: putting the SQL query in the wrong place might result in 7 million SQL queries!
  • Can get 20K records/sec
  • Code starts from beginning but doesn't need to

Agenda

  1. Kafka Concepts
  2. Producer App
  3. Enricher App
  4. Consumer App
  5. Performance

Consumer App

Our third application reads data from the enriched topic, aggregates, and periodically writes results to SQL Server.

Consumer App

Demo Time

Consumer App

Takeaways

  • Memory usage is nominal due to dictionary, and this setup allows us to run the consumer in parallel.
  • In practice, you probably want to include timer-based as well as record-based writes to SQL.
  • Can read up to 41K records/sec
  • Try not to fly through Newark

Agenda

  1. Kafka Concepts
  2. Producer App
  3. Enricher App
  4. Consumer App
  5. Performance

Performance

Basic tips:

  • Maximize your network bandwidth! Your fibre channel will push a lot more messages than my travel router.
  • Compress your data. Compression works best with high-throughput scenarios, so test first.
  • Minimize message size. This reduces network cost.
  • Buffer messages in your code using tools like Collections.Concurrent.BlockingCollection

Performance

Throughput Versus Latency

Minimize latency when you want the most responsive consumers but don't need to maximize the number of messages flowing.

Performance

Throughput Versus Latency

Maximize throughput when you want to push as many messages as possible. This is better for bulk loading operations.

Performance

Throughput Versus Latency

Consumer config: fetch.wait.max.ms, fetch.min.bytes

Producer config: batch.num.messages, queue.buffering.max.ms

Performance

More, More, More

Kafka is a horizontally distributed system, so when in doubt, add more:

  • More brokers will help accept messages from producers faster, especially if current brokers are experiencing high CPU or I/O.
  • More consumers in a group will process messages more quickly.
  • You must have at least as many partitions as consumers in a group! Otherwise, consumers may sit idle.

Wrapping Up

Apache Kafka is a powerful message broker. There is a small learning curve associated with Kafka, but this is a technology well worth learning.

To learn more, go here: http://CSmore.info/on/kafka

And for help, contact me: feasel@catallaxyservices.com | @feaselkl