The Curated Data Platform

Last Revision: March 2026

Kevin Feasel (@feaselkl)
http://CSmore.info/on/cdp

Who Am I? What Am I Doing Here?

Which data platform is right for me?

DB-Engines keeps track of over 400 different data platform technologies. These include relational databases, data warehouses, document databases, key-value stores, search engines, time-series databases, graph databases, and more.

Motivation

My goals in this talk:

  • Discuss when different data storage types make sense.
  • Provide a quick overview of each data storage technology, including use cases and key movers.
  • Cover relevant cloud options in AWS and Azure.

A Brief Warning

This talk covers data platform technologies as a broad swath and does not spend much time covering the merits of individual products with respect to one another.

Often times, "the platform you have" is a perfectly reasonable answer for "Which platform should I choose?" Understanding how (and when!) to use these platforms is my goal for today.

A(nother) Brief Warning

I have specific biases. I've worked primarily in the Microsoft data platform space, so most of my personal experience is in that stack, as well as offerings in AWS and Azure. I have no GCP experience.

Aside from that, I have a bias for open-source technologies over commercial platforms.

I will try to make it clear in this talk when I'm being biased.

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

An Example System in Place

Thoughts on the System in Place

Many companies have one database platform, plus Excel (or Google Sheets, etc.). If that's good enough for your company, great! But imagine some sample complaints that you might hear in your own jobs.

  • Storing all of our financial data in Excel spreadsheets is clunky.
  • Product searches take too long on our website.
  • Customers experience slowness making orders.
  • No support for the data science team.
  • Log review is painful for IT.

Other data platform technologies may mitigate these pain points--while introducing new pain points along the way.

Regarding Multiple Systems

The Upshot

Like power tools, data platform technologies have their specific use cases. Some of them are more versatile than others, but if you pick up the wrong tool for the job, you may struggle to get it done.

Over the rest of this session, I'll help you understand how to select the right tools for the job.

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

Key Requirements

  • Data MUST be correct. Eventual consistency and even a few missed records won't work for us.
  • We need to handle updates in real time, seeing the most recent information as soon as we save it.
  • Performance is less critical than correctness, but still an important factor.

Key Technologies

  • Relational database (OLTP -- On-Line Transactional Processing).

Why OLTP?

  • Non-distributed, relational database because the data must be correct for everybody, and ACID compliance helps us considerably.
  • Performance will generally be good, though analysts far from the data center may need to deal with slower queries.

Recent Developments: OLTP

  • PostgreSQL 18 (September 2025): Async I/O subsystem (up to 3× faster reads), virtual generated columns, native uuidv7(), OAuth 2.0.
  • MySQL 8.0 EOL: Support ends April 30, 2026 — upgrade to MySQL 8.4+ or 9.x.
  • SQL Server 2025: Native vector data type, DiskANN indexes, native JSON (up to 2 GB/row), built-in regex.
  • Azure HorizonDB (preview): New PostgreSQL-compatible managed service competing with AWS Aurora and Google AlloyDB.

Key Players: OLTP

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

Key Requirements

  • Data MUST be correct. We need business users to be able to trust our data.
  • Non-IT staff should be able to access systems, ideally within Excel.
  • It's okay for some reports to update nightly rather than real-time.
  • Performance is less critical than correctness, but still an important factor.

Key Technologies

  • Relational database (OLAP -- On-Line Analytical Processing) for connectivity to Excel and reviewing results.

Why OLAP?

  • Specifically, the Kimball model for warehousing.
  • The data must be correct but does not need to be "real-time." We can use an ETL process to populate the warehouse.
  • We can distribute data marts geographically to meet the performance needs of analysts while maintaining a central data warehouse to store the full set of data.
  • Microsoft and other major players build their BI tools like (Excel, Power Query, etc.) to work best with Kimball-style warehouses.

Key Players: OLAP

OLTP + OLAP

Relational databases can serve as either OLTP or OLAP--these are database designs rather than distinct technologies.

There are also technologies dedicated to extending beyond relational OLAP, such as SQL Server Analysis Services and Oracle Essbase.

Reference Architecture


Data Warehouse reference architecture, Wikimedia

Modern Data Warehousing

Apache Hadoop turned the data warehousing and analytics world upside-down. Although it is now a legacy platform, some of its progeny live on in the form of distributed storage in a data lake, the Apache Spark platform, and Apache Kafka.

The Data Lake

The Hadoop Distributed File System (HDFS) opened up the possibility of massive, distributed data storage. Cloud storage platforms like Amazon's S3 and Azure Blob Storage made this a practical reality. This data includes multi-structured and unstructured data, which typically would not fit well in a classic data warehouse.

The data lake provides a central location for historical storage of a broad array of company data for the purpose of data science and machine learning activities.

The Data Lakehouse

Databricks coined the term Lakehouse to represent the combination of data warehouse and data lake in one managed area.

Since then, we've seen platforms like Databricks, Snowflake, and Microsoft Fabric move quickly in this space.

  • Microsoft Fabric: 28,000+ organizations including 80% of Fortune 500 as of 2025.
  • Snowflake: $3.5B FY2025 revenue (30% growth).
  • Databricks: Entered DB-Engines Top 10.

The Data Lakehouse

Data Lakehouse

Key Players: Modern DW

Reference Architecture


Modern Data Warehouse Architecture

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

Key Requirements

  • Performance is critical. If you work for a global company, you may need fast response times across the globe.
  • Consistency is not critical. Some kinds of product data can be out of date or show different results between regions for a minute or two.
  • For a system like a product catalog, we may still want a single source of truth for product data, including quantity on hand, price, etc.

Key Technologies

  • Document database for "republishing" OLTP data and maximizing performance.
  • (Optional) Relational database (OLTP) to act as the single source of truth.

What is a Document DB?

  • Key-value store
  • The value is a complex document, often JSON (or JSON-like)
  • The value may include nested objects: Product has Images, PriceChanges, and StoreAvailability as well as attributes like Price, Title, and Brand
  • Data retrieval is typically one record at a time, but allows for scans of data

Recent Developments: Document DBs

  • MongoDB acquired Voyage AI (February 2026) to improve embedding models and reduce LLM hallucinations. MongoDB Atlas now represents 75% of MongoDB revenue (up from 66%).
  • Amazon DynamoDB market mindshare declined significantly year-over-year (10.6%, down from 18.6%) as customers consolidate onto lakehouse platforms.

Key Players: Document DBs

Reference Architecture


Cosmos DB Use Case: Retail and marketing

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

Key Requirements

  • Performance is critical. Milliseconds are money.
  • Data is typically pretty stable, with occasional updates but typically many reads of a data point between update.
  • Consistency is important, but occasionally reading stale data is okay.

Key Technologies

  • In-memory key-value caching for fast lookups.
  • Simple storage for static content.
  • Relational database (OLTP) to act as the single source of truth.
  • (Optional) Document database for "republishing" OLTP data and maximizing performance.

A Note on Redis Licensing

In March 2024, Redis switched from BSD to a commercial license (RSALv2/SSPLv1), preventing cloud providers from offering it as a managed service without approval.

This triggered the creation of Valkey, a fully open-source (BSD-licensed) Redis fork that organizations like AWS, Google Cloud, Oracle, Ericsson, and Snap support. Valkey is now the de facto open-source alternative.

Redis added AGPLv3 as a third option in May 2025, but licensing uncertainty in the community remains. Consider Valkey for new open-source deployments.

Key Players: Key-Value Caches

Reference Architecture

Scalable web application

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

Graph Databases

Graph databases have a niche in the analytics space. Graph databases combine nodes (which represent entities) and edges (which represent connections between entities).

Key Features of Graph Databases

  • Path calculation (especially with weights, such as distance between cities)
  • Fraud detection via link analysis: observe the links between known fraudulent entities and non-marked entities.
  • Modeling fluid relationships between entities.
  • Laying out network maps and other complex topologies.

The Problem with Graph Databases

The biggest problem with graph databases is that you can do the same things with relational databases, but with only one concept (the relation) versus two (nodes and edges).

The second-biggest problem with graph databases is that there is no common graph language like SQL or common implementation specs between products.

Key Players: Graph Databases

Accept No Substitutes

If you do go with a graph database, Neo4j is the only one I can heartily recommend. Most other players tend to fade away with little market share.

Some products (e.g., SQL Server) offer limited graph capabilities, but they are not as mature or feature-rich as dedicated graph databases.

AWS Neptune is fine. Avoid Gremlin for Cosmos DB. Graph in Fabric is in early preview and not worth considering for production use at this time.

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

Key Requirements

  • Data has a time element and we care about analyzing relevant data over time.
  • Data ingestion rates are very high, perhaps as high as millions of data points per second.
  • Most reports and dashboards need to aggregate and downsample data, showing trends over time periods (e.g., hourly, daily, weekly, monthly).

Key Features of Time Series Databases

  • Time is a first-class citizen--we index based on timestamp.
  • Specialized compression algorithms compact data points more efficiently than generic databases.
  • Automated retention policies delete older data.
  • Automated downsampling rolls up data points to reduce disk space requirements.
  • Query syntax enhancements focus on time series questions such as moving average, rate calculations, anomaly detection, and interval calculation.

Recent Developments: Time Series

  • InfluxDB 3.0 (GA: April 2025): Complete rewrite in Rust, with a foundation of Apache Arrow, DataFusion, and Parquet. Delivers 10–20× compression vs. 2–3× in InfluxDB 2.x. Adds native SQL support alongside InfluxQL, a fundamental architectural change.
  • TimescaleDB 2.25 (January 2026): Full PostgreSQL 18 support, columnar index scan for accelerated aggregates, UUIDv7 compression (30% storage savings, 2× faster queries).

Key Players: Time Series DBs

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

Key Requirements

  • We want to send our data to a large language model for further analysis.
  • We want to perform semantic search on our data.
  • Our service needs to determine which images or videos are visually similar, even if they don't have identical metadata.

Key Features of Vector Databases

  • Converts data into high-dimensional vectors, capturing semantic meaning.
  • Search is typically of the "Approximate Nearest Neighbor" variety, finding the most similar vectors even if they aren't perfect matches.
  • Specialized indexing works to accelerate similarity search.
  • Can include traditional metadata filtering to assist with comparison.

Vectors and Embeddings

Embeddings are a way of representing data in a high-dimensional space, where similar items are closer together. This is useful for tasks like semantic search, recommendation systems, and image recognition.

Example:

Vector Similarity

The Shifting Vector DB Landscape

Traditional relational databases are rapidly integrating native vector capabilities, narrowing the gap with purpose-built vector databases:

  • PostgreSQL + pgvector: Faster than Qdrant, competitive with Pinecone. PostgreSQL 18 adds a native vector data type with full SQL operations.
  • SQL Server 2025: Native vector type + DiskANN indexes.
  • Oracle AI Database 26ai: Native vector search and autonomous AI lakehouse.

Before adopting a purpose-built vector database, evaluate whether your existing relational database can meet your scale requirements.

Key Players: Vector DBs

Agenda

  1. An Overview
  2. Relational Databases
  3. Data Warehousing (Classic and Modern)
  4. Document Databases
  5. Key-Value Stores
  6. Graph Databases
  7. Time Series Databases
  8. Vector Databases
  9. Log Storage

Key Requirements

  • Need a central source for logging across multiple services.
  • Sometimes logs will follow a specific format, but no guarantee all records have the same shape.
  • Queries are often "What happened at this time?" or "What errors do we see?"

Key Technologies

  • The ELK Stack as a pattern
    • Log storage: Elasticsearch
    • Log shipping and event handling: Logstash
    • Log querying and visualization: Kibana
  • Standalone logging services

Roll Your Own or Purchase?

There are full-service logging solutions, such as Cisco Splunk (Cisco acquired for ~$28B in March 2024), Datadog, Loggly, and SumoLogic. These products perform quite well and tend to be accessible for developers and administrators. The downside is that they tend to be quite expensive.

On the other side, open source products exist as well and can be quite powerful when used correctly, but the learning curve tends to be much higher.

Roll Your Own or Purchase?

OpenTelemetry is rapidly becoming the standard for telemetry collection, with Database Semantic Conventions now stable and CNCF Graduated status expected in 2026. The Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) now provides full observability in a single Docker container.

Key Players: Logging

Reference Architecture

The Complete Guide to the ELK Stack

Wrapping Up

This has been a look at the data platform space as it stands. This is a fast-changing field with interesting competitors entering and leaving the market regularly.

Wrapping Up

To learn more, go here:
https://CSmore.info/on/cdp


And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://CSmore.info/on/contact