Catallaxy Services | Big Data, Small Data, and Everything In Between

Big Data, Small Data, and Everything In Between

Who Am I? What Am I Doing Here?

	Catallaxy Services	@feaselkl
	Curated SQL
	We Speak Linux

What Are We Talking About Here?

As data size expands, numerous products have entered the data storage market to solve particular pain points. This talk will cover, at a high level, many of the data storage technologies currently available on the market.

Motivation

The expansion of data sets and increased expectations of businesses for analysis and modeling of data has led developers to create a number of database products to meet those needs. As data professionals, it is incumbent upon us to understand how these tools work and put them to their best use--before somebody else puts them to sub-optimal use.

Definitions: Big Data

When you have too much data to fit into Excel.

Definitions: Big Data

Big Data is built around four major dimensions:

Volume - sheer quantity of data
Variety - data in different formats, different media types, and different structures interacting
Velocity - number of data points collected over time
Veracity - accuracy of data

Definitions: Small Data

Data sets small enough for human comprehension.
Data sets small enough to fit into a single machine's memory (e.g., R or Redis data)

Definitions: Medium Data

Data sets too large to fit on a single machine but not large enough to require a massive cluster.

SparkR (R but able to use a Spark cluster's memory) is a good example of a product which thrives in the Medium Data space.

Architecture Overview

Stress Points:

Response time (especially global)
Ease of use for reporting and analysis
Cost of scaling up hardware
Semi-structured or unstructured data
Extreme write load

Architecture Overview

For each technology, we will:

Give a quick explanation of the technology
Give a quick overview of popular products in the field
Discuss the pros and cons of this technology
Describe some of the best uses for this technology

Technologies

Relational Database
Multidimensional Database
Hadoop Cluster
Columnstore Database
In-Memory Cache
Key-Value Database
Document Database
Graph Database
Full-Text Search Engine
Message Queue System
Stream Processing System
World of Azure
Consumers

Relational Database

Quick Explanation

Relational databases are built off of set theory, a branch of mathematics dedicated to dealing with collections of things.

Relational Database

Key Players

Commercial

Open Source

Relational Database

Product Advantages

ACID compliant
Pessimistic or optimistic concurrency available
Fully-featured DSLs (T-SQL, PL/SQL, etc.)
Excellent tooling
Great community support
Huge institutional acceptance

Relational Database

Product Drawbacks

S-shaped learning curve
Scaling model is typically "Up" rather than "Out"
Commercial products are expensive

Relational Database

Best Uses

Backbone of a business application
Financial applications which require conistency
"You're fired if this is wrong" data

Technologies

Relational Database
Multidimensional Database
Hadoop Cluster
Columnstore Database
In-Memory Cache
Key-Value Database
Document Database
Graph Database
Full-Text Search Engine
Message Queue System
Stream Processing System
World of Azure
Consumers