Abstract

On a given project, data scientists can spend upwards of 80% of their time preparing, cleaning, and correcting data. In this session, we will look at different data cleansing and preparation techniques using both SQL Server and R. We will investigate the concept of tidy data and see how we can use tools in both languages to simplify research and analysis of a small but realistic data set.


Slides

The slides are available in HTML 5 format. All modern browsers (including tablets and phones) should be able to navigate the slides successfully.

The slides are licensed under Creative Commons Attribution-ShareAlike.


Demo Code

The demonstration code is available on my GitHub repository. This includes all of the SQL and R code, as well as data sources used in demos. This also includes a notebook for tidyr.

The source code is licensed under the terms offered by the GPL. The slides are licensed under Creative Commons Attribution-ShareAlike.


Links And Further Information

Data Cleansing With SQL Server

Data Quality Services

Although I do not go into Data Quality Services in my talk, I consider it an important next step for promoting higher-quality data for analysis.

Data Cleansing With R

Papers