Catallaxy Services | Data Cleansing with SQL and R

ABSTRACT

On a given project, data scientists can spend upwards of 80% of their time preparing, cleaning, and correcting data. In this session, we will look at different data cleansing and preparation techniques using both SQL Server and R. We will investigate the concept of tidy data and see how we can use tools in both languages to simplify research and analysis of a small but realistic data set.

ADDITIONAL MEDIA

On August 16, 2017, I gave a version of this talk at NDC Sydney. You can get the recording on the NDC Youtube channel.

SLIDES

Click here to access the slides for this presentation.

The slides are licensed under Creative Commons Attribution-ShareAlike.

DEMO CODE

Click here to access demo code for this presentation. This includes all of the SQL and R code, as well as data sources used in demos. This also includes a notebook for tidyr

The source code is licensed under the terms offered by the GPL.

LINKS & FURTHER INFO

Data Cleansing With SQL Server

Ginger Grant warns against using IsDate.
Aaron Bertrand contrasts ISNULL versus COALESCE.
Biz Nigatu compares TRY_CAST, TRY_CONVERT, and TRY_PARSE, all of which became available with SQL Server 2012.
Jeremy Kadlec gives his thoughts on data cleansing techniques.
Dinesh Asanka shows ways to perform data cleansing in SQL Server Integration Services.
The single best explanation of Boyce-Codd Normal Form and a practical method for determining how to normalize a set of attributes. If you aren't familiar with functional dependencies, check out the previous video as well. I have my own take on this method in blog form.

Data Quality Services

Although I do not go into Data Quality Services in my talk, I consider it an important next step for promoting higher-quality data for analysis.

Steve Simon introduces Data Quality Services.
This MSDN article goes into detail on Data Quality Services.
Feodor Georgiev also looks at DQS.

Data Cleansing With R

Hadley Wickham has a great vignette on what tidy data means.
Hadley Wickham also introduces the tidyverse, a set of helpful packages built around the concept of tidying data.
Gerald Belton shows how to use dplyr.
Stian Haklev give some advice for people relatively new to data wrangling with R.
Manish at Analytics Vidhya shows some helpful tools for data cleansing in R.
Removing outliers and duplicates using R.
Mark van der Loo has slides from an advanced data cleansing talk.
Ryan Wade has a more advanced example of data cleansing, turning an ugly report into a practicable data set.

Papers

Edwin de Jonge and Mark van der Loo have written an introduction to data cleaning with R.
Erhard Rahm and Hong Hai Do have an academic paper categorizing data cleansing problem subtypes.
Joseph Hellerstein thinks about data cleaning in large databases using univariate and multivariate analysis.