On a given project, data scientists can spend upwards of 80% of their time preparing, cleaning, and correcting data. In this session, we will look at different data cleansing and preparation techniques using both SQL Server and R. We will investigate the concept of tidy data and see how we can use tools in both languages to simplify research and analysis of a small but realistic data set.
The slides are available in HTML 5 format. All modern browsers (including tablets and phones) should be able to navigate the slides successfully.
The slides are licensed under Creative Commons Attribution-ShareAlike.
The demonstration code is available on my GitHub repository. This includes all of the SQL and R code, as well as data sources used in demos. This also includes a notebook for tidyr.
The source code is licensed under the terms offered by the GPL. The slides are licensed under Creative Commons Attribution-ShareAlike.
Although I do not go into Data Quality Services in my talk, I consider it an important next step for promoting higher-quality data for analysis.
Catallaxy Services © 2017.