Launching a Data Science Project: Cleaning is Half the Battle

ABSTRACT

There's an old adage in software development: Garbage In, Garbage Out. This adage certainly applies to data science projects: if you simply throw raw data at models, you will end up with garbage results. In this session, we will build an understanding of just what it takes to implement a data science project whose results are not garbage. We will the Microsoft Team Data Science Process as our model for project implementation, learning what each step of the process entails. To motivate this walkthrough, we will see what we can learn from a survey of data professionals' salaries.

ADDITIONAL MEDIA

I performed a version of this talk for DataPlatformGeeks. You can get the recording on their Youtube channel.

DEMO CODE

Click here to access demo code for this presentation. This includes a Jupyter notebook which walks through our example.

The source code is licensed under the terms offered by the GPL.