Nevertheless, our job is to figure something out.
To deal with these considerations, we build and implement processes. These processes give us structure and help us navigate some of the difficulties in managing data.
In today's talk, we will look at one particular process: the Microsoft Team Data Science Process.
We will use the Team Data Science Process to uncover interesting information from the 2021 Data Professionals Salary Survey.
At each step of the way, we will walk through implementation details and achieve a better understanding of data professional salaries.
Somebody higher up (typically on the business side of the house) wants a data science project done. Your mission is to figure out as much about that person's vision as possible.
Your domain knowledge will be critical to success, but you should still find the champion(s) of your project and understand best their vision of success.
When interviewing your project champion or people on the business side, listen for the following types of questions:
These questions help you determine which algorithms to use.
Your goal is to nail down a specific problem, narrowing the scope to something achievable.
Once you have an interesting question, start looking for data. Some of this data will be in-house and could be in databases, Excel, flat files, accessible through APIs, or even in reports.
Your champion will hopefully be able to point you in the right direction, but this part of the process requires legwork.
Once you have a compendium of data sources, you will want to build a data dictionary, which helps you explain what different pieces of data actually mean. A data dictionary typically tells you:
Another thing to determine is, where does your end data need to be? Will there be a different engineering team expecting to call a microservice API? Will you get a set of files each day and dump results into a warehouse?
This is generally a technical problem that an Engineering group will architect, although your champion might have insight here depending upon how the business side will need to use your results.
We work for Data Platform Specialists, a company dedicated to providing DBAs and other data platform professionals with valuable market knowledge. We have come into possession of a survey of data professionals and want to build insights that we can share with our client base.
Questions we can ask given our domain knowledge:
Narrowing this down with our champion and other stakeholders, we can get to the following question which we will endeavour to answer:
How much money should we expect a data professional will make?
As we review the survey data, we can start to see different shapes of data and begin to build a data dictionary.
For example, TelecommuteDaysPerWeek has six options: less than 1, 1, 2, 3, 4, 5 or more. These are ordered options, meaning the numbers inherently matter.
By contrast, number of hours worked per week is an integer ranging from 5 to 200 (?!).
Our company wants to build a small website and allow people who have built profiles to get an estimate of how much they could be making in different roles. Our job is to build a microservice API which returns a dollar amount based on inputs.
Data processing is made up of a few different activities:
Data gathering will likely be an iterative process; as you flesh out your models and gain a better understanding of the problem, you will likely search several times for data from sources like:
In this example, we will stick to just the data professional survey. But if you want to take this further, a few additional data sources could be:
Data cleansing is half the battle. That's the title, after all!
Most of your time, you'll be a data plumber.
General estimates that you will hear from data scientists is that they spend approximately 80% of their time cleaning data. If anything, this is an underestimation--based on my experiences, that number might be closer to 90%.
Simply getting the data is a start, but there's a long journey ahead.
After grabbing relevant-looking data sets, you will want to join them together to gain insight from the mashup of different data sets. Common join activities include:
You will quickly find problems with your data sets, including (but not limited to):
Mislabeled data: when the label is incorrect. Ex: on the data professional survey, a person who earns 50K per year accidentally typing 500K per year. If you can fix the label, this data becomes useful. If you cannot, it may throw off your analysis.
Mismatched data: data joined together which should not have been.
Incorrect data: when data other than the label is incorrect. Ex: a person works 200 hours per week?
People don't always fill out the entirety of every form. When we're missing important data, we have a few options available to us:
None of these options is perfect, but the last three can help salvage incomplete records.
If your data source does not have duplicate protection in place, you might end up with multiple entries representing the same thing. Sometimes it's easy to catch those: you might be able to use a DISTINCT clause to remove duplicates. Other times, you will have to dig further.
Independent systems may end up with inconsistent data due to reasons like typos, transcription errors, etc. Sometimes subtle differences in data sets can lead to differing results. Potential solutions include:
Data stored in flat files or textual format can end up misshapen--some rows may not have enough delimiters (or maybe too many), there could be newlines in the middle of a record, or the file cuts off in the middle of a record.
This is a problem with flat files and certain semi-structured data formats. It is not a problem with relational databases, where data shape is enforced.
There are several techniques we can use to reshape data to make it easier to analyze:
The most common technique for data analysis at this point is Exploratory Data Analysis (EDA). Examples of EDA techniques include:
The cardinality of a feature is the number of unique values.
The five-number summary of a feature tells us the minimum value, value at the 25% mark, value at the 50% mark, value at the 75% mark, and maximum value. The mean technically is not part of the five-number summary, but can be useful.
Box plots show us the five-number summary by group.
Histograms show us the spread of a single variable.
Here we have two comparisons, depth vs table and x vs y. Depth and table are mildly negatively correlated; this isn't a problem. The x and y variables, however, are very highly correlated. Some algorithms may exhibit problems if you include both x and y in your analysis.
Modeling has five major steps:
Feature engineering involves creating relevant features from raw data. Examples include:
We use feature selection to winnow down the available set of features. There are several reasons to do this:
We take some percentage of our total data and designate it for training / validation, and the remainder is for evaluation. There are no hard rules on percentages, but typically, we reserve 70-80% for training.
There are four major branches of algorithms:
Supervised learning models require known answers (labels). We train a model to map input data to those labels in order to have the model predict the correct answer for unlabeled records. Major classes of supervised learning models and their pertinent driving questions include:
With unsupervised learning, we do not know the answers beforehand and try to derive answers. We can use unsupervised learning to drive toward a supervised problem by giving data analysts insight into the nature of the problem. Major classes of unsupervised learning models and their pertinent driving questions include:
This is a subset of supervised learning, but with the popularity of neural networks, has come into its own. We use heuristics to guesstimate labels and train the model that way. An example of self-supervised learning is predicting the next word in a document based on previous words.
Self-supervised learning typically happens with neural networks.
Reinforcement learning is where we train an agent to observe its environment and use those environmental clues to make a decision.
Once you understand the nature of the problem, you can choose among viable algorithms based on the following major trade-offs:
Once you have an algorithm, features, and labels (if supervised), you can train the algorithm. Training a model is solving a system of equations, minimizing a loss function.
Instead of using up all of our data for training, we typically want to perform some level of validation within our training data set to ensure that we are on the right track and are not overfitting.
Overfitting happens when a model latches on to the particulars of a data set, leaving it unable to generalize to new data. To test for overfitting, test your model against unseen data. If there is a big dropoff in model accuracy between training and testing data, you are likely overfitting.
Cross-validation is a technique where we slice and dice the training data, training our model with different subsets of the total data. The purpose here is to find a model which is fairly robust to the particulars of a subset of training data, thereby reducing the risk of overfitting.
Most models have hyperparameters. For neural networks, the number of training epochs is a hyperparameter. For random forests, hyperparameters include things like the size of each decision tree and the number of trees.
We tune hyperparameters using our validation data set.
Model evaluation happens when we send new data to the model that it has not seen during the training and validation process. We have to be careful not to let any information leak into the training data, meaning that we should never feed aggregates of training + evaluation data to a model.
If we fail to safeguard this data, we can end up overfitting our model to the test data, leaving it less suited for the real-world data outside of our sample.
You can also build a fitness function to evaluate certain types. Genetic algorithms are a common tool for this.
Back in the day, one team would build a solution in an analytics language (e.g., R) but you would not go to production with that. Instead, an implementation team would rewrite your model in C++ or some other fast language. Those days of research versus implementation teams using completely different languages are now (mostly) gone.
Welcome to the era of the microservice: many small services dedicated to providing a single answer to a single problem. Most microservices are exposed via web calls, although other forms of interoperation are still possible. The big benefit to web calls is that I can write my service in R, you can call it in Python, and then someone can call your service from .NET.
Once you have a model ready to go, there are tools which make it relatively easy to deploy scalable predictive services. For example, DeployR:
You can also build your own services. Stacks that I've put into production include:
With a microservices architecture, you're trying to plug in these new APIs while not forcing everybody else to change their skills.
SQL Server Machine Learning Services, in particular, is great when the input data is already stored in SQL Server. With certain types of models, you can make "real-time" predictions. Our scenario was complex enough that we pre-trained a large number of models and stored the results in SQL Server for later prediction.
Another option is notebooks, which help you record your work for subsequent review. Jupyter (Julia + Python + R) and Apache Zeppelin are two great examples of notebooks.
After go-live, the job is not complete.
It is important to keep checking the efficacy of models. Model shift happens, where a model might have been good at one point in time but becomes progressively worse over time as circumstances change.
You may also find out that your training/testing data was not truly indicative of real-world data.
Occasionally, you will want to take new production data and retrain the model, so keep track of your model's predictions and actual outcomes so you can determine the model's efficacy.
Depending upon your choice of algorithm, you might be able to update the existing model with the latest information, feeding those efficacy results back into the model.
Some algorithms, however, require you to retrain from scratch.
Regardless of model efficacy, you will want to confer with those stakeholders and ensure that your model fits their needs. Then, repeat the process.
In a production scenario, you will start with a Minimum Viable Product (MVP) to gauge interest. From there, you can expand the base of predictions, increase accuracy, make the service faster, or add new functionality.
A data plumber's work is never done.
Over the course of today's talk, we have covered the Microsoft Team Data Science Process as well as a pragmatic implementation of a data science workflow. Following a stable process allows you to maximize the chances of developing a high-quality and effective model.
To learn more, go here:
And for help, contact me:
firstname.lastname@example.org | @feaselkl
Catallaxy Services consulting: