Data Science: The Process

Big PictureProcess

Jul 7

Data Science is a process. I found this illustration which lays out the steps, and also shows which roles are typically involved in each. This is very useful when trying to grok the big picture of data science and how it works.

Collection

No surprise here: you need data to do data science. How do we go about collecting our data? There are lots of different ways to capture data, including:

Querying an SQL database
Pulling from an Excel file
Scraping a website
Etc.

Cleaning (AKA scrubbing, munging, etc.)

Now that you have some data to play with, you need to make sure you can use it to answer your question. For example, do you have any duplicated rows? What about missing data points? Are your numerical fields stored as numbers or text?

Exploratory Data Analysis (EDA)

With your data now clean, you can begin to explore the data set to get familiar with it. Typically this includes plotting the variables/values so you can get a sense of the range, magnitude, and general shape of the data.

Model Building

The goal of model-building is to create a numerical model that can (hopefully) predict the value of your target metric, given one or more inputs.

Model Deployment

Now that you have a trained model, you can start throwing new inputs at it to see what it spits out. If you've done your job well when building the model, you should now be able to turn the model loose on the real world to see what it can predict.

data science

Rich McMullen