I plan to start a series of blog post on predictive analytics as there is an increasing demand on applying machine learning technique to analyze large amount of raw data. This set of technique is very useful to me and I think they should be useful to other people as well. I will also going through some coding example in R. R is a statistical programming language that is very useful for performing predictive analytic tasks. In case you are not familiar with R, here is a very useful link to get some familiarity in R.
Predictive Analytics is a specialize data processing techniques focusing in solving the problem of predicting future outcome based on analyzing previous collected data. The processing cycle typically involves two phases of processing:
- Training phase: Learn a model from training data
- Predicting phase: Deploy the model to production and use that to predict the unknown or future outcome
The whole lifecycle of training involve the following steps.
Determine the input and output
At this step, we define the output (what we predict) as well as the input data (what we use) in this exercise. If the output is predicting a continuous numeric quantity, we call this exercise a “regression”. If the output is predicting a discrete category, we call it a “classification”. Similarly, input can also be a number, a category, or a mix of both.
Determine the ultimate output is largely a business requirement and usually well-defined (e.g. predicting the quarterly revenue). However, there are many intermediate outputs that are related (in fact they are be input) to the final output. In my experience, determining these set of intermediate outputs usually go through an back-tracking exploratory process as follows.
- Starting at the final output that we aim to predict (e.g. quarterly revenue).
- Identify all input variables that is useful to predict the output. Look into the source of getting these input data. If there is no data source corresponding to the input variable, that input variable will become the candidate of the intermediate output.
- We repeat this process to learn about these intermediate outputs. Effectively we build multiple layers of predictive analytics such that we can move from raw input data to intermediate output and eventually to the final output.
At this step, we determine how to extract useful input information from the raw data that will be influential to the output. This is an exploratory exercise guided by domain experts. Finally a set of input feature (derived from raw input data) will be defined.
Visualizing existing data is a very useful way to come up with ideas about what features should be included. “Dataframe” in R is a common way where data samples are organized in a tabular structure. And we’ll be using some dataframe that comes with the R package. Specifically the dataframe “iris” represents different types of iris and their measures in different lengths.
> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > nrow(iris)  150 > table(iris$Species) setosa versicolor virginica 50 50 50 >
Single Variable Frequency Plot
For numeric data, it is good to get some idea about their frequency distribution. Histogram and a smoother density plot will give a good idea.
> # Plot the histogram > hist(iris$Sepal.Length, breaks=10, prob=T) > # Plot the density curve > lines(density(iris$Sepal.Length)) >
For category data, bar plot is a good choice.
> categories <- table(iris$Species) > barplot(categories, col=c('red', 'green', 'blue'))