# Predictive Analytics: Overview and Data visualization

http://horicky.blogspot.com/2012/05/predictive-analytics-overview-and-data.html?m=1

I plan to start a series of blog post on predictive analytics as there is an increasing demand on applying machine learning technique to analyze large amount of raw data.  This set of technique is very useful to me and I think they should be useful to other people as well.  I will also going through some coding example in R.  R is a statistical programming language that is very useful for performing predictive analytic tasks.  In case you are not familiar with R, here is a very useful link to get some familiarity in R.

Predictive Analytics is a specialize data processing techniques focusing in solving the problem of predicting future outcome based on analyzing previous collected data.  The processing cycle typically involves two phases of processing:

1. Training phase: Learn a model from training data
2. Predicting phase: Deploy the model to production and use that to predict the unknown or future outcome

The whole lifecycle of training involve the following steps.

## Determine the input and output

At this step, we define the output (what we predict) as well as the input data (what we use) in this exercise.  If the output is predicting a continuous numeric quantity, we call this exercise a “regression”.  If the output is predicting a discrete category, we call it a “classification”.  Similarly, input can also be a number, a category, or a mix of both.

Determine the ultimate output is largely a business requirement and usually well-defined (e.g. predicting the quarterly revenue).  However, there are many intermediate outputs that are related (in fact they are be input) to the final output.  In my experience, determining these set of intermediate outputs usually go through an back-tracking exploratory process as follows.

• Starting at the final output that we aim to predict (e.g. quarterly revenue).
• Identify all input variables that is useful to predict the output.  Look into the source of getting these input data.  If there is no data source corresponding to the input variable, that input variable will become the candidate of the intermediate output.
• We repeat this process to learn about these intermediate outputs.  Effectively we build multiple layers of predictive analytics such that we can move from raw input data to intermediate output and eventually to the final output.

## Feature engineering

At this step, we determine how to extract useful input information from the raw data that will be influential to the output.  This is an exploratory exercise guided by domain experts.  Finally a set of input feature (derived from raw input data) will be defined.

Visualizing existing data is a very useful way to come up with ideas about what features should be included.  “Dataframe” in R is a common way where data samples are organized in a tabular structure.  And we’ll be using some dataframe that comes with the R package.  Specifically the dataframe “iris” represents different types of iris and their measures in different lengths.

``````> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nrow(iris)
[1] 150
> table(iris\$Species)

setosa versicolor  virginica
50         50         50
>
``````

### Single Variable Frequency Plot

For numeric data, it is good to get some idea about their frequency distribution.  Histogram and a smoother density plot will give a good idea.

``````> # Plot the histogram
> hist(iris\$Sepal.Length, breaks=10, prob=T)
> # Plot the density curve
> lines(density(iris\$Sepal.Length))
>
``````

For category data, bar plot is a good choice.

``````> categories <- table(iris\$Species)
> barplot(categories, col=c('red', 'green', 'blue'))``````

# Here’s how often small asteroids enter Earth’s atmosphere!

Here’s how often small asteroids enter Earth’s atmosphere!

16 Nov 2014

NASA’s Near Earth Object (NEO) Program released this new map on November 14, 2014. It shows what more and more people are beginning to realize: that small asteroids enter Earth’s atmosphere frequently. Indeed, we on Earth are located in what astronomers at Armagh Observatory in Northern Ireland have called the cosmic shooting gallery. What happens to these asteroids, and why didn’t we know this before? Many of the impacts are seen and reported as fireballs. Nearly all burn up in Earth’s atmosphere; that is, our atmosphere does its job in protecting us from what would otherwise be impacts to Earth itself. The notable exception was theChelyabinsk meteor on February 15, 2013. It was the largest asteroid to hit Earth in this period (about 20 meters in size before it hit the Earth). The Chelyabinsk meteor broke windows in some 7,200 buildings in six Russian cities and caused the injury of at least 1,500 people, mostly from shattered glass.

The new map above shows that atmospheric impacts by small asteroids are randomly distributed around the globe. The map is a visualization of data gathered by U.S. government sensors from 1994 to 2013. The data indicate that small asteroids struck Earth’s atmosphere – resulting in what astronomers call a bolide (a fireball, or bright meteor) – on 556 separate occasions in a 20-year period. Almost all asteroids of this size disintegrate in the atmosphere and are usually harmless. NASA says:

The new data could help scientists better refine estimates of the distribution of the sizes of NEOs [near Earth objects] including larger ones that could pose a danger to Earth.

On this map, the size of the dots are proportional to the optical radiated energy of the impact event measured in billions of Joules (GJ) of energy. NASA says:

An approximate conversion between the measured optical radiant energy and the total impact energy can be made using an empirical relationship provided by Peter Brown and colleagues in 2002. For example the smallest dot on the map represents 1 billion Joules (1 GJ) of optical radiant energy, or when expressed in terms of a total impact energy the equivalent of about 5 tons of TNT explosives. Likewise, the dots representing 100, 10,000 and 1,000,000 Giga Joules of optical radiant energies correspond to impact energies of about 300 tons, 18,000 tons and one million tons of TNT explosives respectively.

NASA adds that finding and characterizing hazardous asteroids to protect our home planet is a high priority for the space agency. It is one of the reasons that NASA is now spending 10 times more than it was five years ago on asteroid detection, characterization and mitigation.

In addition, NASA says it has:

… aggressively developed strategies and plans with its partners in the U.S. and abroad to detect, track and characterize Near Earth Objects. These activities also will help identify NEOs that might pose a risk of Earth impact, and further help inform developing options for planetary defense.

We can only hope so. If you like, you can help participate in the hunt for potentially hazardous NEOs through the Asteroid Grand Challenge, which, NASA says:

… aims to create a plan to find all asteroid threats to human populations and know what to do about them.

NASA is also pursuing an Asteroid Redirect Mission (ARM) which will identify, redirect and send astronauts to explore an asteroid. Among its many exploration goals, the mission could demonstrate basic planetary defense techniques for asteroid deflection. See the video below for more about this mission.

# Google Does a Map Story (Jane Goodall’s Chimpanzee Research)

Several months ago Google did a MOOC built around Google Maps.  Check out the following URL for a look at a very impressive Map Story: