Select Page

## Time based heatmaps in R

Time based heatmaps in R

### Tutorial Scenario

In this tutorial, we are going to be looking at heatmaps of Seattle 911 calls by various time periods and by type of incident.  This awesome dataset is available as part of the data.gov open data project.

Steps

The code below walks through 6 main steps:

4. Create summary table
5. Create heatmap
6. Celebrate

Code

#################### Import and Install Packages ####################

install.packages(“plyr”)

install.packages(“lubridate”)

install.packages(“ggplot2”)

install.packages(“dplyr”)

library(plyr)

library(lubridate)

library(ggplot2)

library(dplyr)

#################### Set Variables and Import Data ####################

col1 = “#d8e1cf”

col2 = “#438484”

attach(incidents)

str(incidents)

#################### Transform ####################

#Convert dates using lubridate

incidents\$ymd <-mdy_hms(Event.Clearance.Date)

incidents\$month <- month(incidents\$ymd, label = TRUE)

incidents\$year <- year(incidents\$ymd)

incidents\$wday <- wday(incidents\$ymd, label = TRUE)

incidents\$hour <- hour(incidents\$ymd)

attach(incidents)

#################### Heatmap Incidents Per Hour ####################

#create summary table for heatmap – Day/Hour Specific

dayHour <- ddply(incidents, c( “hour”, “wday”), summarise,

N = length(ymd)

)

dayHour\$wday <- factor(dayHour\$wday, levels=rev(levels(dayHour\$wday)))

attach(dayHour)

#overall summary

ggplot(dayHour, aes(hour, wday)) + geom_tile(aes(fill = N),colour = “white”, na.rm = TRUE) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

theme_bw() + theme_minimal() +

labs(title = “Histogram of Seattle Incidents by Day of Week and Hour”,

x = “Incidents Per Hour”, y = “Day of Week”) +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#################### Heatmap Incidents Year and Month ####################

#create summary table for heatmap – Month/Year Specific

yearMonth <- ddply(incidents, c( “year”, “month” ), summarise,

N = length(ymd)

)

yearMonth\$month <- factor(summaryGroup\$month, levels=rev(levels(summaryGroup\$month)))

attach(yearMonth)

#overall summary

ggplot(yearMonth, aes(year, month)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Year and Month”,

x = “Month”, y = “Year”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#################### Heatmap Incidents Per Hour by Incident Group ####################

#create summary table for heatmap – Group Specific

groupSummary <- ddply(incidents, c( “Event.Clearance.Group”, “hour”), summarise,

N = length(ymd)

)

#overall summary

ggplot(groupSummary, aes( hour,Event.Clearance.Group)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Event and Hour”,

x = “Hour”, y = “Event”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

Please see here for the full tutorial and steps

## Foolproof R package Install

Foolproof R package Install

The number of R packages associated cool new tricks available continues to grow every month.  To understand the current state of R packages on CRAN, I ran some code provided by Gergely Daróczi on Github .  As of today there have been almost 14,000 R packages published on CRAN and the rate of publishing appears to be growing at an almost exponential trend.Additionally, there are even more packages available on sources like Github, Bioconductor, Bitbucket and more.

## Approaching package issues systematically

Given the amount of tools the average data person uses daily, we need to reduce the hurdles to as many “easy tasks” as possible.  To help other poor souls that don’t want to think too hard when struggling to install R packages referenced in tutorials or other media, I’ve put together a simple flow chart.  The basic troubleshooting guide can be followed in the flow chart. However additional detailed instructions and links can be found below the image.

## Additional instructions for Package Install troubleshooting flow chart

Is the package available on CRAN?

• Unfortunately CRAN does not have a search but you can usually find the package by googling  “CRAN R <package name>”

Do you have the right version of base R?

• To identify your R version, execute the command “Version” and the output will indicate your installed base R version.  In the screenshot, I have version 3.4.3
• To install a new version of R, visit their download page

Did the install work?
• If the install worked, you will get a message along the lines of “The downloaded binary packages are in <filepath> “

Install via R-Studio package interface

• This is a very handy tip which prevents silly typos.  The tip was given by Albert Kim in reply to my #rstats tweet.  He documents this in his awesome book: Modern Dive An Introduction to Statistical and Data Sciences via R.

Locate the package repo and install via devtools

• Typically the easiest way to locate the package repo is by googling “r package <package name>”.  In the case of the emo R package I found it here: https://github.com/hadley/emo
• Install the package from the repo via devtools.  This simply involves installing and loading the devtools package and then executing the appropriate “install_” command from the docs.  In the case of the emo package, the following code will work.
```install.packages("devtools") library(devtools)
`# OR MAC and Linux users can simply do: devtools::install_github("hadley/emo")`

## Thank you

Thank you for taking the time to read this guide.  I certainly hope that it will help people spend less time thinking about package install debugging and leave more time for fun data analysis and exploration.  Please feel free to let me know your thoughts in the comments or on twitter.  Thanks!

Original Post can be found here.

## Predictive Analytics Path to Mainstream Adoption

Predictive Analytics Path to Mainstream Adoption

Hold on to your hats data scientists, you’re in for another wild ride.   A few months ago, our beloved field of predictive analytics was taken down a peg by the 2017 Hype Cycle for Analytics and Business Intelligence.    In the latest report, predictive analytics moved from the “Peak of Inflated Expectations” to the “Trough of Disillusionment”.  Don’t despair, this is a good thing!   The transition means that the silver bullet seekers are likely moving on to the next craze and the technology is moving one step closer to long term productive use. Gartner estimates approximately 2-5 years to mainstream adoption.

PHASES

Outlined below from Wikipedia, the phases of hype cycle include:

Technology Trigger – A potential technology breakthrough and kicks things off. Early proof-of-concept stories and media interest trigger significant publicity. Often no usable products exist and commercial viability is unproven.

Peak of Inflated Expectations – Early publicity produces a number of success stories—often accompanied by scores of failures. Some companies take action; most don’t.

Trough of Disillusionment – Interest wanes as experiments and implementations fail to deliver. Producers of the technology shake out or fail. Investment continues only if the surviving providers improve their products to the satisfaction of early adopters.

Slope of Enlightenment – More instances of how the technology can benefit the enterprise start to crystallize and become more widely understood. Second- and third-generation products appear from technology providers. More enterprises fund pilots; conservative companies remain cautious.

Plateau of Productivity – Mainstream adoption starts to take off. Criteria for assessing provider viability are more clearly defined. The technology’s broad market applicability and relevance are clearly paying off.

In the past few years, predictive analytics has been steadily moving along the curve to popularization.  As someone in the field, I’ve been both excited and nervous about the changes.  I’m thrilled to have more fellow interested colleagues and business support.  However, I get nervous when folks jump on the trend and try to apply predictive analytics blindly as a way to automate the solution of any problem with data.

## WHAT IS PREDICTIVE ANALYTICS ANYWAY?

Simply put; predictive analytics is the process of identifying patterns in a data set that you have to estimate the values for data that you do not have.

The business problem frequently involves using past data to predict future data.  For example, a company may use last years customer data to build a model which will predict which customers have a high potential to leave.  Customer attribute data such as demographics, spend and engagement are analyzed using statistical techniques to create a predictive model.  After the predictive model is created, it is then capable of taking in the same type of customer information (demographics, spend and engagement) for a new data set and estimating for each customer in this new data set, their probability of leaving. With this type of information, a company can flag and reach out to potential defective customers before they leave.

## STEPS TO PREDICTIVE ANALYTICS MODELLING

Tackling a predictive analytics problem requires more than simply throwing the data into a modelling software and running off with your winning lottery ticket.  Although the actual creation of models can be that easy, for them to be effective it requires an in depth review of the problem and available data.  Often so much is learned in the pre-modelling phases that the predictive modelling strategy changes shape over the course.

While predictive analytics projects can be quite detailed and complex, the high level tasks are straight forward.  Each predictive modeller will have their own flavor of a process.   Below, I’ve outlined what I believe to be the six major phases to creating an effective predictive model.

Define the Problem – Before you even get started, you need to understand the problem thoroughly.  What are you being asked to do?  What is the motivation?  Understand as much about the problem landscape as possible including business models, product function, and more.  Speak to subject matter experts.  Speak to those who have attempted this problem before and learn from them.  Soak in as much information as possible.

Set Up –  Identify and access the data sources that you will analyze.  Set up the tools you that you will use to bring in the data and perform the analysis.

Exploratory Data Analysis (EDA) – Now it’s time to have fun!  This is where you get to dig in and start unravelling the mystery of what is going on.  It’s called “Exploratory Data Analysis” because this is where you explore your data set.  Evaluate all pieces of information you are given to understand their construction, population, quality and relationship with other pieces of information.  For example; if “Average Monthly Spend” is a customer attribute in your data set, you will want to explore the following angles:

• How many missing values are in this attribute?
• What is the distribution?  For example, you might notice that most customers spend \$10 or less.
• Are there any outlier values which will throw off the ability to detect patterns?  For example, are most customers spending between \$5-\$50 dollars with a few customers spending \$1000+ dollars?
• What relationship does this attribute have with other attributes?  For example, is the total spend highly correlated to zip code?
• And more!  You will find yourself being amazed at the stories that unfold during this phase.

Transform – Before you create a model, you need to perform some minor tweaks on your data set to maximize model performance.  The EDA from the previous step will indicate what you need to do.  For example, if there are a some pieces of information with missing values, you need to decide what to do with them.  Do you want to replace the missing value with the overall average, replace with a calculated value or remove that customer from the set altogether?  You may need to decide how you handle outliers.  Are you going to cap the values at a certain hard value, or cap at a certain percentile?  Will you separate the data with high and low outliers for their own special model?  There are many transformations that can be done to help your model more accurately reflect the intended audience and give you higher performance.

Predictive Modelling – Time to create models!  There are so many cool techniques out there to try, it’s good to get creative.   Depending on what you are trying to predict, one type of model might be more fitting than another.  Logistic regression can be great for classification type problems such as predicting if an insurance customer will make a claim (Yes or No).   Linear regression can be a great choice for numerical predictions such as predicting how much an insurance claim from a particular customer is likely to cost you (\$0 to \$X dollars).  Also, you will want to play with which variables you will include as input for each type of model.

Assess and Implement – Now that you’ve created all of your candidate models, its time to pick one.  Start by reviewing basic performance stats about the models.  You’ll want to look for accuracy in a way that is applicable to the model and the problem.  For classification problems you might want to look at a confusion matrix which essentially says how many classifications you got right and wrong when using the model on the training data.  For numerical predictions you will want to look at the mean squared error.  Loosely put, this allows you to understand the relative average amount you are off on predictions.  Beyond overall performance metrics, you will also want to consider the practicality of your model.  A good way of doing this is to consider the bias variance trade off.  A model that is biased is considered to have potentially oversimplified the solution and may not give the most accurate predictions.  However, it is usually very intuitive and provides relatively stable  performance with fluctuations in the data set.  If your model is too complex, it can perform very well on the data set it trained on, but it might be too customized to the data it has seen before.  This means that it might not handle variance in the data very well.  Additionally,  these models are often so complex that they are incredibly difficult if not impossible to interpret.

At the end of this step you will look at all you know about your models and choose the one you think is best.  You will then use the model to predict new data.

## I’M DOING A SERIES!

The above steps can be a lot to take in!   If you’re like me, you are probably looking for a concrete example that you can follow along and try for yourself.  As such, I’m in the midst of shaping a tutorial for y’all using some of my favorite technologies (spoiler alert: it uses R, R Studio and Data Science Experience).  The only hitch is that it’s ends up being far too much material to cover in one blog.  To remedy this, I decided to give the topic the respect it deserves and create a multi part tutorial.  I’m going to break the tutorial down into 4 parts!  FOUR (1 ,2 ,3 ,4)  I’m aiming to release them at a cadence of one every week or two.  They will include a write up of the steps and code that you can execute in your own environment.

I hope this blog has been beneficial to you.  Please feel free to leave your comments below.  Thanks for reading and stay tuned!

Original blog post can be found here

Written by Laura Ellis

## Analysis on Google’s Best Apps of 2017 List

Analysis on Google’s Best Apps of 2017 List

## STEP 1: BACKGROUND

As we are put 2017 to a close, “Best of 2017” lists are being released.  I had a look at Google’s “Best Apps of 2017” list and I like their arrangement of picking the top 5 apps by categories such as “Best Social”, “Most Innovative” etc.  However, I found myself wishing I could dive deeper.  I wanted to examine which factors contribute to the placement of an app on these lists.  A natural way of doing this would be to download the data and start analyzing.  Simple right?  Think again, this is the classic problem of information on the web.  There absolutely is an abundance of information on the internet.  But it’s only consumable in the way that the website wishes to serve it up.  For example, in the best apps list, I can see every app, their category, their total downloads, ratings etc.  All the information I need is available to me, but it’s not in the format that I need to facilitate additional analysis and insights.  Its meant for the user to browse and read right on their website.  The website controls the narrative.

## Strategy

So what do we do?  The information we want is available to us, but it’s not in the form we need.  Luckily for us, many people much smarter than I have solved this problem.  The concept is called screen scraping and it’s a technique used to automate copying data off of websites.  For data wranglers, there are a number of libraries and packages that have been developed to make screen scraping relatively straight forward.  In Python, the package Beautiful Soup has a large following.  In R, the package rvest has been getting a lot of traction.

Since I’m more proficient with R, we will use the rvest package to scrape data from the Google Best Apps of  2017 website and store it in a data frame.  We will then use a variety of R packages to analyze the data set further.

STEP 2: GATHER THE DATA

R packages contain a grouping of R data functions and code that can be used to perform your analysis. We need to install and load them in our environment so that we can call upon them later.  As per the previous tutorial, enter the following code into a new cell, highlight the cell and hit the “run cell”  button.

#install packages – do this one time

install.packages(“rvest”)

install.packages(“plyr”)

install.packages(“alluvial”)

install.packages(“ggplot2”)

install.packages(“plotrix”)

install.packages(“treemap”)

install.packages(“plotly”)

# Load the relevant libraries – do this every time

library(rvest)

library(plyr)

library(alluvial)

library(ggplot2)

library(plotrix)

library(treemap)

library(plotly)

`Call the rvest library to create a screen scraping function for the Google "Best of" website`

Screen scraping is a very in depth topic and it can get incredibly complicated depending on how the page you would like to scrape is formatted.  Luckily the page we are trying to scrape allows the data objects we want to be referenced relatively easily.  Since there are 6 pages total that we would like to scrape (all with the same format), I made a function that we can call for every page.  Creating a function helps to prevent repeated blocks of code.  This function takes two input parameters: the website URL that you want the data from and the category name you would like to assign to this data set.  It then retrieves the apps title, rating count, download count, content rating (mature, teen etc), write up (description) and assigns the category you provided.  Finally, it returns a data frame to you with all of the information nicely packed up.

###########  CREATE Function for Google Screen Scrape ################

#Specifying the url for desired website to be scrapped

df <-data.frame(

app_title = html_text(html_nodes(webpage,’.id-app-title’)),

rating_count = html_text(html_nodes(webpage,’.rating-count’)),

content_rating = html_text(html_nodes(webpage,’.content-rating-title’)),

write_up = html_text(html_nodes(webpage,’.editorial-snippet’)),

category = categoryName)

df

return(df)

}

### Call the function just created (scrapeGoogleReviews) for every “Best Apps” list we want

Each web page hosts it’s own “Best of 2017 App” list such s “Best Social”, “Most Influential”, “Best for Kids” etc.  The function is called for each web page and the results are placed in their own data frame.  We then put all the data frames together into one combined data frame with the rbind command.

###########  CALL Function for Screen Scrape ################

#Combine all of the data frames

fulldf <- rbind(df1,df2, df3, df4, df5, df6, df7, df8)

#Peek at the data frame

## STEP 3: FORMAT YOUR DATA

### Convert to numeric

The downfall to screen scraping is that we often have to reformat the data to suit our needs.  For example, in our data frame we have the two numeric variables: download_count and rating_count.   As much as they look like numbers in the preview above, they are actually text with some pesky commas included in them that make conversion to numeric slightly more complicated.  Below we create two new columns with the numeric version of these variables.  The conversion is performed by first removing any non-numeric values with the gsub function and then converting to numeric with the as.numeric function.

###########  Extra formatting ################

# Remove commas and convert to numeric

fulldf\$rating_count_numeric <- as.numeric(gsub(“[^0-9]”, “”, fulldf\$rating_count))

attach(fulldf)

### Create helper variables for easier analysis and visualization

There are some things just off the bat, that I know we will want for visualization.  To start, it would be nice to have the percent of overall downloads for each app within the data set.

attach(fulldf)

Next, we want to bin our download and rating totals. “Binning” is a way of grouping values within a particular range into the same group.  This is an easy way for us to be able to look at volumes more

breaks <- c(0,10000,1000000,10000000, 100000000)

#Binning by rating totals

breaks2 <- c(10,100,1000,100000, 10000000)

fulldf\$rating_total_ranking = findInterval(rating_count_numeric,breaks2)

attach(fulldf)

#peek at the data

## STEP 4: ANALYZE YOUR DATA

### Visualize app data

Create a pie chart showing the top downloaded apps within the data set.  We use the percentDownloadApp variable created in step 3 and the plot_ly function to create the pie chart.  Given that the majority of the data set has less than 1% of the downloads, we also only include apps with greater than 1% with the ifelse function.

###########  Visualize the Data ################

xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),

yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

Create a treemap to represent all app download volumes – Treemaps are perfect for visualizing volumes in a creative way.  The treemap function offers the added benefit of not including titles for categories with minimal data.

#Treemap without category

index=c(“app_title” ),

type=”index”,

palette = “Blues”,

fontsize.title = 14

)

Create a classic bar chart to represent all app downloads – We are using ggplot to pull a pretty bar chart.

g + geom_bar(stat=”identity”, width = 0.5, fill=”tomato2″) +

labs(title=”Bar Chart”,

subtitle=”Applications”,

theme(axis.text.x = element_text(angle=65, vjust=0.6))

Create a bubble chart showing the top downloaded apps and the number of ratings they received.    We use the ggplot function again to create this bubble plot.  The size of the bubbles represent the volume of downloads, the color represents the number of ratings.  Given the disparity of download volumes, we only look at the top apps.  We filter the data set to only include apps that have greater than 1,000,000 downloads.

ggplot(data = fulldf[download_count_numeric>1000000, ], mapping = aes(x = category, y = rating_count_numeric)) +

geom_point(aes(colour = rating_count_numeric)) +

geom_text(aes(label=app_title), size=2.5, vjust = -2) #vjust moves text up

### Visualize category summary data

Before we do any visualizations, we need to create the summary data frame for categories.  Summary data frames are just tables which have rolled up aggregate information like average, sum, count etc.  They are similar to pivot tables in excel.  We use the ddply function to easily create a table with summary stats for categories.

###########  Visualize Category Stats ################

#Some numerical summaries

statCatTable <- ddply(fulldf, c(“category”), summarise,

N    = length(app_title),

sumOfRatingsCompleted = sum(rating_count_numeric),

minRatingsCompleted = min(rating_count_numeric),

maxRatingsCompleted = max(rating_count_numeric),

avgRatingsCompleted = mean(rating_count_numeric),

sdRatings = sd(rating_count_numeric),

)

statCatTable

attach(statCatTable)

#peek at the table

Create a bar chart displaying the percent of downloads which give ratings –  This is an important stat because it can show user engagement.  We employ the trusty ggplot function and the newly created variable percentRatingPerDownload that we added to our summary data frame above.

#Bar chart

geom_bar(width = 1,stat=”identity”)

Create a circular pie chart to show the same information in a different way – Use this chart with caution as it can be misleading on a quick glance.  In this example it could look like “Most Popular” has the highest value.  When you inspect further it’s clear that “Most Entertaining” has the most complete circle and therefore highest value.

#circular – caution the use b/c often the middle is visually smallest

geom_bar(width = 1,stat=”identity”)+coord_polar(theta = “y”)

Create a radial pie chart as final alternative
– While the radial pie may not be as visually appealing as the circular pie, I think it presents a more obvious interpretation of the data.  We use the radial.pie function to make this chart.

Create a color coded dot plot to show the relative download volumes by category – Dot plots are a great alternative to bar charts.  They are even more powerful when you can group and color code them by category.  We use the dotchart function.

#Color coded dot plot

#pick r colors – http://data.library.virginia.edu/setting-up-color-palettes-in-r/

x <- fulldf

x\$color[fulldf\$category==’Winner’] <- “#1B9E77”

x\$color[fulldf\$category==’Most Innovative’] <- “#D95F02”

x\$color[fulldf\$category==’Best for Kids’] <- “#7570B3”

x\$color[fulldf\$category==’Best Social’] <- “#E7298A”

x\$color[fulldf\$category==’Daily Helper’] <- “#66A61E”

x\$color[fulldf\$category==’Hidden Gem’] <- “#E6AB02”

x\$color[fulldf\$category==’Most Entertaining’] <- “#A6761D”

x\$color[fulldf\$category==’Most Popular’] <- “#666666”

### Visualize category and content summary data

Create summary data – Now we want to look at the combined category, content and binned download/ratings stats to see what types of users there are out there.  As previously, we are going to use the ddply function to create the summary.

###########  Visualize Content + Category Stats ################

N    = length(app_title),

sumRatings = sum(rating_count_numeric)

)

attach(catsum)

#peek at the table

## Do visualization to see what is driving a high number of downloads

alluvial(catsum[c(1:2,4)] , freq=catsum\$N,

cex = 0.6

)

Create another data flow chart but with a focus on teens – From above, we can see that teens are a huge contributor to high downloads of an app.  So, we pull the same chart and this time highlight the teens data flow.

alluvial(catsum[c(1:2,4)] , freq=catsum\$N,

col = ifelse(catsum\$content_rating == ‘Teen’, “blue”, “grey”),

border = ifelse(catsum\$content_rating == ‘Teen’, “blue”, “grey”),

cex = 0.6

)