Select Page

## Time based heatmaps in R

Time based heatmaps in R

### Tutorial Scenario

In this tutorial, we are going to be looking at heatmaps of Seattle 911 calls by various time periods and by type of incident.  This awesome dataset is available as part of the data.gov open data project.

Steps

The code below walks through 6 main steps:

4. Create summary table
5. Create heatmap
6. Celebrate

Code

#################### Import and Install Packages ####################

install.packages(“plyr”)

install.packages(“lubridate”)

install.packages(“ggplot2”)

install.packages(“dplyr”)

library(plyr)

library(lubridate)

library(ggplot2)

library(dplyr)

#################### Set Variables and Import Data ####################

col1 = “#d8e1cf”

col2 = “#438484”

attach(incidents)

str(incidents)

#################### Transform ####################

#Convert dates using lubridate

incidents\$ymd <-mdy_hms(Event.Clearance.Date)

incidents\$month <- month(incidents\$ymd, label = TRUE)

incidents\$year <- year(incidents\$ymd)

incidents\$wday <- wday(incidents\$ymd, label = TRUE)

incidents\$hour <- hour(incidents\$ymd)

attach(incidents)

#################### Heatmap Incidents Per Hour ####################

#create summary table for heatmap – Day/Hour Specific

dayHour <- ddply(incidents, c( “hour”, “wday”), summarise,

N = length(ymd)

)

dayHour\$wday <- factor(dayHour\$wday, levels=rev(levels(dayHour\$wday)))

attach(dayHour)

#overall summary

ggplot(dayHour, aes(hour, wday)) + geom_tile(aes(fill = N),colour = “white”, na.rm = TRUE) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

theme_bw() + theme_minimal() +

labs(title = “Histogram of Seattle Incidents by Day of Week and Hour”,

x = “Incidents Per Hour”, y = “Day of Week”) +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#################### Heatmap Incidents Year and Month ####################

#create summary table for heatmap – Month/Year Specific

yearMonth <- ddply(incidents, c( “year”, “month” ), summarise,

N = length(ymd)

)

yearMonth\$month <- factor(summaryGroup\$month, levels=rev(levels(summaryGroup\$month)))

attach(yearMonth)

#overall summary

ggplot(yearMonth, aes(year, month)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Year and Month”,

x = “Month”, y = “Year”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#################### Heatmap Incidents Per Hour by Incident Group ####################

#create summary table for heatmap – Group Specific

groupSummary <- ddply(incidents, c( “Event.Clearance.Group”, “hour”), summarise,

N = length(ymd)

)

#overall summary

ggplot(groupSummary, aes( hour,Event.Clearance.Group)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Event and Hour”,

x = “Hour”, y = “Event”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

Please see here for the full tutorial and steps

## ETL vs ELT: Considering the Advancement of Data Warehouses

ETL vs ELT: Considering the Advancement of Data Warehouses

ETL stands for Extract, Transform, Load. It has been a traditional way to manage analytics pipelines for decades. With the advent of modern cloud-based data warehouses, such as BigQuery or Redshift, the traditional concept of ETL is changing towards ELT – when you’re running transformations right in the data warehouse. Let’s see why it’s happening, what it means to have ETL vs ELT, and what we can expect in the future.

# ETL is hard and outdated

ETL arose to solve a problem of providing businesses with clean and ready-to-analyze data. We remove dirty and irrelevant data and transform, enrich, and reshape the rest. The example of this could be sessionization – the process of creating sessions out of raw pageviews and users’ events.

ETL is complicated, especially the transformation part. It requires at least several months for a small-sized (less than 500 employees) company to get up and running. Once you have the initial transform jobs implemented, never-ending changes and updates will begin, because data always evolves with business.

The other problem of ETL is that during the transformation we reshape data into some specific form. This form usually lacks some data’s resolution and does not include data which is useless for that time or for that particular task. Often, “useless” data becomes “useful.” For example, if business users request daily data instead of weekly, then you will have to fix your transformation process, reshape data, and reload it. That would take a few weeks more.

The more data you have, the longer the transformation process is.

The transformation rules are very complex, and even if you have only a few terabytes of data, loading time can take hours. Given the time it takes to transform and load a big dataset, your end users will almost never get fresh data. “Today” will mean today last week, and “yesterday” – a week ago yesterday. Sometimes it takes several weeks or months to get a new update to rollup.

To summarize some of the cons of ETL:

• ETL is expensive to implement, especially for small and medium businesses.
• ETL is expensive to maintain.
• ETL is time consuming. Users have to wait for transformation to be finished.

# Why have we been doing it for decades?

A prior generation of data warehouses weren’t able to work with the size and complexity of raw data. So, we had to transform data before loading and querying it.

The latest advances in database technologies made warehouses much faster and cheaper. Storage is getting more accessible all the time, some data warehouses even separate pricing for storage and computation. For example, BigQuery storage pricing is quite cheap and you can just store all your raw data there.

The most important changes in recent years to data warehouses, which made the comparison of ETL vs ELT possible:

• Optimized for analytical operations. Modern analytical warehouses tend to be columnar and optimized for aggregating and processing huge datasets.
• Cheap storage. No worries about what to store, you can just dump all your raw data into a warehouse.
• Cloud based. It scales infinitely and on-demand, so you can get the performance you need the moment you need it.

# ETL vs ELT: running transformations in a data warehouse

What exactly happens when we switch “L” and “T”? With new, fast data warehouses some of the transformation can be done at query time. But there are still a lot of cases where it would take quite a long time to perform huge calculations. So instead of doing these transformations at query time you can perform them in the warehouse, but in the background, after loading data.

Once raw data is loaded into a warehouse, heavy transformations can be performed. It makes sense to have both real-time and background transformations in the BI platform. Users consume and operate on the business definitions level when querying data, and BI is either performing transformation on-the-fly or querying data already transformed in the background.

This approach gives flexibility and agility for development of a transformation layer.

Software engineers nowadays deploy several times a day and praise continuous delivery. The same principle should be adopted for how we approach a transformation. If metric definition changes or some new data is required, one can easily make this changes in hours, not weeks or months. It is especially valuable for fast-growing startups, where changes happen daily and data teams have to be flexible and agile to keep up with product development and business needs.

As data warehouses advance more and more, I’m sure we will see how query time transformations will entirely replace background transformations. Before that happens, we can run some transformations in the background with ELT. Since they are already SQL based and run in the data warehouses, the final switch would be easy and painless.

## How IOT is changing our world?

How IOT is changing our world?

The Internet of Things connected all of kinds intelligent devices, such as mobile devices, sensors, machines or vehicles with each other and with the cloud computing. Analysis of the IoT data offers many opportunities for companies, as they can make faster decisions, optimize business processes or develop new applications and even restructure business models. This has the enormous potential of Internet of Things that is virtually present in every industry, including energy, retail, healthcare, financial services, transportation and manufacturing.

The Internet of Things has changed the dimensions of traditional Business IT. To tap the potential need for a highly scalable and reliable IT Infrastructure, they should be on standardized components and open protocols and include the three layer Devices, Controllers and Data Center or the Cloud solutions.

The growth is a positive sign for all industry, but however, we cannot ignore the sheer size and the public nature of the Internet of Things, which will bring great challenges. Network and system architects need to optimize the IT infrastructure to meet the higher requirements of the IoT in terms of scalability, reliability and security. IoT based applications and automated business processes make high demands on the availability of the system. Many intelligent systems are used for mission critical applications where system downtime leads to decreased productivity.

Intelligent IT solutions such as Red Hat technologies are based on things to meet the requirements of IoT based systems for scalability, reliability and security. The solutions are based on a hierarchical model with device layer (Edge Nodes), control layer (controller gateway) and Data Center Tier or Cloud layer. Here come standardized protocols and components for use.

The device layer involving a variety of intelligent terminals that includes mobile devices, wearable gadgets, sensors, control and regulation devices, autonomous machines and appliances, etc… The communication between the devices and the control or checkpoints is based on standard network protocols, either wired or wireless. The forwarding of the raw data and the exchange of control information is based on open messaging standards.

## Decision-driven before data-driven : A roadmap for data-driven organizations

Decision-driven before data-driven : A roadmap for data-driven organizations

A MANAGER MUST MAKE DECISIONS

Some thirty years back, when I started my career as a management-trainee with an Aerospace company; the 18 months long training program included several courses in management.

I still remember the very first class-room session as if it was yesterday. A distinguished-looking retired professor from IIM Calcutta was introduced by the principal of the Staff-college, and as he addressed the class he declared “I am a Jew! …hope none of you have a problem?” (This used to be the time when the Indian Government very strongly identified with the Palestinian cause).

On assuring himself that we have no problems whatsoever, he proceeded to ask the next question. “Who is a good manager?” … There were a few of us bold enough to try a response. The professor faithfully wrote down everything we said on the board… A wide variety of juvenile definitions ranging from “someone who always gets work done” to “someone who gets work done more efficiently.” … all to be expected from a set of green-behind-the-ears freshers.

“How many of you have heard of Peter Drucker?” …

Fortunately, a good many of us did. (I came to know much later that Peter Drucker was also of Jewish descent. His ancestors were Jewish, but his parents had converted to Lutheranism)

“Well, glad you do… Peter Drucker defines a good manager as someone who takes three good decisions out of ten”.

Not surprisingly, we were all completely lost… It made no sense… Just 30%? …One fails to clear one’s exam at 30%…

The Professor then went on to relate how he asked the very same question when he met Peter Drucker in a seminar. Apparently, Peter Drucker told him – “A manager must take decisions, even if only three out of ten turn-out to be good decisions”.

As they say, that is one premise that I have kept unchallenged ever since. A manager must take decisions. Must take decisions in-time, and as many good decisions as he could.

Overtime, I joined a business school to do my MBA, but I guess Peter Drucker was passé by then, I don’t remember any professor specifically talking about the importance of taking right decisions in right time. I have been through eight different companies in multiple roles ever-since, set-up and scaled businesses, consulted for several companies across industry sectors for over two decades, set-up multiple ODCs and shared-services for customers….

But I doubt if I ever paused to think if I was taking right decisions in right time? Ever?

DECISION-DRIVEN BEFORE DATA-DRIVEN

However, all that changed when I was trying to create a business-case for setting up an Analytics Lab in my last assignment – a kind of internal Centre-of-Excellence for Analytics. After a gruelling tale of trying to take advise and help from different consulting firms and searching through all the published sources, we discovered there was little to nothing as a process for creating an organization-wide roadmap for analytics. Given the near-complete absence of any published sources (three notable exceptions including one from Bain consulting will be explained in the next article), we were forced to come-out with a process of our own.

The most disturbing discovery was – that an organization necessarily needs to be Decision-driven before it is Data-driven.

We have noticed the decision-making process across organizations (not just ours) was informal, and more often-than-not without an audit-trail. No Manager seems to list the decisions that he takes – let alone record the reasons. The only notable exceptions are related to the capital-expenditure, where most organizations have a set process and an audit-trail mandated for decisions beyond a certain threshold value.

However, there are hundreds of operational decisions that are repetitive in nature, usually made by an every-day manager – that cumulatively account for a much-higher value than all capital-expenditure related decisions taken in the company. For example: the quantum of buffer-stock by each SKU may be individually a very small decision, but by annual-cumulative value across SKU’s, it may be the-most-important decision that significantly influences the profitability and performance of the company.

A Decision-driven organization is expected to be process-driven with documented GSOPs (Global Standard Operating Procedures) for each of the decisions recognized as important – Specifically for those 10% of the decisions which influence 90% of the business outcomes.

In our experience, we are yet to come across any organization which meticulously identifies critical decisions, and creates GSOPs for each one of them.

Even more surprisingly, a good many of them have a budget for big-data and Analytics, and all of them aspire to be data-driven….

——————————————————————————————————————————————————

## Some Reinforcement Learning: The Greedy and Explore-Exploit Algorithms for the Multi-Armed Bandit Framework in Python

Some Reinforcement Learning: The Greedy and Explore-Exploit Algorithms for the Multi-Armed Bandit Framework in Python

In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.

## The Problem Statement and Some Theory

Given a set of  actions with some unknown reward distributions,  maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately.  This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.

The following figure defines the problem mathematically  and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.

The following figure shows a motivating application of the multi-armed bandit problem in drug discovery.  Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.

The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.

• The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
• The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
• The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas
the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.

The following code from the github repository of the same course shows how the basic bandit Framework can be defined:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 `import` `numpy as np` `import` `sys`   `### Interface` `class` `Environment(``object``):`   `def` `reset(``self``):` `raise` `NotImplementedError(``'Inheriting classes must override reset.'``)`   `def` `actions(``self``):` `raise` `NotImplementedError(``'Inheriting classes must override actions.'``)`   `def` `step(``self``):` `raise` `NotImplementedError(``'Inheriting classes must override step'``)`   `class` `ActionSpace(``object``):`   `def` `__init__(``self``, actions):` `self``.actions` `=` `actions` `self``.n` `=` `len``(actions)`   `### BanditEnv Environment`   `class` `BanditEnv(Environment):`   `def` `__init__(``self``, num_actions` `=` `10``, distribution` `=` `"bernoulli"``, evaluation_seed``=``"387"``):` `super``(BanditEnv,` `self``).__init__()`   `self``.action_space` `=` `ActionSpace(``range``(num_actions))` `self``.distribution` `=` `distribution`   `np.random.seed(evaluation_seed)`   `self``.reward_parameters` `=` `None` `if` `distribution` `=``=` `"bernoulli"``:` `self``.reward_parameters` `=` `np.random.rand(num_actions)` `elif` `distribution` `=``=` `"normal"``:` `self``.reward_parameters` `=` `(np.random.randn(num_actions),` `np.random.rand(num_actions))` `elif` `distribution` `=``=` `"heavy-tail"``:` `self``.reward_parameters` `=` `np.random.rand(num_actions)` `else``:` `print``(``"Please use a supported reward distribution"``)` `#, flush = True)` `sys.exit(``0``)`   `if` `distribution !``=` `"normal"``:` `self``.optimal_arm` `=` `np.argmax(``self``.reward_parameters)` `else``:` `self``.optimal_arm` `=` `np.argmax(``self``.reward_parameters[``0``])`   `def` `reset(``self``):` `self``.is_reset` `=` `True` `return` `None`   `def` `compute_gap(``self``, action):` `if` `self``.distribution !``=` `"normal"``:` `gap` `=` `np.absolute(``self``.reward_parameters[``self``.optimal_arm]` `-` `self``.reward_parameters[action])` `else``:` `gap` `=` `np.absolute(``self``.reward_parameters[``0``][``self``.optimal_arm]` `-` `self``.reward_parameters[``0``][action])` `return` `gap`   `def` `step(``self``, action):` `self``.is_reset` `=` `False`   `valid_action` `=` `True` `if` `(action` `is` `None` `or` `action` `=` `self``.action_space.n):` `print``(``"Algorithm chose an invalid action; reset reward to -inf"``)``#, flush = True)` `reward` `=` `float``(``"-inf"``)` `gap` `=` `float``(``"inf"``)` `valid_action` `=` `False`   `if` `self``.distribution` `=``=` `"bernoulli"``:` `if` `valid_action:` `reward` `=` `np.random.binomial(``1``,` `self``.reward_parameters[action])` `gap` `=` `self``.reward_parameters[``self``.optimal_arm]` `-` `self``.reward_parameters[action]` `elif` `self``.distribution` `=``=` `"normal"``:` `if` `valid_action:` `reward` `=` `self``.reward_parameters[``0``][action]` `+` `self``.reward_parameters[``1``][action]` `*` `np.random.randn()` `gap` `=` `self``.reward_parameters[``0``][``self``.optimal_arm]` `-` `self``.reward_parameters[``0``][action]` `elif` `self``.distribution` `=``=` `"heavy-tail"``:` `if` `valid_action:` `reward` `=` `self``.reward_parameters[action]` `+` `np.random.standard_cauchy()` `gap` `=` `self``.reward_parameters[``self``.optimal_arm]` `-` `self``.reward_parameters[action]` `#HACK to compute expected gap` `else``:` `print``(``"Please use a supported reward distribution"``)``#, flush = True)` `sys.exit(``0``)`   `return``(``None``, reward,` `self``.is_reset, '')`   `#Policy interface` `class` `Policy:` `#num_actions: (int) Number of arms [indexed by 0 ... num_actions-1]` `def` `__init__(``self``, num_actions):` `self``.num_actions` `=` `num_actions`   `def` `act(``self``):` `pass`   `def` `feedback(``self``, action, reward):` `pass`

Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).

In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.

Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.

Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.

• The next figure shows the reward distributions for a 5-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..4.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
2. Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
3. Both Greedy and Round-Robin has linear regrets (w.r.t. timesteps).
4. In this case the greedy algorithm even with sub-optimal action still performs relatively better than the round-robin.

## Round-Robin

• The next figure shows the reward distributions for a 10-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter ε) and the Optimistic-Greedy (with different values of the hyper-parameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Epsilon-Greedy algorithm behaves exactly like Greedy when ε = 0 and behaves randomly when ε = 1. For both of these cases, the ε-greedy algorithm has linear regret.
2.  ε = 0.1 and ε = 0.15 find the optimal arm 6 eventually and they have sub-linear regrets.
3. Optimistic-Greedy algorithm behaves exactly like Greedy when R = 0 and behaves randomly when R = 10000. For both of these cases, the ε-greedy algorithm has linear regret.
4. R = 3 and R = 5 find the optimal arm 6 eventually and they have sub-linear regrets(w.r.t. timesteps).

ε = 0

ε = 0.05

ε = 0.1

ε = 0.15

ε = 1.0

## Optimistic Greedy

R = 0

R = 1

R = 3

R = 5

R = 10000

The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.

• The next figure shows the reward distributions for a 10-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Again, let’s assume that we don’t know p_i values, we  implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
2. Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
3. For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the non-informative flat prior.
4. Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
5. The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.

>

## American Community Survey data in R

American Community Survey data in R

I met up with an old grad school friend a few weeks back. He’s an accomplished professor of demography, basically the statistical study of populations, particularly surrounding births, deaths, and fertility. Where he once traveled extensively to conduct demographic surveys in remote parts of the world, my friend’s now content to work with available U.S. census data to develop models and test hypotheses. Like me, he’s a stats/techie guy accomplished in statistical packages such as SPSS, SAS, and Stata. He notes he’s played around with R but was warned early on it couldn’t scale for his demanding needs. We then morphed to an engaging discussion about open source and the evolution of statistical software. My sense was that he was wary of the R nay-Sayers, but found his statistical comfort zone hard to abandon.

Alas, though perhaps not appreciated in many statistical settings, the R of 2018 is quite a step up over the R of 2008, driven significantly by contributions such as tidyverse and data.table from its extensive open source ecosystem. Indeed, the R code I write today would be unrecognizable to a core R stickler of even 5 years ago. And as for R’s very real RAM limitation, the 64 GB RAM, 2 TB SSD computer I purchased 2 years ago for \$2500 has handled 95% of the challenges I’ve thrown at it with aplomb. The belligerent 5% I take to the Cloud and Spark.

As our discussion progressed, we opted to “challenge” R with the type of work he now does in his research. We settled on the 5 year (2012-2016) ACS Public Use Microdata Sample files PUMS as our data source. The PUMS “are a sample of the actual responses to the American Community Survey and include most population and housing characteristics. These files provide users with the flexibility to prepare customized tabulations and can be used for detailed research and analysis. Files have been edited to protect the confidentiality of all individuals and of all individual households. The smallest geographic unit that is identified within the PUMS is the Public Use Microdata Area (PUMA).”. Tellingly, there’re freely-available R packages demography and censusapi that make life a lot easier for R demographers.

The data we used consists of a series of household and population files. Each household has one or more persons (population), while each person is for one and only one household. It turns out that there are about 7.4M households consisting of 15.7M persons represented in this sample. A common key, “serialno”, connects the data.tables. Our task was to load both household and population data, with the mandate of joining the two to build topic-oriented data sets on demand. The raw household and population data are downloaded from zip files consisting of 4 CVS’s each.

The development environment consists of Jupyter Notebook with a Microsoft R kernel. R’s tidyverse, data.table, and fst packages drive the data build. What follows is the implementation code.