Introduction to ggplot2 — the grammar

Introduction to ggplot2 — the grammar

There are several thousands of languages in the world and they all have in common that they are defined and explained by some set of rules. This set of rules is so called grammar and with its help individual and separate words (like nouns and verbs) are combined into the right and meaningful sentences. The similar methodology can be applied on creating graphics. If we imagine that each graph is made up from its basic parts — components, we can define basic set of rules by which we will define and combine components into right and meaningful visualizations (graphs). This set of rules is so called grammar of graphics and in this article we’ll explain the methodology and syntax for one of the most famous graphics packages in R — ggplot2.

Graph as a composition of its individual components

The main idea that lies behind grammar of graphics is that each plot can be made from the same few components. Those components are:

  • aesthetics
  • geometrical shapes
  • statistical transformations
  • scaling
  • faceting
  • themes

Each component has its own set of rules and specific syntax (so called grammar of components) and together they are forming one single entity called graph.

In the next section we will introduce set of rules on two levels:

  • set of rules and ggplot2 syntax on component level
  • set of rules and ggplot2 syntax on graph level

Set of rules on component level

The first step in mastering the grammar is understanding individual components and their set of rules that are needed for proper definition and control. Below we are presenting short explanations for each of the components.

Data

                                                         Data frame object as input data source

Description: Through this component we are defining input data set that will be used in visualization.

Syntax: data set name is defined inside ggplot() function. Ggplot() function initializes a new graph and data set name is one of its necessary arguments:

          #graph initialization and data source
ggplot(data_frame_name)

Set of rules: ggplot2 requires that data are stored in a tidy data frame object. It is the most popular R data type object used for storing tabular data. You can imagine data frame as a table which has variables in the columns and observations in the rows. Any other objects like matrices, lists or similar are not accepted by the ggplot2.

Aesthetics

                                    Mapping x,y to age and amount variable. Third variable gender is shown through color

Description: Component represents mapping between variables and visual properties like axes, size, shape and color. What will represent the axes on my plot? Beside variables that will represent axes do we want to see some additional informations? Through this component two variables will be mapped to horizontal and vertical axis. Additional informations (variables) can be added through color, shape and size.

Syntax: aesthetics are defined inside aes() function.

Set of rules: aes() function can be defined inside ggplot() or it can be defined inside other components like geometrical shapes and statistics. If aes() is defined inside ggplot() function then its definition is common for all components (for example x and y axis will be the same for all geometrical shapes on the graph). Otherwise, its definition is recognized only inside specific component.

          #aesthetics that is common for all components - points and text
ggplot(df1, aes(x=col1,y=col2))+geom_point()+geom_text()
#aesthetics that is specified only for points
ggplot(df1)+geom_point(aes(x=col1,y=col2))

Geometrical shapes

         

Description: Plot type definition. More precise, we are defining how our observations (points) will be displayed on the graph. There are many different types (like bar chart, histogram, scatter-plot, box-plot ,…) and each type is defined for specific number and types of variables.

Syntax: Syntax starts with geom_* and here are the most used shapes:

Set of rules:

  • Each geom shape can be defined with its own dataset and aesthetics that will be valid only for that shape. In that case data frame and aes() are defined inside geom_*() function.
  • Each shape have its own specific aesthetics arguments. For example hjust and vjust are arguments specific only for geom_text() and linetype is an argument specific only for line graphs.
  • It is possible to combine geometrical shapes which means that each graph can have one or more geom shapes. For example, sometimes it is useful to show on one graphics bar plot and a line plot or let’s say scatter plot and a line plot.
          #two geom shapes - geom_line and geom_point used on one graph
ggplot(GOT, aes(x=Episode,y=Number_of_viewers, colour=Season, group=Season)) + geom_line()+geom_point()

Statistical transformations

                                                Famous statistical transformation — smoothing

Description: Component is used to transform the data (summarize the data in some matter) before visualization. Many of those transformations are used “behind the scene” during geometrical shapes creation. Often we don’t define them directly, ggplot2 is doing that for us.

Syntax: Syntax depends on a used transformation. Below are often used statistics:

Set of rules:

  • with statistical components additional variables are created (usually some aggregate values or similar). To visualize those data we need to use some geom_*() function. Otherwise the newly created variables will not be visible on the screen.
  • there are two ways to use statistical functions. First way is to use stat_*() function and define geom shape as an argument inside that function (geom argument). The second way is to use geom_*() and define statistical transformation as an argument inside that function (stat argument). Here is an example:
          #define stat_*() function and geom argument inside that function
ggplot(input_data,aes(col1,col2))+geom_point()+stat_summary(geom="point",fun.y="mean",colour="red")
#define geom_*() function and stat argument inside that function
ggplot(input_data,aes(col1,col2))+geom_point()+geom_point(stat="summary",fun.y="mean",colour="red")

Scaling

                                                                   Controlling the colors with scaling

Description: With aesthetics we define what we want to see on the graph and with scaling we define how we want to see those aesthetics. We can control colors, sizes, shapes and positions. Scales also provide the tools that let us to read the plot: the axes and legends (we can customize axis titles, labels and their positions). Ggplot2 creates automatically default predefined scales for each aesthetics that we define. However, if we want to customize scales we can modify each scale component by ourselves.

Syntax: Basic syntax is following:

                                              Basic scaling syntax

Here are the scales for different types of aesthetics:

Set of rules:

  • There are no specific rules — just appropriate function name needs to be chosen. Scaling syntax is a little bit more complex because for each aesthetics scaling we need to know aesthetics name (x,y, color, size, shape), type of a variable (continuous, discrete) and arguments that are specific for each scale function. Keep in mind that you will use these functions only when you are not satisfied with predefined scheme (default scaling that is created by ggplot2).

Faceting

                                                    Sub-plotting histogram

Description: With faceting we are dividing the data into subsets by some discrete variable and displaying the same type of a graph for each data subset.

Syntax: Facet_wrap() or facet_grid() function is used for displaying subsets of data.

                               Faceting — sub-plotting by col1 variable

          ggplot(data_set, aes(col1,col2))+ geom_point()+
facet_wrap(~col3)

Set of rules:

  • Faceting can be used in a combination with different geom shapes, there is no restriction at all. The main idea with faceting is that once you make a graph you can easily split the data (by some criteria)and display sup-graphs which are going to be visible on the screen.

Themes

                                                  Changing background color of the plot

Description: With themes it is possible to control non-data elements on the graph. With this component we don’t change a type of graph, scaling definition or used aesthetics. Instead of that, we are changing things like fonts, ticks, panel strips and background colors.

Syntax: There are several predefined themes and here is the list of some of them:

Each of this themes will change all theme elements to values which are designed to work together harmoniously (complete theme is changed, not just individual elements). However, if we want to change individual elements (for example just background color or just font of our title) we can use theme() function and specify the exact element we want to change.

                                                                        Theme and element function

Set of rules: Each theme element (that is controlled via theme() arguments) is associated with an element function, which describes visual properties of that element. For example, if you want to set up background color you will need to define background color argument inside element_rect() function. If you decide to change axis labels you will need to define new labels inside element_text() function. Each argument in theme function needs to be defined with the help of one of these element_*() functions.

There are four basic element_functions and each is used in a combination with specific theme arguments:

                                           

Here is an example how we combine arguments with element functions:

          ggplot(data_set_name, aes(col1,col2)) + geom_point() + 
theme_bw() +
#panel background is used with element_rect()
theme(panel.background = element_rect(fill = "white",colour = "grey"))

Usually you’ll use predefined themes but it is useful to know that you can change each individual element using theme() function.

With that said, we explained basic rules related to each component of the graph. The next question which we ask ourselves is:” How are these components combined into one single entity called graph?”

Set of rules on graph level

                                                      Combining the components

After we defined each component separately we need to combine them together and create a proper and meaningful composition called graph.

Basic set of rules for combining:

  • Each new graph is initialized with ggplot() function.
  • Ggplot() is used to declare input data frame name and also to specify the set of plot aesthetics intended to be common throughout all geometrical shapes that will be used on one graph.
  • any component that is used in graph building will be added with ‘+’ sign
  • each component has its own corresponding function name and arguments that are related only for that component.
  • we can combine different components, we are not limited to certain combinations
  • each component will use the same input data frame and aesthetics that are defined inside ggplot() function (unless otherwise stated)
  • aesthetics can be defined inside ggplot() function or inside any geometrical shape. If defined inside ggplot() they will be common for all shapes. Otherwise, they will be defined for one specific component/shape.
  • each component has its own special arguments, rules and syntax. In some cases, two components can define special arguments that are unique only for that combination. For example, if geom_text() is used then special arguments inside aes() function are hjust and vjust. They are typical just for geom_text() object (we don’t use those arguments with other shapes).
  • stat_*() component needs to be combined with geom_* component. The reason lies in the fact that statistical transformations are only creating new variables. In order for them to be visible on the screen we must define the corresponding geom_* type which will visualize the new data.

Pseudo code is presented below:

          ggplot(data_frame_name, aes()) + 
component_for_geom1_*() +
component_for_geom2_*() ++
#optional components
component_for_scaling_*() +
component_for_faceting_*() +
component_for_themes_*() + ...

For the end we are presenting one real example:

                                                                 ggplot2 — sub-plotting bar-charts

Result is a graph that looks like this:

                                                                                  ggplot2 — faceting bar-charts

Summary

In this article we showed in what way ggplot2 relies on grammar of graphics. It may seem complex at the beginning because there a lot of rules and topics to master. Firstly you need to understand each component separately — meaning, syntax and rules for each of them independently. After that, you need to additionally learn how to properly combine those component in a one single entity called graph. There is a lot of theory behind the scene. But once you overcome this theory you can control and modify anything you like on your plot so that is nothing left to chance. After mastering the grammar distance from mind to “paper” becomes really short — almost every your idea can be accurately transposed on the screen.

To read original blog , click  here.


Introduction to ggplot2 — the grammar

Best practices of orchestrating Python and R code in ML projects

Best practices of orchestrating Python and R code in ML projects

Today, data scientists are generally divided among two languages — some prefer R, some prefer Python. I will not try to explain in this article which one is better. Instead of that I will try to find an answer to a question: “What is the best way to integrate both languages in one data science project? What are the best practices?”. Beside git and shell scripting additional tools are developed to facilitate the development of predictive model in a multi-language environments. For fast data exchange between R and Python let’s use binary data file format Feather. Another language agnostic tool DVC can make the research reproducible — let’s use DVC to orchestrate R and Python code instead of a regular shell scripts.

Machine learning with R and Python

Both R and Python are having powerful libraries/packages used for predictive modeling. Usually algorithms used for classification or regression are implemented in both languages and some scientist are using R while some of them preferring Python. we use a  target variable with binary output and logistic regression was used as a training algorithm. One of the algorithms that could also be used for prediction is a popular Random Forest algorithm which is implemented in both programming languages. Because of performances it was decided that Random Forest classifier should be implemented in Python (it shows better performances than random forest package in R).

R example used for DVC demo

We will add some Python codes and explain how Feather and DVC can simplify the development process in this combined environment.

Let’s recall briefly the R codes from previous tutorial:

                          

Input data are Stackoverflow posts — an XML file. Predictive variables are created from text posts — relative importance  tf-idf of words among all available posts is calculated. With tf-idf matrices target is predicted and lasso logistic regression for predicting binary output is used. AUC is calculated on the test set and AUC metric is used on evaluation.

Instead of using logistic regression in R we will write Python jobs in which we will try to use random forest as training model. Train_model.R and evaluate.R will be replaced with appropriate Python jobs.

R and Python codes can be seen here.

Let’s download necessary jobs(clone the Github repository):

mkdir R_DVC_GITHUB_CODE
cd R_DVC_GITHUB_CODE
git clone https://github.com/Zoldin/R_AND_DVC

Our dependency graph of this data science project look like this:

                                        R (marked red) and Python (marked pink) jobs in one project

Now lets see how it is possible to speed up and simplify process flow with Feather API and data version control reproducibility.

Feather API

Feather API is designed to improve meta data and data interchange between R and Python. It provides fast import/export of data frames among both environments and keeps meta data informations which is an improvement over data exchange via csv/txt file format. In our example Python job will read an input binary file that was produced in R with Feather api.

Let’s install Feather library in both environments.

For Python 3 on linux enviroment you can use cmd and pip3:

    sudo pip3 install feather-format

For R it is necessary to install feather package:

    install.packages(feather)

After successful installation we can use Feather for data exchange.

Below is an R syntax for data frame export with Feather (featurization.R):

  library(feather)

  write_feather(dtm_train_tfidf,args[3])
write_feather(dtm_test_tfidf,args[4])
print("Two data frame were created with Feather - one for train and one for test data set")

Python syntax for reading feather input binary files (train_model_python.py):

   import feather as ft

input = sys.argv[1]
df = ft.read_dataframe(input)

Dependency graph with R and Python combined

The next question what we are asking ourselves is why do we need DVC, why not just use shell scripting? DVC automatically derives the dependencies between the steps and builds the dependency graph (DAG) transparently to the user. Graph is used for reproducing parts/codes of your pipeline which were affected by recent changes and we don’t have to think all the time what we need to repeat (which steps) with the latest changes.

Firstly, with ‘dvc run’ command we will execute all jobs that are related to our model development. In that phase DVC creates dependencies that will be used in the reproducibility phase:

$ dvc import https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz data/
$ dvc run tar zxf data/Posts.xml.tgz -C data/
$ dvc run Rscript code/parsingxml.R data/Posts.xml data/Posts.csv
$ dvc run Rscript code/train_test_spliting.R data/Posts.csv 0.33 20170426 data/train_post.csv data/test_post.csv
$ dvc run Rscript code/featurization.R data/train_post.csv data/test_post.csv data/matrix_train.feather data/matrix_test.feather
$ dvc run python3 code/train_model_python.py data/matrix_train.feather 20170426 data/model.p
$ dvc run python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.feather data/evaluation_python.txt

After this commands jobs are executed and included in DAG graph. Result (AUC metrics) is written in evaluation_python.txt file:

$ cat data/evaluation_python.txt
AUC: 0.741432

It is possible to improve our result with random forest algorithm.

We can increase number of trees in the random forest classifier — from 100 to 500:

clf = RandomForestClassifier(n_estimators=500, n_jobs=2, random_state=seed)
clf.fit(x, labels)

After commited changes (in train_model_python.py) with dvc repro command all necessary jobs for evaluation_python.txt reproduction will be re-executed. We don’t need to worry which jobs to run and in which order.

$ git add .
$ git commit
[master a65f346] Random forest clasiffier — more trees added
1 file changed, 1 insertion(+), 1 deletion(-)
$ dvc repro data/evaluation_python.txt
Reproducing run command for data item data/model.p. Args: python3 code/train_model_python.py data/matrix_train.txt 20170426 data/model.p

Reproducing run command for data item data/evaluation_python.txt. Args: python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.txt data/evaluation_python.txt
Data item “data/evaluation_python.txt” was reproduced.

Beside code versioning, DVC also cares about data versioning. For example, if we change data sets train_post.csv and test_post.csv (use different spliting ratio) DVC will know that data sets are changed and dvc repro will re-execute all necessary jobs for evaluation_python.txt.

$ dvc run Rscript code/train_test_spliting.R data/Posts.csv 0.15 20170426 data/train_post.csv data/test_post.csv

Re-executed jobs are marked with red color:

$ dvc run Rscript code/train_test_spliting.R data/Posts.csv 0.15 20170426 data/train_post.csv data/test_post.csv
$ dvc repro data/evaluation_python.txt
Reproducing run command for data item data/matrix_train.txt. Args: Rscript — vanilla code/featurization.R data/train_post.csv data/test_post.csv data/matrix_train.txt data/matrix_test.txt
Reproducing run command for data item data/model.p. Args: python3 code/train_model_python.py data/matrix_train.txt 20170426 data/model.p
Reproducing run command for data item data/evaluation_python.txt. Args: python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.txt data/evaluation_python.txt
Data item “data/evaluation_python.txt” was reproduced.
$ cat data/evaluation_python.txt
AUC: 0.793145

New AUC result is 0.793145 which shows an improvement compared to previous iteration.

Summary

In data science projects it is often used R/Python combined programming. Additional tools beside git and shell scripting are developed to facilitate the development of predictive model in a multi-language environments. Using data version control system for reproducibility and Feather for data interoperability helps you orchestrate R and Python code in a single environment. Original blog post can be seen here.


Best practices of orchestrating Python and R code in ML projects