Want to become a Data Scientist? Try Feynman Technique

Want to become a Data Scientist? Try Feynman Technique

Many a blogs and articles are written on how to become a Data Scientist. The list normally goes like this

  • Study descriptive statistics, hypothesis testing, probability
  • Learn types of Machine learning algorithms – Supervised, Unsupervised
  • Learn Python, R, SAS, SQL
  • Apply machine learning techniques using Python, R, SAS
  • Learn Data Visualization

While there is nothing wrong in the path illustrated above, it is not the sufficient way to become an efficient data scientist. Now you might ask WHY? Before I answer that, I want to talk about the ‘Feynman Technique’.

Why is the technique called ‘Feynman Technique’?

The technique is named after the great theoretical physicist Richard Feynman.  He was nicknamed the ‘The Great Explainer’ for his remarkable skill of explaining even the most complex scientific topics in plain layman language.

The Feynman Technique:

Step 1: Narrow down on a topic which you find difficult to grasp. Learn about the topic.

Step 2: Explain the topic as though you are teaching it to someone in very simple terms

Step 3: Work through examples or demonstrate how it works

Step 4: Assess your knowledge about the topic, if still some concepts are unclear, learn more about them and repeat steps 2 – 4

In due process you would have developed a better understanding of the topic than you started with. Now that is the magic of the ‘Feynman Technique’

Become a ‘Great Explainer’ to become a Great Data Scientist

The Data Science domain requires constant learning. Some of the concepts might be too hard to comprehend. Feynman technique can help one to understand topics which one thought was incredibly difficult.

The need to explain to Boss, Clients or VC’s

The Analytics industry will survive only if the key decision makers see value in it. The decision makers are

Your boss – if you are having a in house analytics set up

Clients –  if you are in the analytics consulting / services business

VC’s –  if you are seeking investment for your ‘AI Start-up’  

Most often than not your boss/client/vc’s may not have analytics background or a deep understanding of the latest analytics topics. The oneness is upon you to explain the analytics concepts in as simple as language as possible so that they see value in your proposition.

So, bottom line – Practice your Feynman Technique lest you might face the same ordeal Dilbert faces everyday with his boss as depicted below.

How I became a Data Scientist

During my MBA course, I was the only person with a statistics background and I always felt that my understanding of a statistics concept got better as I explained it to my friends. Their approval of having learnt the concept easily, gave me encouragement and an added responsibility to learn the concepts thoroughly myself so that I do not teach them wrong.

The confidence of having learnt something thoroughly allowed me to get in to the Data Science field. Even now I still follow the Feynman Technique to get a better grasp of topics which initially seem incomprehensible.

Feynman Technique in Practice – Write articles

Well I must confess, I wrote my first article about “Recommender Engine” in order to develop a better understanding of how the recommender systems worked. While I don’t claim expertise in recommender systems, I can surely say I learnt something intuitively.

Similarly in my last article “How to dockerize an R shiny app”. I have tried to explain dockers through Legos !!

Feynman Technique – A remedy for impostor syndrome

As the Data Science field has become lucrative, many want to break in to this field. Those who succeed gaining entry (without a stat/math) background, are sometimes left with an impostor syndrome. As the image depicts, “you are the easiest to fool”. The only way to get over the impostor syndrome is to really develop a strong understanding about the various Data Science topics and what better way to understand topics deeply than the Feynman Technique.

Remember: Become a Great Explainer to become a Great Data Scientist!!

If you liked me article, give it a like and you can also comment below your opinions about the article.


Want to become a Data Scientist? Try Feynman Technique

How to host a R Shiny App on AWS cloud in 7 simple steps

How to host a R Shiny App on AWS cloud in 7 simple steps

What is R Shiny App?

R shiny app is an interactive web interface. R shiny app has two components user interface object (UI.R) and server function (Server .R). The two components are passed as arguments to the shiny app function that creates a shiny app object. For more info on how to build Shiny apps please refer this link

Step 1: Create an EC2 instance

Log in to AWS account, click on EC2 under the ‘Compute’ header or click on EC2 under ‘Recently visited services’

Click on Launch Instance

Choose a Machine Image of your choice, here I have chosen Ubuntu server 16.04 LTS (HVM)

Choose an Instance type; one can start out with t2.micro, t2.small or t2.medium instances. For larger apps one can use t2.large and beyond.

Then click on launch instance, you will then be directed to the below page

Click on Edit security groups, you will be directed to below page (Configure security group).

In the SSH row, change source to ‘My IP’

Click on add a Rule, custom TCP Rule would be added. Under the ‘Port range’ enter 3838. This is the port for R shiny server.

Click ‘Review and Launch’ and then click ‘Launch’. Upon doing this you will get a dialogue box like below. The dialogue box helps in creating a private key which will enable us to ssh into the EC2 instance. Give a name to the key and click ‘Download Key Pair’. You will get .pem file. Save the .pem file/key securely.

Press Launch Instances. You will get a screen like below

If the instance is created successfully, the instance state will show ‘running’

Copy the IP address under public DNS (IPv4), this will be form the basis of our URL to host the R shiny app later.

Step 2: Access the EC2 instance via SSH from Putty (Windows based)

Download putty, after downloading, convert the .pem file into ppk file.

To convert .pem file to ppk, type puttygen in the windows start dialog box. Click on ‘puttygen’. The below box should appear.

Click on File tab and click on load ‘private key’.

Navigate to folder or path where you have saved the .pem file and select it. The .pem file will be imported and you should see a screen like below.

Now save the key as ‘save private key’, give a name to key and save it in your desired location. Upon saving the key the below icon should appear.

Open putty and in the host name box enter the IP of EC2 instance i.e. one adjacent IPv4 Public IP (54.148.189.55) as shown in Fig a

Next navigate to ‘Auth’ on the left hand side panel and browse for the ppk key that you had saved earlier.

After browsing the ppk key click open. If your key is a valid one you will get a command prompt screen like below. Enter your log in credentials and then press enter

Step 3: Install WinSCP to transfer files from host machine to EC2 instance to vice versa

Enter the IP address of EC2 instance in the host name box. The click on ‘Advanced’.

Navigate to the left hand side panel and under SSH click on ‘Authentication’, enter the private key as shown below

After entering the private key click ok. You will get the below screen. Click on ‘Login’.

You will get a dialog box like below. Just click ‘Yes’.

The final result will be the screen below

Step 4: Installing R base and R shiny server in EC2 instance.

The first prerequisite to run R shiny app is to install r base, shiny server, shiny package and associated packages.

To install the above, the first step is to go to the root and install them. The reason being if you are logged in as non root user in Ec2, you will have your own library path and probably the R packages, r base, shiny server may not get installed system wide. To install it system wide, go to root and install the above

Steps to go to root:

In the prompt type the below

sudo –i

You should then get a # symbol like below

Now run the following commands

sudo apt-get update

sudo apt-get install r-base

sudo apt-get install r-base-dev

The below command installs R shiny package

sudo su — -c “R -e ”install.packages(‘shiny’, repos = ‘http://cran.rstudio.com/‘)””

The below command installs shiny server

wget https://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.4.4.807-amd64.deb

sudo dpkg -i shiny-server-1.4.4.807-amd64.deb

Step 5: Transfer the R shiny components

After execution of above steps a directory(folder) by the name ‘shiny-server’ would have been created in the path /srv/shiny-server/

The next step is to create a folder inside the directory shiny-server where we can place our R shiny app components (UI.R, Server. R, configuration file, R workspace, data files or R programs).

At first we may not be able to create folder inside the shiny-server folder, to do this execute the below commands first

sudo chmod 777 /srv/shiny-server

sudo mkdir /srv/shiny-server/myapp

In the above command I have created a folder ‘myapp’ to place all the R shiny app components.

Step 6: Use Winscp to transfer the R shiny app components from local machine to Ec2 instance

Now copy the R shiny app components from your local machine to Ec2 instance under the following path /srv/shiny-server/myapp/

One important thing to be taken into consideration is to configure the shiny-server.conf

The shiny-server.conf is available in the location /etc/shiny-server/

Again you may not be able to access the shiny-server directory under /etc/.

Hence run the below command

sudo chmod 777 /etc/shiny-server

After executing the above command you can copy the configuration file into local system, edit it and then transfer back the edited configuration file in the location /etc/shiny-server.

The edition to be made is as follows; please note the words after # are comments.

The result below shows shiny components copied in the path /srv/shiny-server/myapp

Step 7: Hosting the app

Now the final step. In the amazon console, go to your running EC2 instance. Copy the Public DNS (Ipv4) e.g. : ec2–34–215–115–68.us-west-2.compute.amazonaws.com.

Copy this in the browser and suffix: 3838/myapp like below and press enter.

Your R shiny app is hosted successfully!!!!

Important Notes to consider:

· It is important to copy the R workspace as it will contain the R objects and data files in the folder created i.e. myapp. In the above example we are using a simple shiny app ‘a hello world equivalent’ hence we do not have any R workspace or data files.

· Sometimes the app may not load for some idiopathic reasons or the screen could get ‘greyed’ out. Please refresh and try again.

Sources:

Running R on AWS

Shiny server troubleshoots


How to host a R Shiny App on AWS cloud in 7 simple steps

Recommender Engine – Under The Hood

Recommender Engine – Under The Hood

Many of us are bombarded with various recommendations in our day to day life, be it on e-commerce sites or social media sites. Some of the recommendations look relevant but some create range of emotions in people, varying from confusion to anger.

There are basically two types of recommender systems, Content based and Collaborative filtering. Both have their pros and cons depending upon the context in which you want to use them.

Content based: In content based recommender systems, keywords or properties of the items are taken into consideration while recommending an item to an user. So, in a nutshell it is like recommending similar items. Imagine you are reading a book on data visualization and want to look for other books on the same topic. In this scenario, content based recommender system would be apt.

Collaborative Filtering: Well to drive home the point, the below picture is the best example. Customer A has bought books x,y,z and customer B has bought books y,z. Now collaborative filtering technique would recommend book x to customer B. This is both the advantage and disadvantage of collaborative filtering. It does not matter if the book x was a nonfiction book while the liking of customer B was strictly fiction book. The relevancy of the recommendation may or may not be correct. Typically many companies use this technique since it allows them to cross sell products.

Developing a Content Based Book Recommender System — Theory

Imagine you have a collection of data science books in your library and let’s say your friend has read a book on neural network and wants to read another book on the same topic to build up his/her knowledge on the subject. The best way is to implement a simple content based recommender system.

We will look at three important concepts here which go into building this content based recommender system.

  • Vectors
  • TF-IDF
  • Cosine Similarity

Vectors

The fundamental idea is to convert the texts or words into a vector and represent in a vector space model. This idea is so beautiful and in an essence this very idea of vectors is what is making the rapid strides in Machine learning and AI possible. In fact Geoffrey Hinton (“Father of Deep Learning”) in a MIT technology review article acknowledged that the AI institute at Toronto has been named “Vector Institute” owing to the beautiful properties of vectors that has helped them in the field of Deep Learning and other variants of Neural nets.

TF — IDF

TF- IDF stands for Term Frequency and Inverse Document Frequency .TF-IDF helps in evaluating importance of a word in a document.

TF — Term Frequency

In order to ascertain how frequent the term/word appears in the document and also to represent the document in vector form, let’s break it down to following steps.

Step 1: Create a dictionary of words (also known as bag of words) present in the whole document space. We ignore some common words also called as stop words e.g. the, of, a, an, is etc, since these words are pretty common and it will not help us in our goal of choosing important words

In this current example I have used the file ‘test1.csv’ which contains titles of 50 books. But to drive home the point, just consider 3 book titles (documents) to be making up the whole document space. So B1 is one document, B2 and B3 are other documents. Together B1, B2, B3 make up the document space.

B1 — Recommender Systems

B2 — The Elements of Statistical Learning

B3 — Recommender Systems — Advanced

Now creating an index of these words (stop words ignored)

1. Recommender 2. Systems 3 Elements 4. Statistical 5.Learning 6. Advanced

Step 2: Forming the vector

The Term Frequency helps us identify how many times the term or word appears in a document but there is also an inherent problem, TF gives more importance to words/ terms occurring frequently while ignoring the importance of rare words/terms. This is not an ideal situation as rare words contain more importance or signal. This problem is resolved by IDF.

Sometimes a word / term might occur more frequently in longer documents than shorter ones; hence Term Frequency normalization is carried out.

TFn = (Number of times term t appears in a document) / (Total number of terms in the document), where n represents normalized.

IDF (Inverse Document Frequency):

In some variation of the IDF definition, 1 is added to the denominator so as to avoid a case of division by zero in the case of no terms being present in the document.

Basically a simple definition would be:

IDF = ln (Total number of documents / Number of documents with term t in it)

Now let’s take an example from our own dictionary or bag of words and calculate the IDFs

We had 6 terms or words which are as follows

1. Recommender 2. Systems 3 Elements 4. Statistical 5.Learning 6. Advanced

and our documents were :

B1 — Recommender Systems

B2 — The Elements of Statistical Learning

B3 — Recommender Systems — Advanced

Now IDF (w1) = log 3/2; IDF(w2) = log 3/2; IDF (w3) = log 3/1; IDF (W4) = log 3/1; IDF (W5) = log 3/1; IDF(w6) = log 3/1

(note : natural logarithm being taken and w1..w6 denotes words/terms)

We then again get a vector as follows:

= (0.4054, 0.4054, 1.0986, 1.0986, 1.0986, 1.0986)

TF-IDF Weight:

Now the final step would be to get the TF-IDF weight. The TF vector and IDF vector are converted into a matrix.

Then TF-IDF weight is represented as:

TF-IDF Weight = TF (t,d) * IDF(t,D)

This is the same matrix which we get by executing the below python code:

tfidf_matrix = tf.fit_transform(ds['Book Title'])

Cosine Similarity:

Well cosine similarity is a measure of similarity between two non zero vectors. One of the beautiful thing about vector representation is we can now see how closely related two sentence are based on what angles their respective vectors make.

Cosine value ranges from -1 to 1.

So if two vectors make an angle 0, then cosine value would be 1, which in turn would mean that the sentences are closely related to each other.

If the two vectors are orthogonal, i.e. cos 90 then it would mean that the sentences are almost unrelated.

Developing a Content Based Book Recommender System — Implementation

Below I have written a few lines of code in python to implement a simple content based book recommender system. I have added comments (words after #) to make it clear what each line of code is doing.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
ds = pd.read_csv("test1.csv") #you can plug in your own list of products or movies or books here as csv file
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
######ngram (1,3) can be explained as follows#####
#ngram(1,3) encompasses uni gram, bi gram and tri gram
#consider the sentence "The ball fell"
#ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = cosine_similarity(tfidf_matrix,tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID : (Score,item_id))
for idx, row in ds.iterrows(): #iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and 1 means perfect similarity
similar_indices = cosine_similarities[idx].argsort()[:-5:-1] #stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]

#below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe, then we convert it to a list
def item(id):
return ds.loc[ds['ID'] == id]['Book Title'].tolist()[0]
def recommend(id, num):
if (num == 0):
print("Unable to recommend any book as you have not chosen the number of book to be recommended")
elif (num==1):
print("Recommending " + str(num) + " book similar to " + item(id))

else :
print("Recommending " + str(num) + " books similar to " + item(id))

print("----------------------------------------------------------")
recs = results[id][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended
recommend(5,2)

The Output

Recommending 2 books similar to The Elements of Statistical Learning 
----------------------------------------------------------
You may also like to read: An introduction to Statistical Learning (score:0.389869522721)
You may also like to read: Statistical Distributions (score:0.13171009673)

The list of id and book title in test1.csv are as below (20 rows shown)

ID,Book Title
1,Probabilistic Graphical Models
2,Bayesian Data Analysis
3,Doing data science
4,Pattern Recognition and Machine Learning
5,The Elements of Statistical Learning
6,An introduction to Statistical Learning
7,Python Machine Learning
8,Natural Langauage Processing with Python
9,Statistical Distributions
10,Monte Carlo Statistical Methods
11,Machine Learning :A Probablisitic Perspective
12,Neural Network Design
13,Matrix methods in Data Mining and Pattern recognition
14,Statistical Power Analysis
15,Probability Theory The Logic of Science
16,Introduction to Probability
17,Statistical methods for recommender systems
18,Entropy and Information theory
19,Clever Algorithms: Nature-Inspired Programming Recipes
20,"Precision: Principles, Practices and Solutions for the Internet of Things"

Now that you have read this article you may also like to read……..(well never mind 😉 )

The same article is also available on the following links :

How To Build a Simple Content Based Book Recommender System

Medium: Recommender Engine – Under The Hood

Sources :

TF-IDF

Kaggle

Vector space model

Deeplearning4j

Sklearn Cosine Similarity

Sklearn TFidfvectorizer

Mark Needham Blog

MIT Technology Review


Recommender Engine – Under The Hood