Top Data Science Resources on the Internet Right Now

Top Data Science Resources on the Internet Right Now

I have been looking to create this list for a while now. There are many people on quora who ask me how I started in the data science field. And so I wanted to create this reference.

To be frank, when I first started learning it all looked very utopian and out of the world. The Andrew Ng course felt like black magic. And it still doesn’t cease to amaze me. After all, we are predicting the future. Take the case of Nate Silver – What else can you call his success if not Black Magic?

But it is not magic. And this is a way an aspiring guy could take to become a self-trained data scientist. Follow in order. I have tried to include everything that comes to my mind. So here goes:

1. Stat 110: Introduction to Probability: Joe Blitzstein – Harvard University

The one stat course you gotta take. If not for the content then for Prof. Blitzstein sense of humor. I took this course to enhance my understanding of probability distributions and statistics, but this course taught me a lot more than that. Apart from Learning to think conditionally, this also taught me how to explain difficult concepts with a story.

This was a Hard Class but most definitely fun. The focus was not only on getting Mathematical proofs but also on understanding the intuition behind them and how intuition can help in deriving them more easily. Sometimes the same proof was done in different ways to facilitate learning of a concept.

One of the things I liked most about this course is the focus on concrete examples while explaining abstract concepts. The inclusion of Gambler’s Ruin Problem, Matching Problem, Birthday Problem, Monty Hall, Simpsons Paradox, St. Petersberg Paradox etc. made this course much much more exciting than a normal Statistics Course.

It will help you understand Discrete (Bernoulli, Binomial, Hypergeometric, Geometric, Negative Binomial, FS, Poisson) and Continuous (Uniform, Normal, expo, Beta, Gamma) Distributions and the stories behind them. Something that I was always afraid of.

He got a textbook out based on this course which is clearly a great text:

2. Data Science CS109: –

Again by Professor Blitzstein. Again an awesome course. Watch it after Stat110 as you will be able to understand everything much better with a thorough grinding in Stat110 concepts. You will learn about Python Libraries like Numpy,Pandas for data science, along with a thorough intuitive grinding for various Machine learning Algorithms. Course description from Website:

Learning from data in order to gain useful predictions and insights. This course introduces methods for five key facets of an investigation: data wrangling, cleaning, and sampling to get a suitable data set; data management to be able to access big data quickly and reliably; exploratory data analysis to generate hypotheses and intuition; prediction based on statistical methods such as regression and classification; and communication of results through visualization, stories, and interpretable summaries.

3. CS229: Andrew Ng

After doing these two above courses you will gain the status of what I would like to call a “Beginner”. Congrats!!!. You know stuff, you know how to implement stuff. Yet you do not fully understand all the math and grind that goes behind all this.

Here comes the Game Changer machine learning course. Contains the maths behind many of the Machine Learning algorithms. I will put this course as the one course you gotta take as this course motivated me into getting in this field and Andrew Ng is a great instructor. Also this was the first course that I took.

Also recently Andrew Ng Released a new Book. You can get the Draft chapters by subscribing on his website here.

You are done with the three musketeers of the trade. You know Python, you understand Statistics and you have gotten the taste of the math behind ML approaches. Now it is time for the new kid on the block. D’artagnan. This kid has skills. While the three musketeers are masters in their trade, this guy brings qualities that adds a new freshness to our data science journey. Here comes Big Data for you.

4. Intro to Hadoop & Mapreduce – Udacity

Let us first focus on the literal elephant in the room – Hadoop. Short and Easy Course. Taught the Fundamentals of Hadoop streaming with Python. Taken by Cloudera on Udacity. I am doing much more advanced stuff with python and Mapreduce now but this is one of the courses that laid the foundation there.

Once you are done through this course you would have gained quite a basic understanding of concepts and you would have installed a Hadoop VM in your own machine. You would also have solved the Basic Wordcount Problem. Read this amazing Blog Post from Michael Noll: Writing An Hadoop MapReduce Program In Python – Michael G. Noll. Just read the basic mapreduce codes. Don’t use Iterators and Generators yet. This has been a starting point for many of us Hadoop developers.

Now try to solve these two problems from the CS109 Harvard course from 2013:

A. First, grab the file word_list.txt from here. This contains a list of six-letter words. To keep things simple, all of the words consist of lower-case letters only.Write a mapreduce job that finds all anagrams in word_list.txt.

B. For the next problem, download the file baseball_friends.csv. Each row of this csv file contains the following:

  • A person’s name
  • The team that person is rooting for — either “Cardinals” or “Red Sox”
  • A list of that person’s friends, which could have arbitrary length

For example: The first line tells us that Aaden is a Red Sox friend and he has 65 friends, who are all listed here. For this problem, it’s safe to assume that all of the names are unique and that the friendship structure is symmetric (i.e. if Alannah shows up in Aaden’s friends list, then Aaden will show up in Alannah’s friends list). Write an mr job that lists each person’s name, their favorite team, the number of Red Sox fans they are friends with, and the number of Cardinals fans they are friends with.

Try to do this yourself. Don’t use the mrjob (pronounced Mr. Job) way that they use in the CS109 2013 class. Use the proper Hadoop Streaming way as taught in the Udacity class as it is much more customizable in the long run.

If you are done with these, you can safely call yourself as someone who could “think in Mapreduce” as how people like to call it.Try to do groupby, filter and joins using Hadoop. You can read up some good tricks from my blog:
Hadoop Mapreduce Streaming Tricks and Techniques

If you are someone who likes learning from a book you can get:

5. Spark – In memory Big Data tool.

Now comes the next part of your learning process. This should be undertaken after a little bit of experience with Hadoop. Spark will provide you with the speed and tools that Hadoop couldn’t.

Now Spark is used for data preparation as well as Machine learning purposes. I would encourage you to take a look at the series of courses on edX provided by Berkeley instructors. This course delivers on what it says. It teaches Spark. Total beginners will have difficulty following the course as the course progresses very fast. That said anyone with a decent understanding of how big data works will be OK.

Data Science and Engineering with Apache® Spark™

I have written a little bit about Basic data processing with Spark here. Take a look: Learning Spark using Python: Basics and Applications

Also take a look at some of the projects I did as part of course at github

If you would like a book to read:

If you don’t go through the courses, try solving the same two problems above that you solved by Hadoop using Spark too. Otherwise the problem sets in the courses are more than enough.

6. Understand Linux Shell:

Shell is a big friend for data scientists. It allows you to do simple data related tasks in the terminal itself. I couldn’t emphasize how much time shell saves for me everyday.

Read these tutorials by me for doing that:
Shell Basics every Data Scientist Should know -Part I Shell Basics every Data Scientist Should know – Part II(AWK)

If you would like a course you can go for this course on edX.

If you want a book, go for:

Congrats you are an “Hacker” now. You have got all the main tools in your belt to be a data scientist. On to more advanced topics. From here it depends on you what you want to learn. You may want to take a totally different approach than what I took going from here. There is no particular order. “All Roads lead to Rome” as long as you are running.

7. Learn Statistical Inference and Bayesian Statistics

I took the previous version of the specialization which was a single course taught by Mine Çetinkaya-Rundel. She is a great instrucor and explains the fundamentals of Statistical inference nicely. A must take course. You will learn about hypothesis testing, confidence intervals, and statistical inference methods for numerical and categorical data. You can also use these books:

 

8. Deep Learning

Intro – Making neural nets uncool again. An awesome Deep learning class from Kaggle Master Jeremy Howard. Entertaining and enlightening at the same time.

Advanced – A series of notes from the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition.

Bonus – A free online book by Michael Nielsen.

Advanced Math Book – A math intensive book by Yoshua Bengio & Ian Goodfellow

9. Algorithms, Graph Algorithms, Recommendation Systems, Pagerank and More

This course used to be there on Coursera but now only video links on youtube available. You can learn from this book too:

Apart from that if you want to learn about Python and the basic intricacies of the language you can take the Computer Science Mini Specialization from RICE universitytoo. This is a series of 6 short but good courses. I worked on these courses as Data science will require you to do a lot of programming. And the best way to learn programming is by doing programming. The lectures are good but the problems and assignments are awesome. If you work on this you will learn Object Oriented Programming,Graph algorithms and games in Python. Pretty cool stuff.

10. Advanced Maths:

Couldn’t write enough of the importance of Math. But here are a few awesome resources that you can go for.

Linear Algebra By Gilbert Strang – A Great Class by a great Teacher. I Would definitely recommend this class to anyone who wants to learn LA.

Multivariate Calculus – MIT OCW

Convex Optimization – a MOOC on optimization from Stanford, by Steven Boyd, an authority on the subject.

The Machine learning field is evolving and new advancements are made every day. That’s why I didn’t put a third tier. The maximum I can call myself is a “Hacker” and my learning continues. Hope you do the same.

Hope you like this list. Please provide your inputs in comments on more learning resources as you see fit.

Till then. Ciao!!!


Top Data Science Resources on the Internet Right Now

How to think like a Data Scientist

How to think like a Data Scientist

A data scientist needs to be Critical and always on a lookout for something that misses others. So here is some advice that one can include in the day to day data science work to be better at their work:

1. Beware of the Clean Data Syndrome

You need to ask yourself questions even before you start working on the data. **Does this data make sense?** Falsely assuming that the data is clean could lead you towards wrong Hypotheses. Apart from that, you can discern a lot of important patterns by looking at discrepancies in the data. For example, if you notice that a particular column has more than 50% values missing, you might think about not using the column. Or you may think that some of the data collection instrument has some error.

Or let’s say you have a distribution of Male vs Female as 90:10 in a Female Cosmetic business. You may assume clean data and show the results as it is or you can use common sense and ask if the labels are switched.

2. Manage Outliers wisely

Outliers can help you understand more about the people who are using your website/product 24 hours a day. But including them while building models will skew the models a lot.

3. Keep an eye out for the Abnormal

Be on the lookout for something out of the obvious. If you find something you may have hit gold.

For example, Flickr started up as a Multiplayer game . Only when the founders noticed that people were using it as a photo upload service, did they pivot.

Another example: Fab.com started up as Fabulis.com, a site to help gay men meet people. One of the site’s popular features was the “Gay deal of the Day”. One day the deal was for Hamburgers – and half of the buyers were women. This caused the team to realize that there was a market for selling goods to women. So Fabulis pivoted to fab as a flash sale site for designer products.

4. Start Focussing on the right metrics

  • Beware of Vanity metrics. For example, # of active users by itself doesn’t divulge a lot of information. I would rather say “5% MoM increase in active users” rather than saying ” 10000 active users”. Even that is a vanity metric as active users would always increase. I would rather keep a track of percentage of users that are active to know how my product is performing.
  • Try to find out a metric that ties with the business goal. For example, Average Sales/User for a particular month.

5. Statistics may lie too

Be critical of everything that gets quoted to you. Statistics has been used to lie in advertisements, in workplaces and a lot of other marketing venues in the past. People will do anything to get sales or promotions.

For example: Do you believe in Colgate’s claim that 80% dentists recommend their toothpaste?

This statistic seems pretty good at first. It turns out that at the time of surveying the dentists, they could choose several brands — not just one. So other brands could be just as popular as Colgate.

Another Example: 99 percent Accurate” doesn’t mean shit. Ask me to create a cancer prediction model and I could give you a 99 percent accurate model in a single line of code. How? Just predict “No Cancer” for each one. I will be accurate may be more than 99% of the time as Cancer is a pretty rare disease. Yet I have achieved nothing.

6. Understand how probability works

It happened during the summer of 1913 in a Casino in Monaco. Gamblers watched in amazement as a casino’s roulette wheel landed on black 26 times in a row. And since the probability of a Red vs Black is exactly half, they were certain that red was “due”. It was a field day for the Casino. A perfect example of Gambler’s fallacy, aka the Monte Carlo fallacy.

And This happens in real life. People tend to avoid long strings of the same answer. Sometimes sacrificing accuracy of judgment for the sake of getting a pattern of decisions that looks fairer or probable.

For example, An admissions officer may reject the next application if he has approved three applications in a row, even if the application should have been accepted on merit.

7. Correlation Does Not Equal Causation

The Holy Grail of a Data scientist toolbox. To see something for what it is. Just because two variables move together in tandem doesn’t necessarily mean that one causes the another. There have been hilarious examples for this in the past. Some of my favorites are:

  •  Looking at the firehouse department data you infer that the more firemen are sent to a fire, the more damage is done.
  •  When investigating the cause of crime in New York City in the 80s, an academic found a strong correlation between the amount of serious crime committed and the amount of ice cream sold by street vendors! Obviously, there was an unobserved variable causing both. Summers are when the crime is the greatest and when the most ice cream is sold. So Ice cream sales don’t cause crime. Neither crime increases ice cream sales.

8. More data may help

Sometimes getting extra data may work wonders. You might be able to model the real world more closely by looking at the problem from all angles. Look for extra data sources.

For example, Crime data in a city might help banks provide a better credit line to a person living in a troubled neighborhood and in turn increase the bottom line.

For the original article click here.


How to think like a Data Scientist