Weekly Digest, April 2

Weekly Digest, April 2

Monday newsletter published by Data Science Central. Previous editions can be found here.  The contribution flagged with a + is our selection for the picture of the week.

Announcements
  • Earn our online Business Analytics Graduate Certificate in one year. Learn data analytics to support business decisions. Learn data analysis activities including: descriptive (what happened), diagnostic (why it happened), predictive (what will happen), and prescriptive (what should happen). Learn more . . . Submit your application and start in fall.

  • Register today for the Wharton Customer Analytics Conference – May 9th-10th, 2018 in Philadelphia, PA  — This conference is dedicated to real-world applications of customer analytics that exemplify a balance of high level rigor and business know-how. Technical workshops are available to attendees on May 9th in addition to the full-day event on May 10th

Featured Resources and Technical Contributions

Featured Articles

Forum Questions

Picture of the Week

Source for picture: contribution marked with a +

From our Sponsors

  Hire a Data Scientist | Search DSC | Classifieds | Find a Job | Post a Blog | Ask a Question


Weekly Digest, April 2

Stochastic Processes and New Tests of Randomness – Application to Cool Number Theory Problem

Stochastic Processes and New Tests of Randomness – Application to Cool Number Theory Problem

This article is intended for practitioners who might not necessarily be statisticians or statistically-savvy. The mathematical level is kept as simple as possible, yet I present an original, simple approach to test for randomness, with an interesting application to illustrate the methodology. This material is not something usually discussed in textbooks or classrooms (even for statistical students), offering a fresh perspective, and out-of-the-box tools that are useful in many contexts, as an addition or alternative to traditional tests that are widely used. This article is written as a tutorial, but it also features an interesting research result in the last section.

1. Context

Let us assume that you are dealing with a time series with discrete time increments (for instance, daily observations) as opposed to a time-continuous process. The approach here is to apply and adapt techniques used for time-continuous processes, to time-discrete processes. More specifically (for those familiar with stochastic processes) we are dealing here with discrete Poisson processes. The main question that we want to answer is: Are some events occurring randomly, or is there a mechanism making the events not occurring randomly? What is the gap distribution between two successive events of the same type?

In a time-continuous setting (Poisson process) the distribution in question is modeled by the exponential distribution. In the discrete case investigated here, the discrete Poisson process turns out to be a Markov chain, and we are dealing with geometric, rather than exponential distributions. Let us illustrate this with an example.

Example

The digits of the square root of two (SQRT(2)), are believed to be distributed as if they were occurring randomly. Each of the 10 digits 0, 1, … , 9 appears with a frequency of 10% based on observations, and at any position in the decimal expansion of SQRT(2), on average the next digit does not seem to depend on the value of the previous digit (in short, its value is unpredictable.)  An event in this context is defined, for example, as a digit being equal to (say) 3. The next event is the first time when we find a subsequent digit also equal to 3. The gap (or time elapsed) between two occurrences of the same digit is the main metric that we are interested in, and it is denoted as G. If the digits were distributed just like random numbers, the distribution of the gap G between two occurrences of the same digit, would be geometric, that is, 

with p = 1 / 10 in this case, as each of the 10 digits (0, 1, …, 9) seems — based on observations — to have a frequency of 10%. We will show that this is indeed the case: In other words, in our example, the gap G is very well approximated by a geometric distribution of parameter p = 1 / 10, based on an analysis of the first 10 million digits of SQRT(2).   

What else should I look for, and how to proceed?

Studying the distribution of gaps can reveal patterns that standard tests might fail to catch. Another statistic worth studying is the maximum gap, see this article. This is sometimes referred to as extreme events / outlier analysis. Also, in our above example, studying gaps between groups of digits (not just single digits, but for instance how frequently the “word” 234567 repeats itself in the sequence of digits, and what is the distribution of the gap for that word. For any word consisting of 6 digits, p = 1 / 1,000,000.  In our case, our data set only has 10 million digits, so you may find 234567 maybe only 2 times, maybe not even once, and looking at the gap between successive occurrences of 234567, is pointless. Shorter words make more sense. This and other issues are discussed in the next section.

2. Methodology

The first step is to estimate the probabilities p associated with the model, that is, the probability for a specific event, to occur at any time. It can easily be estimated from your data set, and generally, you get a different p for each type of event. Then you need to use an algorithm to compute the empirical (observed) distribution of gaps between two successive occurrences of the same event. In our example, we have 10 types of events, each associated with the occurrence of one of the 10 digits 0, 1, …, 9 in the decimal representation of SQRT(2).  The gap computation can be efficiently performed as follows:

Algorithm to compute the observed gap distribution

Do a loop over all your observations (in our case, the 10 first million digits of SQRT(2), stored in a file; each of these 10 million digits is one observation). Within the loop, at each iteration t, do: 

  • Let E be the event showing up in the data set, at iteration t. For instance, the occurrence of (say) digit 3 in our case. Retrieve its last occurrence stored in an array, say LastOccurrences[E]
  • Compute the gap G as G = t – LastOccurrences[E]
  • Update the LastOccurrences table as follows: LastOccurrences[E] = t
  • Update the gap distribution table, denoted as GapTable (a two-dimensional array or better, an hash table) as follows: GapTable[E, G]++

Once you have completed the loop, all the information that you need is stored in the GapTable summary table.

Statistical testing

If some events occur randomly, the theoretical distribution of the gap, for these events, is know to be geometric, see above formula in first section. So you must test whether the empirical gap distribution (computed with the above algorithm) is statistically different from the theoretical geometric distribution of parameter p (remember that each type of event may have a different p.) If not statistically different, then the assumption of randomness should be discarded: you’ve found some patterns. This work is typically done using a Kolmogorov- Smirnov test. If you are not a statistician but instead a BI analyst or engineer, other techniques can be used instead, and are illustrated in the last section:

  • You can simulate events that are perfectly randomly distributed, and compare the gap distribution obtained in your simulations, with that computed on your observations. See here how to do it, especially the last comment featuring an efficient way to do it. This Monte-Carlo simulation approach will appear to operations research analysts.
  • In Excel, plot the gap distribution computed on your observations (one for each type of event), add a trendline, and optionally, display the trendline equation and its R-Squared. When choosing a trendline (model fitting) in Excel, you must select  the Exponential one. This is what we did (see next section) and the good news is that, despite the very limited selection of models that Excel offers, Exponential is one of them. You can actually test other trendlines in Excel (polynomial, linear, power, or logarithmic) and you will see that by far, Exponential offers the best fit — if your events are really randomly distributed.

Further advice

If you have collected a large number of observations (say 10 million) you can do the testing on samples of increasing sizes (1,000, 10,000, 100,000 consecutive observations and so on) to see how fast the empirical distribution converges (or not) to the theoretical geometric distribution. You can also compare the behavior across samples (cross-validation), or across types of events (variance analysis). If your data set is too small (100 data points) or your events too rare (p less than 1%), consider increasing the size of your data set if possible.    

Even with big data, if you are testing a large number of rare events (in our case, tons of large “words” such as occurrences 234567 rather than single digits in the decimal representation of SQRT(2)) expect many tests to result in false negatives (failure to  detect true randomness.) You can even compute the probability for this to happen, assuming all your events are perfectly randomly distributed. This is known as the curse of big data.

3. Application to Number Theory Problem

Here, we further discuss the example used throughout this article to illustrate the concepts. Mathematical constants (and indeed the immense majority of all numbers) are thought to have their digits distributed as if they were randomly generated, see here for details.

Many tests have been performed on many well known constants (see here), and none of them was able to identify any departure from randomness. The gap test illustrated here is less well known, and when applied to SQRT(2), it was also unable to find departure from randomness. In fact, the fit with a random distribution, as shown in the figure below, is almost perfect.

There is a simple formula to compute any digit of SQRT(2) separately, see here, however it is not practical. Instead, we used a table of 10 million digits published here by NASA. The source claims that digits beyond the first five million have not been double-checked, so we only used the first 5 million digits. The summary gap table, methodological details, and the above picture, can be found in my spreadsheet. You can download it here

The above chart shows a perfect fit between the observed distribution of gap lengths (averaged across the 10 digits 0, 1, …, 9) between successive occurrences of a same digit in the first 5 million decimals of SQRT(2), and the geometric distribution model, using the Exponential trendline in Excel.

I also explored the last 2 million decimals available in the NASA table, and despite the fact that they have not been double-checked, they also display the exact same random behavior. Maybe these decimals are all wrong but the mechanism that generates them preserves randomness, or maybe all or most of them are correct.

A potential application is to use digits that appear to be randomly generated (like white noise, and the digits of SQRT(2) seem to fit the bill) in documents, at random positions that only the recipient could reconstruct, perhaps three or four random digits on average for each real character in the original document, before encrypting it, to increase security — a bit like steganography. Encoding the same document a second time would result in a different kind of white noise added to the original document, and peppered randomly, each time differently — with a different intensity, and at different locations each time. This would make the task of hackers more complicated.

Conclusion

Finally, this is an example where intuiting can be wrong, and why you need data science. In the digits of SQRT(2), while looking at the first few hundred digits (see picture below), it looked to me like it was anything but random. There were too many 99, two few 37 (among other things), according to my intuition and visual inspection (you may call it gut feelings.) It turns out that I was wrong. Look at the first few hundred digits below, chances are that your intuition will also mislead you into thinking that there are some patterns. This can be explained by the fact that patterns such as 99 are easily detected by the human brain and do stand out visually, yet in this case, they do occur with the right frequency if you use analytic tools to analyze the digits.  

First few hundred digits or SQRT(2). Do you see any pattern?

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.

DSC Resources

Stochastic Processes and New Tests of Randomness – Application to Cool Number Theory Problem

First Doctorship in Data Science

First Doctorship in Data Science

You may even call it post-doctorship, as the level is beyond the traditional PhD degree. It is not a degree, not competing with university programs, but instead, akin to a fellowship or apprenticeship to learn doing state-of-the-art applied research, discover ground-breaking results or applications, and translate your discoveries into seminal material suitable for a broad audience. It is intended for professionals with substantial experience, perhaps to people who already have a PhD in a different field. It is mentored by well connected, world-class recognized scientists (not necessarily affiliated with a university) with broad domain of expertise in many environments. The focus is on real-world problems and applications to help you get a high-level position in the industry or as an independent researcher.

The idea to create such a program stems from the fact that a PhD degree is sometimes perceived negatively when looking for an Industry job, and some candidates hide it in their resume. This is because a PhD degree is supposed to prepare you for Academic research, but due to the large supply of PhD’s and a shrinking job market for tenured positions in Academia, many are left in precarious situations.

Source for picture: Women in data science

This doctorship is designed to make you a leading scientist in the Industry, to become for instance, an executive data scientist or chief scientist. It does not involve writing and defending a thesis, nor publishing esoteric papers in scientific journals read by few, but instead to quickly disseminate your research, explained in simple English, to a broad audience of practitioners.  For instance, possibly to attract VC funding if you want to create a start-up out of it. The standards are by no means inferior to that of traditional PhD programs, they are just very different. The length of this “mentorship” could be as short as two years; it could be carried out part-time, remotely, while having a full time job at the same time.

Features of the Proposed Doctorship

  • Short duration (2 years)
  • Publication in niche media outlets (like DSC), not in scientific journals
  • No teaching load
  • Done remotely, part-time
  • Cross-disciplinary research
  • Not attached to a university program
  • No thesis, no defense
  • Focus on Industry problems
  • Research not influenced by grant money or politics
  • Candidate gets an apprenticeship in the corporate world, related to her research
  • Not a degree
  • Not meant as a substitute to PhD programs

Example of Research 

I have one example in mind. This is what I would offer if I can find the time to start such an endeavor. The research in question is at the intersection of data science, dynamical systems, stochastic processes, computer science, and number theory, with applications to cryptography, Fintech, Blockchain, security, and high performance / high precision computing. Side projects could include the design of continuous random number generators, replacing standard statistical tests and p-value by better tools, or proving that the digits of some mathematical constants, are randomly distributed (this would be a fundamental result.) See here for details. I believe this research could lead to ground-breaking discoveries and nice applications.

I don’t even have a PhD in data science, such PhD’s did not exist back then. Instead my PhD was in computational statistics. Do I qualify to run such a program? Of course it is not for me to answer that question. You can check my career path here and judge by yourself.

The problem is, how many qualified experts are willing to take the challenge to offer this type of mentorship? How many professionals are willing to join such a program?

Related article

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on LinkedIn.

DSC Resources

First Doctorship in Data Science

Weekly Digest, March 26

Weekly Digest, March 26

Monday newsletter published by Data Science Central. Previous editions can be found here.  The contribution flagged with a + is our selection for the picture of the week.

Featured Resources and Technical Contributions

Featured Articles

Picture of the Week

Source for picture: contribution marked with a +

  Hire a Data Scientist | Search DSC | Classifieds | Find a Job | Post a Blog | Ask a Question


Weekly Digest, March 26

Two Beautiful Mathematical Results

Two Beautiful Mathematical Results

These results are relatively easy to prove (first year college-level calculus needed) and could be a good test to refresh your math skills. We posted another simple one (with a probabilistic / number theory flavor) a while back, see here.

Prove the following:

where g = 0.5772156649… is the Euler–Mascheroni constant.

Solution:

Let us introduce the function f(x), defined as follows: 

The answer to the first question is simply –f(-1). How do we compute it? Here is the key:

An exact solution is available for this integral (I computed similar integrals as exercises during my last high school year), and the result can be found here. This WolframAlpha tool allows you to automatically make the cumbersome but simple, mechanical, boring computations. The result involves logarithms and Arctan( (2x+1) / SQRT(3) ) which has know values (a fraction of Pi) if x = 0, 1 or -1. 

To answer the second question, one can use generalized harmonic numbers H(x) with x = 1/3 (see details here) together with the well known asymptotic expansion for the diverging harmonic series (this is where the Euler–Mascheroni constant appears.) The mechanics behind this are as follows. Consider A = B + C, with 

Both A and B diverge, but C converges. We are interested in an asymptotic expansion, here, for A. The asymptotic expansion for B is well known: It involves the Euler–Mascheroni constant , and (log n) / 3. The computations for C is linked to the function f(x) as x tends to 1, and brings in the other numbers in the final result: log(3), and Pi / SQRT(3).

Note that if we define h(x) as

then C = h(1) and g(x) can be computed using a technique and integral very similar to that used for f(x).

Generalization:

The idea is to decompose a function g(x) that has a Taylor series, into three components. It generalizes the well known decomposition into two components: The even and odd part of g. It goes as follows:

where the a‘s are the Taylor coefficients. 

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on LinkedIn.

DSC Resources

Two Beautiful Mathematical Results

Weekly Digest, March 19

Weekly Digest, March 19

Monday newsletter published by Data Science Central. Previous editions can be found here.  The contribution flagged with a + is our selection for the picture of the week.

Announcement

  • IBM predicts that the demand for data scientists will soar 28% by 2020. Take advantage of this opportunity and enhance your skill set with an online data analytics degree. Learn to collect, analyze, and model data, design and use databases, and apply analytics to high volume data problems.
    Learn more . . . Submit your application and start in fall.

Featured Resources and Technical Contributions

Featured Articles

Forum Questions

Picture of the Week

Source for picture: contribution marked with a +

From our Sponsors

  Hire a Data Scientist | Search DSC | Classifieds | Find a Job | Post a Blog | Ask a Question


Weekly Digest, March 19