Analysis of raw data requires us to find and understand complex patterns in that data. We all have a toolbox of techniques and methodologies that we use; the more tools we have, the better we are at the job of analysis. Some of these tools are well known, data mining for example. This talk covers some of the less well-known techniques that are still directly applicable to this kind of analytics. These include: Monte Carlo simulations (MCS) Nyquist’s Theorem Benford’s Law Simpson’s paradox This session will focus on how understanding these techniques can help you understand complex data patterns in your day-to-day work. For example, it turns out that some problems are very, very difficult to solve by standard mathematics. Take the random walk problem, for example. If you stand in a desert and take each step in a random direction, how far are you from your origin (on average) after n steps? Given that Albert Einstein worked on the problem, the mathematical proof is obviously non-trivial. (Unlike the answer, which is simply the square root of n). That self-same problem can be solved using a Monte Carlo simulation which can be written in just nine lines of code. I fully understand that you are unlikely to want to solve this particular problem (for a start, it’s already been solved and, in addition, who would really walk like that in a desert?). But the point is that many problems, from understanding web page usage to the behaviour of users of a video-on-demand system, are horribly difficult to solve as a mathematical problem and ludicrously easy (in relative terms) using a Monte Carlo simulation. Nyquist’s Theorem can be used to tell us, for example, how often to sample smart electricity meters, and much more. Benford’s Law is fabulously powerful for detecting fraud and so on. This session will NOT include any heavy (or, indeed, light!) mathematics; instead, it will focus on how understanding these techniques can help you understand complex data patterns in your day-to-day work. It is a shortened version of a pre-conference, day-long session that I gave at PASS earlier this year.
Presented by Mark Whitehorn at SQLBits XIV