It would be great it resampling techniques (bootstrapping, cross-validation, jackknife, etc) were better known among programmers. They are incredibly powerful (in the common-sense definition) statistical techniques that fit super well into the world-view of programmers, while avoiding many of the pitfalls and assumptions of parametric statistics, and fitting well into the “big data” world. I wish they were the techniques I learned at university: it would have saved me a decade in having tools like real hypothesis tests in my day-to-day toolkit.
Also Monte Carlo. All statistics problems can be framed as a data-generating process encoded by a probabilistic program (stochastic lambda calculus) plus an inference method like MCMC. See https://probmods.org.
As a programmer, it is very pleasant to understand the duality between lambda calculus and statistics. It really fits our worldview.
Maybe just wishful thinking but I think it’s catching on. My first exposure (or at least the one that stuck) was having seaborn’s lineplot calculate and plot a bootstrap confidence interval without me even asking, wondering “huh, how did it do that?”, and doing the research.
I like this because I think there’s a common misconception that a lot of the ideas and algorithms we use today are newer than they really are. We work with a lot of algorithms that were actually invented and characterized in the 50s and 60s, but became academic curiosities because you would need a computer with hundreds of millions of words of storage, or hundreds of MIPS, to apply them to nontrivial problems. It seems like in the 2000s people in several fields finally realized that we had that and more, and had the idea to raid the literature for that stuff. This is an example of something similar, just further back.
Energy has always been the major constraint in computing, now fiercely joined by communication latency and bandwidth. I’m excited for what the future will bring in terms of both. Nice article!
Latency remains the biggest problem, and will only get bigger. For example, since the 1980s hard drive capacity has gone up ~2,000,000x. Bandwidth per device has gone up ~250x. Latency has only improved around 14x. Same story across networks, memories, and pretty much everything else you can think of. There have been a couple major shifts (e.g. SSDs replacing spinning media), but even they haven’t changed the trends.
It would be great it resampling techniques (bootstrapping, cross-validation, jackknife, etc) were better known among programmers. They are incredibly powerful (in the common-sense definition) statistical techniques that fit super well into the world-view of programmers, while avoiding many of the pitfalls and assumptions of parametric statistics, and fitting well into the “big data” world. I wish they were the techniques I learned at university: it would have saved me a decade in having tools like real hypothesis tests in my day-to-day toolkit.
Also Monte Carlo. All statistics problems can be framed as a data-generating process encoded by a probabilistic program (stochastic lambda calculus) plus an inference method like MCMC. See https://probmods.org.
As a programmer, it is very pleasant to understand the duality between lambda calculus and statistics. It really fits our worldview.
Maybe just wishful thinking but I think it’s catching on. My first exposure (or at least the one that stuck) was having seaborn’s
lineplot
calculate and plot a bootstrap confidence interval without me even asking, wondering “huh, how did it do that?”, and doing the research.I like this because I think there’s a common misconception that a lot of the ideas and algorithms we use today are newer than they really are. We work with a lot of algorithms that were actually invented and characterized in the 50s and 60s, but became academic curiosities because you would need a computer with hundreds of millions of words of storage, or hundreds of MIPS, to apply them to nontrivial problems. It seems like in the 2000s people in several fields finally realized that we had that and more, and had the idea to raid the literature for that stuff. This is an example of something similar, just further back.
Yeah, I like the Fisher-Yates shuffle which was originally a paper-and-pencil algorithm from the 1930s.
Energy has always been the major constraint in computing, now fiercely joined by communication latency and bandwidth. I’m excited for what the future will bring in terms of both. Nice article!
Latency remains the biggest problem, and will only get bigger. For example, since the 1980s hard drive capacity has gone up ~2,000,000x. Bandwidth per device has gone up ~250x. Latency has only improved around 14x. Same story across networks, memories, and pretty much everything else you can think of. There have been a couple major shifts (e.g. SSDs replacing spinning media), but even they haven’t changed the trends.