Sometimes rough data can take the form of outliers emerging from the natural course of the human experience—our diversity of cultures, histories, economies, belief systems, etc.
When humanity coexists in our cosmopolitan era, smoothing data threatens to erase this texture, elide minority populations, and pathologize nonconforming behaviors. In this sense, rough data is the pebble in the shoe of our globalized, information age. But experiences and events that don’t fit neatly into normal distributions and bell curves aren’t limited to being the minor exception or the statistical irritant. No. They can be positively enormous: the asteroid that wipes out dinosaurs, the mortgage-backed securities that sink the global economy, or the disinformation avalanche that’s disrupting our democracy. in these cases, the one-offs are so overwhelming that they render the rest of the data in the set insignificant.
We can learn a lot by looking at the history of data science. This history is punctuated by recurring conflicts over what data get smoothed, what outliers can be thrown out as noise, and what outliers redefine the signal itself. What I’ve come to realize is that not all data sets are the same, and thus the consequences of smoothing differ in significance, depending on context.
Nassim Nicholas Taleb’s book, The Black Swan: The Impact of the Highly Improbable offers a useful rubric for understanding the stakes of the outliers, as they exist in two antithetical models of distribution. He illustrated this the distinction with two imaginary statistical countries: Mediocristan and Extremistan.
Data in Mediocristan are distributed normally; they’re Gaussian, a bell curve, and basically human in scale. For example, imagine the distribution of human weight and height. Some people are twice as tall as others, and the heaviest might weigh seven or eight times what the lightest does, but if you gather up a thousand people, they’ll cluster towards average weight and height, and the measure of any single outlier isn’t going to significantly change the aggregate data of the thousand.
Extremistan, however, is defined by the impact that a single outlier can have on an entire data set. If you were to gather a random thousand people and measure their net worth, they might be likely to all distribute along a bell curve as well, within, say, a twenty-fold range (the poorest at $15K and the richest at $3 million—a range that somehow what might pass as normal in today’s America). But if you added Bill Gates to the data set, his wealth would render the entirety of the rest of the data set inconsequential, almost immeasurably small next to his tens of billions. His isn’t one data point among many; rather, it renders all other data points virtually nonexistent.
The differences between data points in Mediocristan and Extremistan isn’t easily intuited. Our minds evolved to think and sense on a human scale. That cheetah might run three times as fast as we do, but we’re not going to get chased by anything that runs a million times faster than we do. My mile splits (and those of every human runner on earth) reside in Mediocristan. When they get smoothed, we don’t lose a lot in our fidelity to reality. A parent might reasonably ignore the possibility that their child will grow to be seven feet tall, and I have no chance of running a four minute mile. Emotionally, I’m much more likely to feel upset by having a hundred thousand dollars while my neighbor has two hundred thousand, than I am to be upset that Bill Gates has billions. This is because we experience human emotions like jealousy in Mediocristan.
But in our era, Taleb shows how reality can be turned on its head by the dynamics of Extremistan. He writes:
So while weight, height, and calorie consumption are from Mediocristan, wealth is not. Almost all social matters are from Extremistan. Another way to say it is that social quantities are informational, not physical: you cannot touch them. (p.33).
Extremistan is characterized by velocity and scale that are celebrated in Silicon Valley business models and pro-globalization evangelism. The rest of us, however, can’t be as bullish about the way information can defy gravity. One of the consequences of being informational is that social dynamics, viral campaigns, political propaganda, and unconscious irrationality can flow without expending energy. Information surfs social networks and swarms cluster around certain information and run with them, unchecked by the gravitational force of reality or counter-argument. In Extremistan, everything is scalable, including Russian propaganda and political mischief by disgruntled young white men—trolls, hackers and gamers—whose impact had previously been limited by the fact that they’re living in their parents’ basement. Surfing the information superhighway, amplified by their canny gaming of information systems, it doesn’t take many to hijack our cognition, especially when our cognition was designed to respond to very different threats.
Smooth data models all predicted a Clinton victory. Ignoring the possibility of roughness denies our humanity. Indeed, our worst impulses—tribalist chauvinisms and racisms, us-vs-them thinking, paranoia and suspicion, etc.—sit next to our highest human ambitions—breakthrough creative works, caring for others, fighting against wrongs—as outliers that can go viral, capturing our imagination and shaping our culture, for better and worse. Neither might look normal, and smoothing data can blind us to both, as it blinds us to the near certainty that whatever is going to happen next, we’re not going to see it coming.
Continue to Part 3
2 thoughts on “The Rough and the Smooth, Part II: Outliers, Scale, and other Dynamics”