Vaccines and bad reasoning

Author, January 2014

Author, January 2014

There seem to be no shortage of semi-celebrities willing to spread misinformation and bad advice about vaccines. A recent one is “Food Babe” Vani Hari. While she is not as totally scientifically illiterate as Jenny McCarthy, her information and reasoning are still bad and dangerous. I see three primary flaws in the reasoning of anti-vaxers: epistemological, mathematical, and ethical. Epistemology refers to methods of determining what's true about the world—actions should be based on good information. Mathematics is necessary to evaluate probabilities of events that contain some randomness, as most things in life do, and to reason about the likely outcomes of our decisions. Ethical philosophy helps us clarify what values are important to us, and how to act on them.

Epistemology

Anti-vaxers continue to spread information which is simply false, and Hari is no exception. They get even the simplest, most easily verified facts like the ingredients of vaccines wrong, and never issue retractions when their errors are pointed out to them, because they often base their facts on their cause, rather than the choosing their cause based on the facts. This weakness is common to all of us: it's called motivated reasoning, and is part of how all human minds work. Fighting it is difficult, but not impossible. Science itself is really nothing more than methods and practices designed to find the truth in spite of how our brains constantly mislead us. Sensory misperceptions, faulty memory, prejudices, loyalties, and all kinds of other cognitive biases make it hard to find how reality works underneath our perceptions, but science has had a remarkable track record of success finding even those truths that fly in the face of common sense or personal experience. Nothing else even comes close.

One of the most important methods of science is independent replication. It's an important principle that no matter how well-respected a scientist is, how carefully she designs a study, how well she argues her case, the results of one study are meaningless. A fact only starts to become scientific when when other scientists from different schools, different countries—and different cognitive biases—try their hardest to prove the first scientist wrong and fail. Only after many decades of many studies from different points of view does a scientific consensus emerge, and even then there will be a few stray detractors. Replication even catches outright fraud, like Piltdown Man and the Andrew Wakefield studies. Even otherwise good science reporters get this wrong all the time: they report on the results of a single study as if it is a “discovery” or a “new result”. A study is just a study, and in fact most studies—even the ones done by very good scientists—turn out to be wrong in some significant way, simply because most studies are at the edges of our understanding. It's not single scientists or single results that are important, it is the process of science as a whole. and that process reaches consensus understandings that teach us amazing things about the world and make all of our lives better.

In addition to important methods like replication, controlled studies, peer review, and others, science also has one critically important feature that other “ways of knowing” don't: it can change. Reporters often get this one wrong too, reporting on some dissenting scientist as a “maverick” fighting an entrenched orthodoxy. Most dissenters are simply wrong, but occasionally one really does discover something new and revolutionary. The scientific process assures that the only arbiter of who is right is reality itself, as judged by experiment. There is no sacred text, no infallible authority. Einstein's general relativity became accepted not because Einstein himself was popular or respected, or because his papers were convincing, but because hundreds of experiments by other scientists have confirmed it. Every time you use your GPS, you're testing general relativity yourself. Even the methods of science change. Study methods get better, communication improves, scientists debate. Contrast this to things like accupuncture or ayurvedic medicine, based on texts thousands of years old, before we even knew about bacteria or viruses or genetics or biochemistry.

It's also common to place too much emphasis on the source of information. Yes, some sources are more reliable than others, but facts are ultimately judged by reality itself, not who espouses them. Some people think that because vaccines are big business, any information that supports them is sponsored by corporations and engineered for sales; or that the government's promotion of them is for the purposes of power and influence. Or they may think that because someone like Dr. Oz is friendly and caring that his advice is worth taking, even though he routinely spouts unscientific nonsense and real science with equal confidence. I sympathize with distrust of government and big pharma. They both have done genuinely despicable things. I personally believe drugs should not be patentable, and that there should be more competition in their manufacture. But none of these considerations affect the facts, the bulk of which are from studies by individual doctors, universities, different governments around the world, whose overwhelming results are simply beyond the capacity of even these giants to manipulate. Sure, drug companies can and do influence studies to make drugs look more effective or safer than they are, but they can't possibly fake numbers like over 750,000 measles cases in 1958, half that in 1960 after the vaccine, about 20,000 by 1968, and varying but similarly small numbers ever since. Numbers like that, repeated over many diseases for which we have similar numbers, are simply beyond politics. Anti-vaxxers are quick to note—correctly—that things like sanitation and refrigeration had an even greater impact on disease before vaccines came along, but that does not diminish the proven impact of vaccines. Denying the millions of people saved by vaccines is like holocaust denial in reverse.

Iron lung ward during Boston epidemic. March of Dimes photo.

Iron lung ward during Boston epidemic. March of Dimes photo.

So it's important to take as your facts about vaccines the general scientific consensus. Not what any single study says, not what any single expert says no matter how credible, not what you want to be true, not what's popular in your culture, not what aligns with your political or religious beliefs, not what a trusted friend tells you—not even what I tell you—but what the totality of the greatest intellectual enterprise in human history currently agrees to be our best understanding. The current consensus understanding about vaccines is not at all controversial, despite what other pundits may tell you. The facts are clear: vaccines work. Being vaccinated greatly reduces your chances of getting and spreading the disease for which you are vaccinated. No vaccine is 100% effective; some are more effective than others, but all of them work pretty well. The best examples are smallpox and polio. Smallpox killed an estimated 300-500 million people, and is now extinct in the wild. The last naturally occurring case of what was once the scourge of humanity was in October 1975, a two-year-old Bangladeshi girl, Rahima Banu. Polio used to fill hospital wards with children in iron lungs, and is now eliminated everywhere in the world except a few places that fight vaccination programs for religious reasons. Probably the least effective is the annual flu vaccine. Flu is a difficult target, and our aim varies. But even this vaccine has a risk-reward function that favors getting vaccinated for most people (there are exceptions your doctor will tell you about). And flu is not just a nuisance—it really kills people. The Food Babe's advice to “encounter the flu naturally” is uneducated, dangerous, and irresponsible. People encountered smallpox and polio naturally too, that's why we invented vaccines.

Math

Even among those who accept the scientific consensus make simple errors of mathematics in evaluating the likely outcomes of their choices. The primary mistake is to rely on simple black-and-white thinking instead of probabilities. The world is a complex place, and the most complex parts of it are biological things like humans and viruses. Simple statements like “A causes B” or “A prevents B” are generally not an accurate picture of how things happen in biology. Randomness plays a big part. Many people who never smoke get lung cancer, and many people who smoke all their lives don't. That doesn't mean there can't be a causal relationship. There's a very strong one: smokers are 10 to 20 times more likely to get lung cancer (men more likely than women, for reasons still unknown).

Some people who get vaccinated still get sick. And many people who don't live long healthy lives. That doesn't mean there's no causal link. Most vaccines are a slam dunk: The introduction of the polio, measles, mumps, and rubella vaccines reduced the total cases of those diseases by over 99%. Pertussis (whooping cough) was reduced over 95%. Because it's somewhat less effective, recent outbreaks of whooping cough have arisen in communities with low vaccination rates. Some are far less effective, but still useful: The yearly flu vaccine reduces flu rates by about 60%.

People also don't properly compare the risks of multiple choices. For example, many people at the blackjack table refuse to hit a 16 when the dealer shows a 10, because the risk of busting is high. They are correct about that: hitting will cause you to lose 53% of the time. But standing will cause you to lose 54% of the time. Hitting is risky, but standing is worse. You hit not because you want to win, but because you want to lose less.

Vaccines can have real side effects. These too are variable, random, and unpredictable. Most are mild (pain at the injection site, fever, rash), but many are indeed serious. That's why the US has a fund for compensation of victims of vaccines. The risks are real, but just saying that vaccines are “unsafe” because there are risks isn't telling the whole story. There are two choices, and neither is “safe”. Reality, which is under no obligation to make life fair, simply doesn't offer any safe option. You have to choose one risk or the other: hit or stand, either vaccinate and risk side effects or don't vaccinate and risk disease. Making that choice requires real numbers and real judgment about the seriousness of the consequences. The MMR vaccine, for example, has caused anaphylaxis (a serious allergic reaction) in patients because it can contain egg proteins and gelatin. But the numbers are tiny: most doctors have never seen a single case, just as most have never seen a case of measles or mumps since vaccines began.

The final risk/reward calculation involves a branch of math called decision theory. This is a pretty simple case: the tiny odds of the two bad outcomes (disease, vaccine reaction) have to be multiplied by the “badness” of each outcome to get what's called the expected value of each choice. This math pretty heavily favors choice one: vaccinate. The downside of disease is much worse than the downside of a reaction to the vaccine, and the odds of both are tiny.

Ethics

Even if you accept the facts and the math, there are still people who decline to vaccinate for what they believe are ethical reasons. I value freedom, and I'm quite sympathetic to those who claim the right not to vaccinate themselves even if I disagree with their reasoning. But even dedicated libertarians acknowledge an important limit to freedom: your freedom to swing your fist ends at my nose. When your choices affect others, I have a right to object and a duty to speak up. We are of course outraged when a Muslim family kills their daughter] for being raped to save the family “honor”. I see no moral difference between that and Christian Scientist parents who kill their daughter by denying necessary life-saving medical care. Both can and have been put in jail, and rightly so. I am not suggesting that failure to vaccinate your children rises to that level, but the issue is similar. Simply crying “freedom” is not sufficient moral justification for endangering the lives of your children and those they encounter. I do believe it is entirely appropriate for schools, for example, to refuse to enroll unvaccinated children. They too have the freedom, and perhaps even the duty, to protect other students from your choices.

Some people have the feeling that not acting has fewer moral implications than acting. A classic example is the trolley problem: you see a train racing out of control toward five people some evildoer has tied to the track, and who are too far away for you to warn. A switch near you will divert the train onto a side track where a single person is in the same predicament. Do you throw the switch? Even though throwing the switch will save lives, somehow not throwing the switch seems less blame-worthy to many. Likewise, they believe that if they vaccinate their children and who then suffer side-effects, that is somehow more blame-worthy than if they decline to vaccinate and they get the disease. But the choice is conscious either way: avoiding the “active” choice is not being conservative, it is moral cowardice.

I can't make the choice for you. But the picture above should make it clear what my choice is, and I urge you to use the right tools in making yours.

Who owns culture?

At the recent Webby awards, Steve Wilhite, the primary creator of the GIF image format, showed gratitude for his award by berating the public for not pronouncing “GIF” the way he does. A pretty silly thing to be upset about, as is my being upset with his pique. But it is one more example of a larger problem: who owns and controls our culture?

Culture—things like language, music, art, food, social customs and rituals—is a creation of the human mind, and often individual pieces of it are created by individual people. Those people certainly deserve credit when their creations become popular. Our government even has laws like copyright intended to encourage the creation of certain kinds of culture by giving their creators a limited commercial monopoly on them. But many people mistakenly take that as support for the idea that creators "own" their creations in some way, and have an absolute right to control how they are used.

A culture cannot possibly grow with such a crippling restriction. Thankfully, many things like mathematical and scientific discoveries, food recipes, and athletic techniques are not subject to such monopolies or our culture would grind to halt completely. Culture depends on a thriving public domain. The “public domain” is the art, music, literature of a culture that is not owned or controlled, like the music of Bach and Mozart, the works of Shakespeare and Dickens, the art of Michaelangelo, the inventions of Archimedes. These artists lived in a world where copyright and patent did not exist, but even after these were created, their limited terms ensured that eventually all works would pass into the public domain and enrich our culture by allowing homage, parody, remixing, and other creative uses that the original creator might never have imagined.

Patents are still, thankfully, limited, so technology can still progress by building on the past. But congress has over the years extended the term of copyright to ludicrous lengths, passing a new copyright law coincidentally every time Walt Disney's “Steamboat Willy” cartoon of 1928 is in danger of slipping into the public domain. Walt Disney himself died in 1966, so copyrights aren't really encouraging him to create more—at this point they're basically corporate welfare for Disney, a company that has made much of its fortune exploiting fairy tales and other public-domain stories. Woe unto the artist who wants to use the image of Mickey Mouse in any way not approved of by Disney: lawyers will descend, ensuring that no one is allowed to enrich our culture in this way unless they also enrich Bob Iger and company.

What could be more a part of American culture than singing “Happy Birthday”? Well, if you do that in a restaurant, or in a movie, be prepared for Time Warner to vigorously defend the rights of its creators, Patty and Mildred Hill, who wrote it in 1893 and are both long dead. There is considerable legal doubt as to whether this copyright claim is legally sound, but there is no question that people have been and continue to be sued over it.

But back to GIF. Wilhite created the original format in 1987 as a way to transmit images over the CompuServe network. The original version wasn't quite up to the job, so a group of graphics programmers on CompuServe (including me, CIS 73407,2030) convened to update it. (For a real walk down memory lane, check out this paper I wrote at the time. It's a detailed explanation of a grapics technique written in plain text, before GIF, before the Web). We produced a specification for GIF 89a, which is what has been used since. After the specification was complete, the powers that be at CompuServe decided to add a paragraph declaring that the acronym should be pronounced with a soft G, thereby confusing pictures with peanut butter. Even at the time, this was a bone of contention. I and many other people who had already been using GIF for some time had always pronounced it with the hard G—after all, it stands for “graphics”. But our objections were ignored.

That's fine—I don't really care how you pronounce it. But I do care that CompuServe, and Wilhite, think they have some right to tell you and me how we should pronounce it. It's a word. It's part of our language. It's in many dictionaries now. And all of those dictionaries—absolutely correctly—include both pronunciations. Because that's how people pronounce the word, and words belong to the people who use them, not to the people who created them.

So stand up to corporate hijacking of our language. Sing Happy Birthday in a restaurant. Call your company a Mickey Mouse operation. Xerox something on your Canon photocopier. And trade GIFs on the net, pronouncing them any way you like. It's our language, our culture, not theirs.

Representing playing cards in software

There are several different ways to represent playing cards in software, each with its own benefits, drawbacks, and best application. I want to outline these, and explain why I chose the particular integer representation used in the OneJoker card library.

A standard deck from Copag, popular in casino poker rooms.

A standard deck from Copag, popular in casino poker rooms.

What is a card?

The set of cards in a standard Anglo-American deck is the cartesian product of two sets: 13 ranks and 4 “French” suits, each card having one of each. Operations on cards typically involve comparing the ranks of two cards based on an ordering dependent on the game, and comparing suits for equality. Many games also use one or more jokers, which have neither rank nor suit. Decks of cards today are manufactured with two jokers, one of which is typically printed in black only, and the other in color (or distinguished in some similar way). Games that distinguish between these often call them “red” and “black” by analogy to suited cards.

Whatever representation is used, it is useful to be able to get the rank or suit of a card as a small integer that can be used to index lookup tables and compute sums. Direct comparison of ranks and identifying sequences are also common, but ordering varies from one game to another, so this should be done carefully. If an application is designed for one game, choosing a representation suited to that game will be handy. For example, poker applications should choose a rank that gives the lowest number to deuces and the highest number to aces so that ranks can compare directly.

Text

While interactive games will display cards to the user as graphical images and accept input from a mouse, other applications that use playing cards must at some point acquire input and produce output as text for humans. A common and effective method is to use one digit or letter for the card's rank and another for the suit: 2c, 9h, Qd, As, etc. The letter T is usually used for tens to keep these strings uniform. This is a good way to save card information in text files, to communicate them over network protocols, and so on. It is common to use JK to represent the joker. It is not common at all to distinguish between the red and black jokers, though some games require it. I recommend using JR for the red joker when the distinction matters, and JK for the black (or when the distinction doesn't matter).

In the spirit of the networking axiom “Be conservative in what you produce, liberal in what you accept”, I recommend that cards be consistently written in this two-character format, uppercase rank and lowercase suit, with a space between cards when representing a list or set. When reading such a list, one can be more liberal by accepting case differences, extra whitespace, no whitespace, or even 10 if uniformity is not required. If such text is for human consumption only (such as running text on a web page or printed book not likely to ever be read by a program), one might use the Unicode suit symbols (♣ ♦ ♥ ♠) as well as red and black text, but these are awkward for use in 8-bit data formats.

Using such a text representation of cards internally for code that runs a game or simulation is always a bad idea. There is no programming language or application I know of for which such an internal representation does not lead to loss of performance and excessive memory usage. Converting other representations to strings for output is always trivial and fast. Converting from input strings may be a tiny bit harder, but it is still simple, and even programs using string representations will have these same complications dealing with irregular inputs and such. So it is always better to represent cards internally with a different representation and convert them for input and output as needed.

Objects

In object-oriented languages, using objects to represent cards is reasonably efficient for most uses. Operations on cards often involve comparing ranks and suits separately, so the card object should have two member variables for rank and suit. Rank should be an integer or an integer-like class (such as an enumeration class) that can do ordered comparisons. Suits are generally only compared for equality, so they can be integers, enumerations, or pointers to one of four static suit objects. Identifying a card object as a joker can be done with an additional flag, or else it can be assigned a unique rank.

Such a representation is fairly compact, so it will not cause excessive memory use. It should be pointed out, though, that even object-oriented languages typically have efficient “primitive” types like integers, and so it might make sense for some applications to forgo objects in favor of one of the integer representations below for extra performance. One might still have a card class with static functions that operate on these integers for clarity. A good example is the Pokerstove application in C++ which uses Card, Rank, and Suit objects for I/O and some functions, but computes different internal representations from them when needed for performance.

If you want to keep extra information in the card object, you can avoid the cost of copying larger objects by keeping a single collection of 52 static card objects and using pointers to these as the cards that get manipulated at runtime.

Bitmaps

If the card games being simulated involve sets of cards with no duplicates, and for which the order of cards in a set is not important, one can represent a set of cards as a single 64-bit integer in which each bit indicates the presence or absence of one particular card in the set. In addition to being the most compact representation for sets, this can speed up many complex calculations. If the bit positions are chosen so that each 16-bit subset of the value repesents one suit, and 13 of each of those 16 bits is the rank, then the 16-bit sub-integers can be used directly for comparisons as well, speeding up calculations further.

As noted, this does not preserve the order of cards, so if you want to do something like shuffle a deck, you'll have to represent the deck as an array of these masks, with each member having one bit set, and then OR them together into a hand as they are dealt. This may be slower than dealing with arrays of machine-size integers. This representation also makes using lookup tables indexed by rank or suit difficult. Also, since no duplciates are allowed, this method cannot be used for games that require duplicate cards such as Pinochle and Canasta.

This representation is most useful for single-purpose applications doing very complex calculations on fixed-sized sets of cards. The venerable pokersource library uses bitmaps to evaluate poker hands, and it is quite effective.

Bitfields

Because the typical 32-bit integer size of most machines is much larger than necessary to identify a card, we can use groups of bits within an integer to store information about the card. Specifically, two bits for suit, four for rank, and the rest for flags or anything else the application might need. This is similar to treating an integer as an object with member variables stored in a very compact way. The well-known Suffecool/Senzee poker hand evaluator uses this method to store along with each card one of 13 prime numbers used in its calculations.

This gives us some of the advantages of the object representation while being more compact. This speeds up applications that need to move and copy many cards from place to place, such as blackjack simulations. A blackjack simulation might use 4 bits to store the 1 to 10 numerical value of a card to avoid some branching in the innermost loop that computes a hand value (though you'll still need to deal with aces specially). Getting ranks and suits out of our numbers requires only fast bit-masking operations to get numbers suitable for indexing lookup tables.

Integers

Finally, there is what is probably the simplest representation of all, but no less powerful if done correctly: simply assigning a small integer value to each card. One can see software in which cards are ordered the way they are when you open a typical new deck of cards, which is Ac, 2c, 3c, ... Kc, Ad, 2d, ... Qs, Ks. This is a bad idea for two reasons. First, getting a numerical rank and suit from a number requires an expensive division by 13, and even after that aces will usually have to be special-cased to move them to their usual high rank.

Better is to order the cards in the standard poker “high card by suit” ordering, which is 2c, 2d, 2h, 2s, 3c, 3d, ... Ks, Ac, Ad, Ah, As. This has many advantages. First, you can separate rank and suit with fast bit masking (in fact, this ordering is essentially a bitfield representation with suit as the low order bits). Also, one can often compare or sort cards by rank without even separating the ranks just by comparing the values themselves. Likewise, comparing ranges of ranks can be done by comparing ranges of values (the “10 count” cards in blackjack, for example, are the range 32 to 47).

This representation is ideal for indexing lookup tables. The values that one might store in a bitfield or object, for example, can simply be fetched from a small lookup table with almost no performance hit. Sets of cards (hands, decks, discard piles, etc.) are simply arrays of integers, for which many programming languages are highly optimized. Duplicate values are no problem, so games like Pinochle and things like 6-deck blackjack shoes need no special handling.

The OneJoker card library uses this represention with a minor change: I add one, so cards have the values 1 to 52 rather than 0 to 51 (the values 53 and 54 are used for jokers). The need for an occasional -1 is not a significant performance hit, it can often be avoided entirely by adding one element to lookup tables, and being able to use 0 as a “null” value is very handy in the C language.

While any one particular application might be faster with a different representation, this simple one is very fast for the vast majority of applications, and can be easily converted to others when needed, so it is probably ideal for a general-purpose library.

What is randomness?

transparent-dice.png

The code project I'm currently working on requires generating numbers for simulating card games. This is a notoriously perilous task for programmers. It also brings up interesting (at least to some of us) questions about just what randomness is, a notoriously perilous task for philosophers.

It turns out (unsurprisingly) that the mathematician's perspective on randomness closely matches the gambler's perspective. “Random”, gamblers and mathematicians will tell you, is the opposite of “predictable”. That is, the extent to which we can predict anything about the outcome of some experiment ahead of time, those results are not random. Or even more toward the gambler's point of view, a random process is one on which we cannot—even in principle—make a bet in our favor.

This differs considerably from what most people informally think of as “random”. Most of us have an intuitive sense that random things are evenly distributed, which is true in the very very long run, but not true at all on the scales we generally experience things. This intuition is called the “gambler's fallacy”, because gamblers who bet expecting things to even out in the short run keep me employed. Just because that seven hasn't been rolled in a while, or we've just seen a run of black on the roulette wheel, that doesn't mean that upcoming rolls or spins will have any bias—any predictability—compared to previous rolls. Dice, it is often said, have no memory.

This leads to some unintuitive results: if we flipped a coin 10 times, and it landed heads all 10 times, we might rightly suspect that it wasn't a fair coin. But if we flipped it a million times, and there were not a single run of 10 heads in a row in the whole sequence, we would also reject that coin as non-random, because its distribution was too even. After billions of rolls of the dice, or billions of cards dealt, or billions of spins of the wheel, we would expect all the possible outcomes to be roughly—but not exactly—even. But in the short run, lopsided results are commonplace, and in fact failing to find such streaks is evidence of non-randomness. And here's an important clue: your lifetime is the short run. If it were true that things “evened out” in the short run, that would be a statistical bias that you could bet on and make money. People do bet on those feelings, but it's the casino who makes money, because they're betting on randomness.

There is one exception to note in casino games: blackjack. Cards don't have memory either, but the shoe from which cards are dealt does. If you've been watching 40 hands of blackjack and not seen a single face card, you can be sure that the cards remaining in that shoe, until it is reshuffled, are overly rich with face cards, and change your bets accordingly. The fact that you will see more face cards in the next few hands than you would expect from a fresh shoe is statistically predictable, and you can therefore make money from that fact, even if each individual card is still randomly chosen from those remaining. Card counting works because short-run clusters of events are predictable.

Randomness and science

People who misunderstand math and science get randomness wrong all the time. You will hear a silly argument from creationists that creating life from random processes is like a tornado in a junkyard randomly assembling a 747. The mistake here (well, one among many) is that they think a process that merely contains some random element is therefore a random process. Yes, random mutation is an important element in evolution, but less important than the process of natural selection, which is not random at all. Imagine, for example, your parents, and their parents, and their parents, going back thousands of generations. That's a lot of people. Now ask: how many of those people died at birth? How many died before puberty? How many were sterile? How many just didn't have the desire or opportunity to have children? These questions are easy: zero. 100% of those people, without exception, successfully reproduced. 100%. That's the exact opposite of random. That's why natural selection is such a powerful force. Even with random variation as input, its output is remarkably complex and and sometimes gives a magnificent illusion of design—and some things that clearly aren't designed.

500px-10_meter_air_rifle_target.svg.png

Even real scientists get randomness wrong. Doctors who find clusters of people with certain cancers, for example, often mistakenly jump to the conclusion that there's some environmental cause, when in fact a purely random distribution would inevitably produce such clusters. Only if the clusters are much worse than expected by randomness—as determined by a good mathematician—should we start looking for another cause. This is called the “Texas sharpshooter” fallacy, after a supposed shooter who fires shots randomly at the side of a barn, then walks over to the biggest clump of holes and draws a target there. Cancer clusters are just like our 10 heads in a row: if the numbers are big, we should expect them a certain number of times.

Even Dr. Steven Novella, host of the Skeptics Guide to the Universe podcast, has said on the air that the digits of pi are random. They are not. First of all, they are 100% predictable by calculating pi. But even outside of that, it is possible to find statistical biases in the digits of pi without actually calculating them out. For example, one can mathematically prove, say, that the 10 billionth digit of pi is a bit more likely to be a 7 than a 6 without calculating pi all the way out to 10 billion digits. Such statistical properties of the digits of pi are quite common, and it means that you could make a bet on one of those digits at fair odds and make money. If the digits were random, this would not be possible.

Programmers get randomness wrong all the time. This includes, unfortunately, the programmers who write the standard function libraries for most programming languages. When you learn to program, you will probably be given a programming exercise of some simple card or dice game, and be taught about the standard function in your language for producing random numbers. For a simple game run a few times, this is probably fine. But if you try to make real industrial-strength simulations of billions of hands or rolls, your built-in random function will probably fail in more than one way. It may be too evenly or too unevenly distributed, or it may repeat itself too soon.

So your language's built-in function is probably bad. And if you try to do it yourself, you'll probably make a worse one. So what should you do? Find an add-on library written by a serious math geek who you trust to get it right. And there's more: even if you have a perfect random number generator, you can use it the wrong way and still get bad results. Shuffling a deck of cards, for example, requires not only that you select cards at random with a good algorithm, but that you correctly rearrange the cards in such a way that each possible arrangement is equally likely, and this is easy to get wrong. There are even issues with the purpose for which the numbers are used: a good algorithm for choosing cards for a simulation might not be the best for choosing a cryptographic key, and vice versa. The well-known algorithms have fancy names: cryptographers use algorithms called RC4, Yarrow, and Fortuna. People writing simulations use algorithms called Mersenne Twister and CMWC.

The moral of this tale is short: if you think you know what randomness is, think again. Maybe consult a mathematician. If you're writing code that uses randomness, definitely consult a mathematician. It's much easier to get it wrong than you think. And test the code. I have as many lines of test code in my library to verify the random number generator as I have to generate the numbers. There's also a good program called “dieharder” that runs a suite of statistical tests on your generator to prove its quality (and which will, by the odd nature of the beast, occasionally—randomly—fail even when your code is perfect!)

Just to give you a glimpse of the level of complexity of the problem, this page shows the code from my OneJoker library that generates random numbers and shuffles cards, and does nothing else. Over 150 lines of code to ensure that the next card you get is, in fact, unpredictable.

Why Python?

I recently counseled a friend who wanted to learn about computer programming to start by learning the Python language. I also mentioned that I liked Python to an old friend who is a fellow experienced programmer, but I wasn't very clear about why. Now that I'm in the middle of a project that uses both the Python and C languages, I've come to better understand my reasons for favoring Python both for learning to program and for serious use.

Computer programming is, at its core, communication. At the lowest level, a program instructs a computer how to solve a problem. But at a more important level, a program communicates to people the thought process of the programmer in translating a vague problem into a specific solution. A programming language, then, should be expressive. That is, it should be easy for a programmer to concisely and accurately describe his thoughts, and it should be easy for someone reading the code (often that same programmer years later) to understand the original programmer's intent.

A brief history of programming

Computer hardware is reasonably simple, conceptually (though there are certainly complex details). There is a large memory that stores numbers. Programs move these numbers from one place to another in memory and do math operations on them. Attached devices use numbers to represent dots of color on a screen, letters on a keyboard or printer, the position of a mouse, the movement of a loudspeaker.

The computer also uses numbers in memory to represent instructions to itself. This is called “machine code”, and was how the very first programmers had to program. If I wanted to write the letter “A” to the screen, I had to know that the screen interprets the number 65 as that letter, and that putting it on the screen involved writing that number to a specific address in memory, and that the instruction code for “write this number to this address” is another number, and so on. Then I put the numbers 248, 65, 2, 193 in the right place in memory and press a start button. This worked, but was complex, tedious, and error-prone.

It is no accident that the first programs written by early programmers were tools to simplify programming. One such tool is called an assembler. This is a program that takes computer instructions described as more human-readable text and translates them into the raw numbers. For example, I might give the memory address of the screen the name screen. The number representing the write-to-memory instruction is named write, and so on. Now I can type write screen, "A", and the assembler will translate that into 248, 65, 2, 193 for me. Assembly instructions are an exact one-to-one match for machine instructions. They are just a more convenient—and more expressive—way to write them.

The next tools were real programming languages, which are a level of abstraction above machine code. Instead of describing machine instructions directly, the programmer used expressions like y = (x + 3) / 2, and a program called a compiler would translate that into a string of instructions to load numbers from memory, do the math, and write them back. The details of exactly how that was accomplished were delegated to the compiler program. A computer is meant to solve problems, so why not have it solve problems about how to program itself? FORTRAN was the first popular such language, although the C language overtook it and remains popular today.

In addition to compilers, there are programs called interpreters that do roughly the same job of translating a programming language into machine instructions, but do so on the fly, as the program is running. This makes using them even simpler since a programmer can try things interactively and get immediate feedback. BASIC is the canonical example of this kind of interpreted language.

Modern languages

Programming languages today add one more level of abstraction. Instead of operating only on numbers directly, or even names given to numbers, they allow you to describe “objects”, which are large collections of numbers that represent real-world things like people, places, accounts, documents, and so on. You express actions in terms of these objects, and the compiler or interpreter then decomposes these into the actions needed on the individual numbers and finally into machine code.

The measure of a modern programming language then is twofold: how well does it allow the programmer to clearly describe the problem to be solved, and how well does it translate that solution into machine code? These goals are often at odds: efficient translation to machine code is often accomplished by making special-case exceptions and back doors in the higher-level abstractions that allow the programmer to fiddle with the numbers directly at the expense of clarity and safety.

The best example of doing this badly is the C++ language. It is an extension to the C language that adds some modern abstraction tools, but it retains all of the low-level number-twiddling of C, which allows—indeed encourages—programmers to step outside of the abstractions. The resulting programs are complex, hard to understand, loaded with exceptions, and hard to debug.

The Java language does a better job, producing efficient machine code while maintaining well-defined higher-level abstractions with few exceptions. It achieves this in part by requiring the programmer to be very explicit about a lot of implementation details that don't really express programmer intent, so it tends to be verbose and hard to read and write.

The Python language manages to maintain consistent and clear use of its high-level abstractions without special cases, and yet produces machine instructions that are remarkably efficient in terms of speed. It achieves this goal mostly at the expense of using more memory at runtime than other languages. Python programs internally use dozens of big hash tables to speed up the namespace and associative-array lookups that accomplish its expressiveness. This is a good tradeoff. Today, memory is a lot cheaper than time. Time saved running a program—and time saved writing it—more than make up for the fact that a running Python program takes up twice the memory of a running Java program.

Examples

Python also adds many features that increase expressiveness without sacrificing either efficiency or high-level abstraction. Things like multiple-value function returns, default function arguments, named arguments, and tuple assignment allow a programmer to provide the information he wants to see in the code and eliminate much that isn't really expressive but merely pro forma. Here are some geeky examples for those who want (and understand) details:

Let's say I have a function of two arguments, the second of which is a value from 1 to 100, but almost always a default value the caller may not know about. In C++, I'd have two choices of how to write this. One, and the way you'd do it in C, would be to use a value like 0 to mean “use the default”:

void dosomething(int a, int b) {
    if (b == 0) { b = 53; }
    . . .
}
. . .
dosomething(5, 12);
dosomething(5, 0);

What would a programmer reading this later think on seeing the call? It looks as if 0 is just an ordinary passed argument and might be surprised that it gets changed. Maybe some change has made 0 a valid argument now, and we'll have to change the default marker. If that happens, and we run across that call in another piece of code, is that call a valid one with 0 or was it the old default?

The other option in C++ uses function overloading:

void dosomething(int a, int b) { . . . }
void dosomething(int a) { dosomething(a, 53); }
. . .
dosomething(5, 12);
dosomething(5);

Now we don't have the problem of confusing a real value with a default marker, but we have a new problem. In C++, overloaded functions are not related in any way and might do completely different things. the first call might draw a line on the screen while the second one plays music. A programmer might reasonably expect the latter call to be a default case of the former, but he can't rely on it. In Python, two function calls with the same name in the same place are guaranteed to call the same function. Omitting an argument means “use the default”, and can mean nothing else.

def dosomething(a, b = 53):
. . .
dosomething(5, 12)
dosomething(5)

Python doesn't need overloading because its arguments aren't typed (more on this later). You treat the arguments of different type differently by checking them explicitly only when necessary. In this way, the code more closely matches the programmer intent. Both languages can do what we want, but in C++ it's easy to get it wrong, while in Python it's easy to get right and is more lucid. Even the much-maligned feature of Python that code indentation is significant helps catch errors by forcing the physical appearance of the code to match its real meaning.

Many of the more expressive idioms in Python come from the world of “functional programming”, a field of study in computer science that uses functions as the overriding abstraction rather than objects—more about verbs, less about nouns. Python is not a functional language itself. It is firmly grounded in the not-so-old school of organizing your problem first by the things involved and then what they do. But its carefully-borrowed features from that world make it capable of expressing complex actions more clearly and effectively than is possible in many other languages.

Let's say you have two lists of things called lista and listb, a function f(), and you want to create another list of f() of things from lista where that thing also appears in listb. In Java, most of your work is specifying implementation (my Java is a little rusty, so forgive me if there are errors; I'm just trying to convey the flavor of the code):

List<Thing> nl = new ArrayList<Thing>();
Iterator it = lista.iterator();

while (it.hasNext()) {
    <Thing>a = it.next();
    if (listb.contains(a)) {
        nl.add(f(a));
    }
}

Here's the equivalent Python:

nl = [ f(x) for x in lista if x in listb ]

The advantage here is not just that it's easier to type, but that it clearly and concisely describes what the programmer wants, and no more. I didn't have to tell the computer exactly how to do what I wanted, just what I wanted. And this code will be easier for me and others to understand later.

“But wait”, I hear Java programmers saying about both of my examples, “Python code isn't type-safe!” That's true. My Python list here might contain non-Things, and the code will still compile. More, it will actually work as long as f(x) succeeds for whatever it finds in lista. It is believed by some that strict type safety catches programmer mistakes. I believe (and some evidence suggests) that this is a myth. Strict typing gets in the way more than it helps. And here's an important point: if I wanted to add strict type-checking to the Python code, I could, making it look more like the Java code. So the Python code is simpler for the more generally useful case, and more complex if that's what the programmer wants.

Performance and Conclusion

As a final note, I'm sure others will point out that speed of execution is critical in many applications, and that Python may not be suitable for those. That's true, but such cases are fewer than you might think. I've used Python for graphics, sound, number crunching, database access and many other things that might seem performance-hungry, and it's up to the task. It's certainly on par with Java. If you have written a program in Python and it's not fast enough, odds are it's your algorithm, not your language. Even if it is the language, it's likely that you've saved enough time in development that you could translate your slow—but working and clearly defined—code into C and still have spent less total time than it would have taken to develop in C in the first place, and have fewer bugs.

My OneJoker library is a hybrid: the core stuff that needs to run blindingly fast is in C, and Python code links to it. This is so that I can write the code for a simulation in nice readable Python, and still do millions of hands in reasonable time. Let's say I wanted to compare the odds of starting with ace-king in a Texas Hold'en game against pocket deuces. Here's the whole program, using my library:

#!/usr/bin/env python3
import onejoker as oj

h1 = oj.Sequence(7, "Ac Kh")
h2 = oj.Sequence(7, "2s 2d")

deck = oj.Sequence(52)
deck.fill()
deck.remove(h1)
deck.remove(h2)

wins1, wins2, ties = 0, 0, 0
boards = oj.Iterator(deck, 5)

for b in boards.all():
    h1.append(b)
    v1, h = oj.poker_best5(h1)
    h1.truncate(2)

    h2.append(b)
    v2, h = oj.poker_best5(h2)
    h2.truncate(2)

    if v1 < v2:
        wins1 += 1
    elif v2 < v1:
        wins2 += 1
    else:
        ties += 1

print("{0:10d} boards".format(boards.total))
print("{0:10d} wins for {1}".format(wins1, h1))
print("{0:10d} wins for {1}".format(wins2, h2))
print("{0:10d} ties".format(ties))

Run this, and it dutifully prints:

1712304 boards
 799119 wins for (Ac Kh)
 903239 wins for (2s 2d)
   9946 ties

in about a minute and a half on my old laptop. And this is actually a pretty bad case for my library: I have run other simulations that complete billions of hands in minutes. If I wanted an approximate answer faster instead of waiting for the exact one, I could replace the line for b boards.all() with for b in boards.random(10000).

So I repeat my recommendation for others out there who may be looking to get into computer programming: try Python. If you want to learn C or Java after that, go ahead. But if anyone suggests you learn C++, run as fast as you can.