September Saturday in God's Country
Sunny Hours

Lies, Damned Lies, and Surveys and Statistics

Given some comments and questions that have arisen of late, I thought I would depart from the norm (is anything here normal?) and provide a brief overview on the subject of surveys/polls. The fact is, I have had to design a couple, and take part in conducting a few, and have had it covered in coursework. Wikipedia has a good section on surveys, and I would strongly recommend you take a look at the advantages/disadvantages sections.

A good poll takes time and a lot of care to create. It is easy to bias a survey without meaning to, and ridiculously easy to skew one with intent. Before I go below the fold, before any survey/poll/etc. is given any validity by you, you need to see the instrument; that is, the section on who was chosen, who was dropped from the pool, the questions in order asked, analysis methodology, and the demographics data. Only then can you determine how much validity to give to the poll.

First thing to do is to look at the sample size. Look not only at the data sample, but look at how many people were contacted for the survey, as they usually are a very different number. From a statistics standpoint, you need in excess of 300 respondents for the data to have any meaning, and for anything on a national basis you are going to need in excess of 1,200 people if I am remembering my classnotes correctly. Within the sample size, the survey should indicate the method used to randomly choose those contacted, be it a reverse phone directory, census data, or other blanket method. If the poll uses a project, party, or other special interest database -- and the survey is not limited to that group -- well, you know how reliable it is.

Even within a blanket source, there are ways the selection can become skewed by accident or design. Random skewing can be having a large number of respondents in a given age range or other demographic.

Almost any good survey will oversample, because the fact is a good number of participants will not provide complete data. These samples should be dropped, and that should be the only reason for dropping respondent samples. Looking back at the previous paragraph, if you get more responses from a given demographic, there will be a tendency to want to drop some of them -- yet, a lot of bias can creep in that way. A good survey will simply do more sampling to try to even things out, if they correct at all for there are good arguments for not trying to correct.

Another key factor are the questions asked, and the order in which they are asked. A good instrument is designed so that initial answers do not bias answers given later. The questions are the easiest way in which inadvertent and deliberate bias are introduced. Word choice and order are critical.

One of my favorites was a telephone poll done a few years ago that hit my friends the Borzoi and the English Werewolf. One of the first questions was "Are you a Democrat or a Republican?" When informed that neither applied, the person doing the survey responded "Well! You have to be one or the other!" BEEEEEP! Failure of proper survey design, failure of elementary school civics (U.S. was not founded nor intended to be a two party system), and general failure of execution of the survey. Anything from that survey would have to be considered invalid.

Order and exact wording are important, indeed, critical. Many political polls can have a tendency to lead the respondent. There are a number of examples out there where people were given two surveys, and based on how and when questions were asked, different results were obtained. It is incredibly easy to guide the respondent, so pay attention to this area.

Along those lines, was it done by phone, mail, or other means? This matters, as it affects the random selection process and the honesty of the answers. All answers have problems with being truly honest, just some methods will skew things further or in different ways. Anytime you have an online or phone poll, the numbers are bogus, as only those truly interested in/motivated by a subject are going to respond. Remember, if it is not truly a random survey, it is junk.

Then look at the methodology uses to compile the statistics. Look for any dropped components. If anything other than straight analysis is done, look at what was done and why. It really does help to understand the difference between mean, median, and mode at this point -- would that more reporters did. How do the data subsets match up to the general conclusions? Are there any pick-and-choose subsets?

When I taught a basic science course years back, one of my favorites (besides the radiation thing) was to use the Gungjubu example. In short, a scientific survey/research effort had found that the gum disease Gungjubu had a 100 percent mortality rate in those who developed said condition (the fact that anything and everything has a 100 percent mortality rate was another part of the lesson -- remember, we all die), and Company X had come out with a special anti-Gungjubu toothbrush for only $50. Was it worth getting? After pointing out that Gungjubu was limited to a specific subset of a group of isolated population on a single island in the South Pacific, was linked to a genetic marker for those people (i.e. only they got it), and a few other little facts, the answer was clear. No.

In short, as I have a yard to mow, a shed to paint, and a dog to scritch, I don't take most polls very seriously. Anyone who cites polls and won't say/link to where they come from is not an honest merchant in the marketplace of ideas. Anyone who runs poll data and won't link back to the poll itself, is not an honest merchant in the marketplace of ideas. Anyone who runs a poll and won't post the source, methodology, and such just plain isn't honest in my book. Your opinion may vary.

More soon.


Oh yeah, before I forget, always look at the final demographics data. If you see anything, and I do mean anything from race to political affiliation skewed one way or another, take the data with a grain or ton of salt...