When you conduct a medical study, you generate a lot of data, a ton of data, more data than you even know what to do with in some cases. To make sense of that, researchers have defined all terms that they apply to their data. They try to classify it, what type of data do I have? It's important for you to know what the different types of data are so that when you read a study, you know exactly what they're talking about. So in terms of our goals today, I'm going to introduce you to some of the relevant jargon that is peppered throughout medical research. There are a lot of terms used to describe data that sound like words you've heard before, but are being used in completely new ways and so you just need to be aware of what those words mean so that as you're reading through a medical manuscript, you aren't totally lost. We're going to describe how researchers think about data and how they try to put it together, and we're going to explain why knowing what data you have matters from a statistical point of view. But in order to get us started, I want you to imagine that we are out trying to find a new car to buy. What you see before you is a cluster of cars, an amalgamation of automobiles, a slew of sedans. Look at them all, all so different and yet they are all cars. We can measure a lot of things about a car, its weight, its length, the time it takes to go from 0 to 60 miles per hour, we can measure its color too. We can classify it as an SUV or a sports car, or a very cool minivan which is what I drive. In fact, we could take a single car and measure thousands of things about it. How many screws are in it? What is the composition of the metal in the tail pipe? Anything. We could even describe how the car makes us feel. My mini-van makes me feel very young and hip and I have my whole life ahead of me. You all believe that, right? It's just data. What did we just see there? We saw a lot of data about these cars. Some of the data is what we call categorical data, categories. For a car, that might be something like the color, the make, or the model. We know that cars come in all sorts of different colors and we can list the different colors but we can't put those colors in any meaningful order. One color is not necessarily better than the other color. We just know that there's yellow and blue and red and green and all of those would have to be analyzed separately. They're all their own category. We have that with people too. You have sex, race, and marital status. It's very difficult to put those in any given order. You're either male or female. You're either black or white. These are just categories, categorical data, that's type of data number 1. The other type is called continuous data. These typically are measurements, things that don't fall into discrete categories, but you can measure along vast spectrum. So if you're thinking about a car, think about, the price, or the weight, or the length of the car. So a car that costs $20,000 costs twice as much as a car that costs $10,000. That's just true. We can order cars by their price. There's a meaningful relationship there. In humans, you might see something like cholesterol level or age, some continuous metric. You can fall anywhere along the spectrum. Again, we could order people, and we would know that a cholesterol level of 300 is twice a cholesterol level of 150. But let me throw a little wrench in the work. So some data can be ordered, but it doesn't work perfectly mathematically. We call this ordinal data. I'll give you an example. If we asked you how much you liked this car on a scale of 1 to 10 and we made you do that for a bunch of cars, we could put those ratings in order. Because a six is better than a five and a seven is better than a six, and so on. But is a 10 twice as good as a five? Is an eight twice as good as a four? Well, when you're talking about a scale like that, probably not. They have an order that's clear, but the relationship between the numbers aren't exactly precise. In human studies, you might think of the level of education that you achieved. We often break that up into high school, college, grad school type of levels, and while we can put that in order, we say, "Oh, that's more education." It's harder to say exactly how much more education college is than high-school, and it probably depends on the study you're doing, so that's called ordinal data. Now, when you're talking about continuous data, remember, that's that spectrum data, like your cholesterol level. The next question that often gets asked is, is this data normal or not? This is a jargon term, it does not mean normal like usual or regular, it's a specific statistical concept and normal data is data that is distributed in that bell curve shape. We've all seen that bell curve before. There's a central peak, and it slopes off to the side and the reason we call that normal data, goes back into the mists of antiquity in mathematics, but it's normal in the sense that physics, nature, a lot of stuff that you measure in the world does actually fall in that pattern and I think a great example is Plinko. You remember Plinko from Price Is Right. You stand at the top and you drop the little ball down that bounces randomly. But you would find if you dropped the Plinko balls right in the middle every time, that you'd be much more likely to have a ball land in the middle than on the sides and in fact there's a great simulation of this online, and you can see here in this video that as the Plinko balls come down, they distribute themselves and they tend to line up in the middle. They are finding that normal distribution. There's the central tendency where most things tend to fall, and then it gets a little rarer as you go off to the sides and this is present in all sorts of data and as a researcher, you just recognize normal data when you see it. So let me give you an example. Here is the distribution of US middle aged male height. So I'm a middle aged US male, this is very relevant data to me and you can see that they measured a bunch of people's heights and what you see is this bell curve, it's not perfect, but it's symmetric. There is a peak in the middle around five-foot-eight or so which is the average height, and then it gets rarer and rarer. We see very few people who are up here at six-foot-five, few men who are up here at six-foot-five, and few men who are down there at five-foot-one although of course they're there. So normal data is all over health care. When you read a paper and it says this was normally distributed, this is all they're talking about. They could just say it's a bell curve, but they don't, they say it's normally distributed. But not all data is normally distributed. So this is the population of the United States by age. That does not look like a bell curve to me, at all really. It looks a ski slope or something like that. This is simply non-normal data, and what this data suggests is that there's a lot of people in between the ages of about zero and 60 something in the United States, and then it tails off, and by the time you get to 99 years-old, there's very few people of that age. So we would call this a right-skewed distribution but we can just basically say it's not normal-looking. Why does all this matter? Well, this matters because many of the statistics we use to compare data across groups of people make the assumption that the data is normal. We'll get to this in a little bit more detail in a future lecture, but the math was developed in many cases up to 100 years ago by people who were really interested in gambling and things like that. It makes the assumption that the data you're looking at has that nice bell-shaped curve. So when you're data does have that nice bell-shaped curve, you just plug it into the existing equations and everything works beautifully. When it doesn't, you need a special new set of equations. There is another set of data that I didn't really get to talk about here, and that's what we call qualitative data. So everything I've been referring to now, the categories, the continuous data, the ordinal data, the normal data, the non-normal data, that's all quantifiable. I can measure it to some extent. Qualitative data is different but really valuable. So when you think of qualitative data, think of focus groups. So you get a bunch of people in the room and you ask them open-ended questions and facilitate a discussion and this can be incredibly useful in medical research because you could take a group of six or seven people with a given disease, and sit them around a table and say, "What is bothering you? What is the bad part about this disease? How does it impact your quality of life? What are the most important things you want us researchers to be working on?" So when you see a qualitative study, you want to click and see, "Okay, they're not measuring how much cholesterol this person has. They're actually asking them questions." It's harder to measure and the statistics, it's hard to compare someone's response to an open-ended question plugged into some equation, so qualitative data is fundamentally different but really useful. An interesting article that's really relevant to us is this one, which did a series of focus groups to figure out how consumers find information about health information online. Basically, what they found by asking these very open-ended questions is that people like a website that looks very good and professionally done and has some medical jargon on it, and they want sites that are unbiased. But when you actually looked at what they were clicking on, what they found is that almost no one in the study clicked on the "About us" page. No one was really able to tell us what financial disclosures were present on the web page. In fact, the vast majority of people when they recited a piece of information that they had learned online, could not recall which website it was they learned the information online. So people want really good high-quality medical information, but it's clear from a qualitative study that they aren't the best at getting it. Of course, that's part of the reason we're here. So some take-home points about the types of data. Data can be categorical, continuous, or ordinal. If you've got continuous data, it can be normal, that nice bell-shape, or non-normal, something else. Normal data fits nicely into the statistics that gamblers developed long, long ago but non-normal data is more interesting and requires special tools. Qualitative data gives important new insights but isn't amenable to statistics. So keep an eye on the type of data that's in the study and you'll understand how it was analyzed. See you next time.