-
Your “representative” samples are bad for science
Have you heard? Small Ns are bad in quantitative projects. “Right”, you will say, “I will henceforth try to get as high of an N as possible”. Consequently, you also want to generalise your inferences to your population. Because why shouldn’t you. The decision-makers that read your research (the number of which is mostly zero, but let’s just pretend) need to know whether those inferences hold up at a population, or even human level. So you try to get representative samples for every consecutive project, if your budget allows for it.
Now, I have very bad news for you: Firstly, you might not even be methodologically savvy enough to generalize your inferences, or even build models that allow for generalization in the way you want (see here). But also: you might be wasting your money. There are (imho) not many use cases where representative sampling is actually worth the trouble. But first, let me give you a quick refresher on sampling.
Sampling is a procedure that we do for (mostly) two things:
- To estimate population values without having full population data.
- To introduce randomness in our research, enabling us to leverage probability theory for our estimation models.
In other words: sampling often is the thing that gives us probabilities (in contrast to likelihood) in the first place, if there is no randomized allocation. If you measure the entire population, you do not need to estimate. You can work descriptively. Alas, we almost never work with entire populations. That’s why we sample things and then estimate something, combined with a quantification of uncertainty. Thank you for attending stats philosophy 101.
Knowing all of this, representativity is a thing that depends very strongly on some assumptions – and those assumptions often are…let’s say overplayed. Because there are three questions that will tell you whether representativity is actually worth something for your project.
Representativity, Shmepresentativity.
The following questions are vital for any project that even thinks about representative sampling.
- What is the nature of my research?
- What is the goal of my research?
- Which context does my research inhabit?
These questions will determine your choice of sample. Let’s go through them one by one.
What is the nature of my research?
Some research is more deserving of a representative sample than others. Especially when you conduct research with living things, representativity might be interesting (otherwise all you care about is power, or raw N). A very good reason for representativity is when your subject or hypothesis touches the lives or substance of every member of the population. Medicine is the best example: You want to ensure that new medicine works for the vast majority of patients (whatever majority means, that’s your personal definition). This is why you need to sample as diverse (representative, for certain people allergic to the word) as possible, with strong power to get effects that count for that majority. Likewise, you need representative samples if you want to explain general tendencies in humans, for example. Cognitive or social psychology is in dire need of such things, because we cannot guarantee that thinking or action is uniform across individuals – or even cultures. But this leads us to the second question.
What is the goal of my research?
Do you want to know whether your medicine works for most humans? Well, then you need to sample from most humans (this is why drug development is so damn expensive, up to a billion Dollars per approved drug). But cognitive or social psychologists don’t have that kind of money, often not more than 2k for one study. But they should, right? So in lieu of that, they are often satisfied with representative samples of one country, with rather low N (100-300). But the question remains: What are you estimating then? In most cases, you estimate the population parameters of that specific country. Sure, countries next to it might have very close or similar values – if we’re being generous. But people on other continents might not. This paper (finally back to HCI) is an interesting demonstration of this effect. Their models for privacy work (somewhat) in Germany but break down in the US, close to zero variance explained. This means you cannot generalise the findings across countries at all, despite representative quotas within countries. Which leads to the third question:
Which context does my research inhabit?
Whether representativity is worth something often strongly relies on the context of your research as well. In the case of the last paper, I’d argue that representativity is informative, as it is specifically tailored to inform national decision makers (legislative and otherwise). But for general tendencies, this paper tells us absolutely nothing. And how could it? As we can see from it, the whole deal is highly culture-specific. Representativity ends with cultures here. There is no general tendency.
This is what I mean by my question: Do you want to focus on specific aspects of a specific population? Great, representativity is your friend. Do you want to find out general tendencies that are present across populations? Then you should either sample from everyone, but I can tell you right now: “All Humans” is almost certainly not a population – for reference just read cross-cultural psychology papers. In case you already took a nationally representative sample out of habit, be really careful about inference. In most cases, you’re better off (financially and considering inference) doing an 80% power sample off of a reasonable SESOI from as diverse demographics and cultures as possible, while ignoring population kookie talk.
But what do I tell the reviewers?
While my gut reaction would be to tell them to fuck off with their BS representativity demands, this will neither increase your chance of publication nor foster learning on the other side. So what you need to tell them is the backstory of what sampling and generalizability means. Argue that representativity does not make sense in your case, or is financially infeasible, or (and this is mostly the case) narrow population-data (!) is borderline useless if one wants to research supposedly general tendencies. If they still fight back, just respond with a recommendation towards a book on sampling theory which they will never read anyways, and end the conversation.
What you should not do is yield to stupid demands. Many reviewers, especially in HCI, have close to zero statistical or methodological knowledge and are too ashamed to admit it. While this is just human, it does not serve the peer review process well. Should you encounter such people – and you will be able to tell from the reviews – it is best advised to remain calm and explain the situation to them with a lot of examples or graphs. Like you would explain it to a student. I had good success with this, and I hope this strategy might be useful to you, too.
In any case, stay (un)representative folks.
~R
-
Age is not a predictor
You have probably encountered this one person, who reviews your paper and stumbles upon your significant small age effect in the demographic regression model. They find it appealing and make it their life’s mission to pester you about it: “WHY DIDN’T U DISCUSS IT, YOU LITTLE SHIT?!” is the feeling one usually gets from such a comment. Sure, my age effect is small, or even medium. But still. Let me point out what is wrong with this demographic variable and why people in most cases should just cease to use it:
Aging is something we do without doing much. It just happens. Everybody knows that. It’s one of those universal life constants you cannot change. The earth keeps on spinning, particles keep moving, you age. That’s what that is. The more math-savvy folk by now will have gasped. “Constant”. Yes, ageing is a constant process with a precision that is hard to come by. Why would a constant then have a random parameter in a model? Well, in some cases it seems makes sense: Age does account for some damage to your body simply because cell division does not work as well over time and chromosomes tend to decay on their own. But that does not make age a predictor. It is a correlate. The predictor for body damage would – more precisely – be the number of divisions the cells went through. If such a basic process does not make age a predictor, then what about all these other times we use it, for example in social science studies?
Well, I got bad news for you: Age is always confounded. Because, in essence, age is just time passed. During that time, things happen. If you do not measure these things, the effect is passed onto the timeframe where these things happened. To make it perfectly clear: Age says nothing about anything. It is a storage device for the thing that is actually relevant. Age has an effect on liking something? Then some event takes place somewhere in that time span that accounts for liking – you just didn’t include it into your model. Age “predicts” behavior? Your behavior probably stems from some experience made with age (puberty, for example, or your cell reproductive systems deteriorating). Maybe your participants all made some terrible or good experience with the thing you’re researching. Maybe some general law of large numbers fuckery is happening (fear of disasters for example might be correlated with age if you do not take into account disasters experienced over lifespan).
I don’t think I need to repeat this, but just in case: Just because something is “statistically significant”, it does not mean that it is a) relevant, b) meaningful or c) real. Your significant age effect in 99% of cases means your model is lacking something. Now that sounds bad, but just as well might be the best you can do given prior research, theory and data. And that’s fine. Just don’t overblow it. Age is confounded but makes for neat statistical control, if you know you’re missing stuff but don’t know WHAT it is you’re missing. Outsource that shit onto age, run your model, ignore the estimate, partial out variance and see what is left for age. But don’t, and I want to stress this, DO NOT EVER complain about an age effect not being talked about in any other matter than the one I just outlined. Age is a nothing-burger in terms of interpretation and it does nobody any good to include it in any reasonable discussion.
So please: Respect the elderly, but do not overblow the meaning of age. It is the worst proxy imaginable. “With age comes wisdom” and that wisdom will tell you to ignore age estimates. This should be taught in textbooks. But it isn’t. Rather, reviewers use age as a flyswatter argument if they have nothing else to contribute to a manuscript. Don’t be that guy. Do the sensible thing.
-
Your performance indicators are INVALID!
This may come as bad news to you, should you work in any sector, but especially if your job is to evaluate performance for anything related to human activity. Because I am here to tell you that anything you touch will evaporate to dust. By default.
This text is about Goodhart‘s law and why it invalidates literally any performance measure we use. But let me give you the verbatims first. Goodhart‘s law is often stated as:
„If a measure becomes a target, it ceases to be a good measure“
And that‘s about the gist of it. Although I prefer the more detailed version, which is:
„Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.“
In very simple terms: If you fuck with it, it goes bad. Now if you experience a really nagging feeling in the deep ends of your brain, don‘t be afraid. That feeling is there for a reason. Let me explain to you what the things I just wrote actually mean and why that is.
If we want to get statistical insight into phenomena, we need objective measures. As such, measures are descriptive of the phenomenon we are interested in. They quantify the things we want to know more about so that we can draw inference from whatever it is we are interested in. Let‘s say we have a company that develops software to pay the bills. Now we want to know how productive our employees are – because we have the feeling that our employee Bob is slacking off and pushing all his work onto other people. So we check how many lines of code everybody wrote during the last year. And we can see: Bob wrote 75% less lines than the average of the other employees. That‘s outrageous and helpful inference we think, so we call in Bob for a feedback. We tell him he has written much less code than others did and he has to improve because otherwise we will have to let him go. A year later, we look at lines of code again and we can see that Bob has improved massively and even went above and beyond, dwarfing any other employee in lines of code. We are delighted and review his code. Unfortunately, we can‘t read any of it and it is riddled with bugs and security problems. Obviously, Bob just wrote a whole bunch of nonsense to jack up his code metrics. We learn: This metric isn‘t apt for performance review anymore. So let‘s search for a new one!
I‘m gonna stop right here. You will probably have spotted what happened. The measure was targeted and collapsed. Why? Because a measure is only descriptive. You fuck with the meaning of descriptives, it doesn‘t mean squat anymore. Good for Bob, bad for everyone else. And there‘s a reason why it‘s called goodheart‘s law. It applies to virtually anything that has to do with human activity. The law breaks the one thing that makes a measure good: Objectivity. But how? I‘ll try to explain.
Anything that is measurable follows a distribution. That means, if you look at all the values measured, some kind of characteristic pattern will show if you measure a lot of different individuals. From this, you can get statistical information: How the average is, how the spread is, and a lot of other things that shall be unnamed right now. They‘re not too important for my point. The important thing is: These underlying properties are what make the measure objective. If you measure something, you know objectively what that means by comparing it to its position within the underlying distribution. But: Once you increase the importance (or salience) of your measure, people will try to influence it for gain. This shapes the underlying distribution of the measure and fucks up prediction and inference, as those are built upon the old distribution. Think about this: A common human growth chart will be virtually useless for a kid that has a genetic growth disorder. You will not be able to predict how tall they will be at which age. Because the underlying distribution of height~age will be different from kids without such a disorder. What happens with measures that become a target is we punch in a disorder of our own, until the underlying distribution of data becomes a different one. The measure then collapses in meaning, because our inference framework that we built to interpret the measure died alongside its distribution.
So what should we do in this case? Well, we would have to update your inference framework. And we would have to update it every time our updated version hit the market, which means: All the time. Now my question is: is this feasible? The answer probably is a resounding no. And this is why, as I said earlier
All your performance indicators are invalid.
Mostly, because they are not constantly updated. But until somebody does that, they are. Well, you might say, that’s probably not that many measures. But oh, my friend, it is almost anything you ever touched. GDP, school grades, KPIs in companies, revenue, scientific citation metrics, even body metrics can become a target with enough makeover ops in the population (I bet in Miami, human body metrics don‘t count for much). This law destroys measurements, because we really love to use them as targets.
What do we do with that? Well, hopefully we learn. Don‘t use performance measures to shape behavior or give out incentives. It will make them meaningless. Rather: measure performance and gain statistical insight into negative influences. Fix those influences. Don’t reward already positive predictors. This will give no incentives to cheat the system, keep the measure steady and show that you care about genuine, sustainable progress.
-
Scientific integrity: A convenient myth
If you ask any layperson today what they think a scientist does, they will probably tell you that a scientist is a person that researches things, discovers new concepts or relations and develops stuff for the betterment of humanity. If you ask them what kind of character a scientist might have, most would probably tell you that scientists are truthful and truth-seeking people with a code of ethics that is high in standards. Unfortunately, this is only partially true. The reality is: scientists are humans just like everybody else. Yet, an aura of integrity surrounds scientists, the senior the scientist, the stronger it is. But it ultimately is a myth. And I would argue, it is a myth that favours those scientists who lack integrity. And that this combination in return fuels science-deniers and their arguments. But let’s start slow and look at a brief and – arguably incomplete – history of fraud in science.
Science as we know and understand it today has its roots in the middle-ages with alchemy, and therefore it basically began as fraud. The alchemists wanted to ultimately be able to transform any element into gold, which was a convenient draw for anybody hoping for a quick and easy buck. If alchemists lived today they would sell “get rich quick”-schemes on Youtube. But back then they would con rich sponsors into paying them in order to “continue their study” and ultimately – of course – provide them with as many gold, silver, or whatever else they desired. Any similarity of this system to modern science is – according to modern scientists – probably a coincidence. Nevertheless, a lot of money was lost to these people and although one could argue that greed does not deserve an outcome, it still was ruining the reputation of people who actually wanted to introduce progress to human knowledge, like Georg Brandt, for example. Although people cleaned up their act massively with the emergence of scientists like Lavoisier, Galilei and Copernicus, it unfortunately didn’t mean that frauds and assholes ceased to exist in the field. Especially with the start of “modern” university systems and research funding, fraudsters started to appear at the scene. Charles Babbage, a mathematician famous for starting early in computing research, criticized colleagues publicly for defrauding universities and research by doctoring and plagiarizing reports (and getting paid for those reports) in the early 1800s. But the main ramp up of scientific fraud was enabled by modern communication and reach (i.e. more potential money and fame for fraudsters). Thomas Alva Edison was famous for being a major dickhead, plagiarizing research and claiming inventions of others for his own benefit, while cutting the real inventors out of the deal. And the 20th century had only just begun. From the Piltdown man in the early 1920s, the Nazi “science” about occult stuff that was sold to the people and higher-ups, the scientists who discovered climate change and kept quiet about it for oil money, the scientists that did research for the tobacco industry to skew public discourse, the scientists that still do research for the agrochemical industry introducing herbicide-resistant GMO plants (a fundamentally stupid idea) and jack up herbicide use and cancer rates in South America, the scientists that did research for the sugar industry to skew public discourse, the single guy (!) who introduced lead to gasoline despite knowing the dangers of lead and introduced CFCs and HCFCs creating the ozone layer hole, the people who experimented with human genome editing because they gave zero shits about ethics and lives, the famous professor who falsified a plethora of studies in order to get his name up and running in media and academia and many more followed. The wiki list (last link) is incomplete, I personally know of a few more examples that I am not disclosing right now as some matters are still unresolved – and obviously this can only be the tip of the iceberg.
I have experienced people actually giving advice on questionable research practices to students: During my studies, I encountered a prof who told the entire class it was of the utmost importance to shape data presentation in a favourable way (including omitting data or modifying the scales) to get a publication “through peer review”. Others told me they would fish for significance if their hypothesis wasn’t going anywhere. Some mentioned that they would always write their introduction last to get the best “theory fit” for their experiments. Bottom line is: people either don’t know about integrity in research or they don’t care. Unfortunately, I think those options aren’t mutually exclusive. Both are happening at the same time. Even if they knew, most wouldn’t care.
The (obvious) reasons are wrong incentives, human nature and broken control. Scientists need to publish to advance their career and maybe get a side gig or two selling books, doing talks or getting posts on boards of industry. Those side gigs usually pay way better than uni work, leading to somewhat bizarre situations where some people run a chair at a university while sitting on more than 3 company boards or even running those companies themselves. And why shouldn’t they? They got tenure, firing them is basically impossible unless they really fuck up. You just don’t have to fuck up too bad. And while there are examples of that happening (Wansink), most of it flies under the radar. Human nature supports such behaviour – after all, why shouldn’t we get more money and attention? It feeds us dopamine and secures a better monetary future of our otherwise neglected children. And if peer review consists of max 5 people per paper and discourse and correction/retraction takes multiple years (if people care, that is), it is natural what happens to the rates of fraud. To quote a magnificent film (The Big Short): “One of the hallmarks of mania is the rapid rise of complexity and the rates of fraud. And did you know: They’re going up!” We might actually be in an upward spiral of retractions and scientific misconduct – I suspect the harder job market for academics and the allure of increasing (social) media attention for science is doing its thing to “aid” in this process. Although it has to be noted that it could also be the case that fraud is a constant in science but recently only detection and publication of detection might have increased – we would still suck to the same degree, but our “science is self-correcting” narrative would maybe actually begin to work – effectively taking up the time of ethically working scientists to expose fraudsters instead of ethically researching stuff that would bring real progress.
Instead, right now, fraudulent scientists hide behind the myth of scientific integrity, seemingly flourishing with prestigious publications that aren’t worth the digital submission fee that Elsevier charges them. They are scientists after all, and science is self-correcting. “If there’s anything wrong with my paper, other scientists will point that out”, they will say. Unfortunately, scientists often are too busy publishing themselves, scraping together money to survive (PhD students for example), scrambling for data or are locked away in the lab for days at a time doing experiments. Not all of them do meta-research and can spend time pointing out dirtbag science all day. They will read a fraudulent publication and dismiss it as such – but won’t report it. The base rate is too high to make an effort. I’ve heard from many disciplines that report 30% or more of their papers to probably be false positives or influence by other shenanigans. Some even estimate most published research to be false. This obviously is food for people who deny science or want to discredit science they do not like (some of those being scientists themselves…just look at psychological debates about ego depletion, power posing, grit or phenomena adjacent to parapsychology). If you can find fault with one argument, you can always generalize it to the field in order to make yourself look better – it’s a PR move as old as history. But gaslighting a field to accept your viewpoints is not science. An attack on the concept one is trying to establish (or, sell) is often seen as an attack on the person – even though it clearly is not. Granted, scientists are not always the best at communicating without sounding condescending, there still should be some level of professional standards when criticizing work. One can call aspects of someone’s work garbage without insulting people themselves. After all, we all produced a lot of garbage in our life, even while working. There’s nothing new under the sun.
To summarize: scientific integrity is a myth and does not exist. Neither are scientists a more ethical kind of human nor are their motives different to that of any other employee. Integrity is a term that was born out of pride about the rigorousness of the scientific method but as many other ideologies it ultimately failed to bypass human nature and a broken system of incentives. You want scientists to be ethical and reclaim integrity? Don’t trust them. Publicly criticize them. Call those who deserve it liars and bullshit artists. Be quick to void tenure for scientific misconduct. Teach them to take criticism with dignity and teach them how to criticize properly. And most of all: Teach them good science. Don’t tell your students how to defraud journals. Tell them how to create good papers ethically. Teach them the scientific method of replication. Teach them that ethics pay off in the long run. Teach them about retractions and QRPs. And maybe, just maybe: Pay them better. Imagine being bombarded with lies and deception most of the time. I think that deserves extra compensation, don’t you?
-
Simple researcher use ANOVA
At a recent academic conference I listened to a talk of a researcher. In their talk, they told the audience about the great experiment he did and how the preliminary analysis uncovered a 4-way interaction.
When I heard this and looked at the slide, I first thought I was hallucinating. Firstly, there was a p-value of 0.09. Then there was an N of 53. Then I saw eta squared of 0.01. But my head was still at “four-way interaction”. I was thinking: What on earth is this? And how did he con himself into believing that this was real?
People, we have to talk about a severe problem with statistics and a technique that has become a bad default. And yes, it is the ANOVA. Used one of those lately? Yes? Then buckle up. This text will probably be somewhat offensive to your morale. Because I haven’t. And I think you shouldn’t have. And I also think you shouldn’t do it again. But no worries, I will show you something greater – something that is the same, just better.
Firstly let me tell you that I firmly believe that ANOVA is a legitimate method of analysis in a few use cases – very few. “But factorial data is everywhere” you might exclaim, thinking about all the experimental conditions and groups you have been conditioned to apply during your training. I’m sorry to inform you: Most of your conditions are clever ways to scam yourself into the belief you’re doing something worthwhile, while your variance sits in the corner and cries because you neglected it. But let me start.
The typical ANOVA is a special form of a regression. It looks at factorial data and estimates the influence of the factors on the overall variance of a metric outcome. This is why in R, you can also do an ANOVA with the lm() function – a fact unkown to a lot of people, surprisingly.
The thing is just that for some reason, psychologists are taught that an ANOVA is a nice thing to do because often we research factorial influence on data. In fact, I myself was taught ANOVA before regression in my stats classes. And after reading your first ancient social psych papers you will probably assume as a relatively fresh student that ANOVAs are common and a good method for anything that comes your way.
But they are not. Why? Here’s a few reasons:
- Most variables in psychology are quasimetric and modern regression estimators can incorporate them well.
- ANOVAs are imho the reason why people still do shit like age groups, or factorizing metric variables in general, which is not good practice.
- Calculating your N for decent power is very dicey once you include interactions.
- Most people cannot safely and easily interpret higher order (>1) interactions (Occam recommends don’t do it!) AND
- Most interactions are statistical noise as they have severly lower power than main effects.
But the last, and probably most controversial reason is: ANOVAs can make quasi-causal analyses (and therefore fraud) so much easier. As a psychologist you learn that, for a causal interpretation, you need experimental conditions, i.e. control groups and effect groups. If you have that, you can create a factor that indicates the condition and use it to evaluate the cause of your outcome. Unfortunately, a lot of people think the factor is the important thing, which it is not, especially if both conditions rely on a variable that has a baseline in every participant. Let’s say you want to know whether high stress makes you sweat. You sample a thousand people and put 50% in some sort of high stress condition and the other ones you leave as they are. Obviously, you have to measure stress in both cases to prove that your control group is indeed the control, and that the high stress group has high stress, right? And if your ANOVA tells your high stress group is significant, you will have a causal effect, right?
Answer: No. You uncovered a correlation between stress and sweating and factorized your IV.But what if we do this within-person? We can do a RM ANOVA then and we don’t compare different people anymore!
Answer: Well, same story, apart from the fact that now you made every participant uncomfortable.You cannot show causality without conditional absence in one part. And by absence, I mean “not even a trace”. This is what makes drug research easy considering operationalization. Obviously, if you give medicine to a patient and nothing to another, that fulfills the absence rule. But what about stress? Everybody has a baseline level of stress. All you can do is say “well we have a significant effect of stress on sweating”. But that’s a correlation.
I hope you can see what I’m trying to get at: An ANOVA makes you believe things that aren’t there. Like stress levels or age groups. We as psychologists (or social scientists for that matter) should know about the biases we all have. And those biases are telling us to do the easy thing, always. Even if we need to factorize our ages and stress metrics. Because an ANOVA is so easy to interpret, right? It’s so comfy and we don’t need to think too much about the whole thing. And I’m not excluding myself here, I’ve done that. But that experience is why I deem it dangerous.
A regression output can be hard to interpret, I know. And you can’t write sexy things about the data compared to an ANOVA that makes your writing so much more easy and interesting. But we gotta remember one thing: statistics relies on models. All they do is look at the data we give them in their own way. It is our job to say “this fits” or “this doesn’t”. But much like clothing, you can determine whether your result is ugly or not. You can make regressions bloom like an ANOVA, without sacrificing variance, interpretability, or integrity. Think about your designs. Think about your variables. Think about your models. Interpret your results with care. And don’t use ANOVAs. Your readers and the community will be better off.