Whose problem is the "reproducibility crisis" anyway?

Aug 18 2015 Published by under Uncategorized

One of the absolute best things about Twitter is that there are a bajillion scientists on it, and if I have a random question - what's your favorite cfos antibody, what's the best statistical test for ___, what do you think this blob on my micrograph is - I can just throw it out into the ether and am likely to get a number of super helpful responses almost immediately.

Today, however, my random question started quite the firestorm! Let's go to the highlight reel:

(BTW, the reason that the y-axis seems excessively high is that I am comparing the correlation in this experimental group to that of another experimental group whose y values do go that high, and non-matching axes are one of my biggest pet peeves)

I got lots of helpful responses from my lovely Twitter friends, which you can see if you click through to the conversation. My real question was, "do I include or exclude this data point that is 6 SDs away from the mean from analysis?" but a young mosquito scientist had this recommendation:

After which the following exchange occurred:

Letting alone the fact that in my mind, a single funky data point out of almost 60 is not a "result," but a...data point...the answer is no, I do not "usually" replicate. Look, I get that in some labs it's super easy to run an experiment in an afternoon for like $5. If this is the situation you're in, by all means replicate away! Knock yourself out, and then give yourself a nice pat on the back. But in the world of mammalian behavioral neuroscience, single experiments can take years and many thousands of dollars. When you finish an experiment, you publish the data, whatever they happen to be. You don't say, let's spend another couple of years and thousands more dollars and do it all again before we tell anyone what we found! So I thought, OK, this guy runs an insect lab, maybe he doesn't know what's involved.

Well, to borrow from Don Draper,

Screenshot 2015-08-17 21.18.39

Jason later doubled down, responding to @neuromagician:

"CORRECT." I cannot even with this. What does it mean? His answer was blissfully, elegantly circular in its logic:

OK, then. Let's leave the tweets there (but grumpy subtweets this way lie).

Here's how I see it: different science fields have different experimental conventions that depend on all kinds of things. The "reproducibility crisis" may be real, but asking scientists to fix it by doing everything twice (or more) is naive at best and intentionally, self-righteously ignorant at worst.

As I've said before, the data are the data. You report them, along with the methods that you used to get them and the stats you used to determine your confidence in them. Someone else reads them and maybe decides to build off of them in their own work, which necessitates trying to replicate yours. Maybe their results are the same, maybe they're not. If they're not, are you a bad scientist? Are they? Obviously not. Part of what makes science exciting and fascinating is figuring out why some things don't always replicate. When we define the intricacies that drive our data, that's when we're truly making progress.

 

 

67 responses so far

  • Drugmonkey says:

    Preach it Sister!

  • Anonymous says:

    If replicating the experiment is not an option, then you have to include the data point that's 6 SDs away in the analysis. The data are the data, right? You should not exclude data points without having a really good reason for doing so. (Hint: being 6 SDs away is not a good reason.)

    Please don't take this the wrong way, but I'm surprised that someone with your experience is even asking this question. Is this common in your field?

    • Dr Becca says:

      What is the right way to take it, Anonymous?

      • Anonymous says:

        The right way would be that I'm surprised that someone whose writing I respect and whom I had assumed was a good scientist is asking something that in my field has a pretty obvious answer. So I'm trying to figure out if the accepted wisdom in your field is different, as opposed to automatically assuming that you don't know how to analyze your data.

        • Dr Becca says:

          I hope my answer from last night helped clarify the reasoning, here. And as others below who are in my field have indicated, this is indeed an accepted and common practice in behavioral neuro studies.

          • Anonymous says:

            Hmm ... not sure who exactly is in your field here, but I see a number of people encouraging you to keep the outlier and report your stats with and without it, especially since it has a significant impact on your interpretation. So, essentially, include the data point in your analysis. Is that the "accepted and common practice" in your field? Or were you referring to the replication issue?

          • Dr Becca says:

            I was more referring to the idea that the responses here suggest that my initial question - how do you interpret an outlier - was, in fact, a legitimate issue worth discussing, and perhaps the pearls-clutching, "I'm so disappointed in you as a scientist" knee-jerk response you had was a bit premature and narrow-minded.

    • bashir says:

      6 SDs might be a good reason to exclude a data point. I say this noting very loudly that you always report if a data point was excluded and why and make that available. That being said, the data isn't just the data. If you're doing behavioral research there is some assumption that the data reflects certain conditions. i.e. the participants did the task and didn't fall asleep halfway through. This sort of thing happens all the time in human research. People fall asleep in the fMRI scanner, babies fuss out, children take unannounced potty brakes. You note it and (probably) exclude the data. How you define what is excluded can get tricky and you can certainly be too loose with that. It an interesting discussion and one that requires scientists thinking through that their data reflects and what reporting the results reflect.

      Mindlessly following rules without thinking through the rationale is not useful.

      • Anonymous says:

        There are *always* assumptions, Bashir, not only in your field.

        "People fall asleep in the fMRI scanner, babies fuss out, children take unannounced potty brakes...."

        And these might all be valid reasons for excluding the data point. but from simply looking at a graph and noting that one point is really far from the others, no, in my opinion this is not enough. And that was the question she posed.

        I'm curious, which part of "You should not exclude data points without having a really good reason for doing so" sounded to you like "mindlessly following rules"?

    • drugmonkey says:

      Sure it is a good reason. There are many good reasons. And many reasons not to exclude data. Do you really imagine your rigid viewpoint captures the complexity of this issue? Starter reading.... it has references, so, hint, check those

      http://pareonline.net/getvn.asp?v=9&n=6

      • Anonymous says:

        Um ... no it's not. And it's funny, because that article you linked to makes the same point I made. Of course, I have the advantage on you here, given that I'm a tenured stats prof at an R1. But thanks anyways for your extremely illuminating comment!

        • Matt says:

          I'm sympathetic to your point, but a tenured stats professor at an R1 should be ashamed to make such a blatant appeal to authority.

          • Anonymous says:

            Yeah, I usually don't do that. But I've gotten really good over the years at sniffing out the blowhards that are not worth my time or trouble to educate. (See his comment below for further confirmation.)

        • Drugmonkey says:

          Imagine my surprise. Stats geeks rarely understand anything about biomedical science and should be ignored for the most part. Our cows aren't spherical.

  • babyattachmode says:

    "the data are the data. You report them, along with the methods that you used to get them and the stats you used to determine your confidence in them."

    I've worked with different PIs during my MSc, PhD and post-doc and people have VERY different opinions when it comes to this. Some PIs do exactly this: write down what you've done and in the discussion you cite the field and place your results in their context. Other PIs are much more inclined to suggest in the discussion what all of this means and how the rodent results will translate to humans, followed by a press release and the media running away with the story AS IF it were done in humans. And that's when it gets problematic when your results happen to only be true in your lab, or only that one time you did the experiment (even though it was significant that one time).
    So I indeed think it is our responsibility as scientists to be honest about our data. It IS different when something has been reproduced a billion times over in different labs around the world or when it has only been found/reported once and that is something to acknowledge.

    And as an aside: I have worked with a behavioral neuroscientist who only published things when they had seen it in at least two separate cohorts of animals, run at separate times, etc. This is not necessarily good for the number of publications that you get...

  • DJMH says:

    I'd go back and re-examine all info about that particular n to see if there was anything that in retrospect was strange about it (eg mouse was stunted, ganglion was dropped on floor, whatever). If there was, I would look at all the other data points to see if any of them also had the "strange" element, and I would exclude all data points with that problem, and say exactly that in the methods.

    But if there's nothing else weird about it, I'd leave it in. The rest of the data are so lovely that it doesn't really seem to hurt your graph, and who knows when it may turn out to be an interesting low-frequency legit occurrence.

  • Anonymous says:

    Wow, maybe you should blog about how to take critism and not take it personally.

  • Someone who is close to you says:

    Wow, maybe you should blog about how to take critism and not take it personally.

  • Ben Saunders says:

    A comment for the "do it again" trolls.

    It is very rare that a behavioral neuroscience (and any field where the model system lives longer than a few days) experiment is completed in a one-off, single cohort fashion. This is what people fail to understand. They might run an in vitro prep on a Tuesday afternoon, and THAT is their experiment. So sure, they want to do it again on Wednesday afternoon to make sure it is a consistent effect. Totally reasonable.

    Behavioral neuroscientists do something similar, it's just stretched out in time. You have a final N of 60 rats, say, but that is probably a reflection of multiple cohorts of the same study run across many months or even years. This is done for practical reasons - it's usually not possible to run all 60 rats at one time, you want to see if there is something to the experiment before committing all the resources for 60 rats, other experiments compete for the behavioral equipment, etc - but -- and this is key, listen up -- it also ends up being a STRENGTH in terms of assessing reproducibility, because if you see a similar pattern across your cohorts, which have maybe been run by different grad students or undergrads at different times, then that is good evidence for robustness.

    The difference in approach reflect the very real distinctions between in vitro or nonmammalian neuroscience, and work on rodents, nonhuman primates, and humans. In the latter cases, getting sufficient group sizes for statistical power is a major limiting factor (logistically and $$$-wise). Hence, drawn out studies.

    That said, even if you could hypothetically run your 60 rat study in a one off, Tuesday afternoon fashion, calling for a redo is unreasonable. The cost of 60 rats is several thousand dollars. Housing those rats in an animal facility will run you 1-several thousand dollars per month. Say your study, like most, involves some kind of histological/imaging component. Your antibodies, tracers, viruses, reagents, etc, for 60 rats will be another several thousand. These are all one-off costs.

    This should be obvious, and if it is not, take a second to think about it before judging people outside of your field for doing science "incorrectly".

    If you want the system to change so all scientists have to independently replicate every "experiment" X times so you can believe the results are "true", redouble your efforts to change the academic funding and publishing structure to reorient incentives, reduce costs, increase funding, and adjust tenure requirements, THEN come talk to us about who is doing it "right". And be ready for the pace of some areas of science to slow down.

    • PaleoGould says:

      So true. My two latest papers aggregate several cohorts run over the past year. We HAVE replicated the results from one in later cohorts.
      (But tiny shout out for people doing work on mammal models other than rodents, nonhuman primates or people).

  • On the data point: I would not remove it _only_ for being 6SD away (although there are statistical tests for outliers that presumably work more or less that way). But there are many reasons we clean data, and they are not fraud: if that data point was associated with confounding experimental conditions that could plausibly have influenced the expt, for example.

    On the broader issue of replication and reproducibility: We tell each other that reproducibility is important, but we rarely do it, for very good reasons (some of which nicely are identified in your post). This has been true for at least 400 years, and the handwringing then looked amazingly like the handwringing now - and yet science has progressed quite nicely over that time. Hardly meets the definition of "crisis", as I argued here: http://wp.me/p5x2kS-1M. I think progress comes more often from consilience (or its lack) between related studies, and rarely from exact replication of any single study.

  • drugmonkey says:

    The RealSolution is that "replication" should be conducted across labs. People should be publishing frequently, not worrying about who has the completest story that will squeeze out anyone else from being interested in similar work and should be riffing off of other people's findings in a replication+extension way. Oh, and journals of normal JIF which exist within a large pool of similar competing journal should not demand absolute "novelty" of workaday publications.

  • becca says:

    I hope Jason comes back as a rat in the next karmic wheel turn.
    It's immoral to run it again.
    Goofy data point should stay on the graph, and I doubt including it in the trend line will hurt, but I'm happy to admit other people with more stats experience may know better.

    About 60 data points is good- though in defense of twitter judgeypants, it's not obvious there are almost 60 little black dots underneath each other.

  • mH says:

    I'll bite... I think people mean a lot of different things by replication and there is a little bit of confusion and a lot of pedantry going on.

    First of all, the broader discussion around "replication" now has nothing to do with this. That is about independently replicating results, usually by different people in a different lab in a different study. Do other scientists get results consistent with yours?

    As for experimental replicates, it is clear that that is what this data shows. This is a behavioral measurement, so the experiment was done many times, probably on different days, and each of the data points shown is an independent measurement of an animal in one of the two groups compared (so biological replicates, not technical replicates).

    What some people seem to be arguing is that you should do this experiment for a while (n=x), analyze the results, and then start over and do the experiment with n=x again. Maybe this time wearing your watch on a different wrist? I dunno. This would be mathematically no different from cutting your current sample size in half (or thirds or quarters) and analyzing them separately. Because whatever you are trying to control by "replicating" a result (usually day-to-day variation, batch of reagent, whatever) is already varying over the time period that you collect this much data. The whole point of collecting data on groups of animals over time is that the little sources of variance you don't care (or know) about should average out.

    I see no compelling reason to do this kind of "artificial" replication, whereas putting limited resources into an adequate sample size for robust comparisons is a no-brainer. I never ever ever ever see the same experiment presented twice (or 3x or 4x) in a paper in my field, which is one where animals are cheap and experiments are fast. I do experiments on multiple days and always do controls and experimental groups together, but OF COURSE you pool all the data for analysis.

    Finally, what should count as real scientific replication -- independent verification of a FINDING (phosphorylation of protein X causes leads to increased expression of gene Y) -- has to be done with a different experiment, e.g. first we inhibited the phosphatase, then we overexpressed the kinase, then we made a phosphomimetic version of X and every time we saw Y expression go up. That's when you start to believe a RESULT rather than a STATISTIC.

  • lurker says:

    I'm going to guess Anonymous (aka Someone who is close to you) is a mid-career white dood with grant funding, similar to @vectorgen, who is actually not that young a squito biologist who just got R56 bridge funding. When these guys have $$, they act like they're qualified to sit on some a high horse. Show some empathy, you pricks! Don't say something like "I'm surprised that someone with your experience is even asking this question...." which is what I frequently also get criticized from the old white doods able to keep their funding from their laurels and maybe from their bullying character. Way to kick the person more who is already on the floor trying to survive this funding drought!

    I'm in the same boat as Dr. Becca, young lab outta money and trying to eeke out a few more papers by using my shear will to push these manuscripts through peer review. I know exactly the conundrum that she is facing, she's not trying to cheat or manipulate the data, she is really trying to put together the best story to get past our asinine system we call peer review and the glamour-douche editors ready to toss your manuscript to the side because they only do a cursory glance of your data, your abstract, and their finger to the wind of what currently prevails as sexy. We're the low pegs on the totem pole, and in a very insecure state not by our own doing necessarily but because the funding situation is tight and we have tremendous toxicity in this hyper-competitive atmosphere that is rife with jerks like Anonymous and @vectorgen.

    Come on people, show a little empathy! Times are tough for everyone without a BSD, the last thing we need to do is go Lord of the Flies on everyone else. We can all take criticism when it's constructive, but we should shame those like Anonymous and @vectorgen when everyone can see their criticism is personnel and counterproductive.

    Be Teflon, and keep fighting the good fight, Dr. Becca. May you live to fight it another day.

    • Dr Becca says:

      Thank you, lurker!! Sorry you're in the same boat as me (a luxury liner it is not) - I hope things turn around for you soon.

    • Anonymous says:

      It's probably not clear to you -- or others? -- that "Anonymous @6:13 am" (i.e., me) and the anonymous that posted later (aka Someone who is close to you) are not the same person. And not that it should matter, but I'm not a mid-career white dude. I'm in a very different field (though I do collaborate with neuroscientists on occasion), hence my question, which was sincere.

      I never said that she was trying to cheat or manipulate her data. But your comment, "she is really trying to put together the best story to get past our asinine system we call peer review," ironically, does suggest that some degree of manipulation is precisely what she's after. (Oh, where oh where is that line?) And regardless of where you are on the totem pole or how many pubs it costs you, this is not OK.

      • Dr Becca says:

        OK. I am not trying to get the story "past" anyone, here. However, the end product in data presentation is interpretation, right? The primary question I want to answer with my data is: is there a meaningful relationship between X and Y?

        So let's say hypothetically that when I run a standard Pearson's correlation analysis on this data set, I get a just significant correlation when the outlier is included, and a just non-significant one when it's not. What do I reasonably conclude about the relationship between measures X and Y in this group? If I leave the outlier in and report the relationship as meaningful, a reviewer could look at it and say, this is bullshit, the effect is clearly driven by the outlier and therefore it is misleading to say X and Y are significantly correlated. On the other hand, if I take it out and conclude that there is no significant correlation, am I failing to report what very well may be a meaningful effect?

        • Drugmonkey says:

          Report it both ways as you do here. Personally I would then interpret it as not a significant correlation and use that as your conclusion for this dataset.

        • A Salty Scientist says:

          I agree with DM and others to report both with your explanation. Even for systems where it is very easy to "repeat" to exclude "outliers," I think that leads to oh too perfect data that discards true biological variability and heterogeneity. As commenters have duly noted, repeating until you get "perfect" data does nothing to solve the "reproducibility crisis."

  • David Jentsch says:

    There are a number of ways (not one way) to measure the reproducibility of an effect or to discriminate a robust from non robust one.

    Doc Becca already did this.

    She tells us that she studied 60 rats (potentially in two groups). If the group sizes were equal, each experimental group contained 30 rats.

    That means she "replicated" each experimental effect 30 times.

    Granted, repetition of experimental groups over time and context is yet another way to increase a sense of the robustness of an effect separate from other factors, but Doc Becca already replicated her experiment in the most straightforward and scientifically accepted way.
    '
    She hasn't told us for sure, but I take from her comments on the amount of time it took to do this experiment that there may also have been an opportunity to measure the outcomes in each rat repeatedly over time. This is yet another way to ensure reproducibility.

    Now, there are real scientific differences between fields about how often you should repeat a given entire design over and over again. A big difference comes down to model organism, not just cost.

    The prevailing ethos and regulations tell us that unnecessary duplication of experiments involving live vertebrate animals is unethical and impermissible. What is open for debate is which repetitions are necessary and which are unnecessary. But all things being equal, an IACUC is going to side with being conservative and say it's unnecessary. And our opponents are going to scream loudly whenever they see repetitions, since one of their top claims is that animal research is unnecessarily DUPLICATIVE. This poses a huge problem and raises the question of whether those who call for repetition after repetition of experiment are essentially demanding impermissible experimentation and why.

    • Dr Becca says:

      Great point, David. Just to clarify - EACH experimental group had 60 rats. I can't go into more detail without, I think, compromising the pseud, but it was necessary for the experimental design.

    • drugmonkey says:

      And our opponents are going to scream loudly whenever they see repetitions, since one of their top claims is that animal research is unnecessarily DUPLICATIVE. This poses a huge problem and raises the question of whether those who call for repetition after repetition of experiment are essentially demanding impermissible experimentation and why.

      Absolutely. And at the same time they blather on about how one specific study does not perfectly predict the human condition so therefore wrong.

      One possible upside to the current furor over allegedly not-reproducible effects will be a broader set of discussion points about why sometimes repetition of an experiment is necessary.

      • David Jentsch says:

        I entirely agree with you. The push for greater reliability in science and reproducibility of results must directly confront the regulatory pressure to reduce sample sizes within experiments and to discourage or even prevent duplication of studies. This must be a talking point amongst all who argue that steps must be taken to increase reliability.

  • dr24hours says:

    Many animal studies have far too small n's and would benefit from larger initial cohorts. This would make them far more prohibitively expensive to replicate. And of course, if you're doing a "single" experiment over multiple cohorts and long time periods, you can argue that the replication is already built in in some ways, though for truly robust results we'd like to see other scientists being the ones doing the replication. If a single lab does an experiment 20 times and gets the same result 20 times, that's nice, but real replication happens when the outcomes are repeatable in different environments.

    But there's also a deep statistical fault in the "do it again" mindset. Replicating experiments ad nauseum is perfectly possible in my field. I can generate n's of hundreds of millions if I want, under broadly variable conditions. Similarly, people working with big data sets can often find millions of n's for case control studies. But that is deeply dangerous. Overpowered studies will confer statistical significance to meaningless differences.

    I had a data set in which the difference between case control and exposed groups' mean total cholesterol was 200 and 202. That's not a clinically meaningful difference by any stretch of the imagination. But the P value was <0.00001. Because there were 200,000 people in each group. It was a meaningless statistical artifact. That's an extreme example, of course, but it doesn't take n's on that scale to befuddle P values when dealing with small effect sizes and noisy data.

    All of us doing research that requires statistical analysis need to do a better job of analyzing our power, and designing experiments that actually illuminate the underlying question, and don't just naively throw more samples at a problem that isn't necessarily informed by more data.

  • Dave says:

    The eyeball test tells me that's an outlier. Even if she leaves it in, doubt it will have much impact on the stats given the large group sizes. Some dickhead reviewer will lose his shit over the optics of it though I'm sure. I wouldn't give it a second look as a reviewer.

    Using live animals, shit happens.

  • neuromusic says:

    am I the only one who finds "Someone who is close to you" super effing creepy?

    • becca says:

      No, it totally shows a professional conscientiousness, an awareness of the optics of behavior on the internet with regard to "stalking", and a mature and finely attuned sense of interpersonal boundaries.
      Or the opposite of all that. You know.

  • Joe Hilgard says:

    I appreciate and recognize that replication involves a lot of time and expense. Here, I don't think that is strictly necessary -- a later researcher might want to replicate it, or you might replicate it later in an extension study -- but for now, the most important question is this: By how much does your inference hinge on this one datapoint?

    It seems to me that the simplest solution, especially if one believes that "the data are the data," is simply to report the analysis both ways: once with the outlier, and once without. That way the reader is appraised of how robust the results and inference are to the inclusion or exclusion of this one rat, and can weight belief accordingly.

    Moreover, as the researcher who performed the experiment, you have the ability to report on your subjective belief that the outlier represents a valid or invalid observation. Per @DJMH's suggestion, you might go through your notes to provide necessary information as to whether this is probably valid or probably invalid.

  • DJMH says:

    Yeah, and I would also check out statistical options that are more robust to outliers--non parametric stats are usually are great at coping with oddities in biological data like this.

    • drugmonkey says:

      That's p-hacking DJMH. Tch. Tch.

      • PaleoGould says:

        Wait, how is using a non parametric tests where it is warranted by the data p-hacking?

        • Drugmonkey says:

          How is it not p-hacking to select a test after collecting the data and trying to achieve the asterix by any means necessary?

          • Pshrink says:

            It's not p-hacking to select a test based on the distribution of data. Many statistical tests are based on assumptions about the distribution of data (i.e. whether it's normal or not) --- something you won't know until AFTER the data is collected.

          • dr24hours says:

            You often have to select the test after collecting the data. You use different tests for differently shaped data and you might not know the data shape until you collect it.

            If you select a t-test, and your data comes out Poisson, it's not p-hacking to switch to a C-test.

          • drugmonkey says:

            She tried one test and now you people want her to try another one. P-hacking!

          • dr24hours says:

            No. It's not p-hacking to use the RIGHT test. DM is absolutely right that you cannot just use a bunch of tests and select the one that gives you the answer you like. That's p-hacking. Using the correct test for the data - even if you originally selected the wrong test - is just doing the stats right.

          • dr24hours says:

            "How is it not p-hacking to select a test after collecting the data and trying to achieve the asterix by any means necessary?"

            That would be p-hacking. Luckily, no one is actually suggesting that.

          • Becca's collaborator says:

            This whole thing seems to be making a mountain out of a molehill. I am quite "familiar" with these data and the question is: Is there a relationship between these two variables? This is a minor issue that is not central to the paper. We don't have any particular feelings about either answer, but we do want the truth. This is group A, and group B has a very strong and steep relationship between the variables. It is very clear if you see group B's data that the relationship between these variables is very different. For group A, there is a weak but significant relationship when the weird guy is included in a Pearson correlation and that disappears when you drop him. My attitude is if we tried to convince people there *was* a meaningful correlation we'd be full of shit. Stats are a tool to describe your data that need to be interpreted in light of biology, your other observations, and common sense. They aren't a straitjacket forcing you to tell lies. As an insider to these data it is absurd what a projection of everybody's unique bugaboo this turned into.

            Either that or this is the sole piece of evidence for our astonishing and media-ready social psychology finding. If we can find a way over 0.05 we're on Dr. Phil, otherwise our careers are destroyed.

      • DJMH says:

        No way. One quick glance at those data screams not normally distributed, so it makes sense to consider non-parametric stats.

        Also, if the outlier is what *makes* the correlation, then I'm doing the opposite of p-hacking. p-lettinggo, maybe.

        • Spiny Norman says:

          Bingo. And if the point is a true flyer, it is deceptive since it's at the tail of the distribution and hence, in a regression, high-leverage.

          The correct procedure is to plot ALL the raw data. As you have done.

          You must then specify which data are used in your analysis, which (if any) are excluded, and what the rationale for your choices is.

  • schotz says:

    If I had to run a replication every time I had an outlier, I would spend all my time doing replication. And what happens when my replication has an outlier? Run a third replication? That train of logic is crazy to me. I would run a "sensitivity analysis" to see just how much impact outlier(s) are having on your analyses and results.

    BTW, I am a co-investigator on a number of large-scale education studies that literally cost millions of dollars and multiple years to run. Replication is often simply not an option.

  • namaste_ish says:

    I hate people. Thank you for reminding me.

  • shrew says:

    What a chode.

    Is he trying to argue that the outlier is a real effect (and therefore you should try to replicate an effect that occurs 1/60= 1.67% of the time)?

    Or is he trying to argue that it isn't a real effect, and your graph is stupid because of a single outlier, and you should try to create a magic, outlier-free figure by repeating shit over and over again?

    Because the latter approach, of just repeating experiments over and over until you got a pretty graph, sure wouldn't contribute to a reproducibility crisis. Nope nope nope!

    I'm straight confused, bro. I think if there is one thing we can be sure of, though, its that one or more of your new anonymous commentors are handling this scientific disagreement in the most mature, collegial manner possible. Pity his departmental colleagues, who must interact with him IRL.

    • Dr24hours says:

      An effect 6SD from a mean occurs a lot less frequently than 1/60! Closer to 1/60,000. Either the sample mean is biased, or doc becca was VERY unlucky/lucky, or (most likely) there's a material reason that element functioned differently than the others. Justifying exclusion.

  • Dennis Eckmeier says:

    IMHO, in that specific figure up there, I'd say the experiment was repeated quite often, and one time something funny happen. Asking to repeat the whole thing until you don't get an outlier anymore, is ridiculous.

    You show it in the plot. Mention it in the results along with the information that you ignored it for statistics (because 6SDs away from the cloud is obviously an outlier, it just fucks with every statistical measure), and move on.

  • Dennis Eckmeier says:

    P.S. N of 60 is *far more* than sufficient a number to establish a distribution (there are tests for it, but I would be quite surprised if it wasn't).

    Let's say the distribution is normal, then the chance of a datapoint 6SDs away from the mean is not zero but crazy small (>3 SD has a probability of 0.1%!!). Thus, if you have one such datapoint, in a well established normal distribution, leaving it in means
    1. you are insanely over-representation that value space region (like, by factor of tens of thousands).
    2. you are ignoring the fact that it is by far more likely that something was odd in this one case, than that this 'result' is part of the real distribution.

    Considering it in the statistical analysis is therefore *completely* unreasonable.

    Of course you can repeat the experiment until there is a probability of 1 for that value to occur. But that would be what? n=10 000 or something?

  • Anonymous says:

    "Mention it in the results along with the information that you ignored it for statistics (because 6SDs away from the cloud is obviously an outlier, it just fucks with every statistical measure), and move on."

    "Considering it in the statistical analysis is therefore *completely* unreasonable."

    Yeah, see, this is the problem.... And of course, "because 6SDs away from the cloud ... just fucks with every statistical measure" is just simply not true.

    As someone already pointed out, in the absence of solid info that this data point is unreliable, the easiest answer is to report it and do your stats with and without it. But I guess the concern is the optics.

  • Anonymous says:

    Dr. Becky,

    I have always enjoyed your discussions of baking, makeup and the foibles of being a lady PI. You should stick to those topics you know well. Leave the glam humping data massage to those of us who have years of experience with that and can use the power responsibly.

    I must warn you that Bob and I have a similar study nearing initiation, so unless you want to be roadblocked by us and our glam editor cronies, your best strategy is to do it all again so your replication of our breakthrough is statistically sound. Stay in your lane.

    I see there a lot of low rent imitators here, but I want to be clear that I am the one true Anonymous. I am a much higher caliber and more important form of jerk than those clowns.

    BGB out

  • […] gay and lesbian researchers are out in the lab Whose problem is the “reproducibility crisis” anyway? The future of mutating, treatment-resistant head lice is already here “Exactly the same clinical […]

  • Paul Thompson says:

    You do the following:
    1) Ensure that an error of some sort has not occurred with this data point
    2) Run the stats with and without
    3) Use a nonparametric method to examine the relationship - this is less prone to the level of deviation, and more prone to the deviation itself

  • Bruce Dick says:

    Dear Dr. Becca,
    When I have seen such issues, I start with a creative problem solving approach, particularly attitude.
    I treat 'outliers' as the most important data. They are always 'true'; either part of the population/reality being studied or a different one (e.g. from error). The trick is to investigate them from numerous angles (even the rediculous) and keep plugging until you identify the real issue, then your response will be obvious. They invariably teach us something we do not know, be it only poor experimentation or unrecognised bias, or hopefully great discoveries and expansion of knowledge. Looking at the unknown is what gets Noble Prizes.
    Know the beast before acting. Treat them as your best friend and they make you a better scientist.
    Best wishes,
    Bruce.

  • Ola says:

    SO the real question is, what are you doing with those graphs? Judging by the fact there's a line on there, it looks like you're doing linear regressions and chasing after r-squared values. In that case, I would bet good money that the outlier is driving most of the r-squared value for the example shown. Remove it, and your correlation goes away. IMHO, if the removal of less than 10% of the data can have a significant impact on the overall result, then the result is not reliable. In this case, removal of 1/60th of the data would have that effect. Ergo, whatever positive r-squared value you have from this plot, is probably not "real" (cue argument about meaning of the word "real" in 3, 2, 1...)

  • Physicist says:

    I would rerun, but then all my `experiments' happen on the computer.
    If for some reason I can't rerun, and there are no obvious reasons to exclude the point, I would keep it in the data.
    However, I would probably remove it from the plot, and add a note below that one outlier is not shown.
    Reason being that any features the rest of your data might show will be invisible with the plot range required to show the outlier.

Leave a Reply