If it disagrees with experiment, it is wrong. In that simple statement is the key to science. It doesn't make a difference how beautiful your guess is. It doesn't make a difference how smart you are, who made the guess, or what his name is. If it disagrees with experiment, it's wrong.
That quote from Richard Feynman encapsulates what I love about science. There are no sacred cows. No matter how clever you think you are, your hypothesis can be disproven with experimental evidence. Even the most prestigious scientific organisations recognise this; the motto of the Royal Society est. 1660 is nullius in verba - take nobody’s word for it.
Psychiatry, like all branches of medicine, has an amazing tool for disproving a hypothesis, the randomised controlled trial. Randomly allocating participants to each arm of the trial (for example, drug versus placebo) results in a fair test. By ensuring there is a 50/50 chance of each participant going into either group we know there will be no systematic differences between the groups at the start of the study.
If the trial is large enough, and enough steps are taken to reduce sources of bias - like blinding the participants and researchers to the group allocations - we can be confident that differences between groups at the end of the trial are caused by the drug under study. A well conducted trial is the gold standard of evidence, it is the basis on which all new medicines gain approval.
To misquote Feynman - if your hypothesis disagrees with a randomised controlled trial you are wrong!
But is it really so simple? In reality, you need to go through a trial with a fine tooth comb before deciding whether to throw out your hypothesis. No study is perfect and decisions made in the design or conduct of the trial can influence results.
Mistakingly accepting a hypothesis (false positive) is known as a Type I error, while mistakingly rejecting a hypothesis (false negative) is a Type II error. Both these error types have to be borne in mind when evaluating the results of a trial.
In this post, I’m going to examine a recent negative randomised controlled trial of raloxifene for schizophrenia spectrum disorders and discuss what the results mean for the oestrogen hypothesis of schizophrenia.
Preamble
The interest in raloxifene as a treatment for schizophrenia makes sense as part of the wider oestrogen hypothesis, discussed in a previous post. Briefly, there are various clinical and epidemiological findings that suggest fluctuations in oestrogen might be important in schizophrenia. There is further evidence (yes, from randomised controlled trials) that oestradiol, the most biologically active form of oestrogen, is effective as a treatment.
I see raloxifene as a medication that has similar actions to oestradiol, without some of the concerning longer-term side-effects. It is classified as a Selective Oestrogen Receptor Modulator, a group of drugs that have variable effects on oestrogen receptors, depending on where the receptor is in the body.
In the case of raloxifene, it stimulates oestrogen receptors in the brain but inhibits them in the uterus. That means, unlike oestradiol, it is not associated with uterine cancer (though in practice, giving oestradiol along with progesterone mitigates this risk).
So it seems like raloxifene could be promising as a treatment for schizophrenia, acting on oestrogen receptors in the brain without affecting them in the uterus.
The trial
The study under question was published open access in Schizophrenia Bulletin by a team from Utrecht spear-headed by Bodyl Brand and Iris Sommer. They recruited male and female participants who had schizophrenia, or a related disorder, and had been on a stable dose of antipsychotic medication for two weeks - they were studying whether raloxifene could ‘augment’ the effect of standard antipsychotics.
Participants then received 120mg of raloxifene (with the dose based on previous research) or a superficially identical placebo for 12 weeks. After the treatment phase, participants were followed-up for another two years. The outcomes included a standard schizophrenia scales called the PANSS (Positive and Negative Syndrome Scale) and cognitive tests.
The study took measures to reduce bias, in terms of using an independent statistician to randomise participants and ensuring both patients and researchers were blind to treatment allocation. It was pre-registered at ClinicalTrials.gov. All these are features of a modern day quality trial and make the results more believable.
Furthermore, the researchers published their protocol in advance. This is reassuring because it’s always tempting to make tweaks to a study’s analysis plan after results are seen, in order to get a positive finding. By comparing the final study with the protocol, we can see what changes were made in the course of the study being conducted (in many cases there are good reasons why protocols might be amended).
The authors make this task even easier by listing all the changes that were made to the protocol. Most of them seem minor. The one that really caught my eye was changing cognition from a secondary outcome to primary (i.e. most important) outcome, which was done in June 2021. It also looks like they used a slightly more complicated statistical analysis (linear mixed-effect modelling) than the one that was planned (repeated measures ANOVA).
102 participants were randomised, with 50 allocated to placebo and 52 to raloxifene. At baseline, both groups had similar scores on the PANSS scales (indicating a similar level of illness severity) - though note the mean scores of 56-58 would correspond to ‘mildly ill’.
The main results for PANSS scores are shown below:
The red solid line represents all participants who were taking placebo while the blue solid line is raloxifine. The dashed lines represent the groups grouped by sex (males and females). I’m a bit confused by this figure. I think it shows that the placebo group had worse/more severe PANSS scores during the treatment phase (baseline to 12 weeks), while the raloxifene group had slightly better scores. By the 38 week follow-up, the groups seem to be converging.
Usually in trials, both placebo and active treatment groups tend to get better over time - which is exactly why we need a placebo arm. In this case, it looks like neither group got much better and the placebo group may actually have got worse. In any case, there was no statistically significant differences between the groups.
I understand the results for working memory even less. See if you can work out what’s going on in the figure below:
From what I can see, there doesn’t look like there’s much between the groups at the end of active treatment (week 12). However, by 38 week follow-up they have diverged widely, with the female raloxifene group doing much better and the male raloxifene group doing worst. I am surprised by the trajectory of the female placebo group (red dashed line with intermittent circles) who showed most improvement at the end of 12 weeks but then had the steepest decline at 38 week follow-up.
The authors interpret this as raloxifene having a positive effect on working memory but only in female patients, though I’m not sure how raloxifene would exert this effect long after the treatment phase had ended.
I could tie my self in knots trying to work out these results but in essence, I think we can consider this trial as negative - in the whole sample, raloxifene did not beat placebo on the pre-specified outcomes. Let’s now think about why the study was negative and whether we should regard it as a ‘true negative’ or ‘false negative’.
Disclaimer
Before we go on, it is worth taking a moment to consider that conducting a well-designed clinical trial is a big achievement in itself. I have used words like failed and negative but this only relates to the study drug, not the study researchers - they were successful in completing this massive task! Even when trials are negative, they can provide useful information that will help develop future treatments.
It is incredibly important that academic groups like the one in Utrecht conduct trials, it is how we will find new treatments for severe mental illness. Raloxifene is now ‘off-patent’ meaning that no one pharmaceutical company has exclusive rights to sell it. After a new drug is brought to market, there is a time-window in which it is expensive and is sold by the company that developed it. Following this, it becomes ‘generic’ meaning that any company can reproduce it, for a much lower cost.
With pharma companies making the majority of their profits in the first few years after a drugs is approved, there is little incentive for them to invest in research for drugs that have become generic, like raloxifene. You will see that this trial was funded by Dutch government agencies.
What this means is that that researchers do not have the vast resources of a multinational company that stand to profit from a successful result. Instead, they are working with limited governmental resources, in which every penny counts and is accountable.
They screened almost 500 patients, and assessed about half of that for eligibility to get a final sample of just over 100. Consider that trials of medications are more difficult to recruit to than the average study and you will appreciate what effort was needed.
All trials outside of big pharma are done on the equivalent of a shoe string - so let’s keep that in mind. That being said, in the spirit of critical thinking I will go through a couple of potential reasons why this negative result might not be conclusive.
Sample size
There should be two words on your lips when deciding if a negative trial is truly negative - sample size. Was the sample size of the trial large enough to detect a difference between the drug and placebo? The authors describe their sample as large…but was it large enough? It is smaller than would be typical for a Phase III trial; those I wrote about in a previous post had three times as many participants per arm.
To determine how large a sample is needed to find a difference, we need a ‘power calculation’. Power is another one of those statistical terms, meaning the likelihood of detecting a true difference between drug and placebo. As sample size increases, so does the power to detect a true difference. In other words, increasing power decreases the likelihood of getting a Type II error (false negative). You might hear small studies being described as ‘under-powered’ - this is a major problem in various academic fields.
A power calculation is relatively simple to do - you plug in the effect size you need to detect, the statistical design you will use and, hey presto, it gives you the sample size needed. Not reporting a power calculation for a trial is an instant red flag.
What does this study say about sample size?
Sample size calculation was based on disjunctive power (or minimal power), which is the probability of finding at least one true intervention effect across all of the outcomes. To be able to find an effect size of 0.57, with two sided alpha set at 5% and with 80% power, 50 participants in each group have to be evaluated. Since we expect some 10% dropout, a total of 55 participants per arm are needed, resulting in a total of 110 participants.
OK, so they did a power calculation: we can relax, right? Not quite - let’s go through it in detail. Firstly, two sided alpha of 5% equates to setting the significance level of p<0.05, which is very conventional - it refers to the probability of accepting a false positive. Likewise, this level of power is pretty standard, meaning it has 80% probability of detecting a true effect of the drug, should this exist. The key number, though is the effect size they are able to detect, 0.57. If you know anything about effect sizes (see this recent Scott Alexander post) this should raise an eyebrow.
An effect size of 0.57 is arbitrarily categorised as medium. But the effect size for antidepressants is 0.3 and for antipsychotics is 0.5. Do we really think raloxifene will have a bigger effect size in schizophrenia than antipsychotics? What’s more, the effect size is for adjunctive raloxifene - so for patients who have already been treated with antipsychotics (and we know from their baseline PANSS scores are only currently mildly affected by their symptoms).
This effect size wasn’t pulled out of thin air, it was taken from a meta-analysis of raloxifene, incidentally by the same team. The main results are in a forest plot here:
The bottom right square is a synthesis of eight studies measuring change in PANSS total scores, which gives the overall effect size of 0.57. What is not shown in that figure, is the sample size of each contributing study. All bar one had a total sample size of less than 60. This is a problem because, as well as increasing the possibility of a false negative, small under-powered studies lead to artificially inflated effect sizes.
There was one well-powered study, with a sample size of 200, by Mark Weiser and colleagues. Guess what? Not only was it negative, but the participants in the placebo group actually did better than those in the raloxifene group. The population of that study was slightly different from others as it included all female, postmenopausal, severely ill patients. Intuitively though, truly effective treatments for an illness tend to have a bigger effect in more severe than milder cases.
To sum up, the study was adequately powered to detect a medium effect of raloxifene, but any true effect is likely to be substantially smaller. The recruitment by Brand et al. was heroic, but was probably still not enough to rule out a small-medium effect of raloxifene - a false negative is still possible.
Study population
The second issue with the trial is the gender mix of patients recruited. The authors highlight this themselves in an unusually honest admission:
A limitation of this study is the relatively small proportion of women in our sample. We expected the use of estrogen-like medication to be more appealing to women than to men and therefore did not make an extra effort to recruit them. In hindsight, we should have made this extra effort. Consequently, the evidence regarding raloxifene’s effect on symptoms in premenopausal women remains limited and requires replication.
Prior to this trial, most trials of raloxifene had been conducted with female participants only. There is good reason to suspect the effect of oestrogen-based medications work differently in females than males - indeed the authors found divergent effects when they analysed the sample by sex, even if the number of females (n=14 per group) is too small to draw any real conclusions.
Given raloxifene exerts its action by oestrogen receptors, I think it is reasonable to extrapolate from other trials of oestradiol in schizophrenia. Again, the largest is by Mark Weiser’s team. A trial of adjunctive oestradiol patches in 200 female patients with schizophrenia showed a small benefit over placebo. Crucially though, this effect was only seen in those aged over 38 years; in younger women it had no effect whatsoever.
Taken together, I think future studies of oestrogen-based medications would do well to focus on female participants, ideally those approaching the age of perimenopause. Clinically, this also makes sense, it is well recognised that schizophrenia worsens in women around this time.
Is the oestrogen hypothesis wrong?
In a word, no. Despite not agreeing with experiment on this occasion, it is too soon to throw out the entire oestrogen hypothesis - sorry Feynman. Like all good negative trials, Brand’s study has given us new information that will inform future research and modify our understanding of oestrogen in schizophrenia.
My beliefs have shifted from regarding oestrogen-based treatments as a potentially beneficial for any patient with schizophrenia, towards selecting participants based on age and sex.
Raloxifene may have failed this time, but with a female-only, perimenopausal sample, there may yet be hope. Another one to watch out for is Weiser’s new trial of oestradiol patches as adjunctive treatment for schizophrenia in women aged 38-48 - the study started in 2019, so the next test of the oestrogen hypothesis may be revealed soon.
I also think nosology is an obstacle here. For years clinicians have noted a subtype of paranoid psychotic disorders in perimenopausal females, call it paraphrenia, involutional psycjosis etc, without prominent disorganization and good response to low dose AP. It was always this population rather than the SCZ population writ large, even including females with more typical onset, in which Estrogens stood a good chance of being useful.
I don’t think completely different. The usage of terms hasn’t been stable.