The humble \(p\)-value is much maligned and terribly misunderstood. The problem is that everyone wants to know the answer to the question: “what is the probability that [hypothesis] is true?” But \(p\) answers a different (and not terribly useful) question: “how (un)surprising is this evidence given [hypothesis]?” Can \(p\) shed insight on the question we really care about? Maybe, though there are dangers.

Random assignment provides a justification not just for estimates of effects but also for estimates of uncertainty about effects. The basic approach, due to Neyman, is to estimate the variance in estimates of the difference between outcomes in treatment and in control outcomes using the variability that can be observed among units in control and units in treatment. It’s an ingenious approach and dispenses with the need to make any assumptions about the shape of statistical distributions or about asymptotics. The problem though is that it can sometimes be **upwardly biased**, meaning that it might lead you to maintain null hypotheses when you should be rejecting them. We use design diagnosis to get a handle on how great this problem is and how it matters for different estimands.

In many experiments, random assignment is performed at the level of clusters. Researchers are conscious that in such cases they cannot rely on the usual standard errors and they should take account of this feature by clustering their standard errors. Another, more subtle, risk in such designs is that if clusters are of different sizes, clustering can actually introduce bias, *even if all clusters are assigned to treatment with the same probability*. Luckily, there is a relatively simple fix that you can implement at the design stage.

We usually think that the bigger the study the better. And so huge studies often rightly garner great publicity. But the ability to generate more precise results also comes with a risk. If study designs are at risk of bias and readers (or publicists!) employ a statistical significance filter, then big data might not remove threats of bias and might actually make things worse.

Cluster-robust standard errors are known to behave badly with too few clusters. There is a great discussion of this issue by Berk Özler “Beware of studies with a small number of clusters” drawing on studies by Cameron, Gelbach, and Miller (2008). See also this nice post by Cyrus Samii and a recent treatment by Esarey and Menger (2018). A rule of thumb is to start worrying about sandwich estimators when the number of clusters goes below 40. But here we show that diagnosis of a canonical design suggests that some sandwich approaches fare quite well even with fewer than 10 clusters.

In many experiments, different groups of units get assigned to treatment with different probabilities. This can give rise to misleading results unless you properly take account of possible differences between the groups. How best to do this? The go-to approach is to “control” for groups by introducing “fixed-effects” in a regression set-up. The bad news is that this procedure is prone to bias. The good news is that there’s an even simpler and more intuitive approach that gets it right: estimate the difference-in-means within each group, then average over these group-level estimates weighting according to the size of the group. We’ll use design declaration to show the problem and to compare the performance of this and an array of other proposed solutions.

Most power calculators take a small number of inputs: sample size, effect size, and variance. Some also allow for number of blocks or cluster size as well as the overall sample size. All of these inputs relate to your data strategy. Unless you can control the effect size and the noise, you are left with sample size and data structure (blocks and clusters) as the only levers to play with to try to improve your power.

You can often improve the precision of your randomized controlled trial with blocking: first gather similar units together into groups, then run experiments inside each little group, then average results across experiments. Block random assignment (sometimes called stratified random assignment) can be great—increasing precision with blocking is like getting extra sample size for free. Blocking works because it’s like controlling for a pre-treatment covariate in the “Data Strategy” rather than in the “Answer Strategy.” But sometimes it does more harm than good.

A dangerous fact: it is quite possible to talk in a seemingly coherent way about strategies to answer a research question without ever properly specifying what the research question is. The risk is that you end up with the right solution to the wrong problem. The problem is particularly acute for studies where there are risks of “spillovers.”

Consider an observational study looking at the effect of a non-randomly assigned treatment, \(Z\), on an outcome \(Y\). Say you have a pretreatment covariate, \(X\), that is correlated with both \(Z\) and \(Y\). Should you control for \(X\) when you try to assess the effect of \(Z\) on \(Y\)?

Welcome to the `DeclareDesign`

blog! We have been working on developing the `DeclareDesign`

family of software packages to let researchers easily generate research designs and assess their properties. Our plan over the next six months is to put up weekly blog posts showing off features of the packages or highlighting the kinds of things you can learn about research design using this approach.