An interactive walkthrough of two-stage least squares (2SLS). Understand the intuition, enter your own data, and see the estimation in action.
Suppose you want to know whether X causes changes in Y. The usual approach — ordinary least squares (OLS) — just draws the best-fit line through your scatter plot. But OLS quietly assumes that X is the only systematic thing driving Y, and that nothing lurking in the background is simultaneously pushing both X and Y around.
When that assumption fails, we say X is endogenous. The classic culprits are omitted variables (something you didn't measure affects both X and Y), reverse causality (Y also influences X), or measurement error in X. In any of these cases, the OLS slope is biased — it doesn't tell you the true causal effect, and no amount of additional data will fix it.
Instrumental variables offer an escape. The idea is to find a third variable Z — the "instrument" — that nudges X around but has no direct connection to Y except through X. By isolating the variation in X that comes only from Z, we can estimate the causal effect of X on Y without the contamination that plagues OLS.
The 2SLS procedure works in two steps. In the first stage, we regress X on the instrument Z to get fitted values X — these capture only the "clean" variation in X that comes from Z. In the second stage, we regress Y on those fitted values X instead of on the original (contaminated) X. The resulting slope estimate isolates the causal effect.
Recall that ordinary least squares estimates the slope by projecting Y onto X directly:
The problem is clear: OLS trusts all the variation in X, but some of that variation is polluted by unobserved confounders. Two-stage least squares fixes this by replacing X with the "cleaned" version X that only contains variation coming from the instrument Z. This happens in two explicit stages:
Notice the parallel: OLS is (XTX)⁻¹XTY, and 2SLS is (XTX)⁻¹XTY. The structure is identical — the only difference is that 2SLS swaps in the instrument-predicted X wherever OLS uses the raw X. That one substitution is the entire method.
In the simple case with one instrument and one endogenous variable, the two-stage algebra collapses into something even simpler — the Wald estimator:
This ratio form makes the logic of IV transparent: you don't need to observe the causal effect directly. You just need to know how much the instrument moves the treatment and how much it moves the outcome, and divide one by the other.
Each row is one observation. Z = instrument, X = endogenous variable, Y = outcome.
| Z (instrument) | X (endogenous) | Y (outcome) |
|---|
Paste comma- or tab-separated values, three columns per line (Z, X, Y). No header row.
This loads simulated state-level data inspired by a well-known application in health economics: estimating the causal effect of cigarette consumption (X, packs per capita per year) on per-capita healthcare expenditure (Y, in dollars). The instrument is the state cigarette excise tax (Z, dollars per pack).
The logic: higher taxes raise the price of cigarettes, which reduces consumption — but taxes themselves shouldn't directly cause higher healthcare spending through any channel other than smoking behaviour. This makes cigarette taxes a plausible instrument for smoking, though one could debate whether tax levels correlate with other state-level health policies (a potential exclusion-restriction concern).
The 40 observations below are simulated to reflect realistic magnitudes and the expected relationships. They are not real data.
IV estimation only works if the instrument satisfies a set of conditions. Two of these can be partially tested; one fundamentally cannot. Understanding these assumptions is more important than understanding the math — a technically perfect 2SLS estimate with an invalid instrument is worthless.
The instrument Z must actually move X around. If Z and X are only weakly correlated, the first-stage denominator (ZTX) is close to zero, and dividing by a near-zero number amplifies any small errors into enormous biases. This is the weak instrument problem.
You can check this by looking at the first-stage F-statistic. The classic Staiger & Stock (1997) rule of thumb says F should exceed 10. More recent work by Lee et al. (2022) suggests even stricter thresholds around 23 for reliable inference. The demo below reports this F-statistic for your data.
Z must affect Y only through X — never directly, and never through some back door. This is the hardest assumption to defend because it is fundamentally untestable with data alone. You cannot run a regression to verify it; it must be argued from theory and domain knowledge.
For example, if states with high cigarette taxes also tend to invest more in public health programs that directly reduce healthcare costs, then the exclusion restriction fails — the tax is affecting spending through a channel other than smoking. The researcher must convince the reader that, conditional on whatever controls are included, the instrument's only pathway to the outcome runs through the treatment.
The instrument must be independent of the unobserved confounders U. In the DAG above, there should be no dashed arrow from U to Z. This is conceptually similar to the exclusion restriction but focuses on the instrument's own "assignment" being as-good-as-random, at least conditional on observed covariates.
In practice, researchers often argue for independence through institutional features, natural experiments, or random assignment. You can do partial checks — for instance, verifying that Z is uncorrelated with observable pre-treatment characteristics — but full independence is ultimately an assumption.
Even when the assumptions hold, IV estimates come with important caveats that practitioners should keep in mind.
When the instrument is weak, 2SLS doesn't just become imprecise — it becomes systematically biased toward the OLS estimate. In finite samples, this bias can be substantial. With many weak instruments (the "many instruments" problem), 2SLS can actually be worse than OLS. If your first-stage F is below 10, treat the results with serious skepticism.
When treatment effects vary across individuals, IV does not estimate the average treatment effect (ATE) for the whole population. Instead, it estimates the Local Average Treatment Effect (LATE) — the effect specifically for "compliers," the subpopulation whose treatment status is actually shifted by the instrument. If cigarette taxes only change smoking behaviour for price-sensitive smokers (not heavily addicted ones), then the IV estimate reflects the healthcare impact of smoking for that marginal group, not for all smokers.
This demo uses one instrument for one endogenous variable (just-identified). In this case, there is no way to test the exclusion restriction with the data — you can't run a Hansen J or Sargan test because there are no "extra" moment conditions to check. With multiple instruments, overidentification tests become available, but they test whether instruments agree with each other, not whether any individual instrument is valid.
The standard errors reported below assume homoskedastic errors (constant variance). In practice, heteroskedasticity-robust or clustered standard errors are usually preferred. The estimates themselves (the coefficients) are unaffected, but the confidence intervals and significance tests may be misleading if heteroskedasticity is present.
First-stage F-statistic — the single most important diagnostic. Reported in the results panel. If it's low, the instrument is weak and your IV estimates are unreliable. Consider finding a stronger instrument or using weak-instrument-robust methods (Anderson-Rubin test, conditional likelihood ratio test).
Hausman test — compares the OLS and IV estimates. If they're statistically different, it suggests OLS is inconsistent (i.e., endogeneity is present and IV is needed). If they're similar, either there is no endogeneity, or both estimators are biased in the same direction. The demo computes a simplified version of this test.
Reduced-form regression — regress Y directly on Z. If Z has no detectable effect on Y, then either X has no causal effect on Y, or the instrument is too weak to detect it. A significant reduced form is a necessary (but not sufficient) condition for a meaningful IV estimate.
Placebo and balance checks — verify that Z is uncorrelated with pre-treatment covariates or outcomes it shouldn't predict. These can't prove validity, but failures are red flags.