Instrumental Variables Regression

An interactive walkthrough of two-stage least squares (2SLS). Understand the intuition, enter your own data, and see the estimation in action.

Why ordinary regression can mislead you

Suppose you want to know whether X causes changes in Y. The usual approach — ordinary least squares (OLS) — just draws the best-fit line through your scatter plot. But OLS quietly assumes that X is the only systematic thing driving Y, and that nothing lurking in the background is simultaneously pushing both X and Y around.

When that assumption fails, we say X is endogenous. The classic culprits are omitted variables (something you didn't measure affects both X and Y), reverse causality (Y also influences X), or measurement error in X. In any of these cases, the OLS slope is biased — it doesn't tell you the true causal effect, and no amount of additional data will fix it.

Instrumental variables offer an escape. The idea is to find a third variable Z — the "instrument" — that nudges X around but has no direct connection to Y except through X. By isolating the variation in X that comes only from Z, we can estimate the causal effect of X on Y without the contamination that plagues OLS.

Z (instrument) X (endogenous) Y (outcome) U (unobserved) relevance Z affects Y only through X no direct path

Two-stage least squares, explained

The 2SLS procedure works in two steps. In the first stage, we regress X on the instrument Z to get fitted values X — these capture only the "clean" variation in X that comes from Z. In the second stage, we regress Y on those fitted values X instead of on the original (contaminated) X. The resulting slope estimate isolates the causal effect.

Recall that ordinary least squares estimates the slope by projecting Y onto X directly:

Ordinary Least Squares (for comparison)
βOLS = (XTX)⁻¹ XTY
(XTX)⁻¹ Scales by the total variation in X. OLS uses all variation in X — including the contaminated part correlated with unobservables. XTY The raw covariation between X and Y. Because X is endogenous, this captures both the causal effect and the bias from confounders.

The problem is clear: OLS trusts all the variation in X, but some of that variation is polluted by unobserved confounders. Two-stage least squares fixes this by replacing X with the "cleaned" version X that only contains variation coming from the instrument Z. This happens in two explicit stages:

Stage 1 — Isolate the clean variation
X = Z(ZTZ)⁻¹ZTX
Z The instrument — a variable that shifts X around but has no direct effect on Y. (ZTZ)⁻¹ZTX This is just an OLS regression of X on Z. It gives us the coefficient π₁ — how much a one-unit change in Z predicts a change in X. X The fitted values from regressing X on Z. These capture only the variation in X that comes from Z, discarding the endogenous contamination. This is the key insight of 2SLS.
Stage 2 — Estimate the causal effect using cleaned X
β2SLS = (XTX)⁻¹ XTY
XTX The variation in the predicted (clean) X. Compare with OLS which uses XTX — total variation including contamination. XTY How much the clean variation in X covaries with Y. Since X is free of confounders (by construction), this covariation reflects the causal relationship. β2SLS The two-stage least squares estimate. This has the same form as OLS — (somethingᵀ·something)⁻¹(somethingᵀ·Y) — but with X substituted for X. That substitution is what removes the bias.

Notice the parallel: OLS is (XTX)⁻¹XTY, and 2SLS is (XTX)⁻¹XTY. The structure is identical — the only difference is that 2SLS swaps in the instrument-predicted X wherever OLS uses the raw X. That one substitution is the entire method.

In the simple case with one instrument and one endogenous variable, the two-stage algebra collapses into something even simpler — the Wald estimator:

Wald Estimator (simplified 2SLS with one instrument)
β = (ZTY) / (ZTX)
β The IV estimate of the causal effect of X on Y — the same quantity as β2SLS above, just arrived at through a simpler formula. ZTY The reduced-form effect — how much the instrument Z predicts the outcome Y overall. This captures the total downstream impact of Z on Y, which (if the instrument is valid) flows entirely through X. ZTX The first-stage effect — how strongly the instrument Z moves the endogenous variable X. This is what "relevance" means in practice. If this is close to zero, the instrument is weak and the estimate becomes unreliable. (ZTY) / (ZTX) The ratio rescales the total effect into a per-unit-of-X effect. If a $1 tax hike reduces healthcare spending by $600 (ZTY) and reduces cigarette consumption by 4 packs (ZTX), then the causal cost of one additional pack is 600 / 4 = $150.

This ratio form makes the logic of IV transparent: you don't need to observe the causal effect directly. You just need to know how much the instrument moves the treatment and how much it moves the outcome, and divide one by the other.

Enter observations

Each row is one observation. Z = instrument, X = endogenous variable, Y = outcome.

Z (instrument)X (endogenous)Y (outcome)

Paste comma- or tab-separated values, three columns per line (Z, X, Y). No header row.

This loads simulated state-level data inspired by a well-known application in health economics: estimating the causal effect of cigarette consumption (X, packs per capita per year) on per-capita healthcare expenditure (Y, in dollars). The instrument is the state cigarette excise tax (Z, dollars per pack).

The logic: higher taxes raise the price of cigarettes, which reduces consumption — but taxes themselves shouldn't directly cause higher healthcare spending through any channel other than smoking behaviour. This makes cigarette taxes a plausible instrument for smoking, though one could debate whether tax levels correlate with other state-level health policies (a potential exclusion-restriction concern).

The 40 observations below are simulated to reflect realistic magnitudes and the expected relationships. They are not real data.

Estimation output

2SLS vs OLS

Enter data above and press "Run IV Regression" to see results here.

What makes an instrument valid?

IV estimation only works if the instrument satisfies a set of conditions. Two of these can be partially tested; one fundamentally cannot. Understanding these assumptions is more important than understanding the math — a technically perfect 2SLS estimate with an invalid instrument is worthless.

1. Relevance testable

The instrument Z must actually move X around. If Z and X are only weakly correlated, the first-stage denominator (ZTX) is close to zero, and dividing by a near-zero number amplifies any small errors into enormous biases. This is the weak instrument problem.

You can check this by looking at the first-stage F-statistic. The classic Staiger & Stock (1997) rule of thumb says F should exceed 10. More recent work by Lee et al. (2022) suggests even stricter thresholds around 23 for reliable inference. The demo below reports this F-statistic for your data.

2. Exclusion restriction untestable

Z must affect Y only through X — never directly, and never through some back door. This is the hardest assumption to defend because it is fundamentally untestable with data alone. You cannot run a regression to verify it; it must be argued from theory and domain knowledge.

For example, if states with high cigarette taxes also tend to invest more in public health programs that directly reduce healthcare costs, then the exclusion restriction fails — the tax is affecting spending through a channel other than smoking. The researcher must convince the reader that, conditional on whatever controls are included, the instrument's only pathway to the outcome runs through the treatment.

3. Independence partially testable

The instrument must be independent of the unobserved confounders U. In the DAG above, there should be no dashed arrow from U to Z. This is conceptually similar to the exclusion restriction but focuses on the instrument's own "assignment" being as-good-as-random, at least conditional on observed covariates.

In practice, researchers often argue for independence through institutional features, natural experiments, or random assignment. You can do partial checks — for instance, verifying that Z is uncorrelated with observable pre-treatment characteristics — but full independence is ultimately an assumption.

What can go wrong

Even when the assumptions hold, IV estimates come with important caveats that practitioners should keep in mind.

Weak instruments and finite-sample bias

When the instrument is weak, 2SLS doesn't just become imprecise — it becomes systematically biased toward the OLS estimate. In finite samples, this bias can be substantial. With many weak instruments (the "many instruments" problem), 2SLS can actually be worse than OLS. If your first-stage F is below 10, treat the results with serious skepticism.

LATE, not ATE

When treatment effects vary across individuals, IV does not estimate the average treatment effect (ATE) for the whole population. Instead, it estimates the Local Average Treatment Effect (LATE) — the effect specifically for "compliers," the subpopulation whose treatment status is actually shifted by the instrument. If cigarette taxes only change smoking behaviour for price-sensitive smokers (not heavily addicted ones), then the IV estimate reflects the healthcare impact of smoking for that marginal group, not for all smokers.

Just-identification and overidentification

This demo uses one instrument for one endogenous variable (just-identified). In this case, there is no way to test the exclusion restriction with the data — you can't run a Hansen J or Sargan test because there are no "extra" moment conditions to check. With multiple instruments, overidentification tests become available, but they test whether instruments agree with each other, not whether any individual instrument is valid.

Standard error assumptions

The standard errors reported below assume homoskedastic errors (constant variance). In practice, heteroskedasticity-robust or clustered standard errors are usually preferred. The estimates themselves (the coefficients) are unaffected, but the confidence intervals and significance tests may be misleading if heteroskedasticity is present.

Checking your work

First-stage F-statistic — the single most important diagnostic. Reported in the results panel. If it's low, the instrument is weak and your IV estimates are unreliable. Consider finding a stronger instrument or using weak-instrument-robust methods (Anderson-Rubin test, conditional likelihood ratio test).

Hausman test — compares the OLS and IV estimates. If they're statistically different, it suggests OLS is inconsistent (i.e., endogeneity is present and IV is needed). If they're similar, either there is no endogeneity, or both estimators are biased in the same direction. The demo computes a simplified version of this test.

Reduced-form regression — regress Y directly on Z. If Z has no detectable effect on Y, then either X has no causal effect on Y, or the instrument is too weak to detect it. A significant reduced form is a necessary (but not sufficient) condition for a meaningful IV estimate.

Placebo and balance checks — verify that Z is uncorrelated with pre-treatment covariates or outcomes it shouldn't predict. These can't prove validity, but failures are red flags.