Machine Learning · Spatial Validation · Energy

Where Solar Gets Built (and Why It Isn't About Sun)

Developers don't build solar farms where the sun is strongest. They build where the grid is reachable. This project models utility-scale solar siting across Texas, finds that grid access outweighs sunshine by roughly an order of magnitude, then transfers the model to North Carolina to show the rule is set by policy, not physics.

Python · scikit-learnRandom Forest · XGBoostSHAP Google Earth EngineSpatial Block CVUSPVDB · HIFLD · NSRDB
Relative solar siting suitability across Texas from a Random Forest model
Relative siting suitability across Texas. Warm corridors trace grid-accessible east and central Texas; grey areas fall outside the model's Area of Applicability; triangles are existing facilities.
01The question

Most solar maps rank land by sunshine and slope. They describe physical potential — where panels would generate the most. But that is not where solar actually gets built.

This project models realized siting: using 176 operating utility-scale facilities, it asks what actually separates the land developers chose from the land they didn't. The answer turns out to be economic, not meteorological.

The question isn't "where is the best sun" — it's what actually drives siting decisions, and whether that rule holds in a state with completely different energy policy.

SHAP feature importance with replicate error bars
Feature importance (mean |SHAP| over 10 replicates, error bars). Distance to transmission dominates; irradiance sits in a minor secondary cluster.
02The headline finding
~9×Grid access vs irradiance importance (permutation)
0.92Spatial cross-validated ROC-AUC
More solar in North Carolina, on worse sun

Across Texas's uniformly sunny range (4.5 to 5.87 kWh/m²/day), the marginal sunshine difference between one parcel and the next barely moves siting. Distance to transmission infrastructure dominates instead — about 9× more important than irradiance by permutation importance, 5× by SHAP.

Irradiance is not irrelevant: a Boruta shadow-feature test confirms it is a genuine predictor. It is simply a minor one. In a uniformly sunny state, what separates a built site from an unbuilt one is grid access, not sun.

FeatureSHAP importance
Distance to transmission0.212
Distance to substation0.100
Distance to road0.046
Land cover0.045
Solar irradiance (GHI)0.044
Elevation0.035
Slope0.017

Transmission ranks first in all 10 replicate draws; irradiance sits in a tangled middle cluster with land cover and roads.

03The transfer test

A single-state finding is just a single state. So the Texas-trained model was applied, untouched, to North Carolina — a state with the opposite solar economy: many small plants driven by federal PURPA and state renewable-portfolio policy, on lower irradiance, versus Texas's few large merchant-scale plants.

North Carolina built roughly four times as much solar as Texas despite worse sun. The model transfers partially (ROC-AUC 0.76 ± 0.01): the siting logic generalizes in kind — grid access matters everywhere — but shifts in degree. Texas plants hug transmission at a median of 55 m; North Carolina's sit near 850 m, connecting to distribution rather than high-voltage transmission. Same shape, different threshold.

ROC curves for Texas within-state and North Carolina transfer
Within-Texas (0.92) vs North Carolina transfer (0.76)
Transmission distance histograms for Texas and North Carolina
Mechanism: TX hugs transmission, NC sits farther out
The policy paradox: lower sun, more solar in North Carolina
The paradox: worse sun, more solar — policy over physics

The full arc: grid access beats irradiance within a state; worse-sun North Carolina built more (policy); the model transfers partially — siting is universal in kind, regime-specific in degree. Comparable siting papers rarely test transfer at all.

04Validation that earns the number

Siting features are spatially autocorrelated, and naive random cross-validation leaks that structure and inflates scores. Rather than report one optimistic number, this project reports a full validation ladder — performance degrades smoothly as the model is asked to generalize further, and the large drop appears only at the state-and-policy boundary, not within Texas.

ValidationROC-AUCWhat it tests
Random CV~0.93Interpolation (optimistic)
Spatial block CV (130 km)0.92Spatial independence
Leave-one-region-out0.91Unseen Texas ecoregions
Transfer to North Carolina0.76Cross-state, cross-regime

Block size was set from the data's own 12 km residual autocorrelation range, not guessed. The leave-one-region-out test holds out entire EPA ecoregions in turn — and the model still discriminates at 0.91, confirming it generalizes across Texas geography rather than memorizing local quirks.

The map's Area of Applicability (94.8% of Texas) was validated with local data-point density: 86% of in-domain cells rest on 10 or more supporting training points, not isolated near-duplicates.

05Robustness: the hard questions, answered
  • "Distance-to-transmission is circular — operating farms are wired to the grid by definition." Bounded directly: with every grid feature removed and only interconnection-immune features left (irradiance, slope, elevation, land cover), the model still discriminates at ROC 0.72. About two-thirds of the signal survives, so grid access amplifies a real pre-existing siting signal rather than being an artifact.
  • "Irradiance only looks unimportant because it's coarsely resolved." The opposite is true: impurity importance is biased toward high-cardinality features, so coarse GHI was if anything penalized. Under the fair permutation metric it still ranks far below grid access, and Boruta confirms it is a real but minor predictor.
  • "The ranking could be one lucky run." Across 10 replicate draws, transmission is the top feature in all 10 and substation second in all 10.
  • "The transfer result could be a single-draw fluke." Stable at 0.76 ± 0.01 across 10 North Carolina pseudo-absence draws.
06Method & limitations

The problem is framed as presence-background (borrowed from species distribution modeling): verified facilities versus pseudo-absences drawn from developable land, with 10 balanced replicate draws averaged for stability. Each facility is summarized over its actual footprint, not a centroid point, so a 500 MW plant is represented by the land it occupies. Features span irradiance (NREL/NSRDB), slope and elevation (USGS 3DEP), land cover (NLCD), and distance to transmission, substations, and roads (HIFLD, TIGER). A developable-land mask excludes steep, wet, forested, and densely built terrain following NREL and peer-reviewed conventions.

  • Single training state and a modest sample (176 facilities). Texas is one regime; the North Carolina transfer probes generalization but does not replace multi-state training.
  • The "irradiance is minor" claim is Texas-specific — within a narrow, uniformly high irradiance range there is little gradient to track. It is not a claim that irradiance is physically irrelevant everywhere.
  • Suitability outputs are relative percentiles, not calibrated probabilities — presence-background models recover relative suitability and discrimination, not absolute build probability.
  • The panhandle (High Plains) is where the model extrapolates least well, reported transparently rather than smoothed over.

Every methodological choice, with its rationale and citations, is recorded in a dated decisions log in the repository.

Open the code & decisions log

Reproducible Python pipeline, eight publication-quality figures, and a dated record of every methodological choice on GitHub.