Why AI Is Running Out of Training Data

5 min readData

The last decade of AI progress ran on one quiet assumption: that the internet was an infinite well of training data. Scrape enough of it, and the models keep getting better. That assumption is now breaking, and it is breaking on a clock.

The training data wall is real, and it has a date

Epoch AI, presenting at ICML 2024, put a number on something the labs had been feeling for a while. The effective stock of high quality public human text is roughly 300 trillion tokens. At the rate frontier labs are training, and overtraining, that stock runs out somewhere between 2026 and 2032. After that, there is no more internet to scrape. The same study projects the market for training datasets growing from 3.2 billion dollars to 16.3 billion dollars across eight years. Data stops being a line item and becomes the strategy. This is the part most people miss. The bottleneck was never compute or model architecture. Those keep improving. The bottleneck is the supply of fresh, real, human signal, and that supply is finite.

Synthetic data does not save you

The obvious objection is: just generate more. Let the models produce their own training data. The research says this fails, and it fails in a specific, mathematical way. In July 2024, Nature published the model collapse result. Train a model recursively on data generated by previous models, and it degrades with probability one. The rare events go first, so diversity collapses, and then the whole distribution converges toward a generic, repetitive mean. It is not a tuning problem you can engineer around. It is a property of the process. The same body of work tells you the antidote and its price. Keeping even 10 percent real human data in the pipeline is enough to stabilize a model against collapse. But the ratio is brutally asymmetric: for a million real samples you can safely add only a few dozen synthetic ones. Synthetic data is a multiplier on real data, never a replacement for it. Work out of Carnegie Mellon and Microsoft at FAccT 2025 adds the qualitative version: synthetic data manufactures false diversity and quietly bakes in stereotypes, because it can only resample what the model already knows.

The web you would scrape is already contaminated

There is a second problem stacked on top of scarcity. The open web is no longer clean. By 2025, an estimated 74 percent of new web pages contained AI generated content. So the naive plan, scrape harder, now means training on the output of other models, which is exactly the input that triggers collapse. You cannot tell, at scale, which sentence was written by a person and which was generated. Provenance stopped being a nice to have. It is the difference between training data and noise.

The data that still matters is human, real, and not online

Here is the conclusion every strand of this research converges on. The data that will actually move models from here is fresh human data, captured with known provenance, and most of it does not exist on the internet at all. It lives in the physical world. This is clearest in multimodal. No language model can generate a true photograph of a street it has never seen, a real recording of a specific accent in a specific room, or a video of a hand performing a real task. Image and video already make up around 42 percent of the training data market, and that is the part synthetic generation cannot fake, because there is no ground truth to resample from. For multimodal AI training data, real human capture is not one option among many. It is the only option. There is a nuance worth being precise about. AI feedback can replace human feedback for alignment: DeepMind showed at ICML 2024 that RLAIF matches RLHF for preference tuning. But that is feedback, not raw data. Pretraining and fine tuning still need human collected signal, in volume, and increasingly with documented consent.

What Glint AI is building

Glint AI is building the real-world data layer for frontier models. Instead of scraping a contaminated and shrinking web, Glint collects on demand from verified human contributors: photos, video, and audio, captured for a specific purpose and tied to a specific request. Two properties are built in rather than bolted on. The first is provenance. Every asset is linked to the contributor who produced it, with explicit consent and complete metadata, under European data protection rules. That is the part a scraped corpus can never reconstruct after the fact. The second is targeting. Because collection happens on demand, it can be aimed at the exact demographics, geographies, languages, and contexts a model is weak in, which is precisely where the long tail of diversity disappears fastest. Contributors are paid for what they capture. That is the mechanism that makes the supply real and renewable instead of a one time scrape. It turns training data from a stock that depletes into a flow that refreshes.

The timing

Put the pieces together. Public data runs out between 2026 and 2032. Synthetic data cannot fill the gap without collapsing the model. The web is contaminated, so provenance becomes mandatory. And the highest value data, multimodal and physical, was never online to begin with. The data budgets of every serious lab are about to move from the margin to the center of the roadmap. That is the layer Glint is building, on European ground, with consent and provenance from day one. If you are a frontier lab or an applied AI team that will hit the training data wall before most, that is exactly the conversation we want to have.

Why AI Is Running Out of Training Data — Glint