Portfolio Artifact 03 · Applied AI · LLM Training Pipeline · Dayo Egonu
Visual Artifact · AI/ML Portfolio · 2025

How Generative AI
LLMs Are Trained

A visual breakdown of the pipeline, infrastructure, costs, and tradeoffs behind the world's most powerful language models — from raw data to deployed product.

Data Collection → Deployment GPT-4 · Claude · LLaMA · Gemini Dayo Egonu · Applied AI

From Raw Data to Deployed Model

Every LLM follows a structured training pipeline. Each phase builds on the last — skipping or rushing any step degrades final model quality.

01
🌐

Data Collection & Curation

Massive corpora are assembled from web crawls (Common Crawl), books, code repositories, academic papers, and curated datasets. Quality filtering removes toxicity, duplication, and low-signal content. GPT-4 trained on an estimated 13 trillion tokens.

~Trillions of tokens
02
🔤

Tokenization & Preprocessing

Raw text is converted into numerical tokens using algorithms like Byte Pair Encoding (BPE). Special tokens are added for structure. Data is shuffled, batched, and formatted as input-output pairs ready for the transformer architecture.

BPE / WordPiece / SentencePiece
03

Pre-Training (Self-Supervised)

The transformer predicts the next token across billions of examples using the self-supervised next-token prediction objective. This phase runs on thousands of GPUs/TPUs for weeks to months, consuming the majority of total compute budget. Loss is minimized via AdamW optimizer.

Most compute-intensive phase
04
🎯

Supervised Fine-Tuning (SFT)

The pre-trained model is fine-tuned on high-quality, human-written demonstrations of desired behavior — answering questions, following instructions, writing code. This phase uses far less data and compute than pre-training but meaningfully shifts model behavior.

Thousands of human-labeled examples
05
🤝

RLHF — Reinforcement Learning from Human Feedback

Human raters compare model outputs and rank them by quality, helpfulness, and safety. These rankings train a reward model that scores outputs. The LLM is then optimized via PPO (Proximal Policy Optimization) to maximize the reward — aligning it with human preferences.

Human preference alignment
06
🚀

Evaluation, Safety Testing & Deployment

Models are evaluated on standardized benchmarks (MMLU, HumanEval, TruthfulQA) and red-teamed for safety vulnerabilities. Deployment involves quantization, distillation, or API-based inference infrastructure. Monitoring continues post-launch to detect regressions and misuse.

Continuous post-deployment monitoring

The Infrastructure Behind the Intelligence

Training frontier LLMs demands extraordinary resources across four primary dimensions — each carrying significant cost and environmental implications.

💾

Datasets

13T+

Tokens consumed during GPT-4 pre-training. Modern LLMs ingest trillions of tokens sourced from web crawls, books, code, scientific literature, and proprietary datasets. Data quality filters, deduplication, and safety classifiers add significant preprocessing overhead before a single training step runs.

Common Crawl The Pile GitHub Code ArXiv Papers Wikipedia Books3
🖥️

Computational Power

25,000+

NVIDIA A100 GPUs deployed for GPT-4 training across Microsoft Azure clusters. Training runs require massive parallelism — tensor parallelism, pipeline parallelism, and data parallelism coordinated across thousands of accelerators. A single training run for a frontier model may cost $50–$100M+ in compute alone.

NVIDIA A100/H100 Google TPU v4 Azure/AWS/GCP InfiniBand

Energy Consumption

1,287

MWh of electricity estimated for GPT-3 training alone — equivalent to the annual power usage of over 120 US households. Frontier model training leaves a substantial carbon footprint, driving AI labs toward renewable energy commitments. Inference at scale multiplies this footprint many times over.

~502 tCO₂ eq (GPT-3) Renewable targets PUE optimization Carbon offsets
⏱️

Training Time

~90

Days estimated for GPT-4 pre-training (on 25k A100s). Even with massive parallelism, the sheer volume of forward and backward passes through trillion-parameter models takes months. Fine-tuning and RLHF phases add additional weeks. Total wall-clock time from data preparation to deployment often spans 6–18 months.

Weeks–months pre-training Days SFT Weeks RLHF Ongoing eval

Frontier Models at a Glance

Major AI labs have each taken distinct approaches to training data, model architecture, and alignment strategy — with significant variance in disclosed resource costs.

GPT-4
OpenAI · 2023
Est. Parameters
~1.8 Trillion (MoE)
Training Tokens
~13 Trillion
Est. Training Cost
$78M–$100M+
Alignment Method
RLHF + Constitutional AI elements
Claude 3
Anthropic · 2024
Est. Parameters
Undisclosed (frontier scale)
Context Window
200K tokens (Opus)
Est. Training Cost
$50M–$75M est.
Alignment Method
Constitutional AI (CAI) + RLHF
LLaMA 3
Meta · 2024
Parameters
8B / 70B / 405B
Training Tokens
15+ Trillion
Est. Training Cost
$30M–$60M (405B)
Access
Open weights (research & commercial)
Gemini 1.5
Google DeepMind · 2024
Est. Parameters
~1 Trillion (MoE, est.)
Context Window
1M tokens (Pro)
Infrastructure
TPU v4/v5 pods (Google)
Modality
Natively multimodal (text, image, audio, video)

Training Cost Comparison

Estimates sourced from published research, industry reports, and disclosed infrastructure data. Costs reflect pre-training compute only unless noted.

Model Organization Hardware Est. GPU-Hours Est. Compute Cost Energy (MWh)
GPT-3 (175B) OpenAI V100 cluster 3.64M GPU-hrs ~$4.6M 1,287 MWh
GPT-4 OpenAI 25,000× A100 ~50M GPU-hrs (est.) $78M–$100M+ Est. 50,000+ MWh
LLaMA 2 (70B) Meta 2,000× A100-80G 1.72M GPU-hrs ~$3M–$5M ~539 MWh
LLaMA 3 (405B) Meta 16,384× H100 ~30M GPU-hrs (est.) $30M–$60M est. Est. 30,000+ MWh
Claude 3 Opus Anthropic AWS Trainium/A100 Undisclosed $50M–$75M est. Undisclosed
Gemini Ultra Google TPU v4 pods Undisclosed $80M–$100M+ est. Undisclosed

* Estimates based on Epoch AI, Semianalysis, and published technical reports. Actual costs vary with hardware pricing, efficiency, and negotiated cloud rates.

What the Numbers Tell Us

Beyond the raw figures, the LLM training landscape reveals several structural patterns with lasting implications for AI accessibility and governance.

01

Data Quality Outweighs Data Quantity

Chinchilla scaling laws (Hoffmann et al., 2022) showed that most early large models were undertrained relative to their size. More tokens of higher quality — not just bigger models — drive performance gains.

02

RLHF is the Key Alignment Lever

Pre-training produces a capable but uncontrolled model. RLHF — pioneered at OpenAI and adopted across the industry — is what transforms raw capability into a model that follows instructions, avoids harm, and behaves helpfully.

03

Cost Creates Concentration Risk

Frontier model training is now a $50–100M+ endeavor, practically limiting it to a handful of well-capitalized organizations. This raises important questions about AI governance, access, and the democratizing potential of open-weight models like LLaMA.

04

Inference Cost Is the Underappreciated Variable

Training is a one-time cost; inference runs continuously at scale. Serving billions of queries daily can dwarf training costs over time, driving intense investment in quantization, speculative decoding, and smaller distilled models.