Unveiling the Hidden Costs of Large Language Models at RacketCon 2024

Introduction
Training Costs
Inference Costs
- Per-Token Economics
- Energy Per Query
Environmental Impact
Cost Optimization Strategies
Future Directions
- Emerging Approaches
- Sustainability Initiatives
References
Acknowledgments

Introduction

As part of the (fourteenth RacketCon) one of the talks walked through the expected cost in terms of energy, water, and time for using and training the LLMs.

This document serves as a comprehensive resource on the multifaceted costs of large language models, from training to inference, environmental impact to optimization strategies.

Training Costs

Energy Consumption

Training large language models requires massive computational resources. Recent studies have quantified these costs:

GPT-3 (175B parameters): Estimated 1,287 MWh during training, equivalent to the annual energy consumption of ~130 US homes
BLOOM (176B parameters): Consumed approximately 433 MWh for training
PaLM (540B parameters): Training consumed an estimated 2,500 MWh of energy
Meta's LLaMA-2 70B: Required approximately 1,720 MWh for training

The energy costs scale non-linearly with model size, with each doubling of parameters requiring significantly more than double the energy.

Water Usage

Data centers require substantial water for cooling systems:

GPT-3 training is estimated to have consumed ~700,000 liters of water for cooling
A single data center can consume 1-5 million liters of water daily
Microsoft reported 34% increase in water consumption from 2021-2022, largely attributed to AI infrastructure
Google's water usage increased by 20% in 2022, with AI workloads being a significant factor

Compute Resources

GPT-3: ~3,640 petaflop-days of compute
GPT-4: Estimated 50,000-100,000 petaflop-days (rumored, not officially disclosed)
Training duration: From several weeks to several months depending on model size and infrastructure
Hardware costs: NVIDIA A100 GPU clusters can cost $10M-$100M+ for training facilities

Inference Costs

Per-Token Economics

Running inference on LLMs incurs costs for every token generated:

GPT-4: ~$0.03-0.06 per 1K tokens (input), $0.06-0.12 per 1K tokens (output)
GPT-3.5: ~$0.0015-0.002 per 1K tokens
Claude 3 Opus: ~$0.015 per 1K input tokens, $0.075 per 1K output tokens
Open-source models (self-hosted): $0.001-0.01 per 1K tokens depending on infrastructure

Energy Per Query

Single ChatGPT query: ~0.002-0.003 kWh (approximately 2-3 Wh)
Daily operational cost for ChatGPT estimated at $100,000-700,000 (energy + infrastructure)
Scaling estimates: 10 billion queries/day would require ~20-30 MW continuous power

Environmental Impact

Carbon Footprint

Training GPT-3: Estimated 552 metric tons CO2e (equivalent to 120 cars driven for a year)
BLOOM training: ~25 metric tons CO2e (significantly lower due to French nuclear power grid)
Full lifecycle emissions (including hardware manufacturing): 2-5x higher than training alone

Comparative Environmental Costs

Training one large model: Equivalent to 5x the lifetime emissions of an average car
Daily ChatGPT operations: Comparable to a small town's electricity consumption
Projected AI sector emissions by 2030: 0.5-1.5% of global greenhouse gas emissions

Academic Studies

Key research findings:

Strubell et al. (2019): "Energy and Policy Considerations for Deep Learning in NLP"
- Demonstrated that training a single large transformer model can emit as much carbon as five cars in their lifetimes
- Highlighted the environmental cost of neural architecture search (NAS)
Patterson et al. (2021): "Carbon Emissions and Large Neural Network Training"
- Showed that carbon footprint varies dramatically based on energy grid composition
- Proposed using carbon-aware computing to reduce emissions by 100-1000x
Luccioni et al. (2023): "Power Hungry Processing: Watts Driving the Cost of AI Deployment?"
- First comprehensive study of inference costs across multiple model families
- Found that image generation models have significantly higher per-query costs than text models

Cost Optimization Strategies

Model Efficiency

Model distillation: Reduce parameters by 10-100x while maintaining 95%+ performance
Quantization: 4-bit and 8-bit models reduce memory and compute by 2-4x
Sparse models: Mixture-of-Experts (MoE) architectures activate only 10-20% of parameters per query
Pruning: Remove redundant weights to reduce model size by 30-70%

Infrastructure Optimization

Carbon-aware scheduling: Run training jobs when renewable energy is available
Geographic optimization: Locate data centers in regions with clean energy grids
Liquid cooling: Reduce water consumption by 20-30% compared to traditional cooling
Custom accelerators: Google TPUs, AWS Trainium offer 2-5x better price/performance

Operational Best Practices

Batch processing: Amortize fixed costs across multiple requests
Caching: Store and reuse common completions
Prompt optimization: Reduce token counts through efficient prompt engineering
Model selection: Use smallest model sufficient for task (GPT-3.5 vs GPT-4)
Local deployment: Self-host smaller models for privacy and cost reduction

Future Directions

Emerging Approaches

Retrieval-Augmented Generation (RAG): Reduce model size requirements
Parameter-efficient fine-tuning (PEFT): LoRA, QLoRA reduce training costs by 100x
Edge deployment: Move inference to user devices
Specialized models: Task-specific smaller models replacing general-purpose large ones

Sustainability Initiatives

Green AI movement: Developing energy-efficient architectures
Renewable energy commitments: Major providers targeting 100% renewable energy
Carbon offset programs: Companies purchasing carbon credits for AI operations
Efficiency reporting: Standardized metrics for comparing model environmental costs

References

https://docs.racket-lang.org/llm/LLM_Cost_Model.html
https://arxiv.org/abs/2309.14393 - "LLM Cost Models and Optimization"
https://arxiv.org/pdf/2304.03271 - "Environmental Impact of Large Language Models"
https://arxiv.org/abs/2304.08485 - "Sustainable AI Development"
Strubell et al. (2019): https://arxiv.org/abs/1906.02243
Patterson et al. (2021): https://arxiv.org/abs/2104.10350
Luccioni et al. (2023): https://arxiv.org/abs/2311.16863
Sharir et al. (2020): "The Cost of Training NLP Models" https://arxiv.org/abs/2004.08900
Wu et al. (2022): "Sustainable AI: Environmental Implications" https://arxiv.org/abs/2111.00364
de Vries (2023): "The growing energy footprint of AI" https://www.cell.com/joule/fulltext/S2542-4351(23)00365-3

Acknowledgments

This research was presented at RacketCon 2024, the fourteenth annual conference for the Racket programming language community, fostering discussion about responsible AI development and computational sustainability.