Chapter 3 of 11
Chapter 1 - Python & Data: The Unsexy Foundation
The Crux
You want to learn AI, so you're probably eager to jump into neural networks and transformers. Stop. The real bottleneck isn't fancy algorithms-it's data quality and infrastructure. This chapter is about the unglamorous reality: 90% of AI work is data plumbing.
Why Python Won (And Why It's Imperfect)
Python is the lingua franca of AI. But why? It's not the fastest language. Its type system is weak. Its parallelism story is messy (GIL, anyone?). So why Python?
The Real Reasons
1. NumPy and the Scientific Computing Stack In the late 1990s, numeric Python (NumPy) provided array operations that were fast enough (C under the hood) and ergonomic enough (Python on top). This created a beachhead.
2. Ecosystem Network Effects Once researchers built scikit-learn, pandas, matplotlib on NumPy, switching costs became prohibitive. The ecosystem is now massive.
3. Readability for Non-Programmers Many AI researchers aren't software engineers-they're statisticians, physicists, domain experts. Python's readability lowered the barrier.
4. Interactive Development Jupyter notebooks let you experiment cell-by-cell. This matches the exploratory nature of data work.
The Downsides Nobody Talks About
Type Safety: Python's dynamic typing means data bugs hide until runtime. You'll pass a list where a numpy array was expected, and everything crashes 3 hours into training.
Performance: Python is slow. Everything fast is actually C/C++/CUDA underneath. You're writing Python glue code over compiled libraries.
Packaging Hell: Dependency management is a mess. pip, conda, poetry, virtual environments-it's a fractal of complexity.
The GIL: Python's Global Interpreter Lock means true parallelism is painful. You'll learn to live with it.
Why We're Stuck: The ecosystem is too valuable to abandon. The industry settled on "Python for glue code, compiled languages for heavy lifting."