Data Scientist Roadmap

Level: Intermediate

How to follow this roadmap

  1. Build the math and statistics foundation first — descriptive stats, hypothesis testing, regression, and basic linear algebra. Skipping this leaves gaps that show up in interviews and modeling decisions.
  2. Get comfortable with the core toolchain: Python (NumPy, Pandas, Polars), Jupyter, SQL, and one visualization library (Plotly or Matplotlib). Build at least three end-to-end notebooks on real datasets.
  3. Move to classical ML with scikit-learn — regression, classification, trees, ensembles (XGBoost, LightGBM). Most production ML is still classical, not deep learning.
  4. Add deep learning fundamentals (PyTorch, transfer learning, fine-tuning) when classical ML hits a ceiling. For most product work, you'll use pretrained models more often than training from scratch.
  5. Layer on production skills: model serving (FastAPI, BentoML), monitoring (Evidently, Arize), feature stores. The job is increasingly about shipping and operating models, not just notebooks.

When to choose this path

Choose this roadmap if you want to extract insight from data and build predictive systems — sitting at the intersection of statistics, software engineering, and the business problem. It's a strong fit for analysts moving up, researchers crossing into industry, and engineers wanting to specialize. If your goal is building production AI features with LLMs, the AI Engineer Roadmap is closer to that job. If you want to build the data infrastructure that scientists use, choose the Data Engineer Roadmap.

What you’ll learn

Recommended resources

Frequently asked questions

Data scientist vs ML engineer vs data engineer?
Data scientists frame business problems, build models, and communicate insight; the work is part research, part engineering. ML engineers productionize models — pipelines, serving, monitoring at scale. Data engineers build the pipelines and warehouses that feed both. The roles overlap at small companies and become distinct at scale.
Do I need a PhD to be a data scientist?
No. PhDs are common at research-heavy companies (Google, DeepMind, OpenAI) but not required for most product data science roles. A strong portfolio with shipped projects and clear writeups outperforms a generic master's degree in most hiring loops.
R or Python for data science in 2026?
Python by a wide margin. R remains strong in statistical research, biostats, and academic settings, but Python dominates industry — especially for ML, deep learning, and any role that touches production code. Learn Python first; pick up R only if a specific role or domain demands it.
How important is Kaggle for landing a job?
Useful for skill-building, less useful for hiring signal in 2026. Kaggle competitions exercise modeling skills but don't reflect real-world data scientist work (problem framing, data cleaning, communication). Build a portfolio of end-to-end projects on real, messy data instead.
What are the best portfolio projects?
Three or four projects that show end-to-end thinking: pick a real domain, find or scrape your own data, frame a clear question, build a model, evaluate it honestly, and write up the findings. One strong project beats ten Kaggle notebook clones.
How deep do I need to go in statistics?
Solid fundamentals — distributions, hypothesis testing, confidence intervals, regression assumptions, p-values, A/B testing — are required and tested in interviews. Beyond that, depth depends on the role: causal inference for product DS, advanced Bayesian for research DS, less for ML-focused roles.
SQL or NoSQL for data science?
SQL — every working data scientist writes SQL daily, against Postgres, Snowflake, BigQuery, or Redshift. NoSQL (Mongo, Cassandra) shows up occasionally in source data but isn't part of the core analytical workflow. Get fast at SQL.

Related roadmaps

Last updated: 2026-04-27