python-performance
python-performance copied to clipboard
Repository for the book Fast Python - published by Manning
Fast Python for Data Science
Welcome to the code repository for the book Fast Python

Fast Python is your guide to optimizing every part of your Python-based data analysis process, from the pure Python code you write to managing the resources of modern hardware and GPUs. You'll learn to rewrite inefficient data structures, improve underperforming code with multithreading, and simplify your datasets without sacrificing accuracy.
Here you can find the code for the book. Here is a chapter-oriented roadmap. Given that the book is in early access, the repo is also under construction
Introduction
Extracting maximum performance from built-in features
- Profiling applications with both IO and computing workloads
- Profiling code to detect performance bottlenecks
- Optimizing basic data structures for speed: lists, sets, dictionaries
- Finding excessive memory allocation
- Using laziness and generators for big-data pipelining
Concurrency, parallelism, and asynchronous processing
- Writing the scaffold of an asynchronous server
- Implementing the first MapReduce engine
- Implementing a concurrent version of a MapReduce engine
- Using multi-processing to implement MapReduce
- Tying it all together: an asynchronous multi-threaded and multi-processing MapReduce server
Using NumPy more efficiently
- Understanding NumPy from a performance perspective
- Using array programming
- Tuning NumPy's internal architecture for performance
Extracting maximum efficiency of hardware and networks
Re-implementing critical code with Cython
- A whirlwind tour of Cython
- Profiling Cython code
- Optimizing array access with Cython memoryviews
- Writing NumPy generalized universal functions in Cython
- Advanced array access in Cython
- Parallelism in Cython
Memory hierarchy, storage and networking
- How modern hardware architectures impact Python performance
- Efficient data storage with Blosc
- Accelerating NumPy with NumExpr
- The performance implications of using the local network
Optimizing modern data processing libraries
High performance Pandas and Apache Arrow
- Optimizing memory and time when loading data
- Techniques to increase data analysis speed
- Pandas on top of NumPy, Cython and NumExpr
- Reading data into Pandas with Arrow
- Using Arrow interop to delegate work to moere efficient languages and systems
Storing big data
- A unified interface for file access: fsspec
- Parquet: an efficient format to store columnar data
- Dealing with larger than memory datasets the old-fashioned way
- Zarr for large array persistence
Advanced topics
Data analysis using GPU computing
- Using Numba to generate CPU code XXX sec3-real
- Performance analysis of GPU code: the case of a CuPy application
Analyzing big data with Dask
- Understanding the execution model of Dask
- The computational cost of Dask operations
- Using Dask's distributed scheduler