Fast Python for Data Science

Welcome to the code repository for the book Fast Python

Book cover

Fast Python is your guide to optimizing every part of your Python-based data analysis process, from the pure Python code you write to managing the resources of modern hardware and GPUs. You'll learn to rewrite inefficient data structures, improve underperforming code with multithreading, and simplify your datasets without sacrificing accuracy.

Here you can find the code for the book. Here is a chapter-oriented roadmap. Given that the book is in early access, the repo is also under construction

Introduction

Extracting maximum performance from built-in features

Profiling applications with both IO and computing workloads
Profiling code to detect performance bottlenecks
Optimizing basic data structures for speed: lists, sets, dictionaries
Finding excessive memory allocation
Using laziness and generators for big-data pipelining

Concurrency, parallelism, and asynchronous processing

Writing the scaffold of an asynchronous server
Implementing the first MapReduce engine
Implementing a concurrent version of a MapReduce engine
Using multi-processing to implement MapReduce
Tying it all together: an asynchronous multi-threaded and multi-processing MapReduce server

Using NumPy more efficiently

Understanding NumPy from a performance perspective
Using array programming
Tuning NumPy's internal architecture for performance

Extracting maximum efficiency of hardware and networks

Re-implementing critical code with Cython

A whirlwind tour of Cython
Profiling Cython code
Optimizing array access with Cython memoryviews
Writing NumPy generalized universal functions in Cython
Advanced array access in Cython
Parallelism in Cython

Memory hierarchy, storage and networking

How modern hardware architectures impact Python performance
Efficient data storage with Blosc
Accelerating NumPy with NumExpr
The performance implications of using the local network

Optimizing modern data processing libraries

High performance Pandas and Apache Arrow

Optimizing memory and time when loading data
Techniques to increase data analysis speed
Pandas on top of NumPy, Cython and NumExpr
Reading data into Pandas with Arrow
Using Arrow interop to delegate work to moere efficient languages and systems

Storing big data

A unified interface for file access: fsspec
Parquet: an efficient format to store columnar data
Dealing with larger than memory datasets the old-fashioned way
Zarr for large array persistence

Advanced topics

Data analysis using GPU computing

Using Numba to generate CPU code XXX sec3-real
Performance analysis of GPU code: the case of a CuPy application

Analyzing big data with Dask

Understanding the execution model of Dask
The computational cost of Dask operations
Using Dask's distributed scheduler

python-performance
python-performance copied to clipboard

Metadata

Fast Python for Data Science

Introduction

Extracting maximum performance from built-in features

Concurrency, parallelism, and asynchronous processing

Using NumPy more efficiently

Extracting maximum efficiency of hardware and networks

Re-implementing critical code with Cython

Memory hierarchy, storage and networking

Optimizing modern data processing libraries

High performance Pandas and Apache Arrow

Storing big data

Advanced topics

Data analysis using GPU computing

Analyzing big data with Dask

Appendixes

Setting up the environment

Using Numba to generate efficient low level code

← Metadata

Owner

Metadata

python-performance python-performance copied to clipboard

Metadata

Fast Python for Data Science

Introduction

Extracting maximum performance from built-in features

Concurrency, parallelism, and asynchronous processing

Using NumPy more efficiently

Extracting maximum efficiency of hardware and networks

Re-implementing critical code with Cython

Memory hierarchy, storage and networking

Optimizing modern data processing libraries

High performance Pandas and Apache Arrow

Storing big data

Advanced topics

Data analysis using GPU computing

Analyzing big data with Dask

Appendixes

Setting up the environment

Using Numba to generate efficient low level code

← Metadata

Owner

Metadata

python-performance
python-performance copied to clipboard