data-engineer-roadmap
data-engineer-roadmap copied to clipboard
Concurrency models are missing
For a modern data engineer knowledge of concurrency models is important.
- A data engineer should know the difference between concurrency and parallelism.
- A data engineer should know the difference between task parallelism and data parallelism.
- Threads vs. processes. Example in Python: libraries
threadingvsmultiprocessing, what are the differences, and what problems does Python have with threading. - A pretty typical scenario for modern data integration: call n APIs each x sec / min / hours. How to do that with a good performance? One of the ways would be to use asynchronous programming.
- Actor model might be good to know as well.
- DAG (example:
Apache Airflow) vs state machines (example:Amazon Step Functions) vs ... . Is actually covered by 'Data structures and algorithms', but maybe would be good to mention this as an example of how knowledge of them might be helpful for a data engineer. -
Parallel programming using techniques like
CUDAon GPU. - Functional programming is also 'nice to have' (but not obligatory).
If you agree on at least some of the points, I can prepare the text.
Hey, these are really good points! I'll def consider adding these to the image when I update it next time. Feel free to create a PR and add it to the markdown version. Thanks a lot for the contribution!
Hey, thanks for the feedback! I will create the markdown version, sure!