data-engineer-roadmap icon indicating copy to clipboard operation
data-engineer-roadmap copied to clipboard

Concurrency models are missing

Open Vlad-Radz opened this issue 4 years ago • 2 comments

For a modern data engineer knowledge of concurrency models is important.

  1. A data engineer should know the difference between concurrency and parallelism.
  2. A data engineer should know the difference between task parallelism and data parallelism.
  3. Threads vs. processes. Example in Python: libraries threading vs multiprocessing, what are the differences, and what problems does Python have with threading.
  4. A pretty typical scenario for modern data integration: call n APIs each x sec / min / hours. How to do that with a good performance? One of the ways would be to use asynchronous programming.
  5. Actor model might be good to know as well.
  6. DAG (example: Apache Airflow) vs state machines (example: Amazon Step Functions) vs ... . Is actually covered by 'Data structures and algorithms', but maybe would be good to mention this as an example of how knowledge of them might be helpful for a data engineer.
  7. Parallel programming using techniques like CUDA on GPU.
  8. Functional programming is also 'nice to have' (but not obligatory).

If you agree on at least some of the points, I can prepare the text.

Vlad-Radz avatar Feb 19 '21 11:02 Vlad-Radz

Hey, these are really good points! I'll def consider adding these to the image when I update it next time. Feel free to create a PR and add it to the markdown version. Thanks a lot for the contribution!

alexandraabbas avatar Apr 10 '21 14:04 alexandraabbas

Hey, thanks for the feedback! I will create the markdown version, sure!

Vlad-Radz avatar Apr 17 '21 19:04 Vlad-Radz