high-performance parallelism. This talk will expose to both data-scientists and
library developers the current state of affairs and the recent advances for
parallel computing with Python. The goal is to help practitioners and
developers to make better decisions on this matter.
I will first cover how Python can interface with parallelism, from leveraging
external parallelism of C-extensions –especially the BLAS family– to Python's
multiprocessing and multithreading API. I will touch upon use cases, e.g single
vs multi machine, as well as and pros and cons of the various solutions for
each use case. Most of these considerations will be backed by benchmarks from
the scikit-learn machine
learning library.
From these low-level interfaces emerged higher-level parallel processing
libraries, such as concurrent.futures, joblib and loky (used by dask and
scikit-learn) These libraries make it easy for Python programmers to use safe
and reliable parallelism in their code. They can even work in more exotic
situations, such as interactive sessions, in which Python’s native
multiprocessing support tends to fail. I will describe their purpose as well as
the canonical use-cases they address.
The last part of this talk will focus on the most recent advances in the Python
standard library, addressing one of the principal performance bottlenecks of
multi-core/multi-machine processing, which is data communication. We will
present a new API for shared-memory management between different Python
processes, and performance improvements for the serialization of large Python
objects ( PEP 574, pickle extensions). These performance improvements will be
leveraged by distributed data science frameworks such as dask, ray and pyspark.
EVENT:
EuroPython 2019 - Talk - 2019-07-12
SPEAKER:
Pierre Glaser
PUBLICATION PERMISSIONS:
Original video was published with the Creative Commons Attribution license (reuse allowed).
ATTRIBUTION CREDITS:
Original video source:
0 Comments