SciPy US 2021
Here is the link to the repo with the set up instructions for the workshop.
Here is the proposal I submitted for this workshop.
Description
Exploratory data analysis and statistical modeling are two crucial components of the data analytics cycle and while these might be “more easily" done with small(ish) datasets, the story changes as the amount of data we want to analyze increases. With this in mind, the goal of this tutorial is to get participants comfortable with exploring, analyzing, modeling, and visualizing large datasets that don’t fit in memory.
Throughout the tutorial, participants will learn how to analyze large datasets using dask, a Python package built for data analytics at scale, and how to create data visualizations and a dashboard using datashader, panel, and holoviews. This tutorial will use a top-down approach, hence, each section starts with the punchline (i.e. the results) and we then work our way backward with clear explanations on how and why each question was asked, each function was created, and each choice was made. One of the benefits of the top-down approach of this tutorial is that participants will be able to take what they learn throughout the session and use it to analyze or reverse engineer the many large datasets and results, respectively, that they might encounter in the wild.
This tutorial follows the session called, (D)Ask me Anything about Munging, Preprocessing, and Wrangling Data at Scale from SciPy Japan 2020.
The size of datasets in diverse industries keeps increasing by the minute and, luckily, for data professionals using Python as their day-to-day tool there is Dask, a library for large scale analytics and data preprocessing. This tutorial takes advantage of dask and other libraries to teach you how to tackle data analysis at scale while covering techniques for data exploration, hypothesis testing, and dashboard creating using large datasets. In addition, this tutorial uses a top-down approach, meaning, in each of the major sections you will start with the results and work your way backwards through the large scale data analytics process.
Being able to analyze large quantities of data is useful for data professionals from all levels and within different industries, for example, business analysts, developers and data scientists alike who want to showcase some of the insights they have uncovered, to developers wanting to explain the most important metrics stakeholders should pay attention to in their big data applications, to data scientists wanting to explain the results of their experiments, models and the like. With this in mind, the goal of this tutorial is to teach data professionals from diverse fields and at diverse levels, how to analyse large datasets - data that does not fit in memory - using their own machines.
This tutorial follows a top-down approach in all three of its part. This means that each time-block starts with the punchline (i.e. the results) and we will then work our way backwards with clear explanations on how and why each question was asked, each function was created, and each choice was made. One of the added benefits of this approach is that participants will be able to take what they learn throughout the session and use it to analyze and reverse engineer, dataset and results, respectively, that they might encounter in the wild.
Audience
The target audience for this session are analysts of all level, developers, data scientists and engineers wanting to learn how to analyse large amounts of data that don’t fit into the memory of their machines.
Prerequisites (P) and Good To Have's (GTH)
- The target audience for this session includes analysts of all levels, developers, data scientists, and engineers wanting to learn how to analyze large amounts of data that don’t fit into the memory of their machines.
- (P) Attendees for this tutorial are expected to be familiar with Python (1 year of coding). - (P) Participants should be comfortable with loops, functions, lists comprehensions, and if-else statements. - (GTH) While it is not necessary to have knowledge of dask, pandas, NumPy, datashader, and Holoviews, a bit of experience with these libraries would be very beneficial throughout this tutorial. - (P) Participants should have at least 6 GB of free memory in their computers. - (GTH) While it is not required to have experience with an integrated development environment like Jupyter Lab, this would be very beneficial for the session as it is the tool we will be using all throughout
Outline
- - The total time budgeted, including breaks, is ~4 hours
Additional Notes
I work as an educator, researcher, and data scientist, and have taught Python to hundreds of students with backgrounds ranging from complete beginner to advanced user. My lessons are full of metaphors, quotes, funny pictures, and exercises as I love to make sure students leave my sessions with at least a laugh, a new concept learned, a new Python trick, or all of the above.
I have done several short tutorials at meetups and other venues on a variety of topics within the space of data analytics and Python programming. Most recently, I taught one of the 3.5-hour tutorials at SciPy Japan 2020 using a bottom-up approach, and now I am excited to do another session using the reverse approach.