33: Katharine Jarmul - Testing in Data Science
A discussion with Katharine Jarmul, aka kjam, about some of the challenges of data science with respect to testing.
Some of the topics we discuss:
- experimentation vs testing
- testing pipelines and pipeline changes
- automating data validation
- property based testing
- schema validation and detecting schema changes
- using unit test techniques to test data pipeline stages
- testing nodes and transitions in DAGs
- testing expected and unexpected data
- missing data and non-signals
- corrupting a dataset with noise
- fuzz testing for both data pipelines and web APIs
- datafuzz
- hypothesis
- testing internal interfaces
- documenting and sharing domain expertise to build good reasonableness
- intermediary data and stages
- neural networks
- speaking at conferences
Special Guest: Katharine Jarmul.
Sponsored By:
- Python Testing with pytest, 2nd edition: The fastest way to learn pytest and practical testing practices.
- Patreon Supporters: Help support the show with as little as $1 per month and be the first to know when new episodes come out.
Links:
- @kjam on Twitter — Data Magic and Computer Sorcery
- Kjamistan: Data Science
- datafuzz’s Python library — The goal of datafuzz is to give you the ability to test your data science code and models with BAD data.
- Hypothesis Python library — Hypothesis is a Python library for finding edge cases in your code you wouldn’t have thought to look for.