2021 PBS TechCon: Your Data is Disgusting!
How a small team leveraged open source tools to build a data pipeline
I had the great fortune of being a presenter at this year’s PBS TechCon conference. The focus of my talk was to introduce attendees to the principles of tidy data and discuss a data pipeline project my team has been working on at Nebraska Public Media. Here’s the session description:
Our station’s data was disgusting! No one wanted to access it, use it, or even see it. Our team dreaded requests to ‘pull numbers’ because of the amount of work it took. Compounding this problem, our data was distributed across many different data lakes and was not always formatted in Excel like files. Our process was also not durable or sustainable. Solutions to one problem could not easily be applied to newer problems. As a result, our station lacked a central location of truth that could be leveraged to generate valid, reliable insights.
Most importantly, stakeholders would question the validity of analyses because of our untidy data and outdated, manual processes–many of which resulted in reporting that lagged key business decisions. This was disappointing, since so much emphasis has been put on the value and importance data should play in decision making.
How then did we go from dirty, disgusting data to data we can derive value from in an efficient, painless way? Our station leaned in, applied tidy data principles, and built a flexible, scalable data pipeline.
This session won’t just be a lecture on standardized data practices (snore). Rather, this session aims to introduce you to the concept of tidy data, and it will overview several open source tools and the tech stack we used to get our station from a point of gut wrenching, revolting data to tidy data that is ready for analysis, visualization, and supports decision making processes. We will also further discuss our wins, struggles, and issues still currently faced as we maintain and further develop our data pipeline. It’s our hope that this session stimulates interest from other teams across the system and inspires others to contribute to this project.
As part of my talk, I mentioned having put together a curated list of resources others could use to learn more about the topics covered. This list can be found in the following section of this blog post. If you’re interested in discussing these topics further, please reach out.
Resources to learn more
Join a Community
- R for Data Science Online Learning Community
- Join the Slack workspace
@Collin Berketo get a hold of me in the workspace