Session Name: Sampling in Open Source distributed tracing
Distributed tracing is a hard problem to solve. In production systems with even moderate levels of traffic, sampling is necessary, but introduces gaps in the data collected. In this talk we will explore standards in progress such as OpenTelemetry, and what is being done to address this concern. We can do sampling at many levels across the data collection and aggregation pipeline. In commercial products this has been very sophisticated for quite some time, but has yet to evolve in open source, but that will change based on necessity. Exploring the popular open source projects such as Zipkin, Jaeger, and Skywalking we will analyze and understand their approaches to sampling and how this will evolve to let data collection scale. To improve the use of tracing data, applying machine learning to help solve the challenges with data distillation and analysis. This must evolve for these projects to handle scale better and help users manage more effectively.
Managing application uptime, performance and meeting SLAs requires a larger set of tools which leverage tracing along with metrics and logs. To collect and analyze this data, additional open source technologies are necessary, most often the ELK stack, Grafana, and Prometheus. Managing these stacks is complex when you have so many methods of data collection and storing.
Finally, the hardest problem is measuring and understanding the overhead trace, log, and metric collection can add to your application. Sometimes overhead is reported by agents with auto instrumentation, but not something that logging libraries or metric libraries measure or collect.
In this talk you’ll learn how to measure and understand relevant metrics for the overhead these introduce and how to ensure your application performs well with minimal cost or infrastructure use.
Jonah Kowall trained in computer science and co-founded one of the first content filtering companies in the late 1990s. Jonah became a security expert committing code to both the FreeBSD project and helped build the first wireless cracking algorithms. Jonah received his CISSP and CISA along with several infrastructure-related certifications and awards. Throughout 15 years as a practitioner and manager across both startups and large enterprises focusing on infrastructure and operations, security, and performance engineering. Spearheading both tactical and strategic operational initiatives, going deep into monitoring and tuning of infrastructure and applications.
In 2011 Jonah changed careers, moving to Gartner to focus on availability and performance monitoring and IT operations management (ITOM). Speaking and writing research globally for IT leaders and CIOs. Jonah led Gartner's influential application performance monitoring (APM) and created the network performance monitoring and diagnostics (NPMD) magic quadrants along with creating the Network Packet Broker (NPB) term and market. In 2015 Jonah joined AppDynamics to drive the company's corporate development, product strategy, and vision. Jonah developed new products and solutions with technical partners including product managing, building, and launching new products with IBM, SAP, and ServiceNow. Cisco acquired AppDynamics in 2017 to create IT and business alignment.
After a successful exit Jonah then joined Kentik as CTO to set and execute the product vision and strategy. Kentik is the leader in AIOps and network analytics for network professionals. In 2020 Jonah joined Logz.io as CTO to drive strategy and ecosystem for the industry’s first open source based SaaS solutions for the modern digital business. Coming back to where Jonah’s career began, in the open source community.