Publication: Using Performance Variation for Instrumentation Placement in Distributed Systems
No Thumbnail Available
Open/View Files
Date
2020-03-03
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Sturmann, Lilian. 2019. Using Performance Variation for Instrumentation Placement in Distributed Systems. Master's thesis, Harvard Extension School.
Research Data
Abstract
Distributed systems are now ubiquitous in the infrastructures underpinning our everyday lives, yet diagnosing performance problems in these systems remains extremely challenging. The current state of the art for problem diagnosis in these systems relies on data from instrumentation in the system, but the placement of this instrumentation is an unsolved challenge in systems research and in production environments.
This work presents an implementation and evaluation of a performance variation-based tool that helps developers understand where instrumentation should be placed in a distributed system to better diagnose current and future performance problems. This tool identifies under-instrumented regions in these systems by localizing performance variation seen in system requests. Contributions of this work include the tool itself; implementations of several methods for localizing performance variation, including a method that prioritizes performance variation deeper in request call graphs; a conversion module that can also function as a stand-alone toolkit to allow the performance variation-based tool to be used across a variety of systems, including those instrumented using the Open Tracing model as well as those using a more general directed acyclic graph (DAG) models; and several experiments evaluating the tool and these methods on an open source distributed application.
They key insight informing this work is that similar workflows in the same system should perform similarly. Building on existing workflow-centric tracing tools to profile system behavior, the tools and methods presented have the potential to significantly cut down on time spent diagnosing performance problems in distributed systems. The experiments evaluate their utility both for understanding where to place additional instrumentation for current problems in these systems as they arise, and for guiding informative placement of default system instrumentation to better handle future problems. Potentially, the tools and methods could also be adapted for use in a broader framework that seeks to dynamically tune instrumentation in running systems to the current system state.
Description
Other Available Sources
Keywords
distributed systems, tracing, monitoring
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service