The Paper
Understanding and Detecting Software Upgrade Failures in Distributed Systems
Paper Link
https://dl.acm.org/doi/10.1145/3477132.3483577
Format
We start at 6:10, don't be late!
The discussion lasts for about 1 to 1.5 hours, depending upon the paper.
Read the paper (done before you arrive)
Introductions (name, and background)
First impressions (1-2 minutes this is what I thought)
Structured review (we move through the paper in order, everyone gets a chance to ask questions, offer comments, and raise concerns)
Free form discussion
Nominate and vote on the next paper
Abstract
Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics.
This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.