Speaker
Description
Test flakiness—when a test intermittently passes or fails without changes to the code—poses a significant challenge in the validation of distributed control systems. This paper presents an investigation into test flakiness in CSP.LMC (Local Monitoring and Control for the Central Signal Processor), a key subsystem of the SKA (Square Kilometre Array) telescope. CSP.LMC is a Python application built on the TANGO framework, that is tested using a multi-level testing approach combining unit, component, and integration tests. To achieve scalable and reproducible deployment, the entire SKA control software runs within a Kubernetes environment. We systematically collect test outcomes and execution benchmarks to monitor system stability over time. A data mining approach is applied to uncover correlations and hidden patterns associated with test instability. Our analysis aims to uncover subtle software issues that are not easily detected through standard test evaluation. Furthermore, we aim to explore how the complexity of both the software architecture and its deployment may introduce sources of non-determinism that can lead to flaky tests. We discuss the impact of flakiness on the reliability of SKA control software and propose practical strategies to benchmark, detect, and mitigate flaky tests in complex distributed environments.