Building Starsim: A Disease Modeling Tool

Starsim is many things: a disease-modeling tool, an agent-based model (ABM), a stochastic (random) simulation, open-source software, and written in Python, an object-oriented programming language with many powerful libraries used for machine learning, scientific and statistical analysis and visualization. As a contractor, I worked with the Gates Foundation’s Institute for Disease Modeling (IDM) to build, improve, and launch Starsim.

Before diving into the specifics of building Starsim, it’s essential to understand the journey into simulation and the background of modeling, random number generation, and epidemiology. This article explores the complexities and nuances involved in creating a sophisticated tool designed to predict and prevent the spread of diseases.

This article will cover the basics of models and reality, the challenges of disease modeling, the role of computer simulation, the importance of (pseudo) random number generation, and the improvements made to Starsim. Furthermore, it will explore the reasons for choosing Python and the prioritization of critical features in the development process. Join us as we explore the story of Building Starsim.

Models and Reality

A model is essentially a representation of reality. Britannica defines scientific modeling as the generation of “physical, conceptual, or mathematical representation of a real phenomenon that is difficult to observe directly”. Models can range from closed-form (mathematical equation) to open-form (simulation). For instance, in physics, the equation v = x / t calculates velocity, predicting how far a ball will roll in a given time.

During the COVID-19 pandemic, the world gained a deeper understanding of the complexities of disease spread. Mathematical epidemiology was used as far back as the 1700s, when Daniel Bernoulli modeled the impact of smallpox on life expectancy. Today, numerous institutions are dedicated to preventing disease spread. However, gathering experts with the knowledge, tools, and expertise to build accurate models remains a significant challenge. According to Mathematical Models in Infectious Disease Epidemiology, “mathematical models have become essential tools for understanding and predicting the spread of infectious diseases”.

The ability to create accurate and comprehensible models is crucial in epidemiology. This requires a blend of expertise across multiple domains, making the process challenging but essential for global health security.

Black Boxes & Ivy Towers

Throughout history, the world has faced numerous pandemics, from the bubonic plague to COVID-19. Epidemiology is the study of the factors that determine the presence or absence of diseases and disorders. A deep understanding of disease spread requires advanced education in medicine, mathematics, and statistics. Advances in computing power have also made computer science an indispensable tool for disease modelers.

In the 20th century, disease modeling was primarily conducted by academics and consultants, with solutions published in academic journals, often making the process appear like an impenetrable black box. Faster and more accurate modeling became possible with advances in computer science, benefiting both closed and open-form problems.

Closed-form equations typically adopt a macro approach. Classic models like SEIR (Susceptible-Exposed-Infected-Recovered) involve setting up rate equations and solving them through time-series computation. Key inputs include the R0 rate and whether immunity is gained post-recovery. These models project disease spread duration and peak, providing critical insights into potential outcomes. This allows for better decision-making.

Computer Simulation

Following 9/11, the NIH Office of Research Services Division of Safety aimed to establish a baseline for understanding risks at the NIH Bethesda Campus, which included 30,000 people, 10,000 cars, and 100 buildings. The primary question was: “How long would it take to fully evacuate the research campus?” My team was tasked with answering this question and later addressed more complex scenarios, such as active shooter responses.

Computer simulation is a powerful tool for modeling complex scenarios. According to the article Mathematical Models in Infectious Disease Epidemiology, “simulations can help explore different intervention strategies and predict their potential impact on disease transmission”.

The application of computer simulation extended to various complex questions, including optimizing strategies for responding to active shooter scenarios, demonstrating its versatility in addressing safety and security concerns within large institutions.

Pseudo Random Number Generation

Pseudo-random number generation (PRNG) is crucial for simulation accuracy and repeatability. Unlike truly random numbers, PRNG allows analysts to produce consistent outcomes. Simulations use a random seed to ensure the same random numbers are drawn in each iteration, aiding in model calibration and statistical analysis. Sufficient trials are run, leveraging the Central Limit Theorem (CLT), to ensure the sum of non-normal probability distributions becomes normally distributed.

Simulations often draw from continuous or discrete probability distributions, where processes A(t), B(t), C(t),…Z(t) yield a total time, T(t). The CLT states that the sum of these distributions will be normally distributed if the sample size is large enough. The number of simulation trials determines the minimum sample size needed for the convergence of simulation outcome variables to a normal distribution. This sample size depends on the process being simulated.

PRNG is essential for achieving accurate and repeatable results in simulation analysis, ensuring that models can be reliably calibrated and statistically analyzed.

Aerosols and Such

Aerosol transmission is a key concept in epidemiology, illustrated by how particles from an aerosol can spread and hover in the air. Similarly, diseases can be transmitted through sneezes or coughs. After my time at NIH, I consulted at Citibank, focusing on process automation and big data analytics with Python and machine learning. This led to working at a fintech startup and eventually launching my own consulting company. Later, I had the opportunity to help build Starsim, a project that required my expertise in Python development, simulation, and epidemiology.

Aerosol transmission is crucial in understanding how respiratory diseases spread, impacting public health strategies and preventative measures. The opportunity to contribute to Starsim was an ideal fit, combining my diverse skills in technology and epidemiology.

According to the World Health Organization, studies found that SARS-COV-2 could remain suspended in the air, having achieved terminal velocity, anywhere from 3 to 16 hours.

Defining Principles

The Gates Foundation’s Institute for Disease Modeling (IDM) had previously developed innovative disease models, specifically EMOD, an agent-based, Epidemiological Modeling software. EMOD required supercomputers to efficiently set up, configure, run, and analyze disease simulations. Agent-based modeling is significant because people and disease vectors are modeled as entities within the system. The model used stochastic (random), continuous-time simulation over a period T with smaller time steps.

Key principles such as agent-based simulation, networks, stochasticity, continuous-time simulation, random number generation, and state-ful variables are fundamental to creating Starsim. These principles allow for the modeling of complex interactions and the assessment of interventions. The key to agent-based modelling is the ability to use variables that are state-ful or posses a state attribute that can change over time.

According to Cliff Kerr, Robyn Stuart, Romesh Abeysuriya, Jamie Cohen, Paula Sanz-Leon, Alina Muellenmeister, Daniel Klein in Starsim: A fast, flexible toolbox for agent-based modeling of health and disease, Starsim is a fast, flexible toolbox for agent-based modeling of health and disease. It is an open-source tool that is available for free.

Profiling & Improvement

Joining the Starsim project in February 2024, the goal was a public beta launch in the summer. The key tasks involved understanding the codebase, ensuring code aligned with Python standards, assessing performance, and tackling critical GitHub issues. Initial steps included reviewing every line of code and developing process and architecture diagrams to challenge assumptions. Python standards were used to ensure code efficiency, readability, and maintainability.

One of the first jobs was to go through every line of code and then to develop and present process and architecture diagrams to my colleagues to “challenge my assumptions”. Why use Python to build Starsim? I learned a bit about the history of disease modeling at IDM from EMOD to Covasim. Covasim was built in about 4–6 weeks by a dedicated team of disease modelers and Python experts across the globe. Python is an object-oriented coding language with plenty of powerful libraries available for scientific and statistical analysis, machine learning, and visualization. Object-oriented coding is helpful when dealing with state-ful variables and calculating & maintaining data across simulations. Object oriented coding also makes software more maintainable, scalable, and easier to read.

According to Kerr, C.C., Stuart, R.M., Mistry, D., Abeysuriya, R.G., Rosenfeld, K., Hart, G.R., Núñez, R.C., Cohen, J.A., Selvaraj, P., Hagedorn, B., George, L., and Klein, Daniel J. 2021 in Covasim: an agent-based model of COVID-19 dynamics and interventions, Covasim is a fast, flexible tool for agent-based modelling.

Conclusion

Starsim stands as a significant advancement in disease modeling, offering a powerful, open-source tool for simulating and understanding disease dynamics. Its development, rooted in agent-based modeling and leveraging Python’s capabilities, provides a flexible and accessible platform for researchers and public health officials. The emphasis on common random number generation (CRNG) ensures simulation accuracy and reliability, crucial for informing public health strategies and interventions.

Key takeaways from the development of Starsim include the importance of interdisciplinary collaboration, the necessity of rigorous code review and performance optimization, and the value of open-source tools in addressing global health challenges. The tool’s architecture, designed for extensibility, allows for the addition of new disease models, making it a versatile asset in managing various health crises.

Ultimately, Starsim contributes to the global effort to manage and mitigate the impact of infectious diseases, providing a low-cost, high-impact solution for understanding and responding to health uncertainties worldwide. The Starsim ecosystem will also include TBsim, STIsim, MealesSim, and many more models in the future. The world needs accurate, powerful computation tools provided at a low-cost to manage abundant uncertainties ranging from bird flu to measles outbreaks.

Leave a comment

Your email address will not be published. Required fields are marked *