The Bus Factor of Critical Open-Source Infrastructure

An Analysis of Maintainer Concentration in Major Software Projects — Yohaan Narayanan

View Repository on GitHub

The Bus Factor of Critical Open-Source Infrastructure

An Analysis of Maintainer Concentration in Major Software Projects

Author: Yohaan Narayanan.

Institution: Valley Christian High School.

Date: 8 March 2026.


Abstract

Open-source software forms the backbone of modern digital infrastructure. Many widely used systems are maintained by small groups of developers, raising concerns about sustainability and risk. This study analyzes the "bus factor" of a major open-source project, NumPy, by examining contributor data and commit distributions. The bus factor represents the number of contributors whose loss would significantly disrupt project development. By analyzing contribution patterns in NumPy, this research demonstrates that critical infrastructure often relies on surprisingly small groups of maintainers. The findings highlight structural risks within the open-source ecosystem and suggest the need for broader contributor distribution to improve stability.


Introduction

Modern computing infrastructure depends heavily on open-source software. Many of the tools that power servers, databases, operating systems, and applications are developed collaboratively by global communities. Despite the scale of these projects, the responsibility for maintaining them is often concentrated among a small number of contributors.

One concept used to evaluate this risk is the bus factor. The bus factor represents the number of developers whose loss would significantly disrupt a project's development. A low bus factor indicates that knowledge and responsibility are concentrated in a small number of individuals.

This study investigates the bus factor of NumPy and examines how contributor distributions affect project stability.


Literature Review

The sustainability of open-source software has become an increasingly important research topic. Previous studies have found that many widely used projects depend heavily on small numbers of contributors.

Research into software ecosystems has shown that contributor activity is often unevenly distributed, with a small number of developers responsible for the majority of commits. This phenomenon has implications for software reliability and long-term maintainability.

The concept of the bus factor provides a useful framework for analyzing these risks by quantifying how many contributors are critical to a project's continued development.


Methodology & Calculation

This project provides a generalized framework for analyzing open-source sustainability. While NumPy is used as the primary case study in this paper, the included Python tools are repository-agnostic and can be used to audit any GitHub repository by simply changing the OWNER and REPO variables.

The Bus Factor Formula

The "Bus Factor" ($B$) is quantified as the minimum number of contributors ($n$) required to account for at least 50% of the total project commits ($C_{\text{total}}$):

$$B = \min \left\{ n \;\middle|\; \sum_{i=1}^{n} c_i \ge 0.5 \times C_{\text{total}} \right\}$$

Where $c_i$ represents the commit count of the $i$-th contributor when ranked by volume in descending order.


Results

Commit Distribution

The following figure illustrates how commits are distributed among contributors.

Commit Distribution Graph

Figure 1: Commit Distribution Among Contributors (numpy/numpy)

The commit distribution follows a "long-tail" pattern, a common phenomenon in open-source ecosystems. The data shows that while the project has hundreds of contributors, the vast majority have fewer than 100 commits. The extreme vertical spike at the far left of the graph represents a small elite group of maintainers who have contributed thousands of commits each. This visual confirms that the project's technical debt and historical knowledge are heavily concentrated in a tiny fraction of the total contributor pool.


Maintainer Dependency

The following visualization illustrates maintainer concentration within NumPy.

Maintainer Concentration Graph

Figure 2: Maintainer Concentration — Top 20 Contributors (numpy/numpy)

The bar chart identifies the specific individuals responsible for the project's sustainability. The top contributor, charris, accounts for a disproportionate volume of the total commits—more than double that of the next highest maintainer. This chart visually demonstrates the "Bus Factor" calculation; by summing the first seven bars, we reach the 50% contribution threshold, identifying the core group that the project's survival depends upon.


Conclusion

This research examined the bus factor of NumPy and found a result of "7" (a 'medium' number) using Python scripts in the repository.

Understanding the bus factor of critical software projects is essential for assessing the resilience of modern digital infrastructure. Future research should explore automated methods for analyzing bus factors across larger datasets of repositories.


References

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2.


Appendices

busfactor.py Download
import requests
import csv

OWNER = "numpy"
REPO = "numpy"

API_URL = f"https://api.github.com/repos/{OWNER}/{REPO}/contributors?per_page=100"

def get_contributors():
    contributors = []
    page = 1
    while True:
        response = requests.get(f"{API_URL}&page={page}")
        if response.status_code != 200:
            print("Error fetching data:", response.status_code)
            break
        data = response.json()
        if not data:
            break
        for c in data:
            contributors.append({"login": c["login"], "commits": c["contributions"]})
        page += 1
    return contributors

def calculate_bus_factor(contributors):
    sorted_contribs = sorted(contributors, key=lambda x: x["commits"], reverse=True)
    total_commits = sum(c["commits"] for c in sorted_contribs)
    running_total = 0
    bus_factor = 0
    for c in sorted_contribs:
        running_total += c["commits"]
        bus_factor += 1
        if running_total >= total_commits * 0.5:
            break
    return bus_factor, sorted_contribs

def save_contributors_csv(contributors_sorted):
    with open("contributors.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["login", "commits"])
        writer.writeheader()
        writer.writerows(contributors_sorted)
    print("Saved contributors.csv")

def main():
    contributors = get_contributors()
    if not contributors:
        print("No contributors found or error fetching data.")
        return
    bus_factor, sorted_contribs = calculate_bus_factor(contributors)
    print(f"Estimated Bus Factor for {OWNER}/{REPO}: {bus_factor}")
    save_contributors_csv(sorted_contribs)

if __name__ == "__main__":
    main()
graphs.py Download
import matplotlib
matplotlib.use('Agg')

import matplotlib.pyplot as plt
import csv

logins = []
commits = []

with open("contributors.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        logins.append(row["login"])
        commits.append(int(row["commits"]))

logins_sorted, commits_sorted = zip(*sorted(zip(logins, commits), key=lambda x: x[1], reverse=True))

plt.figure(figsize=(12,6))
plt.plot(range(1, len(commits_sorted)+1), commits_sorted, marker='o')
plt.title("Commit Distribution Among Contributors (numpy/numpy)")
plt.xlabel("Contributor Rank")
plt.ylabel("Number of Commits")
plt.grid(True)
plt.tight_layout()
plt.savefig("commit_distribution.png")
plt.close()

plt.figure(figsize=(12,6))
top_n = 20
plt.bar(range(1, top_n+1), commits_sorted[:top_n], tick_label=logins_sorted[:top_n])
plt.title("Maintainer Concentration (Top 20 Contributors)")
plt.xlabel("Contributor")
plt.ylabel("Number of Commits")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig("maintainer_concentration.png")
plt.close()

print("Graphs generated successfully.")
Commit Distribution Graph
commit_distribution.png Download
Maintainer Concentration Graph
maintainer_concentration.png Download
README.md View raw

The Bus Factor of Critical Open-Source Infrastructure

This repository contains a research paper analyzing the bus factor of critical open-source software projects. The bus factor measures how many contributors a project relies on for most of its development, highlighting potential risks in software sustainability.

Paper

The full research paper is available here: paper.md

Scripts

Included in this repo:

  • busfactor.py – Calculates the bus factor from a GitHub repository.
  • graphs.py – Generates example commit distribution and maintainer concentration graphs.

Images

  • commit_distribution.png – Visual representation of commit distribution.
  • maintainer_concentration.png – Visual representation of maintainer concentration.

For this paper, I used the example of numpy/numpy; a very popular Python library for numerical computing.

Requirements

Dependencies are listed in requirements.txt:

requests
matplotlib

Install them with:

pip install -r requirements.txt
requirements.txt View raw
  • pkg requests
  • pkg matplotlib
Install all dependencies:
pip install -r requirements.txt