How I Crashed a Server (and Learned to Prevent It with Grafana, Prometheus, and Telegram)

Keep monitoring and don't get caught with your pants down

Let me tell you a story about how I crashed my workplace’s server due to a lack of disk space and how I solved the issue and prevented it from happening again using an automated monitoring and alerts system.

The Event

I once worked at a small company that hosted a web application serving around 1,500 daily users. After some time, I was put in charge of the VMs(virtual machines) that hosted the app for both the staging and production environments.

The system was really simple: Angular frontend, Node.js backend, Nginx to serve, and Docker to orchestrate. I was a complete noob (still am, to be honest) and barely knew anything about Docker beyond docker run and docker-compose up.

How is the application deployed from the on-premise server? I just followed what the previous admin used to do:

git pull origin master
docker-compose down
docker-compose build
docker-compose up

It was that simple. It always worked (with a slow build time, but that was just a minor inconvenience 😅).

Months passed, and everything was running fine… until it wasn’t.

“The application isn’t responding! What happened?! What did you do?!”

I had no idea. There hadn’t been any updates that week. I tried SSH’ing into the machine but nothing.. I tried everything I knew at the time but had no luck.

After a few hours, a senior developer asked the infra guy:

“How much disk space is left on the app host?”

The answer: “None.”

The Cause

The server was on-premise, and we were building the Docker images on the same machine that served the application.

The application didn’t persist much data aside from a bit of cache. So why is the disk full?

Because when Docker builds images, it leaves garbage behind in the overlay2 directory under /var/lib/docker. Even if you run docker system prune -af --volumes, some leftover data keeps piling up over time and eventually, it eats up all your disk space. The app runs fine until there’s no space left for any filesystem operations — causing the OS to crash.

At the time (and even today), this issue still hasn’t been fully solved by the Docker community.

If you’re curious, here’s more on the Docker overlay issue.

The Solution

Clean the entire overlay2 directory and restart the Docker service but be careful! This will remove all Docker layers, containers, and images!
Here is how to do it

The Prevention

To avoid getting caught off guard again, I built a simple monitoring setup to track host metrics. Now, if disk usage hits a certain threshold, I get notified early enough to migrate or clean things up before it’s too late.

Here’s how the setup looks:

  • Node Exporter to collect metrics from the host
  • Prometheus to collect and store metrics
  • Grafana to visualize metrics and send alerts
  • Telegram for receiving notifications

Grafana Dashboard

Do It Yourself

Want to try it for yourself? Here’s the code:

🔧 Demo repository

This is an easy setup you can try on your own machine.

Monitoring saved me from a total outage and now it can save you too!

Demo Repository’s README

Telegram Host Alert Setup Guide

This project sets up Prometheus and Grafana for monitoring using Node Exporter.

Prerequisites
  • Docker and Docker Compose installed
  • Node Exporter binary (already included in collectors/node-exporter/)
1. Configure Environment

Copy the contents of the prometheus.yml.sample file to prometheus.yml and substitute HOST_IP with your current IP.

2. Start the Stack

Run the following command to start Prometheus and Grafana:

docker compose up -d
3. Access the Services
grafana login: admin
password: admin
4. Import Node Exporter Dashboard
  • In grafana dashboards, import the node-export-full.json from grafana/templates
5. Stopping the Stack

To stop all services:

docker compose stop

Conclusion

I hope my past mistake helps you to an extent, thank you for reading along!

References

We want to work with you. Check out our Services page!