Let me tell you a story about how I crashed my workplace’s server due to a lack of disk space and how I solved the issue and prevented it from happening again using an automated monitoring and alerts system.
The Event
I once worked at a small company that hosted a web application serving around 1,500 daily users. After some time, I was put in charge of the VMs(virtual machines) that hosted the app for both the staging and production environments.
The system was really simple: Angular frontend, Node.js backend, Nginx to serve, and Docker to orchestrate. I was a complete noob (still am, to be honest) and barely knew anything about Docker beyond docker run
and docker-compose up
.
How is the application deployed from the on-premise server? I just followed what the previous admin used to do:
git pull origin master
docker-compose down
docker-compose build
docker-compose up
It was that simple. It always worked (with a slow build time, but that was just a minor inconvenience 😅).
Months passed, and everything was running fine… until it wasn’t.
“The application isn’t responding! What happened?! What did you do?!”
I had no idea. There hadn’t been any updates that week. I tried SSH’ing into the machine but nothing.. I tried everything I knew at the time but had no luck.
After a few hours, a senior developer asked the infra guy:
“How much disk space is left on the app host?”
The answer: “None.”
The Cause
The server was on-premise, and we were building the Docker images on the same machine that served the application.
The application didn’t persist much data aside from a bit of cache. So why is the disk full?
Because when Docker builds images, it leaves garbage behind in the overlay2
directory under /var/lib/docker
. Even if you run docker system prune -af --volumes
, some leftover data keeps piling up over time and eventually, it eats up all your disk space. The app runs fine until there’s no space left for any filesystem operations — causing the OS to crash.
At the time (and even today), this issue still hasn’t been fully solved by the Docker community.
If you’re curious, here’s more on the Docker overlay issue.
The Solution
Clean the entire overlay2 directory and restart the Docker service but be careful! This will remove all Docker layers, containers, and images!
Here is how to do it
The Prevention
To avoid getting caught off guard again, I built a simple monitoring setup to track host metrics. Now, if disk usage hits a certain threshold, I get notified early enough to migrate or clean things up before it’s too late.
Here’s how the setup looks:
- Node Exporter to collect metrics from the host
- Prometheus to collect and store metrics
- Grafana to visualize metrics and send alerts
- Telegram for receiving notifications
Do It Yourself
Want to try it for yourself? Here’s the code:
This is an easy setup you can try on your own machine.
Monitoring saved me from a total outage and now it can save you too!
Demo Repository’s README
Telegram Host Alert Setup Guide
This project sets up Prometheus and Grafana for monitoring using Node Exporter.
Prerequisites
- Docker and Docker Compose installed
- Node Exporter binary (already included in
collectors/node-exporter/
)
1. Configure Environment
Copy the contents of the prometheus.yml.sample
file to prometheus.yml
and substitute HOST_IP
with your current IP.
2. Start the Stack
Run the following command to start Prometheus and Grafana:
docker compose up -d
3. Access the Services
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (login: admin / admin)
grafana login: admin
password: admin
4. Import Node Exporter Dashboard
- In grafana dashboards, import the
node-export-full.json
fromgrafana/templates
5. Stopping the Stack
To stop all services:
docker compose stop
Conclusion
I hope my past mistake helps you to an extent, thank you for reading along!
References
- Install Docker Engine on Ubuntu
- Node Exporter
- Node Exporter Full Dashboard
- Set up Telegram alerts with Grafana
We want to work with you. Check out our Services page!