BookingLive – Enterprise level monitoring
At BookingLive, we take the quality of our service seriously. Although we traditionally support 09:00-17:30 we are monitoring our services 24 hours a day.
Over the years we have relied on services such as Pingdom, but noticed that these services only help you once the server is down. We were not happy with the scenario, and as such sought to implement monitoring akin to organisations 100x our size.
- Continually improve learning about what causes outages
- React to outages before they become client facing
- Log as much data as possible for post analysis
- Closely monitor security events
- Notify all key staff of events
- Reduce downtime
- Automate recovery processes
While we won’t share the tools we use to accomplish the tasks, we can say that they are widely used around the world and considered an industry standard. Below is just a taste of what we do here, with the scope growing each day.
The first task of implementation was to get the systems working at a good base level. This of course started with getting the core applications running across our network and through our firewalls!
- Get basic monitoring configured
- Configure notifications to staff
- Start building a statistical base
Once setup, we were quickly able to build data from our live systems. Below are some example datasets that we record minute by minute from all our servers to a secure location.
The initial setup had us monitoring:
- CPU statistics
- Memory usage
- Database utilisation
- Disk space utilisations
- Critical service monitoring (Web service / FTP / Database)
An example CPU utilisation graph
An example CPU load graph
- Quickly build up profiles of the servers
- Analyse peak usages and spikes
- Change internal IT practices to smooth the load
- Warnings for monitored items when hitting thresholds
Individual client monitoring
While monitoring an individual server with many bookinglive clients is fantastic, we were not happy. Quickly we expanded out the product to monitor individual client sites, and key business processes. If a customers key process was bookings, we sought to ensure and report on whether the booking process was active.
It is important when a client identifies a problem with their site, that we can verify/corroborate their version of events.
The above example, is a 7 day track of response times for a client, if they report a slow down on the site at a certain time, we can use over 400 monitoring points to track exactly why the slow down occurred.
Automation of key tasks
Additionally to monitoring clients sites, we also must monitor other key factors such as
- Error logs
- Automated script runs
With so many servers, manually combing through log files, and doing access audits is a task that cannot be done manually every time! We routinely monitor our servers on aspects such as server access, and rogue internet users automatically, as well as manual checks.
Because we are so flexible, we can react to any new threat, and setup monitoring across all servers for the same instance of events with a few clicks of the mouse. This has drastically reduced the time to deploy new processes, and improved security on our servers, as well as client systems.
BookingLive is continuing to improve our service, and monitoring solutions are playing a crucial role in maintaining the levels of service and expertise this coming year.
No solution is one that can solve every problem, but our monitoring framework is moving us quickly towards maintaining visibility of our growing client base, and keeping quality levels consistent.
2014 is the year of big data, and BookingLive wants to ensure we have the data we need internally to support our clients. You will never appreciate the full value of historical data, until you depend upon it to solve a problem…
Hope you found it an interesting read!
Written by Robert Cox