It's time for cloud management with automated fixes

It’s 4:00 in the morning, and you are awoken out of a sound sleep by a phone call from your cloudops center. It seems the inventory application in the public cloud is down.

You know this because some tile on some application monitoring console changed from green to red. Those charged with monitoring public cloud-based applications only know that if this particular application goes down, then they need to call the application admin, which is you.

Through blurry eyes, you find that a middleware process died on one of your cloud machine’s instances. A quick reset of that instance and you’re back in business and back to sleep—at least for 30 minutes before your alarm goes off.

I often ask those selling cloud management and monitoring solutions, either cloudops, secops, netops, etc., if they have automated capabilities to fix issues that are found by the monitoring software. Pretty much nobody does. Or, the management software can kick off external processing that you have to write but does not have visibility or control within those external processes.

I’m not saying that complex problems need to be, or even can be, corrected automatically. But most of the application issues are typically not complex problems and can be fixed quickly with simple reboots or resets, even within public clouds.

What I’m seeing now is a clear demand: a demand that those charged with managing clouds be given the tools not only to find issues with applications, databases, and other cloud-based production processing, but also to automate most simple fixes without bothering humans.

This is a mandate for a few core reasons:

First, putting humans in the process means that responses will be inconsistent. Different humans will be charged with fixing problems at different times and will do so in different ways. In some cases they won’t get fixed in a timely manner, considering that humans sleep through the phone ringing or find other ways to ignore issues with their applications deployed on the public cloud.

Second, we now have automation capabilities that can do some pretty remarkable things. With machine learning we can not only automate some fixes, but do experienced learning as things are being fixed. For instance, once you find out that the database should be reset before the middleware server on your public cloud provider, you can store that knowledge for later.

I’ve built nothing but management with automated fix or self-healing capabilities. Why? People need their sleep, and outages get fixed so fast most never even knew they happened. Sound better?

Source link