Saturday, May 21, 2011

Cloud Operation Challanges

Infrastructure is the part which will undergo the majority change when you move to cloud, hence the operation management at this level will be the most impacted area. However there will be changes at process level as well. The major areas that will get impacted are:
1.       Monitoring
a.       There will be an additional level of monitoring needs to be introduced at hypervisor level as this layer will control your fabric.
b.      You still will need to monitor your hardware sitting beneath hypervisor but probably in a stringent manner.
c.       Monitoring will need to have a lot more automation integrated to move the workloads, only creating an incident might not be sufficient.
d.      Need of monitoring the assets in public cloud.
2.       Discovery
a.       This is the other process that will need re-thinking as there will be a churn of virtual assets created and destroyed every month if not every week or day.
b.      Automated discovery will become a necessity.
c.       Another aspect that would need automation is Change and Configuration Management.
d.      Another key aspect need attention where there are “stale assets” e.g. the virtual machines that sitting in shutdown state for specific periods
3.       Public cloud integration
a.       Ability to monitor both on-premise and cloud assets using single set of tools and processes.
b.      Having a strategy for migration between public cloud and on-premise on the fly.
4.       Failure testing
a.       When using a cloud infrastructure, you need to plan to fail and have repeatedly test you system by shutting down some service or the other.
5.       Process Changes
a.       Process Integration with public cloud provider will be required at various levels.
b.      SLA negotiations with public cloud provider.
c.       Enhanced Self-service capability given to end user will need enhances support structure as well.
The common theme which emerges in changed ops management is the automation. But automation can prove to be “necessary evil”…  Check out the AWS outage (http://aws.amazon.com/message/65648/) that happened recently because of the automation.