There are lots of topics in this area and those in IT have lots of stories to tell about their operations. I’ll list a few points here and perhaps do another list later or update this later with further points.
An IT shop begins and ends with how well it does basic operations. The services that are provided need to work reliably before one can begin discussions about adding new services or systems. One needs to have an honest assessment of how things are going and this should include both a qualitative understanding and a quantitative, metrics based understanding. When you talk to your peers, what did they tell you about how IT was doing? Did they give you any insights? On the metrics front, do you have real trend lines about uptime and costs and power consumption and outages? Can your team account for outages and talk about them to root cause? Is the team trying to fix problems at the root cause or just doing a ctrl-alt-del type fix?
Even if your ‘services’ are outsourced to a 3rd party, you still are accountable for it to the organization. You aren’t off the hook if the system runs in the Amazon cloud or Google Cloud or Salesforce, etc. Putting a service in the cloud doesn’t relieve you of this responsibility.
So here is a list of some ideas some of which are repeats from above:
- Power costs and consumption for your data centers.
- System/service uptime/downtime events (frequency) and duration (how long down). Note that these are different things and have different solutions. You need these for all your mission critical services and core or base level services. Remember that email is a mission critical system to your knowledge workers.
- Disaster Recovery testing data.
- Overall IT Operations costs and trends in a fashion where you can drill down to understand what is driving your costs.
- Equipment aging.
- Patch status on all levels of the stack. Network, server, OS, database, applications, clients, etc.
- Audit findings, testing and assurance practices.
- Accounts and in particular trusted account review process and frequency.
- Security monitoring and logging.
- Integrated monitoring across your whole stack where you can related events between systems and services. Security Information and Event Management (SIEM) tools and practices that are robust.
- And your help desks should be providing you a rich set of information about what services are providing the most grief to your workforce, suppliers and customers. Why are people calling in for help?
There are more ideas and I could probably keep writing. I must highlight a earlier post on Checklists that I think helps you assess the maturity of your operations. Are you using checklists; are your escalation rules followed and clear; is your team researching problems to root cause; are all indicators of maturity.
If I’ve left out some major thoughts or you want to add to this list, please let me know in the comments.