What is a SLA?
According to Oracle Documentation, an SLA “specifies minimum performance requirements and, upon failure to meet those requirements, the level and extent of customer support that must be provided.” Essentially, it is a written down agreement that the organization will commit to responding to certain troubles within a set time frame. SLA items can consist of same-day disaster recovery or next-day troubleshooting and a variety of other “perks” for the organizations and clients that use the desired service. Unfortunately, for a rather dry but necessary topic, I’m not sure if I can be “media-gifted” today with relevant images so I’ll intersperse this all with some fun and lovable, but random GIFS! But in the end, SLA is pretty clear cut, there’s not much ambiguity or anything to hide about it which is why I think the following gif is pretty cat(apt)-worthy!!
GIF source1 Now, although the SLA is typically displayed or found on any organization’s website, it’s really a document meant for when users and organizations sign on with a service during the “design and deployment” stage. Of course, an SLA can be applicable anywhere and in any industry but they just go by different terminology, but in healthcare IT it is named thusly. Signed during project approval, it’s important to go with an organization that commits and develops an SLA around areas such as uptime/downtime, response time, message delivery to clients, and any form of disaster recovery and business continuity in such cases. As I mentioned, they’re pretty cut and dry so there is a relatively simple format in displaying the necessary information like: overview of the system, roles and responsibilities, how to measure service levels, change requests, and so forth.
But here are some of the key metrics and indicators usually found:
- System Availability
- Acceptable Data Loss
- Recovery Time
System Availability is a big one because it is the acceptable amount of downtime that the organization tries to guarantee. No cloud service or organization that holds and provides user information will be 100% perfect and even the largest organizations and companies have expected downtime, it’s just the amount of capital that they have spent to try and minimize the amount of time. The rule of “Five Nines” is pretty common and it represents the availability percentage target, so five nines represents 99.999% system availability. In a year, this translates to an expected window of 5 minutes for allowed downtime or about 0.8 seconds a day. Now how much of a difference would 99% be? Barely less than a percent less, but within a year, 99% represents 3.5 days which can be drastic for performance in organizations. Fortunately, with the rise in clustered servers, redundant network components and Internet connections, mirrored storage devices, and a variety of other new rising technologies, they can help ensure the uptime of the service while providing the flexibility and availability in avoiding planned and disruptive outages for maintenance.
HOWEVER, perhaps it’s not amount the acceptable downtime or speed of the system that these prospective clients should be researching, but instead they should be looking within the organization and trying to determine “how much are we prepared to spend? what is an allowable cost to ensure the best performance and highest of availability?”
Unfortunately though, despite the best efforts, the SLA cannot be met sometimes and it won’t be realized until AFTER disaster strikes. In order to feel comfortable with the agreements in place, it’s important to anticipate and plan for disaster. I realize that’s pretty pessimistic, but it’s better to be paranoid in preparation than be distressed in disaster. Taking a Cyber Security Management course, there was an entire unit and weeks of lectures regarding Incident Response, Disaster Recovery, and Business Continuity and if I learned one that, the one thing that was pounded in my head is learning how to deal with the potential and risk for disaster. So, you’ve got two options:
- Expect and Plan for it
- Jump off a plane and hope you stick the landing.
Now by no means is that GIF meant to represent that you should take the second option and as terrific as the second option sounded, please don’t. Although I never had to do it myself, I know through case studies that planning and preparing for disaster can be an expensive and complicated process and in most cases if I was looking to cut back and save money, usually those costs would be the first to go especially since a disaster would be “rare” (wink wink hint hint cough cough).
So how should we define a disaster for an SLA?
For most people’s definition of “disaster”, their responses would typically fall in line somewhere with cataclysmic events like fires, hurricanes, tornadoes, earthquakes etc. that could destroy an entire building, environment and the data and hardware within. Because of these misconceptions, many people simply don’t see the point of spending so much on a plan that might not even be used…. ever (depending where you live aka by fault lines or dry areas) However, “disasters” actually can include smaller and more frequently occurring events such as hardware and software failure, and although it’s not as dramatic as the freak lightning strike that hit directly down into the fuel line of a building, it can have just as much of an impact. At the end of the day: assess, plan, control, simulate, modify.
So what? How does an SLA help the home organization?
Although an SLA is typically meant as an understanding between the organization and its clients, the SLA acts as an innate pressure to ensure that the organization can in fact live up to its promises that are so easily found and plastered across the internet. (Nothing is ever gone… ever *shudders*). It can help to spur internal audits to ensure that the organization has set up appropriate controls to try and reduce the risk of any attacks, problems or disasters. They can help by educating, training, and making all employees aware of specific processes and procedures in case anything happens as well as teach the client organization what can and will be done in those events (look up any incident response, disaster recovery, or business continuity documentation and plans… those are typically public on any organization’s website). Lastly, any plan should be tested on a regular basis, should involve all relevant people, and ideally occur at a random and unprepared time, similar to like a fire drill or an earthquake drill that we would have in school. Testing on a regular basis will ensure continued accuracy and if not, well it’ll give the employees something to talk about when they get off of work.
In summary, the idea behind disaster recovery planning is to reduce the surprise factor when it occurs. Anticipation and preparation are key attributes.
Risk management is all about preparation by either reducing the likelihood of a disaster or minimizing the impact. Impact can be cost, service, or any other vital factor in a running organization. As always, documentation is key. (I’ll probably do another post specifically on managing risk and how it’s applied to IR/DR/BC plans).
So after all of that? Why do you even need an SLA? What should there be?
It’s not really for one specific organization or anything, in fact it’s to ensure that neither party can plead ignorance. Because an SLA pulls together information on the contracted service and the agreed upon reliability into a single document, it states the metrics, responsibilities and expectations. In doing so, it ensures that both sides have agreed to the same understanding. The SLA not only includes a description of the provided services and service levels, but as we mentioned, metrics, duties and responsibilities, and any possible remedies and/or penalties. It’s important to designate these metrics to ensure that it prevents bad behavior from both parties by ensuring the acceptable levels of service. So in a typical SLA, there should be two components: services and management.
Service is pretty self-explanatory and it includes specifics on services provided as well as any responsibilities, costs, procedures, conditions etc. Management will include more about the reporting method (how to notify client), the used metrics, any clauses and a mechanism for updating the agreement. I realize that the term “metrics” is pretty ambiguous itself, but in the end its an understanding on how things will be measured as well as a justified reason in choosing it. For the most part, these measured metrics should motivate the right behavior for optimized processes as well as something that can be easily collected. More data is typically better, but in this case, less is more because it’s important to receive concise data and information to make informed decisions as soon as possible. Lastly, there should be a clause or statement to review the SLA, just like anything else changes will happen and so will the surrounding environment so it’s important to have it be updated and ensure that both parties are privy to any new changes.
Although a thorough and comprehensive SLA is good to ensure any loopholes and what not, having too many can also make tracking and analyzing performance metrics more difficult and confusing. Like I said before, less is more, can be pretty applicable but just ensure that the most important are listed and any others are mutually understood. I was just practicing on writing one by looking at samples and I ended up with a 2.5 page SLA just because I kept thinking of things, but I didn’t realize just how “off topic” I had gotten and how I strayed away from “Service Level” and more into general practices and risk management topics, those of which should typically be internal and not part of an agreement with clients since those should be organization wide practices. Simple enough? I THINK NOT! (This should not be a “Hold my beverage, I’ve got this” moment)
Lastly, another thing to consider is offering different levels of products and services for different prices. Often times in larger organizations, it can be difficult to create a custom and unique SLA, so to make it more simple, it might be better to divide it up into various performance levels to help show potential clients resources used, cost trade-offs and overall make it easier on the client to compare the services that they want with the services that they need. Especially as cheap labor and outsourcing is becoming more and more prominent, it’s important to convey to clients why costs are higher and what levels of service it includes to make them more competitive.
As Mad-Eye Moody pounded into our brains, “Constant Vigilance”. Keep the SLA up to date with clients to ensure relevance and expected level of service and focus. They aren’t automated yet, so it’s something that organizations, both providers and customers must continually reinforce and maintain.
Now this man, he has been my grail of knowledge and wisdom. A blog on a CIO in healthcare, he talks about SLA’s and even attaches one for viewing. Feel free to view the other documents he mentions, but if you are interested download and take a gander on what a SLA in healthcare looks like.
**Note** I do realize that there is a subset of the IT industry that believes service level agreements can drive the wrong business outcomes, but I believe the idea is correct but the flaw may lie in the implementation and presentation of an SLA that can turn people away.
I firmly believe that an SLA can be structure in a way that can help with cost containment and effectiveness while achieving quick results at high quality and reliability. Some people believe that it is just a legal triviality for a company to shirk from responsibility as well as to have a scapegoat written on there in case of the typical “sh*t hits the fan” scenario, but as I mentioned above the SLA should be an opportunity for both the client and provider to consistently meet service levels, regardless of changing needs. This is why I mentioned that there should be a mechanism if service levels have to change in order to adapt to the client’s needs as well as the environment. It should describe minimum levels of service not provide minimum levels of service. It should describe minimum levels of service but strive for targeted business expectation achievements. Due to costs, organizations can fall back on the SLA and just get by with the bare necessities in order to pull in more contracts, and although that might be generating revenue, the organization could also be losing out on trust and customer retention. In static organizations, they can abuse the SLA to only work towards the “achievable” targets instead of working with the clients in driving its needs while ensuring minimum levels of service. Although clients may be seen as cash cows, organizations should “address any issues and stabilize the environment” as well as preventing any future issues from happening. It shouldn’t only be the client striving for improvement but also the organization as well, in fact, it should be a symbiotic relationship for success. As a result, by improving itself, it can provide better services for the next client as well, thus increasing trust and reliability for the organization. I realize I may not understand the magnitude of the costs, time, and resources invested for the scope, but I believe it should be stated and worked towards. Instead of a Service Level Agreement, it can spur a Service Level Achievement.