E-Mail Service Level Agreements (SLA’s)—How To Make Managing Your Messaging Systems More Efficient And Effective

Picture this scenario:
08:28 am, PST—San Francisco—A Vice President of Sales at a major worldwide manufacturing company gets an urgent phone call from his Director of European Operations in Dublin—the long-awaited deal with a German company has come through, but they have a few modifications to the contract that are show stoppers that must be approved personally by the CEO in San Francisco. If the deal is closed, an announcement can be made before the European markets close for the day, and the company stock will have the opportunity to jump and lock in a high price for a long hol-iday weekend. The CEO is about to depart at 9:00 in the corporate jet for another urgent meeting in New York that can push the stock even higher before the New York exchanges close. The VP says, “e-mail the contract over, I’ll get the boss to approve the changes as he’s headed out the door.”
08:29:58 am, PST—The Director of European Operations sends a 3.5MB file attachment (the contract) via the corporate e-mail system. He regularly sends e-mail to the VP of Sales that arrives within a minute or two of when he sends it.
08:30:01 am, PST—A marketing depart-ment intern hits the “Send” button on the latest corporate “e-zine” that has a 200K .pdf attachment, and is addressed to all 100,000 existing customers worldwide. This is a routine procedure, which has been done every month for the past six months, and the intern is glad that the job is done, so she can leave early for the long weekend. IS does not know that Marketing has implemented this procedure.
08:45 am, PST—VP of Sales calls Direc-tor of European operations in Dublin back, “Where’s the contract? I didn’t get it yet! He’s leaving in 15 minutes!” Director in Dublin resends 3.5MB file attachment via corporate e-mail system, clicking the “urgent” button to flag the message as very important. Director of European Opera-tions stays on the phone to make sure the message gets through.
08:50 am, PST—Desperation in his voice and nothing in his e-mail inbox, the VP of Sales tells the Director of European Operations to fax the changes, hangs up the phone, and places an urgent call to the IS director: “What’s wrong with the e-mail system? It’s going to cost us millions if I don’t get that e-mail in the next ten min-utes!” This is the first the IS director has heard about a problem, and he calls his technicians to find out what’s happening.
08:58 am, PST—Marketing intern fin-ishes packing up her things, satisfied that the first half of the messages have been sent successfully, and heads for the door, ready for the weekend.
08:59 am, PST—Corporate CEO glumly departs the office for the airport, angry, surprised, and upset that the deal with the Germans hasn’t come through, trying to figure out what he’s going to tell the group at his meeting in New York. A marketing intern passes him on the way out, giving him a cheery “Have a great weekend!” as she breezes by, blissfully ignorant of the chaos and destruction she has wrought. “Yeah, great weekend…” he mutters, also oblivious to the reason for his pain. Inside, he knows the stockholders will eat him for lunch because the stock didn’t jump as expected. “Maybe we need some new man-agement around here that can close a deal when it really counts,” he says to himself, glowering menacingly at the VP of Sales following close on his heels.
09:10 am, PST—The IS director discov-ers that a queue of outgoing messages from marketing has essentially frozen all other outgoing or incoming traffic, and after purging the remaining messages in the queue, the contract from the Dublin office finally arrives in the VP of Sales’ e-mail inbox, too late for the CEO, and definitely too late for the deal to be closed that week. The IS director had noticed a slowdown with the mail every so often, but it had always mysteriously cleared up about 35 minutes after it started, and he had never tracked down the source of the problem before. “I wonder what marketing is doing sending all that e-mail?” he asks himself.
This little vignette illustrates the prob-lems inherent in an e-mail system that business people have come to rely upon to perform critical business functions, illus-trating the disasters that can occur if e-mail service levels are not meeting the user’s expectations. As messaging systems are more integrated with the business process and reach out to customers and suppliers, they become more critical to reaching corporate goals and can have sig-nificant impact on corporate growth and profit—as our story shows.
It is clear that as reliability of e-mail sys-tems has grown, users have come full cir-cle, and now expect the messaging system to “work like the phone.” You send a mes-sage, and the person on the other end is expected to receive it, virtually instantly.
Despite the attitude of end users, e-mail isn’t free either. Current research shows the costs of hardware, software, and other infrastructure can be up to $550 per user per year for a typical e-mail installation.
For the e-mail manager, these facts cre-ate three critical needs: the need to mea-sure the service level the way the user sees it, the need to make users aware of their role in improving and maintaining those service levels, and the need to manage cost levels carefully and understand its impact on the service levels. Messaging Magazine November/December 1999 -Mail Service Level Agreements (SLA’s) How to Make Managing Your Messaging Systems More Efficient and Effective By Bill McBride, Tally Systems Corporation

The Messaging Environment
What is the environment in which we are trying to manage? It is frequently a mov-ing target. The infrastructure is based on a set of very rapidly changing technologies: Hardware continues to double in perfor-mance and capacity on a regular and fre-quent basis—Moore’s Law is as true and as strong as ever. Software is moving to client/server architectures, thin clients, Java, Windows NT, and even Linux. Local Area Networks (LANs) and Wide Area Net-works (WANs) continue to proliferate with an ever-changing set of protocols and capabilities. Although e-mail applications are moving to standardized systems like Exchange and Notes, issues such as out-sourcing are becoming more prevalent. Lack of qualified, trained personnel to manage this change puts added pressure on existing staff. The bottom line: IT man-agement now has an impact on corporate growth and profit, and is therefore held accountable for providing sufficient service levels to reach corporate goals.

Management Methods and Tools
So how do you go about setting and man-aging e-mail service levels? There’s a wide variety of tools to consider. Large frame-works like Tivoli, UniCenter, and Open-View that can be enhanced to include messaging management reporting and alerting are one approach, as are point products that can do reporting and alert-ing on specific aspects of messaging management. Network and system man-agement tools that provide metrics and alerts on internal aspects of the messaging infrastructure are also available, showing everything from packet throughput to processor activity. The key is figuring out how to use one or more of these tools to do service level management in a way that supports business goals and not arbitrary metrics.

Use of SLA’s—Why They Are Good
By making performance level metrics quantitative and measuring them directly, you eliminate potential arguments over whether you have met your goals, not only with your internal management, but also with outside service providers. If you don’t measure it, you can’t manage it. By agree-ing on quantitative service levels, you properly set user expectations, and can also define the level of expected usage, which directly impacts the level of service. Ultimately, the level of usage is completely under the control of the user community, a critical factor to consider when creating these agreements. A well-executed SLA also enables you to determine if your level of service is appropriate for reaching business goals, and allows you to make adjustments as necessary to compensate for inadequate or excess capacity. A Service Level Agree-ment can often turn out to be a continu-ous improvement program, rather than a source of anxiety and conflict between the users and those providing service.

How to Base Your SLA
The level of service provided is clearly the first priority, and should be quantitative and accurately measured to be effective. There must be an agreement by the users on the level of messaging traffic that they will generate. Limits on items like maxi-mum attachment sizes and message stor-age should be well defined. Costs to users should also be defined and measured so they can be related to corporate goals and budgets—ultimately, it is the business goals that should be driving service level needs.

Service Level Performance
It’s important to realize that there are sev-eral ways to measure the service level for a messaging system.
Service availability is the main metric. Mean time between failure (MTBF), Mean time to repair (MTTR), and availability can all be measured and directly related to user satisfaction.
Service availability can be mathemati-cally defined as: Availability (%)= (1-MTTR/MTBF) X 100
Using this formula, it’s easy to see that the longer it takes to repair a problem, the bigger the hit your availability will take— unless, of course, you have the unlikely sit-uation where your e-mail system never fails.
Delivery times are another key metric. Metrics on average delivery times, maxi-mum delivery times, and the statistical dis-tribution of delivery times can quantify the service level in a way that not only can be measured, but also corresponds to your user’s perspective.
Another metric that can be used is the distribution of the outage times, or MTTR’s, on their own as previously defined. Surprisingly, a short outage of a mail server can many times go completely unnoticed by a group of users, so long as service is restored relatively quickly. The pain increase is very rapid the longer the downtime, however, as the formula shows. A one-hour outage will almost always be noticed.

Making SLAs Quantitative, With Metrics You Can Measure
A key point to make about SLAs is that to be effective, all the metrics must be directly measured and openly reported. If you can’t measure it, don’t put it in the SLA, because it will only lead to frustrating arguments. Once again, be sure to relate the metrics directly to business goals— SLA’s are designed to support business (do you get the hint yet?)
So, how do you ensure that the metrics you have chosen are measurable, and sup-port the business?
There are three classes of metrics that relate to messaging service levels, and that can be directly and easily measured:
1. Usage metrics on how much messag-ing traffic has been handled;
2. Real time performance metrics that measure delivery performance, availability, etc.;
3. Server specific metrics like log-in times, disk usage, queue lengths, and so forth.

Metrics—Usage
Usage can be measured in many ways and categorized in multiple dimensions. Some of the key metrics include:

  • Server to server traffic. This might even be part of a service level agreement with the network services group.
  • Traffic by department, site, project, or other group.

  • Measuring by individual user.
  • Measuring message sizes and their dis-tribution, including maximum message size.
    A further use can be made of this data by storing it over a longer period of time and doing trend analysis. These trends are useful for more than SLA’s: they can also be used for infrastructure planning and justifying future messaging budgets.

Metrics—Real Time Performance Monitoring
As mentioned before, alerts and alarms can help support staff to correct and prevent problems more quickly and efficiently. By monitoring performance in real-time, you are not only documenting your perfor-mance in a measurable way, but also work-ing directly to improve and assure compliance with the service agreement.

  • Metrics—Server Specific
    Metrics can also be collected from servers that form parts of the messaging system. These can include things such as:
  • Queue Lengths
  • Mailbox sizes
  • Log in times
    While this data is useful from an admin-istrative and infrastructure management perspective, it’s easy to get carried away with numbers that are only of interest to the administrator in charge of the machine. Remember, the key to service is to ensure that like the old Pony Express, the mail gets through on time, meeting business needs.

SLA Reports
Okay, you know what you’re going to mea-sure, you’re out there measuring it, now what do you do with the data? Using your metrics, SLA reports should be made and regularly published. These should contain, at a minimum, the agreed goals and com-mitments vs. actual performance, in quantitative terms. Both service level per-formance and usage should be measured to determine how both parties of the agree-ment are doing. Reports should be tailored to your specific needs by combining the data from all three of the available sources— Usage metrics, real-time performance met-rics, and server specific metrics. Ultimately, the nature of the SLA will determine the mix of data, and the importance placed on each of the sources of data.
Extensions of these basic SLA reports might include reports derived from the basic data over time, including perfor-mance modeling and chargeback reports. These two reports deserve a little further explanation:

Performance Modeling Reports
Performance modeling, in this context, is meant to tie key metrics like delivery time to controllable items that typically cost money, like server capacity and LAN/WAN bandwidth. These can be done through reports like delivery time vs. usage level, or queue lengths. This kind of model can also be used to proactively manage growth. A further analysis that can be done with these kinds of tools and reports is to look at availability vs. usage levels, time of day and server/configuration. These reports allow you to plug a variety of scenarios into real data from your messaging enter-prise, and answer all of those “what if” questions that might arise when planning or budgeting for future infrastructure.

Chargeback Reports
Chargeback is a way to have user groups pay for the messaging services they receive and use, either directly or indirectly. This allows users to plan their service levels and budget accordingly, and also shows those responsible for allocating money to mes-saging resources to determine if groups or departments are “paying the freight” for their messaging use. It also motivates users to use the systems properly to keep within budget, much in the same way that billing for long-distance telephone use prevents callers from making idle small-talk calls to their friends in Australia.

Conclusions
As users have come to expect their messag-ing systems to perform reliably on demand, there is potentially a gap between user expectations and the ability of the messaging service provider to meet those expectations.
SLAs are an effective way to communi-cate with users to develop a joint effort and commitment to manage their messag-ing systems to meet the needs that are the most important—the successful conduct of business. As we saw from our scenario in the beginning, the risks of not putting ser-vice level agreements in place for e-mail are great, and potentially very costly. The good news is that there are software and tools available today that can help estab-lish and maintain effective SLA’s, and pre-vent your company from experiencing an unexpected divergence between expecta-tions and reality that can cost you busi-ness, and maybe your job.