|
Picture
this scenario:
08:28 am, PST—San Francisco—A Vice President of Sales at a major
worldwide manufacturing company gets an urgent phone call from his Director
of European Operations in Dublin—the long-awaited deal with a German company
has come through, but they have a few modifications to the contract that
are show stoppers that must be approved personally by the CEO in San Francisco.
If the deal is closed, an announcement can be made before the European
markets close for the day, and the company stock will have the opportunity
to jump and lock in a high price for a long hol-iday weekend. The CEO
is about to depart at 9:00 in the corporate jet for another urgent meeting
in New York that can push the stock even higher before the New York exchanges
close. The VP says, “e-mail the contract over, I’ll get the boss to approve
the changes as he’s headed out the door.”
08:29:58 am, PST—The Director of European Operations sends a 3.5MB
file attachment (the contract) via the corporate e-mail system. He regularly
sends e-mail to the VP of Sales that arrives within a minute or two of
when he sends it.
08:30:01 am, PST—A marketing depart-ment intern hits the “Send”
button on the latest corporate “e-zine” that has a 200K .pdf attachment,
and is addressed to all 100,000 existing customers worldwide. This is
a routine procedure, which has been done every month for the past six
months, and the intern is glad that the job is done, so she can leave
early for the long weekend. IS does not know that Marketing has implemented
this procedure.
08:45 am, PST—VP of Sales calls Direc-tor of European operations
in Dublin back, “Where’s the contract? I didn’t get it yet! He’s leaving
in 15 minutes!” Director in Dublin resends 3.5MB file attachment via corporate
e-mail system, clicking the “urgent” button to flag the message as very
important. Director of European Opera-tions stays on the phone to make
sure the message gets through.
08:50 am, PST—Desperation in his voice and nothing in his e-mail
inbox, the VP of Sales tells the Director of European Operations to fax
the changes, hangs up the phone, and places an urgent call to the IS director:
“What’s wrong with the e-mail system? It’s going to cost us millions if
I don’t get that e-mail in the next ten min-utes!” This is the first the
IS director has heard about a problem, and he calls his technicians to
find out what’s happening.
08:58 am, PST—Marketing intern fin-ishes packing up her things,
satisfied that the first half of the messages have been sent successfully,
and heads for the door, ready for the weekend.
08:59 am, PST—Corporate CEO glumly departs the office for the airport,
angry, surprised, and upset that the deal with the Germans hasn’t come
through, trying to figure out what he’s going to tell the group at his
meeting in New York. A marketing intern passes him on the way out, giving
him a cheery “Have a great weekend!” as she breezes by, blissfully ignorant
of the chaos and destruction she has wrought. “Yeah, great weekend…” he
mutters, also oblivious to the reason for his pain. Inside, he knows the
stockholders will eat him for lunch because the stock didn’t jump as expected.
“Maybe we need some new man-agement around here that can close a deal
when it really counts,” he says to himself, glowering menacingly at the
VP of Sales following close on his heels.
09:10 am, PST—The IS director discov-ers that a queue of outgoing
messages from marketing has essentially frozen all other outgoing or incoming
traffic, and after purging the remaining messages in the queue, the contract
from the Dublin office finally arrives in the VP of Sales’ e-mail inbox,
too late for the CEO, and definitely too late for the deal to be closed
that week. The IS director had noticed a slowdown with the mail every
so often, but it had always mysteriously cleared up about 35 minutes after
it started, and he had never tracked down the source of the problem before.
“I wonder what marketing is doing sending all that e-mail?” he asks himself.
This little vignette illustrates the prob-lems inherent in an e-mail system
that business people have come to rely upon to perform critical business
functions, illus-trating the disasters that can occur if e-mail service
levels are not meeting the user’s expectations. As messaging systems are
more integrated with the business process and reach out to customers and
suppliers, they become more critical to reaching corporate goals and can
have sig-nificant impact on corporate growth and profit—as our story shows.
It is clear that as reliability of e-mail sys-tems has grown, users have
come full cir-cle, and now expect the messaging system to “work like the
phone.” You send a mes-sage, and the person on the other end is expected
to receive it, virtually instantly.
Despite the attitude of end users, e-mail isn’t free either. Current research
shows the costs of hardware, software, and other infrastructure can be
up to $550 per user per year for a typical e-mail installation.
For the e-mail manager, these facts cre-ate three critical needs: the
need to mea-sure the service level the way the user sees it, the need
to make users aware of their role in improving and maintaining those service
levels, and the need to manage cost levels carefully and understand its
impact on the service levels. Messaging Magazine November/December 1999
-Mail Service Level Agreements (SLA’s) How to Make Managing Your Messaging
Systems More Efficient and Effective By Bill McBride, Tally Systems Corporation
The Messaging
Environment
What is the environment in which we are trying to manage? It is frequently
a mov-ing target. The infrastructure is based on a set of very rapidly
changing technologies: Hardware continues to double in perfor-mance and
capacity on a regular and fre-quent basis—Moore’s Law is as true and as
strong as ever. Software is moving to client/server architectures, thin
clients, Java, Windows NT, and even Linux. Local Area Networks (LANs)
and Wide Area Net-works (WANs) continue to proliferate with an ever-changing
set of protocols and capabilities. Although e-mail applications are moving
to standardized systems like Exchange and Notes, issues such as out-sourcing
are becoming more prevalent. Lack of qualified, trained personnel to manage
this change puts added pressure on existing staff. The bottom line: IT
man-agement now has an impact on corporate growth and profit, and is therefore
held accountable for providing sufficient service levels to reach corporate
goals.
Management
Methods and Tools
So how do you go about setting and man-aging e-mail service levels? There’s
a wide variety of tools to consider. Large frame-works like Tivoli, UniCenter,
and Open-View that can be enhanced to include messaging management reporting
and alerting are one approach, as are point products that can do reporting
and alert-ing on specific aspects of messaging management. Network and
system man-agement tools that provide metrics and alerts on internal aspects
of the messaging infrastructure are also available, showing everything
from packet throughput to processor activity. The key is figuring out
how to use one or more of these tools to do service level management in
a way that supports business goals and not arbitrary metrics.
Use of
SLA’s—Why They Are Good
By making performance level metrics quantitative and measuring them directly,
you eliminate potential arguments over whether you have met your goals,
not only with your internal management, but also with outside service
providers. If you don’t measure it, you can’t manage it. By agree-ing
on quantitative service levels, you properly set user expectations, and
can also define the level of expected usage, which directly impacts the
level of service. Ultimately, the level of usage is completely under the
control of the user community, a critical factor to consider when creating
these agreements. A well-executed SLA also enables you to determine if
your level of service is appropriate for reaching business goals, and
allows you to make adjustments as necessary to compensate for inadequate
or excess capacity. A Service Level Agree-ment can often turn out to be
a continu-ous improvement program, rather than a source of anxiety and
conflict between the users and those providing service.
How to
Base Your SLA
The level of service provided is clearly the first priority, and should
be quantitative and accurately measured to be effective. There must be
an agreement by the users on the level of messaging traffic that they
will generate. Limits on items like maxi-mum attachment sizes and message
stor-age should be well defined. Costs to users should also be defined
and measured so they can be related to corporate goals and budgets—ultimately,
it is the business goals that should be driving service level needs.
Service
Level Performance
It’s important to realize that there are sev-eral ways to measure the
service level for a messaging system.
Service availability is the main metric. Mean time between failure (MTBF),
Mean time to repair (MTTR), and availability can all be measured and directly
related to user satisfaction.
Service availability can be mathemati-cally defined as: Availability (%)=
(1-MTTR/MTBF) X 100
Using this formula, it’s easy to see that the longer it takes to repair
a problem, the bigger the hit your availability will take— unless, of
course, you have the unlikely sit-uation where your e-mail system never
fails.
Delivery times are another key metric. Metrics on average delivery times,
maxi-mum delivery times, and the statistical dis-tribution of delivery
times can quantify the service level in a way that not only can be measured,
but also corresponds to your user’s perspective.
Another metric that can be used is the distribution of the outage times,
or MTTR’s, on their own as previously defined. Surprisingly, a short outage
of a mail server can many times go completely unnoticed by a group of
users, so long as service is restored relatively quickly. The pain increase
is very rapid the longer the downtime, however, as the formula shows.
A one-hour outage will almost always be noticed.
Making
SLAs Quantitative, With Metrics You Can Measure
A key point to make about SLAs is that to be effective, all the metrics
must be directly measured and openly reported. If you can’t measure it,
don’t put it in the SLA, because it will only lead to frustrating arguments.
Once again, be sure to relate the metrics directly to business goals—
SLA’s are designed to support business (do you get the hint yet?)
So, how do you ensure that the metrics you have chosen are measurable,
and sup-port the business?
There are three classes of metrics that relate to messaging service levels,
and that can be directly and easily measured:
1. Usage metrics on how much messag-ing traffic has been handled;
2. Real time performance metrics that measure delivery performance, availability,
etc.;
3. Server specific metrics like log-in times, disk usage, queue lengths,
and so forth.
Metrics—Usage
Usage can be measured in many ways and categorized in multiple dimensions.
Some of the key metrics include:
- Server
to server traffic. This might even be part of a service level agreement
with the network services group.
- Traffic
by department, site, project, or other group.



- Measuring
by individual user.
- Measuring
message sizes and their dis-tribution, including maximum message size.
A further use can be made of this data by storing it over a longer period
of time and doing trend analysis. These trends are useful for more than
SLA’s: they can also be used for infrastructure planning and justifying
future messaging budgets.
Metrics—Real
Time Performance Monitoring
As mentioned before, alerts and alarms can help support staff to correct
and prevent problems more quickly and efficiently. By monitoring performance
in real-time, you are not only documenting your perfor-mance in a measurable
way, but also work-ing directly to improve and assure compliance with
the service agreement.
- Metrics—Server
Specific
Metrics can also be collected from servers that form parts of the messaging
system. These can include things such as:
- Queue
Lengths
- Mailbox
sizes
- Log in
times
While this data is useful from an admin-istrative and infrastructure
management perspective, it’s easy to get carried away with numbers that
are only of interest to the administrator in charge of the machine.
Remember, the key to service is to ensure that like the old Pony Express,
the mail gets through on time, meeting business needs.
SLA Reports
Okay, you know what you’re going to mea-sure, you’re out there measuring
it, now what do you do with the data? Using your metrics, SLA reports
should be made and regularly published. These should contain, at a minimum,
the agreed goals and com-mitments vs. actual performance, in quantitative
terms. Both service level per-formance and usage should be measured to
determine how both parties of the agree-ment are doing. Reports should
be tailored to your specific needs by combining the data from all three
of the available sources— Usage metrics, real-time performance met-rics,
and server specific metrics. Ultimately, the nature of the SLA will determine
the mix of data, and the importance placed on each of the sources of data.
Extensions of these basic SLA reports might include reports derived from
the basic data over time, including perfor-mance modeling and chargeback
reports. These two reports deserve a little further explanation:
Performance
Modeling Reports
Performance modeling, in this context, is meant to tie key metrics like
delivery time to controllable items that typically cost money, like server
capacity and LAN/WAN bandwidth. These can be done through reports like
delivery time vs. usage level, or queue lengths. This kind of model can
also be used to proactively manage growth. A further analysis that can
be done with these kinds of tools and reports is to look at availability
vs. usage levels, time of day and server/configuration. These reports
allow you to plug a variety of scenarios into real data from your messaging
enter-prise, and answer all of those “what if” questions that might arise
when planning or budgeting for future infrastructure.
Chargeback
Reports
Chargeback is a way to have user groups pay for the messaging services
they receive and use, either directly or indirectly. This allows users
to plan their service levels and budget accordingly, and also shows those
responsible for allocating money to mes-saging resources to determine
if groups or departments are “paying the freight” for their messaging
use. It also motivates users to use the systems properly to keep within
budget, much in the same way that billing for long-distance telephone
use prevents callers from making idle small-talk calls to their friends
in Australia.
Conclusions
As users have come to expect their messag-ing systems to perform reliably
on demand, there is potentially a gap between user expectations and the
ability of the messaging service provider to meet those expectations.
SLAs are an effective way to communi-cate with users to develop a joint
effort and commitment to manage their messag-ing systems to meet the needs
that are the most important—the successful conduct of business. As we
saw from our scenario in the beginning, the risks of not putting ser-vice
level agreements in place for e-mail are great, and potentially very costly.
The good news is that there are software and tools available today that
can help estab-lish and maintain effective SLA’s, and pre-vent your company
from experiencing an unexpected divergence between expecta-tions and reality
that can cost you busi-ness, and maybe your job.
|