A Time Warner Migration Strategy
(Originally published in Messaging Magazine, July/August 1998)

By Raphael Freiwirth, Time Warner, Inc.

The following saga will present our approach to switching over from one e-mail domain to another. Although it is specific in nature, the ideas and concepts we follow are our general format for a migration. We follow this type of process regardless of whether it is a hardware, software or application migration. Instead of presenting the detail in a dry format, I am using a specific migration to highlight the steps we took and the issues we had.

Migration Weekend was approaching. The plans were made—we had a complete schedule and process document. We had tested the process in our test lab environment. We racked our brains to remember every small detail that could "do us in". This particular migration would modify both messaging paths and directory processing, as it seems all changes do these days. Our task was to help a user community migrate from file-based e-mail PC and MAC, connected to our corporation’s messaging infrastructure, to one in which they would be viewed (from our "observation point") to be in a Client/Server messaging cloud. We would at the same time "enhance" our infrastructure to allow for a single domain connection to that user community by setting up our own Client/Server messaging server. This would allow for a simplified architecture and design for both their upcoming migration of users and our ability to focus on a limited number of messaging services in our messaging infrastructure. You might have already concluded that we have a number of different e-mail messaging environments. The mission is to get to a small number (one would be great, two is likely).

We have set up our current "messaging infrastructure" environment in our test lab. Desktop client’s, multiple product servers, Unix servers, routers and other paraphernalia are assembled and configured. This takes time and energy and much grumbling. If you don’t have documentation as to why you had something setup in production in a particular configuration, chances are that you are doomed to follow this configuration for no apparent good reason. Try to change it, Murphy (Murphy’s Law, we’re on a first name basis) will find and laugh at you. All messaging paths appear to work; we test them to the extreme with multiple replies between multiple domains. We have gone to the elaborate detail of providing a test Domain Name System (DNS) reachable from external so that we can test our entire messaging flow. All the addresses are viable and are delivered.

We are ready—or are we really? In order to get to this point, we had to judge on getting done as much as possible in as little time as necessary for a small amount of cost. Not possible you say! Well, we do it all the time, but don’t really look at it from that angle. Our group constantly approaches these issues with the 80/20 rule firmly in place. We will work hard and efficiently to make sure that 80% of services are tested and functioning. We will give a cursory look at the remaining 20% of services. We also spend time getting our management tools running before and during our pilot phases for these new products. This will assist us immediately after cutover to find problems before the users start complaining about them.

We then turned our attention to the "wiring" that was necessary in the production environment to make our systems communicate with each other. Without exception, this area consistently causes us to lose many hours during a cut over while we investigate problems. This can be as simple as routers not allowing certain protocols past them. It can also be extremely complex in that certain protocols require firewalls to open certain ports for them to operate. The complexity is in the way you have to work out how to do this without comprising the very security you are attempting to enforce.

Well, all the plans were set. The environment was established and signed off. The process was run through yet again and we determined that addressing styles (aliases) for the various components would work. Our plan was simple; let us set up another domain and make sure that worked. Once it was proven, we would migrate users from the file-based e-mail MAC and PC domains at our leisure. In the meantime, our directory product would maintain the directory aliases of all and make sure that we could have uninterrupted message flow between these domains. Sounds great, but it didn’t happen that way!

The week before the cut over, our partners decided that this process wasn’t good enough. The process was good in that it did what the job required, but it would take too long. Our partners came back to insist that the entire migration be completed within the weekend. We were initially shocked, but looked over our process, reworked out the logistics and decided to go through a quick run through of our process in test. With less than a week remaining, we setup the original environment again, and put the process through a test drill to arrive at our anticipated end state. Less than two days to go! We were happy enough with the result not to attempt to stop the migration.

This is a credit to the entire group that given that major shift in design at the twelfth hour they were able to adapt. In large part there were two areas that allowed us that capability. One was having a complete, finely detailed and tested process ready. The second was the availability of a staging area, complete with the hardware, software and people that where skilled at knowing their environments. This latter area is overlooked many times, and is the key to the success of a migration.

The one section of our migration that became very neglected in this hurry up environment was the back out strategy. Our original plan had a number of "back out" spots, all at critical junctures and allowing for a speedy return to the original environment by the normal workday on Monday. As we conferred on this topic the day before our "start", our partners came to a speedy conclusion-there would be no backout! Talk about focusing us to succeed. This truly made us examine this process with an eye to detail. We did not want to fail. This could be a mechanism for management to galvanize a group to succeed, but it probably is not one of the recommended management tools!

One of the areas that we were forced to deal with was that our migration included two groups in different locations doing different things on the process plan. Sometimes we would be working in parallel; sometimes we would have to wait for one group to finish a task before the other group could continue. Communication was critical for this to succeed. In order to facilitate this, we setup a dial in conference number that would be continuously available. We also set up specific times for us all to dial in for "check ups". Finally, if a problem was detected, everybody was to be paged with the phone bridge number to insure that the urgency was noted and allowed people to get in on the call regardless where they were.

Migration weekend was upon us. All lights green. The process started on a Thursday night, with a complete backup of all existing systems. Even though there wasn’t a back out strategy as such, it was still imperative that we had a clean and stable foundation that could be returned to at any time during the migration. Our process flow was being followed and checked off as we went. Our partners started on Friday with their part, we did our parts as they completed adjoining items on the process plan. In our plan, we had carefully laid out the amount of time that each step would take. We had a directory of about 20,000 total and we were migrating over 3,000 users. Although these totals are not huge, they do take time and it nice to know when a particular section will take 15 minutes as opposed to an hour. There were also many other e-mail domains involved, so we had to make sure that the directories were synched completely and correctly by the end of the migration. We planned to be in a spot where we could run a complete (and automated) directory synchronization cycle by 9 p.m. The synchronization would have taken a number of hours, but at that point, we would have waited till the following morning to continue. It would also have taken us to the completion point of the migration from a directory standpoint, with only testing and general cleanup left to do.

Given the lead in I just gave you, I suspect you can imagine the next few sentences. One of the reasons that we worked very hard to get a complete snapshot of the production environment was so that we could get equivalent time stamps through our process. In our case, we did just that, but...two items in our process completely collapsed timewise. Our partners ran into difficulty with a key part of their directory update (instead of 20 minutes, it took over 3 hours). We then discovered that one of our steps went from 2 hours (deleting unnecessary and unwanted auto-registered names for the file-based e-mail PC gateway) to 5 hours. We had planned these steps according to where they needed to be, but also with an eye to where the item would fit in nicely with the human metabolism where possible. Needless to say, we were not ready to run the directory synchronization (automated part) till 3 a.m. I suppose the good news was that the process held together time wise for the rest of it! By having the buffer time already built in (from 9 p.m. till 8 a.m. the next morning), we had in essence given ourselves a buffer that we didn’t really want, but could have if necessary.

At 8 a.m. the next morning, some of us a little crustier than others, assembled on the phone bridge. Topics always started with where on the process plan we were, what was successfully completed, what had not, and what bypasses we had done. It turned out that the automated process was successful and we were ready for the next parts of the process.

This was the last and major part of the cut over. All the new gateways were up and connected. All the directory entries were established and aliases built for the remainder. Our mood was upbeat. It was time to eyeball specific entries in the directory (ones that we had established beforehand as being "tricky") and to verify that the total number of directory entries were what we had anticipated. Another group of people was designated test partners and went off to do that.

Within a half-hour we were all back on the tie line. It appeared that our partners had installed a gateway that we had no knowledge of, we had never tested and now had modified the manner of messaging routing and entries. Because of the nature of process that we had left for Saturday, we had resolved that we could be commuters on that day to finish. In the case of a problem, a person was designated to be the on-site "helper" and would have to go in. We reached this conclusion by 10 a.m. and had a person on-site and working with our partners that were also on-site (at their end). The rest of us were spending our time either on the tie line or dialed into the systems. The conclusion is that regardless of the technology, it still is very useful to have somebody on-site to eyeball multiple situations and access the situation. This "command" center would have all the connections to systems needed, as many monitors as necessary and a person that we could "talk" through a situation or who would be able to see a situation occurring. Another key point for this center would be that all alerts for management of the systems would be funneled into this area.

By 7 p.m. that evening, we had the situation under control and felt comfortable enough to leave the remainder of any cleanup to our partners (their downstream post offices). Around 11 a.m. the next day (Sunday), there was a page for a tie line conference that we all attended. We confirmed that the systems were still "operationally" ready for Monday, and quickly cleaned up one minor issue that had surfaced through the Saturday transition. At this time the process plan was completed and we were able to check off all items. Testing was done and we could confirm that our test messages had succeeded and that all recipients were able to identify and reply to all other recipients. Of course, this applies strongly to the 80/20 rule.

Monday morning was extremely quiet for the first few hours. This is very normal on a "successful" migration that incorporates the 80% rule. Around lunchtime, we started getting calls and discovered a configuration issue that affected a small group of users. We fixed that by 4 p.m. The point is that it is very difficult to make a user understand that messaging is working for everybody else, but not for them in this particular instance. They are an unfortunate statistic. The key to this is to establish expectations for the move that clearly indicates to users that it is expected that messaging will work for the vast majority, but some users will experience problems. The other point is to make sure the message gets out so that users do call in with problems. There is nothing worse in the world than a user that has sat on a problem for 2 days and finally calls in to report it. Usually, this user is more aggravated and is in one way communication mode (you listen, I talk). Indicating that you know about the problem and it is being corrected can calm down even these users. Users don’t want to hear what is wrong, they want to simply be reassured that you: know about the problem, are working on it and have a relative idea about when it might be fixed.

This leads to another very important area, if you don’t have the tools to know when you do have a problem, your users are the first to report them to you. You want the users that are calling in with problems to be the odd one or two where directory synchronization has failed for a particular user rather than for the entire environment.

Ultimately, it is very difficult to have a migration that is 100% successful. If you have had this occurrence, savor it well. The real key to having a successful migration is knowing when you are having a problem, having the right resources available to fix it, and having a plan that allows for timing issues, failures and backout strategies.