The emergence of cloud computing is a fundamental shift towards new on-demand business models together with new implementation models for the applications portfolio, the infrastructure, and the data, as they are provisioned as virtual services using the cloud. These technological and commercial changes have an impact on current working practices. Businesses need to understand the impact of the new combinations of technology layers, and how they work together. A crucial part of this is analyzing and assessing the risks involved.
For example, the use of shared resources, in multi-tenanted cloud systems and across multiple organizations seeking economies of scale, results in companies relying upon a common cloud service or platform. What attendant risks might this bring to the tenant consumer of the service, and to the sellers and providers of the cloud services? How will it impact their expectations of service levels and performance?
This is a fundamental issue for any enterprise that considers using the cloud. As the Proposed Security Assessment and Authorization for US Government Cloud Computing points out: “The decision to embrace cloud computing technology is a risk-based decision, not a technology-based decision” (see [FEDRAMP]). All organizations, large and small, need to establish the right decision framework and governance mechanisms to use cloud computing successfully. These rely on an ability to analyze and assess the risks.
This chapter is about how to understand the main risks associated with cloud computing. Its approach is based on the Mosaic approach to risk management that was developed at the Carnegie-Mellon University Software Engineering Institute (CMU SEI). More information can be found on the CMU SEI Risk website [SEI RISK] and, in particular, in the excellent tutorial Rethinking Risk Management [D&A].
Risk management is a core business activity of all enterprises, large and small. You can find a summary of the basic principles in, for example, the briefing document on Risk Management for SMEs produced by the Institute of Chartered Accountants in England and Wales [ICAEW]. The Committee of Sponsoring Organizations of the Treadway Commission [COSO] is the source of some of the important work on Enterprise Risk Management (ERM), which has developed into a significant discipline. This work, together with work done by national standards bodies (particularly in New Zealand and Australia) is the basis of ISO 31000:2009: Risk Management – Principles and Guidelines, which is now the standard risk management framework [ISO 31000].
Risk management is sometimes thought of as an unnecessary expense, a chore that can be neglected. This is the opposite pole of folly to the idea that risk can be avoided. A well-run business avoids the toxic risks, but accepts the manageable ones, takes care to minimize its exposure, and makes a healthy profit. Similarly, a well-run non-profit organization must know which risks to accept, and which to avoid, if it is to achieve its objectives. Good risk analysis often demonstrates that pre-existing worries may not actually turn out to be real risks. Properly applied, risk management creates and protects value.
The Mosaic approach builds on and extends traditional risk management to provide a framework for managing complex, systemic risks. It takes a holistic view of risk to objectives by examining the aggregate effects of multiple conditions and potential events. The issue of cloud risk is a collaborative and tenancy one. When an enterprise uses cloud computing, risks can arise in inter-related services that may not be under the complete control of the internal IT department. As [D&A] points out, traditional, tactical risk management is designed for environments with low uncertainty and few interconnections, but today’s networked technologies operate in an environment of high uncertainty and dynamically changing, interconnected systems. This is particularly true for modern enterprises operating within cloud-based ecosystems.
When a program to develop and deploy a new business solution is considered, you assess its mission risks: the systemic risks that affect the program’s ability to achieve its key objectives. A mission risk arises from a factor that has a strong influence on the eventual outcome or result. Such a factor is called a driver. Drivers enable a systemic approach to risk management by aggregating the effects of conditions and potential events.
You assess these risks initially, and reassess them periodically during solution development and operation. When the solution is in operation, the assessment incorporates measurements of solution parameters that affect the risk.
Assessing a mission risk typically means considering a number of complex, inter-related factors. The use of cloud computing affects many of these factors, sometimes negatively, sometimes in a positive way.
Assessment of mission risks is not a simple process. It may, for example, require interviews, documentation reviews, and group meetings. This clearly takes significant effort. Nevertheless, putting this effort into a systematic approach to risk management for a large and complex project will pay off. Even at the stage of preliminary assessment, where a large effort is not justified, it is worth following the general lines of the Mosaic approach when thinking through the risks involved.
This chapter discusses factors where the impact may be negative. There are many factors where cloud computing has a positive effect. For example, it can lower the risk that an enterprise cannot meet business demand due to lack of IT capacity. These advantages have already been covered in Why Cloud.
Company executives are receiving requests for investment in projects with cloud computing and want to understand the ROI and investment risk. They seek to make value judgments and comparisons with available known performance benchmark data to assess the probability of success for a project. They want to understand any gaps or areas of risk that might need to be taken into account in decisions to approve a cloud project or in assessing the use of the cloud service.
In a large enterprise, risk decisions are often taken jointly by several people, who then must explain them to the rest of the executive team. This implies a need to communicate risk information in a way that is clear and easy to understand. It may also be important to communicate risk information outside the organization; for example, to potential shareholders.
Risk management is an important technique used when implementing an enterprise architecture project (see Chapter 31: Risk Management of [TOGAF]). The enterprise architect will seek to satisfy the executives’ needs for risk information with an outline analysis when setting the architecture vision, and a detailed analysis when developing the business architecture.
You may be conducting the risk analysis, and be asked to explain it to others. More likely, the detailed analysis will be done by others, and you will want them to explain it to you. You need to understand how to communicate risk information, either to do it yourself, or to insist that others do it in a way that you can easily understand. The approach followed here, based on Mosaic, uses simple scorecards that communicate the risks very clearly.
The impact and probability of each risk are assessed separately, and then combined to give an indication of exposure. A scale with five levels is used for each of these quantities:
This enables a clear and direct graphical representation of project risks. For example, the risk exposure summary in the figure below is taken from the initial risk analysis on the ViWi project. (This is one particular example. The impact, probability, and exposure values will differ in other cases.)
ViWi Risk Exposure
It shows major exposures to financial, disaster recovery, and system quality risks. The financial and system quality risks are inherent in the project because no-one can be sure that people will buy the virtual widgets or find them useful. The disaster recovery risk can, however, be mitigated by data back-up arrangements. The analysis makes it clear to the decision-makers what risks they are taking, which of them can be mitigated, and which must be accepted. That is its value.
The main cloud-related mission risks to consider are the ones shown in the ViWi example above:
It is important to have a clear definition of each risk, and also to explain the factors involved and provide rationale for each assessment. The figure below shows an outline of the rationale for Sam Pan Engineering’s initial assessment. A full assessment should have more detail. While the scorecard chart communicates an overview of the risk clearly, people will often need more information before taking specific action. And it is crucial that repeated risk assessments are conducted on the same basis. This is impossible without clear explanations of the factors and decision rationale.
Sam Pan Engineering Initial Risk Assessment
Note that Sam Pan has not assessed the service integration and external service risks, because it is not using any external cloud services.
Note also that system quality is a complex risk that is the subject of its own separate probability assessment. The result of this is carried forward into the overall assessment.
That the solution will not meet its financial objectives is a mission risk. Many promising solutions have been abandoned because they fail to meet their revenue and profit objectives. The impact of financial risk is almost always critical.
Business models generally look great on paper, but the figures can be very different when they are put to the test. By enabling costs to be related directly to workload, and hence to revenue, cloud computing reduces some of the risk. But it affects other factors differently, sometimes in a negative way.
The key factors to consider when assessing cloud ROI risk probability are the leading indicators discussed in the next chapter: utilization, speed, scale, and quality. These factors are built into most ROI models, and affect the headline figures for investment, revenue, cost, and time to return. Deviations of these figures from what was planned indicate likely deviations of the headline figures, and consequent failure to achieve the desired ROI.
The impact of mission risks is usually critical. Sam Pan Engineering, however, assesses the financial risk impact only as being high. Although there will be serious consequences if the system does not meet its financial objectives, it may still play a useful role within the corporation.
Introducing cloud computing may require major changes to an enterprise’s organization and culture. Can you persuade a traditionally-run company with a “we do it our way” culture to use external, standardized services? Will you have to make the entire IT department redundant? What will the unions do?
A company adopting new cloud-enabled business processes must have a clear executive vision and direction for business transformation. The program must have the support of top-level executives, and they must share a clear vision of what is to be achieved, and define a clear strategy for realizing it. This should include:
The organization must be able to make the change to adopt on-demand provisioning fast enough to gain the consequent business benefits. Cloud aims to be a step change in total speed of delivery. The organization must be able to move up a gear to take advantage of this.
The organization’s financial structure, and in particular its charging methods, may need to be redefined to adapt to cloud service business models.
Traditional working practices may need to change. This does not just affect the IT department. The fact that the infrastructure or applications used by business departments is not directly controlled by the organization may mean changes to the way that those departments work. This will require education and training, and may meet with resistance from the workforce.
All of our example companies rate the impact of this risk as critical. ViWi rate the probability as minimal; as a start-up company created for this project, they have no organizational or cultural barriers. The other two companies believe they have moderate organization and cultural risk probabilities.
As use of cloud computing develops, it is increasingly likely that an enterprise will use not one but several cloud services, and will need to integrate them with each other and with in-house systems. For example, in Case 21: Brand Unification in Cloud Computing in Use the enterprise wants to use cloud-based collaboration services together with other services replacing some of its existing applications. These services must be integrated with each other and with those of the existing applications that are not replaced.
There is a risk that it will not be possible to integrate the cloud services with the existing system and with each other. This risk is critical; if the system cannot be built, it cannot be used. The service integration risk can be assessed by considering interface conversion cost, ability to change the existing system, and available skills.
The interface conversion cost is the cost of providing “glue” software to connect the services if their interfaces are not completely compatible. The interfaces may be compatible if the services are designed as part of a common suite, or to fit within a particular industry reference model. Otherwise, you should look at how far they have compatible syntax, compatible semantics, and fit within a common process model, or support a suitable interoperability protocol and architecture.
Usually, syntactic conversion is relatively straightforward, and semantic conversion is possible but expensive, but if the services have radically different process models then integration may be extremely difficult.
Similar considerations apply to the ability to change the existing system. You should assess whether it can be used with little change, or whether radical redevelopment is required. In the latter case, the risks are high.
Significant skills are required to assemble and customize multiple cloud services from different providers in a flexible, adaptable way, while maintaining security, back-up, and governance mechanisms. This assembly of services will mean that applications using them will need to become more “loosely coupled” – programmed to act with an integration layer, not the underlying infrastructure. You should assess how far you have these skills in-house, and the cost of employing specialist consultants. If major changes are needed to legacy systems, you may find that the necessary skills no longer exist, either inside or outside your organization.
The requirement to meet compliance obligations was discussed under Establishing Requirements. For each of your obligations, you should assess the risk that the obligation will not be met.
In addition to assessing risks associated with your own systems, you should assess risks associated with external services that you use. For projects based on cloud services, this is often the most significant source of compliance risk.
Regulations, or company policy, may require data to be located in particular geographical areas or legal jurisdictions, to be kept secure, with its integrity and confidentiality maintained, or to be kept online or archived for specific periods. This applies particularly to personal and financial data.
The impact of failing to conform to these regulations varies considerably. For legal requirements, it depends on the prescribed penalties and the enforcement regime. For moral obligations, the impact can include loss of reputation and standing, which may in turn be reflected in market share.
Dependence on an external cloud supplier can increase the probability of non-compliance. Even if you have contracts that provide the necessary assurances on location and confidentiality, force majeure may prevent the supplier from honouring them. For example, what would be the result of legal action for subpoena of data in a cloud environment that may not even be held under your tenancy, but have been placed on the same system by other tenants? And what would then be the impact on your corporate reputation?
Because of questions like this, compliance is an important cloud risk area. All of our example companies rate the impact as critical, but Konsort-Prinz are the only ones that give it a significant probability. They have to conform to European legislation. They believe that they can do this, but it is a complex area, and they rate the risk probability as moderate.
It is sometimes necessary to react to and recover from unplanned events. As for compliance, you should assess risks associated with external services that you use, as well as assessing risks associated with your own systems.
Physical disasters such as fires, floods, and hurricanes that result in a major loss of IT capability are examples of events that can affect in-house systems or the systems of cloud suppliers. And there are other kinds of event, such as mergers and acquisitions of suppliers, unforeseen bankruptcy, or cancellation of contract, that can affect a supplier and require a similar response.
Using cloud computing can make it harder to respond, because you have less control or less visibility over the IT resources concerned.
As part of your risk analysis, you should identify the unplanned events that could harm you, and assess their probabilities and impacts. You may also wish to make general provision for unforeseen events that disrupt the cloud services that you use, or damage their data.
Having identified the risks, you can build into your system design elements that will reduce their probability or mitigate their effects. For example, an effective back-up and restore process, with the back-up copy held in a different location from the data, or on your own rather than the cloud supplier’s system, can change the impact of a disaster from fatal to merely serious.
This is the risk that the solution does not meet its users’ needs. The importance of system quality is discussed under Quality: Improved Margin from Better Service. The impact of a shortfall is reduced margin and loss of ROI.
As for compliance and disaster recovery risks, you should assess system quality risks associated with external services that you use, as well as assessing the risk associated with your own systems.
The risk factors for system quality are illustrated in the first system quality assessment carried out by ViWi. Unsurprisingly, since the product has not yet been tried in the market, this shows high failure probabilities for the high-impact factors of functionality and user satisfaction. (Distance from the center of the chart indicates the size of the impact or probability.) If these risks materialize, and the product does not satisfy its users, then the project will fail, and its backers will lose their money.
ViWi Initial System Quality Assessment
System quality risk factors are discussed in more detail in a separate section.
If you are using external cloud services, then you should assess the risks that they will not be adequate.
The impact can be critical if your solution depends on an external cloud service for essential infrastructure or functionality. But the impact is less if you have an exit strategy, as discussed under Selection.
The factors to consider include the system quality of the external service, and the adequacy of its supplier.
The system quality of an external service can be assessed using the same factors as for the system quality of your own solution. The assessment made by ViWi of the risks associated with using an external PaaS service is illustrated below.
ViWi External Cloud Service System Quality Assessment
Although the factors are the same as for the assessment of the overall ViWi solution, the values of those factors are different. They are often quite unrelated. For example, the users of the PaaS service are the ViWi developers, and it is not so critical for them to be satisfied with the development platform as it is for the virtual widget users to be satisfied with the end product. Also, the PaaS platform is well-established, and the probability that it will be unsatisfactory is low.
Enterprises that use cloud computing rely on cloud suppliers for many things that traditionally would be done in-house. You should assess supplier quality because there is a risk that a cloud supplier is not adequate; for example, that it proves untrustworthy, or goes bankrupt.
Many of the factors involved – such as financial stability, size, reputation, and track record – are similar to those for any other kind of supplier on which you rely. In the context of cloud computing, you should look particularly at the supplier’s track record in meeting SLAs, responding to complaints, and being willing to share information about service operation and system architecture.
Remember, also, that “possession is nine points of the law” when it comes to information. Providers can, for example, use or sell subscriber information for marketing purposes in ways that can be hard to detect or prove. Supplier quality assessments should include trustworthiness.
System quality as defined above depends on the factors of functionality, performance, manageability, security, and user satisfaction. The requirements for the first four of these factors are discussed under Establishing Requirements, and user satisfaction depends on those four factors.
The risk that the system does not have the necessary functionality depends on the similar risk for external cloud-based systems that you use, as well as on non-cloud-related factors, such as the quality of the specification.
If you are relying on PaaS or SaaS services with complex functionality, this may be a significant consideration.
ViWi are relying on significant functionality in the cloud platform that they use. They have already found that there is no PaaS service that meets their requirements, so the risk probability is certain. The impact is, however, only moderate; they can overcome the lack of platform functionality by writing additional software in their own system.
Performance covers availability and reliability, recoverability, throughput, and responsiveness.
Availability depends on reliability. From a risk analysis perspective, the MTBF reliability factor determines the risk probability, and the MTTR – the probable value of the time needed for repair of the system – determines its impact.
Fault tolerance is a factor that affects the availability risk probability. A fully fault-tolerant design has no Single Point of Failure (SPOF), and often accommodates multiple failures within a service window. Software fault-tolerant designs include exception handling and task rollback. For such systems, the failure probability will be low.
Recoverability is the ability to recover from failure. If a system fails, there is likely to be some data loss. If the failure is to a processing unit, the amount of data lost will depend on how the program handles the data; with well-designed software, the amount can be small. If the failure is in a storage system, the loss may be significant. Use of redundant storage can minimize the probability of this.
Many companies would go out of business if they lost all of the data that they keep on IT systems. It is normal practice to limit this impact by taking back-ups. For example, a daily back-up limits the impact to loss of 24 hours’ data.
If the system is performing adequately, its throughput is the same as the offered workload. But there is a risk that it cannot handle the load required. In assessing this risk, you should consider the range of throughput levels that the system should support, the degree of predictability of the variations in level, and how changes in load will be handled.
If you are using cloud services, you can make a trade-off between risk and cost. Configuring your system with more resources than are needed to meet the expected load reduces the probability that the load cannot be handled, but increases the cost. If you can configure the system dynamically, you can keep this over-provisioning to a minimum, but must consider how quickly the load might change, and how quickly additional resources can be brought online.
The probability of an overload can be reduced by apportioning the offered load between systems. Cloud providers normally have some means of doing this when resources are provisioned, and Sam Pan Engineering will have this facility in its private cloud platform. Konsort-Prinz uses a single central database, and cannot partition the processing. ViWi can easily do so, because different virtual widgets can run on different systems.
In assessing the risk that a system is not sufficiently responsive, consider the response times that it should have, the permissible variability in response time, and the need for predictability of response.
Poor response time is often correlated with throughput overload. We will see an example of this in the next chapter.
Manageability includes the factors of configurability, reporting, and fault management. The requirements for them were discussed under Establishing Requirements. Failure to meet these requirements leads to user frustration and delay at best, and at worst renders a system completely unusable.
Provisioning management is particularly important for cloud services. Konsort-Prinz assesses the manageability risk impact of the cloud service in its solution as high, principally because of the needs of its twice-daily provisioning exercise. ViWi rates manageability of the platform as critical, because of its need for dynamic resource provisioning.
Having your own information, on your own hardware, and between your own four walls, provides a level of comfort that you lose in the cloud. Cloud services are often accessed over the public Internet and this must be considered when assessing security. This is not to say that cloud computing is necessarily insecure, just that new considerations need to be taken into account and more modern security models developed and applied. You must adapt traditional security models to suit cloud computing needs and consider end-to-end security, including your own internal policies for access control and user provisioning.
The analysis of risks associated with security threats is an important and specialized discipline. This Guide does not attempt such an analysis. It provides general advice on assessing cloud security risks for the purpose of business decision-making, but contains no information on identifying or taking countermeasures against specific security threats. You can find detailed information on risk in the context of data security in The Open Group Risk Taxonomy [RISK] and The Open Group FAIR – ISO/IEC 27005 Cookbook [FAIR]. The Cloud Security Alliance [CSA] has produced some excellent material relating to security in the context of cloud computing.
The requirements for end-user access control, supplier access control, resource partitioning, security logging, and threat response were discussed under Establishing Requirements. Failure to meet these requirements can lead to system and data unavailability, financial loss, leakage of sensitive information, failure to meet privacy regulations, and damage to reputation. It can also result in the integrity of the system being compromised, leading to sabotage or data leakage. This can be achieved by insertion of malicious software or by abusing existing software vulnerabilities, for instance.
There are two factors that can increase the probability of a security breach for consumers of cloud services. One is the need to rely on the service provider for part of your security arrangements. The other is that the security arrangements in a system that uses services from different providers are likely to be complex, and this introduces the possibility of unconsidered areas where there can be gaps in the defences.
Konsort-Prinz assesses the impact of a security breach as high, in view of the possible disruption to its operations and loss of customer information, but not critical. It assesses the probability as low, because it trusts the supplier, only its staff can control operations and access customer information, and it will have standard access control mechanisms in place. If it decides to give its suppliers access to the system, to boost ecosystem productivity, the security risk must be reassessed.
Logically, user satisfaction is a consequence of the other system qualities considered above. Practically, it should be considered and assessed separately. There is a definite risk that the users are not satisfied with a system, even though it appears to have adequate functionality, performance, manageability, and security.
At design time, assess the probability of user satisfaction by looking at whether human factor experts have been consulted, and whether user trials have been done. When the system is in service, take proactive steps to measure user satisfaction. The importance of this as an indicator of quality is discussed under Measuring and Tracking ROI.
An initial risk assessment is made when taking a decision to use cloud computing, or deciding which form of cloud computing to use. The examples given so far in this chapter have been of such assessments. But the risk analysis is not just done once at the start of system development and then forgotten. It should be repeated throughout the life of the system as circumstances change.
The importance of risk management as an integral part of enterprise architecture is described in Chapter 31: Risk Management of [TOGAF]. This section illustrates the application of this principle to the Konsort-Prinz cloud solution.
The initial risk assessment for this project indicated significant exposure to risks in the areas of organization and culture, compliance, and disaster recovery.
Konsort-Prinz Initial Risk Assessment
The organization and culture risks arise partly because of a generally conservative culture within the company, and partly because national legal constraints make it difficult to make people redundant without their consent. However, the project has the backing of top-level management. The CEO will make it plain that opposition will not be tolerated. At the same time, the company will offer financial terms that will encourage the IT staff who are no longer needed to leave.
The compliance risks relate to European data protection legislation. This is complex and still evolving. (There is a central EU Directive, but it is its interpretation in each member state that counts.) Konsort-Prinz decides to keep all data within the EU, to ensure that all customer and employee data is kept confidential, and to enable customers and employees to check their personal data. The executives believe that this is a moral obligation, and also that it will mean that the company complies with all applicable laws. The implications for the IaaS service that the solution will use are that it must keep all data within the EU, and make that data accessible only to people authorized by Konsort-Prinz. These are made mandatory requirements in the selection process.
To provide for disaster recovery, there will be a back-up IaaS provider. Copies of all the data will be stored periodically on this provider’s system, and there will be a contingency plan to transfer operation to that provider if the main provider suffers a disaster. The impact of a disaster will then be at a non-critical level, although there may still be disruption to operations and some data loss.
The system quality risk is in the performance area. A failure just before the daily back-up during the seasonal peak could cause a data loss that would affect 5% of the year’s sales – which would be reflected quite visibly in the bottom line. The designers decide to modify the system to write a copy of the transaction data to a separate storage resource within the cloud provider’s system. This is a good example of how an effective risk assessment can find a small “weakest link” within a large, and seemingly strong, chain.
The effect of this risk mitigation process is to reduce all risk exposures to an acceptable level.
Konsort-Prinz Risk Mitigation
The assessment will be repeated at significant stages of the architecture development to ensure that the levels of risk exposure continue to be acceptable.
When cloud services are selected, the risks associated with each should be reviewed. Negotiation of contract terms should seek to reduce the risk to the purchaser. The level of risk remaining after this negotiation is a consideration to take into account when making the choice.
Konsort-Prinz has identified three IaaS suppliers that meet the requirements and whose prices give reasonable cost models. (The prices were taken from the public websites of three real IaaS providers in October 2010.)
Konsort-Prinz Supplier Cost Models
The first is a large multi-national corporation. The second is a small local company that offers extremely cheap rates on monthly contracts (hence the need for a larger over-provisioning factor). The third is a medium-sized company based in another EU country.
Konsort-Prinz makes a risk assessment for each potential supplier, and compares the results. The system quality risks are pretty even, but the risk that the suppliers are inadequate differ: minimal for the large company, moderate for the small local company, and low for the medium-sized EU company. In spite of its cost advantage, they assess the risk for the local supplier (IaaS 2) as being too great. The larger EU company (IaaS 3) is slightly riskier than the multi-national, but is selected on cost grounds.
Risks should also be reassessed as appropriate during operation. Of our example companies, the best example of the value of this comes from ViWi.
ViWi Six-Month Risk Assessment
The risks are reassessed six months into the project. The product has been launched, and the question marks over service integration, system quality, and the external service have been resolved. In addition, although sales are lower than expected, the ROI factors have improved. (This will be explained in the next chapter.) The overall risk picture is very different, and much more positive.
In this case the assessment is crucial for a decision on whether to continue the project. More usually, the stakes are lower. Nevertheless, risks do change over project lifetimes. This may mean system modifications and changes in operation, even though cancellation is not in question.