illustrate the existing cloud network problemposted in the user forum for a prominent laaS provider. summaries help us analyze the problems reported. UsingThe provider uses this forum to facilitate troubleshoot- these summaries, we eliminated 7 clusters pertaining tong. The forum spans three years(Aug 06 to Dec 09) non-technical questions such as billing queries, questionsand contains over 9575 message threads. Several of the about future releases, and feature requeststhreads contain descriptive information about the prob- A key limitation of our study is that the range of problem users faced we analyze such threads in this work. lems we examine is limited to problems reported by usersEach thread starts with symptoms the user observed. through the forum. In particular, it excludes problemsThis is followed up by suggestions(by operators or other that are reported directly to laaS provider by customersusers)for debugging actions to perform and a description with the premium service. While it is difficult to quantifyof the results of these actions. Finally, the threads end the impact of this limitation on our study, we do believewith the resolution of the problem and an explanation of that our preliminary evaluation sheds light on the mostthe root cause either by an operator or the usercommon problems faced by a typical user of laas cloudsTo analyze the message threads we beniques from Information Retrieval to cluster the tickets 4 Problems Faced by Usersbased on the data they contain We then examined trendscommon to each of the discovered clusters, thereby un-earthing various"classes of common problems". We de-In this section, we analyze the problems users facedbased on the clusters of problem tickets derived by thescribe our approach nextaforementioned approach. We start by grouping the 20clusters into dominant higher-level problem classes. We3.1 Extraction of problem clustersthen dig deeper within each class to answer questionspertaining to the prevalence and evolution of problemsOur first goal is to automatically cluster message threads observed over time and across categories, and the level ofinto a small number of "problem classes "based on the assistance needed and offered. Our goal is to develop annature of the underlying problems that users faced. There understanding of the nature of problems that can guidere several challenges in deriving such clusters. First, the the design of appropriate support mechanisms that denumber of tickets is large-hence the grouping must be crease problem resolution timesperformed automatically. Second, the number and nature Our key observations are: (i) Users encounter manyof clusters is not know to us beforehand. Third, the tick- problems in trying to setup their instance and to keepets are specified using(unstructured) English text which the instance running. (i) Of the many type of problemsmakes automatic clustering hardfaced by the users we observe that those related to manTo address these challenges, we leverage Information aging virtual resources and instance performance growRetrieval algorithms for imposing structure on, and ex- over time. In addition, we find that these two types oftracting clusters from, unstructured text. We evaluated problems require the most involvement from cloudseveral such algorithms(e.g. 14 )on our dataset and we ministrators because users are ill-equipped with the apound that the Lemur IR package [5 provided us with the propriate tools to debug these two classes. (i,The addibest accuracy on several small random test sets of prob- tion of new features results in a temporary increase in thelem tickets. Briefly, the Lemur tool supports indexing number of problems reported-the number of problemsof each document as a set of descriptive words; Lemur subside as user become familiar and the provider persupports clustering operations on this index using the h- fects the method of delivery for these features.(iv)Ourmeans algorithm. To avoid using generic words as de- investigation of the cloud support model shows that thescriptions of the problem ticket and to obtain meaningful support staff grows in proportion to the increasing num-clusters, Lemur, like most other IR approaches, uses a ber of customers posting in the forum. Administratorsvariety of tricks such as eliminating words that match a usually respond to posts within 10-12 hours, but probstop word"list (including frequently used nouns, pro- lem resolution can take daysnouns and verbs), and word standardization(eliminatingtenses and plurals)etc4.1 Problem categoriesFrom the 9 575 message tickets the lemur tool discovred 194 clusters. We limit our attention to 27 of these We manually inspected the summaries for the 20 minedclusters, each of which had at least 50 problems(con- clusters mentioned above to further group these into 5taining a total of 8684 problems); Roughly 91% of all logical problem classes based on their similarity-eitherproblems reported mapped to the 27 clusters we studiedbeing related to the same set of functional componentsTo understand these 27 clusters, we generated a"sum- or having similar problem semanticsmary of each cluster, which consists of the top 20 wordsWe assign each cluster to one of the following classe(in terms of the frequency counts) in the cluster. These Application-related, Virtual Infrastructure-relatedClassesMaintenance1399523)(1164/47们l060707)InfrastructureRelate1093546MinedEmail server sctlisse(724275828458}15457ClusteWindows Licensing14856image between394274)(19072buckets(1900)EBS performane DNs virtualizedLAMP setuperformance(144,86U462installissue 152/461Misc conneLinux5411)AVi exceptionsConnecting to appKey-pair issues12I6Misc bundlingis46425)Figure 1: A tree depicting the taxonomy of the problem classesage Management-related, Performance-related, andtivity-relatedFigure illustrates the logical grouping of the minedclusters into these 5 classes of problems. With each cluster under a class we include two numbers: The first num-ber inside each cluster denotes the total number of cor-responding problem threads, and the second number denotes problems that required assistance from the cloud图檬图图图图图8骂5当8骂,8吕石88operator to find a resolution. For example, there are 828message threads that are related to storage volume atClass of problemstach and detach, and out of these, 458 problem threadsrequired operator involvementFigure 2: Percentage of problems discovered in eachclass over timeFrom this, we observe that users may be afflicted bya variety of problems spanning these five categories. Ina significant fraction of cases, operator involvement wasneeded to resolve the problemWe now examine the relative prevalence of problemsacross the categories. The top boxes in Figure D il-lustrates the breakdown of the 5 problem classes. Wefind that 93% or 4716 out of the 5111 problem threadsare roughly equally split between the four classes ofPerformance, Image Maintenance, Connectivity, andVirtualization-related problems. Only a small numberApp Conn Maint Pcrf vlass of problrmsof problem threads are Application-related. One possi-ble explanation for the relatively low frequency of appFigure 3 Percentaged of problems in each class that relication problems is that they are related to higher-layer quired operator interventionconfiguration settings, licensing issues, and installationproblems-these issues are typically out of scope for an any of the observed changes. This analysis also pro-TaaS provider, and are better dealt with by the applica- vides some insight into problem classes that have pertion’ s support staff.sisted over time, and hence require additional effort tohelp resolve themFigure2]illustrates the evolution of different classes of4.2 Evolution of problemsproblems between the 3rd quarter of 2006 and 4th quarter of 2009. where the number of tickets in each class isWe now turn our attention to the evolution of user probnormalized by the total number of problems in the correlems over time. Specifically we study how the relativesponding quarter. We observe several interesting trendsfrequency of different categories of problems changesover time, and whether any specific events contribute to Problems in the Virtual Infrastructure category in-Appat least 50%o of the time. We examined the clusters forgOthese two classes of problems and found that, under vir-70tualized Infrastructure, attaching and detaching storage60volumes requires the greatest degree of operator involv50ment. Similarly, we find that "unresponsive instancewas the major cause of operator involvement in the per-10formance problem class. This observation is significantin that users do not have enough information or control,B588甚588gto identify and resolve these problem scenarios. For ex-Timc (in Quarto rs)ample, without any internal access or information aboutFigure 4: Fraction of problems in each class which re-cloud resources, it is not possible to determine why aninstance is unresponsive. Likewise without the abilitquired cloud operator intervention over timeto inspect or change the internal state of the virtualizedstorage. users are constrained to api calls which are increase over timeeffective in changing their storage volume stateWe also analyze operator involvement in variousProblems in the Maintenance category decrease sig-classes as a function of time. Our intent is to see if therenificantly with timeis evidence that users are able to gradually understandConnectivity problems are relatively stable and per-and diagnose certain problems, and if there are somesistentproblems that are always difficult to self-diagnose dueto lack of visibility into the cloud (arising from virtualWe examined the feature release history for the cloud ization). From Figure 4 we find that in general operaprovider and found that sharp increases in problems in tor interventions decrease over time with the excep tionthe Virtualized Infrastructure class are correlated with of the Virtual Infrastructure class. Although interventionthe release of new features in the cloud. For example, decreases for rest of the categories, it never disappearswe obseryed that the increase in Virtual Infrastructure altogether and gradually settles to a stable level. Thisproblems in the 3rd quarter of 2008 coincided with the suggests that while users become more familiar with theintroduction of a new virtual storage service. Following system and accumulate a knowledge-base of solutions,this release. another infrastructure-related feature was re- there are a significant fraction of problems that persisleased in the 2nd quarter of 2009 and followed by an- tently require provider involvement to resolveother significant increase in the set of Virtual Infrastructure problems. To further verify these observations, wewent back to the problem threads in the Virtual Infrastructure class and saw that they were largely related tothe new features. Similarly, with the Connectivity andPerfornance categories, we were able to also correlateertain increases with the release ofnew features or modEEoa×sication of existing featuresThe significant decrease in the Image Maintenanceproblem class is explained by the fact that better APIsand tooling were released over time to better manage theTime (in Quarterimages. For example, automatic image capture and re-boot tooling was made available in 1 st quarter of 2007Figure 5: Max thread closed divided by mediam threadsclosed by an operator.4.3 Problems with Operator Involvement4.4 Cloud Support modWe now examine which problems needed operator involvement for problem resolution, and to what extent. For traditional It service providers, the support modelFigure illustrates the percentage of problems in each consists of a staffing plan, which includes the number ofproblem class that required operator involvement for res- support staff and their skills, hours of availability, andolution. Note that operator involvement ranges between their locations. In addition, the model includes targets20%to 60% of the problems within a clasuch as the time to respond to reported problems of dif-Among the 5 classes, we find that Virtualized Infras- ferent severity levels, and the time to resolve problemstructure and Performance require operator involvement In this section, we analyze the support forum messagesmedian, but there is significant variability in the amountof skew in the forum participationIn figurewe can see the evolution of the numberof customers and administrators active in the forum overtime. The size of the support team grows proportion-ally with the number of customers posting in the forumsThe number of administrators remains roughly an order33588358835883of magnitude smaller than customers over the lifetime ofthe cloud we observedIn Figure we examine the number of messagesFigure 6: Number of cloud operators and cloud cus- posted by the support staff over the course of 24 hourstomers on the forumsfor each day of the week. The cloud provider's sup-port team is active 24x7 with the main peak of activity roughly matching the peak times of customer post-Aings(Figure 8. The support team activity on weekendsSndy……(Saturday and Sunday) is relatively low, likely due toa smaller staffing. Interestingly, there is also a smallerpeak of administrator activity in the early morning hours(around 2am PST), with no corresponding peak in customer posting activity. This may indicate that a globalsupport team is being used, with most administrators located in a timezone appropriate for peak North AmericaHours of Daybusiness hours, and a smaller number deployed to staffFigure7: Number of forum post by cloud operators. the forum during off-peak weekday hours. The numberof unique administrator usernames observed during each7C0hourly period also indicates that there are more operatorsonline during this secondary peak periodsbr7ay∴We also examined the amount of time taken for resoluo3>000.6=Eztion of support threads, estimated as the time differencebetween the first and last timestamp of the messages ina thread. If the last message appears well after the problem was actually resolved, this would overestimate thelution time. If the problem isample, this would be an underestimate of the resolutionHours of daytime. Nevertheless, the statistics still provide useful es-timates of the resolution time. In Figure g we show theFigure 8: Number of forum post by cloud customersCDF of the resolution time across all collected threadsand find that about 60% of the problems are resolved into see what can be observed about the support model for 20 hours while the next 20% of thethe threads can takelaas cloud providersas much as an additional 100 hoursWe first examine the number of administrators reWe also considered the initial response time for adsponding to forum threads, based on their unique user- ministrators to post an answer in Figure 10 We find thatnames across all messages. We observed that over the administrators respond to 60% of the problems in lessfull 3-year period, 166 administrators participated in the than 9 hours while for the next 20%, administrators maymessage threads. A small number of administrators take as long as an additional 20 hours to respond. We(around 10) were involved in over 150 message threads, observe that problems seem to be resolved within 11-110while 100 answered fewer than 20 threads. The degree hours of the administrators first response. It is likely thatto which there are a few dominant administrators may be much of the time after the first response from an adminan indication of the skills distribution in the staffing plan. istrator is spent in an iterative trial and error process asTo investigate further, we examined the ratio of the maxi- customers explore possible root causesmum number of threads answered to the median over theIn trying to understand the support model for this largeobservation period. Figure 5 shows this on a quarterly cloud provider, we find that although the number of supbasis, and we see that in some periods there are admin- port staff increased over time and forums are mannedistrators who participate in many more threads than the 24x7, there is still considerable variability in problemsolution times. Users still must often engage in lengthy of time required to debug these problems, and the proper-hanges to solve their problems, sometimes lasting ties of the support model required to resolve these probseveral dayslem6 Conclusionlaas clouds provide a variety of support models to aideusers in debugging the problems that they encounter. Inthis paper, we conducted a study of the problems encounter by users and examined the effectiveness and theefficiency of the most popular support model, namelybest effort, in resolving these problems. Our goal wasto understand the missing mechanisms that can be addedimc to Resolve Thread (in HoursIto allow cloud providers to offer support in a more effective fashionFigure 9: Problem Resolution TimeWe found that of the problems faced by users, performance and virtualized problems are the most persistentand prevalent problems owing in part due to the fact thatusers have no visibility into the cloud and are thus forcedto consult the cloud operators for help In examining horthe best effort support model handles these problems, wediscovered that 10 operators are responsible for resolvng most problems and that a significant delay of 20-110hours exists between the initial operator involvement andthe problem resolution. Our measurements indicate thatto offer more effective support, clouds should (1) takea proactive approach by developing tool targetted at deC2040E080100120Time Until Provider Response(in Hoursbugging new features, (2)develop tools to automate operator task, and(3)provide a vehicle to gather and trans-Figure 10: Time for the operator to respond to the first fer information between operator and useruser query.7 Acknowledgements5 Related workThis work is supported in part by an NSF FINDgrant(CNS-0626889), an NSF CAREER Award(CNS0746531), an NSF Netse grant(CNS-0905134), andThe idea of cloud computing has been around for several by grants from the Uw-Madison Graduate Schooldecades in the form of utility computing however, the Theophilus Benson is supported by an IBM PhD Fellowlack of mature virtualization tools and poerfulpbroces-sors has prevented it's growthReferencesRecent advancements in both the virtualization and theprocessor fields have created an environment for the de-1]amazonec2.http://aws.amazoncom/'ec2ployment of cloud computing. Although relatively new[23tera.http://tera.cOm[3 M. Armbrust, A Fox, R. Griffith, A D. Joseph, R. H. Katz. A. Konwinskia fair amount of work [3 8 has been done to examinecurrent and future challenges for both users and providerlouds: A berkeley view of cloud computing. Technical Report UCB,EECS-2009-28,Fcb2009f cloud computing. However, little has been done[4]S Cunningham and G. Holmes. Developing innovative applications of machine learning. In Prmc. Southeast Asia Regional Computer Confederationunderstand the range of operational challenges faced bnference, Singapore, 1999users as they attempt to run applications within the cloud. [5 K Fitz, L Haken, and B Holloway. Lemur-A Tool for Timbre Manipula-In particular, work on cloud management [2 has fo[6] P. Mell and T Grance. Draft NIST working definition of cloud computing.cused on the provisioning and scaling of services within[7 rghtscale htlp: Highscale. com/infrastructure clouds. Unlike prior work, we believe that [8] L. M. Vaquero. L. R. Merino, I. Caceres, and M. Lindner. A break in theusers face a significant challenge in merely trying to keepclouds: towards a cloud definition SIGCOMM CCR, (1) 50-55, 2009the instance up and running. To this end we studied thefrequency of problems encountered by users, the amount