Dear all, Here is a description of the events that took place last Friday and some 'lessons learned' that we took away from it. The high level summary of events is that an Atlas user was authorised to create an extreme number of measurements involving a large number of probes. This effectively overloaded the back-end machinery in various different ways. Even though this event could only happen because of the exception in resource use limits, we have implemented workarounds and countermeasures to avoid a repetition in future, and we will be investigating some of the more fundamental issues in the coming period. More information is available below for those interested. Observations: - Problems started shortly before 11am when one of the Atlas users created a large number of new measurements each involving all available probes. The user had been given an exceptional amount of credits as part of a special experiment. Therefore the normal limitations on the impact any individual user can have on the system were not active when the measurements were created and activated. - The results of the newly created measurements put a lot of strain on the measurement scheduler, which triggered our interest. After some investigation the cause of the overload was identified and the related measurements were ended. - However, by this time the majority of results, up to that moment, had already reached our queuing servers and the consumers were already ingesting the results into our Hadoop storage platform. - At this phase we discovered a capacity problem with the process that consumes the Atlas results, so we doubled the capacity of that component on the fly. - This exposed the next bottleneck in our platform in the form of an accumulation of the created results on a very small number of processing nodes. Normally the incoming measurement results are distributed over several storage nodes, so this strongly reduced the consumption rate of new data. - A third factor that contributed was the fact that, in attempt to curb growth of the Atlas data, we have migrated the Atlas data sets to a more efficient compression algorithm earlier in the year. This saved us some 40-50% of storage space for the Atlas data, at the expense of some compute power. Under normal circumstances, even at high loads, this compute power is abundantly available on the storage cluster. Under the specific circumstances of last Friday's events, it turned out that the change of the compression algorithm had increased processing time for some Hadoop system tasks by up to a factor of 8, which had a direct impact on the data consumption speed. Immediate actions taken: - Removed special privileges of the end-user in question - Added capacity to the Atlas consumer processes - Returned (temporarily) to less efficient compression on the Atlas data sets. Lessons learned and further planned action: - Granting special privileges for some of the Atlas users needs (even) more attention than it already receives. - We need to better communicate "best practices" to these power users so they can use their extra allowances responsibly. - Improved compression of Atlas data has decreased our storage demands but also decreased our processing capacity. This needs further investigation to find the optimum configuration. - Investigate possibilities to better spread incoming results over more worker nodes (reduce hotspots). - Investigate and quantify reasonable boundaries of scalability of the whole system, to guide the limits for granting credits to end users. Kind regards, Romeo Zwart