Yesterday the attendees of the Global Windows Azure Bootcamp took part in a Global Render Lab that was built on the Windows Azure platform. The lab was adapted from a simple demo I wrote in 2010, and then adapted for a lab that I use on my Windows Azure training courses.
The lab allowed attendees from events all over the globe to participate and compete in rendering frames in a 3D animation. All the processing would take place in a Windows Azure datacenter.
About 750 attendees from 50 locations in 22 countries took part in the lab. During the event a total of 9904 worker role instances were started, with over 4,000 instances running concurrently for the second half of the event. 724,059 3D ray traced animation frames were rendered with a total render time of 4 years 184 days 2 hours and 46 minutes. The overall compute time used by the 9904 worker roles was almost 7 years.
The Global Render Lab website received 3,718 unique visits, with 40,022 page views during the event. At times there were over 100 simultaneous visitors on the site.
The traffic on the website was sustained over the day with over 5,000 page views per hour at its peak. The website was hosted on a single small reserved instance in Windows Azure Websites, with the ASP.NET cache being used to cache the result sets form the queries to the Windows Azure SQL Database.
228 animations were published to the website using Windows Azure Media Services. The peak inbound data was 6.57 GB per hour, and the maximum encoding job queue depth reached 43 jobs.
The worker roles used 4 storage accounts for animating, rendering, and encoding and media storage. The rendering storage account peaked at 2,105,873 queue requests per hour, which is an average of 585 requests per second. The peak for blob storage was 415,435 requests per hour, which is an average of 115 requests per second.
Back in 2010 there was a lot of buzz around Windows Azure and Cloud Computing as they were, and still are, new and rapidly evolving technologies. I had set through a number of presentations where the scalability of cloud based solutions was evangelized, but had never seen anyone demonstrate this scalability on stage. I wanted to create a demo that I could show during a 60 minute conference or user group presentation that would demonstrate this scalability.
My first job in IT was as a 3D animator and I had initially learned to create animations using PolyRay, a 3D text-based ray-tracer. Creating ray-traced animations is very processor intensive, so running PolyRay in a Windows Azure worker role and then scaling the number of worker roles to create an animation would be a great way to demonstrate the scalability of cloud-based solutions. I created a very simple Windows Azure Cloud Service application that used PolyRay to render about 200 frames that I could use to create an animation.
The first time I shoed the demo was in Göteborg Sweden in October 2010. As I had was using the Windows Azure benefits in my MSDN subscription I had 20 cores available, and I demoed the application scaling to 16 cores. As the compute costs at the time were $0.12 per hour, 16 cores would cost $1.92, so I joked with the audience that it was my two-dollar demo.
Running on 16 instances was fine, but I really wanted to make the demo a little more impressive. Scaling to 256 instances seemed like the next logical step, and this would cost a little over $30 to run for an hour. With 256 instances I really needed a more impressive way to be able to create animations. I hit on the idea of using the depth camera in a Kinect sensor to capture depth data that could be used to create a 3D animation.
The image below of my daughter and I is taken using a Kinect depth camera.
For the animation I chose to model one of those pin-board desktop toys that were popular in the 80’s. I used a simple C# application to do this, it a scene file for the PolyRay ray-tracer using the pixel values of the image to determine the position of the pins. The image below shows the frame that would be rendered using the image above.
I also added Windows Azure Media Services and Windows Azure Websites into the demo so that the completed animation would be encoded onto MP4 format and published on a website.
Scaling an application to 256 worker roles live on stage is an exciting demo to do, but I do get a little nervous every time I do it, as I am heavily reliant on the Windows Azure datacenter I am using being able to allocate the resources on-demand when I need them. I have delivered the Grid Computing with 256 Worker Roles and Kinect demo a number of times at various user groups and conferences and, usually, the demo works fine. It typically takes about 10-20 minutes for the application to scale from 4 roles to 256 roles.
I have adapted the demo to use as a lab in my Windows Azure training courses. The class is divided into two teams and they compete with each other to render the most frames. The lab involves some coding on the solution, and then creating and deploying a deployment package to a Windows Azure Could Service. The students are free to scale the number of worker roles they are using to compete with the other team.
I found that the lab really encourages teamwork and cooperation, as when one student gets their solution deployed they will help the others on their team to complete the lab and get more worker roles running. I use a simple WPF application to keep the score.
If you are interested in attending one of my courses, details are here.
In early 2013 Magnus Mårtensson and I had discussed the idea of running an Azure bootcamp in Stockholm. We decided it would be a great idea to involve some of the other MVPs in Europe and the US and ask if they were interested in running bootcamps in their regions on the same day. This would make for a great community spirit, and allow us to share ideas and publicize the events.
We set a date for Saturday 27th April, and started to get others involved. At the MVP summit in February we invited Azure MVPs and MVPs from other technologies to organize and run their own bootcamp events on the same day. We got a great response, and it resulted in close to 100 events planned in almost 40 countries, with over 7,000 people registered to attend.
Another thing we discussed at the MVP summit was the idea of having some kind of lab, or project that all the events could participate in. This would really help to drive the community spirit and connect the groups in different regions. It would also help to make the event truly global, by having participants around the world cooperating to achieve one goal.
As I had the worker role animation rendering lab for my course ready to go, I suggested it would make a great lab for the Global Windows Azure Bootcamp. It should be fairly easy to convert the lab to work with the different locations working as teams, and create a website that would display the scores.
It would be great fun to have all the different countries and locations competing with each other to render the most animation frames. The challenge would be to ensure that the application would scale to a global level and be able to handle the load that the attendees would place on it.
The challenge I had with creating the lab was to make something that every student could participate in. We originally anticipated that we would have about 10 events, with an average of 50 people at each event, so we would have a maximum of 500 participants. As the event drew nearer we realized that we had been very conservative in our estimates, the event would be ten times larger than we originally planned, with close to 100 events and over 7,000 attendees.
The challenge for me when I easy creating the lab was to make something that could scale to tens of thousands of worker role instances if required. What made this even more challenging was that there would be no way to test this scalability before the event, and I had no control over all the instances that were running, as they would be deployed by attendees in different locations around the world. The lab would last for 26 hours, starting in Sydney and Melbourne Australia, and ending in San Diego California, meaning if I was going to be monitoring the lab over the event, I was not going to get much sleep.
Another potential issue with the event being a lot more popular than we expected was the load that it would place on Windows Azure. As I was hosting the storage services in the North Europe datacenter, the attendees would be deploying there worker roles there. We asked the Azure team if there would be any problems if the bootcamp attendees tried to create over 10,000 worker roles in one data center, and were assured that it would not be an issue.
The students would have a deployment package that they could deploy to Windows Azure using their own subscriptions. The events would also be able to use a Windows application and a Kinect controller to create and upload animations that would be processed by the global render farm.
There is a webcast with an overview of the lab here.
The event kicked off in Sydney and Melbourne Australia on the morning of Saturday 27th April. In Sweden it was midnight, and I was at home monitoring the progress. The students would hopefully start deploring the worker roles early on, so I could see that everything was running smoothly and then get some sleep.
To monitor the render lab I added code to the worker roles to send messages to a Windows Azure Storage queue and I used a C# console application that would receive the messages. This meant I would receive notifications when worker roles started, stopped, animations were completed, and also any exceptions that were thrown by worker roles. I could keep track of the thousands of worker roles that were running using a simple C# console application. I even used Console.Beep() so that I would know when things were happening if I was not set ay my PC.
At 01:14 Swedish time Alex Thomas at the Melbourne event in Australia started the first worker role instance, closely followed by other attendees at that event. A total of 93 worker role instances were created at the Melbourne event, which rendered almost 9,000 frames of animation. At 04:00 on Saturday morning the lab was running smoothly, and so I decided to get a few hours sleep before the events in Europe started.
I woke up at 07:00 and some of the Eastern Europe events, and events in India had started. Things were still running fine, and we had a few hundred instances running. The event in Stockholm that I was running started at 10:00 and I got there early to open up and started the event with a short presentation about the Global Render Lab. Robert Folkesson and Chris Klug did a fantastic job delivering sessions and assisting the students with labs, whilst I spent most of the event monitoring the Render lab. Germany, Belgium, Denmark, UK and The Netherlands really got into the spirit of the lab, with Germany creating over 1,200 worker roles in total.
At 10:30 central European time we had over 1,000 running worker roles, and by 11:30 over 2,000. By 16:00 we had over 4,000 instance running, and this was maintained for the rest of the event as attendees in Europe deleted their deployments and attendees in the USA deployed ad scaled up theirs.
I had also included a “Sharks with Freakin Lasers” easter egg in the animation creator that some of the attendees discovered.
By the time the events in Europe were closing, the events in the USA and Brazil had started. I got home from the Stockholm event at 17:00 and after some family time I was back monitoring the lab. USA had about 14 events completing in the render lab, and were trying to catch up with Germany.
USA had a total of 2477 worker roles deployed during the event, compared to Germany’s 1260, so by the end of the event they had taken first place in the countries, with Berlin taking first place in the locations.
Two or three days before the event I was pretty sure that the Global Render Lab would be a failure, and was seriously considering cancelling the lab at the last minute. About three days before the event I hitting a lot of issues with reliability, I have not had time to diagnose exactly what caused these, but will hopefully include them in a later report. 12 hours before the event kicked off I hit a major potential show-stopper in my code but with help from to Maarten Balliauw I was able to resolve it quickly.
Thanks to the time invested by some of the event organizers during the testing phases of the lab I was able to detect a number of other issues that could have been potential show-stoppers on the day. The need to be able to deploy a new version if needed, and to put all the running worker roles into an idle state was quickly identified, as was the need to be able to reduce the load on storage accounts by disable worker roles by country, location, attendee or specific role instance. I had no control over the deployment and deleting of the worker roles, but I needed some control over how they ran against the storage accounts.
A number of animations failed to be completed and got stuck in the render queue with a status of Encoding, this was mostly due to the way I had implemented the encoding process in the worker role, but also due to the way the students created and deleted deployments. Worker roles were being deleted throughout the event, sometimes at a rate of over 100 per minute, and this meant that some long-running tasks would fail to complete.
Overall I felt that the lab was a great success. From the photos captured by the attendees who uploaded animations it looked like they were enjoying using the application. Many of the events took part in the lab, with some of them taking the competition aspects seriously. It would have been great to have more of the locations taking part, more effort could have been made to promote the lab and make sure that content was provided to attendees in their native languages.
On the whole the application stood up to the load that we placed on it. Some attendees had to wait a long time for their animations to be rendered and encoded. The job queue on the Media Services account indicates that things could have been improved there by increasing the capacity available there to reduce this time.
There were a few reliability issues that meant that some animations never got encoded, there is scope for improvement here. Also the range of different animations that could be selected and rendered form the depth data could be extended.
The project started out as a simple demo in 2010 and has been extended and improved to make the solution we used for the Windows Azure Boot Camp. I plan to continue working with the project when I get the time and make more improvements. I have a large project backlog list for the Global Render Lab, there were so many cool things that I wanted to add to it, but a limited amount of time available.
It was great fun to run the lab, and hopefully there will be opportunities to do something similar in the future. Feel free to contact me via this blog if you have any suggestions or questions about the lab. I’d be happy to deliver sessions detailing the background of the lab at user groups and conferences if there is an opportunity for that.