GeReS (Generic Resource Scheduler) for Windows Azure is now available as a beta release on Codeplex.
It is a simple batch job manager written in C# (or Python for the older version).
- Command line utilities (e.g. qsub, qlist, jobcancel, joblist) to queue tasks for computation, check on their status, cancel them.
- 3 task queues (highq, mediumq, lowq) in order of priority.
- An agent to install on Azure VMs. It will pick tasks off the queues in that order and spawn the required processes.
- A simple notifier application that monitors the status of tasks as reported by the agents.
- An autoscaler, running as an extra-small PaaS worker role, which will deploy or remove worker VMs of the desired size based on waiting time for jobs in the queues.
The agent will run as many tasks on a node as there are cores. If a task is marked "exclusive" at submission, it will be the only one to run. This is useful for those applications that consume most of the VM resources.
The agent is also responsible to update the status of the spawned processes: running, failed, completed or cancelled. When a node is idle for longer than a pre-configured time, the agent will queue it for removal.
The autoscaler only deploys or removes worker nodes - it does not keep track of task status.
Note that this distributed architecture has the advantage of being highly resilient when compared to a traditional "head node" running the scheduler and apportioning work.
The nodes can fail at any time and the incomplete jobs will pop back into the queues for other nodes to pick up.
The autoscaler can fail at any time and simply be restarted without affecting computations in process.
Azure storage tables and queues are used to handle and keep track of tasks. Service bus topics are used for notifications and commands. Geres benefits from their built-in redundancy and resilience.
For further details on the architecture, please read the release notes on Codeplex.
This short video will show you a typical usage scenario.