The functionality of Ankus relies on a workload trace. For each job in the workload trace, it covers a lot of information about the job, such as job submission time, map/reduce tasks count, bytes and records read/written by each task and allocated memory for each task. Given a workload trace, Ankus distills the statistical distributions and models of the trace, and employs these information to generate a job sequence that simulates a real-world workload. Ankus can be easily generalized to be used in other application environments by only replacing the workload trace for bootstrapping Ankus.
Ankus provides two mechanisms for generating workloads. The first one is to merely replay a sample of the workload trace, or scale the workload intensity which follows similar statistical distribution derived from in the workload trace. This mechanism generates a comparable workload that mimics real workloads. The second one is to synthesize a groups of jobs, which follows specific statistics configured by users. The second manner is often used by Hadoop operators to synthesize diverse job types and mixtures for performance evaluation. No matter which mechanism is used by Ankus, the jobs generated by Ankus are injected to JobTracker with a Poisson random process.
The source of Ankus is integrated in WaxElephant