Perl meets big data and high performance computing with the eHive framework.
By Brandon Walts
Date: Saturday, 3 December 2016 10:10
Duration: 20 minutes
Target audience: Intermediate
Language: English
Tags: 5 cloud clusters dbi hpc oo
eHive is a Perl-based framework for configuring, managing and scheduling computational pipelines running across one or more compute clusters. One of the most exciting features of eHive is that it is self-regulating and is designed to concurrently handle data of many formats and complexity, from single jobs that run for two weeks, to hundreds of thousands of jobs, without overloading the database server or compute cluster. While the framework has been developed with bioinformatics applications in mind, it is applicable to all calculations and data types.
eHive was designed around a beehive metaphor. Autonomous agents called “workers” claim compute jobs from a central list, and take them to an appropriate “meadow” of defined computational resources. The workers specialize into a particular class to perform the required work. When the work is finished, the worker generates a message to indicate the work is done, and can pass on parameters and data for subsequent jobs following a predefined workflow structure. In the background, one or more “beekeeper” processes monitor the overall pipeline, submitting workers as required, clean up after dead workers, and monitor the list of required jobs. A unique features of eHive’s configuration is that it provides conditional workflows, allowing the pipeline to automatically adapt to node failures and unanticipated memory requirements. eHive also supports optimisation through fan/funnel functionality, whereby many small sub-tasks can be processed in parallel and their output collected and combined before proceeding to the next step in the pipeline.
We chose to implement this framework in Perl to provide portability, flexibility, and for extendability; in particular, this framework allows inexperienced users to easily add their own modules, functionality, and customisations. Our graphical user interface also provides an intuitive means for monitoring and modifying a pipeline on the fly.
Attended by: Martin Berends (mberends), Lee Johnson, Lance Wicks, David Dorward, Brian Kelly, Richard van Lochem (rvlochem), Clyde Ingram, gaah, Chris Jack, bennythejudge, Peter Mottram (SysPete), Md Anwar Hossain, Hugh Barnard, Katherine Spice, Anatolie Mazur (Mask), Michael Jemmeson (michael), Steffen Winkler (STEFFENW),