Workflow Languages and Large Scale Data Analytics

In many scientific areas, e.g., bioinformatics, the increase of data volume as well as workflow complexity necessitates workflow languages taylored towards parallelism and software integration.

While data-parallel dataflow systems like Hadoop, Spark, or Flink can scale to a large number of nodes, scientific worklfow systems like KNIME or Galaxy integrate arbitrary software, including command line tools or libraries with R or Python interfaces.

How general can a workflow language that focuses on parallelism and integration be? Can it host conditionals, compound data structures, and unbouned iteration? How should such workflows be composed from smaller parts?

Large Scale Reproducible Applications in Next Generation Sequencing

Next Generation Sequencing (NGS) machines generate a growing amount of data while the cost per sequenced base pair decreases. In this setting, some applications are particularly important, e.g., variant calling, ChIP-Seq, or RNA-Seq.

Convenient specification of such workflows and convenient execution in parallel, distributed systems is a major challenge for current workflow systems. In addition, long-term reproducibility is relevant not only for the workflow itself but also for the data it consumes independent from the execution infrastructure.

Functional Programming Language Interpretation

Functional Programming (FP) languages are particularly suited for parallel execution. Implementing an interpreter for a lazy, parallel FP language with strong integration features necessitates a clear understanding of its syntax and semantics.