
ADAM for large-scale genomics data processing
Analyzing DNA and RNA sequencing data requires large-scale data processing to interpret the data according to its context. Excellent tools and solutions have been developed at academic labs, but often fall short on scalability and interoperability. By this means, ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark and Parquet.
However, large-scale data processing solutions such as ADAM-Spark can be applied directly to the output data from a sequencing pipeline, that is, after quality control, mapping, read preprocessing, and variant quantification using single sample data. Some examples are DNA variants for DNA sequencing, read counts for RNA sequencing, and so on.
In our study, ADAM is used to achieve the scalable genomics data analytics platform with support for the VCF file format so that we can transform genotype-based RDD into a Spark DataFrame.