Initially developed at U.C. Berkeley’s AMPLab in 2009, Apache Spark is a “lightning-fast unified analytics engine” for large-scale data processing. It can be used with cluster computing platforms such as Hadoop, Mesos, Kubernetes, or as a standalone cluster deployment.
It can also access data from a wide variety of sources including Hadoop Distributed File System (HDFS), Cassandra, and Hive. In this article, we’ll dive into Spark, its libraries, and why it has grown into one of the most popular distributed processing frameworks in the industry. If you’re new to the world of Big Data, I highly recommend you read up on the Hadoop ecosystem first to get an idea of how Spark fits into a Big Data analytics stack.
Author: Yoshitaka Shiotsu