Apache Spark has proven an efficient and accessible platform for distributed computation. In some areas, it almost approaches the Holy Grail of making parallelization “automagic” – something we human programmers appreciate precisely because we are rarely good at it.
Nonetheless, although it is easy to get something to run on Spark, it is not always easy to tell whether it’s running optimally, nor – if we get a sense that something isn’t right – how to fix it. For example, a classic Spark puzzler is the batch job that runs the same code on the same sort of cluster and similar data night after night … but every so often it seems to take much longer to finish. What could be going on?
Author: Adam Breindel