For large-scale analytics, a distributed file system is kind of important. Even if you’re using Spark you need to pull a lot of data into memory very quickly. Having a file system that supports high burst rates — up to network saturation — is a good thing.
However, Hadoop’s eponymous file system (Hadoop Distributed File System, aka HDFS) may not be all it’s cracked up to be. What is a distributed file system? Think of your normal file system, which stores files in blocks. It has some way of noting where on the physical disk a block starts and how that block matches to a file.
Author: Andrew C. Oliver