HDFS: Big data analytics’ weakest link

For large-scale analytics, a distributed file system is kind of important. Even if you’re using Spark you need to pull a lot of data into memory very quickly. Having a file system that supports high burst rates — up to network saturation — is a good thing.

However, Hadoop’s eponymous file system (Hadoop Distributed File System, aka HDFS) may not be all it’s cracked up to be. What is a distributed file system? Think of your normal file system, which stores files in blocks. It has some way of noting where on the physical disk a block starts and how that block matches to a file.

Author: Andrew C. Oliver

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s