High throughput vs low latency in HDFS

I tried to define what the high throughput vs low latency means in HDFS in my own words, and came up with the following definition:

HDFS is optimized to access batches of data set quicker (high throughput), rather then particular records in that data set (low latency)

Does it make sense? :) Thanks! asked May 23, 2013 at 15:32 spacemonkey spacemonkey 19.9k 14 14 gold badges 43 43 silver badges 64 64 bronze badges

2 Answers 2

I think what you've described is more like the difference between optimizing for different access patterns (sequential, batch vs random access) than the difference between throughput and latency in the purest sense.

When I think of a high latency system, I'm not thinking about which record I'm accessing, but rather that accessing any record at all has a high overhead cost. Accessing even just the first byte of a file from HDFS can take around a second or more.

If you're more quantitatively inclined, you can think about the total time required to access a number of records N as T(N)=aN+b . Here, a represents throughput, and b represents latency. With a system like HDFS, N is often so large that b becomes irrelevant and tradeoffs favoring a low a are beneficial. Contrast that to a low-latency data store, where often each read is only accessing a single record, and then optimizing for low b is better.

With that said, your statement isn't incorrect; it's definitely true, and it is often the case that batch access stores have high latency and high throughput, whereas random access stores have low latency and low throughput, but this is not strictly always the case.