Stefan Radtke's Blog: How to access data with different Hadoop versions and distributions simultaneously

05 May 2015

How to access data with different Hadoop versions and distributions simultaneously

Many companies start rolling out or at least think about using Hadoop for data analysis and processing. Hadoop Distributed Filesystem (HDFS) is the underlying filesystem and typically local storage within the compute nodes is used to provide the storage capacity. HDFS has been designed to satisfy the workload characteristics for analytics and has been born during the time where 1 Gigabit Ethernet was the standard networking technology in the datacenter. The idea was to bring the data to the compute nodes in order to minimize network utilization and bandwidth and reduce latency through data locality. (I’ll post another article to discuss this in more detail and show that this requirement is less important these days where we have 10 Gigabit almost everywhere in the datacenter). For the moment we’ll look at some other side effects of this strategy which reminds me somehow to a lot of data silos. Business Intelligence dudes know what I am talking about.

Figure 1: The “Bring Data to Compute” strategy results in a lot of data silos, complex and time consuming workflows.
One thing you may already know is the fact that HDFS is not compatible with POSIX protocols like

NFS or SMB. That means you need special tools to copy your data into the filesystem. That’ll take a long time. For example, if you need to copy 100TB over a 10 GB Ethernet, you’d need more than 24 hours to so so if the network is not occupied with other traffic.

But it gets even worse: You may have multiple HDFS distributions or versions like you have different RDBMS systems

Have you ever thought about how many different RDBMS systems you have in the company ? Typically companies have several relational database systems like Oracle, MS SQL, MySQL, DB2, Sybase etc. But wouldn’t it be easier to have only one system? The answer of course is: yes it would, but that’s not the reality we are facing. In practice we have different RDBMS systems for various reasons:

Application dependencies
Different people or organizations within the company have different preferences
Mergers and acquisitions
Price and licensing models
Functionality
Performance
Historical reasons
Large IT organizations or service providers just have to support what their customers want. They cannot dictate the Hadoop version or setup a new cluster including storage for every customer
New innovative distributions appear in the market. Consider how many Linux distributions we have. It’s not only RedHat, SuSe, Debian, Ubuntu and you name it. Just recently Intel, IBM and Pivotal/EMC have announced that they’ll maintain their additional distributions that are optimized for virtual and cloud environments. The same may happen with HDFS.
…and others

I guess we’ll see the same development with HDFS for quite the same reasons. Furthermore, the development of Hadoop is currently very fast and we’ll see HDFS vendors starting to build their individual strength within different areas.
Now think about how you would use different versions or distributions when you have compute and data tightly integrated? It most probably will end up in more HDFS clusters, more copies of data and big data movements. You may also be stuck at a specific version or distribution because you need your production data to be available for analysis and you cannot just migrate and copy them every day.

Here is the solution: the Scale-out Data Lake Isilon

Fortunately there is a solution to this issue: EMC’s Isilon Scale Out NAS System has a very mature distributed filesystem OneFS. It’s also build on top of commodity hardware and uses internal disks to provide the space for the data. However, it’s much more advanced than HDFS in many regards and it has been built over more than 15 years to serve massive amounts of data with very high throughput and low latency. To serve Hadoop requests, HDFS has been implemented as a protocol rather than a filesystem. As a result, you can access your data over various protocols such as SMB, NFS, FTP, HTTP, Openstack Swift and HDFS simultaneously while consistency, protection, access control and global file locking is provided by OneFS.

Figure 2: Data on the Scale-Out filesystem OneFS can be accessed via multiple protocols

HDFS as a protocol

Instead of storing the data on a new filesystem type, the Isilon team has integrated HDFS as a protocol. A multi-threaded daemon called isi_hdfs_d is running on every Isilon node. It services both Name Node and Data Node protocols and it translates HDFS RPCs to POSIX system calls. As HDFS is stateless, the underlying filesystem handles coherency.

Figure 3: Multi-threaded HDFS daemon runs on every Isilon node.
With this approach, new protocol version can be integrated quickly and data migrations or modifications are not required as they reside on the POSIX scale-out filesystem.

Figure 4: Data Node and Name Node requests are served in a highly available manner.

Access to the data with different Hadoop versions or distributions

This “de-coupling” of compute and storage with Isilon as your “Data Lake”, you can now access the very same data with multiple Hadoop distributions and even different HDFS versions (at the time of writing this, Isilon supports almost everything from HDFS 1.0 to HDFS 2.6 and the development team has a strong focus to have new versions ready right after the major distributions come up with new HDFS versions).
If you think about this for a moment you’ll agree that this is huge! By pooling your data into Isilon, you get complete freedom which version of HDFS you want or need to use. You can test new versions, roll back to previous ones or use another distribution to access the same data simultaneously. Think a moment about the analogy with the different RDBMS versions in your company which I have mentioned above. There is a high probability that you’ll have the same with Hadoop: different Hadoop distributions and versions within the company. That’s no problem with Isilon. Solved.

Other Advantages

But there are further advantages:

No single point of failure. Name node requests are served by all Isilon Nodes in an active/active manner.
Isilon protects data with erasure coding across nodes. That’s much more efficient than just creating multiple copies of each block. The protection level can be set very flexible and dynamically for every directory or pool. If you follow the guidelines, you’ll get a protection overhead between 20% and 30%. That’s much more efficient over native HDFS where you need to provide 300% of DAS capacity for 3 copies of data. See [3] for more details.
You can scale compute and storage independently. Your compute nodes don’t require storage anymore (you might want to use internal disks for the shuffle IO though). If you need compute power, you add servers, if you need capacity, you add Isilon nodes.
Most workloads run faster on Isilon [1,4].
For some data you can eliminate ingest since the data is already present on Isilon
If you need to ingest data, you can do it via POSIX protocols such as NFS, SMB or FTP. IDC found that NFS writes work 4.2 times faster on Isilon and 36 times faster for reads [1,4].
Use existing authentication providers such as Kerberos, Active Directory, LDAP etc. for integrated security
Isilon balances the data equally across all nodes in the cluster. If you need more capacity, you just add a node. New capacity is available immediately, rebalancing takes place in background.
Use parallel synchronization over LAN or WAN for a disaster recovery strategy
Manage a single large scale-out filesystem (today up to 50PB) very easy via a WebUI, CLI (OneFS is based on FreeBSD so Unix/Linux dudes feel home) or API.
Use Data Tiering: you can use different Isilon nodes in one cluster and use policy based and transparent data tiering for optimal performance and cost efficiency. For details see [5].
Data at Rest Encryption on Isilon is done at drive level. There is almost no performance impact compared to evolving software encryption solutions.
Use Isilon de-duplication. It’s running as a background post process and as such doesn’t impact production performance.
Use Snapshots. Currently more than 20000 snapshots are supported.
Use SEC 17-a4 compliant WORM retention
Use certified file system auditing
Use Isilon Access Zones to provide Hadoop as a Service securely to multiple tenants.
Use existing backup mechanisms.

Summary

OneFS is a very mature scale-out filesystem that serves data via multiple protocols including HDFS to hundreds or thousands of clients. The biggest advantage is that you separate compute and storage and you can scale both independently. Most importantly, you can provide access to the data via multiple HDFS protocols and distributions at the same time. No matter which version or distributions your users or customers prefer, they all can be served by Isilon as long as it is a distribution that’s based on the Apache base, such as Hortonworks, Pivotal or Cloudera. Instead of using HDFS data silos, Isilon is a great foundation for your Data Lake with enterprise grade functionalities that integrates well into your datacenter’s infrastructure with respect to security, serviceability, high performance.

References

[1] EMC Isilon Scale-out Data Lake Foundation – Essential Capabilities for Building Big Data Infrastructure, IDC White Paper, October 2014.
[2] EMC Isilon OneFS – A Technical Overview; White Paper, November 2013.
[3] High Availability and Data Protection with EMC Isilon Scale-Out NAS, White Paper, November 2013.
[4] Comparing Hadoop performance on DAS and Isilon, Stefan Radtke, blog post 2015.
[5] Next Generation Storage Tiering with EMC Smartpools, White Paper, April 2013.
The White Papers mentioned here are all available for download at https://support.emc.com

Acknowledgement

Thanks to my colleague Ryan Peterson who brought up the idea of the analogy to RDBMS systems and why you have different ones during a recent discussion.