I’ve had a number of discussions recently around the virtualisation of Hadoop, some of it with customers, some of it internal and some of it with my colleagues at VMware and Pivotal. As always these conversations get me thinking and that in turn has spurred me on to come out of hiding and write my first blog post in a while
This train of thought was sparked by a conversation with someone who was looking at introducing Hadoop into their environment. Primary use case was so they could leverage unstructured data alongside traditional legacy EDW solutions. The conversation centred around the desire to virtualise Hadoop to facilitate quicker provisioning for POC / Test purposes, however there was also a desire to potentially do the same with a production Hadoop environment from a cost perspective. Sounds pretty straightforward if you ask me, however the main fly in the ointment appears to be the reluctance from some of the Hadoop vendors to give their blessing or indeed commit to supporting a virtualised approach.
So, why are Hadoop vendors so reluctant?
There could be many reasons why a Hadoop vendor would be reluctant to embrace virtualisation or indeed any abstraction (HDFS from an appliance or Linux containers). It could simply be a case of them wanting to apply the “keep it simple stupid” approach and ensuring success by building it the way it always has in the past. Or it could be that they’ve simply just never tested it in a virtual environment and unknown is bad in their eyes. Perhaps it’s something else I haven’t even thought of!?
I have my own opinion of course, I firmly believe that reluctance to virtualise Hadoop is because it fundamentally disrupts today’s Hadoop licensing models and as a result has a negative impact on the vendors revenue streams (per CPU or per host licensing). This is in essence the same kind of pushback that EMC were getting previously around supporting Hadoop with EMC’s Isilon scale out NAS (per host or per TB licensing).
Let me give you a quick example on how both disrupt the Hadoop vendor.
Say I have 100 commodity servers in my Hadoop cluster, I have built a Hadoop cluster of 100 servers to provide my required 0.5PB of storage for HDFS (e.g. 5TB per server = 0.5PB). Now despite having 100 servers I actually only utilise approximately 20% of those servers for Hadoop compute tasks like Map Reduce, if I want to add more storage I have to add more servers including compute even though I’m never going to use that extra compute power. This model doesn’t fundamentally doesn’t scale well from a data centre footprint point of view.
So first disruption, lets say I was to separate the compute and storage elements of Hadoop. I introduce 0.5PB of EMC Isilon to provide the HDFS layer. Instantly I have reduced my server footprint requirement from 100 servers to 20 physical servers + the EMC Isilon for the storage requirement. First major advantage, if I need to scale either compute or hardware I can do it independently of the other. Second major advantage, I go from 33% capacity efficiency (Hadoop does 3 copies of ever file) to up to 80% capacity efficiency (n+1,2,3 or 4 protection)
Second disruption, now lets say I also introduce virtualisation of the Hadoop compute nodes into the solution. If I was to achieve a VM to physical server ratio of 2:1, that means I now only need 10 physical servers + EMC Isilon HDFS storage. If I was to go for a more aggressive 4:1 VM to physical server ratio I’d only need 5 physical servers + EMC Isilon HDFS storage instead of the original 100 physical servers. I’ve now got the ability to easily scale my compute and storage layers independently of each other and have vastly shrunk my data centre footprint, which is a huge cost saving in itself.
Theoretically, we’ve just shaved between 90 – 95 servers off the original solution, not only have we saved on physical footprint but we’ve also taken 90 – 95 servers worth of annual licensing away from a Hadoop vendor!! I can see why they wouldn’t be 100% happy about a customer implementing virtualisation and EMC Isilon for Hadoop.
now granted, the above is a very simplistic example. However if you’re looking at deploying Hadoop I’m sure you can see how virtualisation and introducing Isilon for HDFS could help you massively reduce footprint in the data centre and save on licensing costs. You of course need to build out a suitable TCO analysis for the various options, however I encourage you to do so as I bet it works out to be quite favourable compared to the original 100 nodes.
What about performance?
One of the other main areas the Hadoop vendors appear to be focusing their FUD (FUD = Fear, Uncertainty and Doubt) is virtualisation and the performance of Hadoop. Now I don’t disagree that depending on what you’re doing with Hadoop, performance may well be the key requirement. If it is very, very important to you that your Hadoop is performant and that your Hadoop vendor will support it without argument; then physical Hadoop compute nodes may be the best way for you to go.
That said, if you deem flexibility and ease of scaling Hadoop (storage or compute) as your key requirement then a different approach may be needed. This is where the separation of compute and storage layers adds immediate benefit and where virtualisation of the Hadoop compute nodes drives increased flexibility.
I should add that I personally don’t think you necessarily have to sacrifice compute performance to gain that increased flexibility. I would highly recommend reading the following VMware white paper on Virtualised Hadoop Performance with vSphere 5.1. In it VMware conduct a very in-depth comparison between the performance of native Hadoop and 100% virtualised Hadoop instances including HDFS storage on the server DAS. The table below is a comparison of jobs run on the physical nodes and again with a 1:1, 2:1 and 4:1 VM to physical server ratio.
What is apparent from the graph above is that in the various scenarios tested the virtualised instances aren’t too far away from the native performance (depicted by the 1.0 on the Y axis). In fact in the Terasort test (sorting data generated and writing out the results) the 4:1 configuration actually performed better, which is quite interesting but may be down to test harness subtleties.
There is only so much you can read into these kind of whitepapers, in my view it’s a yes, you should definitely consider virtualisation of Hadoop. On the flip side, it’s essential that you do your own testing to ensure that your virtualised Hadoop solution delivers against your requirements whatever they may be.
What about Vendor support?
Support for virtualised Hadoop is another pushback I’ve heard on a few occasions and I have to say I understand that position. Before my time at EMC I was a Solutions Architect implementing solutions in a UK based investment fund manager. One of my team’s main mantra’s when designing solutions was to always ensure supportability, so when I hear a customer complain about lack of support for a solution I often have to sympathise with the position they find themselves in. They want to do something, but can’t because it compromises the supportability of the business application.
Now this lack of support is a situation I’ve seen before, it always used to be virtualising Oracle that used to get the push back. Oracle wouldn’t support the virtualisation of their database as consolidation ultimately eroded their license revenues (per CPU licensing). I personally think we’re simply seeing a similar thing occurring now with Hadoop and customers are therefore wary of building a solution that the Hadoop vendors won’t support. I get that, doesn’t mean it’s acceptable though!
What can I do about it though?
We at EMC had seen a bit of this sort of thing recently, specifically with Isilon for HDFS. Hadoop vendors weren’t keen on losing revenue (as per my second example earlier) and weren’t really willing to support Isilon or sanction it’s usage, but customers were keen to use Isilon instead of the traditional DAS approach. Now EMC obviously has alliances with all the major Hadoop vendors, that sadly doesn’t necessarily constitute support.
Now some of you may have picked up on the recent joint announcement between Cloudera and EMC around supporting Cloudera Enterprise Hadoop and Isilon for HDFS. This is a great step forward in my opinion and has primarily come about following concerted pressure from Cloudera customers in the global financial industry. Their sheer desire to leverage an enterprise grade platform for HDFS in tandem with Cloudera’s Enterprise Hadoop capabilities resulted in Cloudera having to agree to work jointly with EMC to build a supported solution.
Never underestimate your power as a customer, most of the great things we come up with at EMC come about from our interactions with customers and them telling us what they need or want. You shouldn’t be afraid to ask your vendors for the support you need for your business.
Cloudera Enterprise Hadoop and EMC Isilon for HDFS
Very briefly, the main plus point from a Cloudera / EMC point of view is around the fact that Isilon supports multi-protocol access of the same data, eradicating the need to do major Extract, Transfer and Load (ETL) activity. The same data can be interacted with via SMB, NFS, HTTP, SWIFT, REST and HDFS 1.0 and 2.2, ( 2.2, 2.3 and 2.4 coming very shortly) which allows you to put data into Isilon via one method, e.g. logs from an application and then consume it through another, e.g. HDFS. My personal plus point is the fact that you can use Isilon HDFS for Cloudera, Pivotal HD, HortonWorks or Apache Hadop as your not tied to a traditional DAS stack
Some of the related Links to that announcement are included below, with the Cloudera CTO blog post a particularly good read.
So for those customers out there who want to virtualise Hadoop but want to ensure they are fully supported, I implore you to put the pressure on your Hadoop vendor and your VMware account team. I know for a fact that VMware have alliance teams working with the Hadoop vendors, but it needs real customer pressure on the Hadoop vendors to fundamentally change the game. It’s what happened with Oracle on VMware, it’s what happened with Cloudera and Isilon HDFS and it needs to happen for Hadoop on VMware as well.
It’s also worth noting that this isn’t just a Hadoop problem, we’re going to end up in the same situation with loads of the new popular commercial open source variants . Think of things like MongoDB (NoSQL) or DataStax (Cassandra DB), it’s only a matter of time before large enterprise customers are going to want to virtualise or use enterprise storage platforms with these technologies.
We at EMC aim to offer choice to customers, so they are free to mix and match our technologies with whatever systems they want to put on top of them. However I think we as a mainstream vendor need to do more work partnering and certifying our products to work with the ISV ecosystem. I’m not saying we do this for all ISV’s, we’ll need to be selective but I think we’ll reap the benefit of having done the work with Cloudera and Isilon. We should ensure that it doesn’t stop there, we should listen to our customers and we should aim to provide supported ISV and EMC stacks where needed.