There is something great about products and services that are developed by industry veterans to relieve their own pain points, to figure a way around the very problems they face day in and day out, and in the process build something that is a valuable contribution to the industry as a whole. This is where necessity, ingenuity, shortage and perspicacity hold hands in order to give birth to something that has substantial impact on the work cycle of the entire industry ecosystem.
In May this year, when HybridCluster completed $1 millions fundraising and launched HybridCluster 2.0, I was supposed to prepare an Interview Questionnaire for Luke Marsden, CEO, HybridCluster. I knew little about the product at that time, but somewhere during my research work for the same, I decided that HybridCluster is not just a very interesting product, but it is a success story.
Why? I’ll let the interview do the talking. But I’ll leave you with this interesting excerpt on the company blog, where Luke talks about the genesis of HybridCluster:
Running our own hosting company since 2001 exposed all the problems. We were continuously battling hardware, software and network issues. After a few too many late-night trips to the data centre, I thought to myself: there has to be a better way. Studying theoretical computer science at Oxford University helped me crystallize my vision for an ambitious new project — one which uses ZFS, local storage, graph theory, and a perfect combination of open source components to create a platform uniquely aligned to solving the problems faced by hosters and cloud service providers.
Q : Let’s begin with a brief introduction of yours and a broad overview of HybridCluster.
A: Hi. 🙂 Thanks for inviting me to be interviewed! It’s really great to be on DailyHostNews.
My background is a combination of Computer Science (I was lucky enough to study at Oxford University, where I graduated with a first class degree in 2008) and a bunch of real world experience running a hosting company.
HybridCluster is really a radical new approach to solving some of the tricky problems every hosting company has while trying to manage their infrastructure: it’s an ambitious project to replace storage, hypervisor and control panel with something fundamentally better and more resilient.
In fact I have a bigger vision than that: I see HybridCluster as a new and better approach to cloud infrastructure – but one which is backwardly compatible with shared hosting. Finally, and most importantly – HybridCluster allows hosters to differentiate in the market, sell new services, drive up margins – whilst also reducing the stress and cost of operating a web hosting business. We help sysadmins sleep at night!
Q: Did the idea for a solution like HybridCluster stem from issues you faced first-hand during your decade-long experience in the web hosting industry?
A: Yes, absolutely. Without the real-world pain of having to rush off to the data center in the middle of the night, I wouldn’t have focused my efforts on solving the three real world problems we had:
The first problem is that hardware, software and networks fail resulting in website downtime. This is a pain that every hoster will know well. There’s nothing like the horrible surge of adrenaline you get when you hear the Pingdom or Nagios alert in the middle of the night – or just as you get to the pub on a Friday night – you just know it’s going to ruin the next several hours or your weekend. I found that I had become – like Pavlov’s dog – hard-wired to fear the sound of my phone going off. This was the primary motivation to invent a hosting platform which is automatically more resilient.
Other problems we had in the hosting company included websites getting spikes in traffic – so we knew we needed to invest a hosting platform which could auto-scale an application up to dedicated capacity – and users making mistakes and getting hacked – so we knew we needed to invent something which exposes granular snapshots to the end user so they can log in and roll back time themselves if they get hacked – or if they accidentally delete a file.
Q : Can you please throw some light on the modus-operandi of HybridCluster? How exactly does it help web hosts with automatic detection and recovery in the event of outages?
A: Sure. I decided early on that a few key design decisions were essential:
Firstly, any system which was going to stop me having to get up in the middle of the night would have to have no single point of failure. This is easy to say but actually quite hard to implement! You need some distributed system smarts in order to be able to make a platform where the servers can make decisions as a co-operative group.
Secondly, I decided that storage belongs near the application, not off on a SAN somewhere. Not only is the SAN itself a single point of failure, but it also adds a lot of cost to the system and can often slow things down.
Thirdly, I decided that full hardware virtualization is too heavy-weight for web application hosting. I could already see the industry going down the route of giving each customer their own VM, but this is hugely wasteful! It means you’re running many copies of the operating system on each server, and that limits you to how many customers you can put on each box. OS level virtualization is a much better idea, which I’ll talk about more later.
Basically, I designed the platform to suit my own needs: as a young hoster, I was scared of outages, I couldn’t afford a SAN, and I knew I couldn’t get the density I needed to make money with virtualization. 🙂
Q: How does OS virtualisation used by you differ from Hypervisor based Virtualisation used by other Virtualised solutions?
A: OS level virtualization (or “containers”) are simply a better way of hosting web applications. They are higher density: because each container shares system memory with all other containers, the memory on the system is more effectively “pooled”. They are better performing: there’s no overhead of simulating the whole damn universe just to run an app. And they’re more scalable, each app can use the whole resource of a server, especially when combined with the unique capability that HybridCluster brings to the table: the ability to live-migrate containers around between servers in the cluster and between data centers.
Live migration is useful because it allows things to get seamlessly moved around. This has several benefits: administrators can easily cycle servers out of production in order to perform maintenance on them simply by moving the applications off onto other servers, but also, perhaps more excitingly, it allows applications to get auto-scaled – the HybridCluster software can detect a spike in traffic, and rather than throttling the spike (like CloudLinux), it can burst that application to a full dedicated server by moving other busy things on that server onto quieter servers. This is also a unique feature.
Q: How does HybridCluster enable an end user to self-recover lost files and data from even less than a minute ago? This feature, if I’m not wrong, isn’t available with any other solution out there.
A: It’s quite simple really. Every time that website, database or email data changes, down to 30 second resolution or less, we take a new ZFS snapshot and also replicate the history to other nodes in the cluster. ZFS is a core enabling technology for HybridCluster, and we’ve built a smart partition-tolerant distributed filesystem on top of it! Each website, database or mailbox gets its own independently replicated and snapshotted filesystem.
Anyway, these replicas act both as a user-facing backup and a hot spare. It’s a simple idea, but this is actually a revolution in backup technology – rather than having a backup separate from your RAID or other replication system (where the problem with a replication system like RAID is that it will happily replicate a failure, and the problem with backups is that they take ages to restore) our “hybrid” approach to replicated snapshots kills two birds with one stone, bringing backup restore times down to seconds, and also letting users fetch files/emails/database records out of snapshots which are taken at with very fine grained accuracy.
Indeed, HybridCluster is the first hosting platform to expose this feature to the end user and we have seen a number of clients adopt our technology for this benefit alone!
Q: Is the low-cost storage system able to deliver the efficiency of high-end SANs? Also, what additional value does ZFS data replication bring into the picture?
A: I’m glad you mentioned ZFS again 🙂 One of the really nice things about being backed onto ZFS is that hosters using HybridCluster can choose how powerful they want to make their hosting boxes. Remember, with HybridCluster, the idea is that every server has a local storage and uses that to keep the data close and fast. But because ZFS is the same awesome technology which powers big expensive SANs from Oracle, you can also chuck loads of disks in your hosting boxes and suddenly every one of your servers is as powerful as a SAN in terms of IOPS. In fact, one of our recent hires, a fantastic chap by the name of Andrew Holway, did some hardcore benchmarking of ZFS versus LVM and found that ZFS completely floored the Linux Volume Management system when you throw lots of spindles at it.
I won’t go into too much detail about how ZFS achieves awesome performance, but if you’re interested, try Googling “ARC”, “L2ARC” and “ZIL”. 🙂
The other killer feature in ZFS is that it checksums all the data that passes through it – this means the end to bit-rot. Combined with our live backup system across nodes, that makes for a radically more resilient data storage system than you’ll get with Ext4 on a bunch of web servers, or even a SAN solution.
There’s lots more – call us on +44 (0)20 3384 6649 and ask for Andrew who would love to tell you more about how ZFS + HybridCluster makes for awesome storage.
Q: How does HybridCluster achieve fault-tolerant DNS?
A: Something I haven’t mentioned yet is that HybridCluster supports running a cluster across multiple data centers, so you can even have a whole data center fail and your sites can stay online!
So quite simply the cluster allocates nameservers across its data centers, so if you have DC A and B, with nodes A1, A2, B1, B2, the ns1 and ns2 records will be A1 and B1 respectively. That gives you resilience at the data center level (because DNS resolvers support failover between nameservers). Then, if a node fails, or even if a data center fails, the cluster has self-reorganising DNS as a built-in feature.
We publish records with a low TTL, and we publish multiple A records for each site: our AwesomeProxy layer turns HybridCluster into a true distributed system – you can send any request for anything (website, database, mailbox, or even FTP or SSH session – to any node and it’ll get revproxied correctly to the right backend node (which might dynamically change, eg during a failover or an auto-scaling event). So basically under all failure modes (server, network, data center) we maximize the chances that the user will quickly – if not immediately – get a valid A record which points to a server which is capable of satisfying that request.
In other words HybridCluster makes the servers look after themselves so that you can get a good night’s sleep.
Q: How do you see the future of data center industry?
A: That’s an interesting question 🙂 I’ll answer it for web applications (and databases + email), specifically.
Personally I see cloud infrastructure as a broken promise. Ask the man (or woman) on the street what they think cloud means, and they’ll probably tell you about increased reliability, better quality of service, etc. But really all that cloud does today is provide less powerful unreliable infrastructure with which software engineers are expected to build reliable software on top of. That’s a big ask!
My vision is for a fundamentally more reliable way of provisioning web applications – one where the underlying platform takes responsibility for implementing resilience well, once, at the platform level. Developers are then free to deploy applications knowing that they’ll scale well under load, and get failed over to another server if the physical server fails, or even if the whole data center goes pop.
I think that’s the promise of PaaS, and my vision is for a world where deploying web applications gets these benefits by default, without millions of sysadmins in hosting companies all over the world having to get paged in the middle of the night to go fix stuff manually. Computers can be smarter than that, it’s just up to us to teach them how. 🙂
Q: Tell our readers a bit about the team HybridCluster?
A: Since we got funded in December 2012 we’ve been lucky enough to be able to grow the team to 9 people, and I’m really proud of the team we’ve pulled together.
We’re a typical software company, and so unfortunately our Dave to female ratio is 2:0. That is, we have two Daves and no females (but we’re working on that!). Anyway, some highlights in the team are Jean-Paul Calderone, who’s the smartest person I’ve ever met, and the founder of the Twisted project. Twisted is an awesome networking framework and without Twisted – and JP’s brain – we wouldn’t be where we are today. Also on the technical side, we’ve got Rob Haswell, our CTO, who’s a legend, and doing a great job of managing the development of the project as we make it even more awesome. We’ve also just hired one of JP’s side-kicks on Twisted, Itamar Turner-Trauring, who once built a distributed airline reservation system and sold it to Google.
We’ve also got Andriy Gapon, FreeBSD kernel hacker extraordinaire, without whom we wouldn’t have a stable ZFS/VFS layer to play with. Dave Gebler is our Control Panel guru and we’re getting him working on our new REST API soon, so he’ll become a Twisted guru soon 😉 And our latest hire on support, Marcus Stewart Hughes, is a younger version of me – a hosting geek – he bought his first WHMCS license when he was 15, so I knew we had to hire him.
On the bizdev side, we’ve got Dave Benton, a legend in his own right, who’s come out of an enterprise sales background with IBM, Accenture and Star Internet, he’s extremely disciplined and great at bringing process into our young company. Andrew Holway is our technical pre-sales guy, previously building thousand-node clusters for the University of Frankfurt, and he loves chatting about ZFS and distributed systems. He’s also great at accents and can do some pretty awesome card tricks.
Q: To wrap up, with proper funding in place for development of the products, what’s in the bag for Q3 and Q4 of 2013?
A: We’re working on a few cool features for the 2.5 release later this year: we’re going to have first class Ruby on Rails and Python/Django support, mod_security to keep application exploits out of the containers, Memcache and Varnish support. We’re also working on properly supporting IP-based failover so we don’t have to rely on DNS, and there are some massive improvements to our Control Panel on their way.
It’s an exciting time to be in hosting 😉 and an even more exciting time to be a HybridCluster customer!
Thanks for the interview and the great questions.