EMC RecoverPoint and Axxana – Async replication with Zero Data Loss
I come into contact with a lot of IT products throughout my day job, some are introduced to me by customers, some by colleagues and some by EMC Partners. Monday was no different as I got chatting to an EMC partner who was sitting opposite me in the office, naturally the subject turned to the product his company makes. The company in question is Axxana and the product they make is called Phoenix System RP™, a product that is designed to deliver zero data loss but in a very different way to the traditional Recovery Point Objective (RPO) = Zero infrastructure you’d expect.
Zero Data Loss = Synchronous Replication
Traditionally Zero data loss is delivered using synchronous replication technology and due to the costs involved it tends to be reserved for the most key mission critical systems. With synchronous replication when an application writes data to the storage, that data has to be written to both storage locations before the application receives a write acknowledgement (see below). As you can imagine when doing this between two physical sites application latency becomes a key consideration and as such these setups are usually backed by expensive low latency inter site fibre connections, not cheap!
It’s also worth noting that this latency consideration usually restricts the distance between the production and secondary site. This can often still leave you exposed to possible outages, i.e. natural disasters that could impact both sites. To mitigate this companies often replicate to a third site asynchronously, more leased lines, more storage, more management overhead and generally more expense!
That extra expense however can often be justified! In my previous role in the finance sector, trading systems or back end pricing warehouses were usually set up in this manner due to the potential cost of service loss or data corruption. Data consistency and RPO was always the key requirement when recovering from an outage, RTO being the obvious runner up. When talking to application owners the message I was often given was “as long as the data is correct I don’t care how long it takes you to get it back”. Obviously they did care about RTO, but recovering a system in 1 hour only to find that the data is inconsistent was not an acceptable outcome post outage.
Zero Data Loss = Asynchronous Replication
So how is Axxanna different? what is it they do that allows for zero data loss while using asynchronous replication. Well first of all it’s important to point out that this product integrates with EMC’s RecoverPoint replication product to provide the asynchronous replication. RecoverPoint is a product that works by splitting the write I/O for any protected LUNs, journaling it, compressing and de-duping it before replicating it with write order fidelity on the target storage location.
Axxana adds another level to this process which you can see in the diagram below. The first step is to mark a RecoverPoint consistency group as Axxana protected. Once this has been selected the writes that are usually just synchronously written to the local RecoverPoint appliance (RPA) are also then written synchronously to the Black Box via the Axxana collector servers. The collector takes the block stream adds some consistency checking meta data and then encrypts and writes the data out to the Black Box for safe keeping. An acknowledgement is only sent back to the application once the write has been committed to both the RPA and the Black Box in order to guarantee zero data loss. At the same time the RecoverPoint appliance is replicating the data asynchronously across to the DR site as normal. The key point here is that a combination of the Asynchronous replicated data to the second site and the data held within the Axxana Black Box on the primary site can be merged to create the equivalent of a synchronous data set at DR.
The Phoenix Black box can contain up to a 300GB SSD so it is capable of storing a lot of RecoverPoint data. This capacity makes it a perfect solution for protecting against WAN failure scenarios as well as data centre / application / storage failure scenarios. While the WAN link is down the RecoverPoint data is being synchronously played into the Black Box thus maintaining your zero data loss DR Protection.
The disk capacity raised an interesting point for me, how does the Axxana solution know to expire files from the disk inside the Black Box? I dug a little deeper and spoke to someone at Axxana and they told me the following.
In the initial configuration of an Axxana protected CG, Axxana gets an initial lag size from RP and configures an initial buffer of the same size (+ 10%) on the Black Box SSD. The blocks received from the RP are written cyclically to this buffer. This way we maintain only the last blocks (the delta) in any given moment. The Black Box buffer size is adjusted dynamically according to the changes of the RP lag.
So it decides the space allocation based on the RecoverPoint lag, i.e. the amount of data waiting to be replicated to the secondary site. Dynamically expanding that space allocation allows it to deal effectively with replication lag spikes or WAN link loss, pretty impressive stuff.
So that’s how it functions at a high level for the protection, the next question is what happens if my production site is hit by a disaster?
Axxana Black Box Construction
So I’m guessing that you’re now thinking how on earth am I going to recover the data if its stored on a piece of infrastructure in the Data Centre that has just been burnt down / hit by a plane / insert disaster here / flooded ?? Surely putting it in the primary data centre goes against all data protection logic! Technically you would be right, but you need to understand how this thing is constructed to see why that isn’t going to be a problem.
Axxana describe the Phoenix system as an Enterprise Data Recorder (EDR) based on technology from the aviation industry, i.e. plane black box flight recorders. It’s built as a hardened disaster proof storage device to ensure that the synchronous data held within it remains intact no matter what disaster befalls your data centre.
It’s constructed of 3 main layers, protection levels and pictures of the layers can be seen below.
Electronic Box – water protection
Cylinder – shock protection
Fire protection box – Well the name says it all really!
So while the rest of the data centre is a smouldering wreck you can quite happily set about retrieving your data for recovery at the DR site.
So how do you get the data back in the event of a disaster, well that’s where things get interesting and as someone who loves technology I think this next part is pretty cool.
First of all you need to physically locate the system, you do this by tracking the homing signal installed within the Black Box. Once you have found it you can then connect a laptop with an Axxana software component installed and extract the data. Now a physical connection is all well and good but what if the police or fire brigade won’t let you anywhere near the site let alone dig through the rubble looking for your Black Box! Well that is where the 3G – 3.5G phone transmitters comes in handy, allowing you to transfer the data from the Black Box using mobile phone technology.
The Black Box obtains an IP address from the nearest mobile phone base station and use it to communicate over the internet with the Axxana Recovery servers. The Recoverers can be either wired or wireless connected to the internet. Every interaction between the Black Box and the Recoverer is mutually authenticated using RSA 1024 bit protocol. all data that is sent over is encrypted using AES 128 bit protocol with a dynamic key exchange mechanism that automatically changes for every block of 32MB.
It’s all very clever stuff, I have to admit I am impressed with both the concept and the end product itself, I would love to speak to someone that has used it in anger.
So this is a product I saw a few years back when I was a customer, it was shown to me by EMC as part of a RecoverPoint sales pitch. I remember at the time thinking it was a pretty cool idea and a pretty full on cast iron way to guarantee the protection of critical data, however I couldn’t see a use case for it outside large enterprises. After talking about it for a while the other day I realised that back then I was potentially missing one of the key selling points.
You utilise Axxana so you don’t have to do expensive synchronous replication, so you don’t have to introduce unnecessary application latency, so you don’t have to have that second site within ~100KM distances. The reason this product is built to withstand every feasible disaster is so that you can safely use cheaper asynchronous replication over large distance and still guarantee that synchronous replication RPO that the business or application owner demands.
I swear one of those imaginary light bulbs went on above my head while I was discussing it!
If you want to know more about this product check out the Axxana website or please speak to your EMC account manager about the product. Alternatively you can drop me an email and I’ll find someone to talk to you about it.