Stefan Radtke's Blog: Isilon as a TSM Backup Target

19 June 2014

Isilon as a TSM Backup Target – Analyses of a Real Deployment

In my recent blog “Using Isilon as a Backup Target” I have explained why Isilon is a perfect target for TSM backups (well, the same applies for sure to other backup and archive solutions as well but we want to show real world example here and this one has been with TSM). Beside the nice and simple administration and the fact that you can get rid of the SAN complexity for a large degree, one of the most appealing advantages is that your whole backup process becomes much faster. Why? Because
Isilon allows enormous throughput. With just a five node NL400 cluster (5x 144TB raw) we do typically measure a maximum single stream throughput of ~ 600 MB/s for reads and something above 400 MB/s for writes in an ideal environment. If we consider a multi-threaded IO profile we can get a throughput of more than 2500 MB/s read and 1400 MB/s write throughput. All measured using iozone with a TSM-like workload profile: 100% sequential IO at 256kB blocksize over NFS3 and a 10 Gigabit Ethernet Link using MTU Size of 1500. Even though you may not reach these maximum throughput values in a TSM environment with thousands of threads (too many threads per node cause processing overhead), we see all TSM processes benefit from the new level of performance that Isilon delivers in real deployments:

Management of active data pools can be eliminated as we don’t need them anymore. If Isilon is used as a primary pool everything is accessible fast and easy.
Backups and Restores will accelerate. You can configure your TSM clients to run much more threads than before.
Migrations will run much faster.
Reclamations will run much faster.
Extension of your capacity gets much easier. You just add one or more nodes to Isilon and that's it.

The real world before and after the Isilon deployment

Now let’s look how this works by exploring the results before and after a customer deployment. Figure 1 illustrates the existing setup :

4x TSM Servers instances running on Windows 2012
2x 18 TB FC Netapp Blockstorage used as disk pool
2x TS3500 Tape Libraries with 8x LTO4 drives each

Figure 1: Existing customer setup before Isilon has been deployed

Figure2 illustrates the setup after the Isilon implementation:

4x TSM Server instances running on Windows 2012
1x 3 Node Isilon Cluster with NL Nodes ~ 260 TiB usable capacity
1x TS3500 Tape Libraries with 8x LTO4 drives each

Figure 2: Setup after the Isilon deployment

The results

Figure 3 shows the real measured results of one of the four TSM servers from May 3^rd 2014 until May 16^th 2014.

Figure 3: TSM1 Server processes from 03.05.2014 – 16.05.2014.

Observations from 3th to 12th (before Isilon deployment):

Before the Isilon takeoff at the 16th, backup, archive and migration jobs ran at 100-150MB/s throughput until next day, sometimes until noon. Very short peak rates on the 7th and 09th up to 400MB/s.
Archive jobs (light blue) ran between 8 and 16 hours.
Approximately same run-times for backups (dark violet) but not easy to see because covered by archive graphs.

Observations after the Isilon deployment:

Isilon went into production during the 12th. Throughput dropped on the 13th due to a miss-configured Etherchannel.
Etherchannel issue corrected on the 14th. Throughput increased to ~400 MB/s. You can see very well that the archive throughput (light blue) as well as the backup and restore throughput (violet and dark blue but covered by the archive graphs) already increased and as a result finished several hours earlier.
During the 14th and the 16th, the team started to modify the TSM clients to run with more threads. This was not possible before due to the tape limitations but Isilon can ‘eat’ much more in parallel. As a result you can see that the throughput increased to 800MB/s on the 16th.
Due to the gigantic throughput, backup (dark red, mainly fileserver backup) and archive (Databases) runs have been shortened dramatically. Compare for example the archive (light blue) of the 3rd with the one of the 16th (see figure 4 which has zoomed in some interesting parts from figure 3). The throughput increased from ~150MB/s to ~750MB/s and the runtime went down from ~16 hours to ~2,5 hours. Be aware that the area under the graphs would the same in both cases assuming that the amount of data has not changed.
The migration (dark violet) has been reduced dramatically since we now store much more on Isilon and don’t need to migrate to tape that much data.
Reclamation (pink, see figure 4) has been eliminated totally (well, some days after the 16th we did see some reclamation activities reoccurring also on Isilon as well. But this is much faster because of the Isilon throughput as well as the fact the volumes are much smaller with 128 GB instead of 1,5TB for a single tape).

Figure 4: TSM1 Server processes from 03.05-05.05.2014 and 16.05.2014 (zoomed + stretched from Fig.3).

Be aware that we only discussed the result of one out of four TSM instances here. The other three showed similar results and since the 16^th, the customer is running with up to 1,4 GByte/s against the three node Isilon cluster.

Conclusion

This real world example shows the large improvement for backup, archive, reclamation and migration processes when using Isilon as a TSM target:

The average backup and archive throughputs have been increased by a factor of ~5.
As a result, the runtimes have been reduces by the same factor (12 hours to 2.5 hours).
Migrations and reclamations have been massively reduced.
Complexity has been reduced since all TSM Servers share a single file system. Very easy to maintain and extend. If you add a new Isilon node to the cluster, the capacity is available instantaneously for all TSM Servers with no configuration changes. The task of adding an Isilon node to the cluster takes approximately a minute or so (considering the node has been installed in the rack and cabled).
No more SAN components between TSM Servers and Isilon (so no more volumes, LUN-masking, SAN-Zoning, device class definition changes,…).
The customer was able to reduce the number of tape libraries from two to one and the number of LTO drives from 16 to 8.
Restores will be much faster in general.

Acknowledgement

Many thanks to Lars Henningsen and General Storage who again did an awesome job for the customer. They did the TSM Analysis and provided the data and insight while the Concat team deployed the Isilon System.

Mini FAQ

During the discussion and review of this text I got two questions which I would like to post here:

Q: Where is the TSM DB hosted in this setup? I heard storing it on Isilon is not a good idea due to latency.
A: That's correct. The databases were left untouched.Putting them on an NFS store is not the best idea. Expecially if you want to enable TSM deduplication you need to consider putting the database on Flash disks since the database will typically grow dramatically in size (instead of having an entry for every file, you a en entry for every block (if I remember correctly for every 8 or 32kB block).

Q: How can a mirrored DR solution be achieved? I heard Isilon provides asynchronous mirroring?
A: Yes, asynchronous mirror a typical way of syncing data between two scale out cluster. Be aware that we sometimes have hundreds of nodes and a potential throughput of dozens of Gigabytes per second. Copying this amount of data synchronously would introduce a massive and not acceptable latency. So for a TSM DR concept you would not use Isilon's SynIQ. What you could do instead is using a second Isilon cluster and use it for a TSM copy pool. But we aware that TSM cannot migrate a copy pool (for example to tape).

Q: Could SynIQ be used for TSM node replication?
A: Absolutely yes and it offloads the data copy process form the TSM server.

Discussion

If you have similar experiences (probably using different Backup Software like EMC Networker or Commvault Simpana), questions or comments, feel free to enter them below and start the discussion.

Find me on linkedin: http://de.linkedin.com/in/drstefanradtke/