Archived Newsletters


 
Comments   

 
 

Podcast:
 

 

TeamQuest ITSO News

 

Build a Better Architecture for Backup and Recovery

Backup and recovery represents a multi-dimensional computing problem. Thus all factors of the problem must be accounted for in a successful architecture. Additionally, testing and measurement are vital to developing an appropriate architecture for a specific customer.

Factors Affecting Backup and Recovery
Corporations today are intent on squeezing every penny out of every computing asset. Thus they are putting off equipment refreshes and attempting to spend less on new infrastructure. That means, of course, that aging gear is kept in place longer. All this is exerting an impact on backup and recovery, as is the fact that backup/recovery windows are shrinking steadily in a global economy.

As a result, backup problems have become more prevalent. Unanticipated bottlenecks are appearing that are difficult to diagnose. Backups are failing to complete within the available windows. And the volume of data to be backed up is spiraling.

In many cases, companies are engaging in reactive purchasing of backup and related technology to solve what in reality are non-issues. Failing to spot the real source of the problem, they end up spending a lot of money fixing the wrong things or making technical changes that only solve part of the problem.

The fact is that backup/recovery issues can be difficult to figure out as there is a complex relationship between many hardware and software elements. For example, you have factors such as a master policy server/scheduler, a media management server, direct attached storage (DAS), Storage Area Network (SAN) attached storage, a variety of high and lower speed network components, differing network topologies and configurations, LAN clients and WAN clients. All this makes backup a far from simple proposition.

Case Study - Wells Fargo
Wells Fargo brought in Sun Professional Services (SunPS) several years ago to assist in the erection of a monolithic backup site to service the whole organization. The company purchased a single Sun E6900 server which it split into two domains. Each domain had:

  • 12 UltraSparc IV 1.3 MHz processors active
  • 12 UltraSparc IV 1.3 MHz processors (on-demand)
  • 24 GB RAM, 4 X 4 GB HBA for connectivity over a Hitachi-based SAN
  • 10 X 4 GB HBA for tape connectivity over Sun StorageTek T10K tape drives
  • 4 X 4 port GigaSwift QuadFast Ethernet cards (16 Ports)

Each domain was to act as a separate media server supporting more than 5,000 clients for nightly backup. Each port of the four TCP/IP 4 port NIC cards assigned to each domain was configured into a single trunk of sixteen members each.

This configuration was expected to drive data to tape at a rate of 1,200 MB per second, reduce the backup window below 8 hours, and reduce the cost-per-Megabyte for backups per month. This architecture could actually achieve all of the above. However, it was unable to do it for every type of client.

On the plus side, backup of SAN-based data reached 1,400 MBPS. In addition, the T10K tape drives backed up SAN Data at a maximum speed of 100 MB per second, thereby reducing the cost per MB. However, network-based client backup throughput only reached 350 MBPS. While the backups still completed in less than 8 hours, this was below the SLA threshold. This reduction in speed was traced to a large collection of Windows, Linux and UNIX clients with slow NICs. Based on the existing architecture, there was nothing we could do to speed these clients up as their internal networks wouldn't allow it.

As a result, SunPS came in again to quantify the effect of various re-configuration options on the performance of the overall configuration. The first thing they did was to install TeamQuest Performance Monitoring Software on each domain supporting backup software and media services. One domain was configured to support a single 8-way TCP/IP link aggregation, and the second domain was configured to support two 4-way TCP/IP link aggregations.

SunPS engineers identified one test client to represent each O/S type to be backed up, generated TCP/IP load from each test client to each of the domains, and measured and documented the results on each domain. This led to the reconfiguring of one domain to support a single 16-way TCP/IP link aggregation, and the second domain to support two 8-way TCP/IP link aggregations. Further retesting was conducted to ensure the optimal link aggregation was isolated for both throughput and performance.

Further work using TeamQuest determined that Sun trunking problems existed as well as the initially located problem of network bandwidth constraints at the client level.

These problems were resolved by first using TeamQuest Model's Stretch Factor diagrams. Stretch Factor is the ratio of service plus queuing divided by service. In other words, this measures the time in wait versus the time doing work. This is available for each workload/process accessing an active or passive resource. An ideal score is 1 to 1, whereas greater than 1.8 indicates a constraint. TeamQuest Model shows Stretch Factor, for example, based on growth projections. This is graphed on a monthly basis in order to highlight how long a particular system or applications can last before being constrained. This was used to see how proposed changes would impact the infrastructure based upon expected growth rates.

SunPS also constructed models based on the current workload and added an additional 30 percent, 60 percent and 90 percent to see how well the backup infrastructure coped. This immediately pointed out a problem with CPU utilization when the workload increased by 90 percent. Further, modeling revealed network constraints.

Based upon TeamQuest results, SunPS recommended a reconfiguration of NICs. Previously, NICs had been gathered into a single trunk and this had constricted bandwidth. The new configuration cut down on trunking and eliminated bandwidth restrictions.

Additional recommendations derived from TeamQuest data were to group backup clients based on client performance and only multiplex to multiple tapes for like clients. In particular, slower Windows clients were segregated so other clients could be backed up much faster.

By enabling jumbo frames in TCP/IP a gain of 20 percent was achieved. By employing disk-staging instead of backing up directly to tape, backups were accelerated (i.e. they now ran at disk speed rather than tape speed).

SunPS also recommended a reduction in the ratio of media to master servers, and a reduction in the number of policies and schedules. TeamQuest showed that the optimum arrangement of media servers to master servers was a 5:1 ratio. For every additional media server, performance slumped by 20 percent.

At Wells Fargo, the ratio was about 15:1 on media servers to master servers.

Proposed Alternative Architecture
SunPS, as validated by TeamQuest modeling results, proposed an alternative architecture. This saw the addition of Sun T2000 media servers, a Sun StorageTek SL8500 tape library and a Sun StorageTek 9310 tape library. Each T2000 can run 2 T10K tape drives.

Using Stretch Factor, TeamQuest showed that the T2000 performed much better than the earlier E6900 configuration. While the older box had a Stretch Factor beyond 2 under a high client workload, the T2000 Stretch factor remained around 1 despite adding more clients.

The networking constraints vanished using this new network design. TeamQuest demonstrated via modeling that the T2000, in a new architecture, would eliminate networking issues. This was proved out in the real world.

Share