Scaling SD-WAN Deployments
SteelConnect v2.11 or later
Designing and scaling your SD-WAN deployment is a complex project.
This topic provides some design considerations and best practices to ensure your deployment matches your needs and you can plan for changes in scale.
Riverbed SD-WAN software releases include incremental changes to address scale. Any numbers in this section are for comparison only.
Design Considerations for Scale
A large number of factors affect network performance. Site count is a popular index, but other factors can have just as much impact. When designing for scale, consider each of these parameters and their impact on your resources and overhead:
•Number of sites and zones.
•Rules - traffic path rules, allow/deny rules, and inbound (NAT) rules.
•Routes. Overlay routes require more resources than underlay routes.
•Number of peers and tunnels per gateway and site.
•Throughput limits.
•Routing, route addition and withdrawal, and route flapping.
•Routing convergence measuring less than 30 seconds.
•Connections per second.
•Network path quality fluctuations.
•Users.
•Devices.
•Traffic flow patterns.
Understanding SD-WAN throughput
A key part of designing for scale is optimizing throughput while minimizing resource contention. To do so, you need to understand traffic on the WAN interfaces. SteelConnect throughput falls into these categories:
•Direct to Internet - Unencrypted data going to and from the internet. Sources are either local LAN or remote LANs using this site for internet breakout.
•Overlay (AutoVPN) - Amount of encrypted data going on overlay tunnels to/from other SteelConnect Sites. Such overlay traffic could also be using this site to transit from one overlay tunnel to another.
•Security (Zscaler, Palo Alto, ClassicVPN) - Encrypted data between third-party points-of-presence peers, such as cloud security enforcement nodes or on-premise security firewalls.
•Underlay - Unencrypted data sent to/from underlay network (such as MPLS) without any tunneling overhead. Sources are either local LAN or remote LANs using this site for “transit” to legacy non-SteelConnect sites.
Throughput is dependent on the type of tasks. The following operations can have significant impact on throughput:
•Encryption
•Application Identification
•Rule match/lookup
•Route lookup
•QoS marking/shaping
•Packet Fragmentation
These factors also impact SD-WAN throughput:
•Packet size (the smaller the packet sizes, the lower the throughput)
•Flow count and lifetime
•Rule table size
•Route table size
Best practices for scale
Riverbed recommends the following best practices for deploying scalable SD-WAN solutions.
Use hub-and-spoke instead of full-mesh
When designing your network, consider if you need every site to connect to every other site. Hub-and-spoke models scale better than full-mesh models. Hub-and-spoke models require fewer tunnels and require fewer resources with less probing, fewer computations, and lower resource requirements.
Work with your Riverbed sales team to understand and compare the number of sites, zones, and rules that you can have in a full-mesh deployment and how those numbers change for a hub-and-spoke deployment.
When configuring hub-and-spoke deployments, choose your hub locations wisely. Ensure there is enough available capacity and few transit delays.
For configurations where you need any-to-any connectivity, such as voice, consider if end-to-end latency and jitter on an overlay transit through a hub is acceptable. If not, configure the application to use the MPLS underlay network.
Deploy SDI-2030 appliances
As a best practice, use SDI-2030 appliances instead of SDI-1030 appliances. The SDI-2030s have better resources including an x86-based architecture and 1Gbps WAN throughput.
The SDI-2030 appliances also provide native and full routing protocol support (BGP and OSPF) and support advanced routing policies, including route maps and filtering.
The SDI-2030 appliances also support active-active high availability deployments, which allow for better performance when they are both active.
Use high-availability designs
Whenever possible, use high-availability designs for optimal scalability, including.
•For hubs, deploy either dual hub or split data center designs.
•For appliances, use (1+1) active-active device HA or use (n+k) clustered device HA.
•For uplinks, use multiple uplinks as either active-active or active-backup.
•For WAN, use multiple WANs (MPLS and Internet) and configure intelligent traffic steering (PQPS).
Migrate to an dedicated SCM instance
A standard SteelConnect deployment is part of a shared instance and does not have dedicated hardware resources, but a dedicated, single-tenant SCM instance gets dedicated CPU, memory, I/O, and other hardware resources.
Consider a dedicated SCM instance if you have a large number of sites (more than 300), an unusually large number of rules, zones, routes, tunnels, or if you have contractually binding SLA requirements.
When to upgrade to a dedicated SCM instance
•Site count exceeds thresholds (note, inactive and offline sites/appliances count toward the threshold)
–250 sites for full mesh overlay topology
–300 sites for hub-and-spoke overlay topology
•Unusually long delays for standard SCM configuration operations (web interface or REST API), for example, several minutes to create a site, add a zone, or add routing configuration.
•Unusually long delays for loading typical reports (web interface or REST API), for example, several minutes for dashboard and health-check reports.
When you upgrade, Riverbed changes the EC2 instance type. The standard EC2 type is m4.4xlarge, but with a dedicated, single-tenant SCM, you move to either a m4.10xlarge or m4.16xlarge instance.
When you move to a dedicated instance, you also get advanced monitoring of SCM and its components by the Riverbed Cloud Operations team and special tuning of the SCM configuration for better performance. Contact Riverbed Support for more information about an exclusive SCM instance.
Test in a staging environment
Many variables in individual networks affect real-world performance, such as the unique traffic blend and the features in use. To accurately predict the impact of any network configuration changes, it is best to test in a similar environment, under expected conditions.
We strongly recommend creating a staging environment that duplicates the production environment. This environment should have the similar hardware, software, and configurations so that any changes can be tested before deploying into production.
For best results, ensure your staging environment
•Closely resembles your production environment and includes similar appliance models.
•Includes representative applications and traffic profiles.
•Provides coverage of critical use cases.
Contact Riverbed Support for more information and best practices on building a staging network.
Centralize configurations and focus configurations
When possible, use centralized configuration management through the SCM and do not simply connect legacy site configurations. Design and implement global policies instead of customized and unique local configurations.
If rules are only needed for a specific site, only deploy it in one site and limit scope for rules to sites and zones whenever possible. Often, the default settings do not limit scope and the rule will apply to all sites and zone and can cause unnecessary bloat. Especially consider limits to:
•Outbound/Internal Rules
•Traffic Rules
As a best practice, audit your configuration regularly and clean-up unnecessary or redundant configurations.
Plan and distribute the appliance upgrade schedule
The SteelConnect software upgrade involves many concurrent activities (such as appliance image download, configuration update, node inventory, and tunnel recreation) and these upgrade activities can overload the system.
As a best practice, design an appliance upgrade schedule in batches with a staggered schedule to limit disruptions, and use site groups and tags to schedule the upgrades. Also, upgrade data centers before branches. At each step, verify the upgrade and functionality before proceeding to next stage.
Note the differences between types of upgrades:
•SCM upgrade vs. appliance upgrade
•SDI vs. SteelHead SD image download behavior
Turn off flow-record processing
The flow-record processing in SCM can cause inaccuracies in reports when flow rates are high and flow records can be dropped or their processing delayed. (Without flow processing in SCM, you lose the Traffic Timeline report and SteelCentral Insights for SteelConnect.)
As an alternative
•Use on-premise SteelCentral NetProfiler
•Export NetFlow/IPFIX records directly from appliance to NetProfiler (v2.12)