Tags
7010, ccie data center, cisco, data center, Fryguy, fryguy_pa, LCS, nexus, nexus 7000, nexus 7010, nxos, troubleshoot, vPC
Ok, the Nexus switches have been installed and running for over a week now with no further problems and there has been no fallout that I need to address prior to this post. Everyone at work took the change well, and understood some of the issues that we ran into as well as how we addressed them.
Just to recap what we did and the thoughts around why…
– Location had two Cisco 6509 switches running Sup2/MSFC2 as well as 6548 line cards – running for over 8 years
– Switches had started to show failures on line-cards on a more regular basis, chalked up to age of equip.
– When line-cards failed, spanning-tree loops where introduced which had the ability to severely impact the site
– Recently installed a large VM environment in location with the understanding of DMZ requirements in the near future
– This is a date center location, so data center level hardware was required (10g capabilities and beyond)
– In a single night, removed both Cisco 6509 switches, reconnected about 250 servers, and moved to LACP Etherchannel on all Rack switches in STP forwarding mode
– Also built a temporary network to maintain customer traffic through the site
Now, these requirements might not scream Nexus 7000 hardware – but we do not change core datecenter hardware very often and wanted to install a switch that had more “future proofing” built-in than other switches. The Cisco 6509E chassis in VSS mode has many of these features, but Cisco is investing money in the Nexus line and we felt that this is the proper way to go. Also, with a potential web presence imminent, the VDC and OTV capabilities of the Nexus are a perfect fit.
To be honest, the installation went really well. We where able to remove both Cisco 6509 switches in about an hour (they were DC, so an electrician was required) and get the new Nexus 7010 shoe-horned in their place. The Nexus are some heavy beasts, these where north of 500 lbs each – so I highly recommend removing the Power Supplies if possible. To be honest, the way that Cisco has designed these, they rack easily. Just like the Cisco 6500, the Nexus sits on a shelf for support and then gets screwed in on the face to the rack. We had the new network up and running, ready for cut-over by around 5 AM – 5 hours after we started.
So, what problems did we encounter – that is where the fun begins. What is funny is that only 1 of the problems I would consider a network design issue, the others where the typical – oh, we did not know that – or, whoops, type in the default gateway IP address.
So, the first problem that was actually a design issue was L3 neighbor routing over vPC – even though Cisco does not come out and say it does not work in the documentation, trust me – it does not. Per Cisco’s doc ( http://www.cisco.com/en/US/docs/switches/datacenter/sw/5_x/nx-os/interfaces/configuration/guide/if_vPC.pdf )
Configuring VLAN Interfaces for Layer 3 connectivity
You can use VLAN network interfaces on the vPC peer devices to link to Layer 3 of the network for such
applications as HSRP and PIM. However, we recommend that you configure a separate Layer 3 link for
routing from the vPC peer devices, rather than using a VLAN network interface for this purpose.
Now, as you can see from the above picture we have a EIGRP Neighbor relationship between all the routers and the core switches (R1-N7K1, R1-N7K2, R2-N7K2, R2-N7K1). Typically this is fine in a normal spanning-tree network, but what happens when using a VPC is something different. When a packet is received on R1 and R1 then decides that the next hop should be N7K2, but the device that R1 is trying to get to is attached to N7K1 (direct or etherchannel – the packet is dropped as it would need to transverse the vPC link twice. N7K2 sees the packet and just drops it. It actually sets a bit on the packet when it is received over the vPC link so that it is not re-transmitted back over the link. This is a loop-prevention mechanism, and that is a good thing as you can guess.
In order for us to fix this design flaw, we just had to make the links between R1 and N7K1 a L3 interface as well as between R2 and N7K2 a L3 interface. We actually talked about doing this prior to the Nexus being installed, but chose to wait as we did not want to change too many things at one time. The final design looks like this.
Now, some of the other problems that we encountered that where hardware related was a bad Supervisor module (backup supervisor actually) that was causing high CPU usage on the box. The first Nexus was running at 10-20% cpu whereas the second Nexus was running at 90-100% CPU. This was a little more difficult to track down as there where no errors in the log, but the way we figured it out was that one line card was stuck in a “downgrade in progress” message on some of the ports. The way that message showed up is that we actually downgraded from 5.0.3 to 5.0.2 to see if we had a bug in code that was causing the CPU issues. I will admit, Cisco had us a new supervisor as well as a line card in about 2 – 3 hours after we figured that out.
The last two problems that we encountered where out of out control – one problem that we experienced was bad default gateways on devices. I do not know how they where working prior to the upgrade as they had a non-existent IP address configured for the default gateway. Perhaps they had a static route and it disappeared when the network link went down. That is the only logical explanation that I can figure – and we had a few devices that did this, so it may actually be a “feature” in their code. Luckily, those devices have now been fixed.
The last one that we had taken us a bit longer to figure out – and it turns out that a “socks and sandle” person from the vendor had to get on the phone. We had a device that plays audio message to end-users, and since the installation of the Nexus that feature was no longer working. Stuck us as odd and out of character for it to be related to the new core switches, but since it broke after the install – we kept at it until we figured it out. What it turns out was the Nexus was receiving the packet ( it is the default gateway ) and dropping it. Why you ask, well because the vendor wrote their application to use the default gateway to loop the packet. ie – the same source and destination are in the packet, just routing it through the default gateway. The Nexus, and most any other security conscious device, would drop that packet as it is viewed as a spoof packet. Reviewing the logs in the main VDC, I can see the following error message: 2010 Aug 26 12:49:07 N7K1 %EEM_ACTION-6-INFORM: Packets dropped due to IDS check address reserved on module 9. Once we disabled that feature, everything started to work.
So what is the moral of all of this, well – it is good to know how all the software on your network is configured – but that honestly almost impossible to do. What does help is speaking with the vendors before the change so they are aware of what you are doing – and we did, we actually had pro-active tickets with all vendors for the change. I will also say that getting all the vendors on the phone (TAC, Vendor, etc) makes a huge difference. We had TAC on the phone (actually they usually had 2 TAC Nexus Engineers on the calls) and the vendor and they worked out all the communications between devices and figured it out. But in the end what did it was the “socks and sandle” person from the vendor who said “you know, it does this…” to get the light-bulb to click.
I just want to say, all-in-all this installation went very well – few bumps in the road, but they where to be expected. It helps to have a good team of people who you can count on when you are doing this, and thankfully I have that.
yostie said:
That's too funny. I was doing my first redundant N7K install for a customer and I ran into the same problem with running a routing protocol over a VLAN involved in vPC. It took 3 TAC engineers to figure it out. The guy from Australia worked for like 5 hours on it, a guy from Bussels worked on it for about 8 hours, then the guy from the US figured it out in 5 minutes.
I had a similar project going on where my customer was moving everything from one data center to another. All in all, things went really well except that problem. The routing did work to some degree, certain networks worked and other didnt.
Curtis said:
Great write-up, I wish I had taken better notes on some of my previous projects and you have inspired me to do so.
One possible explanation for the issue with hosts configured with incorrect gateway address is proxy arp. Typically enabled by default on Cat 6K SVIs, this feature is disabled by default on Nexus SVIs.
Andrea said:
Hello..
About the vPC loop prevention algo, it’s now widely known, but it’s still an issue for the TAC.
I reviewed and deployed a big network with Nexus and ASR and the Cisco Advanced Services guys needed something like 3 weeks just to figure out what the presales Cisco suggested..
I reccomended some more improvement, by the dropped them since “it’s the presales design. we need to follow it”.
After some months we had an issue and the TAC, after 2 months of reviews, said “Can you consider these improvements?”. Needless to say, they were some I’ve already suggested.
Given the complexity of the scenario and vPC behaviour (despite the very simple idea), even Cisco is having difficulties into getting everything right..
About the loss of default gw of a lot of equipment, it might be related to proxyarp behaviour on Cisco 6k, enabled by default, and this might not be the case for the Nexus (as per default and/or your config).
If the devices were configured with no default gateway some of them (namely, windows) ask for an ARP for each host they need to connect to..
Ciao,
A.
Andrea said:
Hello..
About the vPC loop prevention algo, it’s now widely known, but it’s still an issue for the TAC.
I reviewed and deployed a big network with Nexus and ASR and the Cisco Advanced Services guys needed something like 3 weeks just to figure out what the presales Cisco suggested..
I reccomended some more improvement, by the dropped them since “it’s the presales design. we need to follow it”.
After some months we had an issue and the TAC, after 2 months of reviews, said “Can you consider these improvements?”. Needless to say, they were some I’ve already suggested.
Given the complexity of the scenario and vPC behaviour (despite the very simple idea), even Cisco is having difficulties into getting everything right..
About the loss of default gw of a lot of equipment, it might be related to proxyarp behaviour on Cisco 6k, enabled by default, and this might not be the case for the Nexus (as per default and/or your config).
If the devices were configured with no default gateway some of them (namely, windows) ask for an ARP for each host they need to connect to..
Ciao,
A.