Recently we had a situation where ports 1-4 on an 8-port 6509 (cat6000-supk9_6-3-5/(C6MSFC2-PK2SV-M), Version 12.1(8b)E9) blade failed. Each of the ports was the first port in a GEC to four other 6509's running the same software. On the other boxes UDLD saw the port go down and reenabled it(%UDLD-3-AGGRDISABLE:Neighbor(s) of port 3/6 disappeared on bidirectional link. Port disabled %MGMT-5-ERRDISPORTENABLED:Port 3/6 err-disabled by udld enabled by errdisable timeout). This recovery looks correct, but the four GEC's were now unusable: CDP couldn't see the neighbor port, and OSPF on the four connected boxes looped in different ways depending on the neighbor (*Jul 15 09:49:35: %OSPF-5-ADJCHG: Process 1, Nbr 1.111.11.111 on Vlan26 from EXSTART to DOWN, Neighbor Down: Dead timer expired
*Jul 15 09:49:48: %OSPF-5-ADJCHG: Process 1, Nbr 1.111.11.111 on Vlan26 from EXSTART to DOWN, Neighbor Down: Dead timer expired
*Jul 15 09:50:02: %OSPF-5-ADJCHG: Process 1, Nbr 1.111.11.111 on Vlan26 from EXSTART to DOWN, Neighbor Down: Dead timer expired
or
Jul 15 09:35:03: %OSPF-5-ADJCHG: Process 1, Nbr 1.111.11.111 on Vlan33 from LOADING to FULL, Loading Done
*Jul 15 09:35:08: %OSPF-5-ADJCHG: Process 1, Nbr 1.111.11.111 on Vlan33 from FULL to DOWN, Neighbor Down: Dead timer expired
*Jul 15 09:35:14: %OSPF-5-ADJCHG: Process 1, Nbr 1.111.11.111 on Vlan33 from LOADING to FULL, Loading Done
*Jul 15 09:35:30: %OSPF-5-ADJCHG: Process 1, Nbr 1.111.11.111 on Vlan33 from FULL to DOWN, Neighbor Down: Dead timer expired).
Aside from the fact that it looks like we ran into something that the software could't handle, looking at this it seems to me that it would be better in the future to turn UDLD recovery off on GEC's and hope that the channel will recover if a failing link stays down.
I'd like to hear some best practices or opinions on this.
Thanks.