So, to start this discussion, I want to set up our network parameters so you have an understanding of our network. Without going into specifics, we are running a VSS Core with two 6509's. These are fiber connected to about 15 distro routers, mostly 45xx's, with a couple stacked 3750's.
Then we have numerous access switches. Core switches are connected via various tunnels, uplinks, etc to the "cloud".
Within the LAN/WAN boundary, we have a startum 1 device; a GPS based time server in the data center.
NTP is working correctly on the network. We are using NTP authentication, and we are using mgmt vlans.
We are required to have each device have dual NTP sources. To me, that means two separate, and independant devices/locations to obtain time from; to others, it means an NTP source (like our GPS device) and a next hop device.
Unfortunately, we have a non-CCNA with some limited networking experience who runs the show, and is making a mess.
To keep network ntp traffic down, the "prefer" source was the next-hop device (ie, access switch hit the distro switch for prefer, and GPS for backup; distro switches hit core switch for primary, GPS for backup).
To me, I think we'll still have a lot of traffic on the network for NTP anyway; big deal. Have it go to one GPS devices as Stratum 1, and then perhaps another ip address for a secondary source if need-be. I think it's all relative. But here's the issue we are having:
All of our distro switches have redundancy, some with multiple paths to the core. So, one distro may have 1 link to Core 2, 1 link to Core 1, and then another link to another distro switch (or even 2 distro switches). Those links are mostly layer 3 etherchannels with 2-3 links per channel. So you have redundancy built into the port channel. But, the access switches have no redundancy. They usually connect via a fiber link to it's local distro switch. If that one fiber link gets cut, eaten, etc...it's down.
Now, almost all of our access switches are 3750's. MOST have UPS devices, but the UPS devices are getting worn out, need batteries, etc. Our power is unreliable here. So when an access switch goes down, and it comes back up, as long as the link is good, it should resynch it's time with no issue.
The problem is several of our distro switches are teh 3750 stacks. Power goes out, UPS kicks in...UPS battery fails before power is restored. 3750 restarts...and then the issue is the time is off. We go back to March 1, 1993. Well, we use EIGRP authentication on the distro switches with EIGRP routing for the port channels to connect to the Core switches. No time..no routing. We tried creating an infinity key on the distro switches, where the time was set for March 1, 1993. But that didnt work because the core will always use the current correct key; it wont go down the list to match the key. Time/date is just local to the local switch. So distro1 has 3 keys. Key 1 is in the correct time range. Core is also using Key 1 in the correct time range. Distro 1 fails, comes back online. Well, key 3 is using the 1993 time range. But that key does not match what the core is using for it's current key. Therefore, failure of routing.
We were driving over to the facility and manually resetting the time to bring up routing. Didnt think about it until recently that even though routing is down, the port channels are directly connected. So, we can remote into an adjacently connected distro switch, ssh via the port channel IP, tada! We have access, and reset the clock.
But we are trying to come up with a work-around to that. One thought was to source the uplink port channel's ip address as NTP source. So, distro is connected to Core via port channel. Let's say distro is connected via layer 3 port channel. Distro is .2, Core is .1. You source the .1 for NTP. Even though routing is down due to key mismatch, NTP can still get across the port channel as they are directly connected, and get the time from the core.
Just looking for thoughts and advice on others whom experiences this issue.
Sounds like too many cooks have been seasoning the soup. People chose all available features without consideration that running everything available in a fragile physical environment does not enhance resiliency but rather diminishes it.
That aside, have you considered putting in a static /32 route (with higher AD than your IGP) on the access switches for the NTP server? That was reported in another thread to have good results.
We are using NTP authentication
This is your biggest weakness. As you've mentioned, if your switches reboot and they can't get the right time source, they can't join the routing. It's a classic "chicken or egg".
Here are my recommendations:
1. Disable NTP authentication altogether. If you really, really need to make it secure, you can use ACL for the managment VLAN only.
2. Rapberry Pi or rPi -- Use a Raspberry Pi as your local or tertiary NTP server. The rPi can be used to synchronize to your distro or core.
I've got an rPi at home and I'm using it to host my Asterisk/FreePBX SIP server. Costs me AU$45 for the rPi, AU$25 for the memory card and a used power charger or an iPad.