Solved: CUCM 6.1 A Cisco DB not starting after power outage

IT_Luke · ‎06-26-2024

A customer has a (very) old twin CUCM 6.1 cluster with 2 hp DL320 G5. After a power outage the phones wouldn't register and was unable to access the Administrative CM interface (Database communication error). After doing some digging from the console the problem is that the "A Cisco DB" is not starting or running on the primary node which explains the situation. Apparently something got corrupted and needs to be fixed during the power outage (considering it's a soft raid with write back cache enabled, it's not surprising). I couldn't locate any valid backup tar of this instance (backups were enabled but the target machine was taken offline about 8 years ago from what I gathered). At the moment there are only 4 phones connected (there were over 40 but everything shrunk drammatically) so starting from a new DB would probably be the smarter solution if possible but before we do I'd like to see if the DB can be recovered. I know there is no default root access (at least without TAC creating one which is impossible with this unsupported version) so I found a linux mint distro with the Intel rg chipset support drivers to recognise the (fake) RAID 1 (mounted in /PartB under /media/mint/_partB), mounted it and added a user with proper shell and sudo privs without touching the default root user. Poking around I found that the DB is a postgres type but I found no apparent errors in the logs I checked until now (they are not in the default directories). As these services are pretty tailored and the way they are brought up is not the default init.d way (called through a "Service Manager"), if anyone has any pointers or suggestions as to where to look for more info that would be appreciated!

IT_Luke · ‎06-26-2024

After dedicating a couple of hours to the FS and related services (tomcat and java based) I found the culprit: it was a corrupted prefs.xml file under the /usr/local/cm/conf/dbl subdirectory which made the preferences.py imported by the startdbl script fail. Emptying this XML file solved the issue as it was subsequently recreated runtime (as others) by the db start-up script procedure (if inconsistent XML data is found in the XML file, the python script aborts and does not recreate it autonomously). FYI this script (and others) can be located under /usr/local/cm/bin/. It's a pity no informational logs or error outputs are reported through the usual console (utlls service start A Cisco DB just returns "not started" without further info). Only through a manual launch of "startdbl start" will you see any error outputs, which took a while to trace back to.

FYI this CCM cluster (as all the rest of the decadent infrastructure) is in this state for a reason as the company and it's assets is undergoing a long legal procedure and nothing can be updated nor purchased in this period. It is far from an ideal situation which we, as IT consultants, have (and are payed) to endure and support until the bureaucracy ends, which may take well over a decade since it started. I am well aware that the system(s) are EOL and without official support, but this doesn't mean they can't be fixed or maintained until and wherever possible. It is a "closed box" situation and sometimes upgrading or renewing is not an option.

Thanks for understanding.

View solution in original post

b.winter · ‎06-26-2024

Are you really expecting any help, on a CUCM version that is probably almost 20 years old?

Nithin Eluvathingal · ‎06-26-2024

The CUCM 6.X version has been at its end of life for quite some time. Currently, we are using version 15. It is crucial that you upgrade to the latest versions as soon as possible to continue receiving support.

IT_Luke · ‎06-26-2024

After dedicating a couple of hours to the FS and related services (tomcat and java based) I found the culprit: it was a corrupted prefs.xml file under the /usr/local/cm/conf/dbl subdirectory which made the preferences.py imported by the startdbl script fail. Emptying this XML file solved the issue as it was subsequently recreated runtime (as others) by the db start-up script procedure (if inconsistent XML data is found in the XML file, the python script aborts and does not recreate it autonomously). FYI this script (and others) can be located under /usr/local/cm/bin/. It's a pity no informational logs or error outputs are reported through the usual console (utlls service start A Cisco DB just returns "not started" without further info). Only through a manual launch of "startdbl start" will you see any error outputs, which took a while to trace back to.

FYI this CCM cluster (as all the rest of the decadent infrastructure) is in this state for a reason as the company and it's assets is undergoing a long legal procedure and nothing can be updated nor purchased in this period. It is far from an ideal situation which we, as IT consultants, have (and are payed) to endure and support until the bureaucracy ends, which may take well over a decade since it started. I am well aware that the system(s) are EOL and without official support, but this doesn't mean they can't be fixed or maintained until and wherever possible. It is a "closed box" situation and sometimes upgrading or renewing is not an option.

Thanks for understanding.