Solved: Re: LMS 2.6: restoredb.pl taking too long?

yjdabear · ‎05-04-2007

It's been almost 75 mins since the last update in the restore log. Is there something going wrong with this restore (from one LMS 2.6 server to another)?

tail -f /var/adm/CSCOpx/log/migration.log

04/May/2007 15:01:29:695 208911 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 135 - Unmapping 717@nms of RME from PIDM

04/May/2007 15:01:29:700 208916 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 160 - UpdatePIDMEntries are done

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 161 - UpdatePIDMEntries are done got to

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 70 - Exiting now... from do mig

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 71 - Exiting doMigration

04/May/2007 15:01:29:702 208918 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: com.cisco.nm.rmeng.migration.DcmaDataMigrator@1ff7a1e

04/May/2007 15:01:29:702 208918 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: com.cisco.nm.rmeng.sa.migration.SaDataMigrator@d19bc8

04/May/2007 15:01:29:703 208919 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain 291 - SaDataMigrator::postMigration() - No post migration to be done.

04/May/2007 15:01:29:703 208919 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: com.cisco.nm.rmeng.migration.SWIMDataMigrator@1630ab9

04/May/2007 15:01:29:704 208920 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: com.cisco.nm.rmeng.migration.RMEDataHandler@1551f60

Joe Clarke · ‎05-10-2007

Ah-ha! Found it. NMSROOT is reset by the fact that restorebackup.pl uses Perl. Instead of being the non-canonical /opt/CSCOpx, it is /product/CSCO/CSCOpx. Since dmgtd.sol always runs as /opt/CSCOpx/objects/dmgt/dmgtd.sol, the dmgtd script can't find it.

If you modify /etc/init.d/dmgtd and change line 30 to:

if echo ${i} | /usr/bin/cut -d, -f2 | /usr/bin/egrep "^/opt/CSCOpx/objects/dmgt/${DMGRFILE}$" >/dev/null 2>&1 ; then

This problem will go away. I don't like using hardcoded values like, but I think it's safe in this particular case. Sorry for the mess.

View solution in original post

Joe Clarke · ‎05-04-2007

This is the migration.log. Check the restore log to see if there's anything happening in there. At this point, I wouldn't be too worried. The migration will take time (could be multiple hours). The major bugs with migration taking days should all be fixed in LMS 2.6, and I see no indication of problems in this log snippet.

yjdabear · ‎05-04-2007

tail -f restorebackup.log

[Fri May 4 14:57:57 2007] Value of PX_DATA_DIR is: /var/adm/CSCOpx/files

[Fri May 4 14:57:57 2007] CMF Version is : 3.0.5

[Fri May 4 14:57:57 2007] No need to update the incremental ccr

[Fri May 4 14:57:57 2007] No need to update the ccr

[Fri May 4 14:57:57 2007] This is Restore

[Fri May 4 14:57:57 2007] Restoration of the RME data is in progress...

[Fri May 4 14:57:57 2007]

[Fri May 4 14:57:57 2007] See the log file /var/adm/CSCOpx/log/migration.log for progress...

Daemon Management is not running.

INFO : Stopping DBEngine processes registered to Daemon Manager

Just curious: Does LMS 2.5+ implement more/better compression for the backup archive? My LMS 2.2 backup used to eat up >7GB, while LMS 2.6 now only consumes ~1.9GB, though the latter does need >6GB temp space to perform the restore.

Joe Clarke · ‎05-04-2007

Nothing is different in terms of compression. After a migration, the resulting data may actually be smaller since the RME database is not restored as-is, but rather the data is transfered into the new database.

yjdabear · ‎05-05-2007

It started at 14:14 yesterday and is still not finished:

ps -ef | grep restore

casuser 4743 4382 0 11:29:53 pts/2 0:00 grep restore

root 23562 384 0 14:14:03 pts/9 0:07 /opt/CSCOpx/objects/perl/bin/perl /opt/CSCOpx/bin/restorebackup.pl -d /product/

-rw-r--r-- 1 root other 6102085 May 4 15:01 migration.log

-rw-r--r-- 1 casuser casusers 1560530 May 4 15:01 restorebackup.log

-rw-r----- 1 casuser casusers 36062 May 4 15:07 CampusOGSServer.log

-rw-rw---- 1 casuser casusers 1730241 May 4 15:07 pidm.log

-rw-rw---- 1 casuser casusers 444056 May 4 15:08 license.log

-rw-r----- 1 casuser casusers 3002 May 4 15:33 RMEDbMonitor.log

-rw-r----- 1 casuser casusers 3002 May 4 15:33 CmfDbMonitor.log

-rw-rw---- 1 casuser casusers 34329 May 4 15:33 CTMJrmServer.log

-rw-rw---- 1 casuser casusers 402895 May 4 15:34 NCTemplateMgr.log

-rw-rw---- 1 root casusers 30787 May 4 15:34 rme_base.log

-rw-rw---- 1 root casusers 8021736 May 4 15:34 rme_ctm.log

-rw-rw---- 1 casuser casusers 5438912 May 4 15:34 IC_Server.log

-rw-rw---- 1 casuser casusers 74597 May 4 15:34 daemons.log

-rw-rw---- 1 casuser casusers 161318 May 4 15:34 RMECSTMServerlog

-rw-rw---- 1 casuser casusers 1650290 May 4 15:34 dmgtd.log

-rw-rw---- 1 casuser casusers 1839918 May 5 03:04 dbbackup.log

-rw-rw---- 1 casuser casusers 11458 May 5 11:27 diskWatcher.log

-rw-rw---- 1 casuser casusers 9197825 May 5 11:28 dmgtDbg.log

Two points of curiosity: 1) The nightly backup ran as scheduled at 3:05 this morning. The archive comes out to 280MB, compared to ~1.9GB on the other server. I don't know it backed up what's originally on this server, or the other one's db which I was trying to restore here. 2) dmgtDbg.log has gotten a lot more chatty ever since 15:01 yesterday, logging every minute, right after migration.log stoped updating:

[Fri May 04 14:58:40 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG_LEVEL)=NULL

[Fri May 04 14:58:40 EDT 2007][dmgrDbg]##### INFO ##### re-evaluate DbgLevel=0x0

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_MY_DEBUG)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_MY_TRACE)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG_LEVEL)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg]##### INFO ##### re-evaluate DbgLevel=0x0

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_MY_DEBUG)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_MY_TRACE)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG_LEVEL)=NULL

Joe Clarke · ‎05-05-2007

Your migration may be failing due to the symlinks you said you were using previously. Doing a backup of this partially migrated data will not yield any usable results, so I wouldn't trust that backup. The full migration.log will be needed to see if using symlinks is the problem.

Joe Clarke · ‎05-05-2007

I'd also like to know if dmgtd is running at this point? The restore log says it is not, but it looks like the script may be stuck in a loop trying to shutdown the database servers. If dmgtd is still running, it may be trying to respawn these processes.

Please provide the full output of ps -efl.

yjdabear · ‎05-08-2007

Could you take a look at TAC case 605960929 that I just opened for this issue? Thanks.

Joe Clarke · ‎05-08-2007

I assume this is still running now, and this ps output is current. What is the cksum on /etc/init.d/dmgtd? I want to make sure you have the correct version of my patch. It looks like the dmgtd script is not seeing the dmgtd.sol process. Also, if you run /etc/init.d/dmgtd stop now, does it also report that dmgtd is not running?

yjdabear · ‎05-08-2007

cksum dmgtd

333832757 23182 dmgtd

/etc/init.d/dmgtd stop

Daemon Management stopping. This may take a few minutes.

BTW, the only symlink in use is NMSROOT (/opt/CSCOpx/) pointing to /product/CSCO/CSCOpx/, but I think that's supported.

Joe Clarke · ‎05-08-2007

That's the right patch, and that's what you should see when dmgtd goes down. With dmgtd down, can you try the restore again? If it hangs again, I'd like to see the new ps output, and we may need to get a truss.

yjdabear · ‎05-08-2007

Actually the restorebackup.pl script did prompt me to perform "/etc/init.d/dmgtd stop" before it could start, so I suppose it'd probably just hang again. I'll go ahead give it another try and see what happens.

yjdabear · ‎05-08-2007

Looks like this second restore is going down the exact same path as the first one. I should note that the restore target server has two LMS patches the restore source does not: the patched dmgtd for CSCsi74887 and the fix for CSCsg26840 user must be Network Admin to use CiscoView. Would these discrepancies cause this restore failure?

ps -ef | grep restore

casuser 18973 16548 0 19:03:08 pts/2 0:00 grep restore

root 28868 27578 0 14:37:38 pts/4 0:11 /opt/CSCOpx/objects/perl/bin/perl /opt/CSCOpx/bin/restorebackup.pl -d /product/

-rw-r--r-- 1 root other 2693888 May 8 14:37 restorebackup.log.old

-rw-rw---- 1 root casusers 68234 May 8 15:19 CCRImport.log

-rw-r----- 1 casuser casusers 36059 May 8 15:23 ess.log

-rw-r----- 1 casuser casusers 981749 May 8 15:23 SyslogCollector.log

-rw-rw---- 1 casuser casusers 131257 May 8 15:25 cri.log

-rw-r----- 1 casuser casusers 4874 May 8 15:26 stpeng.log

-rw-rw---- 1 casuser casusers 1740435 May 8 15:26 EssentialsDM_Server.log

-rw-rw---- 1 casuser casusers 2580742 May 8 15:26 dcmaservice.log

-rw-rw---- 1 casuser casusers 15486 May 8 15:26 ca.log

-rw-r----- 1 casuser casusers 9572 May 8 15:27 CMFOGSServer.log

-rw-rw---- 1 casuser casusers 25736 May 8 15:27 AnalyzerDebug.log

-rw-r----- 1 casuser casusers 49719 May 8 15:27 lwms.log

-rw-rw---- 1 casuser casusers 5419488 May 8 15:29 NetShowServer.log

-rw-r--r-- 1 root other 13007803 May 8 15:31 migration.log

-rw-r--r-- 1 casuser casusers 1755759 May 8 15:31 restorebackup.log

-rw-r----- 1 casuser casusers 67085 May 8 15:36 CampusOGSServer.log

-rw-rw---- 1 casuser casusers 1730657 May 8 15:36 pidm.log

-rw-rw---- 1 casuser casusers 446547 May 8 15:36 license.log

-rw-r----- 1 casuser casusers 3002 May 8 16:02 RMEDbMonitor.log

-rw-r----- 1 casuser casusers 3002 May 8 16:02 CmfDbMonitor.log

-rw-rw---- 1 casuser casusers 35518 May 8 16:02 CTMJrmServer.log

-rw-rw---- 1 casuser casusers 404179 May 8 16:02 NCTemplateMgr.log

-rw-rw---- 1 root casusers 31270 May 8 16:02 rme_base.log

-rw-rw---- 1 root casusers 8124634 May 8 16:02 rme_ctm.log

-rw-rw---- 1 casuser casusers 5440687 May 8 16:02 IC_Server.log

-rw-rw---- 1 casuser casusers 1686819 May 8 16:03 dmgtd.log

-rw-rw---- 1 casuser casusers 75698 May 8 16:03 daemons.log

-rw-rw---- 1 casuser casusers 167922 May 8 16:03 RMECSTMServerlog

-rw-rw---- 1 casuser casusers 2321 May 8 18:41 diskWatcher.log

-rw-rw---- 1 casuser casusers 31417435 May 8 18:56 dmgtDbg.log

Joe Clarke · ‎05-08-2007

These patches should not be causing this problem. Assuming the migration.log looks the same as before, the migration is complete, so things should start happening in the restore log. In that case, please get the output of:

lsof -p 28868

And the /tmp/restore.truss output of the following command (let it run for a minute, then do control+C):

truss -a -f -vall -rall -wall -o /tmp/restore.truss -p 28868

yjdabear · ‎05-09-2007

Please find the outputs in the two new attachments to TAC case 605960929. Thanks.

lsof-p.txt

restore.truss