Solved: DCR internal error in communication channel - Page 2

helexis · ‎10-12-2009

I have reported this error before...and have a TAC case open for it...but have found a workaround that I wanted to share that might shed some light on the issue.

The URL of Common Services > Device Management when I get the error contains the FQDN.

If I modify this URL by removing the domain suffix and attempt the same change in the DCR the change is successful.

Any ideas?

Joe Clarke · ‎10-14-2009

You can escalate the case as a severity 1 if you are available, and can work on it now. That will queue it the next engineer. Else, if you have your engineer contact me tomorrow, I can help them find the necessary procedure.

helexis · ‎10-15-2009

I need assistance with the command I was provided for re-registering the applications. There must be a typo...Can't get my engineer to return an e-mail or phone call. Can you advise?

I'm not sure if you want me to paste the command on here or not since it was so hard to come by, but here is the error I get:

Exception in thread "main" java.lang.NoClassDefFoundError: administration/1/0

The filename used was administration.1.0.xml and I was advised to remove the .xml for this command.

Joe Clarke · ‎10-15-2009

Yes, and you're missing a piece. You forgot the actual class name to execute. The class name, which comes after the end of the classpath argument, and before the filename is:

com.cisco.nm.cmf.registry.CMICApplicationRegistry

helexis · ‎10-15-2009

Thanks!

helexis · ‎10-25-2009

I don't think we have resolved this issue.

It still occurs from time to time.

I am still not 100% sure about the do's and don't of hostname vs. FQDN with multiserver setup.

I know you recommended using the shortname, but there are times when that doesn't make sense.

For example, lets say I have 2 servers (ciscoworks-cm.domain.net and ciscoworks-rme.domain.net).

I was advised to generate the certificates using the hostname only (ciscoworks-cm and ciscoworks-rme).

In the Homepage settings then I would have to put the short hostnames as well.

When I register applications from a remote server it asks for a server name and display name both of which I assume should be the short hostname.

Then when the apps are registered you see a hostname column for each app that is registered and it apparently reads the FQDN from somewhere and that is what is shown as the hostname for the remote apps. (I'd imagine this could be the md.properties file.)

You also have to provide the servername when you setup SSO and the DCR Master/Slave settings both of which rely on the imported certificates and therefore must match with the short hostname.

Somewhere though the server is told to use the FQDN for URLs and this throws things off when you have your certificate generated with the short hostname.

Out of the box several weeks ago this was the issue which proved an issue when it was apparently what caused the "internal error in communications channel" issues.

I then began a mission to get everything to reference the FQDN since I couldn't successfully get everything to use the shortname.

Plus it just doesn't seem acceptable to expect users to address the site by its short hostname only. This requires tedious fenagling with each users HOSTS file or DNS settings to make certain that it won't append an alternate domain suffix.

I have chased this goose entirely too long.

I even upgraded to LMS 3.2 hoping that would work out some of these kinks but the "internal communication channel" error still seems to rear its head as it pleases.

Joe Clarke · ‎10-25-2009

The display name of the registered application can be anything you want. The hostname should be the short hostname.

I did some testing with FQDN vs. short hostname internally on my LMS 3.2 servers, and found things to work generally pretty well when using FQDN except when it comes to application/device mapping (PIDM) and Device Center. I have two machines registered with each other by FQDN, and so far I have not had any communications problem with DCR (though Device Center links use the short hostname of the peer server).

On top of that, the logs I have seen thus far don't point to any real root cause of these issues. There also doesn't appear to be any debugging which can be enabled to give more information. At the very least some code changes would be required to get more clues as to what is going on.

For this reason, you will need to work with TAC so patches can be provided to try and isolate what is going on when this error occurs.

helexis · ‎10-27-2009

Just to be certain that the issue wan't due to a server domain suffix change after installation, I fully formatted and reinstalled LMS 3.2.

I have 2 licensed servers still, but have added a third HUM trial to the mix. So 2 slaves.

I am determined to get this working with FQDN, which may prove to be more trouble than it is worth. :)

All configurable references to the remote servers use the FQDN. The certs were generated with the FQDN and the Homepage Settings reflect the FQDN.

Still getting "internal error in communication channel. It actually seems more apparent that the slaves initially attach to the master but drop off soon after.

Browsing the attached logs brought me to this theory.

I also noticed that the DCR mode settings report:

Current DCR Settings

Mode: Slave Master Hostname: [masterhostname.domain.com]

Port: 443

Master Certificate: Valid

Master Server is unreachable.

So I changed the DCR settings to call the short hostname only and that wasn't sufficient.

I still get "Certificate HostName [masterhostname.domain.com] and the URL Host Name [masterhostname] do not match

Before Calling the astandalone to slave

--------------------

I obviously have to generate the certs with the shortname until this issue is addressed further...

Again, the problem I have with having the cert use the shortname is the browser complaints of the URLs not matching when our end users access the server by the FQDN. It doesn't seem plausible to expect users to open their browser and go to https://masterhostname instead of https://masterhostname.domain.com.

Any thoughts or additions?

Joe Clarke · ‎10-27-2009

There is not enough information in these logs to determine why your DCRs are unable to sync up.

I recommend you open a TAC service request, and keep it open until this is working. I know it works as I'm currently running in such a configuration. I can only guess that something still has not been done right (or hostname resolution is not working correctly for FQDN).

As to your last point, given that Device Center will still use short hostnames even if everything else is using FQDN will mean that your users will still get prompted to accept the cert hostname mismatch (and authenticate again if using SSO).

helexis · ‎10-27-2009

Ok and good point about Device Center's URLs.

Thanks for all your help, anyhow. I have really appreciated it.

helexis · ‎10-28-2009

Update: I have opened a TAC case for this "internal error in communication channel."

I advised my TAC engineer that I had been working with you for a couple of week now on this issue. I hope you might find the time to assist. :)

helexis · ‎10-30-2009

Still seeing the internal error in communication channel error...ugh! ;)

I was hoping you could explain the purpose of the Home Page Server Name in the Home Page Settings under CS > Server > Home Page Admin.

In a multiserver environment should this be your appointed web server and thus match on all servers or just each servers own local hostname or FQDN?

What is the Provider Group Name function...I assume it is one in the same, but does it affect anything I am seeing.

Right now all of my certs are configured with the shortname as advised, however this provider group name or home page server name is the FQDN of each server itself. Should I modify that?

Today when I received the internal communication error I was trying to update the device credentials on a few devices none of which were successful. All returned the error.

I then went to the browser address and modified it to just the shortname. Still saw the error.

I then decided to clear my browser cookies, history, and temp files and tried again. Same error with the FQDN url, but when I tried the shortname I was able to modify the credentials of every device in the db. ;)

Any insight here? Does this help? LOL

Joe Clarke · ‎10-30-2009

The homepage name can be anything you want. You could call it "Cowboy Server" if you wanted. It's just a logical name to present to users (though there are some internal uses as well). There used to be some issues with making this something other than the hostname, but those should be fixed now.

No, this still doesn't explain why this error is occurring. Given the transient nature, and the fact that I cannot reproduce on two clusters, perhaps there is something wrong with the server itself (e.g. bad memory). Or, maybe there is some conflict with something else installed on this server. What services are currently running on the master?

helexis · ‎10-30-2009

I assume you mean just LMS services. So I have attached the pdshow.

The servers are all brand new servers with no other "obvious" apps on them. But you never know, I know.

I am starting to wonder if I ever see the error when accessing the DCR directly from the server. I will test that some.

I guess I didn't mention that I usually don't access the server directly when making changes to the devices in DCR. In fact that may be why I couldn't reproduce the exact error while I was with TAC. Hmmm...I'll begin testing that immediately. We definitely have some tight firewalls here that we could be battling with...

However, it is important to note that there are no firewalls between the master and its slaves. They are all in the same subnet. So the master unreachable should be different.

helexis · ‎10-30-2009

Oops attachment!

Joe Clarke · ‎10-30-2009

No, I meant non-LMS services. LMS will not conflict with itself. But other services could be hindering it.

The client shouldn't have a bearing on how DCR works. All of the communication happens either internally or between servers.