cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2244
Views
0
Helpful
4
Replies

Jabber issue with IM & P Server in Failover

Ali Amir
Level 1
Level 1

Hello!

 
I have the below environment:

 
CUCM & IM/Presence 10.5.2
Jabber 11.5.2

IM/Presence Cluster with 1 Publisher and 1 Subscriber in the Site same site.
 

In CUCM we have "Presence Redundancy Group" which contains both IM/Presence nodes the "Enable High Availability" checkbox checked. The status of High Availability could be confirmed on Presence Topology Details.

Regarding load balancing, users are divided to groups and assigned to host03 and host04. Every user on every server could login without problem to the related assigned server.

When the failover option is active on a server, the server gets idle state and the other server go to Running in Backup Mode. The status of servers on topology details could confirm this.

But users on idle server could not get the service from the other server and stay in offline mode. It seems as Jabber sends authentication request to the second server "host04.domain.test.biz", but this request could process and saved on the second server for failover time.


In Jabber Logs I could find below logs:

################
2016-09-18 21:15:30,855 INFO  [0x00001234] [erx\jwcpp\LoginMgr\LoginContext.cpp(654)] [csf.jwcpp] [CLoginContext::SaveHASoapServers] - @LoginMgr: #0, CLoginContext::SaveHASoapServers ha, login, cache HA Soap servers:host03.domain.test.biz;host04.domain.test.biz for key:host03.domain.test.biz
2016-09-18 21:15:30,855 INFO  [0x00001234] [erx\jwcpp\LoginMgr\LoginContext.cpp(894)] [csf.jwcpp] [CLoginContext::InitCupHAHosts] - @LoginMgr: #0, CLoginContext::InitCupHAHosts ha, soap-servers:host03.domain.test.biz
2016-09-18 21:15:30,855 INFO  [0x00001234] [erx\jwcpp\LoginMgr\LoginContext.cpp(894)] [csf.jwcpp] [CLoginContext::InitCupHAHosts] - @LoginMgr: #0, CLoginContext::InitCupHAHosts ha, soap-servers:host04.domain.test.biz
2016-09-18 21:15:30,855 INFO  [0x00001234] [rx\jwcpp\LoginMgr\LoginCUPState.cpp(124)] [csf.jwcpp] [CLoginCup::connectOnRetrievedServerInfo] - @LoginMgr: #0, CLoginCup::connectOnRetrievedServerInfo login, cup:host03.domain.test.biz
...
2016-09-18 21:33:24,110 ERROR [0x00001514] [upSoapClient\CupSoapClientImpl.cpp(1413)] [csf.jwcpp] [CupSoapClientImpl::GetAllConfig] - @CupSoapCli: soap, get-all-config:12
2016-09-18 21:33:24,110 ERROR [0x00001514] [upSoapClient\CupSoapClientImpl.cpp(1421)] [csf.jwcpp] [CupSoapClientImpl::GetAllConfig] - @CupSoapCli: soap, soap::get_all_config fails:12, reason: SOAP 1.2 fault: SOAP-ENV:Sender[no subcode]
"Invalid session-key element in SOAP request header."
Detail: [no detail]

2016-09-18 21:33:24,110 INFO  [0x00001234] [rx\jwcpp\LoginMgr\LoginCUPState.cpp(508)] [csf.jwcpp] [CLoginCup::OnGetAllConfig] - @LoginMgr: #0, CLoginCup::OnGetAllConfig err-code: 12, err-string: SOAP 1.2 fault: SOAP-ENV:Sender[no subcode]
"Invalid session-key element in SOAP request header."
Detail: [no detail]
...
2016-09-18 21:33:24,110 INFO  [0x00001234] [rx\jwcpp\LoginMgr\LoginCUPState.cpp(217)] [csf.jwcpp] [CLoginCup::HandleCupHA] - @LoginMgr: #0, CLoginCup::HandleCupHA ha, HandleCupHA
2016-09-18 21:33:24,110 WARN  [0x00001234] [erx\jwcpp\LoginMgr\LoginContext.cpp(905)] [csf.jwcpp] [CLoginContext::GetHASoapServer] - @LoginMgr: ha, invalid HA soap server index:2
2016-09-18 21:33:24,110 INFO  [0x00001234] [rx\jwcpp\LoginMgr\LoginCUPState.cpp(235)] [csf.jwcpp] [CLoginCup::HandleDnsSrv] - @LoginMgr: #0, CLoginCup::HandleDnsSrv dns, HandleDnsSrv
2016-09-18 21:33:24,110 INFO  [0x00001234] [erx\jwcpp\LoginMgr\LoginContext.cpp(267)] [csf.jwcpp] [CLoginContext::ChangeState] - @LoginMgr: #0, CLoginContext::ChangeState now:0 auto:1
2016-09-18 21:33:24,110 INFO  [0x00001234] [rx\jwcpp\LoginMgr\LoginStopState.cpp(43)] [csf.jwcpp] [CLoginStop::OnStateChanged] - @LoginMgr: #0, CLoginStop::OnStateChanged CLoginStop::OnStateChanged11
2016-09-18 21:33:24,110 INFO  [0x00001234] [rwerx\jwcpp\xmppsdk\XmppClient.cpp(2028)] [csf.jwcpp] [CXmppClient::FinalCleanData] - @XmppSDK: CXmppClient::FinalCleanData, bReconnect:1
2016-09-18 21:33:24,110 INFO  [0x00001234] [erx\jwcpp\LoginMgr\LoginContext.cpp(627)] [csf.jwcpp] [CLoginContext::NextReconnectAttempt] - @LoginMgr: #0, CLoginContext::NextReconnectAttempt conn, next attempt. count:1, remain:4294967111
2016-09-18 21:33:24,110 INFO  [0x00001234] [erx\jwcpp\LoginMgr\LoginContext.cpp(574)] [csf.jwcpp] [CLoginContext::GetReconnectInterval] - @LoginMgr: #0, CLoginContext::GetReconnectInterval conn, choose a random interval:181 between 20 and 205.
2016-09-18 21:33:24,110 INFO  [0x00001234] [erx\jwcpp\LoginMgr\LoginContext.cpp(296)] [csf.jwcpp] [CLoginContext::StartReconnectTimer] - @LoginMgr: #0, CLoginContext::StartReconnectTimer conn, timer, seconds:181
...
################


In authentication logs from the backup server I could see:


################
21:30:23.173 | error| TokenAuthUtils::executeAuthQuery(): error: 0 Rows Found
21:30:23.173 | error| TokenAuthUtils::authencateUsingOTP(): ExecuteAuthQuery() return authentication failure for userid: [test.user]
21:30:23.183 | error| SXComponent::authenticateVTGToken(): Authentication error: AUTH_SYSTEM_ERROR for userid: [test.user@test.com]
21:30:23.183 | error| SXComponent::_processRequest(): Authentication failed due to internal issue
21:30:23.184 | error| SXComponent::_sendAuthFailure(): Auth component sending Failure
################


In database Logs from backup server I find this:


################
21:33:24.835 HDR|09/18/2016 ttlogin,StandAloneCluster,host04.domain.test.biz,Special,10.5.2.23900-3
21:33:24.835 |SqlParserBase::processSqlReq SELECT
SELECT * FROM clientsessions WHERE sessionid='6cad03a2-4bdf-bb43-ab9a-f74d4861bbbb';
21:33:24.835 HDR|09/18/2016 ttlogin,StandAloneCluster,host04.domain.test.biz,Special,10.5.2.23900-3
21:33:24.835 |{
    sqlstmt(SELECT * FROM clientsessions WHERE sessionid='6cad03a2-4bdf-bb43-ab9a-f74d4861bbbb';)
    m_count(0), errmsg(), rc(success)
    TYPESQLRSP(<msg><type>TTLOGIN</type><table>typesqlrsp</table><action>Q</action><time>0</time><old><1>0</1><5></5><6></6><2>0</2><4>success</4><3></3><0></0></old></msg>
)
}

21:31:40.350 HDR|09/18/2016 ttlogin,StandAloneCluster,host04.domain.test.biz,Special,10.5.2.23900-3
21:31:40.349 |SqlParserBase::processSqlReq SELECT
SELECT * FROM clientsessions WHERE sessionid='1990a498-80a9-3008-8767-a096328af1b7';
21:31:40.350 HDR|09/18/2016 ttlogin,StandAloneCluster,host04.domain.test.biz,Special,10.5.2.23900-3
21:31:40.349 |{
    sqlstmt(SELECT * FROM clientsessions WHERE sessionid='1990a498-80a9-3008-8767-a096328af1b7';)
    m_count(0), errmsg(), rc(success)
    TYPESQLRSP(<msg><type>TTLOGIN</type><table>typesqlrsp</table><action>Q</action><time>0</time><old><1>0</1><5></5><6></6><2>0</2><4>success</4><3></3><0></0></old></msg>
)
}

21:33:39.194 HDR|09/18/2016 ttlogin,StandAloneCluster,host04.domain.test.biz,Special,10.5.2.23900-3
21:33:39.192 |SqlParserBase::processSqlReq SELECT
SELECT * FROM ONETIMEPASSWORD WHERE userid='user1';
21:33:39.194 HDR|09/18/2016 ttlogin,StandAloneCluster,host04.domain.test.biz,Special,10.5.2.23900-3
21:33:39.192 |{
    sqlstmt(SELECT * FROM ONETIMEPASSWORD WHERE userid='user1';)
    m_count(0), errmsg(), rc(success)
    TYPESQLRSP(<msg><type>TTLOGIN</type><table>typesqlrsp</table><action>Q</action><time>0</time><old><1>0</1><5></5><6></6><2>0</2><4>success</4><3></3><0></0></old></msg>
)
}

21:33:39.283 HDR|09/18/2016 ttlogin,StandAloneCluster,host04.domain.test.biz,Special,10.5.2.23900-3
21:33:39.283 |SqlParserBase::processSqlReq SELECT
SELECT * FROM clientsessions WHERE sessionid='baf481d4-60e0-331a-c188-b3acba3e069e';
21:33:39.283 HDR|09/18/2016 ttlogin,StandAloneCluster,host04.domain.test.biz,Special,10.5.2.23900-3
21:33:39.283 |{
    sqlstmt(SELECT * FROM clientsessions WHERE sessionid='baf481d4-60e0-331a-c188-b3acba3e069e';)
    m_count(0), errmsg(), rc(success)
    TYPESQLRSP(<msg><type>TTLOGIN</type><table>typesqlrsp</table><action>Q</action><time>0</time><old><1>0</1><5></5><6></6><2>0</2><4>success</4><3></3><0></0></old></msg>
)
}
################


On the Idle server I could see for authentication these database query:


################
21:20:13.354 HDR|09/18/2016 ttlogin,StandAloneCluster,host03.domain.test.biz,Special,10.5.2.23900-3
21:20:13.337 |SqlParserBase::processSqlReq SELECT
SELECT * FROM onetimepassword WHERE userid='user1';
21:20:13.354 HDR|09/18/2016 ttlogin,StandAloneCluster,host03.domain.test.biz,Special,10.5.2.23900-3
21:20:13.337 |{
    sqlstmt(SELECT * FROM onetimepassword WHERE userid='user1';)
    m_count(0), errmsg(), rc(success)
    TYPESQLRSP(<msg><type>TTLOGIN</type><table>typesqlrsp</table><action>Q</action><time>0</time><old><1>0</1><5></5><6></6><2>0</2><4>success</4><3></3><0></0></old></msg>
)
}

21:20:13.354 HDR|09/18/2016 ttlogin,StandAloneCluster,host03.domain.test.biz,Special,10.5.2.23900-3
21:20:13.337 |SqlParserBase::processSqlReq INSERT
INSERT INTO oneTimePassword (userid,password,tmstamp) VALUES('user1', '54640222', '1476359383733');
21:20:13.354 HDR|09/18/2016 ttlogin,StandAloneCluster,host03.domain.test.biz,Special,10.5.2.23900-3
21:20:13.337 |{
    sqlstmt(INSERT INTO oneTimePassword (userid,password,tmstamp) VALUES('user1', '54640222', '1476359383733');)
    m_count(1), errmsg(), rc(success)
    TYPESQLRSP(<msg><type>TTLOGIN</type><table>typesqlrsp</table><action>Q</action><time>0</time><old><1>0</1><5></5><6></6><2>1</2><4>success</4><3></3><0></0></old></msg>
)
}


Do you know how does Jabber authentication work(in failover solution)? Do I need to consider of user assignment to servers?

4 Replies 4

davethehedgehog
Level 5
Level 5

I have a very similar issue. Did you get anything back? I'm on the same version, so I'm thinking a bug maybe?

We could manage this issue by changing the Upper Retry Limit.

Actually Jabber clients can login to the backup server after a random time. The random time will be defined between Lower Retry Limit and Upper Retry Limit.

These values could help to reduce the load of backup server and should be defined base of total count of users, CPU/Memory resources and deployment type of IM and Presence cluster.

 

You can configure these parameters by choosing:

Cisco Unified CM IM and Presence Administration > System > Service Parameters >  Cisco Server Recovery Manager

 

More details on:

http://www.cisco.com/c/en/us/td/docs/voice_ip_comm/cucm/im_presence/configAdminGuide/10_5_1/CUP0_BK_CE43108E_00_config-admin-guide-imp-105/CUP0_BK_CE43108E_00_config-admin-guide-imp-105_appendix_010101.html

Tapan Dutt
Cisco Employee
Cisco Employee

This looks like to be a replication issue between IM and P node. Try reload of the nodes starting with the Publisher and then the Subscriber, Once done type in the command on PUB and SUB and the output should show 0 on pub and sub

run pe sql ttlogin select count(*) from typesysreplication

Tapan

I have a non-zero output from that command:

admin:run pe sql ttlogin select count(*) from typesysreplication
sqlRv(t) sqlstmt(select count(*) from typesysreplication;)
***result set start***
count(2577), success(t)
***result set end***

Can you suggest what I should do about it?