Solved: UCS Blade B200 M1 Discovery Issue - Bug Found with 1.4(2b)

MaherAlAsfar · ‎06-21-2011

Hello

I m testing some UCS hardware that consist of Two Interconnect 6120xp and Two chassis, Each of the chassis have two blades One is B250 M1 in Slot 1 and 2 and One is B200 M1 in Slot 3

I was at one point running the B200 M1 Blades in each of the chassis with no issues until the time came to test XenServer. after I was done with the testing I unassociated the service profile i believe before powering off the blade.

From that point strangely enough the service profile was un-associated but I could not power off both the B200 M1 Blades, So I tried to reset them and i even had to reseat both blades. after that these two servers were never available, always on and cant be powered off with the following Status when a discovery happens

Overall Status would be "discovery-failed" and the Current Task under Status Details is showing :

"Configuring primary fabric interconnect access to server 2/3 (FSM-STAGE:sam:dme:ComputeBladeDiscover:ConfigFeLocal)"

From the FSM I get the following :

Remote Invocation Result: end-point-unavailable

Remote Invocation Error Code : 1002

Remote Invocation Description : Remote eth port (2/1/2) is not created

Any help is appreciated . I spend a lot of time trying to find something on the Cisco support site or the forums with no luck.

thanks

Maher

crystaldiaz · ‎06-22-2011

I just opened a TAC case for the same issue. It is caused by a bug that is fixed in 1.43i - Background processes no longer cause clogging of the MTS queue. (CSCtq03411).

The temporary workaround until you can update is to ssh into the interconnect and connect to local-mgmt and issue the commands pmon stop and then pmon start. This will cause a service interruption. No configuration changes can be made until this is done.

View solution in original post

abbharga · ‎06-21-2011

Hi Maher,

A few points here:

1) The blades are powered off after the serive profile is removed from the server. There is no explicit action required to shut it down.

2) As for the discovery issue, can you:

i) Verify the firmware version you are running on the cluster?

ii) Can you tryn to move this blade to a different slot (other then 3) and then check for the discovery?

Did you try opening a TAC case for the same?

./Abhinav

MaherAlAsfar · ‎06-21-2011

Thanks Abhinav

When I removed the service profile it clearly stated that are no service profile associated. but if you open at the KVM Console . XenServer was still running on the server until I rested the server that's when the discovery issue appeared.

Firmware is 1.4(2b)

The blade was originally in Slot 4 and i did move it to Slot 3 yesterday. i even reseated the daughter cards on both blades.

I m seeing the same exact behavior on both the B200 M1 blades

thanks

Maher

Robert Burns · ‎06-21-2011

Maher,

Can you try decommissioning the blade (Give it 10mins to complete the task) and the re-commission it.

I assume your XenServer is installed to local disk which might explain why it was still booting with a discovery issue.

Thanks,

Robert

Mark Buckley · ‎06-22-2011

Hi Maher,

Did you find a solution to this problem as we are experiencing the exact same problem as yourself?

We upgraded to Firmware version 1.4(2b), unassociated a profile from a B200 blade and it is now in an unusable state. Even though the profile was showing as unassociated we could connect to the blade using the KVM and still see ESXi running. The profile is now showing as not being associated but the blade does not have a shutdown option (greyed out) and the FSM is failing at about 29% of the way through the diassociate process. We currently have a TAC case open but are not getting anywhere very quickly.

Hope you can help

Thanks

Mark

MaherAlAsfar · ‎06-22-2011

Thanks Robert

I did try to decommission both the B200 M1 blades a couple of times but never waited 10 min.

the other thing is once they are decommissioned it always shows me that there is a slot mismatch issue and to click here to accept this server in this slot ? its wierd cause i didn't move the server and i did acknowledge the slot change when i moved it from slot 4 to slot 3 initially.

Even after i waited about 30 min . still no change

the servers are being booted from SAN to be enable service profile mobility where i can attache the same profile to another similar blades which i tested before on these exact blades.

But you kind of gave an idea now and made me curious if the server is still trying to boot with XenServer from the SAN

not sure if that is even possible for the server to boot with no service profile attached to it !!!!!

In my case i dont see the XenServer even booting at the KVM access to begin with . on top of all that i also unmasked the boot LUNs from these two blades and we still see the error during the discovery.

MaherAlAsfar · ‎06-22-2011

Hi Mark

No we didn't find a solution as of yet. we are still experiencing the issue where the blade is stuck at the discovery operation and in our case we dont see an attempt even to boot the XenServer.

crystaldiaz · ‎06-22-2011

I just opened a TAC case for the same issue. It is caused by a bug that is fixed in 1.43i - Background processes no longer cause clogging of the MTS queue. (CSCtq03411).

The temporary workaround until you can update is to ssh into the interconnect and connect to local-mgmt and issue the commands pmon stop and then pmon start. This will cause a service interruption. No configuration changes can be made until this is done.

Mark Buckley · ‎06-22-2011

Hi Crystal,

That work around has fixed the problem, thanks a lot! Just need to upgrade again now.

Thanks

Mark

MaherAlAsfar · ‎06-22-2011

thanks Crystal

i was just about to post how I managed to fix it before I read your post . what i did for it work is i rebooted the primary Interconnect Node after that the discovery process of the blade went successfully.

but now that you posted the temproray workaround i dont have to go that far.

We will be downloading the new Firmware 1.43i to avoid any future issues because of the

bug in 1.4(2b) .

I marked this thread as Answered.

thanks

Maher

emresumengen · ‎07-19-2011

Hi Maher,

Well, does rebooting the active FI cause a service interruption on your solution? I'm asking this, as I've received a similar reply from TAC (I did find the bug right after we opened a case for our customer), which states that having portAG restarted would fix the issue temporarily, until upgrading to a new version.

I'm worried though, as I've seen Crystal Diaz's post, showing a similar solution to the TAC answer, which is said to cause interruption.

I'd say losing the primary FI _should_not_ cause any interruption as long as the system is fully redundant (hw and link-wise), but a verification would be fine. And, checking the current state of the MTS buffers individually on each FI, I see both are likewise affected

abbharga · ‎07-19-2011

Hi Emre,

I would advice you not to use the pmon stop / start workaround for this issue, we found certain issues with using that workaround and have even update the release notes with a different workaround of portAG restart using the dplug (with TAC assistance).

I will help you with the workaround tommorw, since I am the owner of your case!

./Abhinav

emresumengen · ‎07-19-2011

Hi Abhinav,

We'll surely go with your proposed solution before the upgrade, but I was just trying to find out before talking again with you, if that procedure would cause a UCS-wide service interruption...

My customer is using this system for testing, but even that seems to need change-management and scheduling to be able to withstand interruptions

Anyways, I guess we'll be talking more tomorrow.

(Thanks for the interest )

98712454098g · ‎09-05-2012

connect local-mgmt

pmon stop

#and then

pmon start

helps me to resolve next issue too:

Remote Invocation Error Code : 1003

UCS6248 2.0(3b)

So thanks for Crystal Diaz!

Amit23 · ‎06-01-2015

i am also facing same issue..

as i have FI version 2.2 (1C)

and just now adding second chassis with five new blades..

it is showing discovery failed for all servers..

Warm Regard's
Amit Sahrma