Wednesday, December 10, 2014

SQL Agent Connection Failure after Cluster Reconfiguration

We had some unfortunate event befall one of our SQL clusters lately, and it’s led to some interesting fallout.

First, the TempDB LUN was unpresented accidentally during some SAN maintenance.  As I’m not a SAN admin, I’m not yet sure what all happened with that, but of course SQL Server crashed immediately.  It also appears that the other DB LUNs were briefly unavailable to the server, which is where things get interesting.

We had a new TempDB LUN presented in reasonably short order, and SQL Server was running again without any apparent problems.

Then patch week came.  When the active node rebooted after patching, SQL Server would not start on the secondary node.  Which meant SQL Server was offline pending manual intervention.

It took awhile to figure out why SQL Server wasn’t starting:  all of the role resources were online, and everything looked fine.  These errors in the event log gave us some clues:

[sqsrvres] CheckAndChangeVirtualServerName: Could not obtain the Virtual Server Name (status 138f)
[sqsrvres] GetVirtualServerName, Unable to obtain next Resource from Cluster Resource Enumerator. Error: 0.


Those are both event ID 19019.  So I wondered:  what might have changed about the cluster resources?

The answer?  Dependencies.  When the LUNs were unpresented, all of the dependencies for those also disappeared.  Add those dependencies back, such that the SQL Server service resource depends on the appropriate resources (drives, for instance):

image

You’ll also want to make sure the appropriate network names are set as dependencies.