Sql Server Faq

January 15, 2009

SQL SERVER 2005 RESOURCE FAILS TO COME ONLINE ON CLUSTER AFTER CHANGING THE SAN DRIVE TO ADD MORE DISK SPACE

Filed under: Uncategorized — @ 7:26 am

We have observed a number of issue now where in clustered instance of sql server 2005 fails to come online after performing changing on shared disk on a cluster.

As mentioned in my previous posts if the sql server fails to come online we need to start troubleshooting by looking at the sql server error log. So we checked the sql server error log and  observed the following error

2009-01-11 21:10:06.96 Server      Error: 26054, Severity: 16, State: 1.
2009-01-11 21:10:06.96 Server      Could not find any IP address that this SQL Server instance depends upon.  Make sure that the cluster service is running, that the dependency relationship between SQL Server and Network Name resources is correct, and that the IP addresses on which this SQL Server instance depends are available.  Error code: 0×103.
2009-01-11 21:10:06.96 Server      Error: 17182, Severity: 16, State: 1.
2009-01-11 21:10:06.96 Server      TDSSNIClient initialization failed with error 0×103, status code 0xa.
2009-01-11 21:10:06.96 Server      Error: 17182, Severity: 16, State: 1.
2009-01-11 21:10:06.96 Server      TDSSNIClient initialization failed with error 0×103, status code 0×1.
2009-01-11 21:10:06.96 Server      Error: 17826, Severity: 18, State: 3.
2009-01-11 21:10:06.96 Server      Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
2009-01-11 21:10:06.96 Server      Error: 17120, Severity: 16, State: 1.
2009-01-11 21:10:06.96 Server      SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

As seen from the above Error 26054 message which says that cluster service is unable to find the dependency between the sql server instance and the sql network resource name.

We checked the Cluster Administrator for the sql server resource and found that the dependency exists between the sql server resource and sql network name but still we receive the error.

Cause
=====

We checked the Registry Location HKLM\Cluster\Resources\<SQL Server GUID>\Parameters

The values InstanceName and VirtualServerName were missing from the keys below:

HKLM\Cluster\<SQL Server GUID>\Parameters
HKLM\Cluster\<SQL Agent GUID>\Parameters

The above registry keys are used by the cluster administrator to connect and start the sql server service. So the Cluster Administrator was not able to connect and start the sql server  resource and hence we receive the above error

Resolution
========

IMPORTANT : This resolution contains information about modifying the registry. Before you modify the registry, make sure to back it up and make sure that you understand how to restore the registry if a problem occurs. For information about how to back up, restore, and edit the registry, click the following article number to view the article in the Microsoft Knowledge Base:

256986 Description of the Microsoft Windows Registry

We added the following registry values in the Parameters registry key:

For a named instance of SQL Server:

• InstanceName
Value Name: InstanceName
Value Type: REG_SZ
Value Data: <instance name of the named instance>

• VirtualServerName
Value Name: VirtualServerName
Value Type: REG_SZ
Value Data: <Virtual server name of the instance>

Once we added the registry keys we were able to bring the sql server online on Node1. 

We failed the instance over to Node2 we checked the above registry keys and found they were missing, so we added missing keys and rebooted both the servers of the clusters.

After rebooting the server the sql server resources came online fine without any errors

 

Note:

This post should not be treated as the Microsoft’s Recommendation or Resolution to the problem, it is only a workaround which worked in our environment and so we would like to share it.

 

Share/Save/Bookmark

SQL SERVER 2000 Cluster FAILS TO COME ONLINE ON OTHER NODE

Filed under: Uncategorized — @ 7:26 am

 

Recently I encountered a case wherein clustered instance of sql server 2000 fails to come online on other node of a 2 Node Cluster. 

Usually for such cases we need to start troubleshooting by checking the Error log of the sql server from the server where it fails to come online. The Sql Server Error log can give you the cause if the sql server service itself is not able to start.

In our case we observed that the sql server error log was clean and the sql server was able to start without any errors.

So it appears that cluster service was not able to communicate with the sql server and made an assumption that sql server service could not be started.  Cluster service tries to connect to the sql server until it gets timed out after which cluster service marks the sql resources as failed.

The next obvious step in this cases would be to check the Application event log. We checked the Application Event log and found the following error reported.

Event Type:    Error
Event Source:    MSSQL$TEST
Event Category:    (3)
Event ID:    17052
Date:        1/11/2009
Time:        1:26:01 AM
User:        N/A
Computer:    Node02
Description:
[sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = 11; message = [Microsoft][ODBC SQL Server Driver][TCP/IP Sockets]SQL Server does not exist or access denied.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 9c 42 00 40 01 00 00 00   ?B.@….
0008: 0a 00 00 00 4b 00 44 00   ….K.D.
0010: 43 00 44 00 53 00 51 00   C.D.S.Q.
0018: 4c 00 31 00 34 00 00 00   L.1.4…
0020: 00 00 00 00               ….   

Event Type:    Error
Event Source:    MSSQL$TEST
Event Category:    (3)
Event ID:    17052
Date:        1/11/2009
Time:        1:26:01 AM
User:        N/A
Computer:    Node02
Description:
[sqsrvres] checkODBCConnectError: sqlstate = 01000; native error = 274d; message = [Microsoft][ODBC SQL Server Driver][TCP/IP Sockets]ConnectionOpen (Connect()).

From the above Error we observe that Either Sql server is listening on wrong port or cluster service is trying to connect to sql on wrong port.

We checked the following location of the registry and found that sql server was listening on port 2068.

HKLM/SOFTWARE/MICROSOFT/MICROSOFT SQL SERVER/TEST/MSSQLSERVER/SuperSocketNetLib/TCP    

We also confirmed by checking the sql server error log which states as following

2009-01-11 01:28:44.07 server    SQL server listening on 10.10.0.194: 2068

We checked the port on the working Node1 and it was the same i.e 2068

So it appears that cluster was trying to connect to sql server to wrong port.

We checked the Client Network Config Utility and found that there was TCP/IP Alias created  for the sql server instance SQLCLUSTER\TEST on port 1433 while sql was listening on port 2068.

Cause
=====

Whenever we have alias created to connect to the sql server instance using tcp/ip or np protocol on clustered instance of sql server. Cluster service will use only the alias to connect to the sql instance and so if the alias has wrong definition, the cluster service will fail to connect to sql server instance and thereby will fail to bring the sql resource online.

Resolution
=======

So in our case we checked on the box “Dynamically determine port” in the client network utility and removed the static port address of 1433 in the TCP/IP alias and the sql resource came online fine on Node 2 without any errors.

 

 

 

Share/Save/Bookmark

December 17, 2008

SQL Server 2005 Setup Failing to Install on Cluster with Unexpected Failure while Performing the SCC checks

 

We have observed a number of cases where the installation of the sql server 2005 instance on the cluster fails with the following unexpected Failure Error Message while performing SCC checks.

ERROR MESSAGE:

====================

There is unexpected failure.

MSGID 50000

 

In order to troubleshoot the issue we followed the following steps

1)       We checked the Setup Bootstraps Logs which are located in c:\Program files\Microsoft sql server\90\Setup BootStrap\LOGS\Files.

2)       We checked the Core(Local).log and found the following error was reported in the error log

 

<Func Name=’PerformDetections’>

1

Loaded DLL:C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\sqlsval.dll Version:2005.90.3042.0

Failed to enumerate disks.

HRESULT: 0×80070015

The device is not ready.

Error: Action “InvokeSqlSetupDllAction” threw an exception during execution.  Error information reported during run:

  Source File Name: datastore\clusterinfocollector.cpp

Compiler Timestamp: Sat Oct  7 09:43:48 2006

     Function Name: ClusterInfoCollector::collectClusterGroups

Source Line Number: 391

———————————————————-

Failed to determine disk information

              Source File Name: datastore\clusterinfocollector.cpp

            Compiler Timestamp: Sat Oct  7 09:43:48 2006

                 Function Name: ClusterInfoCollector::collectClusterGroups

            Source Line Number: 390

            ———————————————————-

            Enum*Disks() returned : 2147942421

WinException : -2147024875

        Error Code: -2147024875

Windows Error Text: The device is not ready.

 

  Source File Name: datastore\clusterinfocollector.cpp

Compiler Timestamp: Sat Oct  7 09:43:48 2006

     Function Name: ClusterInfoCollector::collectClusterGroups

Source Line Number: 390

 

Error: Failed to add file :”C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG\Files\SQLSetup0020_WPSQL1_.NET Framework 2.0.log” to cab file : “C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG\SqlSetup0020.cab” Error Code : 2

Error: Failed to add file :”C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG\Files\SQLSetup0020_WPSQL1_SCC.log” to cab file : “C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG\SqlSetup0020.cab” Error Code : 32

Running: UploadDrWatsonLogAction at: 2008/8/30 10:16:17

Message pump returning: 2147942421

 

 

3)       From the above error log it was clear that setup.exe was not able to query the information about the Physical Disks.

 

Failed to enumerate disks.

HRESULT: 0×80070015

The device is not ready.

Error: Action “InvokeSqlSetupDllAction” threw an exception during execution.  Error information reported during run:

  Source File Name: datastore\clusterinfocollector.cpp

Compiler Timestamp: Sat Oct  7 09:43:48 2006

     Function Name: ClusterInfoCollector::collectClusterGroups

Source Line Number: 391

———————————————————-

Failed to determine disk information

 

Cause

======

The above error is reported when either some of the physical Disk resource is Offline or is not Online from the Active Node where setup was run.

 

Resolution

========

So in order to resolve the issue we need to ensure the Disk Resource is Online from the Node from which we are installing the sql server instance in the cluster.

Once the Disk resource is brought online, the sql server 2005 installation runs fine on the cluster

 

 

 

 

Share/Save/Bookmark

November 13, 2008

Log Reader Agent Fails when the Sql server instance is failed over to Node 2

 

We have observed a number of case where we use sql server replication on a clustered instance of sql server and when we try to failover from Node 1 to Node 2 the Log Reader agent may fail to start with the following error

 

Error messages:

The process could not execute ’sp_MSpub_adjust_identity’ on ‘CRPSCSMSQ69V1\PUB’. (Source: MSSQL_REPL, Error number: MSSQL_REPL20011)

Get help: <<http://help/MSSQL_REPL20011>>

 

An error occurred during decryption. (Source: MSSQLServer, Error number: 15466)

Get help: <<http://help/15466>>

 

The process could not execute ’sp_MSpub_adjust_identity’ on ‘CRPSCSMSQ69V1\PUB’. (Source: MSSQL_REPL, Error number: MSSQL_REPL22037)

Get help: <<http://help/MSSQL_REPL22037>>

 

As seen above the root cause of the issue is the error 

An error occurred during decryption”

Which mean when the sql server instance failover to Node 2 it is not able to decrypt and run the stored procedure sp_MSpub_adjust_identity and hence it is not able to execute the stored procedure sp_MSpub_adjust_identity which is causing he Log Reader to fail on Node 2

It was not able to decrypt the stored procedure the Service Master Key on Node 2 was different then the Service Master Key on Node 1 . So one of the way to resolve the above error is to backup the service master key from Node 1 and restore the same on Node 2

BACKUP SERVICE MASTER KEY TO FILE = ‘path_to_file’ ENCRYPTION BY PASSWORD = ‘password’

RESTORE SERVICE MASTER KEY FROM FILE = ‘path_to_file’
DECRYPTION BY PASSWORD = ‘password’ [FORCE]

 Another resolution can be to regenerate the service master key on Node 2 with Force option but in case when you fail back to Node 1 the Log Reager agent will fail again with the same error.

ALTER SERVICE MASTER KEY REGENERATE FORCE

However we regenerate the Service Master Key the Linked servers on the server can fail and we may have recreate the Linked servers or change the password for the Linked servers which can tedious and undesirable

 

 

The cause of the above issue occurs when we change the service account of the clustered instance of sql server using the services console. For a clustered instance of the sql server we recommend to change the service account using the sql server configuration manager because it performs a number of operation in the background which takes care of the service master key when you failover to Node 2.

 

So to resolve the above error we need to change the Service Account to another account using the Configuration Manager on Node 1 and try to failover on Node 2 and check whether the Log Reader Agent worked fine.

We can then change the Service Account back to the original account using the Configuration manager and try to failover again and the replication should work fine.

 

 

Share/Save/Bookmark

October 13, 2008

Sql server performing slow after changing the Cost Threshold for parallelism

We have observed a few cases now where sql server has started performing slow after changing the Cost Threshold Of Parallelism server option of the sql server.

Ironically in such cases we observe that sql server is performing slow even when there is no high CPU utilization and no high I/O operations on the server. Increasing the Cost Threshold Of Parallelism decreasing your CPU utilization as it reduces the Parallelism in the execution plan of some of the queries which are fired on the database and in turn slows their response time and hence the performance of sql server deprecates.

As defined in the BOL,

Cost threshold for parallelism option is used to specify the threshold at which Microsoft SQL Server creates and runs parallel plans for queries. SQL Server creates and runs a parallel plan for a query only when the estimated cost to run a serial plan for the same query is higher than the value set in cost threshold for parallelism. The cost refers to an estimated elapsed time in seconds required to run the serial plan on a specific hardware configuration. Only set cost threshold for parallelism on symmetric multiprocessors.
The Default Value of the Cost Threshold Of Parallelism is 5 which means when Optimizer creates and estimates the cost of the query if the cost exceeds 5, optimizer considers this query as a candidate for the parallel execution plan and hence it selects the Parallel plan for this query. However the Degree of Parallelism depends on the Number of Processors in the System and the MAX DOP setting of the sql server. Based on these values optimizer selects a parallel execution plan which consumes higher CPU but with much better response time of the queries and hence the performance of sql server increases.

Cost Threshold of parallelism can also be used to reduce the parallelism in the sql server if the parallel queries consume excessive CPU so we can increase the Cost Degree of Parallelism and change the behavior of sql server to use serial execution plans.

Usually it is always a good option to go with default value for the Cost Threshold Of Parallelism and avoid manipulating the behavior of optimizer unless it is tested on your environment with the same workload as production.

Share/Save/Bookmark

« Older PostsNewer Posts »

Powered by WordPress