Where to host my SQL Server Instance ? (On-premise v/s Azure)

Enterprises now has four options to host SQL Server in environment

  • SQL Server on Physical Machines
  • SQL Server on VMs in DataCenters
  • SQL Server on Azure VMs (IaaS)
  • SQL Server on Azure SQL Database (PaaS)

The following presentation can be used as a reference to decide on where to host your SQL Server Instance

SQL Server AlwaysON Group Failover in Cluster causes Node to lose network connectivity

Last week, I worked on an issue which might sound weird or strange when you hear it for the first time.

“Whenever we fail SQL Server AlwaysON Group from Cluster, Node A loses its network connectivity.” (Sounds weird, Right ???)

Here is the email which I received from my customer

We’re able to re-create the issue on demand now, as every time we try to fail the SQL cluster group over from A to B, as soon as the SQL network name resource goes offline on the A server, the default gateway on the public NIC gets deleted and the box drops off the network. This only happens when failing over from A to B (B to A does not cause any issues), and also only explicitly occurs when the SQL Network resource is taken offline on the A server when being failed over to B (the core cluster name resource and group can fail over between nodes without issue).

The first obvious thing which comes to our mind is, it can’t be SQL Server application causing this since it sits on top on Windows platform & the failover mechanism doesn’t have anything to do with Default Gateway setting in NICs but as Sherlock Homes says

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

After spending few days on troubleshooting & research we discovered & found the root cause of the issue was incorrect IP routing on Cluster Nodes.

More Specifically,

The issue was caused due to SQL Resource IP registered in persistent route of Public NIC for Node A. This incorrect persistent route registered for Public NIC on Node A which is not bound to the Interface causes the Active Route & Persistent Route to go down when the SQL IP resource goes offline or is failed over

More Details on this issue can be found in the blog

http://blogs.technet.com/b/networking/archive/2009/05/21/active-route-gets-removed-on-windows-server-2008-offline-cluster-ip-address.aspx

http://support.microsoft.com/kb/2161341

To resolve the issue, the Windows Networking team was involved to rectify the persistent route setting & IP routing table.

Hope this helps for anyone who hears or faces this strange issue w.r.t SQL AlwaysON Cluster on Windows 2008 or above Cluster

Parikshit Savjani
Premier Field Engineer

Consideration for changing the default port of default SQL instance on Multi-subnet Failover Cluster

Security Hardening or Security Audits in an enterprise are some of the motivation for DBAs to run SQL Server default instance on a non-default port. For standalone instances, there are no challenges in changing the default instance port to non-default port, however for clustered instance there are some considerations since cluster.exe performs IsAlive check by logging into SQL Server instance.

First to avoid some confusion or myth if any. When a client is connecting to default instance of SQL Server listening on default port (1433) or non-default port, SQL Browser service is not involved & has no role to play in case of default instance. This is the same reason why you observe SQL Browser service is disabled, when you install only default instance of SQL Server on a given server.  SQL Browser service has a role to play only for Named instances & not when Default instances is listening on non-default ports.

For a clustered default instance of SQL Server, when we change the port of the SQL Instance to non-default port, the SQL Server resource might fail to come online & you might see the following error messages in Cluster Log

<< The following abstract is for Clustered default instance of SQL 2012 running on non-default port>>
00002840.00003ef0::2015/01/16-23:25:19.637 INFO [RES] SQL Server : [sqsrvres] Server name is SQLFCI
00002840.00003ef0::2015/01/16-23:25:19.637 INFO [RES] SQL Server : [sqsrvres] Service name is MSSQLSERVER
00002840.00003ef0::2015/01/16-23:25:19.637 INFO [RES] SQL Server : [sqsrvres] Dependency expression for resource ‘SQL Network Name (SQLFCI)’ is ‘([5c22a982-9400-416b-b5bd-95da07998601] or [50660f59-4f33-4cd1-9dc0-84f6b32ebb76])’
00002840.00003ef0::2015/01/16-23:25:19.637 INFO [RES] SQL Server : [sqsrvres] This instance is running with IP OR dependency. Resource DLL Will enable MultiSubnetFailover=yes in connection string
00002840.00003ef0::2015/01/16-23:25:19.637 WARN [RES] SQL Server : [sqsrvres] This is default instance. If non-default TCP/IP listening port number is specified, and SQL Server fails to come online, configure an alias on each possible owner node and try again. If it does not work, please contact customer support
00002840.00003ef0::2015/01/16-23:25:19.637 INFO [RES] SQL Server : [sqsrvres] Starting service MSSQLSERVER…
0000125c.00004990::2015/01/16-23:25:19.886 INFO [NM] Received request from client address SQLFCINodeA.
00002840.00003ef0::2015/01/16-23:25:20.842 INFO [RES] SQL Server : [sqsrvres] Service status checkpoint was changed from 0 to 1 (wait hint 20000). Pid is 19508
0000125c.00003fe8::2015/01/16-23:25:23.639 INFO [NM] Received request from client address SQLFCINodeA.
00002840.00003ef0::2015/01/16-23:25:23.842 INFO [RES] SQL Server : [sqsrvres] Service is started. SQL Server pid is 19508
00002840.00003ef0::2015/01/16-23:25:23.842 INFO [RES] SQL Server : [sqsrvres] Connect to SQL Server …
00002840.00003ef0::2015/01/16-23:25:54.855 ERR [RES] SQL Server : [sqsrvres] ODBC Error: [08001] [Microsoft][SQL Server Native Client 11.0]TCP Provider: The remote computer refused the network connection.
00002840.00003ef0::2015/01/16-23:25:54.855 ERR [RES] SQL Server : [sqsrvres] ODBC Error: [HYT00] [Microsoft][SQL Server Native Client 11.0]Login timeout expired (0)
00002840.00003ef0::2015/01/16-23:25:54.855 ERR [RES] SQL Server : [sqsrvres] ODBC Error: [08001] [Microsoft][SQL Server Native Client 11.0]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online. (1225)
00002840.00003ef0::2015/01/16-23:25:54.855 INFO [RES] SQL Server : [sqsrvres] Could not connect to SQL Server (rc -1)
00002840.00003ef0::2015/01/16-23:25:54.855 INFO [RES] SQL Server : [sqsrvres] SQLDisconnect returns following information
00002840.00003ef0::2015/01/16-23:25:54.855 ERR [RES] SQL Server : [sqsrvres] ODBC Error: [08003] [Microsoft][ODBC Driver Manager] Connection not open (0)
00002840.00003ef0::2015/01/16-23:25:59.857 INFO [RES] SQL Server : [sqsrvres] Connect to SQL Server …
00002840.00003ef0::2015/01/16-23:26:30.860 ERR [RES] SQL Server : [sqsrvres] ODBC Error: [08001] [Microsoft][SQL Server Native Client 11.0]TCP Provider: The remote computer refused the network connection.

With Enhanced Verbose Logging in SQL 2012 as seen above, we now get an message saying that if you are running default instance on non-default TCP/IP port, we should create an alias (TCP Or NP )  to bring SQL Server resource online.

00002840.00003ef0::2015/01/16-23:25:19.637 WARN  [RES] SQL Server <SQL Server>: [sqsrvres] This is default instance. If non-default TCP/IP listening port number is specified, and SQL Server fails to come online, configure an alias on each possible owner node and try again. If it does not work, please contact customer support

In case of Multi-Subnet Failover Cluster we observed the creation of alias brought the SQL Server resources online, however SQL Agent resource failed to come online. When we observed the SQLAgent.out Log we see the following error message


2015-01-16 17:36:54 – ! [150] SQL Server does not accept the connection (error: 1225). Waiting for Sql Server to allow connections. Operation attempted was: Verify Connection On Start.
2015-01-16 17:36:54 – ! [000] Unable to connect to server ‘(local)’; SQLServerAgent cannot start
2015-01-16 17:36:59 – ! [298] SQLServer Error: 1225, TCP Provider: The remote computer refused the network connection. [SQLSTATE 08001]
2015-01-16 17:36:59 – ! [165] ODBC Error: 0, Login timeout expired [SQLSTATE HYT00]
2015-01-16 17:36:59 – ! [298] SQLServer Error: 1225, A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online. [SQLSTATE 08001]
2015-01-16 17:36:59 – ! [382] Logon to server ‘(local)’ failed (DisableAgentXPs)
2015-01-16 17:36:59 – ? [098] SQLServerAgent terminated (normally)

This issue was specific to Multi-Subnet Failover Clustering since we observed that if we disable the cluster checkpoints & Modify the following registry value from 1 to 0, SQL Agent resource would come online.

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL11.MSSQLSERVER\Cluster\RunWithIPORDependency

However it would remain online until SQL Server resource is restarted. When SQL Server resource comes online again, the SQL Server detects it as Multi-Subnet Cluster & flips the bit from 0 to 1 (Note: this is not caused due to Cluster Checkpointing)

This prevents SQL Agent resource to come online. To work around this issue, we can set the ServerHost registry for SQL Agent to point to use the Alias created which will ensure SQL Agent can connect to SQL Server & the resource comes online.

  1. 1.       Remove Cluster Checkpoint

Cluster res “SQL Network Name(SQLFCI)” /removecheckpoint:”Software\Microsoft\Microsoft SQL Server\MSSQL11.MSSQLSERVER\SQLServerAgent”

  1. 2.       Set ServerHost registry key to Alias name
  1. 3.       Add Cluster Checkpoint

Cluster res “SQL Network Name(SQLFCI)” /addcheckpoint:”Software\Microsoft\Microsoft SQL Server\MSSQL11.MSSQLSERVER\SQLServerAgent”

After pointing ServerHost key to Alias which resolves to non-default port, SQL Agent is able to connect SQL Server & comes online.

I have filed a connect bug for the same issue & feel free to vote to bump up the priority

https://connect.microsoft.com/SQLServer/feedback/details/1090514/sql-agent-fails-to-come-online-in-multi-subnet-clustered-default-instance-of-sql-server-running-on-non-default-port

Parikshit Savjani
Premier Field Engineer

What is your role in Cloud?

As I interact with more & more customers who are onboarding on Azure which is the Microsoft Cloud solution, one of the common myth or assumptions which I hear or encounter is

 ”We just need to deploy the application or database in cloud &  Microsoft will take care of the OS, Storage, patching etc”

“After applications move to cloud,  we no longer need Infrastructure team or DBAs to manage the platform”

The above statements may hold true to some extent when organizations adopt Cloud in Software as a Service (SaaS) or Platform as a Service (PaaS) scenarios. However in Infastructure As a Service (IaaS)  & in Hybrid Cloud scenarios the above statements  are not completely true & In this blog post, I would like to (re)define the roles & responsibilities in an enterprise as they start their journey towards cloud.

In my opinion, Most Enterprise Organization will not be able to completely migrate to Public Cloud Infrastructure due to multiple reasons so hybrid scenarios would be the most common cloud adoption which is where Microsoft has a clear edge over its competition.

These roles & responsibilities are my personal imagination of future roles based on my experience on the field

But before I start, a disclaimer

“This is a personal weblog. The opinions expressed here represent my own and not those of my employer. In addition, my thoughts and opinions change from time to time…I consider this a necessary consequence of having an open mind. This weblog is intended to provide a semi-permanent point in time snapshot and manifestation of the various memes running around my brain, and as such any thoughts and opinions expressed within out-of-date posts may not the same, nor even similar, to those I may hold today. “

System Administrators

In Cloud scenarios, System Administrator would be responsible for following

  • Building VM Images which adheres to the company standards which can leveraged by the DBAs or other application team to spin up & start within few mins.  It is one time effort which can be reused by DBAs or application team to spin up their Servers within few mins.
  • Update the VM Images with the latest patches & anti-virus updates.
  • Using or configuring Agents (runbooks, Chef & Puppet Add-ins)  to update & patch the Server VMs already in use in the organization’s environment.
  • Monitor System performance & Resource Utilization

Network Administrators

In Hybrid Cloud Scenarios, this will be the most important role since an enterprise cloud should be designed as an extension of on-premise datacenters which will be made possible by Network Administrators.

  • Configure site-to-site VPN to ensure seamless connectivity from on-premise to cloud data centers.
  • Configure IP Sec, Network Security & Firewalls for the cloud data centers.
  • Configure DNS Servers & assign IP address to the VMs running on cloud.
  • Monitor Network Performance & optimize performance over Internet by using network compression techniques or other options to get the max out of the bandwidth.

Domain Administrators

In Hybrid Cloud scenarios, the servers needs to be added to a domain which will allow the users, groups & service account to secure & access the resources based on the policies governed by Domain Administrators

  • Configure Azure Directory Sync to extend the on-premise AD & Exchange to Azure
  • Ensure the Group Policies are applied to the VM Images
  • Ensure Same Organizational account is integrated & used by users to login VMs, Office 365 or databases in cloud.

 Storage Administrators

In Cloud scenarios, this role can be thinned a bit & can be merged with System Administrators role nevertheless this role can have the following responsibilities

  • Configure Storage Pools to combine multiple Data Disks to obtain higher throughput & IOPS.
  • Configure Azure Storage for backups & BLOB storage.
  • Manage & govern storage to decide which storage should be used to achieve the desired throughput within a given cost budget
  • Monitor Storage performance

 Security Administrators

In Cloud scenarios, data security is a big concern which is where this role will be important & will have the following responsibilities

  • Based on the sensitivity of data define data classification layers to govern & decide which data can move to public cloud & which should remain within on-premise.
  • Harden the Network Security & Firewalls.
  • Govern & define policies on which data can be geographically replicated & which cannot for political reasons

 Database Administrators (My Favorite)

In IaaS Cloud Scenarios, the DBA role is almost unchanged & will have the similar responsibilities as on-premise

  • Setup SQL Instance, configure tempdb, create logins, install database, configure user roles.
  • Configure Server Properties .
  • Backups & restore of databases.
  • Monitor SQL Performance & identify blocking or bad performing queries.

In PaaS Cloud Scenarios

  • Configure database, create users, roles & assign permissions.
  • Backup & Restore Databases.
  • Manage SQL Performance & identify blocking or bad performing queries.

One of the skillsets which all the roles would need to possess is Powershell & leveraging Azure Automation to create orchestrator runbooks which will empower the roles to automate some of repetitive Jobs when managing large scale enterprise cloud environment.

In some organizations the above defined roles might overlap or merge across different teams while in large complex enterprise organization they might be isolated & very well defined.

Hope this helps !!!

Parikshit Savjani
Premier Field Engineer

Cloud Revolution & our adoption

This is my first attempt of blogging a non-technical topic (non-technical but still related to technology) since I never felt it is my cup of tea. However the recent changes & developments which I am witnessing around me has provoked a lot of new thoughts in me which I wanted to pen down for my own reference & for the benefit of others who think alike.

But before I start, a disclaimer

“This is a personal weblog. The opinions expressed here represent my own and not those of my employer. In addition, my thoughts and opinions change from time to time…I consider this a necessary consequence of having an open mind. This weblog is intended to provide a semi-permanent point in time snapshot and manifestation of the various memes running around my brain, and as such any thoughts and opinions expressed within out-of-date posts may not the same, nor even similar, to those I may hold today. “

Since past couple of years or even more we have been hearing a lot of buzz on cloud & there is cloud all around us. The more important fact to realize is, cloud is no longer a buzz but it is now a Reality & if we as IT pros ignore this fact we are going to miss out the boat. Huge Tech companies have made large investments to make it a reality now & businesses have even started to realize the cost savings & benefits out of cloud.

While there is enough proof & content out there to prove on how business can benefit from cloud, I like the hotel analogy which I can relate easily. Being consultants or tourists, we like to travel places but do we have to buy real estate every place we visit to stay. Thanks to the hotels we just book the room, checkin at the desired time & check out when u leave & pay for the days you have spent there. Just imagine, how limited our travel or our reach would have been if hotels didn’t exist & we were supposed to buy a room every place we visit.

Similarly, the business can now think of growing their reach by adopting to cloud & not worrying about the cost of data center, real estate, server, RAM, CPUs, storage etc. Businesses can now afford to experiment or innovate even more with IT since the cost of spinning a VMs in cloud is negligible & if they fail they aren’t going to lose much.

The most strong argument which goes against the cloud is data security.  However I am assuming there would have been a same argument when internet was evolving, sending data packets outside your organization was it safe?  But today, Can organizations grow or even survive without their presence in internet?  There are & will be processes, protocols developed for data security in cloud.

However are IT Pros ready for this Cloud Revolution. We are so much in love with our on-premise technologies & that is our comfort zone which has improved & matured over a period of time to become more & more perfect & easy to work with. While the road to cloud is not so matured & still evolving.

An IT pro like myself might be reluctant to move to cloud since we are going out of our comfort zone, we have invested years of our time to learn & become expert on a product or technology & now we need to move on to something else. Further there are some gaps between on-premise technology where features are available out of the box & in cloud where we are expected to code it.

The bad news is whether we like it or not it is coming our way & at a rapid pace. For instance, the features which were not available 6 months back are now available & as I witness, the offerings on cloud might mature at rapid pace then the time taken by their respective on-premise offerings since the Product Teams exactly knows which features are liked by the consumers the most.

As I see, we are in the times who are witnessing the cloud revolution & we should take every opportunity to learn cloud before it’s too late rather than ignoring it, assuming it is still not matured yet. In 2000′s people who can spell .Net were hired & it was a skillset which lasted long & was worth the investment of time which gave & is still giving people handsome returns. Cloud skillset is definitely one such investment which is going to give returns to IT Pros for a very long time.

To conclude, I will leave you with the following thought

You can either witness this change or “Be the change”.

Hope this helps !!!

Parikshit Savjani
Premier Field Engineer

The mysterious case of data load in a partitioned table

I have chosen this title for my blog post since troubleshooting this issue was like solving a mystery. One of my customer had recently faced a slow performance while loading the data in a partitioned table. Whenever there is performance issue for an insert statement, the usual suspects which come to our mind is blocking or Disk IO so we started investigating the waits for the query in sys.dm_exec_requests & sys.dm_os_waiting_tasks but to our surprise the insert query was either in running or runnable state which didn’t justify the slow performance of the query.

Further, tracking the Disk IO activity we found that actual insert operation is completed within few mins however before the writes are performed, there is increased read activity on the sql instance & some kind of wait which cannot be identified from the waiting_tasks & dm_exec_requests DMV since they appear to be in running or runnable state.

The table is a partitioned table which is around 1TB in size & during the data load, there is an internal memory pressure on the SQL Server with PLE & Free Pages on the SQL Server going down while Lazy writes/sec increasing which too some extend is expected during the data load but we didn’t have any performance baseline to compare against.

Normally, I wouldn’t check the query plan for slower insert performance since unlike a select query which can be optimized by indexing or by stats refresh, there is limited scope for tuning insert/update or deletes. However I captured the query plan to understand the execution path chosen by the query which helped me solved the mystery.

BadPlan

In the Execution Plan,  we see a Sorted Merge Join Operator  & Index Scan operation which is used to perform Unique Key validation before performing the actual insert of data. Performing an index scan on 1TB table is definitely not a good plan.

First question,  why did optimizer chose a Merge Join & Index Scan for Unique Key validation, the answer to that is yellow exclamation warning shown in the Index Scan operator which is missing statistics indication.

When we see more details on the Index Scan operator, we find missing statistics on Ptnid which is hidden column in partitioned index added starting SQL 2008. More details on Ptn id can be found in Craig’s blog below http://blogs.msdn.com/b/craigfr/archive/2008/08/05/partitioned-indexes-in-sql-server-2008.aspx

IndexScan

The only supported way to rebuild the missing statistics for Ptnid is to rebuild the partitioned index on the partitioned table. Although rebuilding a large partitioned table is a not an easy task & can run upto multiple days, it was worth the time & resources spend.

After rebuilding the indexes on the partitioned table, the performance of the data load improved as per the expectation & we saw the following Execution Plan which improved the performanceGoodPlan

The Good Execution Plan uses Nested Loop Join which is non-blocking operator & performs Index Seek operation on the Partitioned table which reduces the memory footprint,  Disk IO & improves the performance of insert statement drastically.

Hope you enjoyed the mysterious case of data loading in a partitioned table

Hope this helps !!!

Parikshit Savjani
Premier Field Engineer

Sociometer Using Power BI

Recently I presented a demo on Power BI for the internal audience at Microsoft which was very received & appreciated by the audience. I thought it would be a good demonstration for external audience as well.

Hope you all like the demo & Power BI !!!

Parikshit Savjani
Premier Field Engineer

Impact of Pre-grown Data & Log Files on the Restore Time of the Database

Today, I was asked the following question by one of my colleague who specializes in Sharepoint

If the actual data is 50GB in size & if we create a pre-grown data file of 200GB to avoid autogrow & fragmentation, the database restore time for the pre-grown database should be much larger than the restore time of the database without pregrowth with  datafile size of 50GB.

My initial guess was, the restore time shouldn’t vary too much since the data file creation should be fast due to instant file initialization, while the database restore operation only restores the active pages with data which should be same in both the scenarios & hence restore time shouldn’t vary too much. However this was purely based on my theoritical knowledge and it was always important to validate & test it.

So I used a sample database with initial primary datafile size of 4GB  & Log File size of 1GB with around 3GB of active data within the database.

To see the impact of the pre-grown datafiles, I kept the initial log file size constant to 1GB, while increased the initial datafile size in increments of 5 GB  keeping the active data in the database  constant at 3GB.

restorea

restore1

For my testing, I used SQL 2012 with SP1 & further I used backup compression while taking the backup each time I have pregrown the data file. As noted in the table above, irrespective of the data file size, the backup size was consistent at 266MB since the data was consistent & static in the database which was as per the expectation.

However, the restore time increased proportionally as the initial data files size increases which didn’t make much sense.

So I turned on the Trace Flag 3004 & 3605 to trace the restore events and identify why the restore time varies during each restores.

After turning on the Trace Flags, from the errorlog I discovered the following events which justified the time difference between the restores

Errorlog for Restore with 5GB data file size

12:57:46 PM spid56 Zeroing completed on C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\SampleDB_log.ldf 0:00:05
12:57:54 PM spid56 Zeroing completed on C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\SampleDB.mdf 0:00:08
12:58:04 PM spid56 Restore: Waiting for log zero on SampleDB 0:00:10

Errorlog for Restore with 20GB data file size

12:51:35 PM spid56 Zeroing completed on C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\SampleDB_log.ldf 0:00:06
12:52:38 PM spid56 Zeroing completed on C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\SampleDB.mdf 0:01:02
12:52:49 PM spid56 Restore: Waiting for log zero on SampleDB 0:00:11

As seen in the data above, the additional time spend during each restores were due to Zeroing of data files which means the instant file initialization was not working for me. I realized my SQL Server was running using the managed service accounts (NT Service\MSSQLServer) which does not possess Perform Volume Maintenance Task permissions by default. I assigned  the permissions to the NT Service\MSSQLServer and restarted the SQL Server & performed the above tests again. The following results were found

restore2

The above results does make sense & meets our expectation since with Instant File initialization, zeroing of the datafile is avoided which makes the restore time more or less same.

However instant file initialization doesn’t apply to Logfiles and we see the following results when we test the restore, keeping the initial size of data file constant while increasing the initial size of the Log file.

restore3

restore4

So the restore time does increase with the increase in the Log File size. Again when I turned on the tracing  to identify where the majority of the time is spent, I found that major time is spent in zeroing of log files.

However in case of the Log Files, the performance advantages of pre-grown Logfiles is significant due to number of VLFs created for every Autogrow  can slower the Database recovery time proportionally.

Hence with instant file initialization for data files & some compromise for restore time, it would still be recommended to pre-grow the data file & log files for the database to avoid Auto grow & fragmentation on the disk.

Autogrow should be used as insurance which you would want to have but not use it until you are in the dire need of it.

Hope this helps !!!

Parikshit Savjani
Premier Field Engineer

Troubleshooting SQL Server Hang using system_health Xevents

Recently my customer reported they were seeing intermittent server hangs, connection timeouts and slow performance on their SQL Server instance. The first question I had, do we have PSSDIAG running on the SQL Instance for me to be able to understand the cause of the hang. However since we won’t be running PSSDIAG proactively on any of the servers we didn’t had a logs to look at.  So we started looking at the system_health Xevents session data at the time of the slow performance.

If you are not familiar with default system health Xevent sessions, I would suggest you read the blog post from Bob Dorr here

Reading from System_Health Xevents session is not the most convenient way of troubleshooting and hence I was used a solution which is developed my fellow PFE colleague Denzil which has blogged here

We observed the issue occurs intermittently when the application hangs and doesn’t allow any further connections to the SQL Server.

Resource Utilization

We do not see CPU or Memory pressure on the SQL Server instance. The System CPU utilization is 40% on any average with some spikes causing the CPU Utilization to rise to 100 % however during these times the SQL CPU Utilization is consistent to 30-35% on an average.

The Available Memory on the server is unchanged and is consistent while there is no low memory alert observed on the SQL instance

CPUUtil

MemoryUtil

 

 

 

 

 

 

 

 

Query Processing & THREADPOOL Waits

We see very high waits for THREADPOOL wait type which suggests we are running out of worker threads due to which we observed slowness and hangs on the SQL Instance. From the following Query Processing logs, we see that there is sudden surge in the worker threads from 198 to 806 worker threads in 10 min interval from 7:10 AM to 7:20 AM…. (Server time would be 2:10 AM to 2:20AM).

QueryProcessingEvent

 

 

 

 

When we check sys.dm_exec_requests data captured during the same time we see there are concurrent requests to SQL instance increased from 80 to 500 during the same interval which justifies the surge in the worker threads created to serve the 500 concurrent requests.

However we need to understand here, THREADPOOL waitype is not the cause but a victim since the assigned worker threads are not completing their assigned tasks ASAP. The root cause of the issue in this case was the blocking observed on the SQL instance which prevents the worker threads to complete its task quickly and release the worker thread back to worker pool.

Blocking

In this case, we found that the Head blocker is waiting on the non-io Buffer Latch PAGELATCH_XX (PAGELATCH_EX, PAGELATCH_UP, PAGELATCH_SH)  caused due to sequential inserts in a large table (Last Page Contention).  In this case we suggested to partition the table and build the indexes on alternate column to avoid Last page content (Further info on resolving latch contention can be found in this white paper).

 

Blocking

 

 

 

 

 

 

 

 

 

Using the System Health Xevents session and solution developed by Denzil we were able to nail down the issue on the SQL Instance without capturing any additional data.

Hope this helps !!!

Parikshit Savjani
Premier Field Engineer

Can number of worker threads in SQL Server exceed max worker threads setting?

Recently while troubleshooting a performance issue on the SQL Instance which was experiencing a very high THREADPOOL wait type I observed something which I was unaware until I hit this issue.

While I was looking at the system health xel files captured at the time of the issue, specifically at the query_processing event trace, the following highlighted line caught my attention.

The server has a 16 core x64 bit processors and as per the max worker thread calculation, the worker threads on the server should not exceed 512 + (16-4)*16=704  values. In the query_processing event, we observed the maxworkers correctly reported as 704, however I saw that workerCreated exceeded the maxWorkers which was unusual.

workerthread

 

 

 

 

 

 

 

 

I checked Denzil R & Suresh Kandoth and they helped me understand that this is an expected behavior in SQL Server code since the max worker threads setting only controls the user tasks or in other user batch and doesn’t control the system tasks or internal tasks. The same is also documented in the msdn article as shown below

http://msdn.microsoft.com/en-us/library/ms190219.aspx

The max worker threads server configuration option does not take into account threads that are required for all the system tasks such as Availibility Groups, Service Broker, Lock Manager, and others. If the number of threads configured are being exceeded, the following query will provide information about the system tasks that have spawned the additional threads

SELECT
s.session_id,
r.command,
r.status,
r.wait_type,
r.scheduler_id,
w.worker_address,
w.is_preemptive,
w.state,
t.task_state,
t.session_id,
t.exec_context_id,
t.request_id
FROM sys.dm_exec_sessions AS s
INNERJOIN sys.dm_exec_requests AS r
    ON s.session_id = r.session_id
INNER JOIN sys.dm_os_tasks AS t
    ON r.task_address = t.task_address
INNER JOIN sys.dm_os_workers AS w
    ON t.worker_address = w.worker_address
WHERE s.is_user_process = 0;

Don’t be surprised if you see the worker threads in sys.dm_os_workers to exceed the max worker threads and you can use the above query to identify the system task which are using the additional worker threads

Hope this helps !!!

Parikshit Savjani
Premier Field Engineer