Wednesday 6 March 2019

SharePoint application pool stops automatically

Issue Definition:
We have multiple web applications in our farm and we are facing issues with all the web applications intermittently where sometime while loading the site, it take more than 2 minutes to load or sometime it times out with the error message.

There can be multiple reasons for site slowness. But the below fix which we applied which is never expected.

Troubleshooting done:
This performance issue is very intermittent.
Post the reboot, we had noticed all SharePoint application pools were getting stopped intermittently and users were getting prompted for authentication which is not expected.

Temp fix: We have added the application pools account to load admin group on one of the servers for the time being [to resolve application pools getting stopped with access issues] – Group policies need to be looked into.

We did confirm that even though we have so called Kerberos authentication selected, we do not have any SPNs set. So basically authentication falls back to NTLM

Every web application has 2 zones – one for NTLM (for crawling) and extended one for Kerberos [for end users browsing]

On BipIP [Load balancer] noticed that NTLM enabled zone was pointed to APP servers (which do not cater to user requests)

There are 2 WFEs:
WFE01 -Server 01 non- working server [WFE01]
WFE02 - server 2 working server [WFE02 ]

On WFE01 the application pools were getting stopped and users while browsing were getting prompted for authentication repeatedly.

So as part of identifying the cause of issue, we would need to focus on one particular operation like loading the site. 

Here is the log collection or steps to perform for isolating this issue:

  1. As a initial step do health check of the SharePoint servers.
  2. Check the URL, logon name and time taken to load for the site which user reported issue.
  3. Does the issue disappears in the second load or it just happens for a period of time?
  4. Let’s try to understand from last 5 days on what time frame issue was reported to see if it is happening due to scheduled activity. Focus more on having major number of users reporting the same issue.
  5. Collect the client side logging using IE developer tool or Fiddler.
  6. Collect Verboseex logs for any successful client trace with the issue.
  7. Do a manual host entry in client machine and point to each WFE one by one to see whether particular URL which was reported slow actually loads slow by pointing to any particular server or not.
  8. Check the Database health report from the primary AG
  9. Check the Database nodes performance, drive space or any backups are running.
  10. Also you can collect memory dump if you find any application slowness

Resolution:


One of the web front-ends enabled the crashonauditfail registry key with the OS patching reboot, which cause the application pools to be unable to start as a self-protection measure.

This setting crash on audit should be set to 1 instead of 2. This does require a reboot of the server to take effect.

This should be a 1; 2 means it’s engaged.  We’ve set it to zero per Microsoft’s suggestion.

Later – the group policy change forcibly disabling the crashonauditfail has been implemented.  This will hopefully prevent any enabling of the reg key upon reboot resulting in the application pools being unable to start unless that particular service account is in the local admin group.

No comments:

Post a Comment