Monday, 15 April 2019

After SharePoint CU upgrade, WSP package failing to deploy in some servers

Issue Description:This issue started happening after the Sep 2017 CU upgrade and that this issue is happening only one server where the solution is not getting deployed. The other server gets the solution deployed correctly.

  1. We started seeing this issue after patching. This is a common behavior we see in SharePoint On prem environments where the Logical SharePoint Timer Service instance goes into Offline/Disabled mode on one or more servers. Ideally after the PSConfig is completed successfully, it should move the Timer service instance back in Online state, however, in some instances, when there are some issues with PSConfig during one of the steps, it fails to get all the Timer service instances back online. This makes the timer jobs on the server ( where the Timer Service instance didn’t come back online) not processing correctly. 
    • This issues doesn’t come up to the surface in obvious way because the OWSTIMER.exe ( SharePoint Timer service) in Windows Services.msc console would remain running, so it would not throw any errors in Application Event logs or in ULS logs. However, since the logical Timer service instance is Offline on the server, the timer jobs would not be processed. This applies to all the timer jobs and not specific to WSP related timer jobs.
  2. Continuing from above point, I don’t believe the issue is with the WSP solution as we were able to deploy it successfully right after getting the Offline Timer Service instance back Online.
Review:
  • As per the logs, found the Timer job to be created for solution deployment but not actually getting through with deployment.
02/20/2018 10:23:10.54                 PowerShell.exe (0x0A2C)             0x0A5C SharePoint Foundation  PowerShell         6tf0       
Medium               Entering ProcessRecord Method of install-spsolution.     9a36ccdf-5135-4057-b3a7-90a009e136f4

02/20/2018 10:23:10.88                 PowerShell.exe (0x0A2C)             0x0A5C SharePoint Foundation  Topology             8uav     
 Verbose               Solution Deployment : Created timer job for branding.wsp, id : 8b0ebc5b-c659-4105-8e7a-cf62a3458360               9a36ccdf-5135-4057-b3a7-90a009e136f4

02/20/2018 10:23:10.88                 PowerShell.exe (0x0A2C)             0x0A5C SharePoint Foundation  Topology             8ucn     
Verbose               Solution Deployment : Deleting OperationStatus object for solution branding.wsp          
 9a36ccdf-5135-4057-b3a7-90a009e136f4

02/20/2018 10:23:10.88                 PowerShell.exe (0x0A2C)             0x0A5C SharePoint Foundation  PowerShell         6tf0       
Medium               Leaving ProcessRecord Method of install-spsolution.       9a36ccdf-5135-4057-b3a7-90a009e136f4

Explanation:

  • Each SharePoint server has Physical Timer Service ( OWSTIMER.exe) which is a Windows level service ( which is responsible for processing all the timer jobs), however, each SharePoint server also has Logical SharePoint Timer Service instance ( this is the piece which resides in SharePoint code base).
  • We require the Logical Timer service instances to be Online on all servers for SharePoint to be able to process the Timer jobs.
  • We ran the following script to identify if all the Timer service instances were Online on all servers and found that the affected server’s Timer Service Instance was set to Disabled.
$farm  = Get-SPFarm
$disabledTimers = $farm.TimerService.Instances | where {$_.Status -ne "Online"}
if ($disabledTimers -ne $null)
{
    foreach ($timer in $disabledTimers)
    {
        Write-Host "Timer service instance on server " $timer.Server.Name " is not Online. Current status:" $timer.Status
        Write-Host "Attempting to set the status of the service instance to online"
        $timer.Status = [Microsoft.SharePoint.Administration.SPObjectStatus]::Online
        $timer.Update()
    }
}
else
{
    Write-Host "All Timer Service Instances in the farm are online! No problems found"
}
  • We enabled the Timer service instance via above script and after that we tried to deploy the solution again via Central Admin site and it was successful on both servers.
Important Note:
  • When the Timer Service instance is not enabled on any of the server, you would see all timer jobs on that server getting impacted. You won’t see any errors in the ULS logs, however, above script would be a good way to check if all Timer service instances are online in the farm.
Reference:

 Add this step to patching activity: 
As an additional safeguard method, I would recommend to add following step in your Patching process so that we can explicitly make sure that all Timer service instances in the farm are Online after the Patching process is completed. You can add this step as part of your Patching process documentation and this will avoid any issues with Timer service instance.

Run following script from one of the SharePoint server after the Patching process is complete. This will check for any Offline/Disabled Timer Service instances and Enable them.

Wednesday, 6 March 2019

Collect memory dump if you find any application slowness

Part of troubleshooting for SharePoint application slowness, we might have to collect the memory dumps.

When we do a manual host entry in client machine and point to any WFE, if you find the slowness with any site, we can collect memory dumps. You can use task manager or Procdump tool to collect the dumps from respective servers.

a. Download the tool from the below location and extract in folder of  server where you have enough free space like above 10GB.

b. Open the command prompt and change the directory to procdump location.

c. Start loading the URL which is slow and simultaneously run below command in command prompt at server. You can collect 2-3 dumps while loading the same site for comparison.

procdump -ma PID #Here PID is process ID of affected site.

SharePoint application pool stops automatically

Issue Definition:
We have multiple web applications in our farm and we are facing issues with all the web applications intermittently where sometime while loading the site, it take more than 2 minutes to load or sometime it times out with the error message.

There can be multiple reasons for site slowness. But the below fix which we applied which is never expected.

Troubleshooting done:
This performance issue is very intermittent.
Post the reboot, we had noticed all SharePoint application pools were getting stopped intermittently and users were getting prompted for authentication which is not expected.

Temp fix: We have added the application pools account to load admin group on one of the servers for the time being [to resolve application pools getting stopped with access issues] – Group policies need to be looked into.

We did confirm that even though we have so called Kerberos authentication selected, we do not have any SPNs set. So basically authentication falls back to NTLM

Every web application has 2 zones – one for NTLM (for crawling) and extended one for Kerberos [for end users browsing]

On BipIP [Load balancer] noticed that NTLM enabled zone was pointed to APP servers (which do not cater to user requests)

There are 2 WFEs:
WFE01 -Server 01 non- working server [WFE01]
WFE02 - server 2 working server [WFE02 ]

On WFE01 the application pools were getting stopped and users while browsing were getting prompted for authentication repeatedly.

So as part of identifying the cause of issue, we would need to focus on one particular operation like loading the site. 

Here is the log collection or steps to perform for isolating this issue:

  1. As a initial step do health check of the SharePoint servers.
  2. Check the URL, logon name and time taken to load for the site which user reported issue.
  3. Does the issue disappears in the second load or it just happens for a period of time?
  4. Let’s try to understand from last 5 days on what time frame issue was reported to see if it is happening due to scheduled activity. Focus more on having major number of users reporting the same issue.
  5. Collect the client side logging using IE developer tool or Fiddler.
  6. Collect Verboseex logs for any successful client trace with the issue.
  7. Do a manual host entry in client machine and point to each WFE one by one to see whether particular URL which was reported slow actually loads slow by pointing to any particular server or not.
  8. Check the Database health report from the primary AG
  9. Check the Database nodes performance, drive space or any backups are running.
  10. Also you can collect memory dump if you find any application slowness

Resolution:


One of the web front-ends enabled the crashonauditfail registry key with the OS patching reboot, which cause the application pools to be unable to start as a self-protection measure.

This setting crash on audit should be set to 1 instead of 2. This does require a reboot of the server to take effect.

This should be a 1; 2 means it’s engaged.  We’ve set it to zero per Microsoft’s suggestion.

Later – the group policy change forcibly disabling the crashonauditfail has been implemented.  This will hopefully prevent any enabling of the reg key upon reboot resulting in the application pools being unable to start unless that particular service account is in the local admin group.

Tuesday, 5 March 2019

SharePoint server service account is being locked out.

Issue Definition:
We are getting unauthorized error in the SharePoint logs and the credentials are passed by secure store service and we are unable to find the user name for the service. From the event logs we could see that the account is locked out.

Temporary Resolution:
The account "searchserviceaccount" was locked and we involved the AD team to unlock the account and then everything started working as expected.

Logs:
06/04/2018 14:43:44.66  w3wp.exe (0x3878)        0x2054  SharePoint Foundation Topology             e5mc     Medium                WcfSendRequest: RemoteAddress: 'https://app01:32122/7cbc7f8f4e3d42ebae9cfbc72a582cd9/SecureStoreService.svc/https' Channel: 'Microsoft.Office.SecureStoreService.Server.ISecureStoreServiceApplication' Action: 'http://schemas.microsoft.com/sharepoint/2009/06/securestoreservice/ISecureStoreServiceApplication/GetRestrictedCredentials' MessageId: 'urn:uuid:9c94020e-ca09-4972-a63f-2f8b775a668a'   99e46d9e-6a17-a08a-7953-a69226556ff7

06/04/2018 14:43:44.70  w3wp.exe (0x3878)        0x2054  Business Connectivity Services  Business Data    ahe2m  High                Web Exception : System.Net.WebException: The remote server returned an error: (401) Unauthorized.     at System.Net.HttpWebRequest.GetResponse()     at Microsoft.SharePoint.BusinessData.SystemSpecific.OData.ODataHttpClientRequestMessage.GetResponse()     at Microsoft.SharePoint.BusinessData.SystemSpecific.OData.ODataConnection.ExecuteRequest(IBCSODataRequest requestMsg)      99e46d9e-6a17-a08a-7953-a69226556ff7

In order to probe more on whats causing the account lockout , extensive set of logging is required.  Hence we followed the below action plan:

On the client machine perform the below:

# Account Lockout and Management Tools
http://www.microsoft.com/downloads/details.aspx?familyid=7AF2E69C-91F3-4E63-8629-B999ADDE0B9E&displaylang=en

# Please also perform the clean boot on the problematic server (please monitor the issue after performing below)
  • Open command prompt: type msconfig go to services Hide all Microsoft services disable all third party services.
  • It will ask you for a reboot.
  • Uninstall the antivirus.
If the client doesn’t lock out after performing the above mentioned steps, we will conclude that the third party service is causing the issue or the antivirus is the cause.

# Please enable auditing on the primary domain controller in domain controllers OU:

Computer Configuration\Windows Settings\Security Settings\Local Policies\Audit Policy\ Audit account logon events - Failure Computer Configuration\Windows Settings\Security Settings\Local Policies\Audit Policy\ Audit account management- Success & Failure. Computer Configuration\Windows Settings\Security Settings\Local Policies\Audit Policy\ Audit logon events- Failure.

Below are the events that we need to look:

4740- account has been locked out
4771- bad password attempts, we will look for caller computer ID.
4625 –account failed to logon.

# With the originating machine identified, we can further check on it for the following:

a. Please check if there is any service running under a domain account that is locked. If yes, it could be old password cached on the client machine. Please then re-configure the service starting account with a new password. ( This lock out happening on Search as well as app server, since it is a service account we never used to change the password)

b. Please check if there is any network drive mapping is configured on the client, and is using the account that is locked. If yes, it could be old password cached on the client machine. Please re-configure the drive map with a new password. ( This is not happening on the client machine, it is happening on the SharePoint server)

c. Please check if there is any cache in “stored username and password” in control panel. If yes, please delete them.

d. Please check if there is any scheduled task on the client, either to use the domain account to launch, or the task is about authentication. If yes, please remove the task temporarily. ( We don’t see any task configured with this since it is our crawler account)

e. Please check if there is any manually created script running at background on the client. If yes, please stop the process temporarily. ( I don’t think so we have any script running in the background)

#Enable logs:

1) Enable netlogon logging on all the DCs.
2) Enable auditing on all DCs -
Nltest /DBFlag:2080FFFF
https://support.microsoft.com/en-in/kb/109626

  • Account logon - Success and failure
  • Logon events - success and failure
  • Account management – success and failure

3) Use the lockout status tool to see which DC the bad passwords are being sent to.
4) Once the bad passwords count increases, review the logs of that DC to check from where the bad password is coming from.
5) Once the source has been identified, then enable auditing on that machine accordingly -

If the source machine is Windows 7 / Server 2008 R2 or upwards, then – Execute in all the servers

auditpol /set /subcategory:"Kerberos Authentication Service" /failure:enable
auditpol /set /subcategory:"logon" /failure:enable
auditpol /set /subcategory:"Account Lockout" /success:enable /failure:enable
auditpol /set /subcategory:"User Account Management" /success:enable /failure:enable
auditpol /set /subcategory:"Credential Validation" /failure:enable
auditpol /set /subcategory:"Process Creation" /success:enable


Once you collect the logs disable auditing:

auditpol /set /subcategory:"Kerberos Authentication Service" /failure:disable
auditpol /set /subcategory:"logon" /failure:disable
auditpol /set /subcategory:"Account Lockout" /success:enable /failure:disable
auditpol /set /subcategory:"User Account Management" /success:enable /failure:disable
auditpol /set /subcategory:"Credential Validation" /failure:disable
auditpol /set /subcategory:"Process Creation" /success:disable

Root Cause:
The issue has been identified due to mismatch on the lmcompatabilitylevel at DC end, the action plan is to match the settings across the domain and forest. In case if the issue still persists after implementing it from their end then we need to have the same compatibility at SharePoint server level, but there are some challenges where the application might get break which still relays on this.

So we see that the account lockouts are coming from a mismatch in lmcompatabilitylevel mismatch.

Action plan and Fix:
Implement the setting on the dc to be consistent to match across the board.  This will help avoid inconsistent results when connecting to the domain controllers.  The dc setting does not take a reboot but may be a short time and it will pick up the setting.  It would be the best practice setting the setting at a gpo level to enforce the setting across all dc.

Here is a screen shot where you would find the policy.  This one sets the dc at 5


This is the screen shot of the policy set at 3



Action plan for the SharePoint servers:
If this does not work you will need to set either a registry key or group policy on the SharePoint servers.  THIS WILL REQUIRE A REBOOT ON CLIENTS TO BE EFFECTIVE. 

MS not recommended that you match the settings on the servers as you did the dc.  

Here’s a few links on lmcompatabilitylevel

Lmcompatabilitylevel most misunderstood setting.

Security guidance for NTLMv1 and LM network authentication

simple breakdown of how ntlm works

Common sharepoint issues with NTLM authentication

We should not be running into this but when you do a lot of ntlm auth you may need to watch out for maxconcurrentapi setting. 

Here is a few articles on maxconcurrenatpi.

Optimizing NTLM auth through multi-domain environment.

Performance tuning for ntlm auth using maxconcurrentapi

management pack

script to check for maxconcurrentapi issues

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters\
Reg dword  MaxConcurrentApi
Value in decimal 2 – 150

Dc need to be all at the same level and good idea to have trust at the same level as well.

Member servers may need to be bumped up as well. but not as high as domain controllers.