Wednesday, 6 March 2019

Collect memory dump if you find any application slowness

Part of troubleshooting for SharePoint application slowness, we might have to collect the memory dumps.

When we do a manual host entry in client machine and point to any WFE, if you find the slowness with any site, we can collect memory dumps. You can use task manager or Procdump tool to collect the dumps from respective servers.

a. Download the tool from the below location and extract in folder of  server where you have enough free space like above 10GB.

b. Open the command prompt and change the directory to procdump location.

c. Start loading the URL which is slow and simultaneously run below command in command prompt at server. You can collect 2-3 dumps while loading the same site for comparison.

procdump -ma PID #Here PID is process ID of affected site.

SharePoint application pool stops automatically

Issue Definition:
We have multiple web applications in our farm and we are facing issues with all the web applications intermittently where sometime while loading the site, it take more than 2 minutes to load or sometime it times out with the error message.

There can be multiple reasons for site slowness. But the below fix which we applied which is never expected.

Troubleshooting done:
This performance issue is very intermittent.
Post the reboot, we had noticed all SharePoint application pools were getting stopped intermittently and users were getting prompted for authentication which is not expected.

Temp fix: We have added the application pools account to load admin group on one of the servers for the time being [to resolve application pools getting stopped with access issues] – Group policies need to be looked into.

We did confirm that even though we have so called Kerberos authentication selected, we do not have any SPNs set. So basically authentication falls back to NTLM

Every web application has 2 zones – one for NTLM (for crawling) and extended one for Kerberos [for end users browsing]

On BipIP [Load balancer] noticed that NTLM enabled zone was pointed to APP servers (which do not cater to user requests)

There are 2 WFEs:
WFE01 -Server 01 non- working server [WFE01]
WFE02 - server 2 working server [WFE02 ]

On WFE01 the application pools were getting stopped and users while browsing were getting prompted for authentication repeatedly.

So as part of identifying the cause of issue, we would need to focus on one particular operation like loading the site. 

Here is the log collection or steps to perform for isolating this issue:

  1. As a initial step do health check of the SharePoint servers.
  2. Check the URL, logon name and time taken to load for the site which user reported issue.
  3. Does the issue disappears in the second load or it just happens for a period of time?
  4. Let’s try to understand from last 5 days on what time frame issue was reported to see if it is happening due to scheduled activity. Focus more on having major number of users reporting the same issue.
  5. Collect the client side logging using IE developer tool or Fiddler.
  6. Collect Verboseex logs for any successful client trace with the issue.
  7. Do a manual host entry in client machine and point to each WFE one by one to see whether particular URL which was reported slow actually loads slow by pointing to any particular server or not.
  8. Check the Database health report from the primary AG
  9. Check the Database nodes performance, drive space or any backups are running.
  10. Also you can collect memory dump if you find any application slowness

Resolution:


One of the web front-ends enabled the crashonauditfail registry key with the OS patching reboot, which cause the application pools to be unable to start as a self-protection measure.

This setting crash on audit should be set to 1 instead of 2. This does require a reboot of the server to take effect.

This should be a 1; 2 means it’s engaged.  We’ve set it to zero per Microsoft’s suggestion.

Later – the group policy change forcibly disabling the crashonauditfail has been implemented.  This will hopefully prevent any enabling of the reg key upon reboot resulting in the application pools being unable to start unless that particular service account is in the local admin group.

Tuesday, 5 March 2019

SharePoint server service account is being locked out.

Issue Definition:
We are getting unauthorized error in the SharePoint logs and the credentials are passed by secure store service and we are unable to find the user name for the service. From the event logs we could see that the account is locked out.

Temporary Resolution:
The account "searchserviceaccount" was locked and we involved the AD team to unlock the account and then everything started working as expected.

Logs:
06/04/2018 14:43:44.66  w3wp.exe (0x3878)        0x2054  SharePoint Foundation Topology             e5mc     Medium                WcfSendRequest: RemoteAddress: 'https://app01:32122/7cbc7f8f4e3d42ebae9cfbc72a582cd9/SecureStoreService.svc/https' Channel: 'Microsoft.Office.SecureStoreService.Server.ISecureStoreServiceApplication' Action: 'http://schemas.microsoft.com/sharepoint/2009/06/securestoreservice/ISecureStoreServiceApplication/GetRestrictedCredentials' MessageId: 'urn:uuid:9c94020e-ca09-4972-a63f-2f8b775a668a'   99e46d9e-6a17-a08a-7953-a69226556ff7

06/04/2018 14:43:44.70  w3wp.exe (0x3878)        0x2054  Business Connectivity Services  Business Data    ahe2m  High                Web Exception : System.Net.WebException: The remote server returned an error: (401) Unauthorized.     at System.Net.HttpWebRequest.GetResponse()     at Microsoft.SharePoint.BusinessData.SystemSpecific.OData.ODataHttpClientRequestMessage.GetResponse()     at Microsoft.SharePoint.BusinessData.SystemSpecific.OData.ODataConnection.ExecuteRequest(IBCSODataRequest requestMsg)      99e46d9e-6a17-a08a-7953-a69226556ff7

In order to probe more on whats causing the account lockout , extensive set of logging is required.  Hence we followed the below action plan:

On the client machine perform the below:

# Account Lockout and Management Tools
http://www.microsoft.com/downloads/details.aspx?familyid=7AF2E69C-91F3-4E63-8629-B999ADDE0B9E&displaylang=en

# Please also perform the clean boot on the problematic server (please monitor the issue after performing below)
  • Open command prompt: type msconfig go to services Hide all Microsoft services disable all third party services.
  • It will ask you for a reboot.
  • Uninstall the antivirus.
If the client doesn’t lock out after performing the above mentioned steps, we will conclude that the third party service is causing the issue or the antivirus is the cause.

# Please enable auditing on the primary domain controller in domain controllers OU:

Computer Configuration\Windows Settings\Security Settings\Local Policies\Audit Policy\ Audit account logon events - Failure Computer Configuration\Windows Settings\Security Settings\Local Policies\Audit Policy\ Audit account management- Success & Failure. Computer Configuration\Windows Settings\Security Settings\Local Policies\Audit Policy\ Audit logon events- Failure.

Below are the events that we need to look:

4740- account has been locked out
4771- bad password attempts, we will look for caller computer ID.
4625 –account failed to logon.

# With the originating machine identified, we can further check on it for the following:

a. Please check if there is any service running under a domain account that is locked. If yes, it could be old password cached on the client machine. Please then re-configure the service starting account with a new password. ( This lock out happening on Search as well as app server, since it is a service account we never used to change the password)

b. Please check if there is any network drive mapping is configured on the client, and is using the account that is locked. If yes, it could be old password cached on the client machine. Please re-configure the drive map with a new password. ( This is not happening on the client machine, it is happening on the SharePoint server)

c. Please check if there is any cache in “stored username and password” in control panel. If yes, please delete them.

d. Please check if there is any scheduled task on the client, either to use the domain account to launch, or the task is about authentication. If yes, please remove the task temporarily. ( We don’t see any task configured with this since it is our crawler account)

e. Please check if there is any manually created script running at background on the client. If yes, please stop the process temporarily. ( I don’t think so we have any script running in the background)

#Enable logs:

1) Enable netlogon logging on all the DCs.
2) Enable auditing on all DCs -
Nltest /DBFlag:2080FFFF
https://support.microsoft.com/en-in/kb/109626

  • Account logon - Success and failure
  • Logon events - success and failure
  • Account management – success and failure

3) Use the lockout status tool to see which DC the bad passwords are being sent to.
4) Once the bad passwords count increases, review the logs of that DC to check from where the bad password is coming from.
5) Once the source has been identified, then enable auditing on that machine accordingly -

If the source machine is Windows 7 / Server 2008 R2 or upwards, then – Execute in all the servers

auditpol /set /subcategory:"Kerberos Authentication Service" /failure:enable
auditpol /set /subcategory:"logon" /failure:enable
auditpol /set /subcategory:"Account Lockout" /success:enable /failure:enable
auditpol /set /subcategory:"User Account Management" /success:enable /failure:enable
auditpol /set /subcategory:"Credential Validation" /failure:enable
auditpol /set /subcategory:"Process Creation" /success:enable


Once you collect the logs disable auditing:

auditpol /set /subcategory:"Kerberos Authentication Service" /failure:disable
auditpol /set /subcategory:"logon" /failure:disable
auditpol /set /subcategory:"Account Lockout" /success:enable /failure:disable
auditpol /set /subcategory:"User Account Management" /success:enable /failure:disable
auditpol /set /subcategory:"Credential Validation" /failure:disable
auditpol /set /subcategory:"Process Creation" /success:disable

Root Cause:
The issue has been identified due to mismatch on the lmcompatabilitylevel at DC end, the action plan is to match the settings across the domain and forest. In case if the issue still persists after implementing it from their end then we need to have the same compatibility at SharePoint server level, but there are some challenges where the application might get break which still relays on this.

So we see that the account lockouts are coming from a mismatch in lmcompatabilitylevel mismatch.

Action plan and Fix:
Implement the setting on the dc to be consistent to match across the board.  This will help avoid inconsistent results when connecting to the domain controllers.  The dc setting does not take a reboot but may be a short time and it will pick up the setting.  It would be the best practice setting the setting at a gpo level to enforce the setting across all dc.

Here is a screen shot where you would find the policy.  This one sets the dc at 5


This is the screen shot of the policy set at 3



Action plan for the SharePoint servers:
If this does not work you will need to set either a registry key or group policy on the SharePoint servers.  THIS WILL REQUIRE A REBOOT ON CLIENTS TO BE EFFECTIVE. 

MS not recommended that you match the settings on the servers as you did the dc.  

Here’s a few links on lmcompatabilitylevel

Lmcompatabilitylevel most misunderstood setting.

Security guidance for NTLMv1 and LM network authentication

simple breakdown of how ntlm works

Common sharepoint issues with NTLM authentication

We should not be running into this but when you do a lot of ntlm auth you may need to watch out for maxconcurrentapi setting. 

Here is a few articles on maxconcurrenatpi.

Optimizing NTLM auth through multi-domain environment.

Performance tuning for ntlm auth using maxconcurrentapi

management pack

script to check for maxconcurrentapi issues

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters\
Reg dword  MaxConcurrentApi
Value in decimal 2 – 150

Dc need to be all at the same level and good idea to have trust at the same level as well.

Member servers may need to be bumped up as well. but not as high as domain controllers.

Search results are not showing up on external application

Issue Definition:

We are having issues with getting search results back from the Intranet zone of the Prod_alispasset web application which has a URL of https://alispasset.sharepoint.com and is using ADFS authentication. The results will display on the same web application when it is accessed using its default zone URL https://spalispasset.sp.com which uses NTLM authentication.

Symptoms:
  1. SPSiteURL Managed Property always shows default zone url irrespective of the AAM Zone URL used for search.
  2. Search results are working fine for existing files with Default zone URL with NTLM
  3. Newly uploaded files are getting crawled where we see success crawls.
  4. We could see the new results are coming with default search but not on the custom search.
  5. We confirmed that we are able to get results using osssearchresults.aspx on each and every one of them
  6. We noticed that the query sent while searching using osssearchresults.aspx and search.aspx are different. 
  7. Once user is in session, bypassing Akamai, and f5 with a hosts file entry on the intranet zone does yield search results but URLs are from the default zone AAM
Root cause: 
In late 2017 Microsoft pushed a feature called AAM caching from SPO. The main benefit of AAM caching is increased query performance. Unfortunately, in Publishing\Consuming scenarios where AAM’s are not present. This causes the AAMs to be dropped.

Now, a few caveats (please pay attention):
  • First, this technically isn’t a bug. It was an intentional design change to improve the performance of SharePoint. From the affected Customers point of view, the use of the word ‘bug’ may not matter, but sometimes it helps to know that MS did this intentionally.
  • Microsoft Product Group is working on the coding to fix this whether this is something which they can be applied separately to publishing farms or on hotfix.
  • We can use the PowerShell script to populate the URL Mapping Cache in the search farm. But we CANNOT prevent the cache from being cleared again. As per MS, the cache clears every 15 days or it may be sooner. We may see this problem again in 1 day or 15 days. 
  • The fix will be the same…and you can script this to run nightly\weekly\etc if we’d rather not test how often the issue occurs
Following are the PowerShell commands we ran to populate the URL Mapping Cache in the search farm:

#add-pssnapin microsoft.sharepoint.powershell

#Add-type -Path "C:\Program Files\Common Files\microsoft shared\Web Server Extensions\15\ISAPI\Microsoft.Office.Server.Search.dll"

#Get URL Mapping existing
$ssa = Get-SPEnterpriseSearchServiceApplication SP2013SSA

$tenantId = [Guid]::Empty

$zoneId = 1

$ssa.GetUrlMapping($tenantid,$zoneId).ForwardUrlMapping

$ssa.GetUrlMapping($tenantid,$zoneId).ReverseUrlMapping

#Set URL Mapping inlcuding existing. This command overwrites existing entries, so we need to include all, existing url mappings pulled from above commands. Below values are based on the today’s date. So, they may vary, and you need to update accordingly.

#Populate forward mappings variable
$ssa = Get-SPEnterpriseSearchServiceApplication

$tenantId = [Guid]::Empty

$zoneId = 1 # Intranet zone

$forwardMappings = New-Object 'system.collections.generic.dictionary[string,string]'

$forwardMappings.Add("https://spalispasset.sp.com","https://alispasset.sharepoint.com")

$forwardMappings.Add("https://spcontoso.sp.com","https://contoso.sharepoint.com")

$forwardMappings.Add("https://spwww.sp-america.sp.com","https://www.sp-america.com")

$forwardMappings.Add("https://spwww.sp-london.sp.com","https://www.sp-london.com")

#Populate reverse mappings variable
$reverseMappings = New-Object 'system.collections.generic.dictionary[string,string]'

$reverseMappings.Add("https://alispasset.sharepoint.com","https://spalispasset.sp.com")

$reverseMappings.Add("https://contoso.sharepoint.com","https://spcontoso.sp.com")

$reverseMappings.Add("https://www.sp-america.com","https://spwww.sp-america.sp.com")

$reverseMappings.Add("https://www.sp-london.com","https://spwww.sp-london.sp.com")

#Update URL mappings
$urlMapping = New-Object 'Microsoft.Office.Server.Search.Query.UrlMapping.UrlMapping'

$urlMapping.ForwardUrlMapping=$forwardMappings

$urlMapping.ReverseUrlMapping=$reverseMappings

$ssa.UpdateUrlMapping($tenantId,$zoneId,$urlMapping)

#To Update 2nd zone:
$ssa = Get-SPEnterpriseSearchServiceApplication

$tenantId = [Guid]::Empty

$zoneId = 2

$forwardMappings = New-Object 'system.collections.generic.dictionary[string,string]'

$forwardMappings.Add("https://spalispasset.sp.com","https://public.alispasset.sharepoint.com")

$forwardMappings.Add("https://spwww.sp-canada.sp.com”, “https://www.canada.com”)

$urlMapping = New-Object 'Microsoft.Office.Server.Search.Query.UrlMapping.UrlMapping'

$urlMapping.ForwardUrlMapping=$forwardMappings

$ssa.UpdateUrlMapping($tenantId,$zoneId,$urlMapping)