Distributed Cache errors in the ULS log


Summary

We noticed a ton of Distributed Cache errors in the ULS log. There were actually 3,670 of the errors below within 30min;

Issue

Out of the box, AppFabric 1.1 contains a bug with garbage collection. AppFabric 1.1 is a prerequisite for SharePoint 2013 as it is the underlying technology used by the Distributed Cache service.

Affects

SharePoint Server 2013 + March Public Update

Symptoms

Due to the bug, some requests to Distributed Cache time out. In our case, users authenticated to a SharePoint using formed based authentication were unexpectedly logged out of the site because the check for their logon token timed out. As well, requests from the search cache timed out after three seconds increasing the time to load search results.

A review of the ULS logs showed a number of distributed cache exceptions :

Unexpected error occurred in method ‘GetObject’ , usage ‘SPViewStateCache’ – Exception ‘Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:The request timed out.. Additional Information : The client was trying to communicate with the server : net.tcp://contoso.com:22233 at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody) at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener) at Microsoft.ApplicationServer.Caching.DataCache.<>c_DisplayClass49.b_48() at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key)’. e7a6759c-378f-40e7-26a8-be00a48fcde1

Token Cache: Failed to get token from distributed cache for ‘0#.f|provider|username’.(This is expected during the process warm up or if data cache Initialization is getting done by some other thread).
Exception: ‘Microsoft.SharePoint.DistributedCaching.SPDistributedCacheClientRequestTimeOutException: Communications with the cache cluster has experienced a delay past the timeout value,please increase the RequestTimeout of the client. —> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:The request timed out..
Additional Information : The client was trying to communicate with the server : net.tcp://contoso.com:22233
at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody)
at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener)
at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass49.b__48()
at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key) –
— End of inner exception stack trace —
at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key)
at Microsoft.SharePoint.IdentityModel.SPDistributedSecurityTokenCache.GetObject(String key)
at Microsoft.SharePoint.IdentityModel.SPTokenCache.TryGetCachedToken(String cacheKey)’.

Unexpected error occurred in method ‘GetObject’ , usage ‘Distributed Logon Token Cache’ – Exception ‘Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).
Additional Information : The client was trying to communicate with the server :

DistributedSearchResultsCache::Get() – Failed due to exception = ‘Microsoft.Office.Server.DistributedCaching.SPDistributedCacheClusterDownException: Cache cluster is down, restart the cache cluster and Retry —> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).
Additional Information : The client was trying to communicate with the server

Resolution
  1. Apply AppFabric Cumulative Update 3AppFabric Cumulative Update 4, or a later AppFabric CU to all servers in the farm
  2. Add backgroundGC key to DistributedCacheService.exe.config file on all cache servers
  3. Restart AppFabric Windows Service on all cache servers
  4. Restart Distributed Cache SharePoint service on all cache servers
  5. Reset IIS (IISRESET) on all servers in the farm

If the issue persists, you may need to increase timeout and connection values:

  1. Increase distributed cache client settings for affected containers using the Set-SPDistributedCacheClientSetting cmdlet.
  2. Increase security token service values with Get-SPSecurityTokenServiceConfig
  3. Restart AppFabric, and Distributed Cache on cache servers

 

References:

https://www.habaneroconsulting.com/insights/SharePoint-2013-Distributed-Cache-Bug

http://support.microsoft.com/kb/2800726/en-us

http://msdn.microsoft.com/en-us/library/hh351248(v=azure.10).aspx

Optimizing SharePoint 2013 Server Performance – Development Server (single server)


There were couple of new services introduced with SharePoint 2013 and raised the hardware resource requirements. Let’s only talk about those process and steps to control the resource consumption when it comes to a single server SharePoint 2013 installation.

  • ·         NodeRunner service
  • ·         Distributed Cache Service
  • ·         Count of Web Application
NodeRunner service
  1. Use Set-SPEnterpriseSearchService -PerformanceLevel Reduced to reduce the CPU impact the search service has on your test environment.
  1. Modify the C:Program FilesMicrosoft Office Servers15.0SearchRuntime1.0noderunner.exe.config so that it can only consume X amount of RAM.
    Change the value at to any amount of RAM you like to contain the memory leak. May be 250 MB per instance of this service.
Even with this 250 MB limit I experienced some NodeRunner crashes. The general advice is to NOT change the NodeRunner memory limit configuration less than 250 MB. And NEVER EVER do this in a production environment!
Some of the pain points by the above modifications:
·         Changing this configuration file is not supported. For test/dev deployments it may have a desired effect on memory usage if you are running with less memory than the recommended minimum.
·         This means it may reduce the initial allocation of memory, but if the application requires more memory than this limit, it will crash. Hence, do not make such a change on a production deployment.
·         You may see errors like: Unable to connect to system client with derived management URIs. Exception: Failed to connect to system manager. Microsoft.Office.Server.Search.Administration.Topology.ApplicationAdminLayer.Reconnect() c80fcf9b-cf6b-2083-a27f-5d57c7dc4ef3.   Deeper analysis of the ULS logs shows that the DBConnector created by the NodeRunner process threw an OutofMemory exception. Removing the Noderunner.Exe.Config memory restriction and rebooting the server allowed me to submit the topology change.
·          
Distributed Cache Service
A new caching service is added in SharePoint 2013 called ‘Distributed Cache Service’ which is built based on Windows Server AppFabric Distributed Cache. Many features rely on this service to store data for fast retrieval when needed. This is used by services/features like Authentication Token Cache, Micro Blogging features, My Site Social Feeds etc.,
How to stop this service
This service can be managed from the ‘Services on Server’ page in the central admin. It can be started/stopped from here.
Allocate Less Memory
By default when SharePoint 2013 preview is installed, Distributed Cache Service’s memory allocation is set to 10 percent of the total physical memory allocation. Using the below PowerShell cmdlets we can change the memory allocation for this service.
$instanceName =”SPDistributedCacheService Name=AppFabricCachingService”
$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName -and ($_.server.name) -eq $env:computername}
$serviceInstance.Unprovision()
Set-CacheHostConfig -Hostname -cacheport 22233 -cachesize
$serviceInstance.Provision()
The above cmdlets stops the Caching Service, and sets the memory allocation to the specified number of megabytes (MB), and then starts the Caching service.
Web Applications
As the number of web applications grow in these single server SharePoint Development server, the number of Application Pools grow (this is not true if we use the same application pool to create multiple web applications), but however, if we donot pay attention while creating new web application we end up creating new application pool as well. These application pools runs in their own memory space, which in terms consuming more memory or RAM. Each App Pool runs with a service called w3wp.exe. As these SharePoint development servers also runs visual studio and sql server on the same box we need to keep in mind on the amount of memory accessible to each application and service.
It was my attempt to throw some light on how we can restrict memory usage by these services and still have a server running optimally even with 6-8 GB of RAM.