Summary
We noticed a ton of Distributed Cache errors in the ULS log. There were actually 3,670 of the errors below within 30min;
Issue
Out of the box, AppFabric 1.1 contains a bug with garbage collection. AppFabric 1.1 is a prerequisite for SharePoint 2013 as it is the underlying technology used by the Distributed Cache service.
Affects
SharePoint Server 2013 + March Public Update
Symptoms
Due to the bug, some requests to Distributed Cache time out. In our case, users authenticated to a SharePoint using formed based authentication were unexpectedly logged out of the site because the check for their logon token timed out. As well, requests from the search cache timed out after three seconds increasing the time to load search results.
A review of the ULS logs showed a number of distributed cache exceptions :
Unexpected error occurred in method ‘GetObject’ , usage ‘SPViewStateCache’ – Exception ‘Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:The request timed out.. Additional Information : The client was trying to communicate with the server : net.tcp://contoso.com:22233 at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody) at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener) at Microsoft.ApplicationServer.Caching.DataCache.<>c_DisplayClass49.b_48() at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key)’. e7a6759c-378f-40e7-26a8-be00a48fcde1
Token Cache: Failed to get token from distributed cache for ‘0#.f|provider|username’.(This is expected during the process warm up or if data cache Initialization is getting done by some other thread).
Exception: ‘Microsoft.SharePoint.DistributedCaching.SPDistributedCacheClientRequestTimeOutException: Communications with the cache cluster has experienced a delay past the timeout value,please increase the RequestTimeout of the client. —> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:The request timed out..
Additional Information : The client was trying to communicate with the server : net.tcp://contoso.com:22233
at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody)
at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener)
at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass49.b__48()
at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key) –
— End of inner exception stack trace —
at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key)
at Microsoft.SharePoint.IdentityModel.SPDistributedSecurityTokenCache.GetObject(String key)
at Microsoft.SharePoint.IdentityModel.SPTokenCache.TryGetCachedToken(String cacheKey)’.
Unexpected error occurred in method ‘GetObject’ , usage ‘Distributed Logon Token Cache’ – Exception ‘Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).
Additional Information : The client was trying to communicate with the server :
DistributedSearchResultsCache::Get() – Failed due to exception = ‘Microsoft.Office.Server.DistributedCaching.SPDistributedCacheClusterDownException: Cache cluster is down, restart the cache cluster and Retry —> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).
Additional Information : The client was trying to communicate with the server
- Apply AppFabric Cumulative Update 3, AppFabric Cumulative Update 4, or a later AppFabric CU to all servers in the farm
- Add backgroundGC key to DistributedCacheService.exe.config file on all cache servers
- Restart AppFabric Windows Service on all cache servers
- Restart Distributed Cache SharePoint service on all cache servers
- Reset IIS (IISRESET) on all servers in the farm
If the issue persists, you may need to increase timeout and connection values:
- Increase distributed cache client settings for affected containers using the Set-SPDistributedCacheClientSetting cmdlet.
- Increase security token service values with Get-SPSecurityTokenServiceConfig
- Restart AppFabric, and Distributed Cache on cache servers
References:
https://www.habaneroconsulting.com/insights/SharePoint-2013-Distributed-Cache-Bug
http://support.microsoft.com/kb/2800726/en-us
http://msdn.microsoft.com/en-us/library/hh351248(v=azure.10).aspx