Recently I had a production Workflow issue. Everything works in Dev, QA & UAT but not in production.
In our case, for the site content type we have enabled the retention from information management policy settings. And declared record to start the custom workflow to manage the retention on records.
The problem we are having is that we use Information Management Policy timerjob to start the workflow asynchronously and its working for some time and suddenly emails are missing for new file uploads, also not creating the workflow history.
I was reviewing and comparing the configuration of all the individual SharePoint servers (We have 4 web servers and 4 application servers in this landscape) and I found one inconsistency. One of the component, SharePoint Foundation Workflow Timer Service (SFTS) is enabled on 6 servers and it is disabled on the other two servers.
Based on my research(see below) this component should be disabled on application servers. When we had 2 web servers and 2 app servers, this service is disabled on application servers correctly. After some time as part of new application Go live, we added 2 additional web servers and 2 application servers to this farm and this service was not disabled on these two additional application servers. To my knowledge the first notification failure happened after we added additional capacity. After stopping this service in newly added 2 APP servers ,the workflow works perfectly.
We have observed workflow behavior for couple of weeks and we haven't seen any issue so far due to stopping the Microsoft SharePoint Foundation Workflow Timer service on app Servers.
There is no error message in the log files that indicates that this is the root cause, but based on various technical forums this is the root cause for intermittent workflow issues.
Summary of the solution:
During processing of delay activity by workflow, Information management policy timer job is scheduled on the servers where the SharePoint Foundation Workflow Timer Service(SFTS) is running. To execute the SFTS job, server(WFE/APP) will try to process the workflow execution and this requires workflow assembly to be available on the server.
So workflow assembly is missing from the server, HOW ?
This service(SFTS) is automatically configured to run on all Web servers in the farm and it is recommended to run on the Web server according to the Topologies for SharePoint Server 2010.
When we deploy the WSP solution, workflow assemblies will be copied to those servers which has WFE role(SharePoint foundation web applicant service=True) see below link.
In my case SFTS service is running on 2 Application servers where WFE role is not running. So to fix the issue I have stopped the SFTS service on App servers.