No need to explain that these infamous two words hide worlds of potential causes, if you happen to run a Siebel implementation. I got involved in a situation at one of my customers, where end-users were reporting end-less occurrences on the above mentioned. User’s reported to be kicked-out for no reason, often loosing in-flight data while having a customer on-call. The volume of reports was more than substantial, though initially real, objective measured figures were not available.
First and foremost focus in such situation should always be: are there any frequent object manager crashes? Unfortunately customer was not running the Server Task Persistence component, part of the Siebel Failure Diagnostics Framework which got introduced as part of Innovation Pack 2012. This out-of-the-box feature helps administrators analyzing crashes, and viewing key characteristics ‘comfortable’ from within the Siebel UI. So without going through the laborious process grabbing manually the call stack from the /bin directory and decoding the FDR files. Added ‘bonus’ is that you immediately know which users got affected by a specific crash.
Luckily, the background utility ‘siebprocdiag’ used by the Server Task Persistence component to gather the key details, can be also be run manually on a specific server using. Siebprocdiag will essentially cycle through all available .fdr files, decode them and relate them back to call stacks reported in the crash.txt file. The way in which siebprocdiag reports the results of its FDR analysis can be solid gold. It reports in the form of “user navigated to view ‘x’, invoked method ‘y’ on applet ‘z’ which in turn invoked business service ‘abc’ leading to a crash”.
Detailed output will be:
- Siebel Server and server component names
- Time of failure
- Process, thread, and task IDs
- Number of affected tasks
- Last set of meaningful business processes that occurred at the time of the failure
- Whether a new process was created, and if so, the list of current users impacted by the failure
- A list of users whose sessions were lost following the failure
- Content of the actual crash.txt file that was logged, which provides the call stack, register contents, and memory information
Well all said and done, we figured out there were crashes and a good number of them across all four Windows servers hosting Call center object managers. The use of siebprocdiag proved vital in quick root-cause analysis, it turned out a buggy business component definition leading to incomplete SQL statements being generated, that business component got introduced in the most recent SRF release.
But guess what, after all the object manager crashes ceased to exists after an emergency release – users kept reporting ‘server busy’. What was going on?
We started to increase ‘EventContext’ tracing, to get a better grip and to hopefully see some type of pattern. Unfortunately, that lead nowhere.
One step further we added ObjMgrSessionLog=5 and ProcessRequest=5 in order to trail the flow of events. At the same time we cranked-up SWSE monitoring per three usual system environment variables (SIEBEL_LOG_EVENTS = 4, SIEBEL_SESSMGR_TRACE = 1, SIEBEL_SISNAPI_TRACE = 1). Although SWSE tracing is flooded with irrelevant (no, what I mean is ‘benign’) traces, it typically can be well filtered based on SARM ID which can always correlate an Object manager task with SWSE tracing.
The SARM ID is printed in every line of a component log. The fact it correlated with SWSE tracing.
TaskConfig TaskCfgParamInit 3 0000003c56710e30:0 2015-12-30 15:28:14 The Parameters for the current task (8405474) are :
Since the customer is running a high-available configuration including a set of hardware load-balanced web servers, we decided to temporarily by-pass the load-balancer. Luckily the vast majority of end-users are on Citrix, hence we able to provision a changed short-cut instantaneously. By-passing the load balancer we both could cancel-out the load-balancer being potential culprit as well as analysing is simply easier on a single web server. Load-wise it should create no issue to run on a single leg temporarily.
Run a couple of hours, the incidents kept flowing in and pretty quickly we noticed this pattern:
ObjMgrSessionLog ObjMgrLogin 3 00005710566f096c:0 2015-12-30 15:43:38 Username: XYZ, Login Status: Attempt, Session Id: bP6I.5OZwhK.IxZSBDlRU3dYBTdk2UeLUkP8fk83cyQcXmEKcPnGWD4mUAYLDnQCtsMEDoCp8E6aDkhoXWRCgX-4QnqEWCECWJHyMZKAcVqMF6MI1szESLb78y05d7O-SIO8bmbJ2W5SFgJvqj8kD-V1Gx.zFc4hjxWshoKQSlg_, IP Address: xx.xx.xx.xx
Usually, if you spot “Login Status: Attempt” in the middle of a web session, it simply means a time-out. But under normal circumstances, the message in the log should be preceded by “SBL-SMI-00126: The session has timed out”. The “Login Status: Attempt” message, would in case of WebSSO ensure the user is logged-in automatically, after a session timout. Moreover, in case of a traditional time-out, you should be easily able to spot a gap in the log corresponding the the SessionTimeout configured on the web server (eapps.cfg).
After collecting during a good deal of a day we confirmed a couple of things:
- The message “Login Status: Attempt” (if not preceded by SBL-SMI-00126) clearly indicated a an ‘affected user’.
- Extrapolating on the occurrences we collected over the period of one hour on one single server, we found the issue to hit approximately every users two times their working day. Wow.
But the most intriguing fact: at the same time, groups of users got affected. Sometimes a little as 2, sometimes as many as 9. And these groups of users were always related to one specific object manger process Id.
This latter piece of information led us to believe a relationship with SISNAPI session multiplexing between SWSE and Siebel Servers. The SWSE multiplexes Siebel web client session over a pool of SISNAPI – which of course are just TCP/IP – connections. SWSE maintains a ‘pool’ of SISNAPI connections for each target object manager process. Multiplexing is defined by the SessPerSisn value on the object manager, and defaults to 20. The size of the pool is calculated by:
Max Tasks of the component / Max number of Object Manager multi-threads processes / SessPerSisn. This figure should lead to an integer value. So take Max Tasks = 200, MaxMTServers = 5 and the default SessPerSisn = 20 would lead to 2 SISNAPI connections for each multi-threaded process. I have to admit, in the past 15+ years I have never been required to tweak this value.
We verified these connections running “netstat -ano | find “2321”. The output contains a trailing column a PID, this PID related to the SiebService process on the target Siebel server. So basically you separate the output, and confirm the number of TCP/IP connections between SWSE and a particular Siebel Server.
netstat -ano | find “2131”
TCP 184.108.40.206:2321 220.127.116.11:52207 ESTABLISHED 3464
TCP 18.104.22.168:2321 22.214.171.124:52458 ESTABLISHED 3464
TCP 126.96.36.199:2321 188.8.131.52:52680 ESTABLISHED 3464
TCP 184.108.40.206:2321 220.127.116.11:52736 ESTABLISHED 3464
Anyways, working with the network team we ran a short while Wireshark between the SWSE and one specific Siebel Server. And since we could indicate second-close when an incident produced itself, Wireshark data could be filtered appropriately without being overwhelmed. And guess what? Spot-on, we found a match. It appeared that the Web server closed a TCP/IP connection with details: [RST, ACK, CWR]. The network team immediately noticed the uncommon “CWR” flag which was set on the Reset (RST) packet. CWR means ‘congestion window reduced’, which essentially is a mechanism for a sender to re-negotiate the window in which packets are being exchanged with the receiving party, to make it smaller because of perceived packet-loss or congestion.
A bit more details here, is that CWR has a related feature named ‘Explicit Congestion Notification’ or ECN. And interestingly, ECN is a feature which on Windows releases prior to Windows 2012 has been available, but default it has been always disabled. But with Windows 2012 the feature has been enabled by default. Since the network and infrastructure teams were not prone, disabling ECN because they felt the applications should be resilient we considered changing SessPerSisn.
Theoretically: lowering SessPerSisn would increase the number of SISNAPI connections, and likely lowering the possibility of congestion on one specific TCP/IP connections. At the same time, we would be reducing the number of users impacted. Side-effect of lowering SessPerSisn would be increased memory foot-print for the SWSE plugin plus additional OS memory overhead for both Web servers as well as Siebel Servers. Additional memory for the SWSE plugin should be approximately 1 Mb / SISNAPI connection. Additional OS memory was estimate to be 100 KB / connection. In the customer’s setup that would not accumulate to huge figures.
Decision was made to lower SessPerSisn from default ’20’ to ’10’ – so basically doubling the number of TCP/IP connections. The change got implemented, and seems to have the looked-for effect. The ‘Server Busy’ messages have not been seen anymore!
Now – the story is not over and done for me. Some questions need still to be answered from Oracle’s side:
- Should we recommend other values for SessPerSisn for Windows 2012+ implementations?
- How to effectively ‘measure’ what an appropriate value for SessPerSisn should be. It’s appreciated that doubling the number of SISNAPI connections seems to work, but it would be good if it could be pro-actively monitored when congestion lingers.