Server busy, server busy, the server is very busy…

No need to explain that these infamous two words hide worlds of potential causes, if you happen to run a Siebel implementation. I got involved in a situation at one of my customers, where end-users were reporting end-less occurrences on the above mentioned. User’s reported to be kicked-out for no reason, often loosing in-flight data while having a customer on-call. The volume of reports was more than substantial, though initially real, objective measured figures were not available.

First and foremost focus in such situation should always be: are there any frequent object manager crashes? Unfortunately customer was not running the Server Task Persistence component, part of the Siebel Failure Diagnostics Framework which got introduced as part of Innovation Pack 2012. This out-of-the-box feature helps administrators analyzing crashes, and viewing key characteristics ‘comfortable’ from within the Siebel UI. So without going through the laborious process grabbing manually the call stack from the /bin directory and decoding the FDR files. Added ‘bonus’ is that you immediately know which users got affected by a specific crash.

Luckily, the background utility ‘siebprocdiag’ used by the Server Task Persistence component to gather the key details, can be also be run manually on a specific server using. Siebprocdiag will essentially cycle through all available .fdr files, decode them and relate them back to call stacks reported in the crash.txt file. The way in which siebprocdiag reports the results of its FDR analysis can be solid gold. It reports in the form of “user navigated to view ‘x’, invoked method ‘y’ on applet ‘z’ which in turn invoked business service ‘abc’ leading to a crash”.

Detailed output will be:

  • Siebel Server and server component names
  • Time of failure
  • Process, thread, and task IDs
  • Number of affected tasks
  • Last set of meaningful business processes that occurred at the time of the failure
  • Whether a new process was created, and if so, the list of current users impacted by the failure
  • A list of users whose sessions were lost following the failure
  • Content of the actual crash.txt file that was logged, which provides the call stack, register contents, and memory information

Well all said and done, we figured out there were crashes and a good number of them across all four Windows servers hosting Call center object managers. The use of siebprocdiag proved vital in quick root-cause analysis, it turned out a buggy business component definition leading to incomplete SQL statements being generated, that business component got introduced in the most recent SRF release.

But guess what, after all the object manager crashes ceased to exists after an emergency release – users kept reporting ‘server busy’. What was going on?

We started to increase ‘EventContext’ tracing, to get a better grip and to hopefully see some type of pattern. Unfortunately, that lead nowhere.

One step further we added ObjMgrSessionLog=5 and ProcessRequest=5 in order to trail the flow of events. At the same time we cranked-up SWSE monitoring per three usual system environment variables (SIEBEL_LOG_EVENTS = 4, SIEBEL_SESSMGR_TRACE = 1, SIEBEL_SISNAPI_TRACE = 1). Although SWSE tracing is flooded with irrelevant (no, what I mean is ‘benign’) traces, it typically can be well filtered based on SARM ID which can always correlate an Object manager task with SWSE tracing.

The SARM ID is printed in every line of a component log. The fact it correlated with SWSE tracing.

TaskConfig    TaskCfgParamInit    3    0000003c56710e30:0    2015-12-30 15:28:14    The Parameters for the current task (8405474) are :

Since the customer is running a high-available configuration including a set of hardware load-balanced web servers, we decided to temporarily by-pass the load-balancer. Luckily the vast majority of end-users are on Citrix, hence we able to provision a changed short-cut instantaneously. By-passing the load balancer we both could cancel-out the load-balancer  being potential culprit as well as analysing is simply easier on a single web server. Load-wise it should create no issue to run on a single leg temporarily.

Run a couple of hours, the incidents kept flowing in and pretty quickly we noticed this pattern:

ObjMgrSessionLog    ObjMgrLogin    3    00005710566f096c:0    2015-12-30 15:43:38    Username: XYZ, Login Status: Attempt,  Session Id: bP6I.5OZwhK.IxZSBDlRU3dYBTdk2UeLUkP8fk83cyQcXmEKcPnGWD4mUAYLDnQCtsMEDoCp8E6aDkhoXWRCgX-4QnqEWCECWJHyMZKAcVqMF6MI1szESLb78y05d7O-SIO8bmbJ2W5SFgJvqj8kD-V1Gx.zFc4hjxWshoKQSlg_, IP Address: xx.xx.xx.xx

Usually, if you spot “Login Status: Attempt” in the middle of a web session, it simply means a time-out. But under normal circumstances, the message in the log should be preceded by “SBL-SMI-00126: The session has timed out”. The “Login Status: Attempt” message, would in case of WebSSO ensure the user is logged-in automatically, after a session timout. Moreover, in case of a traditional time-out, you should be easily able to spot a gap in the log corresponding the the SessionTimeout configured on the web server (eapps.cfg).

After collecting during a good deal of a day we confirmed a couple of things:

  1. The message “Login Status: Attempt” (if not preceded by SBL-SMI-00126) clearly indicated a an ‘affected user’.
  2. Extrapolating on the occurrences we collected over the period of one hour on one single server, we found the issue to hit approximately every users two times their working day. Wow.

But the most intriguing fact: at the same time, groups of users got affected. Sometimes a little as 2, sometimes as many as 9. And these groups of users were always related to one specific object manger process Id.

This latter piece of information led us to believe a relationship with SISNAPI session multiplexing between SWSE and Siebel Servers. The SWSE multiplexes Siebel web client session over a pool of SISNAPI – which of course are just TCP/IP – connections. SWSE maintains a ‘pool’ of SISNAPI connections for each target object manager process. Multiplexing is defined by the SessPerSisn value on the object manager, and defaults to 20. The size of the pool is calculated by:

Max Tasks of the component / Max number of Object Manager multi-threads processes / SessPerSisn. This figure should lead to an integer value. So take Max Tasks = 200, MaxMTServers = 5 and the default SessPerSisn = 20 would lead to 2 SISNAPI connections for each multi-threaded process. I have to admit, in the past 15+ years I have never been required to tweak this value.

We verified these connections running “netstat -ano | find “2321”. The output contains a trailing column a PID, this PID related to the SiebService process on the target Siebel server. So basically you separate the output, and confirm the number of TCP/IP connections between SWSE and a particular Siebel Server.

netstat -ano | find “2131”
  TCP    53.53.53.29:2321      53.53.53.193:52207    ESTABLISHED     3464
  TCP    53.53.53.29:2321      53.53.53.193:52458    ESTABLISHED     3464
  TCP    53.53.53.29:2321      53.53.53.193:52680    ESTABLISHED     3464
  TCP    53.53.53.29:2321      53.53.53.193:52736    ESTABLISHED     3464

Anyways, working with the network team we ran a short while Wireshark between the SWSE and one specific Siebel Server. And since we could indicate second-close when an incident produced itself, Wireshark data could be filtered appropriately without being overwhelmed. And guess what? Spot-on, we found a match. It appeared that the Web server closed a TCP/IP connection with details: [RST, ACK, CWR]. The network team immediately noticed the uncommon “CWR” flag which was set on the Reset (RST) packet. CWR means ‘congestion window reduced’, which essentially is a mechanism for a sender to re-negotiate the window in which packets are being exchanged with the receiving party, to make it smaller because of perceived packet-loss or congestion.

A bit more details here, is that CWR has a related feature named ‘Explicit Congestion Notification’ or ECN. And interestingly, ECN is a feature which on Windows releases prior to Windows 2012 has been available, but default it has been always disabled. But with Windows 2012 the feature has been enabled by default. Since the network and infrastructure teams were not prone, disabling ECN because they felt the applications should be resilient we considered changing SessPerSisn.

Theoretically: lowering SessPerSisn would increase the number of SISNAPI connections, and likely lowering the possibility of congestion on one specific TCP/IP connections. At the same time, we would be reducing the number of users impacted. Side-effect of lowering SessPerSisn would be increased memory foot-print for the SWSE plugin plus additional OS memory overhead for both Web servers as well as Siebel Servers. Additional memory for the SWSE plugin should be approximately 1 Mb / SISNAPI connection. Additional OS memory was estimate to be 100 KB / connection. In the customer’s setup that would not accumulate to huge figures.

Decision was made to lower SessPerSisn from default ’20’ to ’10’ – so basically doubling the number of TCP/IP connections. The change got implemented, and seems to have the looked-for effect. The ‘Server Busy’ messages have not been seen anymore!

Now – the story is not over and done for me. Some questions need still to be answered from Oracle’s side:

  1. Should we recommend other values for SessPerSisn for Windows 2012+ implementations?
  2. How to effectively ‘measure’ what an appropriate value for SessPerSisn should be. It’s appreciated that doubling the number of SISNAPI connections seems to work, but it would be good if it could be pro-actively monitored when congestion lingers.

– Jeroen

 

 

 

 

 

 

 

 

 

Advertisements

4 thoughts on “Server busy, server busy, the server is very busy…

  1. Hi Jeroen

    I did face the same problem for a customer in JAPAC. Here the problem was from EAI Object Manager not in eComms Object Manager. Same they has SessPerSisn as 25. So I suspect that this parameter value should be a problem. So the value was set to 5 compared to 25 in PROD. We did monitor it regularly. Customer warned us that reducing this value will result to drop in request. But luckily we did not have any issues. the system is stable now. And it is perfectly fine.

    Thanks.

    • For EAI Object managers, SessPerSisn typically is lower. If you consider, I never had to touch SessPerSisn in the past – only for EAI Object managers, where it’s more typical to reduce the standard value.

  2. Nice article. The server busy error also appears when the session cookie ‘_sn’ is expired or contains a null value. (or loadbalancer misconfiguration). I experimented this and found out there is a bug which could even max out all siebel sessions from one user request!
    I’ve created POC, https://github.com/sanjeeve/myPy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s