Thursday, June 14, 2012

Adding Timeout and Congestion Prevention in a WCF/.Net Remoting Multi-tier Server


Background:

Visually similar is an optional service in Bing.com shopping site which start from a clothing or shoes item and finds other items that looks similar. If our response time is beyond 750 milliseconds, the requests will timeout on client side. The shopping team is investigating a problem that all requests fail against our service in one center. We suspect that occasional timeouts in our service is causing their server to go haywire. They wrote a test program which repeatedly query our service, and found 13% of request timeout under normal load of 7 QPS (query per second). I don't want to tune our search algorithm because it was developed by a teammate who left the team. So, I decided to enforce a timeout on server side.

In the following graph: The response time occasionally was very high.





Analysis

Architecture: Our service has the following architecture:
WCF Client at Shopping server -> WCF service/.Net remoting client -> Fan out connection with 3 .Net remoting server, which perform actual search.

WCF client is beyond our control, so I can insert the timeout at the following places, in order of increasing complexity:
1.       add timeout configuration in WCF service definition
2.       add timeout configuration in .Net remoting client definition
3.       add custom timeout code in WCF service
4.       add custom timeout code in  .Net remoting server

Attemp 1

I first tried #1 since it is easy and most up-stream to the call chain. All I had to do is to add the following to the web.config file:
  <bindings>
    
<basicHttpBinding>
       
<binding name="Binding1" ...
              
receiveTimeout="00:00:01"
              
sendTimeout="00:00:01"
Unfortunately, this did not work. I found the following web forum discussion saying the WCF service timeout will be 22 seconds even if the value is configured to be smaller.

Attemp 2

I then tried #2, because it is next easiest. I just need to add a global configuration to the .Net remoting client. When timeout expires, the .Net remoting call will throw exception.  The following is the code:
        IDictionary t = new Hashtable();
        SoapClientFormatterSinkProvider sc = new SoapClientFormatterSinkProvider();
        t.Add("timeout", (uint)600);
        TcpClientChannel clientChannel = new TcpClientChannel(t, sc);
        ChannelServices.RegisterChannel(clientChannel, false);
This is very elegant programing model, and simple to implement, However, the actual timeout was 485 milliseconds longer than given in the settings. (I tested timeout setting of 200 and 600 milliseconds.) This is the response from the "remoting support" team:
"I don’t think it guarantees the exception will be raised the exact time specified in the timeout value only that it will be hit if the timeout value specified is exceeded."
We need accurate timeout values. So, I cannot use this approach.

Attempt 3

I then tried #3 and #4 with the following sample timeout code:
        delegate QueryResults MethodDelegate2(Decimal prodID, ...);
        public QueryResults SearchVisualSimilarRangeWithTimeout(Decimal prodID, ...)
        {
            MethodDelegate2 dlgt = new MethodDelegate2(client.SearchVisualSimilarRange);

            // Initiate the asynchronous call.
            IAsyncResult ar = dlgt.BeginInvoke(prodID, ..., null, null);

            // Wait for the WaitHandle to become signaled.
            if (!ar.AsyncWaitHandle.WaitOne(timeMsec))
            {
                throw new TimeoutException();
            }

            // Get the results of the asynchronous call.
            return dlgt.EndInvoke(ar);
        }
I made the call to be asynchronous. When a call timeout, I will allow it to run to completion asynchronously, but the main call thread will return immediately after the timeout.

#3 did not work, and #4 partially worked in that timeout decreased from 13% to 3%. Also, the problem is worse when there is more requests per second. The following are the response time of reported by the test program:




Attempt 4

With #4, it is possible that mess-up in the WCF service code could cause timeout. So, I have to make #3 work. To troubleshoot why #3 did not work, I added tracing with time stamp to the WCF service, and discovered the dynamics of the timeout scenario:
·         Large part of perceived call time on the client includes the time when the request is in the WCF request queue. My timeout code doesn’t help in this situation.
·         When WCF service has several threads running, it may decide to hold more quests in the queue. This queue behavior is not consistent, and WCF may decide to send a large number of requests to the service code at once. This creates a wave of congestion in the .Net remoting server and slow response time.
·         The above 2 factors add together to become a more severe congestion problem.

To solve this problem, I added code in WCF service to drop new requests if the number of concurrent requests is more than a threshold. Our search algorithm is CPU intensive and use multiple threads, so if I allow more request through while our system is already overloaded, it will cause much larger number of requests to timeout. Furthermore, the ripple effect of the queue discussed above can congest our entire system for a long period of time. So, this should actually increase the overall success rate of our request.
    volatile static int concurr = 0;
    try
    {
        Interlocked.Increment(ref concurr);
        if (concurr > InitState.Instance.maxConcurr)
        { // No locking here since this is heuristic anyway...
            return new KEXResponse();
        }
    ...
    }
    finally
    {
        Interlocked.Decrement(ref concurr);
    }

Alternatively, I could investigate ways to solve the problem with different WCF queuing configurations, but the above simple fix did the trick, so there is no need to dig more into WCF for a solution.

Result

The result:  after 2 days of work, the combined dropped request& timeout rate observed by client dropped from 13% to below 1%, with timeout rate dropping to 1 in 2000. Also, the system is much more stable at large request per second load. The feature is back online after the fix is deployed.   

Lessons learnt:

·         Never take documented behaviors or WCF and .Net remoting for granted. Test it with a small program before applying it to product code.
·         Ripple effect between WCF queuing and backend service may increase congestion, and a good test program is needed to simulate real traffic and uncover the problem.
·         Combination of good testing program, tracing, and problem analysis can sometimes solve problems faster than profiling tools and debugging.

No comments:

Post a Comment