Background:
Visually
similar is an optional service in Bing.com shopping site which start from a clothing
or shoes item and finds other items that looks similar. If our response time is
beyond 750 milliseconds, the requests will timeout on client side. The shopping
team is investigating a problem that all requests fail against our service in one
center. We suspect that occasional timeouts in our service is causing their
server to go haywire. They wrote a test program which repeatedly query our
service, and found 13% of request timeout under normal load of 7 QPS (query per
second). I don't want to tune our search algorithm because it was developed by
a teammate who left the team. So, I decided to enforce a timeout on server
side.
In
the following graph: The response time occasionally was very high.
Analysis
Architecture:
Our service has the following architecture:
WCF Client at
Shopping server -> WCF service/.Net remoting client -> Fan out connection
with 3 .Net remoting server, which perform actual search.
WCF
client is beyond our control, so I can insert the timeout at the following
places, in order of increasing complexity:
1.
add timeout configuration in WCF service
definition
2.
add timeout configuration in .Net remoting client
definition
3.
add custom timeout code in WCF service
4.
add custom timeout code in .Net remoting server
Attemp 1
I
first tried #1 since it is easy and most up-stream to the call chain. All I had
to do is to add the following to the web.config file:
<bindings>
<basicHttpBinding>
<binding name="Binding1" ...
receiveTimeout="00:00:01"
sendTimeout="00:00:01"
<basicHttpBinding>
<binding name="Binding1" ...
receiveTimeout="00:00:01"
sendTimeout="00:00:01"
Unfortunately,
this did not work. I found the following web forum discussion saying the WCF
service timeout will be 22 seconds even if the value is configured to be
smaller.
Attemp 2
I
then tried #2, because it is next easiest. I just need to add a global
configuration to the .Net remoting client. When timeout expires, the .Net
remoting call will throw exception. The
following is the code:
IDictionary
t = new Hashtable();
SoapClientFormatterSinkProvider sc = new SoapClientFormatterSinkProvider();
t.Add("timeout", (uint)600);
TcpClientChannel clientChannel = new TcpClientChannel(t,
sc);
ChannelServices.RegisterChannel(clientChannel, false);
This
is very elegant programing model, and simple to implement, However, the actual
timeout was 485 milliseconds longer than given in the settings. (I tested
timeout setting of 200 and 600 milliseconds.) This is the response from the
"remoting support" team:
"I
don’t think it guarantees the exception will be raised the exact time specified
in the timeout value only that it will be hit if the timeout value specified is
exceeded."
We
need accurate timeout values. So, I cannot use this approach.
Attempt 3
I
then tried #3 and #4 with the following sample timeout code:
delegate QueryResults
MethodDelegate2(Decimal
prodID, ...);
public QueryResults
SearchVisualSimilarRangeWithTimeout(Decimal
prodID, ...)
{
MethodDelegate2 dlgt = new
MethodDelegate2(client.SearchVisualSimilarRange);
// Initiate the asynchronous call.
IAsyncResult ar = dlgt.BeginInvoke(prodID, ..., null, null);
// Wait for the WaitHandle to become signaled.
if (!ar.AsyncWaitHandle.WaitOne(timeMsec))
{
throw new TimeoutException();
}
// Get the results of the asynchronous call.
return dlgt.EndInvoke(ar);
}
I
made the call to be asynchronous. When a call timeout, I will allow it to run
to completion asynchronously, but the main call thread will return immediately
after the timeout.
#3
did not work, and #4 partially worked in that timeout decreased from 13% to 3%.
Also, the problem is worse when there is more requests per second. The
following are the response time of reported by the test program:
Attempt 4
With
#4, it is possible that mess-up in the WCF service code could cause timeout.
So, I have to make #3 work. To troubleshoot why #3 did not work, I added
tracing with time stamp to the WCF service, and discovered the dynamics of the
timeout scenario:
·
Large part of perceived call time on the client
includes the time when the request is in the WCF request queue. My timeout code
doesn’t help in this situation.
·
When WCF service has several threads running, it
may decide to hold more quests in the queue. This queue behavior is not
consistent, and WCF may decide to send a large number of requests to the
service code at once. This creates a wave of congestion in the .Net remoting
server and slow response time.
·
The above 2 factors add together to become a more
severe congestion problem.
To
solve this problem, I added code in WCF service to drop new requests if the
number of concurrent requests is more than a threshold. Our search algorithm is
CPU intensive and use multiple threads, so if I allow more request through
while our system is already overloaded, it will cause much larger number of
requests to timeout. Furthermore, the ripple effect of the queue discussed
above can congest our entire system for a long period of time. So, this should
actually increase the overall success rate of our request.
volatile static int concurr = 0;
try
{
Interlocked.Increment(ref concurr);
if (concurr > InitState.Instance.maxConcurr)
{ // No locking here since this is heuristic anyway...
return new
KEXResponse();
}
...
}
finally
{
Interlocked.Decrement(ref concurr);
}
Alternatively,
I could investigate ways to solve the problem with different WCF queuing
configurations, but the above simple fix did the trick, so there is no need to
dig more into WCF for a solution.
Result
The
result: after 2 days of work, the
combined dropped request& timeout rate observed by client dropped from 13%
to below 1%, with timeout rate dropping to 1 in 2000. Also, the system is much
more stable at large request per second load. The feature is back online after
the fix is deployed.
Lessons learnt:
·
Never take documented behaviors or WCF and .Net
remoting for granted. Test it with a small program before applying it to
product code.
·
Ripple effect between WCF queuing and backend
service may increase congestion, and a good test program is needed to simulate
real traffic and uncover the problem.
·
Combination of good testing program, tracing, and
problem analysis can sometimes solve problems faster than profiling tools and
debugging.