Squeak SmalltalkJoker Squeak Smalltalk : Network : prevnext Socket Disconnect Problem Heavy Load

Hi,
we've got a production server for an insurance company running a
Squeak 3.5 image on the 3.6b6 VM on Debian Linux. Unfortunately there
are mysterious problems with the sockets when accessing PostgreSQL
from within Squeak. We wrote workarounds, that just retry if a
connection is failing, but this isn't helping much anymore.
As it is a critical business application the client is not only
complaining but even beginning to sue for damage, that we are unable
to afford as a little internet agency! This means that *we won't be
ever able to use squeak again* and that we have to go back to other
solutions like Python or ASP.
To the problem: we've got a couple of connections being used by
different threads executing queries on postgres. During heavy load -
and this is where most of the time the problem shows up - there are
about 300 queries executed sequentially on one connection by one
thread. Most often, the connections are suddenly closed, even before
Squeak starts to wait for the data.
This was the part, to stress the *urgency* and to explain the
background for the following questions:
1. Are the sockets in Squeak at all being able to be used from
different threads?
2. Could it be that the production server is too fast? We didn't have
those problems on the development machines (macintosh and windows).
3. Should we try another VM, another image, another postgres client?
4. Is there anybody in this mailing-list able to help with that kind
of problems? We are lost and are really fed up with those Squeak
sockets! We would be even willing to pay the one that is able to solve
the problem.
Lukas
--
Lukas Renggli
http://renggli.freezope.org

My best guess (from afar) is that the connection is closed immediately
after the reply has been sent and that something is going wrong in the
handling of this situation. E.g., consider the following code:
Socket initializeNetwork.
port := 12346.
"create client and server"
server := Socket newTCP.
server listenOn: port.
client := Socket newTCP.
client connectTo: NetNameResolver localHostAddress port: port.
[client isConnected] whileFalse:[(Delay forSeconds: 1) wait].
"send request"
client sendData: 'Hi there'.
"handle on server"
request := String new: 100.
sz := server receiveDataInto: request.
request := request copyFrom: 1 to: sz.
request = 'Hi there'
 ifFalse:[self error: 'Something went wrong'].
"---> send response and immediately close <---"
server sendData: 'Response'.
server close.
[server isConnected] whileTrue:[(Delay forSeconds: 1) wait].
At this point, the server closed the connection whereas the client
hasn't done anything yet. Depending on the VM implementation this can
cause severe trouble. You may want to check (on varying platforms)
what the results of the following are:
     "Win32"
a) client statusString.   'connected'
b) client getData.   'Response'
d) client statusString.   'otherEndClosedButNotThisEnd'
And to answer your specific questions:
> 1. Are the sockets in Squeak at all being able to be used
> from different threads?
Yes they are. Unless you attempt to use the same socket simultanously
from different threads there is no problem. If you do the latter then
you have the "common" issues of synchronizing access to a shared
resource from different threads.
> 2. Could it be that the production server is too fast? We didn't
> have those problems on the development machines (macintosh and
> windows).
It could be. The above example simulates an infinitely fast server
which replies before client has a chance to look at it.
> 3. Should we try another VM, another image, another postgres client?
First of all, run the above example and see what this gets you. It
may  be that the VM is merely reporting that the
"otherEndClosedButNotThisEnd" and that your code deduces from this
that there is no readable data on the socket (e.g., #isConnected will
return false which in turn may screw #getData and friends). In this
situation you  may  still be able to read data from the socket (can't
say without having tried it; but see what
#primSocket:receiveDataInto:startingAt:count: reports) and simply work
around this problem.
> 4. Is there anybody in this mailing-list able to help with
> that kind of problems? We are lost and are really fed up with
> those Squeak sockets! We would be even willing to pay the one
> that is able to solve the problem.
If the above solves your problem make a donation to Squeak e.V. (heh,
heh) It's even tax-deductable ;-)
Cheers,
  - Andreas
----

Ian is the author of the unix socket stuff. Goodness knows he and I
have tried to ensure it works correctly under heavy load.
a) All unix variations have slightly different behaviour under load.
I  think Ian has tested on os-x and NetBSD.
b) Try disabling the interval timer  -notimer and see if that changes
things?
c) Having a repeatable Sunit will greatly help debugging?
d) You must build a debuggable VM which dumps sufficient state about
the status of the socket to understand why the
socket is being closed.
e) As pointed out they issue could be on the other side with you
violating some rule and they close you down.
Please keep in mind this is not a language issue. The interesting and
troublesome parts of the Socket code are just plain C code. The same
issue could be occurring if the application was written in C or a host
 of other languages.

MacIntosh

On Fri, 12 Sep 2003, Andreas Raab wrote:
> done anything yet. Depending on the VM implementation this can cause
> severe trouble. You may want to check (on varying platforms) what
> the results of the following are:
>      "Win32"
> a) client statusString.   'connected'
> b) client getData.   'Response'
where did 'c)' go?
> d) client statusString.   'otherEndClosedButNotThisEnd'
Modulo #getData having vanished from Socket, in Unix we get
a) client statusString.   'connected'
b) client getData.   'Response'
c) client statusString.   'connected'
d) client getData.  immediate error 'getData timeout'
e) client statusString.   'otherEndClosedButNotThisEnd'
So what I might do is to try moving the check for otherEndClosed
forward into primSocketConnectionStatus to see if the image likes that
better.  (Although we don't know that this is the problem.  Also, the
image really should not rely on this.  Closing the other end of the
connection does not generate [unless you do some very unpleasant
things with signals] any kind of event in the still-connected peer,
who is expected to deal with the closure when the next _read_
operation _fails_.)  Note that this will make Socket>>isConnected (and
friends) measurably slower, but if that's the price to pay, so be it...
(What are you hiding between 'b)' and 'd)'??? ;)
Regards,
Ian

Hi Andreas, Ian,
> My best guess (from afar) is that the connection is closed
> immediately after the reply has been sent and that something
> is going wrong in the handling of this situation.
Right, that seems to be the problem.
The result of the posted code on the production server (Debian Linux,
Squeak Unix VM 3.6-beta6) is the following (with -notimer already
turned on):
 a) client statusString.    --> 'connected'
 b) client getData.         --> 'Response'
 c) client statusString.    --> 'connected'
 d) client getData.         --> 'getData timeout' exception is raised
 e) client statusString.    --> 'otherEndClosedButNotThisEnd'
Is this ok? I've tried to run in several times, but it looks all the
time the same. Should we try it in a loop? Should we write a test that
runs it in parallel with different sockets?
> Yes they are. Unless you attempt to use the same socket
> simultanously from different threads there is no problem.
> If you do the latter then you have the "common" issues of
> synchronizing access to a shared resource from different
> threads.
Ok, lets check that on the production machine:
 PGConnection allInstances collect: [ :each | each instVarAt: 4 ]
 --> #(a Socket[connected] a Socket[connected] a Socket[connected] ...
 PGConnection allInstances collect: [ :each | (each instVarAt: 4)
oopString ]
 --> #('1810' '2601' '1928' '1596' '2804' '2495' '3428' '194' '3276' ..
.
All sockets used by the PGConnection instances seem to be different,
right? So this should be not the problem.
> First of all, run the above example and see what this gets
> you. It may be that the VM is merely reporting that the
> "otherEndClosedButNotThisEnd" and that your code deduces from
> this that there is no readable data on the socket (e.g.,
> #isConnected will return false which in turn may screw
> #getData and friends). In this situation you may still be able
> to read data from the socket (can't say without having tried
> it; but see what #primSocket:receiveDataInto:startingAt:count:
> reports) and simply work around this problem.
So that would rewrite the message reading the data from the Postgres
server? For clarity I removed the log-statements, it basically looks
like this:
 PGConnection>>next
  readIndex >= lastReadIndex ifTrue: [
   (socket waitForDataUntil: Socket standardDeadline)
    ifFalse: [self
error: 'timed out getting data'].
   [(lastReadIndex _ socket
receiveDataInto: readBuffer) = 0
    ifTrue: [(Delay forMilliseconds:
100) wait].
   lastReadIndex = 0] whileTrue.
   readIndex _ 0].

readIndex _ readIndex + 1.
 ^ readBuffer at: readIndex
I do not understand what you actually suggest to change.
Thanks a lot for your help,
Lukas
--
Lukas Renggli
http://renggli.freezope.org

On Fri, 12 Sep 2003, Ian Piumarta wrote:
> So what I might do is to try moving the check for otherEndClosed
> forward into primSocketConnectionStatus to see if the image likes
> that better.
With that change, Andreas' example now gives me:
  self statusString  'connected'
  self getData   'Response'
  self statusString  'otherEndClosedButNotThisEnd'
  self getData  immediate error 'getData timeout'
Lukas: replace sqSocketConnectionStatus() in your sqUnixSocket.c with
the attached version, recompile, and try to break your db connection
again.  If it doesn't break any more then I'll include the changes in
the next release.
Cheers,
Ian

Lukas,
I've run into similar problems using MySQL from Squeak. Usually it was
because some kind of synchronization problem or the Mysql driver not
reacting well to something my app did.
IMO, the first thing you need to find out is whether the server is
closing the connection, and if so, why. Yanni's suggestion about the
post master log level is very good. You might also consider using a
packet sniffer to monitor the connections at the OS level. This might
involve a lot of groveling through hex dumps, but it will definitely
tell you what's going on.
Another thing you might try is switching clients. Diego's ODBC package
works quite well. It makes deployment more difficult though, and might
require some reworking of your application, so I'd only do that as a
last resort.
Hope this helps,
Colin

> The result of the posted code on the production server (Debian
> Linux, Squeak Unix VM 3.6-beta6) is the following (with -notimer
> already turned on):
>
>  a) client statusString.    --> 'connected'
>  b) client getData.         --> 'Response'
>  c) client statusString.    --> 'connected'
>  d) client getData.         --> 'getData timeout'
> exception is raised
>  e) client statusString.    --> 'otherEndClosedButNotThisEnd'
>
> Is this ok?
This looks very reasonable to me. The situation I was thinking about
was that a) would report immediately "otherEndClosedButNotThisEnd"
(which is depending on the implementation quite possible - I had this
on Windows some point in the past). So this doesn't seem to be the
problem I was thinking about.
> I've tried to run in several times, but it looks all the
> time the same. Should we try it in a loop? Should we write a
> test that runs it in parallel with different sockets?
No, I don't think so. The above test tells us that the VM responds "as
if" you were calling recv() directly on the socket which is the correct
behavior.
> > Yes they are. Unless you attempt to use the same socket
> > simultanously from different threads there is no problem.
> > If you do the latter then you have the "common" issues of
> > synchronizing access to a shared resource from different
> > threads.
>
> Ok, lets check that on the production machine:
>
>  PGConnection allInstances collect: [ :each | each instVarAt: 4 ]
>  --> #(a Socket[connected] a Socket[connected] a
> Socket[connected] ...
>
>  PGConnection allInstances collect: [ :each | (each
> instVarAt: 4)
> oopString ]
>  --> #('1810' '2601' '1928' '1596' '2804' '2495' '3428'
> '194' '3276' ...
>
> All sockets used by the PGConnection instances seem to be different,
> right? So this should be not the problem.
Perhaps better:
 | usedSet |
 usedSet := IdentitySet new.
 PGConnection allInstancesDo:[:each|
  | socket |
  socket := each instVarAt: 4.
  (usedSet includes: socket)
   ifTrue:[self error: 'Socket shared by multiple
connections].
  usedSet add: socket.
 ].
>
> > First of all, run the above example and see what this gets
> > you. It may be that the VM is merely reporting that the
> > "otherEndClosedButNotThisEnd" and that your code deduces from
> > this that there is no readable data on the socket (e.g.,
> > #isConnected will return false which in turn may screw
> > #getData and friends). In this situation you may still be able
> > to read data from the socket (can't say without having tried
> > it; but see what #primSocket:receiveDataInto:startingAt:count:
> > reports) and simply work around this problem.
[snip]
> I do not understand what you actually suggest to change.
Nevermind. This was under the assumption that a) reports
otherEndClosedButNotThisEnd. It is of no importance any longer.
That said, it would be interesting to find out where your code actually
fails - from the above you can see that if the server closes the
connection rapidly, you can have time outs very soon (for example in a
loop) and maybe that's your problem. It would be helpful if you could
investigate some of the cases where you get the error and report the
expected and actual state back.
Cheers,
  - Andreas

On Fri, 12 Sep 2003, Colin Putney wrote:
> Another thing you might try is switching clients. Diego's ODBC
> package works quite well. It makes deployment more difficult though,
> and might require some reworking of your application, so I'd only do
> that as a last resort.
Using ODBC (or any external database lib) also means that the entire VM
will block during a query - probably not what you want for a heavily
loaded application