Jump to content

Connection problems


Recommended Posts

Perhaps the most frustrating occurrence on BBO is to be doing well in a tournament, seemingly have an opponent pause for a minute or two. Suddenly you are removed from the tournament. The director says you weren't playing!

 

It is the nature of the Internet that delays and dropped packets will occur and BBO seems to be rather clumsy at recovery. What often seems to be the case is that the client believes it is waiting for the server when, in fact, it is the server waiting.

 

In this afternoon's example, for whatever reason, I was seeing occasional dropped packets running a ping test, but actually seeing better than usual ping times overall. I have this monitoring running virtually constantly and I could see that I was accessing the Internet successfully at the time I thought I was waiting for the opponent to play to the first trick.

 

Are the messages serialized and prioritized? How long can the client wait without inquiring as to the state of the connection? If there were some kind of keepalive function in the client, users could be alerted to problems sooner and could try to reestablish the connection.

 

The thing I feel sure of is that there is room for improvement in the way sessions are monitored and managed.

Link to comment
Share on other sites

The following might help.

 

1. check your connection to bbo.bridgebase.com ; these are the machines your client actually connects to

 

2. Client is supposed to send a hearbeat every xx seconds (15? something like that)

 

3. When you are playing, server will boot you if it does not detect any activity in 60 seconds (ie, no hearbeat, nothing else)

 

4. Client will boot itself if it does not detect activity in a much longer period, perhaps 3 mins.

 

For now, the best thing to do when you suspect someone is stuck ( you can keep an eye on #players in the system via the Lobby button to the right ) is to try to send a chat.

Link to comment
Share on other sites

Uday, thanks for the comments. The reason that I placed this thread in the "Suggestions for the Software" forum is that all of the things you suggest are sometimes not enough.

 

2. Client is supposed to send a heartbeat every xx seconds (15? something like that)

 

3. When you are playing, server will boot you if it does not detect any activity in 60 seconds (ie, no hearbeat, nothing else)

 

These in particular seem insufficient. The easiest way to do better that I see would be to attach, say, a serial byte for each message to and from the server. If the client sees serial 86 and serial 85 is missing, then instantly it will know that it could have missed something important, like the play of a card from dummy and it can notify the server of the missing message.

 

Ditto, for the server, if a serial entry is skipped, it can immediately request that the missing action be resent.

 

A small field, like a byte, is probably plenty big enough. Cycle from 0 to 255 then wrap around. Yes, theoretically a block of 256 messages could be missed and go undetected. In that case we'd be no worse off than we are today. Even though this process adds a tiny overhead, I'm guessing it could save BBO bandwidth overall. It appears to me that losing the connection and reconnecting actually represents a pretty substantial chunk of bytes that might be avoided more often.

 

I mentioned priority in the first post because if it is discovered that several messages are missing, it is most important to get those messages resent that affect the state of play of a hand. The chat, players, counts, tables, etc., can wait.

Link to comment
Share on other sites

Messages never go missing; the typical symptom is that data stops flowing along the pipes and it takes a while to discover this. In your first example, the problem might be, for instance, that your pc is no longer connected (but doenst know it yet).
Link to comment
Share on other sites

Messages never go missing;

 

That doesn't appear to fit situations I've seen repeatedly. (A common instance is where I continue to get chat messages right up to the point that I go through a "connection lost, reconnecting" cycle.) Moreover, it isn't the way the Internet works. It takes some kind of elaborate session protocol of the kind I described to make up for packets that may be dropped at any point along the way. Without a more detailed explanation of how that supposedly works with BBO, I can only infer that it is a popular notion of how somebody thinks it is supposed to work but doesn't.

Link to comment
Share on other sites

Messages never go missing;

 

That doesn't appear to fit situations I've seen repeatedly. (A common instance is where I continue to get chat messages right up to the point that I go through a "connection lost, reconnecting" cycle.) Moreover, it isn't the way the Internet works. It takes some kind of elaborate session protocol of the kind I described to make up for packets that may be dropped at any point along the way. Without a more detailed explanation of how that supposedly works with BBO, I can only infer that it is a popular notion of how somebody thinks it is supposed to work but doesn't.

Link to comment
Share on other sites

Messages never go missing;

 

That doesn't appear to fit situations I've seen repeatedly. (A common instance is where I continue to get chat messages right up to the point that I go through a "connection lost, reconnecting" cycle.) Moreover, it isn't the way the Internet works. It takes some kind of elaborate session protocol of the kind I described to make up for packets that may be dropped at any point along the way. Without a more detailed explanation of how that supposedly works with BBO, I can only infer that it is a popular notion of how somebody thinks it is supposed to work but doesn't.

BBO uses a protocol called TCP/IP which includes guarantees that messages don't go missing.

 

Suggest you don't try to give Uday lessons on how the Internet works. If you want to solve this problem you would be better off listening to him.

 

Fred Gitelman

Bridge Base Inc.

www.bridgebase.com

Link to comment
Share on other sites

I'm old enough not to be offended by much these days, but I do think we're having a miscommunication.

 

Assuming no bugs in our code for the purposes of the next paragraph:

 

Client and server exchange information via messages, which simply blobs of information that one side is trying to send to the other. An example is: server wants to tell client that John has logged in. So it sends client a "message" that says "John has logged in". Messages are sent in some order, and are received in the same order. Assuming no bugs in the lower level code that is involved (a safe assumption for now), this means that we can't receive message 2 unless we've already received message 1.

 

Assuming bugs, of course anything is possible. It is possible that messages relevant to the table are being discarded by BBO (or not sent in the first place) while messages relevant to chat are not. I consider this an unlikely scenario but won't rule it out.

 

(A common instance is where I continue to get chat messages right up to the point that I go through a "connection lost, reconnecting" cycle.)

This is not inconsistent with a damaged connection. The messages might be older.

Example:

you appear to be frozen to everyone else at the table, who are peppering you with "are you there? every (say) minute"

From your perspective, the connection is really slow, and you get the messages (say) every 2 minutes.

So after 10 minutes, you are 5 minutes "behind" everyone else even though you think you are at the table.

 

 

Anyway, back to the real problem, which is, i think "How can a person who is having conn trouble be informed of this in a more timely fashion?"

 

Unfortunately, i can't think of a cheap way. I suppose I could send a rapid stream of "are you there?" to the client while the client is at a table so that if i dont get back the expected response in a reasonable timeframe i won't have to wait to boot the client. In an ideal world perhaps the client would do the same thing. But I have to protect more than the odd poor connection -- I have to protect the bandwidth of the network, and I have to protect the machines that are involved.

 

I think the best thing to do for the moment is send a '..." to the table when you see an unusually long delay. If the ... echoes back to you immediately, either you are not having a conn problem or you have discovered a new bug.

 

u

Link to comment
Share on other sites

Unfortunately, i can't think of a cheap way. I suppose I could send a rapid stream of "are you there?" to the client while the client is at a table so that if i dont get back the expected response in a reasonable timeframe i won't have to wait to boot the client.

If I understand correctly, you send a ping every 15 seconds. If you fail to hear from four pings in a row, you disconnect them.

 

The two things I can think of are...

 

1. If you fail to hear a ping, you put a red dot on the user from HIS perspective, as well as the other users on the table. Obviously, this will only be successful if you've lost one-way communication (he can still 'hear', but not 'talk').

 

2. Put a timer on the PC side of the application, so if it hasn't received a ping in 30 seconds it puts up a message saying 'you appear to be disconnected. Reconnect?', in which your replies would be 'Yes', which would log out out of BBO and try to log you back in, or 'wait', which would just put you out there as if nothing had happened.

 

#1 is easy, but it wouldn't benefit 90% of the people. #2 would be enormously helpful- you can often disconnect and reconnect and get on successfully, and if you realize you're stuck early enough they won't have subbed you. It would take some coding, but it would have virtually no overhead, since the only part required of the server side is the 15 second ping, which is already happening. The PC would just check every 5 seconds to compare the current time with the time the PC last received a ping, and if it's more than 30 seconds send the message (and reset the ping clock, so it wouldn't give you the message every 5 seconds).

Link to comment
Share on other sites

:) This thread finally seems to be getting some traction!

 

Inasmuch as I run a tiny ISP, I really do know whereof I speak.

 

I'm old enough not to be offended by much these days, but I do think we're having a miscommunication

 

Uday, I'm just trying to get the facts out here and I'm glad you are not offended.

 

Assuming no bugs in the lower level code that is involved (a safe assumption for now), this means that we can't receive message 2 unless we've already received message 1.

 

At the packet level, your message is dispatched on a best-effort basis. But anywhere along the way, if congestion occurs, a transient glitch in hardware, anything, your packet may be silently dropped.

 

So the question comes up about that lower-level code anyway and it needn't be a question of bugs, per se. I can be sure of is that there is not an end-to-end acknowledgment of each message before the next message is sent. There is too much packet latency (ping-time if you will) between even fast well-connected machines to wait out each message for confirmation. So BBO sends many packets at once and the client must parse them to see what is missing.

 

As soon as packets are detected as lost, some kind of recovery effort must begin. What I'm saying is that it is clear to me that the recovery efforts currently in place could be substantially improved.

 

Here are some ideas:


  •  
  • Make sure that the lower-level code, in fact, immediately detects a missing packet and informs the client of an incomplete message.
     
  • When an incomplete message is detect, the client should light the red dot on the user's name, just as the server obviously orders up the light for other users with problems to be displayed on my screen.
     
  • Separate play messages from all others and insure their delivery first.
     

 

But I have to protect more than the odd poor connection -- I have to protect the bandwidth of the network, and I have to protect the machines that are involved.

 

What I'd be interested in seeing quantified is the amount of traffic BBO must process in doing a connection from scratch compared to the amount of traffic associated with sessions that are already established. My guess is, given the relatively long setup time (especially when 10000+ players are on) at the beginning, that it is a very significant part of your load.

 

So, I see this discussion as ultimately pointing toward a win-win solution. You can reduce your bandwidth consumption because users with transient problems are saved a full "Connection lost ... reconnecting" cycle and the players get to continue their games.

Link to comment
Share on other sites

I used to be a senior technical support engineer at a worldwide ISP (Genuity, which was originally BBN, the company that practically created the Arpanet and Internet), and I know what I'm talking about. And many of our customers were the operators of small ISPs, so I also know how little you can know about TCP/IP and still be able to set up an ISP (I remember having to talk one of them through setting up a mail server).

 

Anyway, TCP performs exactly the type of sequencing scheme that you describe. Every packet that it sends contains a sequence number, and TCP will never send the data in packet N to the application until it has received packets N-1, N-2, and so on. It's not necessary for applications to deal with missing data, because TCP guarantees in-order delivery. If a packet goes missing, all the packets after it will be held up until the missing packet is retransmitted. And low-level acknowledgements are used to detect when packets have gone missing and need to be resent.

 

However, TCP doesn't make acknowledgements (or lack thereof) visible to applications. This is why BBO sends its own periodic messages, and detects that there's a connection problem when it doesn't see these on schedule. That's how the red dot works.

 

TCP treats each direction of transmission independently. This is why it's possible for you to continue to receive chats, even though your bids/plays are not being sent to the server. And TCP uses a "sliding window" mechanism, which basically means that the sender doesn't wait for the acknowledgement of packet N before sending packet N+1 -- the window specifies how much can be sent before waiting for an acknowledgement. Modern systems typically allow 8K-32KB -- higher values allow for faster throughput in large data transfers. This allows quite a bit of chat to be received before the client declares that it has lost the connection.

 

It does, however, seem feasible that when the server has detected a problem with the client (making you red), it could send something to the client. If the client is still receiving from the server, and the window hasn't been filled (if you're still seeing chat, these conditions are apparently met) then this should allow the client to display a warning that there's a communication problem.

Link to comment
Share on other sites

I think the best thing to do for the moment is send a '..." to the table when you see an unusually long delay. If the ... echoes back to you immediately, either you are not having a conn problem or you have discovered a new bug.

 

u

 

A slightly superior way that is less likely to annoy an opponent who is merely thinking of which card to play is to observe the number of people in the lobby. It generally ticks up/down every few seconds and if one is lagged it too stops.

Link to comment
Share on other sites

I used to be a senior technical support engineer at a worldwide ISP (Genuity, which was originally BBN, the company that practically created the Arpanet and Internet), and I know what I'm talking about. And many of our customers were the operators of small ISPs, so I also know how little you can know about TCP/IP and still be able to set up an ISP (I remember having to talk one of them through setting up a mail server).

Very well. We'll assume for purposes of our discussion that you actually do know something. I didn't ask you to configure my mail server but I'd be happy to put my resume up against yours any day.

 

You do reference a key part of the debate that may come up if we ever get to things that matter.

 

[i failed to see Fred's post, so the comment here made no sense and I deleted/edited it.]

 

With TCP connections, to protect play there should be a minimum of two session per client, one to carry the state of the hand in play and another carry all the nonessential stuff. The QoS for play should then be set to real-time and the rest can be set to minimum cost.

However, TCP doesn't make acknowledgements (or lack thereof) visible to applications. This is why BBO sends its own periodic messages, and detects that there's a connection problem when it doesn't see these on schedule. That's how the red dot works.

That to me sounds like exactly the justification for getting BBO out of the TCP business if that is where it is. Those extra messages BBO and its clients send waste the bandwidth that uday wants to preserve AND the TCP layer is preventing the application from knowing about trouble in a timely fashion.

 

Much of the work with streaming media (see helixcommunity.org for example) is already working with both TCP and UDP, with UDP being the choice that works better more often.

 

BBO's server and every BBO user pay part of the price when disconnections occur. Doesn't it just make good business sense to work to minimize them?

Link to comment
Share on other sites

You are assuming that BBO uses TCP and not UDP. That is a likely assumption, given the nature of the errors that occur. However, neither uday or anyone who knows the internals of BBO has yet stated in this thread that BBO does it that way.

Hmmm...

 

BBO uses a protocol called TCP/IP which includes guarantees that messages don't go missing.
Link to comment
Share on other sites

Here are some ideas:

 

[*]Make sure that the lower-level code, in fact, immediately detects a missing packet and informs the client of an incomplete message.

 

[*]When an incomplete message is detect, the client should light the red dot on the user's name, just as the server obviously orders up the light for other users with problems to be displayed on my screen.

 

[*]Separate play messages from all others and insure their delivery first.

Quick comment here:

 

Four job cycles back, I worked for a company called FTP Software. FTP did the first port of the UNIX BSD stack over to personal computers. I remember the bad old days when PCs didn't ship with their own TCP/IP and we had to buy them from 3rd parties for $500.

 

Three job cycles back, I worked for a company called GN Nettest designing test and measurement equipment. We specialized in troubleshooting TCP flow control issues.

 

Two job cycles back I worked for a company called Wind River. I was the product manager for the company's TCP/IP stacks.

 

I've spent WAY too much of my life working with TCP/IP stacks.

 

Here are a few quick comments. The last is (probably) the most useful.

 

1. QOS is a tool of the devil

 

2. Those who fail to learn the lessons of TCP are doomed to re-implement it. I've seen a lot of folks try to implement their own reliable application layer transport protocol so they can run over UDP instead of TCP. This has always struck me as a complete waste of time. A very large number of very smart people have spent working on TCP. If you name isn't Van Jacobsen you probably aren't going to be able to improve on what's already built into TCP.

 

3. One of TCP's features is that it preserves sequencing. As far as I can recall, the stacks that I worked with don't do an out of sequence application layer write. if a packet goes missing you don't get that piece of data. You also don't get any of the data that came after that. (I can double check this if folks really want. Things could have changed). From the perspective of the application, it doesn't see out of sequence data and can't take action based on out of seqeunce data.

 

4. In theory, an upper layer application could start sending queries to a local TCP. For example... you might be able to ask the stack to inform you whether TCP keep alive messages are arriving on the stack. I've never seen anyone use this in an intelligent fashion.

 

5. You already have the option to get all of the information that you want without any need for Fred and Uday to rewrite the BBO application. There is a very nice freeware protocol analyzer out there called Ethereal. Youcan download this from

 

http://www.ethereal.com/

 

Fire it up at the same time you're running BBO, slap on a couple of filters, and you have a real time display of the traffic flow between your pc and the BBO servers.

Link to comment
Share on other sites

Those extra messages BBO and its clients send waste the bandwidth that uday wants to preserve AND the TCP layer is preventing the application from knowing about trouble in a timely fashion.

BBO doesn't send streaming media- it sends ANSI codes. The loss of one byte wouldn't mean that some pixel would be black instead of red, it would mean that a card would be missing from dummy. UDP would be exactly the wrong way to send this information.

 

You seem to be confused about the timing here. The amount of time it takes for BBO to determine that a user has nothing to do with the protocol. Uday deliberately set it to check every 15 seconds, with a 'fail' only being four non-receives in a row. Transmission speed isn't a real factor here- there's nothing stopping Uday from changing that value to 15/100 of a second, except that the processing power required by the server would be enormous, and the odds of four failures in a row would be much higher, especially from Eastern European ISPs.

 

Why you'd want to use a dual session on a piece of software designed to work with dial-up modems is beyond me. I guess I'll have to beef up the ol' resume.

Link to comment
Share on other sites

Apologies to Fred and Inquiry, somehow I missed Fred's post in replying to the later one. I noted an edit in the post where I made my error.

Why you'd want to use a dual session on a piece of software designed to work with dial-up modems is beyond me. I guess I'll have to beef up the ol' resume.

Perhaps. Are you aware that the normal operation of a browser, even on dial-up, opens many sessions to get a page with numerous graphics, frames, etc? Using TCP for http sessions is probably the most successful use of TCP/IP sessions. Certainly it is what everyone sees who uses the web and what everyone would find unacceptable if it were run through a single session. (The technical-minded will know that it is typically a number of sessions that are opened and requests go to the session that gets free first so as to avoid the overhead of opening new sessions while a page is loading.) A single TCP session being preserved over an extended period of time, to carry both high- and low-priority information, is definitely problematic. All the more so on the Eastern European ISPs that jtfanclub mentioned.

Link to comment
Share on other sites

Apologies to Fred and Inquiry, somehow I missed Fred's post in replying to the later one. I noted an edit in the post where I made my error.
Why you'd want to use a dual session on a piece of software designed to work with dial-up modems is beyond me. I guess I'll have to beef up the ol' resume.

Perhaps. Are you aware that the normal operation of a browser, even on dial-up, opens many sessions to get a page with numerous graphics, frames, etc? Using TCP for http sessions is probably the most successful use of TCP/IP sessions. Certainly it is what everyone sees who uses the web and what everyone would find unacceptable if it were run through a single session. (The technical-minded will know that it is typically a number of sessions that are opened and requests go to the session that gets free first so as to avoid the overhead of opening new sessions while a page is loading.) A single TCP session being preserved over an extended period of time, to carry both high- and low-priority information, is definitely problematic. All the more so on the Eastern European ISPs that jtfanclub mentioned.

Couple additional points that need to be discussed here

 

TCP is a transport layer protocol. I'm part of the school that argues that using multiple separate TCP connections to to try to virtualize the OSI session layer is a severe layering violation and is - ultimately - a lot more trouble than its worth. (Admittedly, no one really takes the OSI stack that seriously any more... even so, certain points hold true)

 

As we've already noted, TCP includes a whole bunch of different adaptive algorithms which attempt to optimize that rate at which data gets transmitted between a pair of TCPs. These algorithms have cute little names like Van Jacobsen slow start and the like, however, they perform a very important task.

 

Lets assume that we have 30 different tcp connections going between a local HTTP client and a remote HTTP server. All of these connections are running on the same network. If that network is experiencing any kind of problems with congestion or packet loss or what have you, these problems will (typically) impact each and every one of the TCP connections in the same manner. Unfortunately, none of the TCPs share information with one another. Even if one TCP connection has already discovered the optimal data burst size or packet length to connect to the remote host, all of the other TCPs will still need to go through a slow start, path mtu discovery, and a bunch of crap.

 

It would be a lot easier if all the data ran over a single tcp connection. Unfortunately, the OS often acts to limit the rate of speed at which a single tcp is able to transmit, hence silly little hacks like opening multiple tcps to download a single web page.

Link to comment
Share on other sites

When browsers open multiple connections, they're often to DIFFERENT servers. Many web sites have one server for static text, another for images, a third for the dynamically-generated content, etc. This is very different from the way BBO uses the network.

 

Also, using separate connections for the application data than the liveness checking could easily cause incorrect behavior. It's possible that some network problem could affect one connection but not the other. If it affects the liveness-test connection, a red dot would show up while you're still working fine. Conversely, if it only affects the bridge connection and not the liveness test, you'll be stuck but no red dot would appear. It's only really correct if your liveness test goes over the connection you care about for the application.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...