MegaGlest Forum

Archives (read only) => Vanilla Glest => Linux and other ports => Topic started by: titi on 11 February 2008, 09:47:31

Title: Let's fix the linux multiplayer problem! (new workaround!)
Post by: titi on 11 February 2008, 09:47:31
It looks like a lot of people have trouble to run a multiplayergame without a crash.Please let's all together try to find the bugs!

1. play a local single player game to ensure that your setup is ok.
2. ensure that you all have the same binary and data ( best would be 3.0.0 )
3. for the moment please play seperate 32/64 bit linux games
4. start from a console to see errors when it crashes
5. report crashes ( and their output! ) here.  If possible include information which Linux distribution/hardware(GFX by ATI or NVIDIA)  was used by all players.
6. !!!!please also report successful games!!!
7. report compiler version which was used to build binary ( gcc --version )

(Update:
Use this script to start glest and you get a logfile for every crash:
http://www.titusgames.de/runglest.tar.gz (http://www.titusgames.de/runglest.tar.gz))

( I never had any trouble playing with my son, but we had very similar hardware and the same linux distribution. I had some successful and some crashed games with others on the inet )
Title:
Post by: AF on 11 February 2008, 11:15:06
You may be interested in the work tobi did for spring regarding sync and window<->linux and x86<->x64 and the streflop library.
Title:
Post by: martiño on 11 February 2008, 11:53:32
Yeah, everything runs perfectly on windows though. windows/linux and 32/64 bit compatibility will be our next focus.
Title:
Post by: titi on 11 February 2008, 12:27:53
Thats great to hear!

Does it make sense when we post our results here?
Title:
Post by: AF on 11 February 2008, 13:02:33
It only works because every working glest is all compiled from the same compiler and source under the same platform.

As soon as you pit a VS2005 build against a mingw32 build or a newer gcc build against an older one you get errors because floating point calculations are done slightly differently with different calculations and different accuracies. This generates tiny differences which desync the game, and as the game continues they compound each other into huge differences which can crash the game depending on how network traffic is interpreted.

To do this spring developers used streflop to fix the floating point accuracies, separated out synced and unsynchronized code, and built the windows release under mingw32 for better compatibility with *nix gcc builds.
Title:
Post by: martiño on 11 February 2008, 13:34:01
Quote from: "titi"
Thats great to hear!

Does it make sense when we post our results here?


Yeah, that would be really useful, I specially interested in the 32-64 bit problem and also in gcc3-4 issues.

We are aware of the floating point not being deterministic issues, and we know that this is not an issue on windows since we provide our own binaries, the issue is when people start compiling with different compilers and using different machines. We are thinking of a way of fixing it, we might use streflop or just fixed point maths.

Regards.

Martiño.
Title:
Post by: AF on 11 February 2008, 15:17:45
Fixed point maths may not be wise as it entails a performance hit. Streflop is not the only piece of code out there for this, especially since it was a pre-existing library that was totally rewritten by spring devs IIRC. Perhaps some research is in order?
Title:
Post by: martiño on 11 February 2008, 15:50:27
Yeah, we will investigate, I would like to avoid the use of third party libraries if i can though, Glest already has a lot of external dependencies
Title:
Post by: titi on 11 February 2008, 17:27:17
If you use the following startscript for glest you will have a logfile for every crash:
Unpack this to your glest installation and start glest with the script runglest.sh instead of glest.
http://www.titusgames.de/runglest.tar.gz (http://www.titusgames.de/runglest.tar.gz)
This will create a logfile in the glest directory.

If you didn't installed glest in the userdirectory see the last lines of the script and uncomment the things you need.
Title:
Post by: titi on 11 February 2008, 17:42:54
So here we go, the first crash :(

glest 3.0.0
Server Ubuntu 7.10 32 bit Nvidia gfx.(binary compiled with gcc (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2) )
Client Ubuntu 6.06 32 bit Nvidia gfx (binary compiled with gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5))


Everything starts up fine but after 10 minutes the client crashes.
- no error message because we started without console ( sorry next one will have a log! )
-------------------------

Next game, same computers but client and server changed their role
now with errormessage:
Exception: Can not find command type with id: 8 in unit: daemon
-------------------------
Title:
Post by: martiño on 11 February 2008, 17:47:11
i think one critical thing might be which version of GCC was used,
Title:
Post by: AF on 11 February 2008, 20:25:27
Its not quite a 3rd party lib, its not like you can plug it in, and whatever you do it could well have far reaching effects across the code base, but if you want ever windows vs linux without wine then you don't have many other options.
Title:
Post by: titi on 11 February 2008, 21:46:49
AF why are you always so aggressive? Thats probably why noone wants to answer you. Calm down a bit! Choose a more gentle way to say something.
Title:
Post by: ttsmj on 12 February 2008, 20:36:33
Quote
gcc (GCC) 4.1.3
Quote
gcc (GCC) 4.0.3


hey titi, next time we gonna use the same binary, ok? and what about testing the latest svn release?

By the way, there is this option: configure --enable-debug. What does it do, when I compile this way? Will the game have more detailed terminal output? Or what's the difference?
Title:
Post by: AF on 13 February 2008, 14:01:11
titi if I wanted to be aggressive I'd use big text and flashy colours!
Title:
Post by: Duke on 13 February 2008, 18:03:44
Agreed the worst I could say about AFs stile, is that is not polite, but not unpolite either and far from aggressive.

About the topic: I had this exception with the demon once even in single player.
I think the situatin was, that the unit was slayn but did not fall down and when I tryed to move it it couldn't find it.

Ao it could be that aside from the asyncronisation, there might be some kind of package loss.
Title:
Post by: AF on 13 February 2008, 18:20:14
Well before we aim at fixing the sync problem, we should be able to prevent it causing crashes, after all a slew of console error messages and warnings is far more useful than a crash message, and it would help track down desync causes too.

As for me, I'd say to the point and perhaps blunt.
Title:
Post by: titi on 13 February 2008, 19:22:44
Duke did you really had this crash in single player mode???
I and my sons are playing so often in linux and we never never never  had a problem like this! BUt If its really the case it's probabaly mostly just a "simple" bug and not the big sync trouble.

another thing that we should try here is my binary build on Ubuntu Dapper 6.06 with glibc 2.36. this should(hopefully) run on a lot of systems. Lets all try to use this binary in multiplayermode.

It's glest 3.0.0
http://www.titusgames.de/linuxglest300.tar.gz (http://www.titusgames.de/linuxglest300.tar.gz)

@AF: I looked at the code and it's easy to ignore these errors and give warnings instead. This should easily be done. But lets wait a bit what martino and matzeB will do. I think/hope they are on it and there will be some debug sessions soon.
Title:
Post by: AF on 13 February 2008, 19:52:28
hmmm.

My NTai AI project had a buffer overrun bug in it for months, at the time I didn't know how to use a debugger and it never crashed for me, hence why it went undetected. One day I learnt how to debug C++ programs, and found it, and at the next release a large group of people started commenting on how they'd never been able to play before now.

A long time ago that was but it serves the point that sometimes a crash bug only affects some people despite using the same binary and OS. Wierd ^_^
Title:
Post by: martiño on 13 February 2008, 20:18:07
Quote from: "AF"
Well before we aim at fixing the sync problem, we should be able to prevent it causing crashes, after all a slew of console error messages and warnings is far more useful than a crash message, and it would help track down desync causes too.

As for me, I'd say to the point and perhaps blunt.


I completely disagree, the crashes are intentional (it would be as difficult to let the game run). It is far better to get a crash and know that the game is desynchronized, than just playing a game which is different on every machine.

As for a way to fixing the problem we are considering redistributing some kind of "reference binaries", so everybody has the same exe.
Title: IT WORKS!!!!!!!
Post by: titi on 13 February 2008, 21:50:52
NO CRASH WITH SAME BINARY!!!!
We played some games today using my binary and there are no more crashes!!!!!! It simply works!
Really different hardware this time and no more errors!

So please use it and tell me about errors:
http://www.titusgames.de/linuxglest300.tar.gz (http://www.titusgames.de/linuxglest300.tar.gz)

My "reference" binary has an old glibc 2.36 and for this it should run on most systems ! ( build on ubuntu 6.06 dapper )
Martinho probabaly my binary could be the linux reference its tested :)


-----------------------------------------
tested gaming:
host:
AMD Athlon(tm) 64 Processor 3400+
NVidia Geforce 6600GT
1GB ram
Ubuntu 6.06 Dapper

client:
Intel(R) Celeron(R) CPU 2.00GHz
GeForce 6200/AGP/SSE2
Title:
Post by: titi on 13 February 2008, 22:48:37
martino, if you release the 3.1.0 version you should probabaly put your old binary check inside. This will help, that all players use the same binary.

If you find such an error there is no need to leave the whole game ( in my opinion) Just give an errormessage and go back to the mainmenu. This would be better!!
Title:
Post by: Duke on 14 February 2008, 01:22:29
When I think about it I'm not sure if it was exactly THIS bug, but the one I'm refering to was definately single player since I haven't played multiplayer yet.

It is just a hint that it could be a very rare bug that becomes much less rare due to the async.
Maybe the Client commanding a unit thats does not exist on the server?

Of course reference binaries are a solution to the async caused by sligtly different mathematics.
But Titi only tested on Lan right?

Is there code to catch async that is cause by lag due to high ping over internet? I somehow doubt it, because in my understanding such code should be able to handle the other async as well.
Title:
Post by: titi on 14 February 2008, 10:00:54
No, its tested in real Life with high ping over the internet. The client had some speedups to get back into sync ( when the ping was really bad) but that's all.
We played about 3 hours without a crash. When we used different binaries we only had to wait some minutes until it crashes.
Title:
Post by: jrepan on 14 February 2008, 14:28:28
One binary is not very good solution. It has many problems:
1. Linux and Windows (and FreeBSD) binaries are anyway different
2. You can't add patches if you can't compile
3. Everybody may not have same version of libraries and static binaries are waste of space
Title:
Post by: Duke on 14 February 2008, 17:47:39
True, but at least we can be relatively sure that there is only just this one problem.
Title:
Post by: AF on 14 February 2008, 22:10:40
it's likely to cause more problems than it solves.

Also intentional crashes are bad because there's no way for a user to determine bugs from desyncs, and potential bug reports are lost. It's also unprofessional looking and reflects badly on your coding efforts when the end user experiences it.

For example how is the user to know that there's a horrendous bug introduced in a new version when they go "oh its a damned desync" every single time? Or when every crash is immediately labelled a desync when users ask about it? Indeed the game should not continue without any notice, but a crash is not the only option available nor is it the most appropriate.
Title:
Post by: Duke on 15 February 2008, 00:11:46
Agreed, if it is called release and not beta, it shouldn't crash every 10 minutes.

Its worse if they don't run it in a terminal, because then there is no message at all and the game is simply ... gone 0o.

My prof once said somthing along the line of: "alway catch an exception as early as possible to minimise its consequences".

As titi said if a unit doesn't respond properly it might be a wise idea to stop the current game, but not to crash the entire Programm.

And even then you could give the user a warning and ask him if he wants to try if the game goes on if the unit in question is simply removed.
Title:
Post by: KaSek on 15 February 2008, 15:06:47
Hi.

I think the cause of the problem is the implementation of Glest.

I've ported Glest 3.0 to Mac OS X. As far as I know the multiplayer works if Mac connects to each other.

I found the original code puts C++ object (not data) into data packet (noooo!).
So I had to make another way. And more it ignores not only endian but data alignment. My Mac version resolved these problems and it doesn't depend on specific platform, I think. But it cause a little bit overhead.

Now I'm preparing the source code of Mac version and I'm going to post to my Web site. Please see this one. I will announce when it's available on "Glest 3.0 package on Mac OS X" thread.
Title: multiplayer crash with same binary
Post by: ttsmj on 15 February 2008, 15:09:32
Today our multiplayer game crashed when playing the same binary!

server: ttsmj, Ubuntu 7.10 32 bit (titi's binary, sourceforge data)
client: justWeedy, sidux 32 bit (titi's binary, sourceforge data)

console message on client was:
Code: [Select]
Exception: Disconnected

if you need some more info, tell me
Title:
Post by: titi on 15 February 2008, 15:14:52
This is a network timeout. This would happen in windows too. The connection was too bad or disconnected!
Title:
Post by: martiño on 15 February 2008, 15:16:39
Glest doesn't make much of an effort to keep connections alive when they die, so if you have a bad connection you will have trouble playing the game. LAN and broadband users should be fine.
Title:
Post by: weedkiller on 15 February 2008, 15:50:32
Hi, i played with him.
Yes looks like it was a ping problem;
the bandwith took under 10 mbits (?), but when i looked after it the game closed as it was no longer in focus.
I assume that since it was not longer the active programm the commands where managed more slowly and then the connection was to slow to continue.

I thinks its quite hard to improve big pings although it felt like the connection was fast enough at the beginnig, i'm not sure if it was only  one moment when the connection got to slow. Is there a possibility that the server finds out if its to slow and waits some time? But how should the client know... ;)
Title:
Post by: AF on 15 February 2008, 16:49:01
If actual C++ objects memory footprints are being copied across network traffic then it appears we have found a huge security flaw. Someone could exploit this to install a worm or trojan or cheat code. Its also a big no no  as said before if there's ever to be communication between glests from different compilers.
Title:
Post by: weedkiller on 16 February 2008, 09:26:56
Hi again,
i played again with ttsmj and everything went fine, so it was really just a bad connection error.
We played against 2 computerAI and there was only little slowdown compared to simgleplayer.
It was really playable.

OK, how much traffic may the AI need? is it like a additional human player or do they need less bandwith/trafficspeed?
Title:
Post by: ttsmj on 22 February 2008, 14:35:37
I am playing games with different binaries trough loopback 127.0.0.1.

binary1: GCC 4.0.3 ---- titi's binary
binary2: GCC 4.1.3 20070929 (prerelease) -- compiled by me

Game starts fine, running, but sooner or later it crash with this message:

Code: [Select]
Exception: Can not find command type with id: 1 in unit: energy_source. Game out of synch
In game/commander.cpp:
Code: [Select]
ct= unit->getType()->findCommandTypeById(networkCommand->getCommandTypeId());

//validate command type
if(ct==NULL){
throw runtime_error("Can not find command type with id: " + intToStr(networkCommand->getCommandTypeId()) + " in unit: " + unit->getType()->getName() + ". Game out of synch");
}



When unit->getType()->findCommandTypeById(1); returns NULL?
Title:
Post by: ttsmj on 22 February 2008, 14:39:04
In types/unit_type.cpp:

Code: [Select]
const CommandType* UnitType::findCommandTypeById(int id) const{
for(int i=0; i<getCommandTypeCount(); ++i){
const CommandType* commandType= getCommandType(i);
if(commandType->getId()==id){
return commandType;
}
}
return NULL;
}