When Puppets Lose Their Puppeteers

First up, news for pre-alpha testers:

Pre-Alpha testers: the server is back up for a couple of days! If you have an hour, please log in with a new character and test out the newbie cave. I would love to get your feedback on the game from the start point up to where you leave that first cave. You can email me (ericheimburg@eldergame.com) or post on the forum. Let me know what you liked, what you hated, what should be added to make it more fun or more intuitive. Anything, really.

If you had an old character, it’s been pseudo-wiped: it won’t show up in your list, but if you make a new character with the same name, it will be partially restored. Don’t make a character with an old name to test the newbie experience! You’ll appear in whatever area you used to be in, and you’ll completely miss the new starting cave. Once you’ve done the newbie experience, though, feel free to reclaim your old character names if you want. The pseudo-restoration feature will only be available for a week or so, so reclaim ‘em if you want ‘em! Only a handful of characters had earned significant levels already, so it’s not a big deal for most people.

If you emailed me to get into the pre-alpha and I haven’t replied, I’m sorry. I was waiting to have a server up and running to direct you to. I’ll be emailing you back soon!

Server Instability

One of the things that’s been keeping me from putting the pre-alpha server back up is that it’s no longer as stable as it used to be. There’s a nasty memory leak in Unity 3.5, somewhere in the guts of the engine.

I’ve been trying to find the memory leak, but it’s very hard. It’s not caused by my code — that would be much easier to find. Instead, the game just slowly leaks heap memory, and the Unity profiler doesn’t show any new object allocations actually happening, so it’s happening somewhere deep inside. My best guess is it’s some subtle bug in the new pathfinding system —  something about the way I’m using it is causing it to leak. But I can’t find any solid clues.

So I’ve been doing the old programmer’s voodoo dance: “If I remove this, does it stop leaking? No… how about if I change this? No… how about…” Just randomly flailing to try to find clues about what causes the leak. I’ve wasted a lot of time on this and still no luck.

Players might get this crash it if they stay logged in for enough hours, but it’s not a super fast leak, so I don’t even really care about that right now. It’s a much bigger deal on the server. Since the server is up 24/7, a memory leak will eventually crash it. (It stays up between 2 and 6 hours. What determines how long it stays up? I don’t know!)

“The Server” vs. Sub-Servers

But when I say “the server” crashes, that’s not really right. The server is actually pretty complex (surprise?) and has lots of parts, and only one part is crashing. But it’s an important part.

The main game logic is written in Java and runs in SmartFoxServer 2X. This part is still very stable. But the Java code doesn’t understand the physical 3D world. It knows how many Hit Points everybody has, and all their other stats, and it knows their x,y,z world location, but it doesn’t know what’s really at that location. Is it on the ground? In the water? On the edge of a cliff? In a tiny house? It doesn’t know.

To answer those questions, there are separate sub-servers that interpret the 3D world. These sub-servers are written in Unity, and run in Unity’s “headless” mode (meaning they have no graphics output). They just quietly sit there moving NPCs around at the beck and call of the Java server. The Java server calls them “puppeteers”. It tells them, in effect, “Hey, drive Skeleton #511 around. He wants to kill Bob and his AI routines are a, b, and c. Let me know when he’s close enough to attack Bob.”

A player is talking to some NPCs. How does the server "see" this same scene?

This is what the server "sees". Those three white capsules are actually the elves and the cow from the previous scene. Everybody's a capsule on the server.

There’s a separate puppeteer sub-server for each area of the world, controlling all the monsters in that area. It also spot-checks player behavior to see if they’re cheating. (If a player teleports through a wall, it can tell.)

What Happens When There’s  a Server Crash

Unfortunately, these sub-servers are the ones that are running out of memory and crashing. And when they crash, the Java server is blind.

Players can still move around, but the server can no longer tell if they’re cheating. (They could be using wall hacks or teleport exploits and it won’t know.) But more importantly, all monsters suddenly stop moving. They just stand there, and you can stab them to death and take their stuff. They are puppets without puppeteers.

But the Java server has a pretty good backup plan when this happens. It deputizes players’ computers to act as puppeteers! It looks for a player in that area with a good ping rate, and secretly tells that player’s computer, “okay, now make Skeleton #52 move around. It has these powers and this AI routine. Tell me where it moves to!”

I’m pretty happy with this emergency backup system. When bugs have caused sub-servers to crash, testers mostly haven’t noticed. The monsters just stop moving for a few seconds, then start right back up again.

However, players do perceive this as lag. The monsters seem to get dumber and have more trouble chasing after players. That’s because there’s extra network latency now: instead of the server talking to a sub-server right next door, it has to talk to some computer in who-knows-where.

Plus, the server has to trust that the player’s client isn’t hacked. An evil player could take advantage of this and fling those monsters off a cliff, or into space, or whatever they wanted.

It’s a pretty cool backup system, but it’s not something I want to happen when the game is finished. (But I’ll still use it for some things, like modeling the insides of player housing, where cheating isn’t a concern. I don’t much care if you use wall hacks in your house. There’s nothing to fight!)

Watchdog!

But what I really need to do is get those crashed sub-servers running again. My first thought was to just use one of the many 3rd-party watchdog programs that check to see if your app suddenly dies, and if so, restart it. But I couldn’t use those.

The problem is that the sub-servers don’t just crash and die. Unity wrote special code to avoid that. Instead, they stay alive, but stop working, and they bring up a special message box asking the user to send a crash report to the developer. That’s a pretty nice feature for the game client… but it’s really weird when it happens on a remote computer running with graphics turned off!

So the regular 3rd-party watchdog programs can’t tell when the sub-server crashes because technically it’s still running, showing that special message. I had to write my own custom watchdog app that can detect when the sub-servers are crashed like that and shut them down. Then it starts them up again.

So now when a sub-server crashes, it comes right back up again within a minute.

Damn You, Memory Leak

The watchdog app is something I needed to have done anyway, so it’s not wasted work. But it’s not what I’d planned to be working on. And I sure didn’t plan on spending 30+ hours looking for a leak that isn’t even in my code.

But it is what it is. I’ll keep looking for it, but at least now it’s not a super huge issue. The game will keep running, more or less.

Next time: cows, pets, and the eventual consequences of death.

This entry was posted in Programming, Project Gorgon. Bookmark the permalink.

7 Responses to When Puppets Lose Their Puppeteers

  1. Jason says:

    That is a really mental way to do a world simulation.

    Combined with the horror that is SFS, I’m surprised you managed to get it to work at all.

    We’ve had this debate before, but I just replaced SFS for a clients project in a week. It’s a giant pile of shit.

  2. Eric says:

    Heh, why is it mental? It’s pretty typical to separate your physics management from your game logic. I didn’t have any luck getting Java-based physics solutions to work well. The physics engine in Unity is very good, so I just use it for both client and server.

  3. Jason says:

    Ya when physics are involved it gets dicey, sadly.

    Even if the physics implementations were good, the problem is getting them to match to the client, so in that case Unity is almost your only option. The most recent work I did with physics on the server side we had Bullet running in a JNI library on both sides. Worked well, but a lot of headache when you start dealing with native code and Java.

    For myself, I am hoping to never have to arse about with physics again.

  4. Michael Kujawa says:

    The challenge is reducing the amount of chatter between the two servers–balancing on-demand asynchronous state queries with bulk state replication–but it’s certainly a viable solution.

    I’m impressed by your fallback of pushing the logic to the clients.

  5. Michael Kujawa says:

    Memory leak tools are invaluable. I’m surprised Unity doesn’t provide an allocator and some corresponding tools for leak/bloat/velocity tracking. Do you remember the stuff we built for AC2?

  6. Espoire says:

    As a fellow hobbyist game developer, I must salute you. Your dynamic sub-server replacement scheme is brilliant. I’m amazed you got it to work.

  7. Anthony says:

    Have you talked to the unity folks about the leak? If its part of the engine they may know what is going on or at least a way for you to work around it.