By Mike Croft
Story time! I hope you’re all sitting comfortably.
This story begins on a Tuesday, in London, after a particularly hectic first two days at the office. For those unaware, I don’t live in London, so when I’m scheduled to be with this particular customer, I get up very early in the morning on Monday and travel down on the train for the week.
Generally, when I’m particularly busy, I like to get small cups of water regularly to get me away from my screen, help me think, and stretch my legs a little. It’s an easy measure of how busy or stressed I am to count the number of cups of water I fetch. That Monday (despite arriving late thanks to travel time) was a 5 cup Monday.
By Tuesday lunchtime (3 cups), I already felt worn out, particularly when I was told that there was a new set of packages for deployment to the UAT environment for a particularly troublesome project.
This project includes a JBoss domain with 6 servers in 3 groups of two; one server group for the backend services, and two frontend server groups with a single (very large) EAR each. All the packages for the backend are all well under 20MB each, whereas the two frontend packages are both well over 150MB.
The slightly unique thing about the UAT environment is that, unlike Dev or Prod, the domain controller is on the same machine as one of the instances (Inst1 below)
After that, the day rather picked up. My colleagues and I even made plans to go for a few drinks at the pub down the road, which was exactly the sort of thing I needed to unwind a bit and get settled after a busy couple of days.
When, just as I was getting ready to leave for the pub, I heard the Outlook “message received” sound, I didn’t think much of it. As I read on, I found that my earlier unease at how smoothly the deployment went was justified; the packages I had been given were actually the version that was already deployed, the developers informed me. They had already uploaded the correct packages to the repository and now wanted me to deploy the new ones ASAP.
“Sure”, I thought. “That shouldn’t take long”. My colleagues were already logged off and waiting to go and get a beer at this point;
“No, no, don’t wait for me, you guys go on and I’ll join you in about 15 minutes”
I wasn’t worried. The deployment earlier had gone fine, there was no reason why this one wouldn’t either. I didn’t even need to get a cup of water.
Deploying the packages to the backend servers was flawless.
Deploying the package to the frontend B servers was a little slow, but problem free.
When I saw the PermGen error, I still wasn’t worried. I’d been warned, and even though the solution of “just reboot JBoss” is really not the sort of “workaround” I like, it was still just testing.
The process we’re using to start and stop JBoss is using the included init scripts (found in $JBOSS_HOME/bin/init.d/) symlinked to /etc/init.d. These completely bring down all JBoss processes and restart them from scratch (if needed).
In hindsight, using the JBoss CLI to simply execute /host=Inst1:reload on the relevant servers would have been better but, perhaps due to a lack of a fresh cup of water and a clear head, that thought didn’t occur to me. The difference is that a :reloadcommand would reboot the JBoss server instance, but not kill the host controller or process controller.
Curiouser and curiouser…
Restarting Inst1 caused no problems, I checked the server log and saw that it had started up and even saw some cluster view information from Hazelcast.
Restarting Inst2, however, did not go as planned. Time and time again, I would see that JBoss failed to start in the time allowed. What was more confusing was that both the Host Controller and Process Controller started without errors.
The server.log for the Inst2 server that the Host Controller should have started also had no errors… what was odd was that it had nothing else either! The last message in the server.log was when I shut down all the servers, so it looked like JBoss wasn’t encountering any error at all during startup, it just wasn’t even attempting to start!
Eventually, I thought to check the start() command of the init script, and came across this:
So this “failed to startup in the time allotted” message was a bit misleading! The script simply waited for 30 seconds (the value of $STARTUP_WAIT) and then grepped the server.log for the code ‘JBAS015874’, the log message ID of successful startup, and if it didn’t find it in time, printed out that error.
It was then that, from nowhere, the old adage that forms the basis of Duck Typing came to my mind: “if it walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.”
It looked like Inst2 wasn’t attempting to start, there was no sign that Inst2 was attempting to start, and the init script wasn’t even checking that Inst2 had at least tried to start – therefore Inst2 probably wasn’t even trying to start!
After some thought, it suddenly occurred to me that a lot of the information about JBoss instances in domain mode are held in the domain.xml, such as the server group and details about common JVM settings. The host.xml holds data about the server itself, but is incomplete without the domain.xml.
In fact – I’ve already blogged about this mechanism. Step 3 of how JBoss starts in domain mode says:
- If this host is not the DC:
- The HC tries to connect to the DC and combines the remote domain.xml with the local host.xml to make a single configuration for the machine.
So for all non-domain controller hosts, the configuration is the result of a combination of local and remote files – in other words, the configuration for any JBoss instances on that host is incomplete with only local files!
With this likely candidate for the problem in mind, I went back to the domain controller, remembering also that the cluster information I’d seen was from Hazelcast and not from Infinispan. That actually only told me that the Inst1 server had come up and loaded Hazelcast properly, not that the domain controller was healthy as I’d assumed.
A quick restart of JBoss on the Inst1 host meant that the domain controller came up properly and could communicate with the other instances. Once I made sure that the domain controller was behaving, Inst2 – like magic – came up successfully first time. (And even managed to deploy the EARs without any more PermGen errors!)
I kicked myself for not taking time out to clear my head and fetch another cup of water before getting bogged down and making myself late for the pub.
After a swift email to the developers telling them that the deployment had now completed (and that they should test it), I logged off before Outlook could buzz again and got myself to the pub as fast as I could. I arrived to gin where my colleagues were (thankfully) still waiting with a drink, and a sign of how valuable all my cups of water really are…