Wednesday, April 7, 2010

WTF, Ubuntu?

I just finished installing Ubuntu 9.10 server edition on a shiny new Dell PowerEdge R805 box, as part of expanding our malware analysis labs. No big deal - half an hour of babysitting an installer, right?

Wrong.

It took me 5 hours, thanks to some really stupid decisions made by the Ubunutu team surrounding perhaps the most vital part of the installation process: the bootloader.

The actual install itself was nice and easy, just like I've come to expect out of the Ubuntu folks: sane defaults, good explanations when I had to make a relevant choice, and generally minimal requirements for interactivity. Anybody with even the most basic computer experience could fumble their way through it. After finishing, I took my CD out, rebooted...and suddenly found myself at a Busybox shell with a note about GRUB being unable to find the root filesystem.

I figured I'd done something really retarded, because in all of the years I've been installing *NIX operating systems, I've only had one other bootloader failure - an OpenBSD "Bad Magic" issue when I was swapping out hard drives that made immediate sense once I did two seconds worth of Googling, and that yielded a fun little picture in the process. So I sat down, thought for a second, and then realized I'd installed the 32-bit version of Ubunutu on a box with 8GB of RAM and a terabyte worth of hard drive - which sure seemed like a good reason for the OS to not be seeing the drive properly.

So I headed back to my desk, burned a copy of the 64-bit version, reinstalled, and got...the exact same Busybox shell. Damnit!

A quick bit of Googling seemed to suggest that there were issues with GRUB recognizing really big disks. Since I'd just used the whole drive with Ubuntu's guided LVM setup, I figured that either my /boot partition was way off past the end of where GRUB could read, or that my / partition was just too big for it to handle. That's what I get for being lazy, I figured, and headed back into installer land, this time manually partitioning things so that /boot was at the very start of the drive, / was 50GB, and /var took up the rest of the space. Another 30-minute installation later, I rebooted, figuring I'd be all set.

Not so much.

Confused, I followed the suggestion at the Busybox shell and did a "cat /proc/modules". Sure enough, mptbase, mptsas, and scsi_transport_sas were all loaded - exactly the modules I needed to be able to see this SAS/MPT BIOS controller. /dev/sda* existed, and inspecting /boot/grub/grub.cfg (side note: Linux people, can we *please* agree on one frikkin' extension for config files?) showed that my root device was set properly. What the hell?

Getting desperate, I spent some substantial time scouring the web for answers. It seems that a number of people have had problems installing various versions of Ubuntu on the R805 boxes - but in classic Linux style, any time someone popped onto a forum or a mailing list asking how to fix boot issues with this hardware, the thread ended with some variant of "Hey, I figured it out! Thanks guys!", and NO GODDAMNED DESCRIPTION OF HOW THEY FIXED THE PROBLEM. Seriously, people, it takes like two minutes to explain the fix, and it will save countless people countless hours of pain if you just make sure your solution is archived somewhere on the web.

After trying a whole host of possible fixes - setting the SAS controller to be visible to "BIOS only" instead of "BIOS & OS", telling the CD installer to boot off the first hard drive, etc. - I ran across this little nugget of wisdom, which suggested that I set my "rootdelay" value to 35 to give the SAS adapter time to initialize.

Aha! That made perfect sense, I figured. After all, this entire process had been further aggravated by the 30 seconds or so it takes the Dell SAS controller to initialize on each boot (seriously, people, how does it take a hard disk controller 30 f'ing seconds to initialize on a machine with 8 2.5GHz cores?); why wouldn't it want to waste another 30 seconds of my life re-initializing after the operating system loaded?

Optimistic about my prospects for success, I rebooted yet again, held down shift like the article suggested...and got no GRUB menu. I tried again with "e" (which I vaguely remembered using on some other bootloader in years gone by), and again with "Esc". The third time being a charm, I decided to brute-force the issue, popped the installer disc back in the drive, and chose "Rescue Broken System" from the menu.

This is where I started to realize how broken Ubuntu's installation has become.

At first, I thought I'd accidentally chosen "Install Ubuntu" from the menu, because the system proceeded along all of the same steps as a regular install. It even went to the trouble of finding my network hardware, having me choose an interface to do DHCP on, and set a hostname. Seriously, guys, I promise I don't need a fully functional network just to go touch my bootloader, repair a broken partition, or, you know, do anything else that would require me to use a CD to boot. You're just wasting my time.

Once I finally got my shell and headed on over to edit /boot/grub/grub.cfg, I realized the reason I could't get into the GRUB menu: the default timeout value had been set to "-1", i.e. "don't wait at all". Gee, guys, that makes so much sense - because, you know, no one will ever need to edit their GRUB config on the fly! That, and setting a delay of 1 second would just be too much hassle for people trying to boot up nice and fast on their shiny new servers with the 90-second delay to get into the bootloader.

With the delay fixed and GRUB reinstalled, I booted up again, and this time actually got to the GRUB menu. Much to my horror, the banner on the top read:

"GRUB version 1.97~beta4"

Really, Ubuntu? Seriously? You're going to put a beta version of a bootloader on the production release of a server operating system? What cutting-edge boot-loading feature could you possibly need that you couldn't use a release version of GRUB?

Cursing the Ubuntu developers under my breath, I added the rootdelay value, hit Ctrl-x to boot, waited...and had a fully operational operating system in under a minute! Hallelujah!

Convinced that I was done, I added the rootdelay value to /boot/grub/grub.cfg, ran "update-grub" as root to make the changes permanent, and rebooted one last time, just to be sure. It's a good thing I did, too, because MY CHANGES WEREN'T SAVED, and I ended right back up at my Busybox shell. I had to go in through the rescue option on the installer CD, make my changes there, and update GRUB from my CD just to get the changes to stick.

With all of the effort the Ubuntu people put into making their installation simple, you'd think they could have gone to the trouble of setting the "rootdelay" variable to a higher value when they saw a SAS card that they probably know takes forever to initialize. Really, would that be so hard, guys?
Add to Technorati Favorites Digg! This

13 comments:

Jason C said...

Did you happen to update the BIOS at all? I had a similar problem with Ubuntu 8.04 and a Dell R710 that I fixed with a BIOS update.

https://answers.launchpad.net/ubuntu/+question/72629

Kint said...

Unless I really *DO* need the shiniest, latest version of a package I stick with Debian. Less breakage.

Good story, though :)

Alex Kirk said...

Jason: can't say that I did, though I was contemplating it. I have to admit that I generally do BIOS updates as a last resort, because they're generally so painful - horrible memories of DOS boot floppies and all that. I'm sure I could do one off of a USB stick these days, it's just that old habits die hard.

...and Kint, I may have to take your advice on the next machine I build. I still can't believe they'd use a beta bootloader...

Kint said...

@Alex; yeah, and check out the backports project. I used it this very morning to get nmap 5.0 installed on Debian Lenny. Lenny's version is still 4.76.

http://www.backports.org

Al said...

I feel your pain, but this seems a little over the top and kind of like a whinge.

I've never had an issue with a 32-bit OS and lots of RAM, other than not seeing all the RAM, so not sure why you thought that was the problem.

You are prepared to wait a bit of extra time every time the computer boots, but not on the hopefully rare occasion that you have to boot into rescue mode? All of a sudden you are worried about them "wasting" your time?

Why that version of grub? Maybe because it supports big disks... another thing you were worried about.

One of the hardest things in OSes to get wide, stable hardware support. So when you say "Really, would that be so hard, guys?" yes, it kind of would.


Al

Andy said...

I had a very similar issue installing 9.10 to my laptop after my 4yr old 40gb IBM HDD died and I put a brand new 160gb Samsung in it.
After a "successful" install, GRUB reported "error: no such device: ".

After a bit of scouring, I found this very useful fix, but it took a while!

http://sourceforge.net/apps/mediawiki/bootinfoscript/index.php?title=Boot_Problems:search

(Also worth noting...I ran some system updates after fixing this problem, and they undid all my changes, effecively putting me back to square one).

The ironic thing is, I had only insalled 9.10 on my old HDD a week before it failed, and the whole process went swimmingly! The new HDD was a nightmare, but all is working great now, and I must confess I DO love 9.10.

David G. GuimarĂ£es said...

i found these days that ubuntu is using grub2, which config file is /etc/default/grub.. when u make changes to this filel, u have to run update-grub to make grub read the config file and install it at /boot/grub/grub.cfg

more info can be found at: https://wiki.ubuntu.com/Grub2

Richard Bejtlich said...

This is an awesome post. Funniest thing I've read all month, AND informative.

Noah said...

It sounds like you've really gone out of your way to disparage ubuntu for what is at it's core a BIOS bug..

J.L. said...

USE DEBIAN ! If you're on to doing anything serious with Linux. Don't fall back to second best.

Demo said...

Ubuntu has serious problems with the release team. Whinging? Don't be a fanboi. Trolling and asshattery are not helpful.

The latest fuck up by ubuntu is the now infamous nuking of ipv6. Way to go ubuntu. Not only have you lost me as a corporate partner, but I'm slowly removing ubuntu from all my home and business systems.

I'm going with someone who's yet to fuck me over with beta testing a "stable" release on my ass and wasting hundreds of hours of my corporate time. I have TWO people working on this full time thanks to UBUNTU idiocy.

FUCK UBUNTU. FUCK WINDOWS. FUCK FANBOIS who whine about anyone complaining about the distro that they like to suck on.

Benjamin said...

You know what, this post actually presented the solution but for all the just reasons mentioned by Alex and Demo and for plenty I've experienced myself, especially with 10.04 which seems more of a go get the apple mac os style and forget about anything technical, I will NOT be installing ubuntu server 10.04 LTS on my brand new dell server at all, thanks for the great reminder about actual enterprise ready linux distro's like debian and centos. Awesome post, thanks a lot, very true!

Johan said...

Thank you! After 12 hours of installing various versions of Ubuntu over and over again, thinking my partitioning of the drive went wacko, THIS solved my problem! Thank you thank you thank you!

For informational purposes, this was successfully installed on a Dell PowerEdge R310 with the Dell SAS 6/iR (_NOT_ PERC 6/i, different card) SAS controller.

From the start I got the initramfs-busybox-prompt, saying that it could not find the root-device. Tried with both LVM and classic partitioning schemes with no avail until I got the rootdelay-parameter in place.

TIP: After installing Ubuntu you'd get the GRUB menu for 2-3 seconds, you have to press "e" to enter the edit-mode, add rootdelay=35 on the blank line, press Ctrl+X to continue (a successful boot). If you're prompted with another initramfs-prompt just type "exit" and it'll continue booting.
When logged in, sudo-edit /etc/default/grub and add rootdelay=35 on the GRUB_CMDLINE_LINUX_DEFAULT-line and then sudo update-grub. Reboot and behold, it works.

Sidenote: it might work just to type exit in the first initramfs-prompt which says the error about the root not being able to be found. Maybe you'd have to wait ~30 seconds at that prompt to allow the controller to init the disks before exiting. Never tried this though. Just a sidenote.