TrueNAS – I hoped, I tried, it’s looking like UnRAID

Well it’s been a VERY long time.  So much has changed.  But enough about me and my gap in posting.

Given the discontinuation of Apple’s Time Capsules a while back, I knew that ultimately I would have to put some other sort of NAS storage system on the network to continue to enjoy the seamless backups.  That time came this summer when the Time Capsule was starting to make some unpleasant sounds which often precede a disk failure. It was time. 

Cue a fair bit of research.  Looking at Synology, Western Digital, Buffalo, Asustor, Qnap, Terramaster.  Many great products. All commercial, simple, decently supported.  But pricey for what you get.  I had just been burned by a bad Kickstarter that was promising a good home NAS and server system, and I wanted all that storage and flexibility, and that price.  So I looked at what the state of the DIY sphere was. 

Ultimately it came down to TrueNAS and Unraid for being able to have solid disk arrays, flexibility and resilience.  TrueNAS is more of an enterprise offering and also had what seemed to be an amazing ecosystem with TrueCharts.  Unraid was also a great alternative, simpler, perhaps a bit less powerful, but also booted off of a flash drive, which just didn’t quite sit right with me.  So ultimately, I went TrueNAS. 

The config:  AMD Ryzen 7 8700g with 64 GB of DDR 5 RAM (not ECC, if you’re into the NAS TrueNAS scene, it’s a debate), 1 TB NVME boot drive, Asus Prime B650M-A AX motherboard, 750w Prime Gold 80+ seasonic power supply, Fractal Design Node 804 case (GREAT case for a server with space for 8 3.5” drives, and really easy access), 5 8 TB Iron World NAS drives and 3 4 TB Iron Wolf NAS drives (2 pools and a somewhat constrained budget).  All that was going to get me a lot more storage than the commercial offerings, a lot of performance for running containers/servers at home, transcoding capability if I did put a Plex server together, and 2.5 Gbit ethernet wired connectivity.  Start the adventure. 

I put it all together, installed TrueNAS and got things up and running.  Just started with a pure Time Machine backup on an SMB share configured for multi user Time Machine.  MUCH faster, and worked great.  I made some mistakes though, and the array started getting fatally corrupted. Cue a fair bit of trouble shooting looking for drive failures, long drive testing, etc.  It seems the motherboard SATA controllers might be getting overloaded, and ZFS is a bit temperamental about the hardware (again, the ECC debate comes up, but that wasn’t where it was happening).  So add in a PCI card with 8 SATA connectors on it per TrueNAS forum recommendations with the 9211 JBOD-configured PCI card.  Rebuild the pools.  All is well.  No more disk corruption.  So.  Pretty sure that problem was diagnosed and solved correctly.  

After more successful backups and stability, it’s time to add in a few servers.  Simple.  Gitea (a source code control server based on Git) and the Postgres database server to support it.  Easy install, easy startup, all good.  TrueCharts was as promised.  

Then I started to get random reboots.  Zero logs, just the computer would randomly reboot itself.  No warning.  Cue another investigation.  It seemed to correlate with the start of Time Machine backups and disk activity spikes.  Hmmmm.  Again testing disks, nothing.  Power?  750W was WELL in excess of the disks all startup up at once.  Not sure.  One of my sons has use for a power supply on a gaming build, so I grabbed a 1000W gold 80+ seasonic.  But before that I grabbed a live Ubuntu image on USB, and booted the system clean on that.  Mounted all the drives and did stress tests on the CPU, RAM, and ever disk including the NVME.  For a week.  Perfectly stable.  Tuned the stress tests to maximize the load and minimize wait times on the arrays, and also to burst it all.  Rock solid.  The hardware recommendations from the TrueNAS forums were as promised.  

So.  Full clean install then.  Wiped everything.  EVERYTHING.  Latest 24.04.2.3 version of TrueNAS, clean install, new arrays, and off we go again.  No load.  How long will it run by itself just doing snapshots (DO THESE WITH ZFS.  It’s like Time Machine for your arrays without piles of backup storage, but you DO still need full backups not in the system, that’s a different post). Flawless.  No issues, smooth as silk for a week again.  Start the Time Machine backups.  Also all good.  Ran for over a week.  Zero issues.  Then random reboots started up again.  The reboots stopped when I reinstalled the entire system, and now are starting again?  This smells of a bug now.  Zero errors of any kind reported by ZFS or S.M.A.R.T. on anything in the disks, NVME, or HDD.  But no disk corruption.  Before I had to revert to a snapshot to have Time Machine work again.  Not this time.  Curiouser and Curiouser.  

Seemed semi-stable now though.  I started the servers again.  Same behaviour for another week.  Still generally stable but random reboots.  Keep looking, keep checking firmware and any other clue or possibility I could find in the forums.  Everything SHOULD be solid. 

And now, corruption of the Time Machine backup again.  The array is again fine, as it has been ever since I put the PCI SATA card in the system.  There’s a bug somewhere.  And all through this the TrueNAS system will randomly reprint IP address for the interfaces despite it being a static address (also done with a permanently assigned DHCP address, same behaviour).  

Well, I KNOW TrueNAS is amazing and solid and brilliant for literally thousands of people and that iXSystems has a great offering for thousands of companies.  But I think I’m throwing the towel in on this.  TrueCharts had a major blow-up with the community and abruptly just UNPUBLISHED all of the repo.  So now there’s an open source war on the ecosystem that was also valuable.  It was all built on Helm Charts on k3s (mini Kubernetes, or k8s if you’re into that) and it was another lfit to just take a simple docker container and get that going, but I was less worried about that.  Now it was a pain. 

Add it all up and then Unraid is now supporting ZFS in their beta stream of 7.0.0-beta-x.  Well, then I looked into backup options and experience on that flash boot drive.  Nothing alarming, and generally really great reviews of response and support.  It is a paid license.  But a very reasonable amount and it is a lifetime license.  No call home things of some of the mainstream commercial stuff, all very open, open standards, and generally a much more “grass roots” company, but very successful.  Standard docker support, and again a great and flourishing ecosystem.  Noted as much more user friendly.  

To be clear, I’ve done a fair bit of sysadmin work over my career, and nothing TrueNAS is doing is leaving me in the dust, but at some point, I want the systems to work.  I’m not hacking on these things, they are supporting my hacking efforts.  I’m more software than hardware.  

So I’m going to look into pulling it all down and trying out Unraid now.  I still back up everything to detached SSD drives and the Git and DB stuff is all also locally mirrored so nothing is at risk.  And I have offsite backups on separate HDDs.  This is an experiment to do more and enable more and also move past the Time Capsules without having drives dangling everywhere.  

So I still have a pile of respect for the TrueNAS system and community, and they are moving with the upcoming 24.10 electric eel releases (in release candidate status as of this posting) to support pure docker and put the TrueCharts fiasco behind them.  All great things.  It’s the fact the foundational reason for this all, the backups, is not reliable FOR ME.  It’s obviously reliable for other people, but I’ve invested a lot of time to make it work, and just haven’t had success, and I can’t swap out every piece of hardware just to try to find some edge case issue in the software after all the mainstream recommended stress tests showed all the hardware to be rock solid.  

So hopefully I will actually do some follow up posts for anyone reading this with what I discover.  🙂 

Misguided analyst editorial – update: Called it

Wow.   Rob Enderle has a lot of readers on Computerworld I expect.   I read the odd article from him.   I had thought he would be offering solid business advice in light of the viral “Comcastic” support call from hell [ http://www.huffingtonpost.com/2014/07/14/the-comcast-call-from-hell_n_5586476.html ] I stand corrected.

Basically, Enderle shows to be a run-of-the-mill, CYA sycophant extraordinaire advocating analytics to essentially make two classes of customer, rather than use it to help monitor and improve your customer relationships as a whole.    His original article I’m ranting about is here: Don’t Be Comcast: Use Analytics, Monitoring to Prevent a Viral Disaster – Computerworld

He starts somewhat sanely, in having a list of the biggest customers available to managers, so that, as he did, you don’t cancel a supply contract you have from someone who happens to be your largest customer.  I get the impression that it wasn’t a healthy business relationship, and may have been grounded more in back-room “you scratch my back I’ll scratch yours” deals than good business if a cancellation resulted in that sort of fallout.   Either that or the company Enderle was working for ALSO wasn’t competitive, and let’s just say what goes around comes around in that case.

But when he then takes social media, and proposes to use it to monitor when you might have a PR issue on your hands, or track negative and positive PR, that’s just good sense.   But that’s not what he proposes.   He takes the idea of the “influencers”, people who have larger pull in social media and PR, being often celebrities or journalists, and having your real-time analytics alert you when they are contacting the company in support to give them extra-special treatment.   Basically, make them an “elite”customer, and screw the rest of us.  

You know what social media does then?   Check the hashtag count on twitter.   You’ll still get #comcastic from all the rest of your customers relating serious issues and problems, and you’ll have a few media celebrities having positive PR.  And it will catch up with you, and if your customer service sucks, I’ll trust my second-cousin’s-friend’s opinion of your shop far more than some privileged A-list celeb on where I take my business. 

Enderle is a “mover-shaker-fool” that is looking for quick results and his own rep in a corp rather than actually making your business the leader in the category.   Fix the problem.  Treat your customers correctly and install that in your employees.  And don’t incentivize them disproportionately against that.   I’ll bet good money that  the call rep at Comcast is paid good coin on a “save” of a leaving customer, so he will work his rear end off to the point shown in that recording to make that save.   It’s worth it as I expect his performance and his compensation is so skewed to making the save that it’s not worth his time to be courteous and walk people through it professionally.   Enderle says the rep should be fired.   If I were the CEO I would start with looking at how the incentive programs for the call centre are set up, especially in the “customer retention” area.   And adjust the attitudes of the people setting that up.   

And you know the irony?   I bet the whole behaviour of grinding so hard to keep a customer (and up in Canada Shaw and Telus do it just as much, but I didn’t run into quite the level of zealousness that was in the viral posting) is based on analytics.  Enderle hasn’t yet learned the lesson that people who actually THINK about analytics and their application has, which is you still need to have a goal in mind when you apply them.   Comcast has a goal when it hits customer retention, as do all these telecom/internet providers.  Keep the customer at all costs because customer acquisition is very expensive.  The numbers say it’s a lost cause so go all out.   Any win is a great bonus.  Social media is re-empowering the consumer and making the businesses play honest with everyone.   Enderle doesn’t get it.   Make sure you make a better decision than he advocates.

 

UPDATE:  Looks like I called that better than the paid analyst did.   http://venturebeat.com/2014/07/22/comcasts-retention-policies-take-the-blame-for-that-customer-service-call-from-hell/ pretty much outlines what I figured was the core of the issue.   Perhaps new metrics NOT from the accounting department need to be added in?