Project R730 Part 2

Page content

Following up on Project R730 - Part 1, it’s time to expand the tale.

It took a while, but the few remaining parts I still needed to finish the R730 finally arrived. I installed, upgraded, or otherwise swapped several components, and I got the server up and running with TrueNAS SCALE. It wasn’t just a job, it was (and still is, really) an adventure!

Tripwires

The first complication I encountered was in regard to the m.2 SATA and NVMe PCI-E adapters. Namely, they apparently didn’t exist, or were otherwise ensnared in an inescapable Hell. Perhaps my mistake was ordering anything during the Silicon Valley Bank collapse, but whatever the reason, a couple seemed outright impossible to acquire.

I had no less than three separate eBay sellers cancel my order for either the m.2 SATA, or m.2 NVMe expansion card. All of these were operating out of Philadelphia, so I assume they were all China drop-shippers that were having problems with suppliers thanks to the bank crash. I also wonder what’s going on with Philadelphia that makes it such an attractive location to use as an eBay front.

In any case, I eventually got… some parts. Rather than a single 4x card for NVMe and a 4x card for SATA, I ended up with two 1x NVMe expanders, and a 2x SATA card. I actually managed to acquire a 4x SATA card prior to the 2x, but there was something about the ASMedia 1064 controller chip that my R730 didn’t like, since it refused to POST while the thing was installed. I tested it later in my desktop system, and it worked just fine. The working 2x card uses a JMicron JMB582, but I had to configure the server to boot from UEFI rather than BIOS or it refused to boot from any USB device. It would somewhat defeat the purpose to have working boot drives if I can’t install anything on them!

Shell Games

Once I had the 128GB m.2 SATA drives and P1600X Optanes safely nestled in their respective expanders, there was one final thing I needed to do. The Arctic MX-4 paste and Xeon E5-2690v4 chips had arrived shortly after the drives were finally working, so in they went.

CPU2 runs about 6C hotter than CPU1 for whatever reason, so I thought I may have used too much thermal paste and caused a mild insulating effect. I went in and scraped a bit off and the results were the same, so I can only assume it’s an artifact of the airflow design based on the component layout. Even the warmer CPU runs at 37C, so it’s hardly worthy of concern. Still, I like to be thorough.

I slapped on the lid, shoved it on a precarious ledge in my basement, plugged everything back in again, and called it a day. It was time to fire this baby up for real!

Driving In

I’d already done a test-run with TrueNAS on a VM, so I had an idea of what to expect. Unlike Proxmox or other systems, it’s meant to run headless and be configured through a web interface. It does present a kind of console to an attached monitor, but this is mainly to modify network or other settings so you can reach that web interface.

The first thing I did was to create a storage pool. The pool wizard is simple enough; it’s just a wrapper for zpool create, but also incorporates device-level encryption settings. That allowed me to download the encryption key it generated, which I did.

Unfortunately this was when I noticed that some of the monitoring output showed “swap” was enabled. I explicitly disabled this during installation, so I didn’t understand how this happened. Apparently it’s a known issue when creating a new pool with TrueNAS SCALE. TrueNAS Core has a GUI option for it, but SCALE isn’t so fortunate. It’s an easy fix, though. Just SSH into TrueNAS and run this:

midclt call system.advanced.update '{"swapondrive": 0}'

So I destroyed and recreated the pool to get rid of all that filthy swap space and re-acquired the encryption key for the pool. When all was said and done, I ended up with this:

NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
enctank  14.0T  58.9G  13.9T        -         -     0%     0%  1.00x    ONLINE  /mnt

NAME                                      STATE     READ WRITE CKSUM
enctank                                   ONLINE       0     0     0
  raidz1-0                                ONLINE       0     0     0
    d65d054f-31e9-4337-8e9a-36cddca17ec9  ONLINE       0     0     0
    3e9cb342-fcbe-43da-aa83-fb765cf5d655  ONLINE       0     0     0
    a25944be-8b3c-44ef-92ad-68c387962446  ONLINE       0     0     0
    93f47067-b03c-451b-95fe-ddc2abb8d02d  ONLINE       0     0     0
  raidz1-1                                ONLINE       0     0     0
    226b9c77-71f0-4819-8e83-3a852b2d9025  ONLINE       0     0     0
    41a089b6-9414-4df2-9fbb-d1c1511f98b4  ONLINE       0     0     0
    61be2e19-fe4e-459d-8d3a-79b2889cd467  ONLINE       0     0     0
    00c86038-f971-42a9-85aa-b57b7b08ae0d  ONLINE       0     0     0
logs	
  mirror-2                                ONLINE       0     0     0
    2b7d440c-611e-418c-a04b-12cc0ca65ba1  ONLINE       0     0     0
    6df39f62-98bf-4796-b57f-663699515cf7  ONLINE       0     0     0

Each RAIDZ1 VDEV consists of four Samsung PM863 drives, and the SLOG is a mirror of the Optane P1600X devices. I managed to run some smart tests, and it turns out the PM863 drives aren’t as new as I thought—the Dell IDRAC had simply lied to me. This is the real health for most of the drives:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
241 Total_LBAs_Written      -O--CK   099   099   000    -    1457592362319
242 Total_LBAs_Read         -O--CK   098   098   000    -    3204242740726

An LBA is 512-bytes on these drives, so this particular sample has written 695TB of data. PM863s are rated for 2800TB, so the drive is actually at 75% life remaining. Out of all eight drives, the “youngest” is at 97%, and the “oldest” is 65%. I’ll never run anything that will pound these drives as hard as whatever they were running before, so that should be more than enough.

The boot drives fared a bit better. Perhaps because they were basically laptop system drives, they both looked like this:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
233 Media_Wearout_Indicator -O--CK   003   001   000    -    98

In this case, the 98% refers to how much wear is remaining on the device. More than enough for what I need.

Bits and Bobs

Next came… everything else. The bulk of this was getting Apps working. TrueNAS SCALE ships with an integrated Kubernetes K3s cluster, and this is how all of the “Apps” work. They’re essentially just container stacks with conveniently simplified deployment based on a GUI rather than YAML files. All I had to do was choose the pool I wanted to use for application data storage, and everything was mostly done.

Mostly, because I have to make things difficult for myself. See, TrueNAS binds to all addresses by default. The prevailing wisdom is to install the Traefik App, change the default web ports for the TrueNAS interface, and use Traefik as a reverse proxy to reach it once again. I say nay!

Instead, I created two IP aliases: one for TrueNAS and its web interface, and one for apps. To do that, I had to enter the Network menu and ensure both IP addresses were listed as aliases. Then I went to the System Settings -> General menu and change the GUI settings so they only bound to the intended interface. Then I just had to repeat the process in the Apps -> Settings -> Advanced Settings dialog.

Why not go the traditional route? I didn’t want the NAS itself to depend on the apps it hosts. I also find it mildly distasteful running the default web interface on non-default ports. If something ever goes wrong, I’d be left trying to remember which port the GUI was running on, and I just don’t trust myself that much. I’ll still use Traefik as a reverse proxy, but for the apps; I consider those an independent entity from the NAS.

I also dropped into the System Settings -> Services menu to enable SSH and S.M.A.R.T. services. NFS will come later once I consolidate a bit more. The other thing I did in the System Settings area was to set up automated snapshots in the Data Protection menu. The template defaults to creating a weekly snapshot that persists for two weeks, which seemed fine. I created a second set for hourly snapshots that exist for three days. I wanted a bit higher granularity on recent files since they’re most likely to change.

Apparently due to how Kubernetes works, it’s important never to perform snapshots for Persistent Volume Claims (PVCs) it creates for app storage. I didn’t know this the first time around, so it ended up wreaking havoc with the apps. So for both of my snapshot jobs, I excluded the ix-applications dataset from recursive snapshots, and that fixed everything once I went and purged the ones it already produced.

Ongoing

Once I get all the apps working the way I want, I’ll post part 3 of this escapade, and boy will that be a doozy. I’ve done what feels like a monumental amount of work for little gain, and I may end up discarding all of it simply to reduce the amount of associated maintenance. But I did learn a lot along the way so that I can get the most out of this monstrosity’s final form.

Until Tomorrow