This is best opened in a companion window to be referenced while reading the following information.
Summary of the Engineering Walden Project
Basically, the goal here is to provide a platform on which I can build any operating system and application implementation combination that I see fit, pursuant to the goals laid forth in Scorpionfish.
This type of platform is known in The Open Group Architectural Framework as a technology architecture. In the case of this project, it is the infrastructure which provides an extensible service platform capable of supporting yet-unknown services and responding to unpredictable demands quickly, efficiently, securely, and resiliently.
The vision of the architecture is that it will provide a development and operations environment in which those burdens of life which may be borne by information technology can see their solutions developed, tested, and put into production in an accessible, affordable, effective, and academic fashion. The platform allows the owner to make use of cutting edge free and open source software to perform this feat.
The hardware platform will be restricted to a single motherboard and its attached hardware. This will provide for accessibility and affordability, along with elegance and efficiency; it is unlikely that we will need for our purposes such support for workloads that the hardware platform must the interconnection between separate physical systems. Technology nowadays is amazing, and any consumer with adequate funds may purchase motherboards with multiple processor sockets and support for over a terabyte of RAM.
If you need more than that, then the Electric Mountain likely gives way to the cloud computing model describe below.
The OpenStack model has been taken as a reference architecture, as it seeks a similar end although it is designed for a different scope; whereas the Electric Mountain is constructed from a single motherboard and its attached hardware, the OpenStack cloud computing model is designed to unify multiple motherboards and their attached hardware. Because this project is scoped to a single-system solution, the OpenStack software cannot be implemented, but the model, abstracted away from its particularities, is illuminating nonetheless.
The same core capabilities as those featured in the OpenStack model must be met, albeit at a single-system scope. The technology required to do this already exists (as OpenStack is really the expansion in scope of principles already well-understood).
The Engineering Walden Technical Reference Model can be summarized as follows:
- Processing Service capabilities:
- Logical partitioning of hardware and software
- Some subset of the single hardware platform will be provided to multiple separate operating systems and application suites
- Maximum elegance and efficiency must be achieved to optimize hardware utilization despite segmentation
- Logical partitioning of hardware and software
- Networking Service capabilities:
- Logical partitioning of network interfaces and infrastructure
- Hypothetical scenarios include:
- separate internal networks between logical partitions
- a designated storage network between logical partitions and the storage service
- high-performance connections between logical partitions and the external network (the home LAN or Internet)
- Hypothetical scenarios include:
- Logical partitioning of network interfaces and infrastructure
- Storage Service capabilities:
- Logical partitioning and provisioning of storage to local and remote systems.
- Archiving (Backup and Compression)
Software Selections for the Technical Reference Model
At minimum, the reference model calls for a Fedora 22 operating system providing a KVM/QEMU/libvirt virtualization platform on which operates a FreeBSD 10.2 guest domain managing the majority of the hard disk devices (using PCI pass-through technology to give the disks entirely over to the FreeBSD guest domain, to OpenZFS’ delight, and for ease of disk recovery and management). OpenZFS is used to administer storage, and a combination of services (NFS, Samba, Plex) can be used to provision storage.
Once that technology architecture is up, we can use the platform to support an application architecture designed to deliver the capabilities demanded by modern life. Guest domains can be created very flexibly, with resources such as processor cores, storage (think LVM thin provisioning), and even RAM capable of being overcommitted in support of the application platforms which may run on nearly any given operating system.
In this way, a single system is managed in a style similar to multi-system cloud architectures, with less extensibility and flexibility, but nearly equivalent versatility in offering the same core capabilities. It is the Electric Mountain TRM (Technical Reference Model).
Less flexible, but powerful, sturdy, fecund, and deceptively dynamic for a single hardware platform.
Explanation by Example: Adding a Capability
If I decide I need OneNote 2013, that means I need Windows 10.
The first problem is determining which hardware to use. The goal of the project is to create a single system (one motherboard and its connected hardware) which can provide a large range of possible hardware requirements to separate operating systems. To do this, I use a stack of software: the Linux kernel and KVM along with QEMU and libvirt. I can use this software in concert to create a virtual machine which will share the single set of hardware I have according to rules I specify.
Because the hypothetical OneNote-bearing Windows 10 system is simply a toolbox (that is, I’m only using it for simple OneNote 2013 computing, storing the data handled thereby, and analogous tasks with other Office products) it doesn’t need much power. I can get away with a little machine using 2 GB of RAM (maybe 4 GB if necessary) and a single processor.
Knowing that, I can create a virtual machine and apportion for its use 4 GB of the 24 GB of RAM on my single system, one of the four processor cores, and some storage space (whose allocation practices I will explain below). The Windows 10 operating system will have these portions of the hardware presented to it as though it is installed on a separate system.
That’s pretty sweet, but what would be better is if I adjust its processor share so that if the whole set of hardware is overly taxed, this particular virtual machine can only ever make use of, say, 10% of my system’s processor time. I could reserve 60% for my file system so that nothing bad ever happens to it on account of a lack of resources (an operating system can crash if it lacks adequate access to a processor) and 30% for my hypervisor (which controls all operating systems’ access to the hardware – you don’t want it to die). That way, if some horrible, horrible set of circumstances leads to all my systems simultaneously attempting to exert full control over the processor (shouldn’t ever happen), my storage service and my hypervisor service (on which the storage service depends) will not crash. Others might, but they are small, lightweight, and easily rebuilt if necessary.
Those processor time shares are ceilings (I’m using cgroups), so if other virtual machines aren’t using the processor, the system with only 10% share can make use of their shares. The ceilings only come into play when the hardware experiences undue stress and must prioritize. I avoid those situations with intelligent management choices and by not overcommitting my hardware too stressfully.
So now I’ve got my Windows 10 system sharing a small portion of my processor and RAM. As I wrote above, it also needs persistent storage (RAM is volatile, so when it loses power, all of its data is lost – that’s why you have to boot up a system – data is loaded into RAM from the hard disk since computations occur so quickly that they would be insufferably delayed if they had to rely on the slow hard disk).
I give it a little bit of storage on my SSDs, but I have precious limited space on those things, so I can’t give out too much storage there. It should be reserved for storage requiring fast and constant access, so as to take the burden off of my centralized storage solution, ’cause that storage service si going to solve the rest of this problem: if I’m going to do all my great workings, I want to have more space than that little 10 GB or so (for the operating system and program files) which my OneNote system will get on the SSDs.
The Storage Service (or: FreeBSD and OpenZFS are awesome)
So I built a storage service. It consists of a FreeBSD 10.2 (it’s a kick-ass UNIX operating system) guest domain (virtual machine) which has been given access to four 3 TB hard disks thanks to a fairly recent and sophisticated technological advance that allows the guest domain to access devices directly, without mediation by the virtualization software. That means the guest domain thinks it’s directly connected to this hardware (and it more or less is), and that solves a ton of problems since OpenZFS, with its advanced design, functions best when it has total control over (and access to) the disks. It can obtain information about the disks that the virtualization platform doesn’t typically allow the guest domain to see, for instance, and while few file systems make important use of that data (so they don’t usually have to worry about this), OpenZFS, the best ever, is different.
Logical Partitioning and Provisioning of Storage
This provides a central storage location made accessible to all those machines (either local virtual machines or other physical machines connected to my home network) which needed said access. I can control their access individually (so, say, my Windows 10 OneNote machine can only access the OneNote data, and maybe it has only read-only access to other important data, so even if it freaks out for whatever reason (or if someone hacks it, God forbid), its activity against my precious data is restricted by solid administrative measures (which could be broken as well, but that would take some serious hacking gymnastics).
By virtue of all my devices storing and accessing their data on the same centralized platform, I can manage all that data from that one location. Rather than having to transfer data around so that my various devices can use it, or having to back up each of my devices individually, I can arrange for multiple systems to share access to data when necessary. Therefore, I can make something in Excel 2013 on my Windows 10 system and then subsequently open it on my Fedora 22 Workstation with Calligra Sheets, all without having to manually transfer it between them or worry about which version of the document I was working on, since there is only one version.
The additional major benefit from this design is that I don’t have to coordinate backups among devices; I simply arrange for backups from within the storage service (the FreeBSD operating system). From this one administrative interface, I ensure nightly backups are made off-site for all of my data, which is access, managed, and shared among all of my devices as necessary, so that if my house explodes, I still have yesterday’s data in its entirety with which to rebuild my life.
Of course, this storage service needs to be robust. Obviously, all of these devices accessing data on one system puts a burden on that system which would normally be spread across all the individual devices. So, I optimize the disk performance of my four 3 TB disks by arranging them more intelligently than just being a bunch of independent disks.
Using OpenZFS, I can actually create a pool of my disks such that I arrange them as two pairs of disks. I have put them in a configuration analogous to RAID 10 for performance and redundancy. This is achieved first by creating two mirrored pairs of disks, meaning that any data contained on one disk is synchronized to the other disk. Once these two pairs are set up, they are joined as a pool, and any data written to the pool is striped across those mirrored pairs of disks, meaning that half of any given data set can be expected to be stored on one mirrored pair, with the other half on the other mirrored pair.
Though each pair contains all of the same data, I do this because hard disks, especially non-SSDs, are typically performance bottlenecks. The processor can meet huge amounts of I/O requests without problem, but the hard disks have to seek out that data, read it, and report it back, and this delays the operation. That’s why people see a huge performance boost with SSDs – they are about a hundred times faster than mechanical drives. But they are expensive, and I have four 3 TB disks. So, instead of being limited by the throughput of each disk, I can combined two of them and access them in parallel to basically double the potential throughput. Where one disk could operate at 60 MB/s, for example, two of them could hit 120 MB/s. This is especially important if many processes (or virtual machines) are accessing the data.
The fact that I have two copies of all my data means that, if any one disk fails, the system is capable of continuing operations while in its degraded state (since one of the remaining three disks is the exact same as the failed disk, so the system can rely on it alone), giving me time to replace the failed disk safely and the system then automatically rebuilds all the data that should be on that disk and returns to a healthy state. No service outage occurs even though a hard disk failed.
Most awesomely, OpenZFS capitalizes on this redundancy in a way seldom seen in other file systems (though some are now catching up). OpenZFS maintains checksum values calculated from every block of disk storage. What this means is that every block of data (a set of a given size) is processed with a mathematical algorithm and the resulting number is stored. This type of mathematical algorithm, known as a hashing algorithm, is designed to produce separate output for every possible set of data, allowing the output, known as a hash value, to be used as a sort of fingerprint for the data. You can rest assured with a very high degree of confidence that data put through the hashing algorithm will always return the same result, and that result will be different from any other data combination.
So, without storing all the data twice in its entirety to have something to check against and ensure the integrity of the data (i.e. that the data has not changed since it was last knowingly altered), I can store the data and then the much smaller hash value. If I want to check the integrity of the data, I need simply read the data, process it with the hashing algorithm, and compare the result I get to the result I stored when the data was first written to disk. If they’re the same, the data hasn’t changed. If they aren’t, the data has somehow become corrupt.
On its own, this information is at least valuable in that I now know my data is corrupt. But, with the redundancy I described above, OpenZFS leverages the fact that two copies of all data exist at all times (on separate mirrored pairs of disks); if OpenZFS reads a block of data and determines that block has become corrupt (by processing it with the hashing algorithm and observing that the resulting checksum value does not match the stored checksum value), it consults the mirrored pair of disks and determines whether its copy of the data is correct by the same process. If so, it copies that data over the corrupt data, logs that it had to do that for you, and is thereby self-healing.
So, minor hard disk errors which occur and wreck up data (known as bit rot in the biz) as a hard disk ages are not a threat to OpenZFS, since it alone (although ReFS, offered by Microsoft, and BTRFS, I believe, have recently acquired this behavior – OpenZFS has had it for much longer) has the capability to administer the data it stores in this fashion.
Pretty sweet, yes?
For security against theft, the disks are encrypted with AES-256. That algorithm is so strong that it would take a billion high-end GPUs performing parallel processing against the possible answers to the encryption problem ~6.7e40 times longer than the age of the universe to have a 50% chance of hacking the encryption with current mathematical methods.
Seriously. There is no risk of my data being lifted if someone steals my disks. Even if it’s the NSA (unless we SEVERELY underestimate them, and I don’t think we do).
Archiving (and Compression)
Every day at 4:00 AM, OpenZFS takes a snapshot of every dataset. It performs this instantaneously, and exactly what is on the disk at that point in time is permanently saved and stored for future access. I can control when these snapshots occur, and though I currently don’t remove them, I’ll probably keep every day’s snapshot for half the year, and then after that just one a week in perpetuity. This way, I not only have all my data, but I have a readable journal of that data’s evolution. I can recover files lost days ago (and later, years ago) with excellent tools that allow me to search for that data and retrieve it. Hell, users can even do it themselves, since every folder available to them has a hidden directory that exposes all those snapshots (they’re read-only) to the users.
The storage service then transmits only the data in these snapshots (just the differences which have been created between now and the last snapshot) to my off-site archive location, minimizing the traffic that needs to occur and optimizing the process. A separate FreeBSD system at a super secret location stores this third copy of my data in a single compressed 4 TB disk (my pool of space is actually 6 TB, when not compressed).
Scope of the Storage Service Solution
As noted above, even physically separate devices on my network (e.g. my iPhone, iPad, etc.) have the means to store, or at least synchronize, their data with locations apportioned to them with this storage service on my home server. They can also retrieve data from certain locations, all within my administrative oversight, of course. If my wife or I make a cataclysmic error on her iPad, I don’t want us to be able to accidentally delete photos off of the home server, so the iPad has read-only access to that data while it retains the ability to write data to the server in a safer location.
Even devices on other networks (such as my mother’s laptop on her home network, or my mother-in-law’s desktop on hers) can be given storage on my home server. Using SSH or OpenVPN (or somesuch solution), their devices can establish secure, authenticated connections to the home server and synchronize their computers’ data to it, so they gain all of the benefits of snapshots, integrity, resilience, and archival without my need to manage it separately beyond minimal administrative intervention.
My home server solves many people’s problems, and that’s good, because we want to maximize the benefits of our labor.
Example of Service Extension
So that’s the storage service at work in a very rudimentary fashion. It can also provide storage to other virtual machines, and those virtual machines can host any solution I can think up.
For a basic example, I want to run services such as a media server that allows my Playstation (and other devices) access to content. To do this, I can create another lightweight virtual machine (2 processor cores, 4 GB of RAM, 30 GB of storage on the SSDs since Plex makes use of a fairly robust local data set in its operations), install the open source and free Plex Media Server on it, provide read-only access to the storage system so it can see the media (again, I’m preventing needless duplication of this data to the Plex system and all the administrative overhead that would come with such a solution), and I can connect to it using the Playstations and their built-in DLNA capabilities (that’s a protocol for streaming media). The iPads and whatnot can make use of this too, so family members can browse our home pictures and videos on the iPad, downloading any which might be needed for any particular project or email, and it’s all safe from the iPad’s potential mishaps or destruction.
I can extend the solution just like that many times over. If I find some sweet software that does something awesome, no matter what OS it requires, I can create a virtual machine, share the resources necessary to its ends, and run the software.
Some Highlights of the Virtualization Platform (or: KVM and LVM are awesome)
Aside from the excellent administrative boundary provided by the virtualization technology I’m using (one system’s total destruction will not nuke other systems), there are a number of benefits conferred by the software stack I’m using to provides this virtualizatioon platform.
Snapshots and Guest Domain Recovery
One sweet thing about VMs is that I can shut them down and take snapshots of them. I can then turn them on and test whatever solution I have planned. If it blows up in my face – if it absolutely otherwise irrevocably destroys the data on that system somehow (I accidentally n00b out, say), then I can simply revert to the snapshot I took and the system is returned to exactly the state it was in.
I use the logical volume manager on the virtualization platform system to do that, and I can even take snapshots of the virtualization platform itself and I can recover even it through this method (if I have to perform a big upgrade to keep up with Fedora releases, for example – 23 is coming out soon and I am on 21, so I’m going to have to upgrade to 22 at least – that’s always more risky than your average day, so a snapshot will be a valuable recovery option, though things should probably go fine).
KVM allows for a single four-core processor to support numerous virtual machines given access to that system. By the processor sharing scheme I introduced at the beginning of this article, five guest domains, all of which have been given all four processor cores with which to operate, can be configured to share time on that processor. The hypervisor (the Linux Kernel with KVM loaded) will mediate their access and have them take turns just like multiple processes in a single operating system take turns accessing the processor.
Random Access Memory
Not only can the processor be overcommitted, but I can overcommit RAM. Because the RAM used in any given operating system is likely to contain a fair amount of data which isn’t regularly accessed (unless the system is really highly optimized), the hypervisor can take advantage of this. I may have only 24 GB of RAM, but I can overcommit to systems more than that. When the hypervisor notices the systems aren’t making much use of some of their RAM, it writes the overflow to the hard disk. One must be careful of overcommitting RAM with highly-utilized systems, but with systems separated largely for flexibility and not performance reasons, this can be useful if managed carefully and made necessary by the number of systems desired.
Finally, even the hard disk space can be overcommitted. Hard disk portions assigned to guest domains can be apportioned in a manner known as “thin provisioning.” For a basic example, I could have a thin provisioning pool of 30 GB of space. If I knew two guest domains would only ever make use of 15 GB of space each, but they each required that I have that much hard disk space for some stupid reason (like Windows Home Server 2011 arbitrarily required 160 GB of space for the operating system despite the fact that it took up like 30 GB), I could basically lie and tell the guest domains they each had 30 GB. I need to manage this carefully, because the guests will crash if they need more space and there isn’t any, but as long as I’m correct in my estimation that neither of them will exceed 15 GB of actual use, the thin provisioning pool will contain them both in 30 GB of space even though the total amount given to guest domains is 60 GB.
This is most useful when you have a hoard of, say, virtual desktops. If you have 300 employees and you have a service level agreement with them stipulating that each is offered 30 GB of storage, though you know from statistical sampling and experience that the average employee will only make use of 20 GB, you can thinly provision those virtual desktops, cap them each at 30 GB, but start them at, say, 10 GB, thereby making the initial disk consumption 3 TB for all machines rather than the 9 TB required to meet the SLA. If your statistics and estimations pan out, anyone who needs the space can grow into it and anyone who doesn’t won’t consume it, meaning you’ll only need about 6 TB of space total to provide all the space necessary for your people while still meeting the SLA, even for that small subset of employees who actually makes use of all the storage available to them.
I liken this sort of administration strategy to a hypothetical scenario in which multiple people are given a mansion in which to live. Each believes the mansion to be solely his own possession, but all of them are managed in such a way that they only ever inhabit rooms which are uninhabited by the other mansion dwellers. As one leaves a room, another can enter, and they all end up thinking they have this great and glorious mansion to themselves, never realizing it’s shared by many.
Obviously, this is harder to accomplish with a mansion and (most) people than with guest domains and a hypervisor.
Back to the Diagram
So that’s what’s up. Notice each box in the diagram represents either the virtualization platform system or a guest domain, and there are various hardware components illustrated whose logical partitions are color coded to match the guest domains making use of it and labeled to explain its function.
And while you’re at it, listen to this.
…and then this. ‘Cause it’s really awesome, too.