Bare Metal + Kubernetes = ♡

In this blog post, I'm talking about Metal3 (pronounced "metal-kubed"): a Kubernetes API and a cluster API provider for bare metal. I'll explain how and why it uses Ironic, and how you can use it to provision bare metal machines.

I assume that you are familiar with Kubernetes and the concept of Custom Resource Definitions (CRD). I also won't spend time explaining what cluster API is, there are enough good resources on this topic. I will only say that it's a way for Kubernetes to manage itself, in this case by provisioning (and deprovisioning) bare metal machines.

Introduction in Metal3

To put it simply, the Metal3 project provides two components: a bare metal management API and a cluster API implementation that uses it. The API is a Kubernetes-native wrapper around Ironic that allows to use it in a straightforward and declarative way. On top of that, Metal3 provides container images and tools to deploy both Ironic and its components.

Metal3 deliverables

baremetal-operator (BMO)

The heart of the project: the BareMetalHost (BMH) custom resource definition and the controller to manage such resources. The main deployment tools can also be found in this repository.

cluster-api-provider-metal3 (CAPM3)

The cluster API provider using baremetal-operator to provision machines for the cluster.

ironic-image

The container image for Ironic with suitable configuration for Metal3.

ip-address-manager (IPAM)

An optional controller to manage IP addresses. See this blog post for details: Introducing the Metal3 IP Address Manager.

hardware-classification-controller

An optional controller to label hosts according to their physical properties, which are obtained through inspection (more about it below).

ironic-ipa-downloader

An init container to download the IPA (ironic-python-agent - our service agent that is loaded on the machines when provisioning) from the RDO project.

Note

You can, of course, provide your own images. You can build them with ironic-python-agent-builder or completely from scratch. In one of the future blog posts, I'll explain how to use CoreOS as a foundation for IPA.

metal3-dev-env

Scripts and Ansible playbooks to set up a development environment using virtual machines. If you know OpenStack: metal3-dev-env is like DevStack.

ironic-client

A container image with the bare metal CLI for debugging. Of course you can just install it locally using:

$ pip install --user python-ironicclient

Note

Normally you don't need the ironic client to use Metal3.

Why Ironic

… or how many people put it nowadays:

Why not Just Rewrite It in Go ™?

Maybe I'm just a bit too old to rewrite things that already work?

But seriously though, isn't it easier to just write bare metal management in a Kubernetes-native fashion from scratch? No. Definitely not.

Software as a shared experience

Ironic is a relatively old (by the standards of this industry) project, and it has incorporated a lot of experience with all the things bare metal. It is easy to overlook the fact that the real challenge of hardware provisioning is not how to implement a bunch of standards (although it's hard enough - hello UEFI!), but rather what to do when real machines don't behave according to them. On top of the sheer complexity of the standards, you get an even larger complexity of their implementations.

From the very beginning, Ironic has been developed by a diverse group of experts from large software vendors, hardware manufacturers and cloud providers. The software vendors contributed an understanding of how large software is built, the hardware vendors provided their expertise in specific hardware families, the cloud providers shared their strong passion for security and reliability. This mix created highly user-centric software, written with reliability, security and scalability in mind. Software that features great support for popular enterprise-grade hardware, as well decent support for most hardware in the market. Ironic has grown naturally to be what it is, it is hard to create it from scratch, definitely not within a reasonable timeframe.

An additional upside is that any features or bug fixes created for Metal3 can be reused by the wider community, including projects like TripleO, Bifrost and Kayobe, as well as numerous ad-hoc solutions.

All eggs in their baskets

By using a separate project for bare metal management, Metal3 achieves a clean separation of concerns between two areas:

  1. A lower-level bare metal state machine provided by Ironic.

  2. A high-level declarative API provided by baremetal-operator.

Therefore, the bare metal management system doesn't need to know, how Kubernetes works, while the Kubernetes API provider doesn't need to be aware of the sometimes awkward properties of real hardware. Both projects can attract experts in the relevant domains of knowledge without requiring them to keep too much context in the head.

On top of that, Ironic (just as all OpenStack software) is API-first, that is, it has been designed around an API rather than some sort of a user interface. This is why Ironic can be easily integrated into higher-level projects like Metal3 and also used as it is.

Isn't it OpenStack?

Indeed, Ironic has been and still is developed under the OpenStack umbrella, which includes dozens of the projects governed by the OpenInfra Foundation. However, this relationship does not imply a hard dependency! What we call standalone Ironic can be used without the rest of OpenStack or only with a few chosen services (Identity and Networking are two common examples).

The above mentioned TripleO and Kayobe don't use the whole OpenStack with Ironic. And the Ironic project itself provides a standalone tool called Bifrost that can be used for bare metal management (see also a wonderful introduction to Bifrost from Julia). Ironic works very well with and without OpenStack, and Metal3 is great proof of that!

Kubernetes API for bare metal

Now that you understand, what Ironic is, and why Metal3 uses it, let us take a look at the most important object in Metal3: BareMetalHost (BMH). In Ironic terms, BMH is a node with ports and introspection data embedded in it. The only required information is the address of the BMC (the node's management controller) and its credentials.

---
# This is the secret with the BMC credentials (Redfish in this case).
apiVersion: v1
kind: Secret
metadata:
  name: node-1-bmc-secret
type: Opaque
data:
  username: VXNlcg==
  password: UGFzc3dvcmQ=

---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: node-1
spec:
  bmc:
    # BMC address, the actual format depends on the protocol:
    address: redfish+http://mgmt.node1.example.com/redfish/v1/Systems/1
    # Good old IPMI is supported as well:
    #address: ipmi://192.168.122.1:6233
    # BMC credentials - a link to a secret:
    credentialsName: node-1-bmc-secret
  # MAC address the node boots from. Will eventually be optional for
  # Redfish, but it's better to provide it.
  bootMACAddress: 00:5f:33:20:6b:5f
  # The node will use UEFI for booting (the default).
  bootMode: UEFI
  # Bring it online for further actions
  online: true
  # It's recommended to tell Ironic which device to use as a root device.
  # The `deviceName` hint is not the most reliable, you can use
  # `serialNumber`, `model`, `minSizeGigabytes` and a few others instead.
  rootDeviceHints:
    deviceName: /dev/sda

Once created, the BMH undergoes inspection, cleaning and reaches the Ready state, where it can be used for deployment. Inspection results in a lot of hardware information being collected and saved as part of the status:

status:
  # ...
  hardware:
    cpu:
      arch: x86_64
      clockMegahertz: 2100
      count: 8
      flags:
      - 3dnowprefetch
      - abm
      # ...
    firmware:
      bios:
        date: 02/06/2015
        vendor: EFI Development Kit II / OVMF
        version: 0.0.0
    hostname: master-0.ostest.test.metalkube.org
    nics:
    - ip: fd2e:6f44:5dd8:c956::14
      mac: 00:5f:33:20:6b:5d
      model: 0x1af4 0x0001
      name: enp2s0
    - ip: fd00:1101::8eb0:ccc2:928e:2a5
      mac: 00:5f:33:20:6b:5b
      model: 0x1af4 0x0001
      name: enp1s0
      pxe: true
    ramMebibytes: 32768
    storage:
    - hctl: "0:0:0:0"
      model: QEMU HARDDISK
      name: /dev/sda
      rotational: true
      serialNumber: drive-scsi0-0-0-0
      sizeBytes: 107374182400
      type: HDD
      vendor: QEMU
    - name: /dev/vda
      rotational: true
      sizeBytes: 8589934592
      type: HDD
      vendor: "0x1af4"
    systemVendor:
      manufacturer: Red Hat
      productName: KVM (8.2.0)

Note

As you can see, my example is taken from a virtual machine.

To deploy, you only need to populate the image information, i.e. add this to the spec:

spec:
   # ...
   image:
     # The image URL can be in the qcow2 format or raw.
     url: http://images.example.com/images/my-os.qcow2
     # The image checksum URL: either the checksum itself or a file with
     # checksums per file name.
     checksum: http://images.example.com/images/my-os.qcow2.md5sum
     # Checksum type must be set if not md5. Supported are sha256 and sha512.
     checksumType: md5

After the image is downloaded, written on the disk and configured, the BMH reaches the provisioned state. After that, it can be used for creating a Kubernetes worker or for any other purpose.

status:
   # ...
   provisioning:
     ID: 7ef3f064-2b86-4d34-8b33-fb5d127a713b
     bootMode: UEFI
     image:
       url: http://images.example.com/images/my-os.qcow2
       checksum: http://images.example.com/images/my-os.qcow2.md5sum
     state: provisioned

To de-provision a node, simply remove the whole image field. After cleaning it will be back to the ready state.

When writing an image is not enough

Sometimes operators ask for the ability to use a custom installer while keeping all other benefits of Ironic and Metal3. This is also possible. By setting diskFormat to a special value live-iso, you can request Ironic to boot the provided ISO and finish the installation once it starts booting. From this point, you can use your installation procedure of choice.

spec:
   # ...
   image:
     diskFormat: live-iso
     # The image URL: an ISO in this case.
     url: http://images.example.com/images/my-installer.iso

Note

Installation is considered done after the ISO is successfully booted on the machine. It is up to you to track the process from that point on.

If you already have bare metal machines provisioned through other means, you can simply add them as they are by marking them as externallyProvisioned:

spec:
   # ...
   externallyProvisioned: true
   image:
     url: http://images.example.com/images/my-os.qcow2
     checksum: http://images.example.com/images/my-os.qcow2.md5sum

Note

The requirement to specify a valid URL will be removed soon.

How it works

When BMO starts, it accepts Ironic and Inspector endpoints and credentials (only HTTP basic authentication is supported), as well as deploy kernel/ramdisk URLs via environment variables. Ironic and Inspector are supposed to be started and managed separately, for example by using deployment templates provided in the BMO repository (e.g. OpenShift does it via a separate operator cluster-baremetal-operator).

Once started, the controller manages the BareMetalHost (BMH) CRD by synchronizing the changes between Kubernetes and Ironic. BMO expects to completely own the nodes defined as BMH. It creates them in Ironic if they are missing, updates them if they don't match the information it has, provisions if there is an image, de-provisions if the image is removed, deletes if the BMH is deleted (or detached).

Nodes are created with names looking like <NAMESPACE>~<BMH NAME>, for example on OpenShift you can see the following picture:

$ baremetal node list --fields uuid name provision_state
+--------------------------------------+---------------------------------------+--------------------+
| UUID                                 | Name                                  | Provisioning State |
+--------------------------------------+---------------------------------------+--------------------+
| 7ef3f064-2b86-4d34-8b33-fb5d127a713b | openshift-machine-api~ostest-master-0 | active             |
| 13b00cf0-562b-47ee-8d45-c0f4dba0e074 | openshift-machine-api~ostest-master-2 | active             |
| 8e13a1ad-1029-4106-aec2-ba640eb99a1e | openshift-machine-api~ostest-master-1 | active             |
+--------------------------------------+---------------------------------------+--------------------+
$ baremetal node show openshift-machine-api~ostest-master-0 --fields driver_info instance_info -f json
{
  "driver_info": {
    "deploy_iso": "http://localhost:6181/images/ironic-python-agent.iso",
    "deploy_kernel": "http://localhost:6181/images/ironic-python-agent.kernel",
    "deploy_ramdisk": "http://localhost:6181/images/ironic-python-agent.initramfs",
    "redfish_address": "http://[fd2e:6f44:5dd8:c956::1]:8000",
    "redfish_password": "******",
    "redfish_system_id": "/redfish/v1/Systems/3f32c07a-b060-4d44-b73b-894be044b347",
    "redfish_username": "admin"
  },
  "instance_info": {
    "capabilities": {},
    "image_source": "http://[fd00:1101::3]:6181/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2",
    "image_os_hash_algo": "md5",
    "image_os_hash_value": "http://[fd00:1101::3]:6181/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum",
    "image_checksum": "http://[fd00:1101::3]:6181/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum"
  }
}

One interesting consequence of this approach is that the Ironic's database (usually MySQL/MariaDB) becomes transient: it can be removed on the container restart, and BMO will re-create the nodes in the right states: any in-progress operations are started from scratch, while already deployed nodes are adopted.

Future plans

There is a lot of work planned in Metal3. In the near future, we want to bring BIOS and RAID support to BMO, previews of these features are already available. Another contributor is working on a network configuration proposal, which will allow in the future to manage bare metal switches while provisioning the nodes connected to them.

Looking further in the future, I'd like us to consider the possibility of dropping MySQL in favor of either SQLite or anything else that is not a full-featured database. Additionally, we need to develop our multi-master story: currently the Metal3 pod with all services is usually deployed on one master.

Get involved

We're looking for more users, more developers, and more opinions! If you'd like to talk to us, check the metal3 community page; we have a Slack channel, as well as a good old mailing list. If you're interested in Ironic specifically, check out the Ironic community page instead.

If you'd like to experiment with Metal3 or evaluate it before production, try metal3-dev-env. If you prefer OpenShift, check out openshift dev-scripts. Finally, if you want to learn more about Ironic, check the Ironic documentation or start with the Bifrost installation guide.

Good luck with your bare metal journey!