Ironic Deploy Deep Dive

The Pike Release Version

Dmitry Tantsur (Principal Software Engineer, Red Hat)
owlet.today/talks/pike-ironic-deploy-deep-dive

Agenda

Overview of Ironic drivers
Scheduling on bare metal nodes
Initiating the deployment process
The iSCSI deploy interface
Boot management and PXE boot
Connecting networks and VIFs

Tear down is covered by another deep dive.

Ironic drivers overview

Driver interfaces

Most of the deployment actions are done by drivers. The drivers, in turn, consist of interfaces, each with a different role in hardware management.

Ironic drivers overview

Power and management

The power interface handles powering nodes on and off

The management interface handles additional vendor-specific actions, like getting or setting boot device.

These drivers are tied to the protocol to access the BMC

Examples include ipmitool, redfish, ilo.

Ironic drivers overview

Boot and deploy

The boot interface handles how either the deployment ramdisk or the final instance get booted on the node. Examples include PXE/iPXE and vendor-specific virtual media approaches.

The deploy interface orchestrates the deployment process, including how exactly an image gets transferred to a node. Currently supported are iSCSI-based and direct deploy methods.

Ironic drivers overview

Network

The network interface handles how networks are connected to and disconnected from a node

Currently supported are the following implementations:

none networking does nothing, and expects DHCP to be configured externally.
flat networking uses Neutron with one flat provision network, also serving as a tenant network.
neutron networking also uses Neutron, but it is able to talk to switch-specific ML2 drivers to connect/disconnect different networks to and from nodes.

Ironic drivers overview

Classic drivers

Before the Ocata release, the drivers were always monolithic. All interfaces were hardcoded, and partly reflected in their names.

The pxe_ipmitool driver uses ipmitool power and management, pxe boot and iSCSI-based deploy as performed by IPA.
The agent_ipmitool driver uses ipmitool power and management, pxe boot and direct deploy as performed by IPA.
We are not good at naming, are we?

ironic/drivers/ipmi.py

Ironic drivers overview

Classic drivers drawbacks

The problem is: as of Pike-3 we will have 11 hardware interfaces and 36 drivers.

Every time we add a new implementation for any interface, we need more drivers.

The crisis burst out when we introduced the network interface, all implementations of which were compatible with all drivers.

Starting with Ocata, we support a new concept of dynamic drivers .

Ironic drivers overview

Hardware types

A hardware type defines which interface implementations it is compatible with, and which priority they have.

New fields were provided on nodes for each of the interface. E.g. boot_interface, power_interface, etc.

A dynamic driver is built on the fly, based on these fields, interface implementations enabled in the configuration, and the defaults from the node's hardware type.

ironic/drivers/ipmi.py ironic/drivers/generic.py

install-guide on interfaces

Scheduling on bare metal

Exposing resources - before Pike

Historically, we've been exposing bare metal resources similar to the way virtual resources are exposed: nova/virt/ironic/driver.py.

For example, a node with 16 GiB of RAM and 4 CPUs was represented as a hypervisor with 16 GiB of RAM and 4 CPUs.

When any instance is deployed on it, it's reported as a hypervisor with 16 GiB RAM and 4 CPUs used.

Scheduling on bare metal

Exact filters

This approach is racy. If a users asks for 2 instances with 2 GiB of RAM and one 1 CPU each, the scheduler can try placing both of them on the same bare metal node. Only one of the attempts will succeed.

One mitigation is to use exact scheduling filters.
Another is to have a lot of retries in the RetryFilter. TripleO uses 30. It works, but has a strong negative impact when an actual failure happens.

Scheduling on bare metal

Exposing resources - Pike

Now every node exposes several resource classes to the scheduler: nova/virt/ironic/driver.py.

This includes traditional memory/disk/CPU resources, as well as baremetal-specific custom resource class, fetches from a node's resource_class field.

Scheduling on bare metal

Exposing resources - After Pike

At some point in time bare metal nodes will stop exposing memory/disk/CPU resources to the scheduler completely.

Flavors targeting bare metal will have to request a custom resource class instead of them: install-guide.

Scheduling on bare metal

Capabilities

Capabilities allow picking nodes based on non-standard properties. Nowadays they are partly replaced by custom resource classes, but can still be useful.

A flavor can have capabilities requested via its extra_specs. They will be matched against capabilities as reported by the Ironic Nova driver: nova/virt/ironic/driver.py

Initiating the deployment

High-level overview

nova/virt/ironic/driver.py

Add instance_info to the node
Add instance_uuid to lock the node
Validate the final node information
Plug VIFs and start the firewall
Build and store a config drive
Issue a provisioning request
Wait for the inevitable success

Initiating the deployment

Instance information

nova/virt/ironic/patcher.py

Image information: Glance source, disk, swap and ephemeral partition sizes
Nova host ID (used when binding ports)
Flavor details: VCPUs, memory, disk (will probably go away eventually)
Requested (and matched capabilities)

Initiating the deployment

Instance UUID

The instance_uuid field is used to lock the chosen node.

Before this point, the node picked by the scheduler can still be used by anything else. This is where potential races can happen.

The instance_uuid field can only be added or removed, but it's not possible to replace an existing value.

So after it's successfully set to an instance UUID, Nova is safe to proceed with deploying on it.

Initiating the deployment

Plugging VIFs

Ironic has to know VIF IDs to be able to talk to Neutron.

Previously, they were passed via extra[vif_port_id], now we have a separate API for that.

Nova requests every VIF to be plugged: nova/virt/ironic/driver.py. Everything else is handled by Ironic itself.

Initiating the deployment

Provision state change

The deployment is initiated by requesting provision state active for the node.

A looping call is established to wait for a provision state that indicates either success (active) or a failure. It also accounts for a potential deletion request in the middle of a deployment: nova/virt/ironic/driver.py

Ironic deployment overview

Preparation

Plug VIFs
Cache images
Configure boot environment (PXE, iPXE, virtual media)
Connect the provisioning network to the node
Boot the ramdisk (IPA)
Wait for a callback from the ramdisk

Ironic deployment overview

Deployment - iSCSI method

Request IPA to expose the root disk as an iSCSI share
Mount the resulting iSCSI share to the conductor
In case of partition images - partition the target device
Flash the instance image to the target device
Write the config drive, if provided
In case of partition images and local boot - install the bootloader on the target device
Unmount the iSCSI share

Ironic deployment overview

Deployment - direct method

In case of partition images - request IPA to partition the target device
Request IPA to flash the instance image (fetched from a Swift temporary URL or an HTTP location) to the target device
Request IPA to write the config drive, if provided
In case of partition images and local boot - request IPA to install the bootloader on the target device

Ironic deployment overview

Finishing

Request IPA to power off the machine
Disconnect the provisioning network and connect tenant network(s)
Set the boot device as requested
Power on the machine

iSCSI-based deploy

Starting the deploy

The deploy starts when a conductor receives do_node_deploy RPC request.
A few sanity checks are done then: power and deploy interface validations, and check for maintenance mode.
The node is moved to the deploying provision state, and a new thread is lauched for the remaining actions.
There, the prepare and deploy methods of the deploy interface are called.

ironic/conductor/manager.py (1)

ironic/conductor/manager.py (2)

iSCSI-based deploy

Preparation

The deploy interface prepare method is called in several cases: on deployment (or rebuilding), on take over and on adopting a node.

In case of deployment, it

removes tenant networks from the node (if any)
adds the provisioning network (if needed)
orders the boot interface of the node to boot the deployment ramdisk

ironic/drivers/modules/iscsi_deploy.py.

iSCSI-based deploy

Starting the deploy

The actual deploy process is started with caching the instance (user) image on the conductor. Ironic can download it from Glance, as well as any HTTP(s) location.
The image is (usually) converted to the "raw" format first to ensure it can be dd-ed to the target device.
The node is rebooted. In the prepare call we already ensured that it will boot the deployment ramdisk.
At this point, the node's provision state changes from deploying to deploy wait, and the conductor idles, awaiting a callback from the ramdisk

iSCSI-based deploy

IPA start up and lookup

The deployment (also cleaning and inspection) ramdisk for Ironic is based on Ironic Python Agent (IPA) - Python service providing an HTTP API for various provisioning tasks.

On start up, IPA initializes hardware manager(s) - plugins handling hardware-specific aspects of provisoning. The default GenericHardwareManager is used in most cases.

Then IPA gets the Ironic API URL from the kernel boot arguments (supplied by the boot interface). It calls the lookup API endpoint to figure out the current node UUID, and a few optional configuration parameters.

iSCSI-based deploy

IPA heart beats

After a successful start up, IPA waits for requests, while periodically polling the heartbeat API. Ironic assigns tasks to IPA in response to these heart beats.

All IPA-based deploy interface implementations process heart beats in a similar way, defined in the HeartbeatMixin class: ironic/drivers/modules/agent_base_vendor.py.

It detects the required actions by looking at the node's provision state. If it's deploy wait, the continue_deploy method is called. This method differs between different deploy interface implementations.

iSCSI-based deploy

Finding root disk

Ironic requests IPA to publish the target disk via iSCSI, providing the complete node information: ironic/drivers/modules/iscsi_deploy.py.
The IPA iscsi extension starts with asking the current hardware manager to pick the target device.
If root device hints were provided on a node, they are used: ironic_python_agent/hardware.py, ironic_lib/utils.py.
Otherwise, the smallest disk that is larger than 4 GiB is used: ironic_python_agent/utils.py.

iSCSI-based deploy

Accessing root disk

The chosen device is published using either tgtd or LIO. For CentOS/RHEL, LIO is used: ironic_python_agent/extensions/iscsi.py.
On receiving success result from IPA, the conductor mounts the resulting iSCSI share locally: ironic/drivers/modules/deploy_utils.py.
Then Ironic proceeds with writing the image. It is done differently for partition and whole-disk images.

iSCSI-based deploy

Whole-disk images

In this case, Ironic only needs to copy the image and create a config drive: ironic/drivers/modules/deploy_utils.py.

Image is written by using dd, converting it to the raw format, if it was not done by the conductor: ironic_lib/disk_utils.py.
Then the conductor checks for a present config drive partition, and creates one, if missing: ironic_lib/disk_utils.py.
Finally, the config drive is dd-ed to the resulting partition.

iSCSI-based deploy

Partition images [1]

In this case, Ironic also needs to create a partition table: ironic_lib/disk_utils.py.

All metadata on the target disk is destroyed: ironic_lib/disk_utils.py.
A partition table of the requested type (MBR or GPT) is created. The default for UEFI is GPT, otherwise MBR is used by default: ironic_lib/disk_utils.py.

iSCSI-based deploy

Partition images [2]

Then the root, swap, ephemeral and config drive partitions are created. The root partition goes last to allow it to be extended later (e.g. by cloud-init): ironic_lib/disk_utils.py.
The swap, ephemeral (optionally) and EFI (optionally) partitions are formatted; the root and config drive partitions are populated: ironic_lib/disk_utils.py.

iSCSI-based deploy

Final steps

If local boot is requested, the conductor instructs IPA to install the boot loader: ironic_python_agent/extensions/image.py. Also the boot device is changed to "disk".
The boot interface prepare_instance is called.
The conductor asks IPA to perform a soft reboot, unless a hard reboot was explicitly requested for this node: ironic/drivers/modules/agent_base_vendor.py.
The node is changed from provisioning to tenant network(s) via the appropriate network interface calls.
Finally, the node is rebooted. The deployment is done.

Networking

The boot interface

The boot interface was a relatively late addition to Ironic.

Initially, its logic was contained in the deploy interface, but it became a cause of duplication when more boot methods (e.g. virtual media) were introduced.

Currently, the boot interface is responsible for booting both the deployment (and cleaning) ramdisk, and the final instance.

There is still, however, a lot of boot code in the deploy interface implementations :(

Networking

PXE boot overview

The pxe boot interface is the generic boot interface working with (nearly) all hardware. It works by populating a PXE or iPXE environment for a given node.

It works differently depending on

whether PXE or iPXE is configured,
whether the instance image is partition or whole-disk,
whether local or network boot for the instance is requested,
whether BIOS or UEFI boot is used for the node.

Networking

PXE bootstrap

Neutron requests the node to boot the PXE ROM (pxelinux.0) from the conductor's TFTP server.
The PXE ROM requests the configuration file named pxelinux.cfg/{MAC} from TFTP. This file is generated by the conductor: ironic/drivers/modules/pxe_config.template.
This configuration file boots the kernel/ramdisk pair published by the conductor on TFTP for the node.

Networking

iPXE bootstrap

If the node does not indicate (in its DHCP) request that it's running iPXE, Neutron sends it the iPXE ROM (undionly.kpxe for BIOS, ipxe.efi for UEFI).
When the node runs the iPXE ROM, it is instructed to fetch the iPXE script boot.ipxe from the conductor's HTTP server. This file is auto-generated and is the same for all nodes: ironic/drivers/modules/boot.ipxe.
This script loads another script at pxelinux.cfg/{MAC} generation by the conductor for this node: ironic/drivers/modules/ipxe_config.template.
The final script boots the kernel/ramdisk pair published by the conductor on HTTP for the node.

Networking

DHCP configuration

The DHCP options are generated for either PXE or iPXE boot: ironic/common/pxe_utils.py.
The update_dhcp_opts method of a DHCP provider is called with these options. It ends up populating extra_dhcp_opts on every VIF: ironic/dhcp/neutron.py.

Networking

Boot configuration

For a ramdisk boot, the conductor places IPA kernel and ramdisk to the TFTP or HTTP location, and renders a PXE configuration file or an iPXE script pointing at them: ironic/drivers/modules/pxe.py.

For instance local boot (including whole-disk images which always boot locally), all PXE/iPXE configuration is merely removed, and the node's boot device is set to "disk".

For instance network boot, new PXE/iPXE configuration is written, pointing to instance kernel/ramdisk on a TFTP or HTTP location: ironic/drivers/modules/pxe.py.

Networking

Plugging flat networks

When Ironic is used with flat networking, it is assumed that both provisioning and tenant traffic happens on the same flat network.

Attaching the provisioning networks boils down to merely setting binding:host_id on all VIFs: ironic/drivers/modules/network/flat.py.

Nothing is required for attaching the tenant network.

Networking

Advanced networking [1]

When Ironic is used with neutron networking, it can support any kinds of networks. Different networks are used for provisioning (and cleaning) and tenant traffic.

A compatible ML2 plugin is required in Neutron to be able to configure switches accordingly.

The networking-generic-switch project can be used for many kinds of hardware that accept SSH connections.

Networking

Advanced networking [2]

On attaching the provisioning network, Ironic creates ports on it: ironic/drivers/modules/network/neutron.py.

This boils down to iterating over Ironic ports that have PXE enabled, and passing MAC address and local link information to Neutron: ironic/common/neutron.py.

Networking

Advanced networking [3]

After deployment, ports are plugged into tenant networks: ironic/drivers/modules/network/neutron.py.

For each Neutron port, vnic_type is set to baremetal, and the local link information is passed: ironic/drivers/modules/network/common.py.