Dmitry Tantsur (Principal Software Engineer, Red Hat)
owlet.today/talks/pike-ironic-deploy-deep-dive
Tear down is covered by another deep dive.
Most of the deployment actions are done by drivers. The drivers, in turn, consist of interfaces, each with a different role in hardware management.
The power interface handles powering nodes on and off
The management interface handles additional vendor-specific actions, like getting or setting boot device.
These drivers are tied to the protocol to access the BMC
Examples include ipmitool, redfish, ilo.
The boot interface handles how either the deployment ramdisk or the final instance get booted on the node. Examples include PXE/iPXE and vendor-specific virtual media approaches.
The deploy interface orchestrates the deployment process, including how exactly an image gets transferred to a node. Currently supported are iSCSI-based and direct deploy methods.
The network interface handles how networks are connected to and disconnected from a node
Currently supported are the following implementations:
Before the Ocata release, the drivers were always monolithic. All interfaces were hardcoded, and partly reflected in their names.
The problem is: as of Pike-3 we will have 11 hardware interfaces and 36 drivers.
Every time we add a new implementation for any interface, we need more drivers.
The crisis burst out when we introduced the network interface, all implementations of which were compatible with all drivers.
Starting with Ocata, we support a new concept of dynamic drivers .
A hardware type defines which interface implementations it is compatible with, and which priority they have.
New fields were provided on nodes for each of the interface. E.g. boot_interface, power_interface, etc.
A dynamic driver is built on the fly, based on these fields, interface implementations enabled in the configuration, and the defaults from the node's hardware type.
Historically, we've been exposing bare metal resources similar to the way virtual resources are exposed: nova/virt/ironic/driver.py.
For example, a node with 16 GiB of RAM and 4 CPUs was represented as a hypervisor with 16 GiB of RAM and 4 CPUs.
When any instance is deployed on it, it's reported as a hypervisor with 16 GiB RAM and 4 CPUs used.
This approach is racy. If a users asks for 2 instances with 2 GiB of RAM and one 1 CPU each, the scheduler can try placing both of them on the same bare metal node. Only one of the attempts will succeed.
Now every node exposes several resource classes to the scheduler: nova/virt/ironic/driver.py.
This includes traditional memory/disk/CPU resources, as well as baremetal-specific custom resource class, fetches from a node's resource_class field.
At some point in time bare metal nodes will stop exposing memory/disk/CPU resources to the scheduler completely.
Flavors targeting bare metal will have to request a custom resource class instead of them: install-guide.
Capabilities allow picking nodes based on non-standard properties. Nowadays they are partly replaced by custom resource classes, but can still be useful.
A flavor can have capabilities requested via its extra_specs. They will be matched against capabilities as reported by the Ironic Nova driver: nova/virt/ironic/driver.py
The instance_uuid field is used to lock the chosen node.
Before this point, the node picked by the scheduler can still be used by anything else. This is where potential races can happen.
The instance_uuid field can only be added or removed, but it's not possible to replace an existing value.
So after it's successfully set to an instance UUID, Nova is safe to proceed with deploying on it.
Ironic has to know VIF IDs to be able to talk to Neutron.
Previously, they were passed via extra[vif_port_id], now we have a separate API for that.
Nova requests every VIF to be plugged: nova/virt/ironic/driver.py. Everything else is handled by Ironic itself.
The deployment is initiated by requesting provision state active for the node.
A looping call is established to wait for a provision state that indicates either success (active) or a failure. It also accounts for a potential deletion request in the middle of a deployment: nova/virt/ironic/driver.py
The deploy interface prepare method is called in several cases: on deployment (or rebuilding), on take over and on adopting a node.
In case of deployment, it
The deployment (also cleaning and inspection) ramdisk for Ironic is based on Ironic Python Agent (IPA) - Python service providing an HTTP API for various provisioning tasks.
On start up, IPA initializes hardware manager(s) - plugins handling hardware-specific aspects of provisoning. The default GenericHardwareManager is used in most cases.
Then IPA gets the Ironic API URL from the kernel boot arguments (supplied by the boot interface). It calls the lookup API endpoint to figure out the current node UUID, and a few optional configuration parameters.
After a successful start up, IPA waits for requests, while periodically polling the heartbeat API. Ironic assigns tasks to IPA in response to these heart beats.
All IPA-based deploy interface implementations process heart beats in a similar way, defined in the HeartbeatMixin class: ironic/drivers/modules/agent_base_vendor.py.
It detects the required actions by looking at the node's provision state. If it's deploy wait, the continue_deploy method is called. This method differs between different deploy interface implementations.
In this case, Ironic only needs to copy the image and create a config drive: ironic/drivers/modules/deploy_utils.py.
In this case, Ironic also needs to create a partition table: ironic_lib/disk_utils.py.
The boot interface was a relatively late addition to Ironic.
Initially, its logic was contained in the deploy interface, but it became a cause of duplication when more boot methods (e.g. virtual media) were introduced.
Currently, the boot interface is responsible for booting both the deployment (and cleaning) ramdisk, and the final instance.
There is still, however, a lot of boot code in the deploy interface implementations :(
The pxe boot interface is the generic boot interface working with (nearly) all hardware. It works by populating a PXE or iPXE environment for a given node.
It works differently depending on
For a ramdisk boot, the conductor places IPA kernel and ramdisk to the TFTP or HTTP location, and renders a PXE configuration file or an iPXE script pointing at them: ironic/drivers/modules/pxe.py.
For instance local boot (including whole-disk images which always boot locally), all PXE/iPXE configuration is merely removed, and the node's boot device is set to "disk".
For instance network boot, new PXE/iPXE configuration is written, pointing to instance kernel/ramdisk on a TFTP or HTTP location: ironic/drivers/modules/pxe.py.
When Ironic is used with flat networking, it is assumed that both provisioning and tenant traffic happens on the same flat network.
Attaching the provisioning networks boils down to merely setting binding:host_id on all VIFs: ironic/drivers/modules/network/flat.py.
Nothing is required for attaching the tenant network.
When Ironic is used with neutron networking, it can support any kinds of networks. Different networks are used for provisioning (and cleaning) and tenant traffic.
A compatible ML2 plugin is required in Neutron to be able to configure switches accordingly.
The networking-generic-switch project can be used for many kinds of hardware that accept SSH connections.
On attaching the provisioning network, Ironic creates ports on it: ironic/drivers/modules/network/neutron.py.
This boils down to iterating over Ironic ports that have PXE enabled, and passing MAC address and local link information to Neutron: ironic/common/neutron.py.
After deployment, ports are plugged into tenant networks: ironic/drivers/modules/network/neutron.py.
For each Neutron port, vnic_type is set to baremetal, and the local link information is passed: ironic/drivers/modules/network/common.py.