This is a brief summary of bare metal discussions at the OpenInfra Summit & PTG 2019 in Denver.

Keynotes

The Metal3 project got some spotlight during the keynotes. A (successful!) live demo was done that demonstrated using Ironic through Kubernetes API to drive provisioning of bare metal nodes.

The official bare metal program was announced to promote managing bare metal infrastructure via OpenStack.

Forum: standalone Ironic

On Monday we had two sessions dedicated to the future development of standalone Ironic (without Nova or without any other OpenStack services).

During the standalone roadmap session the audience identified two potential domains where we could provide simple alternatives to depending on OpenStack services:

Alternative authentication. It was mentioned, however, that Keystone is a relatively easy service to install and operate, so adding this to Ironic may not be worth the effort.
Multi-tenant networking without Neutron. We could use networking-ansible directly, since they are planning on providing a Python API independent of their ML2 implementation.

Next, firmware update support was a recurring topic (also in hallway conversations and also in non-standalone context). Related to that, a driver feature matrix documentation was requested, so that such driver-specific features are easier to discover.

Then we had a separate API multi-tenancy session. Three topic were covered:

Wiring in the existing owner field for access control.

The idea is to allow operations for non-administrator users only to nodes with owner equal to their project (aka tenant) ID. In the non-keystone context this field would stay free-form. We did not agree whether we need an option to enable this feature.

An interesting use case was mentioned: assign a non-admin user to Nova to allocate it only a part of the bare metal pool instead of all nodes.

We did not reach a consensus on using a schema with the owner field, e.g. where keystone://{project ID} represents a Keystone project ID.
Adding a new field (e.g. deployed_by) to track a user that requested deploy for auditing purposes.

We agreed that the owner field should not be used for this purpose, and overall it should never be changed automatically by Ironic.
Adding some notion of node leased to, probably via a new field.

This proposal was not well defined during the session, but we probably would allow some subset of API to lessees using the policy mechanism. It became apparent that implementing a separate deployment API endpoint is required to make such policy possible.

Creating the deployment API was identified as a potential immediate action item. Wiring the owner field can also be done in the Train cycle, if we find volunteers to push it forward.

PTG: scientific SIG

The PTG started for me with the Scientific SIG discussions of desired features and fixes in Ironic.

The hottest topic was reducing the deployment time by reducing the number of reboots that are done during the provisioning process. Ramdisk deploy was identified as a very promising feature to solve this, as well as enable booting from remote volumes not supported directly by Ironic and/or Cinder. A few SIG members committed to testing it as soon as possible.

Two related ideas were proposed for later brainstorming:

Keeping some proportion of nodes always on and with IPA booted. This is basing directly on the fast-track deploy work completed in the Stein cycle. A third party orchestrator would be needed for keeping the percentage, but Ironic will have to provide an API to boot an available node into the ramdisk.
Allow using kexec to instantly switch into a freshly deployed operating system.

Combined together, these features can allow zero-reboot deployments.

PTG: Ironic

Community sustainability

We seem to have a disbalance in reviews, with very few people handling the majority of reviews, and some of them are close to burning out.

The first thing we discussed is simplifying the specs process. We considered a single +2 approval for specs and/or documentation. Approving documentation cannot break anyone, and follow-ups are easy, so it seems a good idea. We did not reach a firm agreement on a single +2 approval for specs; I personally feel that it would only move the bottleneck from specs to the code.
Facilitating deprecated feature removals can help clean up the code, and it can often be done by new contributors. We would like to maintain a list of what can be removed when, so that we don't forget it.
We would also like to switch to single +2 for stable backports. This needs changing the stable policy, and Tony volunteered to propose it.

We felt that we're adding cores at a good pace, Julia had been mentoring people that wanted it. We would like people to volunteer, then we can mentor them into core status.

However, we were not so sure we wanted to increase the stable core team. This team is supposed to be a small number of people that know quite a few small details of the stable policy (e.g. requirements changes). We thought we should better switch to single +2 approval for the existing team.

Then we discussed moving away from WSME, which is barely maintained by a team of not really interested individuals. The proposal was to follow the example of Keystone and just move to Flask. We can use ironic-inspector as an example, and probably migrate part by part. JSON schema could replace WSME objects, similarly to how Nova does it. I volunteered to come up with a plan to switch, and some folks from Intel expressed interest in participating.

Standalone roadmap

We started with a recap of items from Forum: standalone Ironic.

While discussing creating a driver matrix, we realized that we could keep driver capabilities in the source code (similar to existing iSCSI boot) and generate the documentation from it. Then we could go as far as exposing this information in the API.

During the multi-tenancy discussion, the idea of owner and lessee fields was well received. Julia volunteered to write a specification for that. We clarified the following access control policies implemented by default:

A user can list or show nodes if they are an administrator, an owner of a node or a leaser of this node.
A user can deploy or undeploy a node (through the future deployment API) if they are an administrator, an owner of this node or a lessee of this node.
A user can update a node or any of its resources if they are an administrator or an owner of this node. A lessee of a node can not update it.

The discussion of recording the user that did a deployment turned into discussing introducing a searchable log of changes to node power and provision states. We did not reach a final consensus on it, and we probably need a volunteer to push this effort forward.

Deploy steps continued

This session was dedicated to making the deploy templates framework more usable in practice.

We need to implement support for in-band deploy steps (other than the built-in deploy.deploy step). We probably need to start IPA before proceeding with the steps, similarly to how it is done with cleaning.
We agreed to proceed with splitting the built-in core step, making it a regular deploy step, as well as removing the compatibility shim for drivers that do not support deploy steps. We will probably separate writing an image to disk, writing a configdrive and creating a bootloader.

The latter could be overridden to provide custom kernel parameters.
To handle potential differences between deploy steps in different hardware types, we discussed the possibility of optionally including a hardware type or interface name in a clean step. Such steps will only be run for nodes with matching hardware type or interface.

Mark and Ruby volunteered to write a new spec on these topics.

Day 2 operational workflow

For deployments with external health monitoring, we need a way to represent the state when a deployed node looks healthy from our side but is detected as failed by the monitoring.

It seems that we could introduce a new state transition from active to something like failed or quarantined, where a node is still deployed, but explicitly marked as at fault by an operator. On unprovisioning, this node would not become available automatically. We also considered the possibility of using a flag instead of a new state, although the operators in the room were more in favor of using a state. We largely agreed that the already overloaded maintenance flag should not be used for this.

On the Nova side we would probably use the error state to reflect nodes in the new state.

A very similar request had been done for node retirement support. We decided to look for a unified solution.

DHCP-less deploy

We discussed options to avoid relying on DHCP for deploying.

An existing specification proposes attaching IP information to virtual media. The initial contributors had become inactive, so we decided to help this work to go through. Volunteers are welcome.
As an alternative to that, we discussed using IPv6 SLAAC with multicast DNS (routed across WAN for Edge cases). A couple of folks on the room volunteered to help with testing. We need to fix python-zeroconf to support IPv6, which is something I'm planning on.

Nova room

In a cross-project discussion with the Nova team we went through a few topics:

Whether Nova should use new Ironic API to build config drives. Since Ironic is not the only driver building config drives, we agreed that it probably doesn't make much sense to change that.
We did not come to a conclusion on deprecating capabilities. We agreed that Ironic has to provide alternatives for boot_option and boot_mode capabilities first. These will probably become deploy steps or built-in traits.
We agreed that we should switch Nova to using openstacksdk instead of ironicclient to access Ironic. This work had already been in progress.

Faster deploy

We followed up to PTG: scientific SIG with potential action items on speeding up the deployment process by reducing the number of reboots. We discussed an ability to keep all or some nodes powered on and heartbeating in the available state:

Add an option to keep the ramdisk running after cleaning.
- For this to work with multi-tenant networking we'll need an IPA command to reset networking.
Add a provisioning verb going from available to available booting the node into IPA.
Make sure that pre-booted nodes are prioritized for scheduling. We will probably dynamically add a special trait. Then we'll have to update both Nova/Placement and the allocation API to support preferred (optional) traits.

We also agreed that we could provide an option to kexec instead of rebooting as an advanced deploy step for operators that really know their hardware. Multi-tenant networking can be tricky in this case, since there is no safe point to switch from deployment to tenant network. We will probably take a best effort approach: command IPA to shutdown all its functionality and schedule a kexec after some time. After that, switch to tenant networks. This is not entirely secure, but will probably fit the operators (HPC) who requests it.

Asynchronous clean steps

We discussed enhancements for asynchronous clean and deploy steps. Currently running a step asynchronously requires either polling in a loop (occupying a green thread) or creating a new periodic task in a hardware type. We came up with two proposed updates for clean steps:

Allow a clean step to request re-running itself after certain amount of time. E.g. a clean step would do something like
```
@clean_step(...)
def wait_for_raid(self):
    if not raid_is_ready():
        return RerunAfter(60)
```
and the conductor would schedule re-running the same step in 60 seconds.
Allow a clean step to spawn more clean steps. E.g. a clean step would do something like
```
@clean_step(...)
def create_raid_configuration(self):
    start_create_raid()
    return RunNext([{'step': 'wait_for_raid'}])
```
and the conductor would insert the provided step to node.clean_steps after the current one and start running it.

This would allow for several follow-up steps as well. A use case is a clean step for resetting iDRAC to a clean state that in turn consists of several other clean steps. The idea of sub-steps was deemed too complicated.

PTG: TripleO

We discussed our plans for removing Nova from the TripleO undercloud and moving bare metal provisioning from under control of Heat. The plan from the nova-less-deploy specification, as well as the current state of the implementation, were presented.

The current concerns are:

upgrades from a Nova based deployment (probably just wipe the Nova database),
losing user experience of nova list (largely compensated by metalsmith list),
tracking IP addresses for networks other than ctlplane (solved the same way as for deployed servers).

The next action item is to create a CI job based on the already merged code and verify a few assumptions made above.

PTG: Ironic, Placement, Blazar

We reiterated over our plans to allow Ironic to optionally report nodes to Placement. This will be turned off when Nova is present to avoid conflicts with the Nova reporting. We will optionally use Placement as a backend for Ironic allocation API (which is something that had been planned before).

Then we discussed potentially exposing detailed bare metal inventory to Placement. To avoid partial allocations, Placement could introduce new API to consume the whole resource provider. Ironic would use it when creating an allocation. No specific commitments were made with regards to this idea.

Finally we came with the following workflow for bare metal reservations in Blazar:

A user requests a bare metal reservation from Blazar.
Blazar fetches allocation candidates from Placement.
Blazar fetches a list of bare metal nodes from Ironic and filters out allocation candidates, whose resource provider UUID does not match one of the node UUIDs.
Blazar remembers the node UUID and returns the reservation UUID to the user.

When the reservation time comes:

Blazar creates an allocation in Ironic (not Placement) with the candidate node matching previously picked node and allocation UUID matching the reservation UUID.
When the enhancements in Standalone roadmap are implemented, Blazar will also set the node's lessee field to the user ID of the reservation, so that Ironic allows access to this node.
A user fetches an Ironic allocation corresponding to the Blazar reservation UUID and learns the node UUID from it.
A user proceeds with deploying the node.

Side and hallway discussions

We discussed having Heat resources for Ironic. We recommended the team to start with Allocation and Deployment resources (the latter being virtual until we implement the planned deployment API).
We prototyped how Heat resources for Ironic could look, including Node, Port, Allocation and Deployment as a first step.