I have recently analyzed and fixed an interesting issue. One of the Metal3 components, Ironic Inspector, started showing a linear growth of consumed memory. Other containers in the Metal3 pod do not exhibit similar issues. The memory usage of the service that normally requires 100-200 MiB exceeds 1 GiB within a month of being mostly idle.

This blog post describes what I did to debug the problem, and how an unorthodox approach to resource clean-up can cause unexpected troubles down the road.

How much memory does a container consume?

Should be trivial to figure out, right? Right?..

OpenShift has metrics API (which the reporter used), but it's not available on dev-scripts (the testing environment we commonly use) for whatever reason :( But that's just Linux, so...

... no, the "ps" command is not installed either ...

... okay, let's resort to the basics:

$ oc exec -c metal3-ironic-inspector metal3-69f467dd6-b9wtl -- grep Rss /proc/1/status
RssAnon:          941804 kB
RssFile:           11248 kB
RssShmem:              0 kB

Here, the problem is already noticeable: this service has no reasons to consume more than 900 MiB of RAM.

How does it develop?

To confirm the bug, I need to see that the memory usage actually grows (as opposed to allocating too much memory once). Let's commit a Small Bash Crime, knowing that the process of interest will have PID 1 in the container.

while true; do
    date -u;
    date -u >> ~/report;
    oc exec -c metal3-ironic-inspector metal3-8f9b5d6c8-mpp9v -- \
        grep Rss /proc/1/status >> ~/report;
    sleep 600;
done

Now we have a report that looks like this (although much longer):

Thu May 23 11:56:35 UTC 2024
RssAnon:          941804 kB
RssFile:           11248 kB
RssShmem:              0 kB
Thu May 23 12:06:35 UTC 2024
RssAnon:          942056 kB
RssFile:           11248 kB
RssShmem:              0 kB
Thu May 23 12:16:36 UTC 2024
RssAnon:          942336 kB
RssFile:           11248 kB
RssShmem:              0 kB
Thu May 23 12:26:36 UTC 2024
RssAnon:          942704 kB
RssFile:           11248 kB
RssShmem:              0 kB
Thu May 23 12:36:36 UTC 2024
RssAnon:          942912 kB
RssFile:           11248 kB
RssShmem:              0 kB

The growth is steady, just as reported. Bug confirmed.

Now, what is leaking?

To leak memory in Python, you need to allocate an object and then keep a reference to it somewhere forever. Usually, a global variable or some sort of a cache (spoiler: this case is more interesting).

Checking the Inspector logs, I quickly discovered what exactly it was doing all this time: largely nothing. Just running a periodic task to sync its database with Ironic. This task is therefore the only obvious culprit.

Time to tracemalloc!

The standard tracemalloc module allows taking snapshots of memory, comparing these snapshots, and reporting the difference including the exact line where the memory was allocated.

Game plan: take snapshots periodically, compare each new one with the previous one. At which point? Well, I have a periodic task AND this periodic task is our suspect, so why not put this logic in the end of it?

Starting tracemalloc

... is easy:

diff --git a/ironic_inspector/cmd/all.py b/ironic_inspector/cmd/all.py
index 5dcf2a9..942df91 100644
--- a/ironic_inspector/cmd/all.py
+++ b/ironic_inspector/cmd/all.py
@@ -13,6 +13,7 @@
 """The Ironic Inspector service."""

 import sys
+import tracemalloc

 from oslo_config import cfg
 from oslo_service import service
@@ -26,6 +27,8 @@ CONF = cfg.CONF


 def main(args=sys.argv[1:]):
+    tracemalloc.start()
+
     # Parse config file and command line options, then start logging
     service_utils.prepare_service(args)

Collecting samples

Proudly stol^Wadapted from the Python documentation:

diff --git a/ironic_inspector/conductor/manager.py b/ironic_inspector/conductor/manager.py
index 44e2c43..e32511d 100644
--- a/ironic_inspector/conductor/manager.py
+++ b/ironic_inspector/conductor/manager.py
@@ -13,6 +13,7 @@

 import sys
 import traceback as traceback_mod
+import tracemalloc

 from eventlet import semaphore
 from futurist import periodics
@@ -219,7 +220,10 @@ def periodic_clean_up():  # pragma: no cover
         pxe_filter.driver().sync(ir_utils.get_client())


-def sync_with_ironic(conductor):
+_SNAPSHOT = None
+
+
+def _do_sync(conductor):
     if (conductor.coordinator is not None
             and not conductor.coordinator.is_leader):
         LOG.debug('The conductor is not a leader, skipping syncing '
@@ -234,6 +238,22 @@ def sync_with_ironic(conductor):
     node_cache.delete_nodes_not_in_list(ironic_node_uuids)


+def sync_with_ironic(conductor):
+    global _SNAPSHOT
+
+    _do_sync(conductor)
+
+    new_snapshot = tracemalloc.take_snapshot()
+    if _SNAPSHOT is not None:
+        diff = new_snapshot.compare_to(_SNAPSHOT, 'lineno')
+        if diff:
+            for stat in diff[:10]:
+                LOG.info("MEMORY %s", stat)
+        else:
+            LOG.debug("No memory usage difference")
+    _SNAPSHOT = new_snapshot

Looking at the report

Each time the peridic task is run, something like this is logged:

/usr/lib64/python3.9/tracemalloc.py:193: size=0 B (-92.5 KiB), count=0 (-1974)
/usr/lib/python3.9/site-packages/openstack/config/cloud_region.py:193: size=30.6 MiB (+5868 B), count=229614 (+42), average=140 B
/usr/lib/python3.9/site-packages/openstack/config/cloud_region.py:74: size=26.0 MiB (+4987 B), count=355355 (+65), average=77 B
/usr/lib/python3.9/site-packages/openstack/config/_util.py:32: size=24.2 MiB (+4632 B), count=5467 (+1), average=4632 B
/usr/lib/python3.9/site-packages/openstack/resource.py:278: size=30.7 KiB (-2856 B), count=187 (-17), average=168 B
/usr/lib/python3.9/site-packages/openstack/config/cloud_region.py:79: size=14.5 MiB (+2777 B), count=229614 (+42), average=66 B
/usr/lib/python3.9/site-packages/eventlet/semaphore.py:43: size=8832 KiB (+1872 B), count=30717 (+6), average=294 B
/usr/lib/python3.9/site-packages/openstack/utils.py:84: size=9884 KiB (+1752 B), count=54428 (+9), average=186 B
/usr/lib/python3.9/site-packages/openstack/resource.py:574: size=558 KiB (-1008 B), count=3401 (-6), average=168 B
/usr/lib/python3.9/site-packages/openstack/connection.py:395: size=3971 KiB (+744 B), count=10933 (+2), average=372 B

Importantly, with each iteration, certain sizes always increate and never decrease. As you can guess, these are the ones in the MiB range at the time of taking this sample.

I will ignore the eventlet allocation and hope that it goes away, because the last thing I want to do is to dive into the eventlet's code. Everything else clearly points at the openstacksdk - the library we use to talk to Ironic.

Looking at the source

First let's take a look at utils.py line 84. This is not literally the first thing that caught my attention, but it exposes a completely separate smaller issue. Here is the code:

 def get_string_format_keys(fmt_string, old_style=True):
     if old_style:

         class AccessSaver:
             def __init__(self):
                 self.keys = []

             def __getitem__(self, key):
                 self.keys.append(key)

         a = AccessSaver()
         fmt_string % a

         return a.keys

     # else ...

Ouch. A lesser known fact: creating new types in Python is not exactly cheap. Many years ago, I ended up creating namedtuples on each API request which made this API unbelievably slow. Now it seems like they are not necessarily deallocated?

A naive attempt to verify this idea with a simple Python script has not yielded a conclusive result, and it's possible that this leak is simply a byproduct of the larger issue. I have proposed a trivial fix to create the class only once and left it there. If you understand how classes are deallocated, definitely let me know!

What else do we have?

The other lines seem pretty boring at first sight:

 if project_name not in conf:
     _disable_service(
         config_dict,
         st,
         reason="No section for project '{project}' (service type "
         "'{service_type}') was present in the config.".format(
             project=project_name, service_type=st
         ),
     )
     continue

 def _make_key(key, service_type):
     if not service_type:
         return key
     else:
         service_type = service_type.lower().replace('-', '_')
         return "_".join([service_type, key])

 def normalize_keys(config):
     new_config = {}
     for key, value in config.items():
         key = key.replace('-', '_')
         if isinstance(value, dict):
             new_config[key] = normalize_keys(value)
         elif isinstance(value, bool):
             new_config[key] = value
         elif isinstance(value, int) and key not in (
             'verbose_level',
             'api_timeout',
         ):
             new_config[key] = str(value)
         elif isinstance(value, float):
             new_config[key] = str(value)
         else:
             new_config[key] = value
     return new_config

At first, nothing particularly interesting: allocating some strings in the first two examples, presumably reallocating the dictionary in the last one. The location of the first extract is important though: it's a part of the from_conf function that creates a new CloudRegion object, which, in turn, is a part of a Connection object. By re-tracing the code paths leading to the seemingly boring allocations, I'm able to conclude that connections are somehow not deallocated, which causes their configuration dictionaries to accumulate in the memory.

Do we actually close connections?

Well, we do not:

 def get_client(token=None):
     session = _get_ironic_session()

     try:
         return openstack.connection.Connection(
             session=session, oslo_conf=CONF).baremetal
     except Exception as exc:
         LOG.error('Failed to establish a connection with ironic, '
                   'reason: %s', exc)
         raise

Not closing connections is a potential problem, but it's usually not a huge problem in programming languages with garbage collection. Once the baremetal proxy is no longer used by the caller, it and thus the connection will be orphaned, causing Python to delete them. We won't have a chance to handle potential exceptions (which is why it's a bad practice), but memory should not be leaked.

At this point, I could add a missing close call or better keep the connection for the lifetime of the service (which is what I'm inclined to do in reality), but I wanted to understand the issue.

Diving into close

Let's check out the Connection's close call. How many seconds will it take you to go "wait, what"?

def close(self):
    self.config.set_auth_cache()
    if self.__pool_executor:
        self.__pool_executor.shutdown()
    atexit.unregister(self.close)

Okay, okay, I was in a bit different position: I was going through the git history and stumbled upon the patch that makes close look like this. I had some context, and here is the commit message:

finalizers execution order is not guaranteed in python3. This leads to
absence of builtins like "open" in the finalizers. Depending on the
keyring backend used inside of the set_auth_cache access to the file system
may be required.

Replace use of destructor to another builtin library "atexit" to
implement a more reliable exit handler.

Wait, what?

The first paragraph is correct: you should not rely on __del__ method to be called in a normal Python environment. If your process is already exiting, many things may have been deleted. And yes, even the open built-in function may no longer be available. What made me blink in a disbelief is the second part. Here is the change:

diff --git a/openstack/connection.py b/openstack/connection.py
index 819e16bd3..5722a01e2 100644
--- a/openstack/connection.py
+++ b/openstack/connection.py
@@ -176,6 +176,7 @@ try to find it and if that fails, you would create it::
 Additional information about the services can be found in the
 :ref:`service-proxies` documentation.
 """
+import atexit
 import concurrent.futures
 import warnings
 import weakref
@@ -472,9 +473,8 @@ class Connection(
                 'additional_metric_tags'
             ] = self.config.config['additional_metric_tags']

-    def __del__(self):
-        # try to force release of resources and save authorization
-        self.close()
+        # Register cleanup steps
+        atexit.register(self.close)

Since the connection's close method is doing something quite non-trivial that may easily fail during the process shutdown, the author moved this logic from the unreliable __del__ to the atexit machinery, so that it runs before anything is deallocated. Problem solved?

But what happens if close is never called? The answer is: exactly the thing described in this blog post.

Grand finale

Now we have all components of the puzzle. Inspector was written long ago under an assumption that connections are pretty lightweight and closing them is not necessary. In fact, I'm not even sure the close call existed back then! A new connection was opened on every run of the periodic task but was never closed. Thanks to the Python's garbage collector, everything worked smoothly.

Some time passed, and certain non-trivial logic was added to close. It was still not a problem because Inspector has never used any features that required proper clean-up. But then the patch above landed.

You see, things like self.close are bound methods, they contain a reference to the self object (otherwise, the method body would not be able to use it). Passing self.close to atexit.register makes the latter store the former somewhere in the Python's guts. Which, in turn, makes the corresponding object, i.e. a Connection live for entire duration of the service.

I'm not sure which lessons we need to learn out of this story. Possibly:

Always close things that are called Connection, File, Stream, etc.
Even better, cache connections if you use them constantly.
Don't get too creative with rarely used features, especially if they have a global effect on the process.
Really, avoid global state, it will bite you when you expect it the least.
TraceMalloc rocks (and I somehow did not know about it before)!