Debugging gobject reference count problems
Debugging gobject reference leaks can be difficult, very difficult according to the official documentation. If you google this subject you will find numerous hits. A tool called RefDbg was created to address this specific need. It however appears to have lost effectiveness because (taken from the docs):
Beginning with glib 2.6 the default build does not allow functions to
be overridden using LD_PRELOAD (for performance reasons). This is the
method that refdbg uses to intercept calls to the GObject reference
count related functions and therefore refdbg will not work. The
solution is to build glib 2.6.x with the '--disable-visibility'
configure flag or use an older version of glib (<= 2.4.x). Its often
helpful to build a debug version of glib anyways to get more useful
stack back traces.
I actually tried the tool and got the error “LD_PRELOAD function override not working. Need to build glib with –disable-visibility?” , so I kept looking. Another post lead me to investigate using systemtap which seemed promising, so I looked into it further. This approach eventually got me the information I needed to find and fix reference count problems. I’m outlining what I did in hopes that it will be beneficial to others as well. For the record, I used this approach on storaged, but it should work for any code base that uses gobjects. The overall approach is:
- Identify what is leaking
- For each object that is leaking in step 1, collect the needed information to help figure out why
Seems simple…
Step 1, identification
Step 1 is pretty much straight out of my initial information gathered on this topic and using systemtap. This is my version of the system tap file I used to figure out what was leaking.
global alive
probe gobject.object_new {
alive[type]++
printf ("CREATED:%s:%d:%p\n", type, tid(), object)
print_ubacktrace_brief()
print("\n")
}
probe gobject.object_finalize {
alive[type]--
printf ("DELETED:%s:%d:%p\n", type, tid(), object)
print_ubacktrace_brief()
print("\n")
}
probe end {
printf ("Alive objects: \n")
foreach (a in alive) {
if (alive[a] > 0)
printf ("%d\t%s\n", alive[a], a)
}
}
To utilize this you need to run your application via systemtap, eg.
stap /home/tasleson/alive\_stack.stp \
-d /usr/lib64/libgio-2.0.so.0.4800.2 \
-d /usr/lib64/libglib-2.0.so.0.4800.2 \
-d /usr/lib64/libgudev-1.0.so \
-d /usr/lib64/libc-2.23.so \
-d /usr/lib64/libpthread-2.23.so \
-d /usr/lib64/libpolkit-gobject-1.so \
-d udisks/.libs/libudisks2.so \
-d modules/lvm2/.libs/libudisks2\_lvm2.so \
-d modules/libudisks2\_lsm.so \
-d modules/libudisks2\_btrfs.so \
-d modules/libudisks2\_iscsi.so \
-d modules/libudisks2\_bcache.so \
-d modules/libudisks2\_zram.so \
-d src/.libs/lt-udisksd \
-c "src/udisksd -r --uninstalled commandline --force-load-modules" -o /tmp/identify\_leaks.txt
In this example I’m dumping the output to a file /tmp/identify_leaks.txt. If you look at the output you will see something like:
...
CREATED:GParamObject:16451:0x7f028c005660
g_type_create_instance+0x20b
g_param_spec_internal+0x1e3
g_param_spec_object+0x72
g_socket_connection_class_intern_init+0xe7
g_type_class_ref+0xa37
g_type_class_ref+0x2e5
g_object_new_valist+0x4ac
g_object_new+0xf1
g_socket_client_connect+0x140
g_dbus_address_try_connect_one+0x1fa
g_dbus_address_get_stream_sync+0xf8
initable_init+0xe7
async_init_thread+0x2e
g_task_thread_pool_thread+0x5d
g_thread_pool_thread_proxy+0x9e
g_thread_proxy+0x55
start_thread+0xca
clone+0x6d
CREATED:GUnixConnection:16451:0x7f028c0056f0
g_type_create_instance+0x20b
g_object_new_internal+0x2cb
g_object_new_valist+0x44e
g_object_new+0xf1
g_socket_client_connect+0x140
g_dbus_address_try_connect_one+0x1fa
g_dbus_address_get_stream_sync+0xf8
initable_init+0xe7
async_init_thread+0x2e
g_task_thread_pool_thread+0x5d
g_thread_pool_thread_proxy+0x9e
g_thread_proxy+0x55
start_thread+0xca
clone+0x6d
DELETED:GSocketAddressAddressEnumerator:16451:0x13fc660
g_object_unref+0x1a0
g_socket_client_connect+0x375
g_dbus_address_try_connect_one+0x1fa
g_dbus_address_get_stream_sync+0xf8
initable_init+0xe7
async_init_thread+0x2e
g_task_thread_pool_thread+0x5d
g_thread_pool_thread_proxy+0x9e
g_thread_proxy+0x55
start_thread+0xca
clone+0x6d
DELETED:GSocketClient:16451:0x1406a20
g_object_unref+0x1a0
g_dbus_address_try_connect_one+0x211
g_dbus_address_get_stream_sync+0xf8
initable_init+0xe7
async_init_thread+0x2e
g_task_thread_pool_thread+0x5d
g_thread_pool_thread_proxy+0x9e
g_thread_proxy+0x55
start_thread+0xca
clone+0x6d
...
Alive objects:
53 GParamBoolean
1 GTask
50 GParamObject
58 GParamString
6 GParamFlags
1 GDBusConnection
12 GParamEnum
7 GParamBoxed
1 GUnixSocketAddress
10 GParamUInt
12 GParamInt
1 GSocket
1 GUnixConnection
1 GDBusAuth
1 GSocketInputStream
1 GSocketOutputStream
1 GCancellable
1 GDBusMessage
7 GIOModule
1 GLocalVfs
127 GParamOverride
1 GInotifyFileMonitor
1 GUdevClient
1 UDisksLinuxProvider
1 UDisksObjectSkeleton
1 UDisksLinuxManager
1 UDisksLinuxManagerBTRFS
1 UDisksLinuxManagerZRAM
1 UDisksLinuxManagerLSM
1 UDisksLinuxManagerISCSIInitiator
1 UDisksLinuxManagerBcache
1 UDisksLinuxManagerLVM2
1556 GUdevDevice
1556 UDisksLinuxDevice
17 UDisksLinuxDriveObject
3 GParamVariant
16 GParamUInt64
17 UDisksLinuxDrive
3 GParamDouble
1 GParamInt64
17 UDisksLinuxDriveAta
20 UDisksLinuxBlockObject
20 UDisksLinuxBlock
2 UDisksLinuxPartitionTable
3 UDisksLinuxFilesystem
3 UDisksLinuxPartition
1 UDisksLinuxSwapspace
6 UDisksLinuxPhysicalVolume
107 UDisksLinuxLogicalVolumeObject
107 UDisksLinuxLogicalVolume
The end summary shows you a list of objects that are still present in memory when we exit the process. In this particular case we can see that something looks wrong as we have 1556 instances of GUdevDevice and UDisksLinuxDevice which is unexpected. At this point we have an idea of what’s leaking and where they were created and some destroyed, but not much else. What’s really needed is a full detail of every place an object reference gets incremented and decremented.
Step 2, getting more detailed information
To debug a reference counting leak you ultimately need to identify:
- Where an object was created
- Each place a reference count was incremented and decremented
My first approach was to edit my above systemtap file and simply dump all of this, eg:
global alive
probe gobject.object_new {
alive[type]++
printf ("CREATED:%s tid=%d\n", type, tid())
print_ubacktrace_brief()
print("\n")
}
probe gobject.object_ref {
printf("INC REF:%s\n", type)
print_ubacktrace_brief()
print("\n")
}
probe gobject.object_unref {
printf("DEC REF:%s\n", type)
print_ubacktrace_brief()
print("\n")
}
probe gobject.object_finalize {
alive[type]--
printf ("DELETED:%s tid=%d\n", type, tid())
print_ubacktrace_brief()
print("\n")
}
probe end {
printf ("Alive objects: \n")
foreach (a in alive) {
if (alive[a] > 0)
printf ("%d\t%s\n", alive[a], a)
}
}
But this failed as this dumps far too much information causing errors to be thrown by sytemtap. So instead I changed it to be specific:
global alive
global leaking = "GUdevDevice"
probe gobject.object_new {
if (type == leaking) {
alive[type]++
printf ("CREATED:%s:%d:%p\n", type, tid(), object)
print_ubacktrace_brief()
print("\n")
}
}
probe gobject.object_finalize {
if (type == leaking) {
alive[type]--
printf ("DELETED:%s:%d:%p\n", type, tid(), object)
print_ubacktrace_brief()
print("\n")
}
}
probe gobject.object_ref {
if (type == leaking) {
printf ("INC:%s:%d:%p\n", type, tid(), object)
print_ubacktrace_brief()
print("\n")
}
}
probe gobject.object_unref {
if (type == leaking) {
printf ("DEC:%s:%d:%p\n", type, tid(), object)
print_ubacktrace_brief()
print("\n")
}
}
probe end {
printf ("Alive objects: \n")
foreach (a in alive) {
if (alive[a] > 0)
printf ("%d\t%s\n", alive[a], a)
}
}
However, even with all this data things weren’t totally clear, there was tons of data to look at. I spent time looking at back traces and eventually I realized that some leaking objects had multiple different paths in which they were created. How could I focus my attention to the correct ones? To help with this I wrote a python script that took the output of the first systemtap and summarized it:
#!/usr/bin/env python
import sys
import md5
# Return a list of tuples (symbol, addr) for the stack trace
def process_stack(s):
rc = []
sl = s.readline()
while sl:
if len(sl):
sl = sl.rstrip()
if '+' in sl:
(sym, addr) = sl.split('+')
rc.append((sym, addr))
else:
return rc
sl = s.readline()
return rc
def gen_report(stap_file):
gobjects = {}
with open(stap_file, 'r') as f:
line = f.readline()
while line:
if len(line):
line = line.rstrip()
# Skip the summary to the end
if line == 'Alive objects:':
return gobjects
if ':' in line:
try:
# We have a CREATE|DELETE
(op, obj_type, tid, addr) = line.split(':')
st = process_stack(f)
if op == 'CREATED':
key = "%s:%s" % (obj_type, addr)
assert key not in gobjects
gobjects[key] = (obj_type, tid, addr, st)
else:
# Object deleted
key = "%s:%s" % (obj_type, addr)
assert key in gobjects
if key in gobjects:
del gobjects[key]
except ValueError as e:
print('line in error= "%s"' % line)
raise e
line = f.readline()
return gobjects
def signature(st):
m = md5.new()
for i in st:
m.update(i[0])
return m.hexdigest()
def leak_count(h):
count = 0
for v in h.values():
count += v['count']
return count
def summarize(s):
# We will store a hash of hashes, where first level is object type and
# second is signature of stack trace
t = {}
for r in s.values():
(obj_type, tid, addr, st) = r
st_sig = signature(st)
if obj_type in t:
if st_sig in t[obj_type]:
# We've seen this st before, lets increment count
t[obj_type][st_sig]['count'] += 1
else:
t[obj_type][st_sig] = dict(count=1, meta=st)
else:
t[obj_type] = dict()
t[obj_type][st_sig] = dict(count=1, meta=st)
# Lets build a list of keys sorted by who leaks the most
items = [(leak_count(v), k) for k, v in t.items()]
items.sort()
items.reverse()
for i in items:
# Lets dump what we have
v = t[i[1]]
for values in v.values():
print('LEAK %s %d times (%d total diff. ways)' % (i[1], values['count'], len(v)))
for s in values['meta']:
print("\t%s:%s" % (s[0], s[1]))
items = [(len(v), k) for k, v in t.items()]
items.sort()
items.reverse()
print('Number of unique stack traces for a given type that leaks')
for i in items:
print('%d: %s' % (i[0], i[1]))
if __name__ == '__main__':
result = gen_report('/tmp/identify_leaks.txt')
summarize(result)
The python script ultimately gives us the ability to determine how many different code paths exist for the creation of an object and which path leaks the most. Some sample output:
LEAK UDisksLinuxDevice 1556 times (1 total diff. ways)
g_type_create_instance:0x20b
g_object_new_internal:0x2cb
g_object_newv:0x1dd
g_object_new:0x104
LEAK GUdevDevice 40 times (2 total diff. ways)
g_type_create_instance:0x20b
g_object_new_internal:0x2cb
g_object_newv:0x1dd
g_object_new:0x104
_g_udev_device_new:0x1b
g_udev_client_query_by_subsystem:0xce
LEAK GUdevDevice 1516 times (2 total diff. ways)
g_type_create_instance:0x20b
g_object_new_internal:0x2cb
g_object_newv:0x1dd
g_object_new:0x104
_g_udev_device_new:0x1b
monitor_event:0x29
g_main_context_dispatch:0x15a
g_main_context_iterate.isra.29:0x1f0
g_main_loop_run:0xc2
LEAK GParamOverride 1 times (14 total diff. ways)
g_type_create_instance:0x20b
g_param_spec_internal:0x1e3
g_param_spec_override:0x81
g_object_class_override_property:0xde
udisks_partition_table_override_properties:0x2d
udisks_partition_table_skeleton_class_init:0x6e
udisks_partition_table_skeleton_class_intern_init:0x48
g_type_class_ref:0xa37
g_type_class_ref:0x2e5
g_object_newv:0x278
g_object_new:0x104
...
Number of unique stack traces for a given type that leaks
23: GParamObject
18: GParamString
18: GParamBoolean
14: GParamOverride
7: GParamUInt64
7: GParamEnum
7: GParamBoxed
6: GParamInt
5: GParamUInt
5: GParamFlags
3: GParamVariant
2: GUdevDevice
2: GParamDouble
1: UDisksObjectSkeleton
1: UDisksLinuxSwapspace
...
We suspected that the biggest leaking object was GUdevDevice and based on this information we can see that there are two distinct paths where an object is created which leak, with one of them being the vast majority (lucky for me).
LEAK GUdevDevice 40 times (2 total diff. ways)
g_type_create_instance:0x20b
g_object_new_internal:0x2cb
g_object_newv:0x1dd
g_object_new:0x104
_g_udev_device_new:0x1b
g_udev_client_query_by_subsystem:0xce
LEAK GUdevDevice 1516 times (2 total diff. ways)
g_type_create_instance:0x20b
g_object_new_internal:0x2cb
g_object_newv:0x1dd
g_object_new:0x104
_g_udev_device_new:0x1b
monitor_event:0x29
g_main_context_dispatch:0x15a
g_main_context_iterate.isra.29:0x1f0
g_main_loop_run:0xc2
It’s interesting how many different code paths exist for some objects which are leaking, 23 different code paths, oh fun! With this new information we can look at the output from the second specific systemtap output, find an object instance that was created which matches the same code path we selected from the summary and then proceed to examine all the places where the object reference count was incremented and decremented, an example (back traces omitted for brevity):
CREATED:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
DEC:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
DEC:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
DEC:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
DEC:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
INC:GUdevDevice:11766:0x7f82cc005470
DEC:GUdevDevice:11766:0x7f82cc005470
DEC:GUdevDevice:11766:0x7f82cc005470
DEC:GUdevDevice:11766:0x7f82cc005470
At this point I wish I could share some magical way to identify the exact places where we are missing a reference count decrement and fix the code, but unfortunately you just end up looking at a lot of code and evaluating where the increment occurred, but a decrement is missing. Sometimes it will be easy because the incremented lifespan is short, but in others the span is large and across thread boundaries. This is where it would be good for developers to comment on the intended life cycle of the object in question when the code is being written.
Final thoughts
Fixing gobject reference count leaks is indeed very difficult. Even more so when you are working on code that others developed. If you find yourself writing reference counting code that has non-trivial lifespans, please do others a favor and document the intended lifespan. If you can’t document it, it’s a pretty good indication that you don’t know or understand for sure and likely the code will be plagued with leaks and/or segmentation faults during it’s lifetime.