Home » Android » java native interface – What exactly causes a 'spin on suspend' error in Android?

java native interface – What exactly causes a 'spin on suspend' error in Android?

Posted by: admin June 15, 2020 Leave a comment

Questions:

I’m currently having trouble debugging some Android code which relies on a native library. One native call in particular seems prone to this “spin on suspend” error. It generally manifests like so:

threadid=2: spin on suspend #2 threadid=48 (pcf=3)

Thus far I haven’t been able to determine exactly what’s failing here, except that after about 10 of these messages, my application encounters a SIGSTKFLT and exits. Every time, the first thread is the GC, and the second thread is whatever thread is currently executing the native code. The portion of the stack printed along with this message always has a native method at the top of the stack.

What exactly is happening when Dalvik complains about this, and how can I begin to debug the cause so I can fix it?

EDIT: An interesting wrinkle — after the native developer made some changes, I now also see the following error sometimes:

PopFrame missed the break
VM aborting
Fatal signal 11 (SIGSEGV) at 0xdeadd00d (code=1)

It’s also extremely odd to me that the thread dump shows my native method at the top of the stack, yet the thread state is RUNNABLE, not NATIVE — how can that be possible?

How to&Answers:

The basic problem is that Dalvik is a safe-point suspend VM, and uses “stop the world” garbage collection. This means that, for the GC to operate, it has to wait for all threads to reach a point where it can be sure that they won’t be altering the heap.

For some reason, one of your threads isn’t responding to the GC thread’s request to suspend. It’s not actually executing in native code; if it were, the thread would be in NATIVE state, which is considered safe. (All access to the native heap is gated through JNI calls, and all JNI calls do a suspend check.)

For performance reasons, the JIT is capable of chaining blocks of compiled code together in a way that skips the suspend checks. If a thread takes too long to suspend, the suspending thread will “unchain” the blocks, and wait a little longer. Eventually it starts complaining, and eventually-eventually it gives up and aborts the VM.

Some devices use a vendor-modified version of Dalvik that gets this wrong, and aborts can happen on tight loops. I wouldn’t expect to see a native method at the top of the stack in this case.

Your best bet for debugging is to attach gdb at the point it goes unhappy and try to figure out what the target thread is doing. It’s possible that the native code trashed the VM state or return stack in some way, and so on its return from native code the thread gets jammed up.

Update after EDIT: The dvmPopFrame() function is used to pop a stack frame off the managed stack. When the VM calls your native method it inserts a “break” frame, so that when the stack is unrolled for exception handling the VM doesn’t blow past the call site. (It’s also used for managed-code method calls issued by the VM, e.g. for reflection or <clinit>.) The message PopFrame missed the break means that the break frame wasn’t found.

Break frames have a null method pointer. When unrolling the stack, dvmPopFrame() continues as long as it sees a non-null method pointer (meaning it’s not a break frame) and a non-null previous-frame pointer (meaning you haven’t hit the top of the stack). If you hit the top of the stack, you’ve missed the break — all Dalvik stacks start with a real method (sometimes a “fake” method if the thread was attached to the VM with JNI).

So my guess would be that the native code trashed the stack, nulling the previous-frame pointer. One technique for sorting this out would be to have the VM call a native method that calls the actual native method; the “middle man” allocates some stuff on the stack, sets it to known values, calls the actual method, then verifies that its stack allocations are unchanged before returning.

(It may be necessary to use the values to prevent the compiler from optimizing them away; if you use something like:

if (jniEnv == NULL) {
    printf("my stuff is ...", ...);
}

then it’ll never actually run, since the JNIEnv* is never null… but the compiler doesn’t know that.)

For a full description of the Dalvik stack layout, see dalvik/vm/interp/Stack.h.

It’s normal for the thread to be in RUNNABLE when returning from native code. Your native method is still at the top because the code that pops it off failed and aborted the VM.