Dubbo反序列化抛出的空指针异常问题

Dubbo线上踩坑问题排查记录

起因

线上出现了一段空指针异常,根据异常栈和经验,第一时间的反应是消费者段在发序列化的时候报错了,觉得应该是返回值在反序列化时出问题。但是前前后后检查了好几次返回值,都没有发现有什么有问题的地方。下面是异常堆栈:

java.lang.NullPointerException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.alibaba.com.caucho.hessian.io.JavaDeserializer.instantiate(JavaDeserializer.java:271)
at com.alibaba.com.caucho.hessian.io.JavaDeserializer.readObject(JavaDeserializer.java:155)
at com.alibaba.com.caucho.hessian.io.SerializerFactory.readObject(SerializerFactory.java:397)
at com.alibaba.com.caucho.hessian.io.Hessian2Input.readObjectInstance(Hessian2Input.java:2070)
at com.alibaba.com.caucho.hessian.io.Hessian2Input.readObject(Hessian2Input.java:2005)
at com.alibaba.com.caucho.hessian.io.Hessian2Input.readObject(Hessian2Input.java:1990)
at com.alibaba.dubbo.common.serialize.support.hessian.Hessian2ObjectInput.readObject(Hessian2ObjectInput.java:88)
at com.alibaba.dubbo.rpc.protocol.dubbo.DecodeableRpcResult.decode(DecodeableRpcResult.java:92)
at com.alibaba.dubbo.rpc.protocol.dubbo.DecodeableRpcResult.decode(DecodeableRpcResult.java:109)
at com.alibaba.dubbo.rpc.protocol.dubbo.DubboCodec.decodeBody(DubboCodec.java:97)
at com.alibaba.dubbo.remoting.exchange.codec.ExchangeCodec.decode(ExchangeCodec.java:129)
at com.alibaba.dubbo.remoting.exchange.codec.ExchangeCodec.decode(ExchangeCodec.java:90)
at com.alibaba.dubbo.rpc.protocol.dubbo.DubboCountCodec.decode(DubboCountCodec.java:46)
at com.alibaba.dubbo.remoting.transport.netty.NettyCodecAdapter$InternalDecoder.messageReceived(NettyCodecAdapter.java:134)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at com.alibaba.ttl.TtlRunnable.run(TtlRunnable.java:49)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

于是,在dubbo的github项目中,我发现了一个类似的issue,https://github.com/apache/dubbo/issues/619,与我遇到的问题完全一致。

原因

详细地异常追踪和源码分析请参见:https://www.yuque.com/zhaixiaoxiang/fmga67/sr041v

下面简要理下为什么会有这样的异常输出,输出的堆栈为什么是这样的。

provider抛出没有堆栈的异常

频繁地打印堆栈信息,会造成大量的IO资源浪费。出于性能优化的角度考虑,如果抛出了大量的内置的”cold”的异常,jvm就会重新编译,不再打印异常堆栈,只会抛出一个没有堆栈的预分配的异常。 https://www.oracle.com/technetwork/java/javase/relnotes-139183.html

The compiler in the server VM now provides correct stack backtraces for all "cold" built-in exceptions. For performance purposes, when such an exception is thrown a few times, the method may be recompiled. After recompilation, the compiler may choose a faster tactic using preallocated exceptions that do not provide a stack trace. To disable completely the use of preallocated exceptions, use this new flag: -XX:-OmitStackTraceInFastThrow.

所以provider返回给consumer的并不是一个正常的result输出,而是一个没有堆栈的异常。

consumer输出堆栈信息时,输出了默认的当前堆栈

consumer在InvokerInvocationHandler动态代理返回结果给业务时,拿到了服务端的异常,throw了出去。(从服务端拿到的异常,所以是没有堆栈的),但是在我们打印日志输出异常堆栈的时候,会调用getOurStackTrace()方法获取堆栈信息以供输出。

这个方法可以会在栈为空的情况下(UNASSIGNED_STACK= new StackTraceElement[0]),为其赋上当前栈的相关信息。所以,provider虽然返回的是一个空的异常,consumer拿到的也是一个空的异常,但是输出的时候,带上了consumer当前栈的相关信息。

private synchronized StackTraceElement[] getOurStackTrace() {
    // Initialize stack trace field with information from
    // backtrace if this is the first call to this method
    if (stackTrace == UNASSIGNED_STACK ||
        (stackTrace == null && backtrace != null) /* Out of protocol state */) {
        int depth = getStackTraceDepth();
        stackTrace = new StackTraceElement[depth];
        for (int i=0; i < depth; i++)
            stackTrace[i] = getStackTraceElement(i);
    } else if (stackTrace == null) {
        return UNASSIGNED_STACK;
    }
    return stackTrace;
}

下面是关于异常的一段测试代码,

public static void main(String[] args) throws NoSuchFieldException, IllegalAccessException {
    Throwable throwable = new Throwable();
    throwable.printStackTrace();
    System.err.println("========================================");

    Field stackTrace = throwable.getClass().getDeclaredField("stackTrace");
    stackTrace.setAccessible(true);
    stackTrace.set(throwable, null);
    try {
        throw throwable;
    } catch (Throwable ex) {
        ex.printStackTrace();
    }
}

debug截图

java.lang.Throwable
	at com.xzy.test.TestMain2.getThrowable(TestMain2.java:31)
	at com.xzy.test.TestMain2.main(TestMain2.java:13)
========================================
java.lang.Throwable
	at com.xzy.test.TestMain2.getThrowable(TestMain2.java:31)
	at com.xzy.test.TestMain2.main(TestMain2.java:13)

debug可以看到catch到的ex,它的stackTrace是null的,但是输出依然是有内容。因为在调用getOurStackTrace()方法时又为异常赋上了当前栈的信息。

资料总结