I have HBase running in a CDH cluster 5.7.0. After several months running without any problems, hbase service stopped and now it's impossible to start the HBase master (1 master and 4 region servers).
When I try to start it at some point the machine hangs and the last thing I can see in the master log is:
2016-10-24 12:17:15,150 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005528.log
2016-10-24 12:17:15,152 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005528.log after 2ms
2016-10-24 12:17:15,177 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.log
2016-10-24 12:17:15,179 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.log after 2ms
2016-10-24 12:17:15,394 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.log
2016-10-24 12:17:15,397 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.log after 3ms
2016-10-24 12:17:15,405 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.log
2016-10-24 12:17:15,409 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.log after 3ms
2016-10-24 12:17:15,414 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.net.SocketException: No buffer space available
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:465)
at sun.nio.ch.Net.connect(Net.java:457)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)
at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)
at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)
at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)
at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)
at java.lang.Thread.run(Thread.java:745)
2016-10-24 12:17:15,427 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /xxx.xxx.xxx.xxx:50010 for block, add to deadNodes and continue. java.net.SocketException: No buffer space available
java.net.SocketException: No buffer space available
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:465)
at sun.nio.ch.Net.connect(Net.java:457)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)
at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)
at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)
at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)
at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)
at java.lang.Thread.run(Thread.java:745)
2016-10-24 12:17:15,436 INFO org.apache.hadoop.hdfs.DFSClient: Successfully connected to /xxx.xxx.xxx.xxx:50010 for BP-813663273-xxx.xxx.xxx.xxx-1460963038761:blk_1079056868_5316127
2016-10-24 12:17:15,442 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.log
2016-10-24 12:17:15,444 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.log after 2ms
2016-10-24 12:17:15,669 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.log
2016-10-24 12:17:15,672 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.log after 2ms
I'm afraid there's something corrupted in WALProcedureStore but I don't know how to keep digging to find the problem. Any clues? Can I start the Master freshly without trying to load a previous corrupted state?
EDIT:
I just saw this bug that I think is the same issue that is happening to me. Can I safely remove everything in /hbase/MasterProcWALs
without deleting old data stored in hbase?
without
Thanks