遇到的死锁问题分析

转载请注明出处: http://www.liubida.com/programming_language/dead_lock_analysis

前几天遇到一个问题, 导致连续3天的虚拟机数据没有更新成功, 后来通过监控发现每天的定时任务会报出大量的异常, 异常信息直指数据库的DeadLock
异常和堆栈的部分信息如下:
com.alibaba.armory.exception.ArmoryServiceException: SqlMapClient operation; SQL [];
--- The error occurred in ibatis/VmServer.xml.
--- The error occurred while applying a parameter map.
--- Check the VmServer.updateVMStateByHostId-InlineParameterMap.
--- Check the statement (update failed).
--- Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction; nested exception is com.ibatis.common.jdbc.exception.NestedSQLException:
--- The error occurred in ibatis/VmServer.xml.
--- The error occurred while applying a parameter map.
--- Check the VmServer.updateVMStateByHostId-InlineParameterMap.
--- Check the statement (update failed).
--- Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
at com.alibaba.armory.exchange.dragoon.data.DragoonDataHandler.refreshVMToHost(DragoonDataHandler.java:432)
at ...

执行的定时任务是对宿主机刷新数据. 任务由很多线程执行, 并且在涉及到数据库update的方法时, 使用了spring的@Transactional来标记为事务处理, 我们定义为事务A.
比较关键的几个sql语句,
1. select lock_state from server where id = ? for update
2. update server set lock_state = ? where id = ?
3. select s.*, d.*, s.id as server_id, ast.po_number as po, ast.in_service_date as buy_time, ast.cost_value as original_cost, v.host_id as host_id, v.vm_index as vm_index, v.vm_resource as vm_resource, v.state as vm_state, v.op_state as op_state, cpu.amount as cpu_amount, cpu.core_num as cpu_core, mssize as mem_size from server s left join device_base d on s.device_base_id = d.id left join device_base_asset ast on ast.tag_number = d.assets_num right join vmserver v on v.server_id = s.id left join cpu on s.id = cpu.server_id left join (select server_id,sum(size) mssize from memory_slot group by server_id) mstmp on mstmp.server_id = s.id where v.host_id in ( ? )
4. update vmserver set state = ? where host_id = ?
5. update server set lock_state = ? where id = ?

我们对着这几个简单的语句研究了半天也没发现有什么不妥, 只是语句1的”for update”有点点风险.
mysql中使用for update, 必须针对InnoDB, 并且在一个事务中, 才能起作用. 它会对查询的数据加上行锁,或者表锁. 在我们的场景中, 是加上了行锁.

后来通过查看show innodb status的输出,
------------------------
LATEST DETECTED DEADLOCK
------------------------
111123 13:30:02
*** (1) TRANSACTION:
TRANSACTION 3 4124628169, ACTIVE 0 sec, process no 3818, OS thread id 1184938304 starting index read
mysql tables in use 3, locked 3
LOCK WAIT 28 lock struct(s), heap size 6752, 429 row lock(s)
MySQL thread id 2015123, query id 2099159826 172.22.24.98 armory Sending data
update vmserver set op_state='UPDATE_FINISH' where host_id in ( select id from server where device_base_id in( select id from device_base where service_tag in ( '030JNA10A2000060' ) ) )
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1857 page no 203 n bits 200 index `PRIMARY` of table `armory`.`server` trx id 3 4124628169 lock mode S locks rec but not gap waiting
Record lock, heap no 122 PHYSICAL RECORD: n_fields 19; compact format; info bits 0
0: len 4; hex 800001b9; asc ;; 1: len 6; hex 0003f5d8d4c5; asc ;; 2: len 7; hex 00000340021a01; asc @ ;; 3: len 4; hex 80000534; asc 4;; 4: len 6; hex 78636e2d7879; asc xcn-xy;; 5: SQL NULL; 6: len 4; hex 4bbc094b; asc K K;; 7: len 4; hex 4ecc84da; asc N ;; 8: len 5; hex 312e322e33; asc 1.2.3;; 9: len 13; hex 5550444154455f46494e495348; asc UPDATE_FINISH;; 10: len 1; hex 01; asc ;; 11: len 1; hex 08; asc ;; 12: len 7; hex 312e312e313135; asc 1.1.115;; 13: len 13; hex 5550444154455f46494e495348; asc UPDATE_FINISH;; 14: len 1; hex 88; asc ;; 15: len 10; hex 4e4f5f494e5354414c4c; asc NO_INSTALL;; 16: len 7; hex 6e6f4572726f72; asc noError;; 17: len 7; hex 6e6f4572726f72; asc noError;; 18: SQL NULL;

*** (2) TRANSACTION:
TRANSACTION 3 4124628165, ACTIVE 0 sec, process no 3818, OS thread id 1224608064 starting index read
mysql tables in use 1, locked 1
5 lock struct(s), heap size 1216, 3 row lock(s), undo log entries 1
MySQL thread id 1998379, query id 2099159869 172.22.24.98 armory Updating
update vmserver set state = 'unknown' where host_id = '441'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1857 page no 203 n bits 200 index `PRIMARY` of table `armory`.`server` trx id 3 4124628165 lock_mode X locks rec but not gap
Record lock, heap no 122 PHYSICAL RECORD: n_fields 19; compact format; info bits 0
0: len 4; hex 800001b9; asc ;; 1: len 6; hex 0003f5d8d4c5; asc ;; 2: len 7; hex 00000340021a01; asc @ ;; 3: len 4; hex 80000534; asc 4;; 4: len 6; hex 78636e2d7879; asc xcn-xy;; 5: SQL NULL; 6: len 4; hex 4bbc094b; asc K K;; 7: len 4; hex 4ecc84da; asc N ;; 8: len 5; hex 312e322e33; asc 1.2.3;; 9: len 13; hex 5550444154455f46494e495348; asc UPDATE_FINISH;; 10: len 1; hex 01; asc ;; 11: len 1; hex 08; asc ;; 12: len 7; hex 312e312e313135; asc 1.1.115;; 13: len 13; hex 5550444154455f46494e495348; asc UPDATE_FINISH;; 14: len 1; hex 88; asc ;; 15: len 10; hex 4e4f5f494e5354414c4c; asc NO_INSTALL;; 16: len 7; hex 6e6f4572726f72; asc noError;; 17: len 7; hex 6e6f4572726f72; asc noError;; 18: SQL NULL;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1693 page no 8 n bits 232 index `PRIMARY` of table `armory`.`vmserver` trx id 3 4124628165 lock_mode X locks rec but not gap waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
0: len 4; hex 000002da; asc ;; 1: len 6; hex 0003f5d8ca57; asc W;; 2: len 7; hex 0000034007089b; asc @ ;; 3: len 4; hex 00003150; asc 1P;; 4: len 4; hex 000001b9; asc ;; 5: len 4; hex 00000001; asc ;; 6: len 2; hex 6f6e; asc on;; 7: len 8; hex 5550444154494e47; asc UPDATING;; 8: len 4; hex 4d2d00ef; asc M- ;; 9: len 4; hex 4ecc84d4; asc N ;; 10: len 4; hex 80000002; asc ;;

*** WE ROLL BACK TRANSACTION (2)

明确了问题的关键是这两条语句
a. update vmserver set op_state='UPDATE_FINISH' where host_id in ( select id from server where device_base_id in( select id from device_base where service_tag in ('030JNA10A2000060')))
b. update vmserver set state = 'unknown' where host_id = '441'

原来在这个刷新数据的方法后面, 还有一个更新操作状态(op_state)的sql, 可恶的它居然也是一个事务处理(叫事务B). 从输出的信息已经明确的发现死锁就是发生在这两个事务当中, 那么这两个事务是肿么就死锁了呢?
先来看语句b, 这个语句b就是方法里的语句4, 基本上是只会加行锁的, 除非vmserver中查不到host_id的的值的记录, 这种情况下该语句会由行锁上升到表锁, 但是问题不在这儿.

这里需要说明下, 虽然输出信息里报的是两个transaction的这两个语句死锁, 但是一般来说, 这只能表明这两个语句是这两个事务(A和B)死锁时的最后的语句. 因为事务B只有语句a, 自然就是语句a导致死锁, 而事务A的语句就多了, 见前面的语句1,2,3. 我们思考之后, 把嫌疑放在语句1上.
对事务B的语句a的分析:
) 这个语句写的很搓, 双层嵌套in查询, 会分别对device_base, server和vmserver分别查询之后再update
) vmserver表中host_id没有建索引, 对vmserver是表锁. (为什么vmserver是表锁, 而不是只对host_id的那几行加锁? 因为host_id in (…) 在查询前是无法确定到底是要锁哪几行; 并且server中的device_base_id没有加索引, 所以server是表锁, 这也导致vmserver直接表锁)
) 这个语句写的很搓, 双层嵌套in查询, 这会导致整条语句的锁会是这三个表的锁中最强的一个.
) 加锁的顺序是先对vmserver加表锁(意向锁), 然后对server表加表锁, 然后去做update

对事务B的语句1和语句4分析:
) 本来select是不需要锁的, 但是由于”for update”, 所以需要对server加上行锁
) 然后后面的语句4, 也就是语句b, 需要对vmserver加行锁

问题来了,
事务A的加锁顺序是vmserver(表锁)->server(表锁)
事务B的加锁顺序是server(行锁)->vmserver(行锁)
fk~!

deadlock_analysis

死锁分析


死锁重现:
1. 线程1完成事务A, 进入事务B, 切换到线程2
2. 线程2进入事务A, 对server加上行锁, 切换到线程1
3. 线程1对vmserver加上表锁, 尝试对server加表锁, 因失败而等待, 切换到线程2
4. 线程2尝试对vmserver加行锁, 因失败而等待, 至此, 死锁已成

可以看见, 死锁的原因:
1. 事务A和B中对vmserver和server表加锁的相反顺序
2. 本来多线程之间都是各自锁自己相关的行, 由于事务B的sql写法不规范, 导致线程在事务B时升级为表锁

修正方法:
1. 修改事务B的update语句
update vmserver vms
left join server s on s.id = vms.host_id
left join device_base db on db.id = s.device_base_id
set vms.op_state= ?
where db.service_tag in (?,?,?...)

这样, 事务B的锁就退化为行锁, 那么多个线程就是各自锁各自的行, 应该不会死锁了.

tips:
1. 假设有表A(id, sid, name, comment), id是主键, sid建有索引, 表数据100条, 一个sid可能对应多个id.
那么当使用sid来对表进行update时, update A set comment = ? where sid = ?
sid=0, 有20条记录,
sid=1, 有30条记录,
sid=2, 有40条记录,
sid=3, 有10条记录,
这时候, 就有可能在sid=2或sid=3时就是对整张表加锁
2. “show innodb status”的输出信息里报的是两个transaction的两个语句死锁, 但是这只能表明这两个语句是这两个事务(A和B)死锁时的最后的语句, 并不能说明死锁一定是由这两个语句造成.
3. update的逻辑代码要加事务, sql语句不能用in
4. 由于对mysql的不熟悉, 导致我们刚开始一直没有找到问题的根源在哪里..工具很重要

后话:
有人问我为什么要在事务A中的select后面加for update, 其实这里是为了防止, 在对某台宿主机刷新数据时有用户会对用户执行操作而导致的状态异常, 和多线程之间的刷新没有关系.