8/30/2017 8:25:08 PM
from master to slave, it is expected to get hang at every DB operation, and keeps retrying to find correct connection. The Mysql cluster is a master-slave achitecture, where slave is read-only. I was testing the new DB API in our staging AZ cluster and testing DB, I ran into the below issues:
* After Mysql failovers (from master to slave), AZ keeps having the previous Mysql DataSource (previous master), which had been switched to read-only. That said, the dataSource remains cached, and doesn't get refresh to be the new dataSource.
* The first step of Mysql failover is to enforce master to be read-only. By then, AZ is not able to write to DB. The running job logs was missing. The observation during this stage is that AZ keeps throwing SQLException:
> java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement Query: INSERT INTO execution_flows (project_id, flow_id, version, status, submit_time, submit_user, update_time) values (?,?,?,?,?,?,?) Parameters:...
The proposed solution in this PR is * check if the current Mysql is read-only. If it is, it will keep retrying. * if AZ can not find satisfied SQL connection, we create new concrete DataSource, rather than using existing one. As a result, we override createDataSource method.
Applying this code patch, We made an extensive test on testing cluster. During the DB failover, the only Exception I was seeing is:
> 2017/08/29 23:58:11.051 +0000 ERROR [MySQLDataSource] [Azkaban] Failed to find write-enabled DB connection. Wait 1 minutes and retry. No.Attempt = 1
java.sql.SQLException: Failed to find DB connection Or connection is read only. at azkaban.db.MySQLDataSource.getConnection(MySQLDataSource.java:75) at org.apache.commons.dbutils.AbstractQueryRunner.prepareConnection(AbstractQueryRunner.java:175)
....
During retrying, AZ UI gets a hang when I try accessing history page. No other exceptions.
After retrying 3 attempts, AZ is back to normal. I can not observe any failures.