Replication - Slave Monitor

 

Replication - Slave Monitor 

Each daisy chain Replication server has a monitor specified in the Slave Monitor field of the Edit Replication Server window. In the event of a failure, users will be told to connect to their database at the other location. It is up to the administrators of the replication service to ensure that training is in place to make sure users know what to do when they see this message.

Trigger Rules: EZ2000 Plus Dental is designed to detect replication failure. The reaction will be triggered according to the following rules:
1. EZ2000 Plus Dental must be running on the designated monitor computer for that slave.
2. Triggered if replication fails or is stopped for any reason.  So do not stop replication unless EZ2000 Plus Dental is shut down on the monitor.
3. Loss of internet connection will not trigger the reaction, and MySQL will gracefully continue replication where it left off when reconnected.
4. Stopping the MySQL service will cause other error messages in EZ2000 Plus Dental, but will not trigger the reaction.

Technical Details:
1. The monitor polls the slave server every 10 seconds with the following query: SHOW SLAVE STATUS;
2. If the Slave_SQL_Running column does not equal "Yes", then the reaction is triggered as follows:
3. For example, if this is slave with server_id of 3, then:
UPDATE preference SET ValueString = '3' WHERE PrefName = 'ReplicationFailureAtServer_id';
4. All users are kicked out of EZ2000 Plus Dental on all servers.
5. When users try to use EZ2000 Plus Dental on the failed slave, they are told, "This database is temporarily unavailable. Please connect instead to your alternate database at the other location."

Get Us Involved:  Look in the replication log and find the query that broke replication.  If it was a custom query that you ran, then we do not need to know about it.  But if it is a query coming from EZ2000 Plus Dental, then we need to know about it so that we can immediately repair it in the code. We have seen the following issues break replication:
1. "CREATE TABLE mytable".  This is the usual culprit.  At a minimum, it must always be prefaced by "DROP TABLE IF EXISTS mytable"  This applies equally to temp tables that you may create.  But even then, it's still dangerous because someone at another node might reference the same table at around the same time.
2. Error: "Duplicate entry '0001-01-01 00:00:00' for key IndexAckTime".  This was due to a corrupted index file.  Dropping and recreating the index solved the problem.

Repair:  Once the reaction has been triggered, it must be repaired before any users can access that slave.  The repair should be run after business hours if possible. For the offices that are down, you can have them directly connect to the next working replication server in the chain if you have VPN setup. This will allow the office to continue working, but with a slower than normal connection.
Make sure to complete each numbered step before continuing on to the next step, or else replication will fail. For the steps that say "on all replication servers..." you must perform the step on all replication servers (master and slaves) before continuing on to the next step.

1. On all replication servers: Block the MySQL service network port within the windows firewall for both incoming and outgoing traffic. The typical port number is 3306. This step will prevent any client machines from connecting to any database while the repair is in progress. The port number is different when using eCW.
2. Pick a database from one replication server that you think is the most up-to-date database. Call the server where this database is located the "master" and all other replication servers "slave".
3. On the master server, in the Main Menu, click Setup, Replication. Click the Clear button at the lower right.  If this button is not visible, the replication failure was not detected in the database, so you can skip this step and proceed to step 4. 

4. On all replication servers:
- Run the following SQL statement: STOP SLAVE; (This stops the replication thread within MySQL.)
- Stop the MySQL service.
- Wipe out all loose files in the mysql data directory which do not reside in a subfolder, except for any files beginning with "ib" (e.g. iblog_file0, ib_logfile1, ibdata1).
5. On all slave servers (not the master): Move or rename the replicated database (typically named 'opendental') from the data directory.
You must fully complete the above steps on all replication servers before proceeding to step 6, or replication will fail.
6. Copy the database folder from the database directory on the master to the data directory on each slave.
7. On all replication servers:
- Start MySQL.
- Run the following SQL statement (make sure to change SERVER_NAME to the master server computer name): CHANGE MASTER TO MASTER_HOST = ‘SERVER_NAME’, MASTER_USER = 'repl',MASTER_PASSWORD = 'od1234';
- Run the following SQL statement (make sure the change the database name from opendental to your actual database name): GRANT CREATE TEMPORARY TABLES ON opendental.* TO 'repl' IDENTIFIED BY 'od1234';
- Run the following SQL statements: GRANT SUPER ON *.* TO 'repl' IDENTIFIED BY 'od1234'; GRANT REPLICATION SLAVE ON *.* TO 'repl' IDENTIFIED BY 'od1234';
- Run the following SQL statement: START SLAVE; (This starts the replication thread within MySQL.)
8. On all replication servers: Verify that replication is running successfully by executing the following SQL statement and checking for the Slave_IO_Running and Slave_SQL_Running columns to both say "Yes": SHOW SLAVE STATUS;
9. On all replication servers: Unblock the MySQL service network port. Make sure to unblock incoming as well as outgoing.

Alternate Repair:  If the database is very large, then the above repair scenario may not be feasible.  If you have a deeper understanding of replication and expertise in restarting failed replication, then you might consider restarting replication and running all the queries below the failure point in the log.

MySQL also provides an enterprise monitor software that will graphically show your replication loop and will let you monitor the status of each server and how many seconds each one is behind its master.  While this is a very useful tool, it is also expensive.

 

EZ2000 Plus Dental Software 800-273-5033