We have a mysql master master replicated system going. Tonight one of our disks for one of the masters went offline. When the disk was brought back online and mysql was started; the slave io process wasn’t running. One of the local relay logs was missing, the oldest one. It was proposed to restore from a snapshot of the disk volume prior to the event. Now the problem is that you are working with two mysql servers that are sharing the same data. It occurred to me that the Heisenberg’s Uncertainty Principal was at work here.
The concept of a master in mysql replication is that the host is a read and write mysql server. Anything written to the master is propagated to the slaves. Slaves are therefore read only. It is possible to update the data directly on the slave but that data isn’t simply sent to the master. To do that you’d need a master master setup. Let’s discuss how mysql master master replication works.
First let’s begin with a master slave scenario. Every time a modification query is run on the master it is logged in to the relay log. The slave(s) subscribe to these relay logs and process them linearly. The slave(s) record the position in a file. They monitor the master for updates and continue rerunning the queries that are recorded in to the relay log. Masters do not push updates to the slaves.
Now the way that master master replication works is that the slave is also the master. Both systems record their modification queries and each monitor the other for updates to their relay log. Each server has a unique server id and if they see an update from themselves they silently ignore it and move on in the relay log. This ensures that you do not have an endless loop of modifications.
You can continue adding masters to your master master setup keeping in mind that mysql replication is linear and therefor you can only have one master. You can of course have as many slaves as you want. So lets say you have hosts A, B and C. You want each to be masters. My analogy for multi master replication is line a mobius river. All data on A is replicated to its slave B. All data on B is replicated to C. All data on C is replicated to A. It’s A -> B -> C -> A. Because each server has its own unique id it can ignore any updates made by itself and thereby data is only sent in the circle one revolution. But this setup is beyond the scope of my post.
Understanding the principal of the setup should give you an understanding of the problem. We could restore the data from a snapshot of the affected master, however the data on that master would be out of date to the other master. The unaffected machine would be trying to get updates from its master at a position that hasn’t happened yet while the affected system would be trying to catch up to the position of its master. Beyond that we are making the assumption that the affected host didn’t have any updates made to it during the time difference between the crash and the snapshot.
Further restoring backups can be difficult. A backup that is not atomic and does not record the system’s position in its master’s relay log presents the problem of knowing what your data is but not knowing where your data is. Of course it is possible to preform an atomic backup. The way this can be done is by
read locking all your databases and their tables and recording the stable position in the master’s relay log. The down side to this is that now your host being backed up is rejecting updates made to it while the data is being dumped out. If this is a slow process it could cause problems. Another way to do this would be to use a file system level snapshot of your data but you must make sure that your position is
recorded atomically with the data on the file system.
The problem I actually encountered was this.
00921 20:48:00 [ERROR] Failed to open the relay log ‘/var/log/mysql/binary/mysql_relay_log.008169’ (relay_log_pos 372257059)
100921 20:48:00 [ERROR] Could not find target log during relay log initialization
100921 20:48:00 [ERROR] Failed to initialize the master info structure
The solution was simply to move on. The file in question was either deleted at the same time of the crash or was corrupted and disappeared. I like to think it
was caused by being deleted and not corrupted… The solution was to alter the relay-log.info file. Always make a backup of files you modify! The beginning of the file is where the current position of the system is and the second holds the information of its master. I simply stopped mysql, found the next relay log,mysql_relay_log.008170, and set its position to 0. Once mysql was started again it just worked. Because we were working with the data at the time of the crash we could assume that no data was then modified. That might not be the case for you. You may have used a backup to get yourself here or what have you.
So even further you could do some trickery to avoid problems mentioned earlier on how an affected master might have made its own updates during that time. I have never tried this, but you could possibly restore your atomic backup and change the server id to some other unique value than that of the system
that was backed up. When it begins reading that from when it left off it would also read its former entries its master’s relay log and rerun queries it already preformed. It would then not ignore them because the queries were run by a different id.
I probably could continue to blab about hypothetical ways to give you more continues but; A I’ve given you the logic to preform your own Voodoo and B my Android Froyo source just finished downloading and I want to cross compile it tonight.