Man page for mk-table-sync
August 24, 2007 – 5:33 pmMK-TABLE-SYNC
Section: User Contributed Perl Documentation (1)
Updated: 2008-06-01
Index
Return to Main Contents
NAME
mk-table-sync - Synchronize MySQL tables efficiently.
SYNOPSIS
It is a good idea to back up your data and run "mk-table-sync" with “–test”
first, to see what will happen. If you want to see which rows are different
without changing the tables, use “–print” instead of “–execute”.
To sync db.tbl1 from host1 to host2:
mk-table-sync –execute u=user,p=pass,h=host1,D=db,t=tbl host2
Sync all tables in host1 to host2 and host3:
mk-table-sync –execute host1 host2 host3
Resolve differences mk-table-checksum found on this master’s slaves:
mk-table-sync –execute –replicate test.checksum master1
Sync this slave to its replication master:
mk-table-sync –execute –synctomaster slave1
Sync this slave to its replication master, resolving differences
mk-table-checksum found:
mk-table-sync –execute –synctomaster –replicate test.checksum slave1
Sync server2 in a master-master replication configuration, where server2’s copy
of db1.tbl2 is known or suspected to be incorrect:
mk-table-sync –execute –synctomaster h=server2,D=db1,t=tbl1
Note that in the master-master configuration, the following will NOT do what you
want, because it will make changes directly on server2, which will then flow
through replication and change server1’s data:
# Don’t do this in a master-master setup!
mk-table-sync –execute h=server1,D=db1,t=tbl1 h=server2
DESCRIPTION
WARNING this tool is unfinished and could perform slowly. The Chunk
algorithm is great when it can be used, and so is Nibble, but otherwise Stream
is the only choice and it does not perform very well. Please run with
“–test” before subjecting your servers to this tool, and make backups of
your data!
This tool is designed to do one-way synchronization of data (two-way sync is
planned for the future). It finds differences efficiently with one of several
algorithms (see “ALGORITHMS”). It makes changes on the destination table(s)
so it matches the source.
It does not synchronize table structures, indexes, or any other schema
changes. It synchronizes only data.
It can operate through replication by comparing a slave with its master and
making changes on the master. These changes will flow through replication and
correct any differences found on the slave.
It accepts a list of DSNs (see the “–help” output) to tell it where and how
to connect.
There are many ways to invoke it. The following is the abbreviated logic:
if DSN has a t part, sync only that table:
if 1 DSN:
if –synctomaster:
The DSN is a slave. Connect to its master and sync.
if more than 1 DSN:
The first DSN is the source. Sync each DSN in turn.
else if –replicate:
if –synctomaster:
The DSN is a slave. Connect to its master, find records
of differences, and fix.
else:
The DSN is the master. Find slaves and connect to each,
find records of differences, and fix.
else:
if only 1 DSN and –synctomaster:
The DSN is a slave. Connect to its master, find tables and
filter with –databases etc, and sync each table to the master.
else:
find tables, filtering with –databases etc, and sync each
DSN to the first.
ALGORITHMS
This tool has a generic data-syncing framework, within which it is possible to
use any number of different algorithms to actually find differences. It chooses
the best algorithm automatically. While I plan to add more algorithms in the
future, the following are implemented now:
- Chunk
Finds an index whose first column is numeric (including date and time types),
and divides the column’s range of values into chunks of approximately
“–chunksize” rows. Syncs a chunk at a time by checksumming the entire
chunk. If the chunk differs on the source and destination, checksums each
chunk’s rows individually to find the rows that differ.It is efficient when the column has sufficient cardinality to make the chunks
end up about the right size.The initial per-chunk checksum is quite small and results in minimal network
traffic and memory consumption. If a chunk’s rows must be examined, only the
primary key columns and a checksum are sent over the network, not the entire
row. If a row is found to be different, the entire row will be fetched, but not
before.
- Nibble
Finds an index and ascends the index in fixed-size nibbles of “–chunksize”
rows, using a non-backtracking algorithm (see mk-archiver for more on this
algorithm). It is very similar to “Chunk”, but instead of pre-calculating
the boundaries of each piece of the table based on index cardinality, it uses
"LIMIT" to define each nibble’s upper limit, and the previous nibble’s upper
limit to define the lower limit.It works in steps: one query finds the row that will define the next nibble’s
upper boundary, and the next query checksums the entire nibble. If the nibble
differs between the source and destination, it examines the nibble row-by-row,
just as “Chunk” does.
- Stream
Selects the entire table in one big stream and compares all columns. Selects
all columns. Much less efficient than the other algorithms, but works when
there is no suitable index for them to use.
- Future Plans
Possibilities for future algorithms are TempTable (what I originally called
bottom-up in earlier versions of this tool), DrillDown (what I originallly
called top-down), and GroupByPrefix (similar to how SqlYOG Job Agent works).
Each algorithm has strengths and weaknesses. If you’d like to implement your
favorite technique for finding differences between two sources of data on
possibly different servers, I’m willing to help. The algorithms adhere to a
simple interface that makes it pretty easy to write your own.
DOWNLOADING
You can download Maatkit from the Sourceforge website at
<http://sourceforge.net/projects/maatkit>, or you can get any of the tools
easily with a command like the following:
wget http://www.maatkit.org/get/toolname
or
wget http://www.maatkit.org/trunk/toolname
Where "toolname" can be replaced with the name (or fragment of a name) of any
of the Maatkit tools. Once downloaded, they’re ready to run; no installation is
needed. The first URL gets the latest released version of the tool, and the
second gets the latest trunk code from Subversion.
OPTIONS
- –algorithm
The algorithm to use when comparing the tables. This is a suggestion. The tool
will auto-detect the best algorithm, and if your chosen algorithm can’t be used,
will use the best available one instead. See “ALGORITHMS”.
- –askpass
Prompt for a password when connecting to MySQL.
- –bufferresults
Fetch all rows from MySQL before comparing. This is disabled by default. If
enabled, all rows will be fetched into memory for comparing. This may result in
the results “cursor” being held open for a shorter time on the server, but if
the tables are large, it could take a long time anyway, and eat all your memory.
For most non-trivial data sizes, you want to leave this disabled.
- –chunksize
The size of each chunk of rows for the “Chunk” and “Nibble” algorithms.
The size can be either a number of rows, or a data size. Data sizes are
specified with a suffix of k=kibibytes, M=mebibytes, G=gibibytes. Data sizes
are converted to a number of rows by dividing by the average row length.
- –columns
Comma-separated list of columns to compare when looking for differences.
- –databases
Sync only this comma-separated list of databases.
- –engine
Sync only tables whose storage engine is in this comma-separated list.
- –execute
After finding differences, execute the queries required to sync the tables.
- –function
Which hash function you’d like to use for row checksums. Good choices include
"MD5" and "SHA1". If you have installed the "FNV_64" user-defined function,
"mk-table-sync" will detect it and prefer to use it, because it is much faster
than the others. See mk-table-checksum for more information and benchmarks.
- –help
Show a brief help message.
- –ignoredb
Use this option to skip a comma-separated list of databases.
- –ignoreengine
Use this option to skip a comma-separated list of storage engines (table types).
- –ignoretbl
Use this option to skip a comma-separated list of tables. Table names may be
qualified with the database name.
- –lock
Lock tables when beginning work. This uses "LOCK TABLES". This can help
prevent tables being changed while you’re examining them.The argument is an integer, default value 0. The possible values are as
follows:
VALUE MEANING
===== =======================================================
0 Never lock tables.
1 Lock and unlock one time per sync cycle (as implemented
by the syncing algorithm). This is the most granular
level of locking available. For example, the Chunk
algorithm will lock each chunk of C<N> rows, and then
unlock them if they are the same on the source and the
destination, before moving on to the next chunk.
2 Lock and unlock before and after each table.
3 Lock and unlock once for every server (DSN) synced, with
C<FLUSH TABLES WITH READ LOCK>.A replication slave is never locked if “–replicate” or “–synctomaster”
is specified, since in theory locking the table on the master should prevent any
changes from taking place. (You are not changing data on your slave, right?)
If “–wait” is given, the master (source) is locked and then the tool waits
for the slave to catch up to the master before continuing.If “–transaction” is specified, "LOCK TABLES" is not used. Instead, lock
and unlock are implemented by beginning and committing transactions. The exception
is if “–lock” is 3.
Print queries that will resolve the differences between the tables. If you
don’t trust "mk-table-sync", or just want to see what it will do, this is a
good way to be safe. These queries are valid SQL and you can run them yourself
if you want to sync the tables manually.
- –replace
Write all "INSERT" and "UPDATE" statements as "REPLACE". This is
automatically switched on as needed when there are unique index violations.
- –replicate
Specifies that "mk-table-sync" should examine the specified table to find data
that differs. The table is exactly the same as the argument of the same name to
mk-table-checksum. That is, it contains records of which tables (and ranges
of values) differ between the master and slave.For each table and range of values that shows differences between the master and
slave, "mk-table-checksum" will sync that table, with the appropriate "WHERE"
clause, to its master.This automatically sets “–wait” to 60 and causes changes to be made on the
master instead of the slave.If “–synctomaster” is specified, the tool will assume the server you
specified is the slave, and connect to the master as usual to sync.Otherwise, it will try to use "SHOW PROCESSLIST" to find slaves of the server
you specified. If it is unable to find any slaves via "SHOW PROCESSLIST", it
will inspect "SHOW SLAVE HOSTS" instead. You must configure each slave’s
"report-host", "report-port" and other options for this to work right. After
finding slaves, it will inspect the specified table on each slave to find data
that needs to be synced, and sync it.The tool examines the master’s copy of the table first, assuming that the master
is potentially a slave as well. Any table that shows differences there will
NOT be synced on the slave(s). For example, suppose your replication is set
up as A->B, B->C, B->D. Suppose you use this argument and specify server B.
The tool will examine server B’s copy of the table. If it looks like server B’s
data in table "test.tbl1" is different from server A’s copy, the tool will not
sync that table on servers C and D.
- –setvars
Specify any variables you want to be set immediately after connecting to MySQL.
These will be included in a "SET" command.
- –skipbinlog
Do not log statements to the binary log ("SET SQL_LOG_BIN=0").
- –skipforeignkey
Disable foreign key checks ("SET FOREIGN_KEY_CHECKS=0").
- –skipuniquekey
Disable unique index checks ("SET UNIQUE_CHECKS=0").
- –synctomaster
Treat the server you specified as a slave. Inspect "SHOW SLAVE STATUS",
connect to the server’s master, and treat the master as the source and the slave
as the destination. Causes changes to be made on the master. Sets “–wait”
to 60 by default, sets “–lock” to 1 by default, and disables
“–transaction” by default. See also “–replicate”, which changes this
option’s behavior.
- –tables
Restrict syncing to this comma-separated list of tables. Table names may be
qualified with the database name.
- –test
Analyze tables and decide which sync algorithm to use, then bail out before
doing any work. Implies “–verbose” so you can see the results.
- –timeoutok
Keep going if “–wait” fails. If you specify “–wait” and the slave
doesn’t catch up to the master’s position before the wait times out, the default
behavior is to abort. This option makes the tool keep going anyway.
Warning: if you are trying to get a consistent comparision between the two
servers, you probably don’t want to keep going after a timeout.
- –transaction
Use transactions instead of "LOCK TABLES". The granularity of beginning and
committing transactions is controlled by “–lock”. This is enabled by
default, but since “–lock” is disabled by default, it has no effect.Most options that enable locking also disable transactions by default, so if
you want to use transactional locking (via "LOCK IN SHARE MODE" and "FOR
UPDATE", you must specify –transaction explicitly.If you don’t specify –transaction explicitly, "mk-table-sync" will decide on
a per-table basis whether to use transactions or table locks. It currently
uses transactions on InnoDB tables, and table locks on all others.
- –trim
This option adds a "TRIM()" to "VARCHAR" columns in "BIT_XOR" and "ACCUM"
modes.This is useful when you don’t care about the trailing space differences between
MySQL versions which vary in their handling of trailing spaces. MySQL 5.0 and
later all retain trailing spaces in "VARCHAR", while previous versions would
remove them.
- –utf8
This option is deprecated. Pass the "A" option in a DSN instead. For
backwards compatibility, this option adds "A=utf8" to all DSNs.Enables character set settings in Perl and MySQL. If the value is "utf8", sets
Perl’s binmode on STDOUT to utf8, passes the "mysql_enable_utf8" option to
DBD::mysql, and runs "SET NAMES UTF8" after connecting to MySQL. Any other
value sets binmode on STDOUT without the utf8 layer, and runs "SET NAMES" after
connecting to MySQL.
- –verbose
Explain the differences found while comparing the tables. See “OUTPUT” for
more details about the output.
- –version
Output version information and exit.
- –wait
Make the master wait for the slave to catch up in replication before comparing
the tables. The value is the number of seconds to wait before timing out (see
also “–timeoutok”). Sets “–lock” to 1 and “–transaction” to 0 by
default.
- –where
A "WHERE" clause to restrict the portion of the table being synchronized.
EXIT STATUS
STATUS MEANING
====== =======================================================
0 Success.
1 Internal error.
2 At least one table differed on the destination.
3 Combination of 1 and 2.
OUTPUT
If you specify the “–verbose” option, you’ll see information about the
differences between the tables. There is one row per table. Each server is
printed separately. For example,
# Syncing D=test,t=test2
# DELETE REPLACE INSERT UPDATE ALGORITHM DATABASE.TABLE
# 0 0 2 0 Chunk test.test1
This table required 2 "UPDATE" statements to synchronize.
There are cases where no combination of "INSERT", "UPDATE" or "DELETE"
statements can resolve differences without violating some unique key. For
example, suppose there’s a primary key on column a and a unique key on column b.
Then there is no way to sync these two tables with straightforward UPDATE
statements:
+—+—+ +—+—+
| a | b | | a | b |
+—+—+ +—+—+
| 1 | 2 | | 1 | 1 |
| 2 | 1 | | 2 | 2 |
+—+—+ +—+—+
The tool rewrites queries to "DELETE" and "REPLACE" in this case. This is
automatically handled after the first index violation, so you don’t have to
worry about it.
ENVIRONMENT
The environment variable MKDEBUG enables verbose debugging output in all of the
Maatkit tools:
MKDEBUG=1 mk-….
HISTORY AND ACKNOWLEDGEMENTS
My work is based in part on Giuseppe Maxia’s work on distributed databases,
<http://www.sysadminmag.com/articles/2004/0408/> and code derived from that
article. There is more explanation, and a link to the code, at
<http://www.perlmonks.org/?node_id=381053>.
Another programmer extended Maxia’s work even further. Fabien Coelho changed
and generalized Maxia’s technique, introducing symmetry and avoiding some
problems that might have caused too-frequent checksum collisions. This work
grew into pg_comparator, <http://www.coelho.net/pg_comparator/>. Coelho also
explained the technique further in a paper titled “Remote Comparison of Database
Tables” (<http://cri.ensmp.fr/classement/doc/A-375.pdf>).
This existing literature mostly addressed how to find the differences between
the tables, not how to resolve them once found. I needed a tool that would not
only find them efficiently, but would then resolve them. I first began thinking
about how to improve the technique further with my article
<http://www.xaprb.com/blog/2007/03/05/an-algorithm-to-find-and-resolve-data-differences-between-mysql-tables/>,
where I discussed a number of problems with the Maxia/Coelho “bottom-up”
algorithm. After writing that article, I began to write this tool. I wanted to
actually implement their algorithm with some improvements so I was sure I
understood it completely. I discovered it is not what I thought it was, and is
considerably more complex than it appeared to me at first. Fabien Coelho was
kind enough to address some questions over email.
The first versions of this tool implemented a version of the Coelho/Maxia
algorithm, which I called “bottom-up”, and my own, which I called “top-down.”
Those algorithms are considerably more complex than the current algorithms and
I have removed them from this tool, and may add them back later. The
improvements to the bottom-up algorithm are my original work, as is the
top-down algorithm. The techniques to actually resolve the differences are
also my own work.
Another tool that can synchronize tables is the SQLyog Job Agent from webyog.
Thanks to Rohit Nadhani, SJA’s author, for the conversations about the general
techniques. There is a comparison of mk-table-sync and SJA at
<http://www.xaprb.com/blog/2007/04/05/mysql-table-sync-vs-sqlyog-job-agent/>
Thanks to the following people and organizations for helping in many ways:
The Rimm-Kaufman Group <http://www.rimmkaufman.com/>,
MySQL AB <http://www.mysql.com/>,
Blue Ridge InternetWorks <http://www.briworks.com/>,
Percona <http://www.percona.com/>,
Fabien Coelho,
Giuseppe Maxia and others at MySQL AB,
Kristian Koehntopp (MySQL AB),
Rohit Nadhani (WebYog),
The helpful monks at Perlmonks,
And others too numerous to mention.
SYSTEM REQUIREMENTS
You need Perl, DBI, DBD::mysql, and some core packages that ought to be
installed in any reasonably new version of Perl.
BUGS
Please use the Sourceforge bug tracker, forums, and mailing lists to request
support or report bugs: <http://sourceforge.net/projects/maatkit/>.
Please include the complete command-line used to reproduce the problem you are
seeing, the version of all MySQL servers involved, the complete output of the
tool when run with “–version”, and if possible, debugging output produced by
running with the "MKDEBUG=1" environment variable.
COPYRIGHT, LICENSE AND WARRANTY
This program is copyright (c) 2007 Baron Schwartz.
Feedback and improvements are welcome.
THIS PROGRAM IS PROVIDED “AS IS” AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, version 2; OR the Perl Artistic License. On UNIX and similar
systems, you can issue `man perlgpl’ or `man perlartistic’ to read these
licenses.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place, Suite 330, Boston, MA 02111-1307 USA.
AUTHOR
VERSION
This manual page documents Ver 1.0.7 Distrib 1972 $Revision: 1970 $.
Index
- NAME
- SYNOPSIS
- DESCRIPTION
- ALGORITHMS
- DOWNLOADING
- OPTIONS
- EXIT STATUS
- OUTPUT
- ENVIRONMENT
- HISTORY AND ACKNOWLEDGEMENTS
- SYSTEM REQUIREMENTS
- BUGS
- COPYRIGHT, LICENSE AND WARRANTY
- AUTHOR
- VERSION