Man page for mk-parallel-dump
August 24, 2007 – 5:38 pmMK-PARALLEL-DUMP
Section: User Contributed Perl Documentation (1)
Updated: 2008-06-01
Index
Return to Main Contents
NAME
mk-parallel-dump - Dump sets of MySQL tables in parallel.
SYNOPSIS
mk-parallel-dump
mk-parallel-dump –tab –basedir /path/to/backups/
mk-parallel-dump –sets order,profile,session –settable meta.backupset
DESCRIPTION
mk-parallel-dump connects to a MySQL server, finds database and table names,
and dumps them in parallel for speed. It can be used in several pre-packaged
ways, or as a generic wrapper to call some program in parallel, passing it
parameters for each table. It supports backup sets and dumping only tables that
have changed since the last dump.
To dump all tables to gzipped files in the current directory, each database with
its own directory, with a global read lock, flushing and recording binary log
positions, each table in a single file:
mk-parallel-dump
To dump tables elsewhere:
mk-parallel-dump –basedir /path/to/elsewhere
To dump to tab-separated files with "SELECT INTO OUTFILE", each table with
separate data and SQL files:
mk-parallel-dump –tab
To dump one or more backup sets (see “BACKUP SETS”):
mk-parallel-dump –sets set1,set2,set3 –settable meta.backupset
To “write your own command line,” use "–" to indicate where the arguments for
mk-parallel-dump stop and where the arguments for "mysqldump" (or any other
program) begin. The following example shows "mysqldump", and aside from
simpler options to "mysqldump", is basically what happens when you specify no
arguments at all:
mk-parallel-dump — mysqldump –skip-lock-tables ‘%D’ ‘%N’ \
\| gzip –fast -c - \> ‘%D.%N.gz’
The "%" modifiers are macros (see “MACROS”). The "–skip-lock-tables"
argument is very important in that last example, because otherwise both
mk-parallel-dump and "mysqldump" will lock tables, so "mysqldump" will hang,
waiting for the locks. Notice the shell metacharacters "|" and ">" are
escaped so the shell won’t interpret them, and they’ll get passed through to the
generated command-line.
There’s no reason you can’t use mk-parallel-dump to do other tasks in
parallel, such as "OPTIMIZE TABLE":
mk-parallel-dump –noflushlock –nolocktables — mysqlcheck –optimize ‘%D’ ‘%N’
When you use built-in defaults, mk-parallel-dump will relay these arguments
on to every forked copy of "mysqldump": “–defaults-file”, “–host”,
“–port”, “–socket”, “–user”, “–password”. If you write your own
command-line, you will need to specify them manually.
If you specify the “–tab” option, mk-parallel-dump creates separate files
that hold views and triggers, so they can be restored correctly (this is not
currently possible with the "mysqldump" from MySQL AB, which will restore
triggers before restoring data). Otherwise it does not back up your entire
database; it dumps tables and data only. It does not dump view definitions
or stored routines. However, if you dump the "mysql" database, you’ll be
dumping the stored routines anyway.
Exit status is 0 if everything went well, 1 if any chunks failed, and any
other value indicates an internal error.
mk-parallel-dump doesn’t clean out any destination directories before
dumping into them. You can move away the old destination, then remove it
after a successful dump, with a shell script like the following:
#!/bin/sh
CNT=`ls | grep -c old`;
if [ -d default ]; then mv default default.old.$CNT;
mk-parallel-dump
if [ $? != 0 ]
then
echo "There were errors, not purging old sets."
else
echo "No errors during dump, purging old sets."
rm -rf default.old.*
fi
BACKUP SETS
Backup sets are groups of logically related tables you want to backup together.
You specify a set by inserting the table names into a table in the MySQL server
from which you’re dumping, and then naming it in the “–sets” option.
mk-parallel-dump always works a set at a time; if you don’t specify a set, it
auto-discovers tables, filters them with the various command-line options
(“–databases”, etc) and considers them the default set.
The table that stores backup sets should have at least these columns: setname,
priority, db, tbl. The following is a suggested table structure:
CREATE TABLE backupset (
setname CHAR(10) NOT NULL,
priority INT NOT NULL DEFAULT 0,
db CHAR(64) NOT NULL,
tbl CHAR(64) NOT NULL,
ts TIMESTAMP NOT NULL,
PRIMARY KEY(setname, db, tbl),
KEY(setname, priority, db, tbl)
);
Entries are ordered by priority, db, and tbl. Priority 0 tables are dumped
first, not last. If it looks like tables are dumped in the wrong order, it’s
probably because they’re being dumped asynchronously. The output is printed
when the dump finishes, not when it starts.
If you specify “–age”, mk-parallel-dump expects the "ts" column to
exist, and will update the column to the current date and time when it
successfully dumps a table.
Don’t use "default" as a set name. It is used when you don’t specify any
sets and when you want all tables not explicitly assigned to a set to be
dumped (see “–defaultset”).
Set names may contain only lowercase letters, numbers, and underscores.
CHUNKS
mk-parallel-dump can break your tables into chunks when dumping, and put
approximately the amount of data you specify into each chunk. This is useful to
avoid enormous files for restoration, which can not only take a long time but
may be a lot of extra work for transactional storage engines like InnoDB. A
huge file can create a huge rollback segment in your tablespace.
To dump in chunks, specify the “–chunksize” option. This option is an
integer with an optional suffix. Without the suffix, it’s the number of rows
you want in each chunk. With the suffix, it’s the approximate size of the data.
mk-parallel-dump tries to use index statistics to calculate where the
boundaries between chunks should be. If the values are not evenly distributed,
some chunks can have a lot of rows, and others may have very few or even none.
Some chunks can exceed the size you want.
When you specify the size with a suffix, the allowed suffixes are k, M and G,
for kibibytes, mebibytes, and gibibytes, respectively. mk-parallel-dump
doesn’t know anything about data size. It asks MySQL (via "SHOW TABLE STATUS")
how long an average row is in the table, and converts your option to a number
of rows.
Not all tables can be broken into chunks. mk-parallel-dump looks for an
index whose leading column is numeric (integers, real numbers, and date and time
types). It prefers the primary key if its first column is chunk-able.
Otherwise it chooses the first chunk-able column in the table.
Generating a series of "WHERE" clauses to divide a table into evenly-sized
chunks is difficult. If you have any ideas on how to improve the algorithm,
please write to the author (see “BUGS”).
MACROS
mk-parallel-dump can insert "%" variables into arguments. The available macros
are as follows:
MACRO MEANING
===== =================
%S The backup set
%D The database name
%N The table name
%C The chunk number
%W The WHERE clause
You can place a number between the "%" and the letter. The macro replacement
then assumes it’s a digit and pads it with leading zeroes (in practice, this is
only useful for %C).
OUTPUT
Output depends on verbosity. When “–test” is given, output includes
commands that would be executed.
When “–verbose” is 0, there is normally no output unless there’s an error.
When “–verbose” is 1, there is one line of output for each backup set,
showing the set, how many tables and chunks were dumped with what status, how
much time elapsed, and how much time the parallel dump jobs added up to. A
final line shows sums for all sets, unless there is only one set.
When “–verbose” is 2, there is also one line of output for each table.
Each line is printed when a forked “child” process ends and is removed from
the list of children. The output shows the backup set, database, table,
seconds spent dumping, the exit status of the forked dump process, and number
of current processes (including the one just reaped; so this typically shows
“how many are running in parallel”). A status of 0 indicates success:
SET DATABASE TABLE TIME STATUS THREADS
default mysql db 0 0 4
default mysql columns_priv 0 0 4
default mysql help_category 0 0 3
SPEED OF PARALLEL DUMPS
How much faster is it to dump in parallel? That depends on your hardware and
data. You may be able dump files twice as fast, or more if you have lots of
disks and CPUs. Here are some user-contributed figures.
The following table is for a 3.6GHz Xeon machine with 4 processors and a RAID-10
array of 15k disks, directly attached to the server with a fibre channel. Most
of the space is in one huge table that wasn’t dumped in parallel:
COMMAND SIZE TIME
————————– —– —-
mk-parallel-dump 1.4GB 269
mysqldump 1.4GB 345
On the same machine, in a database with lots of roughly equal-sized tables:
COMMAND SIZE TIME
————————– —– —-
mk-parallel-dump 117MB 7
mysqldump 117MB 37
It doesn’t always work that well. A dual 2.80GHz Xeon server with a RAID-5
array of three 7200RPM SATA disk drives running MySQL 5.0.38 on GNU/Linux
achieved the following dump times:
COMMAND SIZE TIME
————————– —– —-
mk-parallel-dump 3.0GB 2596
mysqldump | gzip –fast 3.0GB 3195
While dumping two threads in parallel, this machine was at an average of 74%
CPU utilization and 12% I/O wait. This machine doesn’t have enough disks and
CPUs to do that many things at once, so it’s not going to speed up much.
Dumping lots of tiny tables by forking of lots of "mysqldump" processes isn’t
usually much faster, because of the overhead of starting "mysqldump",
connecting, inspecting the table, and dumping it. Note that tab-separated
dumps are typically much faster and don’t suffer as much from the effects of
many tiny tables, because they’re not done via "mysqldump".
See also <http://www.paragon-cs.com/wordpress/?p=52> for a test of parallel
dumping and restoring.
DOWNLOADING
You can download Maatkit from the Sourceforge website at
<http://sourceforge.net/projects/maatkit>, or you can get any of the tools
easily with a command like the following:
wget http://www.maatkit.org/get/toolname
or
wget http://www.maatkit.org/trunk/toolname
Where "toolname" can be replaced with the name (or fragment of a name) of any
of the Maatkit tools. Once downloaded, they’re ready to run; no installation is
needed. The first URL gets the latest released version of the tool, and the
second gets the latest trunk code from Subversion.
OPTIONS
Some options can be disabled by prefixing them with "–no", such as
"–no-gzip".
- –age
Specifies how ‘old’ a table must be before mk-parallel-dump will consider
it.When “–sets” is not specified, mk-parallel-dump uses "SHOW TABLE STATUS"
instead of "SHOW TABLES" to get a list of tables in each database, and compares
the time to the "Update_time" column in the output. If the "Update_time"
column is not "NULL" and is older than the specified interval ago, it will not
be dumped. Thus, it means “dump tables that have changed since X amount of
time” (presumably the last regular backup). This means the table will always be
dumped if it uses InnoDB or another storage engine that doesn’t report the
"Update_time".When “–sets” is specified, the “–settable” table determines when a table
was last dumped, and the meaning of "–age" reverses; it becomes “dump tables
not dumped in X amount of time.”
- –basedir
The directory in which files will be stored. If you use pre-canned options,
such as “–tab”, mk-parallel-dump knows what the eventual filenames will
be, and can place all the files in this directory. It will also create any
parent directories that don’t exist, if needed (see also “–umask”).The default is the current working directory.
If you write your own command line, mk-parallel-dump cannot know which
arguments in the command line are filenames, and thus doesn’t know the
eventual destination of the dump files. It does not try to create parent
directories in this case.
- –binlogpos
Dump binary log positions from both "SHOW MASTER STATUS" and "SHOW SLAVE
STATUS", whichever can be retrieved from the server. The data is dumped to a
file named 00_master_data.sql. This is done for each backup set.The file also contains details of each table dumped, including the WHERE clauses
used to dump it in chunks.This option is enabled by default.
- –charset
Enables character set settings in Perl and MySQL. If the value is "utf8", sets
Perl’s binmode on STDOUT to utf8, passes the "mysql_enable_utf8" option to
DBD::mysql, and runs "SET NAMES UTF8" after connecting to MySQL. Any other
value sets binmode on STDOUT without the utf8 layer, and runs "SET NAMES" after
connecting to MySQL.
- –chunksize
Specifies that the table should be dumped in segments of approximately the size
given. The syntax is either a plain integer, which is interpreted as a number
of rows per chunk, or an integer with a suffix of G, M, or k, which is
interpreted as the size of the data to be dumped in each chunk. See “CHUNKS”
for more details.
- –csv
Changes “–tab” options so the dump file is in comma-separated values
(CSV) format. The SELECT INTO OUTFILE statement looks like the following, and
can be re-loaded with the same options:
SELECT * INTO OUTFILE %D.%N.%6C.txt
FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ‘\"’
LINES TERMINATED BY ‘\n’ FROM %D.%N;This option implies “–tab”.
- –databases
Dump this comma-separated list of databases.
- –dbregex
Dump only databases whose names match this Perl regular expression.
- –defaultset
When “–sets” is given, this option makes mk-parallel-dump dump a
"default" set consisting of tables not explicitly included in any set.
- –defaults-file
Only read default options from the given file. You must give an absolute
pathname.
- –flushlock
Lock all tables globally with "FLUSH TABLES WITH READ LOCK". This is enabled
by default, unless you’re dumping sets (see “–sets”). This lock is taken
once, at the beginning of the whole process, and is never released.If you want to lock only the tables you’re dumping, use “–locktables”.
- –flushlog
Execute "FLUSH LOGS" after locking and before dumping master/slave binary log
positions. This is done for each backup set.This option is NOT enabled by default because it causes the MySQL server to
rotate its error log, potentially overwriting error messages.
- –gzip
Compresses files with gzip. This is enabled by default unless your platform is
Win32. By default, this causes the standard SQL dumps to be piped to gzip’s
"STDIN" and the result is redirected to the destination file. If this option
isn’t enabled, by default "mysqldump"’s "–result-file" parameter is used to
direct the dump to the destination file. When using “–tab”, this option
causes gzip to be called separately on each resulting file after it is dumped
(because "SELECT INTO OUTFILE" cannot be directed to a pipe).
- –help
Displays a help message.
- –host
Connect to host.
- –ignoredb
Do not dump this comma-separated list of databases.
- –ignoreengine
Do not dump any data for this comma-separated list of storage engines. The
schema file will be dumped as usual.The default value is "FEDERATED,MRG_MyISAM". This prevents dumping data for
Federated tables and Merge tables.
- –ignoretbl
Do not dump this comma-separated list of table names. Table names may be
qualified with the database name.
- –locktables
Disables “–flushlock” (unless it was explicitly set) and locks tables with
"LOCK TABLES READ". Enabled by default when “–sets” is specified. The
lock is taken and released with every set of tables dumped.
- –losslessfp
Wraps double and float types with a call to "FORMAT()" with 17 digits of
precision. According to the comments in Google’s patches, this will give
lossless dumping and reloading in most cases. (I shamelessly stole this
technique from them. I don’t know enough about floating-point math to have an
opinion).This works only with “–tab”.
- –numthread
Specifies the number of parallel processes to run. The default is 2 (this is
mk-parallel-dump, after all — 1 is not parallel). On GNU/Linux machines,
the default is the number of times ‘processor’ appears in /proc/cpuinfo. On
Windows, the default is read from the environment. In any case, the default is
at least 2, even when there’s only a single processor.
- –password
Password to use when connecting.
- –port
Port number to use for connection.
- –quiet
Sets “–verbose” to 0.
- –sets
Dump this comma-separated list of backup sets, in order. Requires
“–settable”. See “BACKUP SETS”. The special "default" set is
reserved; don’t use it as a set name.
- –setperdb
Specifies that each database is a separate backup set. Each set is named the
same as the database. Implies “–locktables”.
- –settable
Specifies the table in which backup sets are kept. It may be given in
database.table form.
- –setvars
Specify any variables you want to be set immediately after connecting to MySQL.
These will be included in a "SET" command.
- –socket
Socket file to use for connection.
- –stopslave
Issue "STOP SLAVE" on server before dumping data. This ensures that the data
is not changing during the dump. Issues "START SLAVE" after the dump is
complete.If the slave is not running, throws an error and exits. This is to prevent
possibly bad things from happening if the slave is not running because of a
problem, or because someone intentionally stopped the slave for maintenance or
some other purpose.
- –tab
Dump via "SELECT INTO OUTFILE", which is similar to what "mysqldump" does with
the "–tab" option, but you’re not constrained to a single database at a time.Before you use this option, make sure you know what "SELECT INTO OUTFILE" does!
I recommend using it only if you’re running mk-parallel-dump on the same
machine as the MySQL server, but there is no protection if you don’t.The files will be gzipped after dumping if “–gzip” is enabled. This option
sets “–umask” to zero so auto-created directories are writable by the MySQL
server.Triggers are dumped into ".trg" files, and views are postponed until the end of
the dump, then dumped all together into the "00_views.sql" file. This allows
restoring data before the triggers, which is important for restoring data
accurately. Views must be postponed until the end and dumped together so they
can be restored correctly; interdependencies between views and tables may
prevent correct restoration otherwise.
- –tables
Dump this comma-separated list of table names. Table names may be qualified
with the database name.
- –tblregex
Dump only tables whose names match this Perl regular expression.
- –test
Print commands instead of executing them.
- –umask
Set the program’s "umask" to this octal value. This is useful when you want
created files and directories to be readable or writable by other users (for
example, the MySQL server itself).
- –user
User for login if not current user.
- –verbose
Sets the verbosity; repeatedly specifying it increments the verbosity.
Default is 1 if not specified. See “OUTPUT”.
- –version
Output version information and exit.
- –wait
If the MySQL server crashes during dumping, waits until the server comes back
and then continues with the rest of the tables. "mk-parallel-dump" will
check the server every second until this time is exhausted, at which point it
will give up and exit.This implements Peter Zaitsev’s “safe dump” request: sometimes a dump on a
server that has corrupt data will kill the server. mk-parallel-dump will
wait for the server to restart, then keep going. It’s hard to say which table
killed the server, so no tables will be retried. Tables that were being
concurrently dumped when the crash happened will not be retried. No additional
locks will be taken after the server restarts; it’s assumed this behavior is
useful only on a server you’re not trying to dump while it’s in production.
ENVIRONMENT
The environment variable "MKDEBUG" enables verbose debugging output in all of
the Maatkit tools:
MKDEBUG=1 mk-….
SYSTEM REQUIREMENTS
You need Perl, DBI, DBD::mysql, and some core packages that ought to be
installed in any reasonably new version of Perl.
This program works best on GNU/Linux. Filename quoting might not work well on
Microsoft Windows if you have spaces or funny characters in your database or
table names.
BUGS
Please use the Sourceforge bug tracker, forums, and mailing lists to request
support or report bugs: <http://sourceforge.net/projects/maatkit/>.
Please include the complete command-line used to reproduce the problem you are
seeing, the version of all MySQL servers involved, the complete output of the
tool when run with “–version”, and if possible, debugging output produced by
running with the "MKDEBUG=1" environment variable.
COPYRIGHT, LICENSE AND WARRANTY
This program is copyright (c) 2007 Baron Schwartz. Feedback and improvements
are welcome.
THIS PROGRAM IS PROVIDED “AS IS” AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, version 2; OR the Perl Artistic License. On UNIX and similar
systems, you can issue `man perlgpl’ or `man perlartistic’ to read these
licenses.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place, Suite 330, Boston, MA 02111-1307 USA.
AUTHOR
SEE ALSO
VERSION
This manual page documents Ver 1.0.8 Distrib 1972 $Revision: 1970 $.
Index
- NAME
- SYNOPSIS
- DESCRIPTION
- BACKUP SETS
- CHUNKS
- MACROS
- OUTPUT
- SPEED OF PARALLEL DUMPS
- DOWNLOADING
- OPTIONS
- ENVIRONMENT
- SYSTEM REQUIREMENTS
- BUGS
- COPYRIGHT, LICENSE AND WARRANTY
- AUTHOR
- SEE ALSO
- VERSION