Thursday, December 29, 2016

Slow MySQL Replication with MySQL 5.6

I can't believe that this is not common knowledge... MySQL 5.6 (and even Percona's 5.6 XTRADB Server, can have poor replication performance.

We built a beefy slave machine for the sole purpose of backing up production database.  We had the high hopes of keeping them in sync ~100% of the time.  With the replication performance present in MySQL 5.6, that idea seemed to be impossible at first.

The server itself had 128G of RAM and was using a Samsung EVO 950 Pro drive in the M2 slot for the MySQL data directory. Replication was completely CPU bound with little disk IO.   In fact, the amount of disk IO would not have even choked old IDE ATA drives.

Turns out that a Facebook engineer blogged about it in Dec 2013.  His conclusion was that MySQL 5.6 was unsuitable for Facebook's replication needs and that they could use MySQL 5.1 instead as he showed that MySQL 5.1's replication performance is much better than 5.6.

Switching to MySQL 5.1 was out of the question so  I was left with a bit of a problem...  I didn't like the idea of replication running from a MySQL 5.6 Master to MySQL 5.1 slave.  I gave MySQL 5.7 a go but I found that was unstable in this configuration.  I had two trial upgrades of the master's database go awry and a few times in production/testing with the end user I had weird user account issues crop up.

I ended up filtering some of the tables out  with the 'replicate_ignore_table' option.  I picked tables that received a high number of writes but where of little value for business purposes or reporting.

Further tuning of my.cnf helped.  Below are parts of this setup's MySQL configuration:


lower_case_table_names = 1
skip_name_resolve
innodb_file_per_table=1
binlog_format=row
default_storage_engine = innodb
max_allowed_packet = 64M
binlog_cache_size = 1M
max_heap_table_size = 1024M
sort_buffer_size = 256M
join_buffer_size = 256M
tmp_table_size = 8192M
innodb_doublewrite=0
innodb_support_xa=ON
innodb_checksum_algorithm=none
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb_buffer_pool_instances=8
innodb_buffer_pool_size = 64G
innodb_data_file_path = ibdata1:10M:autoextend
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb_io_capacity=2000
innodb_io_capacity_max=2500
innodb_thread_concurrency = 0
innodb_flush_log_at_trx_commit = 0
innodb_log_buffer_size = 32M
innodb_log_files_in_group = 3
innodb_max_dirty_pages_pct = 90
innodb_flush_method=O_DSYNC
innodb_lock_wait_timeout = 120
innodb_flush_neighbors=0
innodb_log_file_size = 4G
innodb_lru_scan_depth=2500
innodb_open_files=65536
innodb_print_all_deadlocks

sync_binlog=0
sync_master_info=0
sync_relay_log=0
sync_relay_log_info=0
relay_log=relay-bin
relay_log_index=relay-bin.index
master_info_repository=FILE
relay_log_info_repository=FILE
relay_log_recovery=0

Please only use this for reference and read up on each of these options before enabling them on your production servers.  If you have a test server that you can replicate the same work load, even better. Some of the options enabled here have some serious side effects that in the event of a crash, may corrupt the database.

Wednesday, May 11, 2016

Pro Tip: Do not remove 'kmod' in Ubuntu LXC Container. It removes '/sbin/init/'

Long story short: Do not remove the package 'kmod' on Ubuntu systems.  Various packages depend on this package, including cron and upstart.  Since upstart provides /sbin/init, the container will fail to start.

I was in the process of trimming down an LXC container and in doing so I noticed the 'kmod' package.  The description makes it seem like the perfect candidate for removal in an LXC container.

justin@deimos:~$ apt-cache show kmod

Description-en: tools for managing Linux kernel modules
 This package contains a set of programs for loading, inserting, and
 removing kernel modules for Linux.
 It replaces module-init-tools.
After removing this package, and the packages cron and upstart.  I then restarted my container... and after it failed to start, I enabled the log feature and found this error:

      lxc-start 1462981210.791 NOTICE   lxc_start - start.c:start:1152 - exec'ing '/sbin/init'      lxc-start 1462981210.791 ERROR    lxc_start - start.c:start:1155 - No such file or directory - failed to exec /sbin/init      lxc-start 1462981210.791 ERROR    lxc_sync - sync.c:__sync_wait:51 - invalid sequence number 1. expected 4      lxc-start 1462981210.791 ERROR    lxc_start - start.c:__lxc_start:1080 - failed to spawn 'lxc_container'

I thought I might have been fighting with previously closed bugs, but it turned out just to be be my own over zealousness in saving a few bytes of disk space. Oops!



Monday, May 9, 2016

Embedding Redmine within an IFrame

In this post I'm going to describe how I used NGINX to remove the X-Frame-Options so that we could embed Redmine inside another webapp.  

Our dev team at Indosoft, Inc had created a set of Work Flow tools to help process tickets in our issue tracking software.  The workflow tools run on one server and they embed the Redmine webapp inside an IFrame. Redmine is running in an LXC container and is proxied by NGINX.  The NGINX container forces all the clients to use a HTTPS connection.  The workflow tool is also using a valid SSL certificate.

The problem is that Redmine, which is written in Ruby, was setting the X-Frame-Options header in the HTTP response to 'X-Frame-Options SAMEORIGIN'.   Since the workflow tools and Redmine had different URL's, the web browsers where refusing to load Redmine in the IFrame.

I knew that I should be able to use NGINX to solve this problem.  It turns out that the 'ALLOW-FROM uri' option should have been able to solve the problem, however it is not and will not be implemented in Chrome.  If you want to read that conversation that the Chrome Dev's had, follow this link:  https://bugs.webkit.org/show_bug.cgi?id=94836

So it seems like the easiest and most straight forward way is to remove the 'X-Frame-Options' is the response header as it passes through NGINX.  NGINX's proxy module gives us the tool we need; using the proxy_hide_header option we can strip any header out of the response from the proxied server before sending it off to the client.

        location / {
                proxy_pass http://x.x.x.x:80/;
                proxy_set_header Host            $host;
                proxy_set_header X-Forwarded-For $remote_addr;
                proxy_hide_header X-Frame-Options ;
        }
This approach is at best only a partial fix.  This is weakens the security of our embedded webapp as the users could be left vulnerable to clickjacking and XSS exploits. We have a few options to explore if this is unacceptable.  NGINX could be configured to only strip this header when the client IP is on a white list with a bit of system administration overhead but again, that is not ideal.  The proper solution to improve the security of this setup seems to be to implement Content Security Policy (CSP).

In my follow up article to this one,  I'll tackle how to inject the CSP directives into the Response Headers.

Thursday, May 5, 2016

Canadian Cell Tower Map

I stumbled upon a neat mashup that Steven Nikkel put together.   It's neat to see where the cell towers are and which bands they handle.

http://www.ertyu.org/steven_nikkel/cancellsites.html
This is good to know ahead of time when you are travelling through rural and very remote  areas.

What usually happens when I've been out in the woods while ATV'ing, we'll stop and we'll find that we have picked up the odd signal from a tower and a few text messages have come in.  This might not sound like a big deal, but I'm talking about about locations that are so far from the official dead zone that it's pretty cool to find these odd locations that do make contact with a tower.

Next ATV trip I'm going to try figuring out where more sweet spots are in case of emergency.   I'll have to admit, this  may end up being a  fun and geeky hobby.

Tuesday, April 26, 2016

City of Fredericton Eliminates a Large Portion of it's Monthly Parking Spots for the Summer of 2016

I sent this email to all the current City Council and the candidates in this upcoming election. At the time of posting this, I've had some feed back from some of the candidates.



To the current City of Fredericton Council and the Candidates in this upcoming election,

As of May 1st many office workers in the downtown area will not have any parking until further notice as both the Upper Queen Lot and the Frederick Sq. Parking Garage are closing to the general public. The government parking permit holders have been given priority and those of us that work, eat and shop downtown are penalized.  
These actions hurt  businesses and will impact the decisions of any business looking to operate in the downtown area.  The small businesses that we frequent on a nearly daily basis will see reduced foot traffic and reduced revenues.  
I work at a company that has an office downtown and about half of our team relies on these two parking locations.   Hourly parking is a expensive option suggested by the City's Parking Services Division. It's truly not an option unless we can park at those locations for an equivalent price and for the entire day.  As it currently stands, there is a two hour maximum to the hourly parking and it would cost at least $190 per month if we followed the suggested alternative option.  This of course, assumes that there is enough unoccupied parking spaces in the downtown core.
I wish that this decision would have been communicated earlier and with real alternatives so that the alternative solutions could have been explored.  For something that must have been decided months in advance, the City of Fredericton has given only 3 business days' notice -- this really shows the amount of disregard the City must have for the non government users of those two parking facilities and businesses in the downtown core.
Our offices' lease is currently up for renewal in the next year --  this action and the periodical closures of the Upper Queen Lot will be heavily influencing our decision to stay within Fredericton city limits.
Attached is the copy of the notice found on our vehicles today.

Sincerely,
Justin Traer 
General Manager,
Indosoft, Inc




Monday, January 25, 2016

Stuttering Sound Problems with KOTOR 2 in Linux

Sometimes I just have to shake my head in disbelief -- how do some things pass through QA?

I recently picked up Knights of the Old Republic 2 on Steam.  It's been on my list of games to play for a while and it being on sale at the time and also being Linux compatible made me pull the trigger on it.

Pretty quickly I noticed that the sound was stuttering really badly. After putting up with it for way too long, some google fu lead me to a Steam forum where the Windows users where saying that they had the same problems on multicore CPU's.   Apparently, setting the CPU affinity resolves the issue.  I can confirm that this is easy to do on Linux and it does solve the problem.

Start up KOTOR2, then Alt-Tab to a xterm and run the following:


ps -ef |grep KOTOR| grep -v grep|grep -v 'bin/sh' |awk '{print $2}'` | xarg taskset -p 03

This will figure out the PID of KOTOR2 process and lock it to the 4th CPU.  You can close the xterm at this point.  Now Alt-Tab back to KOTOR2 and the sound stuttering should be fixed.  The CPU's are counted from 0, so if you have a quad core, you can lock the process to CPU's 0 through 3.   Based on your rig, there might be a better CPU to lock too. One thing to look at is which CPU is handling various IRQ requests.  I didn't have to look into this even with an AMD FX-6300 so I would suspect any recent chipset wouldn't require any further tuning either.



https://steamcommunity.com/app/208580/discussions/0/648814842278312559/

http://manpages.ubuntu.com/manpages/lucid/man1/taskset.1.html

Monday, January 18, 2016

SSD Tuning for Linux Servers




I got a call from a client that was having issues with really slow database queries.   Part of the solution was to upgrade to the latest version of Percona Server 5.6 but while we where in, we took the time to look into the tuning of the machine.

The machine was a beast -- over a 100G of RAM, a few SSD's in a RAID controller but it's reported IO was dismal in production.   UPDATE queries that should have only taken microseconds where taking minutes.

We optimized tables, removed duplicate indexes -- pretty typical stuff.   During the optimization of tables, the SSD array was getting over 500Mb/s for writes.  I was reasonably sure the hardware wasn't failing given those numbers.

We found that is was running EXT4 with barriers.   I have found that MySQL performs poorly with EXT4's barrier feature enabled. We disabled that in fstab and rebooted.

After that was all done, I looked into what we can do with respect to kernel tuning for this workload. I found that these settings should help in theory for this type of situation.

echo 0 > /sys/block/sda/queue/rotational
echo 0 > /sys/block/sda/queue/add_random
 echo 2 > /sys/block/sda/queue/rq_affinity  
echo noop > /sys/block/sda/queue/scheduler
echo 2 > /sys/block/sda/queue/nomerges
/sbin/blockdev --setra 0  /dev/sda

Given that it is an array of SSD's attached to RAID controller card, I went with the idea that the kernel should just let MySQL and the RAID controller do as much of the work as possible and stop second guessing it.  This is why the scheduler is set to 'noop' and I've set the read ahead on this logical drive to 0. 

I found that the OS was not able to ID the drive as an array of SSD's so we had to set the 'rotational' setting to 0 which is supposed to trigger certain optimizations with in the kernel for SSD's.

Next, the 'add_random' setting is disabled so that we can stop collecting entropy from this device.   Since it is an SSD, I would suspect that the entropy provided would be poor and it's adding a small amount of overhead to each IO.  The node also doesn't do much crypto work either so this was a obvious choice.

Alas, the client was keen on getting this node back into production so I was not able to do any real benchmarking on these configuration changes but in the end we did improve the situation considerably so the client was happy about that.

Monday, January 11, 2016

An alternative to 'top' for Mysql

I think that one of my most used tools at work is 'top'.   It helps me figure out what is going on with a server very quickly.  While it's not perfect, it is a good all purpose tool. Unfortunately, there is no viable alternative that I've found yet that is prepackaged so I resort to using the following commands run from the shell:

for i in `seq 1 1000`;do clear; uptime;free -m ;mysql -uxxx -pxxx information_schema  -B -e "select * from PROCESSLIST where command != 'Sleep';" 2>/dev/null | grep -v 'from PROCESSLIST where ' ;sleep 4;done

Modify these commands to suit your needs -- I decided that updating every 4 seconds was reasonable enough for my needs and that displaying the memory info in MB works for me.

This allows me to monitor what a particular MySQL server is doing, hopefully helping me spot any long running queries at a glance with out relying any application debugging support or MySQL's general_log.

It may not be as pretty or polished as iotop or top, but it can give an insight into what is going with a MySQL server



The problem with relying on any particular applications debugging logs is that any access done directly by other MySQL users will not be logged.  MySQL's general log does capture everything and definitely has it's purpose both during application testing as well as during fault resolution in the field but it only reports on completed queries.

This solution does have some short comings, it doesn't capture every single query, nor does it show if a user is hammering the database with thousands of tiny queries every second but as an equivalent tool to top for peering into MySQL, this can get the job done.