Be aware of latency in distributed systems

I want to begin by saying that this specific example is very niche problem, but real.

A client moved their LAMP application to a new hosting provider and noticed that simple operations were taking several seconds longer, yet profiled queries were quicker if anything.

I setup a test database server with identical specs with the same provider and did some simple profiling by setting long_query_time=0 and turning the slow query log on. I then ran the simple operation, took a copy of the slow query log and filtered out just the queries; ignoring SET TIMESTAMP.

grep -vP "^(#|SET timestamp)" /var/lib/mysql/mysql-slow.log > ~/test.sql

I added USE database_name; to the top of test.sql and then tested the execution speed from the database server itself:

# time mysql >/dev/null < ~/test.sql

real	0m0.475s
user	0m0.351s
sys	0m0.028s

Half a second for the entire operation is not something users are likely to notice, so the problem is elsewhere. Next step was to copy test.sql onto the server the web application is running from, and run the same test. For comparison purposes I also tested from the old webserver to the old database server:

oldserver:~$ time mysql -uprofileuser -h olddatabase -p$PASS >/dev/null < test.sql

real	0m1.805s
user	0m0.385s
sys	0m0.051s

oldserver:~$ ping olddatabase
64 bytes from x.x.x.x: icmp_seq=1 ttl=64 time=0.120 ms

newserver:~$ time mysql -uprofileuser -h newdatabase -p$PASS >/dev/null < test.sql

real	0m5.563s
user	0m0.817s
sys	0m0.049s

newserver:~$ ping newdatabase
64 bytes from x.x.x.x: icmp_seq=1 ttl=57 time=1.34 ms

So there is a 3.758 second difference between execution times on the two remote servers, and a 5.088 second increase when comparing locally to over the network.

The latency difference was only 1.22ms so it's surprising. Checking the SQL file:

$ wc -l test.sql
4198 test.sql

This "simple operation" runs 4198 database queries. Most of them are cached so it only takes half a second for the server to execute them all, but the data needs to get to the other server, and it takes 1.34ms each time. . 4198 * 1.34ms = 5.625 seconds.

Latency is important.