So I’ve really been digging Kevin Closson’s blog lately. Back at the beginning of this month he had another post that caught my attention about running Oracle on Opteron in which he made the point that these boxes should always be run in NUMA mode (not SUMA). This grabbed my eye because I’ve been delving a bit deeper than usual into CPU issues recently. In particular, on both of my past two tuning engagements, we’ve looked pretty closely at CPU utilizations. At the first we wanted to see if Oracle was effectively utilizing Hyper-Threading. At the second we were investigation high CPU wait events from the database. (Which turned out not to be CPU-related!) I worked up some quick scripts to help analyze the CPU patterns in both of these situation. But before I get into that – let me go on a quick tangent about what originally got me interested in this. :)
Hobby Stuff
Kevin’s post mainly caught my attention because I recently spent a lot of time researching CPUs while considering an upgrade for my server at home [pictured] where I have about 30 virtual machines running redhat, suse, solaris x86, centos, OEL, and Oracle 8i, 9i and 10g single instance and RAC. (Of course I can’t run all these VM’s at once – I can usually run about four at a time.) I was looking at AMD’s dual-core Athlon X2 processors because after Intel’s Core 2 Duo started blowing away the Athlons in benchmark tests, AMD started slashing prices like crazy to stay competitive.
I’m a pretty big fan of virtualization. It’s interesting to me how everybody always talks about dynamic provisioning of resources when they’re selling RAC – like it’s some new thing. Yet we’ve had this and more for quite awhile – with technologies like pSeries DLPARs, ESX, or Xen. Of course it’s not a panacea for every resource management challenge – but it is very solid technology. The only thing I’m not entirely convinced of just yet is that x86 virtualization technologies can really isolate problems and guarantee performance to other VM’s just yet. Some guys at Clarkson University in New York just published a research paper in May about virtualization and performance and VMWare seems to be the best technology at the moment. But I’m not sure I’d run my production Oracle system on an x86 VM just yet, although I suspect that it’s just a matter of time. (Especially with Vanderpool and Pacifica available in all of the latest processors now.)
Anyway, all of that is just to say that I’ve been digging a bit into CPU issues lately!
Monitoring SMP and Hyper-Threading
So – moving on – one thing I’ve always been curious about is how well Linux handles Hyper-Threading. With Hyper-Threading each physical CPU appears to be two CPUs – however there aren’t two cores; there’s just hardware optimizations to store two sets of state information and take better advantage of spare cycles after conditions such as a cache miss, branch misprediction, or data dependency. So naturally the process scheduler in your operating system needs to be aware that it’s dealing with hyper-threaded CPU’s so it can schedule processes accordingly. Actually there are some similarities to NUMA platforms, just as Kevin has pointed out for multi-core chips.
As it turns out Oracle and RHEL4 running on four Xeon processors (eight virtual processors) balanced the workload across the physical CPU’s very well – at least on the system I was looking at two weeks ago. Here are some graphs for CPU utilization for three consecutive days with both batch and user workloads:
The really cool thing about these graphs is how you can see a clear pattern in how Linux’s process scheduler assigns the workload. It always starts with the sixth virtual processor (third physical processor). It then places work on the forth processor (second physical). Next it utilizes the eigth (fourth physical). And lastly it goes to number two (first physical). After this it goes to the other half of each physical processor in reverse order: one, seven, three, five.
I haven’t read through the source code to see what is causing this pattern or if it’s intentional – but there’s at least one observable advantage: it’s being smart about Hyper-Threading. Linux seems to be utilizing all four physical processors pretty well by balancing the total load across the four physical processor fairly evenly.
Making Per-CPU Utilization Graphs
It’s not hard at all to generate these graphs yourself – from either sar or mpstat. What I actually measured was idle time – in excel I subtracted from 100 to get CPU usage. And all I did was write a short awk script to take the idle time from each processor and list them horizontally rather than vertically.
[root@testbox1 ~]# mpstat -P ALL 2|awk '{
> if($3>=0) {tt=$1; cput[$3]=$10;}
> if($3=='7') {print tt" "cput[0]" "cput[1]" "cput[2]" "cput[3]" "cput[4]" "cput[5]" "cput[6]" "cput[7];}
> }'
02:55:35 96.48 100.00 100.50 101.01 100.50 100.50 100.50 100.50
02:55:37 66.83 90.59 99.01 81.68 99.01 99.01 94.06 98.51
02:55:39 70.35 100.50 100.50 100.50 100.50 100.00 100.50 100.50
02:55:41 82.00 99.50 100.00 100.00 100.00 100.50 96.50 99.00
02:55:43 80.60 97.51 99.50 100.00 99.50 99.50 99.50 99.50
02:55:45 82.09 85.07 99.50 92.54 99.50 97.51 94.53 97.51
02:55:47 94.47 100.50 100.50 100.50 100.50 100.50 100.50 100.50
02:55:49 93.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
02:55:51 87.50 98.00 100.50 98.50 96.00 93.00 100.00 100.00
02:55:53 94.00 100.00 99.50 100.50 100.00 100.00 100.50 100.00
02:55:55 86.07 99.50 99.50 99.50 99.50 99.50 99.00 99.50
[root@testbox1~]# sar -P ALL|awk '{
> if($3>=0) {tt=$1; cput[$3]=$8;}
> if($3=='7') {print tt" "cput[0]" "cput[1]" "cput[2]" "cput[3]" "cput[4]" "cput[5]" "cput[6]" "cput[7];}
> }'
12:10:01 96.11 66.38 89.88 56.50 95.83 39.14 94.36 59.26
12:20:02 87.93 28.84 72.04 33.25 97.90 8.88 66.89 48.87
12:30:01 27.91 24.80 31.20 49.60 94.17 0.65 28.67 36.21
12:40:01 51.79 34.29 61.36 31.19 92.90 1.60 42.30 48.16
12:50:01 51.21 44.47 57.79 40.21 95.27 3.78 46.61 50.98
01:00:01 50.20 47.89 53.62 45.47 97.22 2.15 47.77 50.96
01:10:01 47.83 47.87 52.70 45.54 98.99 0.56 47.07 50.43
01:20:01 52.08 41.39 60.32 39.85 95.09 7.33 44.89 55.40
01:30:01 73.70 11.95 78.59 25.64 98.17 16.72 61.79 40.10
01:40:01 57.65 38.74 59.51 35.93 95.22 3.91 49.07 49.16
01:50:01 87.06 65.59 85.22 43.18 97.93 26.45 76.62 63.46
02:00:01 78.71 92.01 80.81 60.39 96.83 46.06 86.33 77.20
02:10:01 94.93 94.91 95.56 90.55 99.15 95.00 97.96 96.74
02:20:01 99.20 99.49 99.69 99.81 99.78 99.83 99.94 99.91
02:30:01 99.54 99.73 99.76 99.84 99.81 99.84 99.86 99.76
02:40:01 99.10 98.62 99.64 99.46 99.77 99.70 99.84 99.83
02:50:01 98.92 99.55 99.34 99.68 99.50 99.85 99.90 99.93
03:00:01 99.39 99.71 99.41 99.68 99.87 99.86 99.85 99.77
03:10:01 99.21 98.88 99.22 99.69 99.86 99.89 99.85 99.92
03:20:01 99.39 99.73 99.26 99.69 99.92 99.92 99.77 99.77
03:30:01 98.29 98.74 99.69 99.72 99.94 99.92 99.80 99.83
03:40:01 98.16 99.31 93.73 96.59 94.49 98.26 99.51 99.09
03:50:01 93.54 97.21 80.01 87.16 91.36 88.81 92.49 95.92
04:00:01 98.65 99.45 94.80 99.36 99.79 99.65 99.86 99.92
04:10:01 94.03 88.50 98.26 94.96 99.36 99.01 99.91 99.90
04:20:01 93.88 93.47 99.50 98.94 99.75 99.70 99.94 99.10
04:30:01 92.14 88.39 99.19 97.81 98.85 99.60 99.72 99.80
04:40:01 87.21 69.08 98.45 96.75 99.77 97.62 99.07 99.88
04:50:01 88.14 85.09 98.50 98.45 99.64 99.24 99.67 99.40
05:00:01 94.25 93.42 98.40 98.18 99.54 99.68 99.90 99.28
05:10:01 87.82 77.37 97.30 82.34 95.75 86.99 88.39 93.80
05:20:01 85.92 66.98 91.13 73.96 91.27 81.73 95.41 86.62
05:30:01 89.94 58.90 92.43 53.22 95.05 56.58 79.38 60.81
05:40:01 93.17 90.69 98.46 78.62 99.74 83.38 97.34 97.84
05:50:01 93.04 90.57 91.94 91.53 99.47 87.84 99.35 96.89
06:00:01 89.86 92.43 99.27 90.49 99.58 88.30 97.24 97.72
06:10:02 90.15 94.42 98.75 95.06 98.69 94.76 99.17 98.77
06:20:01 86.69 94.35 99.08 81.11 99.73 77.60 98.31 93.25
06:30:02 88.98 93.90 98.92 92.05 99.27 94.65 98.45 97.26
06:40:01 86.98 94.58 98.43 89.63 99.46 85.81 97.43 97.50
06:50:01 86.09 92.47 96.88 88.99 99.62 88.06 97.06 93.53
07:00:01 84.59 92.04 97.51 89.49 99.34 89.93 96.92 97.65
07:10:01 84.75 92.47 98.64 86.17 99.44 87.83 98.01 93.79
07:20:01 86.29 95.04 97.79 87.89 99.18 88.66 96.69 96.65
07:30:01 83.34 94.12 98.43 87.12 99.38 90.90 97.14 92.63
07:40:01 82.78 93.06 98.45 83.36 98.68 90.06 97.52 93.63
07:50:01 83.11 93.55 98.33 84.85 99.44 86.97 96.10 95.83
08:00:01 82.38 92.70 96.20 80.62 99.65 80.24 91.48 94.26
08:10:01 80.02 93.47 94.10 78.48 99.31 81.05 93.96 95.58
08:20:01 84.17 93.49 96.10 81.34 99.69 86.06 95.84 94.75
08:30:01 79.17 91.22 98.32 86.08 99.46 91.30 97.45 96.88
08:40:01 82.04 92.34 95.21 81.49 97.23 87.68 92.38 77.70
08:50:01 82.81 95.17 96.65 86.56 99.29 87.29 98.41 96.40
09:00:01 84.06 95.45 98.51 78.78 99.41 86.66 98.98 95.17
[root@testbox1 ~]#
I like that output a lot better anyway; makes it easy to see at a glance something like a single process that hogs a specific CPU for an extended period of time. And this data is easy to pull into excel and generate some quick graphs against.
Wish I had more time to write these. :) I’ve got two more good battle stories from this week… one about CPU wait events that weren’t CPU problems and another about a new CBO optimization in 10g that was killing some queries on a PeopleSoft system. But I’ll have to save those for some other posts… it’s time to call it a night!
Resources
A few other good links for further reading on virtualization…
- A Comparison of Software and Hardware Techniques for x86 Virtualization – Aug ’06 – paper explaining VMWare virtualization internals and why current processor-based x86 virtualization often gets outperformed by software implementations.
- Enforcing Performance Isolation Across Virtual Machines in Xen – May ’06 – paper describing how Xen implements performance isolation.
Discussion
Comments are closed.