3dDeconvolve Benchmark Results

See also: SUMA Rendering Benchmark

The following data shows the time required to execute a typical 3dDeconvolve analysis. This particular analysis is provided on the AFNI WEB site as a standardized benchmark, so that computer performance can be compared by AFNI users throughout the world. The benchmark runs a short, but intensive 3dDeconvolve analysis requiring about 250 megabytes of RAM at its peak.

Note that results may vary for larger or smaller analyses. For example, Athlon 64 3400+ (2.4GHz 1M) outperformed the Mac G5 2.5GHz machines (using 1 cpu) for very large analyses.

Also the results below are skewed for machines that have less than 256 meg RAM, since the benchmark analysis causes a great deal of swapping in these cases. Hence, the performance of eva, seven, and ida would be much better if they had more RAM, although still not competitive with the newer machines.

Memory speed

Memory speed is a critical factor with 3dDeconvolve, as demonstrated by comparing neelix (PC2700) and moe (PC2100). Although they run different operating systems, the comparison is straightforward. The operating system plays no measurable role in 3dDeconvolve performance. ( This was verified when upgrading a system from Caldera Linux to FreeBSD. ) 3dDeconvolve is almost entirely cpu and memory-bound, so the OS is utilized very little by the process, and hence the important factors are cpu, memory, and the quality of the compiler's optimizer. Some commercial compilers can offer a significant improvement over gcc. ( See results for Dale, the Itanium 2 system, below. )

Cache Size

As of Sep 25, 2006, we can finally see directly the effect of an increased cache size on performance. Compare the results for zsazsa and deepthought. These two systems use the same motherboard, same type of RAM (the difference in RAM size is irrelevant, since they both have more than enough for the afni_speedo test, which requires only about 300 meg), and essentially the same CPU, but with different cache sizes. The 4200+ has a 512k L2 cache, vs. 1 meg for the 4400+.

AFNI Version, etc.

Results can also be affected by the version of AFNI, the compile options, and the compiler used. Improvements to the 3dDeconvolve code and the gcc object code optimizer will bring about gradual improvements over the years. For example, comparing the results below for "moe" and "oldmoe", we see that 3dDeconvolve from AFNI_2006_06_30_1332 runs quite a bit faster than the version from a year earlier on the exact same hardware. Some commercial compilers, such as cc from SCO Unix, and the Intel compiler ICC, can also produce far better results than gcc. Be sure to consider this when comparing results from this site to your own.

AMD64 Results

Note that running a 64-bit operating system DOES NOT improve performance of 3dDeconvolve on AMD systems. This is because 3dDeconvolve is floating-point intensive, and the floating point hardware hasn't changed much from the old 32-bit Pentium/Athlon architectures, which already had a 64-bit data bus and SSE instructions.

This is good news for AMD64 users. The AMD64 chip has a VERY good price/performance ratio, and there is no need to immediately upgrade your OS. Just slip in a new motherboard, CPU, and RAM (be sure to use the fastest RAM available for your motherboard) under your 32-bit OS, and you'll get the full performance benefit.

Although more than 98% of the opensource software we run on our FreeBSD workstations *is* ported to FreeBSD-AMD64, there are a few critical closed-source applications that are not yet available ( mainly the commercial GeForce drivers). For this reason we are choosing to run FreeBSD i386 on our AMD64 systems for now. Although it is possible to run most i386 binaries on the FreeBSD AMD64, drivers are a different story, and it's not that desirable anyway. The only major benefit to running in pure AMD64 mode is the ability to address more than 4GB of memory, which is not yet an issue for most fMRI analyses. Linux users may encounter similar limitations, so I recommend verifying availability of all your software before insalling a 64-bit version of the OS.

Hyperthreading

While falling somewhat short of the dual-core processor performance, Intel's hyperthreading technology, primarily present in Pentium 4 and Xeon processors, can offer substantial performance gains where multiple threads are a possibility. ( See results for Tamo below. )

However, hyperthreading should be used with caution. The technology has a serious security problem for which there is currently no software-based workaround. I.e., the only way to close the security hole is by disabling hyperthreading, or replacing the CPU. See this artical for details. The security hole allows LOCAL users to gain access to priveledged information. On FreeBSD, hyperthreading is disabled by default. If you trust all local users on a system, you can enable hyperthreading using two simple steps:

  1. Enable SMP (this is already enabled in most recent FreeBSD kernels). On PC-BSD, this can be done in the PC-BSD System Manager. On stock FreeBSD systems, consult the FreeBSD Handbook for instructions on adding SMP to the kernel config.
  2. Enable hyperthreading by adding "machdep.hyperthreading_allowed=1" to /etc/sysctl.conf. This will take effect upon the next reboot. To enable it immediately without rebooting, run the command "sysctl machdep.hyperthreading_allowed=1".

The Data

All results were obtained using AFNI_2006_06_30_1332, except where there are two entries for the same machine. In these cases, older results are from an earlier version of AFNI. Note that there have been significant improvements to 3dDeconvolve, so direct comparison of results from different AFNI versions will not provide an accurate view of hardware speed.

The first table is sorted according to elapsed time for the 3dDeconvolve benchmark using only 1 CPU. This should indicate the preferred machines for running intensive analyses other than 3dDeconvolve that cannot utilize both CPUs. This assumes that 3dDeconvolve is a reasonably good benchmark for predicting the performance of other CPU intensive programs, which is often true, but results can vary wildly depending on the exact behavior of the particular program.

The second table ranks computers by their elapsed time for the 3dDeconvolve benchmark using -jobs 2, which utilizes both CPUs where available. This should indicate the preferred machines to run 3dDeconvolve analyses. Note that you MUST use the -jobs 2 option with 3dDeconvolve to achieve this performance. Also note that using -jobs 2 on a machine with only one CPU does not generally hurt performance, so you can safely include it in analysis scripts without regard for where they will be run.

The third table shows results from a 4-CPU Itanium server loaned to us by the physiology department for evaluation.

Computers ranked by speed with -jobs 1

Hostname CPU/Speed/Cache/OS[/Motherboard]/Memory CPUs -jobs 1 -jobs 2
bilbo Mac Pro Quad Core Xeon 2.8GHz Leopard 4G 4 15.725 8.639
tessa iMac Core Duo 2.1GHz Tiger 3G 2 25.824 14.128
smokey iMac Core Duo 2.1GHz Tiger 1G 2 26.191 14.426
stevemac iMac Core Duo 2.3GHz Tiger 2G 2 26.284 15.002
fmricourse Mac Mini Core Duo 2.0GHz 2G DDR2 667MHz 2 27.609 15.143
cheech Athlon 64 X2 6000+ 3.0GHz 1M FreeBSD 6.3 M2NE 4G 4xDDR800 2 29.747 17.382
maggie Athlon 64 X2 6000+ 3.0GHz 1M FreeBSD 6.3 M2NE 4G 4xDDR800 2 31.549 18.567
apu Athlon X2 5200+ 2.6GHz 1M FreeBSD 6.2-AMD64 M2N4 SLI 4G 4xPC8000 2 31.958 18.785
maui iMac Core Duo 2.0GHz Tiger 2G 2 36.895 21.172
maverick iMac Core Duo 2.0GHz Tiger 2G 2 36.948 21.177
eden iMac Core Duo 2.0GHz Tiger 2G 2 36.963 21.087
zsazsa Athlon 64 3700+ 2.2GHz 1M FreeBSD 5.4 A8N SLI 1G 2xPC3200 DDR400 1 37.376 37.283
bart Athlon 64 4600+ 2.4GHz 512K FreeBSD 5.4 M2N4 SLI 4G 4xPC2-5300 2 38.284 22.597
gollum Athlon 64 4000+ 2.4GHz 1M FreeBSD 5.4 A8N 1G 2xPC3200 DDR400 1 38.785 39.001
sahag MacBook Core Duo 2.0GHz Tiger 2 40.263 21.639
ruth Mac G5 2.5GHz Tiger 1G 2 42.339 26.692
deepthought Athlon 64 X2 4200+ 2.2GHz 512k FreeBSD 5.4 A8N SLI 1G 2xPC3200 DDR400 A8N SLI 2 42.367 27.319
tamo Pentium 4 3.0GHz PC-BSD 1.3 1G Hyperthreading enabled (See notes above) 2 43.124 32.895
lisa Athlon 64 3700+ 2.4GHz 1M FreeBSD 5.4 1G PC3200 1 43.574 43.549
fmricourse Mac Mini Core Duo 1.66GHz 1G 2 44.076 27.845
mame Mac G5 2.5GHz Tiger 2G 2 44.221 26.926
cairo Athlon 64 3400+ 2.2GHz 1M FreeBSD i386 5.4 1G PC3200 1 44.684 44.475
wiggum Athlon 64 4000+ 2.4GHz 1M A21G FreeBSD 5.4 2G 2xPC3200 DDR400 1 45.098 45.068
heron-pcbsd PC-BSD 1.3 512M under Parallels on a MacBook 2GHz 2G RAM 1 45.451 47.471
edmund Athlon 64 3400+ 2.4GHz 1M FreeBSD i386 5.4 1G PC3200 1 45.606 45.626
mach MacBook Pro Core Duo 2.0GHz 2G DDR2 667 2 48.100 23.111
lasiked Pentium 4 3.2GHz 1mb cache SuSE 9.3 1G PC3200 2 48.190 35.450
lappy Pentium IV 2 GHz, Ubuntu 6 i386 2GB 1 52.821 51.624
gandalf Mac G5 1.8GHz Tiger 1G 2 54.178 32.069
knuth Athlon 64 3400+ 2.4GHz 1M FreeBSD i386 5.4 1G PC3200 1 54.617 55.006
birdie Mac G5 1.8GHz Tiger 1G 2 58.583 34.543
yoshi iMac G5 2.0GHz Tiger 1 72.364 71.723
neelix Athlon XP 2400+ 2.0GHz FreeBSD 5.4 2G PC2700 1 111.970 112.021
bagua Athlon XP 1800+ 1.5GHz FreeBSD 5.4 1G 1 129.932 129.759
gaea Athlon XP 1800+ 1.5GHz FreeBSD 5.4 1G 1 158.476 158.090
moe Athlon MP 2800+ 2.1GHz FreeBSD 5.4 1G PC2100 2 166.905 88.842
gizmo Athlon XP 1800+ 1.5GHz FreeBSD 5.4 2G 1 178.742 182.744
garcia Athlon 1.2GHz PC-BSD 1.3 512M 1 201.768 202.654
america Athlon 857Mhz FreeBSD 5.4 1G 1 213.791 213.856
krusty Mac G4 1.0GHz Panther 1.5G 2 239.993 140.211
oldmoe Athlon MP 2800+ 2.1GHz FreeBSD 5.2 1G PC2100 2 241.547 126.772
apu Pentium III 1.0GHz FreeBSD 5.4 2G 2 241.640 166.106
winston Mac Mini G4 1.5GHz Tiger 512M 1 261.337 267.475
sabrina Mac Powerbook G4 1.5GHz Panther 512M 1 273.163 278.657
uhura Pentium III 800Mhz FreeBSD 5.4 1G 2 280.314 188.156
tuvok Pentium III 800Mhz FreeBSD 5.4 1G 2 282.713 189.227
bombadil Pentium III 800Mhz FreeBSD 5.4 1G 2 283.914 189.640
belanna Pentium III 800Mhz FreeBSD 5.2 1G 2 315.913 200.014
sebo Pentium III 600Mhz FreeBSD 5.4 1G PC100 2 336.670 227.926
loraine Pentium III 666Mhz FreeBSD 5.4 768M PC133 1 355.953 348.199
lursa Pentium III Xeon 500Mhz FreeBSD 5.2 1G 2 408.718 265.617
deb Pentium II 450Mhz FreeBSD 5.4 512M 2 422.513 274.307
bart Pentium III 450Mhz FreeBSD 5.4 512M 2 431.547 274.178

Computers ranked by speed with -jobs 2

Hostname CPU/Speed/Cache/OS[/Motherboard]/Memory CPUs -jobs 1 -jobs 2
bilbo Mac Pro Quad Core Xeon 2.8GHz Leopard 4G 4 15.725 8.639
tessa iMac Core Duo 2.1GHz Tiger 3G 2 25.824 14.128
smokey iMac Core Duo 2.1GHz Tiger 1G 2 26.191 14.426
stevemac iMac Core Duo 2.3GHz Tiger 2G 2 26.284 15.002
fmricourse Mac Mini Core Duo 2.0GHz 2G DDR2 667MHz 2 27.609 15.143
cheech Athlon 64 X2 6000+ 3.0GHz 1M FreeBSD 6.3 M2NE 4G 4xDDR800 2 29.747 17.382
maggie Athlon 64 X2 6000+ 3.0GHz 1M FreeBSD 6.3 M2NE 4G 4xDDR800 2 31.549 18.567
apu Athlon X2 5200+ 2.6GHz 1M FreeBSD 6.2-AMD64 M2N4 SLI 4G 4xPC8000 2 31.958 18.785
eden iMac Core Duo 2.0GHz Tiger 2G 2 36.963 21.087
maui iMac Core Duo 2.0GHz Tiger 2G 2 36.895 21.172
maverick iMac Core Duo 2.0GHz Tiger 2G 2 36.948 21.177
sahag MacBook Core Duo 2.0GHz Tiger 2 40.263 21.639
bart Athlon 64 4600+ 2.4GHz 512K FreeBSD 5.4 M2N4 SLI 4G 4xPC2-5300 2 38.284 22.597
mach MacBook Pro Core Duo 2.0GHz 2G DDR2 667 2 48.100 23.111
ruth Mac G5 2.5GHz Tiger 1G 2 42.339 26.692
mame Mac G5 2.5GHz Tiger 2G 2 44.221 26.926
deepthought Athlon 64 X2 4200+ 2.2GHz 512k FreeBSD 5.4 A8N SLI 1G 2xPC3200 DDR400 A8N SLI 2 42.367 27.319
fmricourse Mac Mini Core Duo 1.66GHz 1G 2 44.076 27.845
gandalf Mac G5 1.8GHz Tiger 1G 2 54.178 32.069
tamo Pentium 4 3.0GHz PC-BSD 1.3 1G Hyperthreading enabled (See notes above) 2 43.124 32.895
birdie Mac G5 1.8GHz Tiger 1G 2 58.583 34.543
lasiked Pentium 4 3.2GHz 1mb cache SuSE 9.3 1G PC3200 2 48.190 35.450
zsazsa Athlon 64 3700+ 2.2GHz 1M FreeBSD 5.4 A8N SLI 1G 2xPC3200 DDR400 1 37.376 37.283
gollum Athlon 64 4000+ 2.4GHz 1M FreeBSD 5.4 A8N 1G 2xPC3200 DDR400 1 38.785 39.001
lisa Athlon 64 3700+ 2.4GHz 1M FreeBSD 5.4 1G PC3200 1 43.574 43.549
cairo Athlon 64 3400+ 2.2GHz 1M FreeBSD i386 5.4 1G PC3200 1 44.684 44.475
wiggum Athlon 64 4000+ 2.4GHz 1M A21G FreeBSD 5.4 2G 2xPC3200 DDR400 1 45.098 45.068
edmund Athlon 64 3400+ 2.4GHz 1M FreeBSD i386 5.4 1G PC3200 1 45.606 45.626
heron-pcbsd PC-BSD 1.3 512M under Parallels on a MacBook 2GHz 2G RAM 1 45.451 47.471
lappy Pentium IV 2 GHz, Ubuntu 6 i386 2GB 1 52.821 51.624
knuth Athlon 64 3400+ 2.4GHz 1M FreeBSD i386 5.4 1G PC3200 1 54.617 55.006
yoshi iMac G5 2.0GHz Tiger 1 72.364 71.723
moe Athlon MP 2800+ 2.1GHz FreeBSD 5.4 1G PC2100 2 166.905 88.842
neelix Athlon XP 2400+ 2.0GHz FreeBSD 5.4 2G PC2700 1 111.970 112.021
oldmoe Athlon MP 2800+ 2.1GHz FreeBSD 5.2 1G PC2100 2 241.547 126.772
bagua Athlon XP 1800+ 1.5GHz FreeBSD 5.4 1G 1 129.932 129.759
krusty Mac G4 1.0GHz Panther 1.5G 2 239.993 140.211
gaea Athlon XP 1800+ 1.5GHz FreeBSD 5.4 1G 1 158.476 158.090
apu Pentium III 1.0GHz FreeBSD 5.4 2G 2 241.640 166.106
gizmo Athlon XP 1800+ 1.5GHz FreeBSD 5.4 2G 1 178.742 182.744
uhura Pentium III 800Mhz FreeBSD 5.4 1G 2 280.314 188.156
tuvok Pentium III 800Mhz FreeBSD 5.4 1G 2 282.713 189.227
bombadil Pentium III 800Mhz FreeBSD 5.4 1G 2 283.914 189.640
belanna Pentium III 800Mhz FreeBSD 5.2 1G 2 315.913 200.014
garcia Athlon 1.2GHz PC-BSD 1.3 512M 1 201.768 202.654
america Athlon 857Mhz FreeBSD 5.4 1G 1 213.791 213.856
sebo Pentium III 600Mhz FreeBSD 5.4 1G PC100 2 336.670 227.926
winston Mac Mini G4 1.5GHz Tiger 512M 1 261.337 267.475
bart Pentium III 450Mhz FreeBSD 5.4 512M 2 431.547 274.178
deb Pentium II 450Mhz FreeBSD 5.4 512M 2 422.513 274.307
sabrina Mac Powerbook G4 1.5GHz Panther 512M 1 273.163 278.657
loraine Pentium III 666Mhz FreeBSD 5.4 768M PC133 1 355.953 348.199

Results from special 4-cpu systems

Dale

The results from Dale require some additional explanation. These results were obtained by using an expensive commercial compiler from Intel called "icc", which performs some automatical parallelization when -O3 is used. Hence, regardless of how many 3dDeconvolve jobs were used, all 4 CPUs played a role in the analysis. By using -O2 instead of -O3 with icc, auto-parallelization is disabled, and the results of changing -jobs follow the expected pattern more closely.

Note also that at the time of this writing (12-20-2004), one of Dale's CPUs costs $3500, about the same cost as a complete Mac G5 system. Multiply that by 4, add the cost of the rest of the hardware, and the commercial compiler from Intel, and it's hardly a bargain for the kind of performance benefit we're seeing.

Power5

Results as of 3-18-2005 for an IBM Power5 server with 4 Power5 (PPC 64) CPUs. This machine was on loan from IBM, and currently has only gcc installed.

Leviathan

Results as of 11-9-2005 for a dual CPU, dual-core Opteron system, using gcc. * Note that the system uses DDR333 RAM. It would likely perform significantly better with DDR400.

Hostname Machine type Processors -jobs 1 -jobs 2 -jobs 3 -jobs 4
bilbo Mac Pro Quad Core Xeon 2.8GHz Leopard 4G 4 15.725 8.639 6.508 5.303
power5 IBM Power5 1.6GHz Linux gcc -O3 4 41.224 21.099 17.235 13.511
power5 IBM Power5 1.6GHz Linux gcc -O2 4 40.706 21.579 18.992 14.851
dale Itanium 1.4GHz RedHat icc -O3 4 34.991 22.409 19.884 17.480
anencka Mac G5 quad 2.5GHz 6G PC4200 (32-bit binary!) 4 (2 dual core) 44.363 24.305 20.646 19.987
leviathan Opteron 2.0GHz SuSE 9.3 Pro 1MB cache/core, DDR333 RAM 4 (2 dual-core) 69.237 37.822 28.067 24.919
dale Itanium 1.4GHz RedHat icc -O2 4 129.962 67.371 47.288 37.921
dale Itanium 1.4GHz RedHat gcc -O3 4 196.608 100.212 68.595 52.987