to attempt to get back to the original question, this (
click me) i would say sums up for me why controlled blind testing is the only way, and why we really can't 'trust our ears' when trying to decide if one thing is better than another. it's a bit long winded, so i will quote what i think are the main things standing in the way of just sitting down and listening.
1) LEVEL DIFFERENCES. If one amplifier played louder than the other, then it will sound better. Louder music sounds better to us. That is why we like to listen to our music loudly.
The gain and power of amplifiers varies. Therefore, for a specific volume control setting on the preamp used in the test, different amplifiers will play at slightly different loudness levels.
But the audiophile in the example above probably didn't even attempt to set the preamp level at exactly the same level for both amplifiers. He probably just turned up the level to where it sounded good to him. He made no attempt to match the levels at all because he was unaware that this was an uncontrolled variable.
In any case, the amps probably would have had different loudness levels even if the preamp setting was identical. This is because amplifiers have different gain and power levels.
Note that human hearing is extremely sensitive to loudness. Scientific tests show that we can hear and accurately detect very tiny differences in loudness (1/4 dB is possible). At the same time, we don't recognize obvious differences in the level of music until there are a couple of dB of difference. This is due to the transient and dynamic nature of music, which makes subtle level differences hard to recognize.
Therefore when music is just a little louder, we hear it as "better" rather than as "louder." It is essential that you understand that two identical components will sound different if one simply plays a little louder than the other. The louder one will sound better to us even if the two actually sound identical.
This is a serious problem in listening tests. Consider the amplifier test above and for purposes of this discussion, let's assume that both amplifiers sound exactly the same, but that the new one will play a bit louder because it has slightly more gain. This means that the new amp will sound better than the old one in an open loop test even though the two actually sound identical.
The audiophile will then draw the conclusion that the new amp is better and will spend $10,000 to buy it. But in fact, the new amp didn't really sound any better and it was the difference in loudness that caused the listener to perceive that it was better.
So the audiophile would have drawn a false conclusion about the new amp sounding better. This erroneous conclusion cost him $10,000. I think you can see from this example that you absolutely, positively must not have more than one uncontrolled variable in your tests.
2) TIME DELAY. Humans can only remember SUBTLE differences in sound for about two seconds. Oh sure, you can tell the difference between your mother's and your father's voices after many years. But those differences aren't subtle.
Most audiophiles are seeking differences like "air", "clarity", "imaging", "dynamics", etc. that are elusive and rather hard to hear and define. They are not obvious. We cannot remember them for more than a few seconds. To be able to really hear subtle differences accurately and reliably requires that you be able to switch between the amplifiers immediately.
Equally important is that you should make many comparisons between the components as this will greatly improve the reliability of your testing. This is particularly important when dealing with music as different types of music have a big influence on the sensitivity of what you can hear during your testing. You really need to test with many types of music using many comparisons.
Open loop testing only provides a single comparison, which is separated by a relatively long delay while components are changed. This makes it very difficult to determine with certainty if subtle differences in sound are present.
3) PSYCHOLOGICAL BIAS. Humans harbor biases. These prejudices influence what we hear. In other words, if you EXPECT one component to sound better than another -- it will.
It doesn't matter what causes your bias. The audiophile in the previous test had a bias towards the new amp, which is why he brought it home for testing. He expected it to sound better than his old amp, so it did. It was especially easy for his bias to influence him due to the time delay involved as he changed cables.
That bias may have been because he expects tubes to sound better (or worse) than transistors, or that the new amp had (or didn't have) feedback, or it was more expensive than his old amp, or that it looked better, or that he read a great review on it, or that is had a particular class of operation, etc. Bias is bias regardless of the cause and it will affect the performance that an audiophile perceives. It must be eliminated from the test.
Don't think you are immune from the effects of bias. Even if you try hard to be fair and open-minded in a test, you simply can't will your biases away. You are human. You have biases. Accept it.
4) CLIPPING. Clipping is when an amplifier is being driven beyond its power and voltage abilities. This produces massive amounts of distortion, compression of the dynamic range, loss of clarity and detail, a sense of strain, harshness, and generally bad performance.
It doesn't matter what good features an amplifier has -- if it is clipping, it is performing horribly and any potentially subtle improvements in sound due to a particular feature will be totally swamped by the massive distortion and general misbehavior of an amplifier when clipping. Therefore no test is valid if either amplifier is clipping.
If one amplifier in the above test was clipping, while the other wasn't, then of course the two will sound different from each other. The amp that is clipping will sound worse than the one that isn't. But you must not test a clipping amp (that is grossly misbehaving) to one that isn't clipping (and is performing well). That is not a valid test at all and doesn't tell you how an amp sounds when it is performing properly and within its design parameters.
Most audiophiles simply don't recognize when their amps are clipping. This is because the clipping usually only occurs on musical peaks where it is very transient, and does not occur at the average power level. Transient clipping is not recognized as clipping by most listeners because the average levels are relatively much longer than the peaks. Since the average levels aren't obviously distorted, the listeners think the amp is performing within its design parameters -- even when it is not.
Peak clipping really messes up the performance of the amplifier as its power supply voltages and circuits take several milliseconds to recover from clipping. During that time, the amp is operating far outside its design parameters, has massive distortion, and it will not sound good, even though it doesn't sound grossly distorted to the listener.
Instead of distortion, the listener will describe a amp that is clipping peaks as sounding "dull" (due to compressed dynamics), muddy (due to high transient distortion and compressed dynamics), "congested", "harsh", "strained", etc. In other words, the listener will recognize that the amp doesn't sound good, but he won't recognize the cause as simple amplifier clipping. Instead, he will likely assume that the differences in sound he hears is due to some minor feature like feedback, capacitors, type of tubes, bias level, class of operation, etc. rather than simply lack of power.
But his opinion would be just that -- an assumption that is totally unsupported and unproven by any evidence. Most likely his guess would not be the actual cause of the problem.
Because different audiophiles will make different assumptions about the causes of the differences they hear, it is easy to see why there is so much confusion and inaccuracy about the performance of components when open loop testing is used.
It is easy to show that most speaker systems require about 500 watts to play musical peaks cleanly. Most audiophiles use amps with far less power. Therefore audiophiles are comparing clipping amps most of the time. This variable must be eliminated if you want to compare amplifiers operating as their designers intended.
5) The last uncontrolled variable is the amplifier. This is the one variable that we want to test. So we do not need to control it.
The above information should make it clear why open loop testing is fraught with error and confusion. It is easy to see why we can easily be tricked by open loop testing, particularly when there is a significant time delay which will allow bias to strongly influence what we hear and make it difficult to recognize level differences. All these uncontrolled variables simply make it impossible to draw valid conclusions from open loop testing, even though we may be doing our best and being totally sincere in our attempt to determine how the two components sound.
a controlled ab-x test would be a right pain to setup, but given the above i would trust the results of one every time over the usual 'subjective' listening test.