Friday, 1 February 2013

High Bitrate MP3 Internet Blind Test: Part 1 - PROCEDURE (Set B = MP3)

The survey has closed today (February 1, 2013). Over the next few days, I will have write ups on the Procedure (released today), followed by Results, and finally a Discussion section.

As you'll see in the description below, "Set B" was the MP3 encoded collection of music.

For the survey participants - now that you know, consider how you voted.  Do you believe MP3 ~320kbps causes significant or serious sonic degradation, or in a significant way impaired your ability to enjoy the music through your system?

--------------------------------------

Procedure:
Over the course of approximately two months (December 10, 2012 - February 1, 2013), an anonymous survey was activated on freeonlinesurveys.com to gather feedback on the audibility of 2 "Sets" of FLAC-encoded audio files. One set of files contained segments of music ripped directly from audio CD (PCM 16/44) whereas the other set had the audio converted to MP3 then decoded back to 16/44 format where it was converted to FLAC. The specific details of this MP3 conversion will be discussed below.

Song selection:
The 3 musical segments selected for the test were:

1. "Time" (2:29) from Pink Floyd off the 2011 re-master of "Dark Side Of The Moon" - a 2.5 minute excerpt with all the ruckus of chimes, bells and clocks in wonderful detail and space. A classic audiophile test track. This segment has a score of DR11 using the Foobar2000 dynamic range meter.

2. "Church" (2:31) (from Lyle Lovett off his 1992 record "Joshua Judges Ruth". An acoustic country track with layered vocals, hand clapping, and a choir to evaluate sound quality with. The DR16 measurement for this segment represents a highly dynamic and natural-sounding track.

3. "Keine Zeit" (1:20) from Megaherz off the recent 2012 album "Götterdämmerung". For those who have used VBR algorithms for lossy encoding, "loud" music tends to demand higher bitrates to encode. This track at DR6 is not particularly dynamic but is representative of modern mastering for music in the hard rock / metal genres.


Participant Invitation for Test:
During the 2 months that this test was conducted,"subjects" were recruited from a number of "audiophile" and music related message forums. The hope was to achieve an adequate number of serious audiophiles and music lovers representing a cohort who would be able to seriously assess sound quality and would own higher quality equipment for audio playback. In principle, this would be the group most likely to be critical of sonic degradation. Invitations were posted on the following forums:
- audioasylum.com "PC Audio"
- forums.slimdevices.com "Audiophile"
- www.head-fi.org "Computer Audio"
- www.computeraudiophile.com "General Forum"
- stereophile.com "MP3 vs AAC vs FLAC vs CD" article comments
- www.hydrogenaudio.org "Listening Tests"
- www.audiocircle.com "The Discless Circle"
- www.stevehoffman.tv "Audio Hardware"
- www.wiredstate.com "Equipment Reviews, Listening Impressions"
- www.xtremeplace.com "Planet Audio" (Singapore)
- vr-zone.com "Audiophile's & HTPC Corner" (Singapore)
- www.lowyat.net "Home Entertainment / Audiophiles" (Malaysia)

Reminder messages were posted on the forums approximately ever 2-3 weeks to increase visibility of the invitation with the last reminder approximately 1 week before the closure of the survey. There should have been plenty of time for all the respondents to listen and make judgments on perceived quality of the samples.

How the MP3 test tracks were produced:
For those with some experience with digital audio editing, it is relatively trivial to detect if a WAV/FLAC file were sourced through a standard MP3 process. Lossy encoders like MP3 will "throw out" frequencies the psychoacoustic model deems inaudible. For example, running an FFT frequency analysis on many MP3's quickly reveals that most encoders will remove frequencies at 18kHz and above. Characteristics like this allow programs like Tau Analyzer to estimate the probability of an audio file to have been modified by lossy encoding. Therefore, for the purpose of this test where the samples are freely available to many likely technologically savvy participants, it was necessary that the MP3-encoded samples be process in some way which results in equivalent sound quality to a direct MP3 encode around 320kbps, yet mask the file from easy detection.

A 2 stage technique was used to create the MP3 test files using LAME 3.99.5 (current version at this time):
Stage 1 - convert to 400kbps
lame.exe --freeformat --lowpass -1 -b400 <file.wav> <file400.mp3>
lame.exe --decode <file400.mp3>

Stage 2 - convert to 350kbps
lame.exe --freeformat --lowpass -1 -b350 <file400.wav> <file350.mp3>
lame.exe --decode <file350.mp3>

Use dBPowerAmp to convert the <file350.wav> to FLAC

This utilizes LAME's "free format" to create initially a 400kbps MP3 without the usual lowpass filter in place, then runs the resulting file through the MP3 encoder again but at a lower 350kbps bitrate (again with low-pass turned off) which closer approximates the 320kbps target bitrate for the test. By doing this, even though the resulting MP3 size is slightly larger by 30kbps, the degradation in sound quality by objective measures is in fact approximately the same or slightly worse than if the audio were processed directly through 320kbps but without the tell-tale sign of the strong low-pass filter.

Since this was not a direct conversion to 320 kbps, to confirm the amount of sonic degradation of this process used to create the test MP3, WavDiff was employed to calculate the variance from the original lossless file vs. the MP3 processed test files and also the variance of the original lossless file vs. MP3 encodes at CBR 320 kbps and 256 kbps (for the sake of brevity, I will just report the RMS Error [RMSE]):
Time - Test file: 105.879  /  MP3 (320): 110.403  /  MP3 (256): 176.337
Church -  Test file: 46.591  /  MP3 (320): 46.915  /  MP3 (256): 76.357
KeineZeit - Test file: 393.914  /  MP3 (320): 372.550  /  MP3 (256): 607.282

From the values above, one can see that the three musical selections objectively are similar in variance to a 320 kbps MP3 and that variance is significantly larger with 256 kbps (as expected). Note that I have also used the above technique to encode test tones to ensure that the distortion characteristics from the encoding method closely represents CBR 320kbps MP3. Something to keep in mind is that because the low-pass filter was turned off, bits are now being used by the MP3 encoder for frequencies beyond the usual threshold for hearing for most people (is there ever a need to allocate any bits for 20-22kHz for example?), taking away from the ability for the encoder to better represent audible frequencies. Theoretically this should worsen the sound quality by worsening the distortion for the frequencies we are more sensitive to.

On a side note, KeineZeit seemed to be the most difficult to encode resulting in the highest amount of error going through the lossy encoding process.

The files were checked with the "Foobar2000 Dynamic Range Monitor" and ensured to be equal in  volume after the MP3 compression of  <0.01 dB difference.

The original lossless files were labelled Time_A, Church_A, and KeineZeit_A; whereas the lossy MP3 encoded/decoded files were labelled Time_B, Church_B, and KeineZeit_B. The files were encoded to FLAC for lossless compression and delivered for download as a ZIP file with instructions totaling approximately 75MB.

As noted above, freeonlinesurveys.com was utilized to collect the survey results. The primary question asked was for the respondents to choose whether "Set A" or "Set B" sounded INFERIOR; with the implication probably being that the MP3 encoding would deteriorate the sound quality (it is by definition "lossy"). In order to not force the respondents to guess if in their opinion the samples are equivalent, an option was provided to select "no difference". The other questions in the survey pertained to which of the 3 songs was thought to be most revealing of differences, confidence of the respondent ("easy" to "impossible to tell"), approximate cost of the audio system used, and a description of the type of equipment used. The hope is that these other variables can be used to analyze the data to determine if the respondent's level of confidence and cost of the system (presumably the more expensive systems are more revealing) predicted accuracy.

Continue to - Part 2: Results

1 comment: