VSpace Model Documentation

Overview

VSpace, MNLib and its documentation are Copyright 1999-2000 Richard W.E. Furse (all rights reserved).

This document provides a high-level discussion of the physical model used within VSpace along with some of the logic behind its design. General MN library architecture is not discussed. This text is not intended for the ordinary user.

Philosophy

VSpace is based on a reasonable physical model of an acoustic space. The model is based on a simple box shaped room with an image-based approach. The detail of modeling is largely determined by the significance of the psychoacoustic cues provided. The resulting model is arguably more cumbersome but more `real' than models working directly from psychoacoustics such as Gerzon panpots.

Note that though the image-based approach for a box-shaped room appears similar to the reflected `ray-tracing' model commonly used for graphics, this is an invalid extrapolation for general room shapes; sound sources do not cast shadows to the same degree as light sources. Indeed even the model presented here is not a strict solution to the wave equation in general (see below).

Units

VSpace is based around a number of unit modules from the MN library. The core module in use is the RoomAcousticSpace unit. This takes care of most parts of the model discussed here with the exception of delay, distance gain and microphone angle behaviour. These are delegated to classes descended from Microphone. This allows use of a number of microphone (or `recording device') models, allowing different types of recordings with real and impossible microphone behaviours. Not all microphones support all parts of the model described here. For instance OmnidirectionalMicrophone_Simple (simple omnidirectional in the script language) is a very fast microphone but does not implement the delay line part of the model described below. This can be fine for some applications, particularly previewing.

The skip script keyword is implemented simply by direct calls to Network::skip() with approximation enabled.

All the units used here are available in the MN library, however only the VSpace utility binds them together in this particular way.


Direct Sound

The direct sound is delayed by a time based on the distance from source to microphone at the time the sound was generated. This provides a physically accurate model when the source moves although it is numerically costly. Use of a conventional multi-tap delay could be used as an alternative, however this provides an inaccurate model as the only straightforward source of delay time is the current location of the sound source rather than the location of the source at the time the sound was generated.

Note that modeling movement of the microphone is not mathematically equivalent to movement of the sound source if the medium remains in place. VSpace does not support moving microphones.

The direct sound is filtered for distance using a one-poll low-pass filter. A sliding cutoff is used based on distance. The formula used is:

cutoff = 100000 / distance

Where cutoff is in Hz and distance is in metres. This formula comes from trial and error although it does seem to fit the little data I've checked it against.

Generally sound is attenuated according to the Inverse Square Law (which gives a signal amplitude inversely proportional to distance). This model is disregarded within a sphere given by a `core radius'. For the Ambisonic microphone classes provided, the microphone the gain of the sound is smoothly limited and directional information made vague to make transitions across the core sphere more continuous. For other microphones, the gain simply remains constant within the core radius.


Early Reflections

Early reflections are constructed by anti-phase reflections off walls with attenuation based on absorption coefficients. Note that use of an infinite set of reflected images with no absorption provides a model of the room which satisfies the wave equation. Reducing the number of reflections or allowing absorption means that the model is no longer a strict solution, however it is a practical one.

Early reflection images are passed through the same algorithms as the direct sound signal discussed above and will be affected by Doppler shift, distance filtering etc. Note that this configuration results in a very real acoustic image and features the same sound colouration issues present in real rooms. In particular, coincident images, though accurate, can weaken the overall psychoacoustic effect by the Craven hypothesis.

Which images will be generated is decided at the beginning of processing and is determined on a room image basis, taking into account sound absorption at walls and the distance between the centre of the relevant room images. Reasonably, this assumes that microphones are using the inverse square law, at least for reflected sound. Quiet images may be discarded depending on the `early reflection minimum gain' tolerance in use.

The distance between the centre of room images is also used to determine which images are considered early enough to be included in this algorithm rather than being approximated by the late reflection engine. This decision is made according to the `early reflection time' in use. If this is set to zero then only the late reflection engine will operate. This will produce a relatively poor quality image. An early reflection time long enough for a reflection off each wall seems a useful rule of thumb for a minimum value for this.


Late Reflections

Late reflections are generated by a set of parallel FeedbackDelay objects followed by a number of AllpassFilter units in series. The FeedbackDelay objects are used in the place of the comb filters of conventional literature because both feedback delay and gain attenuation happen before the first output of the delay. A one-poll low pass filter is used to reduce the high frequency content of the signal and a delay attempts to ensure late reflections begin around the same time that early reflections end.

The low-pass filter uses a fixed cutoff determined using the distance filtering rule applied for the direct sound, but basing the distance involved on the `early reflection time'. No low-pass filtering is included in the feedback delay objects as it proved too difficult to calibrate cutoff frequencies against the moving parameters of the rest of the model automatically. In practice the high-frequency attenuation provided by the early reflection part of the model provides adequate colouration.

The allpass filters in place use fixed delay times to produce phase smudging of the sound. No attempt is made to infer these times from the model, however an estimation of the echo density implications of these filters are incorporated into the configuration of the FeedbackDelay units.

Note that allpass filters were not used for the parallel part of the reverb. Conventional allpass filters for reverb with large delay times do not produce a flat frequency response when only a part of their impulse response is examined. To the ear the effect is very similar to that of a large delay comb filter, defeating the theoretical `colourless reverb' argument. However comb filters (and the FeedbackDelay filters used here) have an advantage over the allpass when the decay shape is considered: an exponential decay shape is observed throughout the impulse response, whereas the allpass filter produces an exponential decay only after the first spike of the response, and this spike can be wildly out of keeping with the overall decay shape.

The comb filters are tuned to produce a specific echo density, frequency density and decay time using a conventional reverberation approach. Account is taken of the echo density induced by the subsequent allpass filters present. Comb filter delay times are chosen to be prime where possible to reduce the risk of delays feeding into each other and delay times are spread so that the shortest to the longest delay times have a ratio of approximately 1:1.5.

The echo density value used is derived by differentiating a function approximating the number of room images before a certain time using a simple volume argument. This function is evaluated at a point during the late reflection stage, half way through the late reflection (-60dB) time period. This model is perhaps too literal as it does not take into account the amplitudes of these images.

The late reflection reverb time and overall gain present for the reverberation is generated using algorithms used in the early reflection part of the modeling process. The first few reflections that were considered too distant to be handled explicitly by the early reflection engine are examined using a sum-squared approach and the results are used to infer an approximate reverb gain and classical reverb time.

Note that while the term `reverb time' is used in this document, it is used to determine the -60dB decay time for the comb filters that cut in after the early reflections, not a strict estimate of the -60dB decay time for the room. Sabine's equation or relatives are not used at any point in the system. The exponential decay produced by the late reflection engine will not necessarily match whatever decay curve the real room exhibits, however it should join onto the tail of the more accurate early reflection curve with a reasonable level of continuity.


Second-Order Ambisonics

Second order Ambisonic sound files produced by VSpace normally have suffix .fmh. This corresponds to Furse-Malham encoding. These files are RIFF Wave files with nine channels containing the following signals (assuming a mono sound f(t) at <x,y,z> on the unit sphere).

Label Cartesian Representation
W 0.707107 * f(t)
X x * f(t)
Y y * f(t)
Z z * f(t)
R (1.5 * z * z - 0.5) * f(t)
S (2 * z * x) * f(t)
T (2 * y * z) * f(t)
U (x * x - y * y) * f(t)
V (2 * x * y) * f(t)

The first four channels conform to standard B-Format with WXYZ channel ordering.


HRTFs

MLib3 (MN's predecessor) includes a HRTF-based dummy head within its acoustic space model. This used convolution with an interpolation from a set of HRTF recordings from a real dummy head (the MIT KEMAR set). I found the results of this approach coloured the sound significantly. This could be compensated for during composition, however the long time periods required to perform the convolutions made this an unacceptable way of working. I am very interested in introducing the dummy-head to MN, however I need a better set of dummy head recordings. I would rather have a `real' set than process the set I have to flatten it out. Any offers?


Links

The author Richard Furse can be emailed as richard@muse.demon.co.uk.

"Ambisonics" is a registered trademark of Nimbus Communications International.