CyberEdge  
  Articles & Papers: CEJ Archive

 

Marketing & ConsultingMarket ResearchInfo ResearchArticles, PapersRecent WritingsBeyond DarwinCEJ ArchiveVRU ArchiveHealth and SafetyCase Studies & ClientsAbout CEISContact Us

Back

 

 

Spatial Sound at SIGGRAPH:
Is it 3D?
By William Martens, Headspace

This is a more extensive version of an article which appears in the September/October 1995 issue of CyberEdge Journal.

© 1995 William Martens and CyberEdge Journal.

There were many immersive attractions at the SIGGRAPH convention in LA this summer, and many of those included spatialized sound. When experiencing the audio component of each, I asked myself the question, "Is it 3D?" I was attempting to evaluate how well coordinated the sound was with the graphics, and whether I heard the same kind of 3D qualities in the sound that I saw in the visuals. The graphic component was, of course, the 3D medium that was featured in these attractions, and I must say that the visuals were consistently as good as you might expect. On the other hand, the audio was not consistently well produced, nor did it typically have the kind of spatial realism that is called for in immersive attractions.

I voiced the same complaint last year when I reviewed for CyberEdge Journal the 3D audio at SIGGRAPH '94. This year, instead of just complaining, I will attempt to explain, by analogy to computer graphic rendering techniques, what features are missing and what their value to these attractions might be.

After waxing pedantic, I continue by comparing the experiences I had at SIGGRAPH both in terms of the quality of the subjective experience, and in terms of what my ear and the providers of the audio rendering technologies told me that they'd included in their 3D audio displays.

Analogy between 3D Graphic and 3D Audio Rendering

An analogy between the technologies of 3D graphic rendering and 3D audio rendering is drawn in order to teach distinctions in spatial sound processing to those more familiar with 3D graphics. In each of four comparisons, a spatial sound processing feature is added to those previously presented so that the complexity of 3D audio rendering can be appreciated relative to the graphic rendering of a simple object such as the sphere. The following respective graphic and audio features are discussed in turn: Garaud shading vs. directional filtering, shadows vs. discrete reflections, ray tracing vs. reverberation, and motion blur vs. Doppler shift.

In order to teach some of the distinctions important in understanding levels of sophistication in three-dimensional (3D) audio signal processing, I've prepared an analogy between the graphic rendering of a sphere on a visual display such as a CRT screen, and the audio "rendering" of a sound source, say the round tone of a low F on the trombone, via a pair of loudspeakers. In each case, the minimal presentation simply gets the stimuli to their respective displays - the sphere on the screen appears as a solid white circle (no shading) and an identical monaural sound source emanates from both of the speakers (no signal processing). At each increase in the level of sophistication of audio rendering, new spatial sound processing features are added that can be compared to increasingly complex graphic rendering of the sphere.

The first level of sophistication in audio rendering that might be called 3D would be the simulation of the acoustical response of the ears for sound arriving from the direction of the trombone. If the sound is to appear to arrive somewhat from the right, then the signal sent to the right speaker will have more high- frequency energy that the signal sent to the left. This simulates the head-shadow known to exist at high frequencies at the ear on the side of the head opposite the sound source (a dominant feature of the Head-Related Transfer Function, or HRTF). This acoustical phenomenon is analogous to the way in which the shading of the sphere changes as it moves across a space with a simulated light source at a fixed position.

Just as the sphere looks more like an object in 3D space and less like a circle of light on the screen, so does the trombone tone sound less like it is emanating from the speakers and more like it is arriving from a distinct position in space.

The second level of sophistication in 3D audio rendering creates realistic impressions of distance for the trombone. Consider first what an observer might experience if the projection of a sphere on the screen gradually grows smaller and larger. Is the shaded sphere moving forward and back, or is its size shrinking and growing while it maintains a fixed position in space? The same ambiguity occurs when the loudness level of a sound source is modulated over time. Did the sound itself get softer and louder, or did it move forward and back while maintaining a constant volume? For both visual and sonic objects, simulating the interaction of the object with a wall provides a frame of reference that can help the observer decide between the two possibilities. If the sphere casts a shadow on the wall, the observer can infer from the behavior of the shadow whether the sphere is moving of not. The same is true of the sound that reflects off the surface of a wall. If a sound source gets softer and louder, then so does a simulated reflection that arrives from a different direction and at a slight delay relative to the arrival of the direct sound. If a sound source moves away from the observer, both the direction and the time delay are modulated in a manner that is characteristic of a moving sound source. The more simulated walls producing sound reflections, the better the effect. The same is true of including more shadows for disambiguating the movement of the sphere.

Level Three

The third level of sophistication in 3D audio rendering is the simulation of the reverberation of sound source throughout a 3D space. This simulation is akin to ray tracing in graphic rendering in that it requires the calculation of how sound leaving the source in all directions eventually arrives at the observer's ears. If the surface of the sphere were made of shiny metal, graphic ray tracing would allow the textured walls of a room to be reflected to the observer's eyes according to the position and shape of the sphere. Including reverberation in 3D audio rendering completes a spatial simulation in a most realistic fashion. When the sound source moves away from the observer, the overall level of the reverberation is hardly modulated at all while the level of the sound arriving directly from the source decreases. As with the single simulated reflection described above, but much more powerfully, ambiguities of changing loudness versus changing sound source distance are easily resolved.

An additional 3D graphic and audio rendering feature that may or may not be included in the preceding levels of sophistication is motion- dependent modification of the graphic or sound object itself. When a graphic object moves quickly enough, it provides a good deal of realism if motion blur is added to the simulation. The image of the object will appear to be blurred along its direction of motion. Similarly, when a sound source moves quickly enough away from the observer, its pitch will be decreased by an amount proportional to the sound source velocity.

This feature provides a very salient cue to changes in sound source distance. In addition, if a sound source takes a circular path around the observer's head, the source is closer to one ear and then the other as it completes each revolution. As the source simultaneously moves toward one ear and away from the other, the pitch of the signal sent to one speaker is increased while the pitch to the other is decreased. This phenomenon adds a great deal to the sense of spatial motion not only when the sound source moves, but also when the observer moves by a stationary sound source.

Spatial Sound Examples at SIGGRAPH

My favorite spatial sound experience was in a simulator pod brought to SIGGRAPH by Virtual World Entertainment (VWE). As I moved through the canals of Mars (nicely texture-mapped I might add), the sound emitters I passed flew from front to back on either side of me; and they didn't simply get louder and softer, they approached and receded. I attributed this to the excellent four-channel reverberation simulation that VWE sound designer Eric Huffman achieved by using two separate stereo reverb processors, one for the two front loudspeakers and one for the two rear loudspeakers. The sound samples were layered so that you never heard the same sound event twice and they had a realistic variation that was quite satisfying. All this complexity (64 channels of 16-bit, 44.1 kHz wavetable synthesis plus effects) was accomplished using two AWE-32 Sound Blaster cards from Creative Labs (at a street price $200 each). I have to point out one glaring error VWE committed, however, that really grates on me. When I first started in the simulation, I was immediately disappointed to find that the low-frequency portion of laser blasts and collisions was missing from their audio reproduction system. Why they thought it was OK to keep a subwoofer out of their budget is beyond me, unless it's just the same old "photocentricism" that's running rampant in the industry and leads developers to spend an order of magnitude more money on the image generation than sound. This is particularly ironic in VWE's case, since their own customer testing confirmed how important the sound was to their pod's success: They only had sound in two of their five test pods, and observed the predictable wide gap in the customer's ratings.

Crystal River Engineering (CRE) and Paradigm Simulation teamed up to produce an impressive demo of coordinated 3D graphics (Paradigm's Vega) and 3D sound. The demo integrated CRE's 3D audio hardware (the Acoustetron II) with Paradigm's real-time audio simulation development software for Silicon Graphics workstations (AudioWorks2). Paradigm's software prioritizes sound emitters in the simulation, and then sends them to CRE's hardware to be filtered according to the directional response characteristics of the human head and pinna (outer ear). The Acoustetron II is a $10,000 system that can process 8 simultaneous sources at 44.1 kHz, or 16 simultaneous sources at 22.05 kHz, using stored measurements of these directional response characteristics (compressed 256-coefficient Head-Related Transfer Functions, or HRTFs, mentioned in the above analogy). If the number of concurrent sounds is reduced to 4, the Acoustetron II can also generate 6 simulated reflections per sound source using 18-coefficient filters for each.

The payoff is clear

When you hear the output of the Acoustetron II over headphones in a head-tracking simulation, the payoff for all this number crunching is clear. For non-head-tracking applications and for loudspeaker reproduction, however, the MIPs used to apply HRTFs to the sounds would probably be better spent on generating proper reverb. The lack of reverb processing in CRE's products has always been a disappointment to me. In the best installations, the output of their system is sent through a reverberation unit, but this doesn't produce a desirable result since the source has already been spatialized (using filters, Doppler shift, attenuation with distance, etc.). To understand why this might be, let's return to the analogy between graphic and audio rendering. In 3D graphic rendering, objects are made to appear further in the distance by desaturating their color, and sometimes shifting their color toward blue. This simulates the phenomenon knows as "aerial perspective." If you added this desaturation to the entire visual scene as a post-process, all element in the whole scene recede together. This is strictly analogous to running the output of a 3D audio renderer through a reverberator.

Even for conventional commercial audio production the analogy holds: every sound mixing engineer knows that to control distance, the reverb must be applied to the pre-fade sends of the mixing console, rather than the post-fade sends.

Just as Paradigm's Vega demo was coupled with CRE's Acoustetron II, Sense8's World Tool Kit (WTK) demo was coupled with the SACS (Sampling Acquisition/ Control System) from Visual Synthesis Incorporated (VSI). In both of these integrations, the highlighted feature was ease-of-use. When a graphic object is moved in Vega or WTK, the location of the sound emitter in the simulation is automatically updated. Only a single command is required to move both. It was difficult to evaluate the spatial sound imagery in the Sense8/VSI integration, however, since they had placed the loudspeakers directly underneath the bench I was to sit on. I stuck my face against the BOOM visual display (from Fakespace), which might as well have included earphones or even "nearphones" for how encumbering it was, and though I was sailing past a clanking buoy, all the sound seemed to come from underneath the bench (no surprise).

The gang at Gravity produced an even better "3D-flying-thing" PC-game this year than the one I raved about in my SIGGRAPH report last year. This time instead of a 3D bug, it was a 3D bat in a title called "Dingbats." Once again, the sound design was fun and cute, and featured a very cool, jazzy musical score (with a walking string bass!) composed by Paul Godwin. The audio was spatialized using the Q Sound process, which let sound emitters track left and right with the movement of graphical objects on... and off screen. Yes, when a graphical object flew off screen, the sound it was making continued to fly right off the screen and all the way past the loudspeaker until it was well out to the side of the listener. This was a successful demo of the Q Sound effect, but with such great content, I'd really like to experience the game using more complete 3D audio processing (Sean Burke-Gaffney, technical support for Q Sound, says forthrightly that Q Sound, though often described as 3D, in fact, delivers sound only in an arc around the listener). What the game is calling out for is Doppler shift on all the moving sources. You've got a bat flitting back and forth in front of you, and you ought to be hearing lots of shifting pitch, as well as shifting time delays between the two loudspeaker signals. And did I mention that the bat was flying through a series of caves? Let's at least try the game on a sound card that has reverb on it, so that it sounds like we're in a cave rather than in wide-open sky.

Another spatial audio installation worth noting was presented by Robin Bargar of NCSA's Audio Development Group: "Audio Navigation in a Cyberspace Village." Before arriving at the convention, the participants in the "Interactive Communities" program at SIGGRAPH sent Robin sound samples that were characteristic of the contributions they were bringing to the convention. These samples were spatialized over four loudspeakers surrounding the listener, not according to the actual spatial location of those contributions on the show floor (which might have been expected), but rather according to their position in an abstract space defined by the conceptual dimensions that could be used to distinguish between all the contributions. This was a very cool experiment in spatial data management. It mystified virtually everyone I asked about it. Perhaps this elegant idea was contaminated by all those tetrahedra on the video wall recurring forever through a space that wraps back on itself along all three axes.

Sound and VRML

I'll end with one final note on a topic only tenuously connected to spatial sound... tenuously connected, at least at this time. At the Sunday evening VRML BOF (Virtual Reality Modeling Language, Birds of a Feather), a group of ten volunteers announced to a large group of VRML enthusiasts that they would be working together to add extensions to the VRML 1.0 specification. VRML 1.0 is a file format that provides a standardized 3D-graphic interface to the World Wide Web, but currently provides no means for interactive behavior, no multi-user support, none of lots of other wanted and needed features, and last but not least, no sound.

In fact, VRML worlds are all pretty still and silent these days unless your VRML browser can handle custom extensions for features that couldn't be included in the VRML 1.0 spec. So the group of ten volunteers, dubbed the VRML Architecture Group, will be working hard to ensure that proposed extensions to VRML 1.0 are put through a rigorous technical review process and are accompanied by sufficient guidelines for VRML authors and browser developers. VRML Architecture Group working documents, proposals and meeting minutes are freely available to all interested parties at the VAG World Wide Web site, http://vrml.wired.com/VAG. This site is the repository for information related to future enhancements of the VRML language.

Dr. William Martens is a perceptual psychologist specializing in spatial hearing research and the simulation of the acoustical cues used in human sound localization. He holds several patents on spatial sound processing technology and has published many papers on the subject. As a member of the VRML Architecture Group, he is responsible for overseeing the review of VRML audio extensions and establishing guidelines for audio rendering in VRML worlds. Dr. Martens currently holds the position of Headspatial Technologist at Headspace in San Mateo, CA.

Back

Questions? More Info? Email the Webmaster

407 M. L. King Jr. Way, Oakland, CA 94607 USA
+1 510 419-0800

Our privacy policy may be read here.