Context Zero Logo
Published on

Chapter 1: Audio Fundamentals - Understanding Sound in the Digital World

Authors

Introduction

Before we can teach machines to understand audio, we need to understand it ourselves. This chapter covers the fundamental concepts of audio - from the physics of sound waves to how computers represent and process audio data.

Physics Behind Sound: Understanding Wave Types (click to expand)

Before diving into digital audio, let's understand the fundamental physics of sound. Sound is just one type of wave in our universe, and understanding its place in the wave spectrum helps us appreciate its unique properties.

Types of Waves in Nature

Waves are disturbances that transfer energy from one place to another without transferring matter. They are classified into three main categories:

1. Mechanical Waves

Sound is a mechanical wave - it requires a material medium (air, water, solid) to propagate.

  • Examples: Sound waves, water waves, seismic waves, waves on a string
  • Key property: Cannot travel through vacuum
  • How sound works: Air molecules compress and expand, creating pressure variations that travel outward

2. Electromagnetic (EM) Waves

Generated by oscillating electric and magnetic fields.

  • Examples: Light, radio waves, X-rays, microwaves
  • Key property: Can travel through vacuum at the speed of light
  • Difference from sound: No medium required, much faster propagation

3. Matter Waves

Quantum mechanical waves associated with moving particles.

  • Examples: Electron waves, de Broglie waves
  • Key property: Exhibit wave-particle duality
  • Application: Used in electron microscopes, quantum computing

Why Sound Needs a Medium

Sound requires a medium because it's a pressure wave that needs particles to transmit energy. The speed of sound varies dramatically in different media:

  • Vacuum: Cannot propagate (no particles)
  • Air (20°C): 343 m/s
  • Water: 1,480 m/s
  • Steel: 5,960 m/s
  • Diamond: 12,000 m/s

Key Insight: Denser media have particles closer together, enabling faster energy transfer and higher propagation speeds.

Wave Classification by Particle Motion

Understanding how particles move relative to wave propagation is crucial for understanding sound:

1. Longitudinal Waves

Sound is a longitudinal wave - particles vibrate parallel to the direction of wave propagation.

  • Particle motion: Back and forth along the wave direction
  • Creates: Compressions (high pressure) and rarefactions (low pressure)
  • Examples:
    • Sound waves in air, water, and solids
    • P-waves (primary seismic waves)
    • Pressure waves in fluids
  • Visualization: Like a compressed and stretched spring

2. Transverse Waves

Particles vibrate perpendicular to the direction of wave propagation.

  • Particle motion: Up and down or side to side
  • Creates: Crests (peaks) and troughs (valleys)
  • Examples:
    • Light and all EM waves
    • Waves on a string
    • S-waves (secondary seismic waves)
    • Water surface waves (partially)
  • Note: Cannot exist in fluids (liquids/gases) for mechanical waves

Key Differences Between Wave Types:

  • Longitudinal Waves (Sound): Particles move parallel to wave direction, creating compressions and rarefactions. Sound needs a medium because particles must push each other.
  • Transverse Waves (Light): Particles move perpendicular to wave direction, creating crests and troughs. Light doesn't need a medium as it's electromagnetic field oscillations.

Wave Classification by Propagation

Progressive (Traveling) Waves

Waves that move through space, carrying energy from source to destination.

  • Sound waves from a speaker travel to your ears
  • Water ripples spread outward from a dropped stone
  • Energy transfer: Continuous from point A to point B

Standing Waves

Waves that oscillate in place without apparent movement (covered in advanced physics).

  • Guitar string vibrations
  • Organ pipe resonances
  • Energy: Stored rather than transmitted

Fundamental Wave Properties

Every wave, including sound, has these essential properties:

For example, an A4 note (440 Hz):

  • Frequency (ν): 440 Hz
  • Period (T): 2.27 ms
  • Wavelength (λ): 0.78 m (in air at 20°C)
  • Velocity (v): 343.2 m/s
  • Angular frequency (ω): 2764.6 rad/s

The wave equation relates these properties: v = νλ

Waves exhibit double periodicity:

  1. Periodic in TIME: Repeats at regular time intervals (the period)
  2. Periodic in SPACE: Repeats at regular spatial intervals (the wavelength)

What is Sound?

Sound is a mechanical wave that propagates through a medium (usually air) as variations in pressure. These pressure variations cause our eardrums to vibrate, which our brain interprets as sound.

Understanding Sound Waves

Interactive Sound Wave Properties

20 Hz (Low)440 Hz (A4)2000 Hz (High)
SilentNormalLoud
180°360°

Timbre (Harmonics) - Creates unique sound character

Try These:

  • • Set frequency to 262 Hz for Middle C
  • • Add harmonics to create richer sounds
  • • Compare sine vs square waves at same frequency
  • • Observe how phase affects wave position

Key Concepts:

  • • Higher frequency = higher pitch
  • • Larger amplitude = louder sound
  • • Harmonics create timbre (tone quality)
  • • Phase affects how waves combine

Key Properties of Sound Waves

  1. Frequency (Pitch): How many times the wave oscillates per second, measured in Hertz (Hz)

    • Human hearing range: 20 Hz to 20,000 Hz
    • Middle C on a piano: 261.63 Hz
    • Human speech: primarily 85-255 Hz
  2. Amplitude (Loudness): The magnitude of pressure variations

    • Measured in decibels (dB)
    • Whisper: ~30 dB
    • Normal conversation: ~60 dB
    • Rock concert: ~110 dB
  3. Phase: The position of a point in time on a waveform cycle

    • Important for how sounds combine
    • Critical for spatial audio perception
  4. Timbre: The quality that distinguishes different sound sources

    • Why a piano and guitar playing the same note sound different
    • Determined by harmonics and overtones (try adding harmonics in the visualizer above!)
Sound Frequency Classification: What We Can and Cannot Hear (click to expand)

Sound waves are classified by their frequency ranges, and not all of them are audible to humans. Understanding these classifications is crucial for audio engineering and deep learning applications.

The Complete Sound Spectrum

INFRASOUND (< 20 Hz)

  • Human Audible: ❌ No
  • Examples: • Earthquakes (0.01-10 Hz) • Ocean waves (0.1-1 Hz) • Elephant communication (5-20 Hz) • Weather systems (0.001-0.1 Hz)
  • Applications: Seismic monitoring, wildlife research

HUMAN AUDIBLE RANGE (20 Hz - 20,000 Hz)

  • Human Audible: ✅ Yes
  • Sub-ranges (Musical/Audio Engineering): • Sub-bass: 20-60 Hz • Bass: 60-250 Hz • Low-mid: 250-500 Hz • Midrange: 500-2000 Hz • Upper-mid: 2000-4000 Hz • Presence: 4000-6000 Hz • Brilliance: 6000-20000 Hz

ULTRASOUND (> 20,000 Hz)

  • Human Audible: ❌ No
  • Examples: • Dog whistle (23-54 kHz) • Bat echolocation (20-200 kHz) • Medical ultrasound (2-18 MHz) • Ultrasonic cleaning (20-400 kHz)
  • Applications: Medical imaging, sonar, cleaning

KEY FACTS ABOUT HUMAN HEARING: • Young adults: Can typically hear 20 Hz - 20,000 Hz • Age-related loss: Upper limit decreases ~1 kHz per decade after 20 • Most sensitive: 2,000 - 5,000 Hz (speech consonants) • Speech range: 85 - 255 Hz (fundamental), up to 8 kHz (harmonics) • Music range: 27.5 Hz (A0 piano) to 4,186 Hz (C8 piano) • Pain threshold: ~120-130 dB at any frequency

WAVELENGTH AT DIFFERENT FREQUENCIES (in air at 20°C):

  • 20 Hz (Lower human limit): 17.15 meters
  • 440 Hz (A4 concert pitch): 77.9 cm
  • 1000 Hz (1 kHz reference): 34.3 cm
  • 20000 Hz (Upper human limit): 1.7 cm
  • 40000 Hz (Ultrasound/dog hearing): 0.9 cm

Understanding Frequency Ranges

1. Infrasound (< 20 Hz)

Human Audible: ❌ No (but can be felt as vibration)

Infrasound consists of frequencies below the human hearing threshold. While we can't hear these frequencies, we can often feel them as physical vibrations.

  • Natural Sources: Earthquakes, ocean waves, thunder, wind
  • Animal Communication: Elephants use infrasound for long-distance communication
  • Industrial: Large machinery, ventilation systems
  • Effects on Humans: Can cause uneasiness, dizziness at high intensities
  • Detection: Requires specialized equipment (seismographs, infrasound monitors)

2. Human Audible Range (20 Hz - 20,000 Hz)

Human Audible: ✅ Yes

This is the sweet spot for human perception, though the actual range varies significantly between individuals.

Musical and Audio Engineering Sub-ranges:
  • Sub-bass (20-60 Hz): Felt more than heard, adds "weight" to music

    • Lowest piano notes, kick drum fundamental
    • Home theater subwoofers optimize for this range
  • Bass (60-250 Hz): Foundation of music

    • Bass guitar, low male vocals
    • Most musical fundamentals
  • Low-midrange (250-500 Hz): Body and warmth

    • Lower harmonics of vocals
    • Fullness of instruments
  • Midrange (500-2,000 Hz): Critical for speech intelligibility

    • Most important for human communication
    • Where our ears are most sensitive
  • Upper-midrange (2,000-4,000 Hz): Clarity and definition

    • Consonants in speech (s, t, k sounds)
    • Attack of percussive instruments
  • Presence (4,000-6,000 Hz): Detail and articulation

    • Sibilance in vocals
    • "Bite" of electric guitars
  • Brilliance (6,000-20,000 Hz): Air and sparkle

    • Cymbals, highest harmonics
    • Sense of "openness" in recordings

3. Ultrasound (> 20,000 Hz)

Human Audible: ❌ No

Frequencies above human hearing but extremely useful in technology and nature.

  • Near Ultrasound (20-100 kHz):

    • Dog whistles (23-54 kHz)
    • Cat hearing (up to 64 kHz)
    • Rodent deterrents (30-70 kHz)
  • Mid Ultrasound (100 kHz - 1 MHz):

    • Bat echolocation (20-200 kHz)
    • Dolphin sonar (up to 150 kHz)
    • Ultrasonic cleaning (40-200 kHz)
  • High Ultrasound (1 MHz - 1 GHz):

    • Medical ultrasound imaging (2-18 MHz)
    • Industrial non-destructive testing
    • Ultrasonic welding

Age and Hearing Loss

Age-Related Hearing Range Changes:

  • Child (< 10): 20 Hz - 20,000 Hz
  • Teen (10-19): 20 Hz - 18,000 Hz
  • Young Adult (20-30): 20 Hz - 17,000 Hz
  • Adult (30-40): 25 Hz - 16,000 Hz
  • Middle Age (40-50): 30 Hz - 14,000 Hz (⚠️ Lost: highest harmonics)
  • Older Adult (50-60): 35 Hz - 12,000 Hz (⚠️ Lost: brilliance range)
  • Senior (60-70): 40 Hz - 10,000 Hz
  • Elderly (70+): 50 Hz - 8,000 Hz (⚠️ Lost: some speech clarity)

Standard Audiometry Test Frequencies: 250, 500, 1000, 2000, 4000, 8000 Hz

These frequencies test critical points for: • Speech understanding (500-4000 Hz) • Music appreciation (250-8000 Hz) • Environmental awareness (all frequencies)

Practical Applications in Audio Deep Learning

Understanding frequency ranges is crucial for:

  1. Feature Extraction: Knowing which frequencies contain relevant information
  2. Data Preprocessing: Applying appropriate filters for specific tasks
  3. Model Design: Choosing architectures that capture relevant frequency ranges
  4. Augmentation: Realistic frequency-based data augmentation

For example, for speech recognition, we might:

  • Focus on 80 Hz - 8 kHz (where speech information lives)
  • Apply high-pass filter at 80 Hz to remove rumble
  • Use 16 kHz sampling rate (sufficient for 8 kHz content)

For music analysis, we need:

  • Full 20 Hz - 20 kHz range
  • 44.1 kHz or 48 kHz sampling rate
  • Careful handling of bass and treble information

Analog to Digital: Capturing Sound

To work with audio on computers, we need to convert continuous analog signals to discrete digital representations through sampling and quantization. This involves three fundamental concepts: waveforms, sample rate, and bit depth.

Key Digital Audio Concepts

What is a Waveform?

A waveform is simply air pressure changing over time. When you speak, your vocal cords push air molecules in waves. A microphone measures that pressure as a continuous signal between -1 and +1. That curve is the waveform. When a speaker plays it back, it physically pushes air in that same pattern, and your ears interpret it as sound.

Sample Rate (Frequency of Snapshots)

Sample rate is how many "snapshots" of the waveform you take per second. Audio is continuous in the real world, but computers are digital — so we discretize it by measuring the amplitude at regular intervals.

  • 24 kHz = 24,000 samples/sec: One snapshot every 0.000042 seconds
  • The Nyquist theorem tells us you need at least 2× your highest frequency to reconstruct it faithfully
  • Human speech tops out around 8-10 kHz, so 24 kHz gives you plenty of headroom
  • This is why speech synthesis systems like KittenTTS output 24 kHz — it's the sweet spot for speech: higher quality than phone calls, lighter than music

Bit Depth (Measurement Precision)

Bit depth is the precision of each sample measurement:

  • 8 bits: 256 possible amplitude levels — coarse, you can hear the "staircase" rounding error as background hiss (quantization noise)
  • 16 bits: 65,536 levels — the steps are so tiny the ear can't detect them
  • 24 bits: 16,777,216 levels — used in professional recording for maximum headroom

The standard for speech is 24 kHz × 16-bit, which you'll see in code like:

import librosa
audio, sr = librosa.load("speech.wav", sr=24000, mono=True)
# audio.shape == (72000,) for a 3-second clip; sr == 24000

Three seconds at 24 kHz produces 72,000 samples; at 16-bit (2 bytes/sample) that's roughly 281 KB on disk.

The Sampling Process in Detail

Digital Audio: Waveform, Sample Rate & Bit Depth

1. Waveform — the raw shape of sound

Air pressure changing over time. When you speak, your vocal cords push air molecules in waves. A microphone measures that pressure as a continuous signal between -1 and +1.

peak (loud)trough (loud)amp+10-1timeAir pressure over timeSpeakers push/pull airto recreate this shape

2. Sample Rate — how often we measure the wave

How many "snapshots" of the waveform we take per second. 24 kHz = 24,000 samples/sec means one snapshot every 0.000042 seconds. More samples = better reconstruction.

Low rate (few snapshots) → jagged reconstruction

High rate (dense samples, 24kHz = 24,000/sec) → smooth reconstruction

24kHz is standard for TTS — humans hear up to ~20kHz, so 24k comfortably covers speech

3. Bit Depth — how precisely we measure each sample

The precision of each measurement. 8 bits = 256 levels (coarse, noisy). 16 bits = 65,536 levels (CD quality). More bits = less quantization noise.

2-bit (4 levels)

coarse / noisy

8-bit (256 levels)

decent quality

16-bit (65,536 levels)

CD quality
FormatSample RateBit DepthFile Size/min
Phone call8 kHz8-bit~0.5 MB
KittenTTS / TTS24 kHz16-bit~2.8 MB
CD audio44.1 kHz16-bit~10 MB
Continuous waveform
Sample points
Quantized signal
Quantization error
8 kHz24 kHz44.1 kHz96 kHz
4-bit8-bit16-bit24-bit

Waveform

Air pressure changing over time. The blue line shows the continuous analog signal that exists in nature.

Sample Rate

How often we measure the waveform. Red dots show where we take snapshots. Higher rate = better quality.

Bit Depth

Precision of each measurement. Green line shows quantized values. More bits = less noise.

File Size: 46.9 KB/second for mono audio (2814.0 KB/minute)

Formula: (Sample Rate × Bit Depth) ÷ 8 ÷ 1024 = KB/sec

Nyquist Theorem

Maximum frequency that can be accurately captured: 12,000 Hz (half the sample rate). This is why 24 kHz is perfect for speech (captures up to 12 kHz).

Sampling at different rates:

For a 440 Hz signal (A4 note), here's how different sampling rates perform:

Sample RateSamples in 10 msTime Between SamplesMax Frequency (Nyquist)Captures 440 Hz?
8 000 Hz800.125 ms4 000 Hz
16 000 Hz1600.062 ms8 000 Hz
24 000 Hz2400.042 ms12 000 Hz
44 100 Hz4410.023 ms22 050 Hz

Nyquist-Shannon Sampling Theorem

The sampling theorem states that to accurately represent a signal, we must sample at at least twice the highest frequency present in the signal.

  • Nyquist Frequency: Half the sampling rate
  • Aliasing: Distortion that occurs when sampling below the Nyquist rate

Common sampling rates:

  • 8 kHz: Telephone quality (captures up to 4 kHz)
  • 16 kHz: Wideband speech (captures up to 8 kHz)
  • 44.1 kHz: CD quality (captures up to 22.05 kHz)
  • 48 kHz: Professional audio
  • 96 kHz: High-resolution audio

Quantization: Bit Depth

After sampling, we need to represent each sample's amplitude as a digital value. Quantization converts continuous amplitude values to discrete levels.

For example, quantizing a value of 0.7234:

Bit DepthLevels AvailableQuantized ValueError
2-bit40.6666675.67%
4-bit160.7333331.29%
8-bit2560.7235290.01%
16-bit65 5360.723404< 0.01%

Notice how more bits = less error!

Common bit depths:

Bit DepthLevelsTypical Use
8-bit256Telephone quality
16-bit65 536CD quality
24-bit16 777 216Professional recording
32-bit floatScientific computing

Digital Audio Representation

Time Domain: Waveform

The waveform is the most direct representation of audio — amplitude values over time. A complex waveform can include multiple harmonics (integer multiples of the fundamental frequency) and an envelope (how the amplitude changes over time). For example, a musical note might have a fundamental frequency of 220 Hz (A3) with several harmonics and an exponential decay envelope.

Audio File Formats

Different formats store audio data in various ways:

FormatCompressionQualityFile SizeTypical Use Case
WAVNone (lossless)PerfectLargeProfessional audio, archiving
MP3Lossy (psychoacoustic)Good (bitrate-dep.)Small (~10% of WAV)Music distribution, streaming
FLACLosslessPerfectMedium (~50–70%)Audiophile music, archiving
OGGLossy (Vorbis)Good to excellentSmallOpen-source apps, games

Loading and Saving Audio

Different libraries can be used to load and save audio:

LibraryBehavior
LibrosaConverts to mono, normalizes to [-1, 1]
SoundfilePreserves original format
ScipyReturns integer values
TorchaudioReturns PyTorch tensors

A minimal example with each:

# Librosa — mono float32 in [-1, 1]
import librosa
audio, sr = librosa.load("clip.wav", sr=16000, mono=True)

# Soundfile — preserves channels and dtype
import soundfile as sf
audio, sr = sf.read("clip.wav")

# Torchaudio — returns a torch.Tensor of shape (channels, samples)
import torchaudio
waveform, sr = torchaudio.load("clip.wav")

Each library has its own advantages depending on your use case.

Key Takeaways

  1. Sound is a wave: Understanding wave properties helps us choose appropriate representations
  2. Sampling theorem: Sample at least 2× the highest frequency you want to capture
  3. Digital representation: Balance between quality (sampling rate, bit depth) and file size
  4. Know your format: Different formats suit different applications — lossless for archiving, lossy for distribution

Exercises

Chapter 1 Knowledge Check

Question 1 of 7

Sound waves in air are best classified as which type of wave?

What's Next?

Now that we understand how audio is represented digitally, we're ready to explore how to extract meaningful features from audio signals. In the next chapter, we'll dive into signal processing techniques like the Fourier Transform, spectrograms, and feature extraction methods that form the foundation for audio deep learning.


Next: Chapter 2 - Signal Processing Basics →