Chapter 1: Audio Fundamentals - Understanding Sound in the Digital World

Introduction

Before we can teach machines to understand audio, we need to understand it ourselves. This chapter covers the fundamental concepts of audio - from the physics of sound waves to how computers represent and process audio data.

Physics Behind Sound: Understanding Wave Types

Before diving into digital audio, let's understand the fundamental physics of sound. Sound is just one type of wave in our universe, and understanding its place in the wave spectrum helps us appreciate its unique properties.

Types of Waves in Nature

Waves are disturbances that transfer energy from one place to another without transferring matter. They are classified into three main categories:

1. Mechanical Waves

Sound is a mechanical wave - it requires a material medium (air, water, solid) to propagate.

Examples: Sound waves, water waves, seismic waves, waves on a string
Key property: Cannot travel through vacuum
How sound works: Air molecules compress and expand, creating pressure variations that travel outward

2. Electromagnetic (EM) Waves

Generated by oscillating electric and magnetic fields.

Examples: Light, radio waves, X-rays, microwaves
Key property: Can travel through vacuum at the speed of light
Difference from sound: No medium required, much faster propagation

3. Matter Waves

Quantum mechanical waves associated with moving particles.

Examples: Electron waves, de Broglie waves
Key property: Exhibit wave-particle duality
Application: Used in electron microscopes, quantum computing

Why Sound Needs a Medium

Sound requires a medium because it's a pressure wave that needs particles to transmit energy. The speed of sound varies dramatically in different media:

Vacuum: Cannot propagate (no particles)
Air (20°C): 343 m/s
Water: 1,480 m/s
Steel: 5,960 m/s
Diamond: 12,000 m/s

Key Insight: Denser media have particles closer together, enabling faster energy transfer and higher propagation speeds.

Wave Classification by Particle Motion

Understanding how particles move relative to wave propagation is crucial for understanding sound:

1. Longitudinal Waves

Sound is a longitudinal wave - particles vibrate parallel to the direction of wave propagation.

Particle motion: Back and forth along the wave direction
Creates: Compressions (high pressure) and rarefactions (low pressure)
Examples:
- Sound waves in air, water, and solids
- P-waves (primary seismic waves)
- Pressure waves in fluids
Visualization: Like a compressed and stretched spring

2. Transverse Waves

Particles vibrate perpendicular to the direction of wave propagation.

Particle motion: Up and down or side to side
Creates: Crests (peaks) and troughs (valleys)
Examples:
- Light and all EM waves
- Waves on a string
- S-waves (secondary seismic waves)
- Water surface waves (partially)
Note: Cannot exist in fluids (liquids/gases) for mechanical waves

Key Differences Between Wave Types:

Longitudinal Waves (Sound): Particles move parallel to wave direction, creating compressions and rarefactions. Sound needs a medium because particles must push each other.
Transverse Waves (Light): Particles move perpendicular to wave direction, creating crests and troughs. Light doesn't need a medium as it's electromagnetic field oscillations.

Wave Classification by Propagation

Progressive (Traveling) Waves

Waves that move through space, carrying energy from source to destination.

Sound waves from a speaker travel to your ears
Water ripples spread outward from a dropped stone
Energy transfer: Continuous from point A to point B

Standing Waves

Waves that oscillate in place without apparent movement (covered in advanced physics).

Guitar string vibrations
Organ pipe resonances
Energy: Stored rather than transmitted

Fundamental Wave Properties

Every wave, including sound, has these essential properties:

For example, an A4 note (440 Hz):

Frequency (ν): 440 Hz
Period (T): 2.27 ms
Wavelength (λ): 0.78 m (in air at 20°C)
Velocity (v): 343.2 m/s
Angular frequency (ω): 2764.6 rad/s

The wave equation relates these properties: v = νλ

Waves exhibit double periodicity:

Periodic in TIME: Repeats at regular time intervals (the period)
Periodic in SPACE: Repeats at regular spatial intervals (the wavelength)

What is Sound?

Sound is a mechanical wave that propagates through a medium (usually air) as variations in pressure. These pressure variations cause our eardrums to vibrate, which our brain interprets as sound.

Understanding Sound Waves

Key Properties of Sound Waves

Frequency (Pitch): How many times the wave oscillates per second, measured in Hertz (Hz)
- Human hearing range: 20 Hz to 20,000 Hz
- Middle C on a piano: 261.63 Hz
- Human speech: primarily 85-255 Hz
Amplitude (Loudness): The magnitude of pressure variations
- Measured in decibels (dB)
- Whisper: ~30 dB
- Normal conversation: ~60 dB
- Rock concert: ~110 dB
Phase: The position of a point in time on a waveform cycle
- Important for how sounds combine
- Critical for spatial audio perception
Timbre: The quality that distinguishes different sound sources
- Why a piano and guitar playing the same note sound different
- Determined by harmonics and overtones (try adding harmonics in the visualizer above!)

Sound Frequency Classification: What We Can and Cannot Hear

Sound waves are classified by their frequency ranges, and not all of them are audible to humans. Understanding these classifications is crucial for audio engineering and deep learning applications.

The Complete Sound Spectrum

INFRASOUND (< 20 Hz)

Human Audible: ❌ No
Examples: • Earthquakes (0.01-10 Hz) • Ocean waves (0.1-1 Hz) • Elephant communication (5-20 Hz) • Weather systems (0.001-0.1 Hz)
Applications: Seismic monitoring, wildlife research

HUMAN AUDIBLE RANGE (20 Hz - 20,000 Hz)

Human Audible: ✅ Yes
Sub-ranges (Musical/Audio Engineering): • Sub-bass: 20-60 Hz • Bass: 60-250 Hz • Low-mid: 250-500 Hz • Midrange: 500-2000 Hz • Upper-mid: 2000-4000 Hz • Presence: 4000-6000 Hz • Brilliance: 6000-20000 Hz

ULTRASOUND (> 20,000 Hz)

Human Audible: ❌ No
Examples: • Dog whistle (23-54 kHz) • Bat echolocation (20-200 kHz) • Medical ultrasound (2-18 MHz) • Ultrasonic cleaning (20-400 kHz)
Applications: Medical imaging, sonar, cleaning

KEY FACTS ABOUT HUMAN HEARING: • Young adults: Can typically hear 20 Hz - 20,000 Hz • Age-related loss: Upper limit decreases ~1 kHz per decade after 20 • Most sensitive: 2,000 - 5,000 Hz (speech consonants) • Speech range: 85 - 255 Hz (fundamental), up to 8 kHz (harmonics) • Music range: 27.5 Hz (A0 piano) to 4,186 Hz (C8 piano) • Pain threshold: ~120-130 dB at any frequency

WAVELENGTH AT DIFFERENT FREQUENCIES (in air at 20°C):

20 Hz (Lower human limit): 17.15 meters
440 Hz (A4 concert pitch): 77.9 cm
1000 Hz (1 kHz reference): 34.3 cm
20000 Hz (Upper human limit): 1.7 cm
40000 Hz (Ultrasound/dog hearing): 0.9 cm

Understanding Frequency Ranges

1. Infrasound (< 20 Hz)

Human Audible: ❌ No (but can be felt as vibration)

Infrasound consists of frequencies below the human hearing threshold. While we can't hear these frequencies, we can often feel them as physical vibrations.

Natural Sources: Earthquakes, ocean waves, thunder, wind
Animal Communication: Elephants use infrasound for long-distance communication
Industrial: Large machinery, ventilation systems
Effects on Humans: Can cause uneasiness, dizziness at high intensities
Detection: Requires specialized equipment (seismographs, infrasound monitors)

2. Human Audible Range (20 Hz - 20,000 Hz)

Human Audible: ✅ Yes

This is the sweet spot for human perception, though the actual range varies significantly between individuals.

Musical and Audio Engineering Sub-ranges:

Sub-bass (20-60 Hz): Felt more than heard, adds "weight" to music
- Lowest piano notes, kick drum fundamental
- Home theater subwoofers optimize for this range
Bass (60-250 Hz): Foundation of music
- Bass guitar, low male vocals
- Most musical fundamentals
Low-midrange (250-500 Hz): Body and warmth
- Lower harmonics of vocals
- Fullness of instruments
Midrange (500-2,000 Hz): Critical for speech intelligibility
- Most important for human communication
- Where our ears are most sensitive
Upper-midrange (2,000-4,000 Hz): Clarity and definition
- Consonants in speech (s, t, k sounds)
- Attack of percussive instruments
Presence (4,000-6,000 Hz): Detail and articulation
- Sibilance in vocals
- "Bite" of electric guitars
Brilliance (6,000-20,000 Hz): Air and sparkle
- Cymbals, highest harmonics
- Sense of "openness" in recordings

3. Ultrasound (> 20,000 Hz)

Human Audible: ❌ No

Frequencies above human hearing but extremely useful in technology and nature.

Near Ultrasound (20-100 kHz):
- Dog whistles (23-54 kHz)
- Cat hearing (up to 64 kHz)
- Rodent deterrents (30-70 kHz)
Mid Ultrasound (100 kHz - 1 MHz):
- Bat echolocation (20-200 kHz)
- Dolphin sonar (up to 150 kHz)
- Ultrasonic cleaning (40-200 kHz)
High Ultrasound (1 MHz - 1 GHz):
- Medical ultrasound imaging (2-18 MHz)
- Industrial non-destructive testing
- Ultrasonic welding

Age and Hearing Loss

Age-Related Hearing Range Changes:

Child (< 10): 20 Hz - 20,000 Hz
Teen (10-19): 20 Hz - 18,000 Hz
Young Adult (20-30): 20 Hz - 17,000 Hz
Adult (30-40): 25 Hz - 16,000 Hz
Middle Age (40-50): 30 Hz - 14,000 Hz (⚠️ Lost: highest harmonics)
Older Adult (50-60): 35 Hz - 12,000 Hz (⚠️ Lost: brilliance range)
Senior (60-70): 40 Hz - 10,000 Hz
Elderly (70+): 50 Hz - 8,000 Hz (⚠️ Lost: some speech clarity)

Standard Audiometry Test Frequencies: 250, 500, 1000, 2000, 4000, 8000 Hz

These frequencies test critical points for: • Speech understanding (500-4000 Hz) • Music appreciation (250-8000 Hz) • Environmental awareness (all frequencies)

Practical Applications in Audio Deep Learning

Understanding frequency ranges is crucial for:

Feature Extraction: Knowing which frequencies contain relevant information
Data Preprocessing: Applying appropriate filters for specific tasks
Model Design: Choosing architectures that capture relevant frequency ranges
Augmentation: Realistic frequency-based data augmentation

For example, for speech recognition, we might:

Focus on 80 Hz - 8 kHz (where speech information lives)
Apply high-pass filter at 80 Hz to remove rumble
Use 16 kHz sampling rate (sufficient for 8 kHz content)

For music analysis, we need:

Full 20 Hz - 20 kHz range
44.1 kHz or 48 kHz sampling rate
Careful handling of bass and treble information

Analog to Digital: Capturing Sound

To work with audio on computers, we need to convert continuous analog signals to discrete digital representations. This involves three fundamental concepts: waveforms, sample rate, and bit depth.

Converting Analog to Digital

To work with audio on computers, continuous analog waveforms must become digital audio through sampling and quantization. The sample rate determines how often we take "snapshots" of the wave, and bit depth controls the precision of each measurement.

Key Digital Audio Concepts

What is a Waveform?

A waveform is simply air pressure changing over time. When you speak, your vocal cords push air molecules in waves. A microphone measures that pressure as a continuous signal between -1 and +1. That curve is the waveform. When a speaker plays it back, it physically pushes air in that same pattern, and your ears interpret it as sound.

Sample Rate (Frequency of Snapshots)

Sample rate is how many "snapshots" of the waveform you take per second. Audio is continuous in the real world, but computers are digital — so we discretize it by measuring the amplitude at regular intervals.

24 kHz = 24,000 samples/sec: One snapshot every 0.000042 seconds
The Nyquist theorem tells us you need at least 2× your highest frequency to reconstruct it faithfully
Human speech tops out around 8-10 kHz, so 24 kHz gives you plenty of headroom
This is why speech synthesis systems like KittenTTS output 24 kHz — it's the sweet spot for speech: higher quality than phone calls, lighter than music

Bit Depth (Measurement Precision)

Bit depth is the precision of each sample measurement:

8 bits: 256 possible amplitude levels — coarse, you can hear the "staircase" rounding error as background hiss (quantization noise)
16 bits: 65,536 levels — the steps are so tiny the ear can't detect them
24 bits: 16,777,216 levels — used in professional recording for maximum headroom

The standard for speech is 24 kHz × 16-bit, which you'll see in code like:

When working with digital audio, the data is typically stored as arrays of values. For example, 3 seconds of audio at 24 kHz sampling rate would have 72,000 samples. At 16-bit depth, this would result in a file size of approximately 281.25 KB.

The Sampling Process in Detail

Sampling at different rates:

For a 440 Hz signal (A4 note), here's how different sampling rates perform:

8000 Hz:

Samples in 10ms: 80
Time between samples: 0.125ms
Max frequency (Nyquist): 4000 Hz
✓ Can capture 440 Hz signal

16000 Hz:

Samples in 10ms: 160
Time between samples: 0.062ms
Max frequency (Nyquist): 8000 Hz
✓ Can capture 440 Hz signal

24000 Hz:

Samples in 10ms: 240
Time between samples: 0.042ms
Max frequency (Nyquist): 12000 Hz
✓ Can capture 440 Hz signal

44100 Hz:

Samples in 10ms: 441
Time between samples: 0.023ms
Max frequency (Nyquist): 22050 Hz
✓ Can capture 440 Hz signal

Nyquist-Shannon Sampling Theorem

The sampling theorem states that to accurately represent a signal, we must sample at at least twice the highest frequency present in the signal.

Nyquist Frequency: Half the sampling rate
Aliasing: Distortion that occurs when sampling below the Nyquist rate

Common sampling rates:

8 kHz: Telephone quality (captures up to 4 kHz)
16 kHz: Wideband speech (captures up to 8 kHz)
44.1 kHz: CD quality (captures up to 22.05 kHz)
48 kHz: Professional audio
96 kHz: High-resolution audio

Quantization: Bit Depth

After sampling, we need to represent each sample's amplitude as a digital value. Quantization converts continuous amplitude values to discrete levels.

For example, quantizing a value of 0.7234:

2-bit quantization:

Levels available: 4
Quantized value: 0.666667
Error: 5.67%

4-bit quantization:

Levels available: 16
Quantized value: 0.733333
Error: 1.29%

8-bit quantization:

Levels available: 256
Quantized value: 0.723529
Error: 0.01%

16-bit quantization:

Levels available: 65536
Quantized value: 0.723404
Error: < 0.01%

Notice how more bits = less error!

Common bit depths:

8-bit: 256 levels (telephone quality)
16-bit: 65,536 levels (CD quality)
24-bit: 16,777,216 levels (professional recording)
32-bit float: Scientific computing

Digital Audio Representation

Time Domain: Waveform

The waveform is the most direct representation - amplitude values over time:

The waveform is the most direct representation of audio - amplitude values over time. A complex waveform can include multiple harmonics (multiples of the fundamental frequency) and an envelope (how the amplitude changes over time). For example, a musical note might have a fundamental frequency of 220 Hz (A3) with several harmonics and an exponential decay envelope.

Audio File Formats

Different formats store audio data in various ways:

Different audio formats store data in various ways:

WAV:

Compression: None (lossless)
Quality: Perfect
File size: Large
Use case: Professional audio, archiving

MP3:

Compression: Lossy (psychoacoustic)
Quality: Good (depends on bitrate)
File size: Small (~10% of WAV)
Use case: Music distribution, streaming

FLAC:

Compression: Lossless
Quality: Perfect
File size: Medium (~50-70% of WAV)
Use case: Audiophile music, archiving

OGG:

Compression: Lossy (Vorbis)
Quality: Good to excellent
File size: Small
Use case: Open-source applications, games

Loading and Saving Audio

Different libraries can be used to load and save audio:

Librosa: Converts to mono, normalizes to [-1, 1]
Soundfile: Preserves original format
Scipy: Returns integer values
Torchaudio: Returns PyTorch tensors

Each library has its own advantages depending on your use case.

Audio Characteristics for Deep Learning

Dynamic Range

The ratio between the loudest and quietest parts of an audio signal:

Signal-to-Noise Ratio (SNR)

Measure of signal quality:

Practical Considerations for Deep Learning

1. Normalization

Always normalize audio data for neural networks:

2. Resampling

Match all audio to the same sampling rate:

3. Padding and Trimming

Ensure consistent input sizes:

Hands-On Exercise: Audio Data Explorer

Let's build a tool to explore audio characteristics:

Key Takeaways

Sound is a wave: Understanding wave properties helps us choose appropriate representations
Sampling theorem: Sample at least 2× the highest frequency you want to capture
Digital representation: Balance between quality (sampling rate, bit depth) and file size
Normalization is crucial: Always normalize audio for deep learning
Know your format: Different formats suit different applications

Exercises

Experiment with Sampling Rates:
- Record your voice at different sampling rates
- Listen to the differences
- Plot the waveforms and identify aliasing
Build a Format Converter:
- Create a function that converts between WAV, FLAC, and MP3
- Compare file sizes and quality
Analyze Real Audio:
- Download speech, music, and environmental sound samples
- Calculate and compare their statistics
- Identify characteristic differences
Noise Addition:
- Add different types of noise (white, pink, brown) to clean audio
- Calculate SNR for each
- Listen and observe the differences

What's Next?

Now that we understand how audio is represented digitally, we're ready to explore how to extract meaningful features from audio signals. In the next chapter, we'll dive into signal processing techniques like the Fourier Transform, spectrograms, and feature extraction methods that form the foundation for audio deep learning.

Previous: Introduction →
Next: Chapter 2 - Signal Processing Basics →