Video Conferencing Systems

Table of Contents

Overview

Video conferencing systems enable real-time audio and video communication between multiple participants over IP networks. Modern browser-based implementations rely primarily on WebRTC (Web Real-Time Communication), an open standard providing peer-to-peer media streaming without plugins. The architecture involves complex interplay between signaling servers (for session establishment), STUN/TURN servers (for NAT traversal), and Selective Forwarding Units (SFUs) or Multipoint Control Units (MCUs) for scaling beyond peer-to-peer limits. Key challenges include network adaptation, echo cancellation, bandwidth estimation, and maintaining quality of experience across heterogeneous network conditions.

Background

  • Traditional video conferencing: H.323, SIP protocols (1990s-2000s)
  • Flash-based solutions: Adobe Connect, early browser video
  • WebRTC standardization began 2011, W3C Recommendation 2021
  • Major implementations: Google Meet, Zoom (partial WebRTC), Jitsi, Daily.co
  • COVID-19 pandemic (2020) dramatically accelerated adoption and development
  • Current focus: E2E encryption, AI features, spatial audio, virtual backgrounds

Key Concepts

WebRTC Core APIs

API Purpose
getUserMedia() Capture camera/microphone streams
RTCPeerConnection Manage peer-to-peer media connections
RTCDataChannel Arbitrary data transfer between peers
MediaRecorder Record media streams
getDisplayMedia() Screen sharing capture

Signaling and Session Establishment

WebRTC requires external signaling (not specified by standard):

  1. Offer/Answer: SDP (Session Description Protocol) exchange
  2. ICE Candidates: Network endpoint discovery
  3. Trickle ICE: Incremental candidate exchange for faster connection
                Signaling Server
                     |
         +-----------+-----------+
         |                       |
     Peer A                  Peer B
         |                       |
         +--- STUN/TURN Server --+
                     |
              Media Streams

NAT Traversal

  • STUN (Session Traversal Utilities for NAT): Discover public IP/port
  • TURN (Traversal Using Relays around NAT): Relay when direct fails
  • ICE (Interactive Connectivity Establishment): Framework combining both
  • ~85% of connections succeed with STUN only
  • TURN required for symmetric NATs, enterprise firewalls

Scaling Architectures

Architecture Description Use Case
Mesh All peers connect to all peers 2-4 participants
SFU Server forwards streams selectively 5-50 participants
MCU Server mixes into single stream Legacy endpoints

Media Processing

  • Codec negotiation: VP8, VP9, H.264, AV1 for video; Opus for audio
  • Simulcast: Send multiple quality layers, SFU selects per recipient
  • SVC (Scalable Video Coding): Single stream with extractable layers
  • Bandwidth estimation: REMB, Transport-CC for congestion control
  • Jitter buffer: Smooth out network timing variations

Implementation

Basic WebRTC Connection

// Get user media
const stream = await navigator.mediaDevices.getUserMedia({
  video: { width: 1280, height: 720 },
  audio: { echoCancellation: true, noiseSuppression: true }
});

// Create peer connection with STUN server
const pc = new RTCPeerConnection({
  iceServers: [
    { urls: 'stun:stun.l.google.com:19302' },
    { urls: 'turn:turn.example.com', username: 'user', credential: 'pass' }
  ]
});

// Add local tracks
stream.getTracks().forEach(track => pc.addTrack(track, stream));

// Handle ICE candidates
pc.onicecandidate = ({candidate}) => {
  if (candidate) sendToSignalingServer({type: 'candidate', candidate});
};

// Handle remote stream
pc.ontrack = ({streams}) => {
  remoteVideo.srcObject = streams[0];
};

// Create and send offer
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
sendToSignalingServer({type: 'offer', sdp: offer});

Signaling Server (Node.js/Socket.io)

io.on('connection', socket => {
  socket.on('join-room', roomId => {
    socket.join(roomId);
    socket.to(roomId).emit('user-joined', socket.id);
  });

  socket.on('offer', ({to, sdp}) => {
    io.to(to).emit('offer', {from: socket.id, sdp});
  });

  socket.on('answer', ({to, sdp}) => {
    io.to(to).emit('answer', {from: socket.id, sdp});
  });

  socket.on('candidate', ({to, candidate}) => {
    io.to(to).emit('candidate', {from: socket.id, candidate});
  });
});

Screen Sharing

const screenStream = await navigator.mediaDevices.getDisplayMedia({
  video: { cursor: 'always' },
  audio: true  // System audio (browser support varies)
});

// Replace video track in existing connection
const videoSender = pc.getSenders().find(s => s.track?.kind === 'video');
await videoSender.replaceTrack(screenStream.getVideoTracks()[0]);

References

Notes

  • WebRTC requires HTTPS (except localhost) for getUserMedia
  • Mobile browser support varies; native SDKs often preferred
  • End-to-end encryption: Insertable Streams API (experimental)
  • Virtual backgrounds: TensorFlow.js BodyPix, MediaPipe
  • Recording: Server-side via SFU or client-side MediaRecorder
  • Common TURN providers: Twilio, Xirsys, Daily, self-hosted coturn
  • Bandwidth typically: 250-1000 kbps per video stream
  • Latency target: <150ms for interactive conversation
  • Quality metrics: MOS (Mean Opinion Score), SRTT, jitter, packet loss

Author: Jason Walsh

j@wal.sh

Last Updated: 2026-01-11 11:04:31

build: 2026-01-11 18:31 | sha: eb805a8