Video Conferencing Systems

Overview
Background
Key Concepts
Implementation
References
Notes

Overview

Video conferencing systems enable real-time audio and video communication between multiple participants over IP networks. Modern browser-based implementations rely primarily on WebRTC (Web Real-Time Communication), an open standard providing peer-to-peer media streaming without plugins. The architecture involves complex interplay between signaling servers (for session establishment), STUN/TURN servers (for NAT traversal), and Selective Forwarding Units (SFUs) or Multipoint Control Units (MCUs) for scaling beyond peer-to-peer limits. Key challenges include network adaptation, echo cancellation, bandwidth estimation, and maintaining quality of experience across heterogeneous network conditions.

Background

Traditional video conferencing: H.323, SIP protocols (1990s-2000s)
Flash-based solutions: Adobe Connect, early browser video
WebRTC standardization began 2011, W3C Recommendation 2021
Major implementations: Google Meet, Zoom (partial WebRTC), Jitsi, Daily.co
COVID-19 pandemic (2020) dramatically accelerated adoption and development
Current focus: E2E encryption, AI features, spatial audio, virtual backgrounds

Key Concepts

WebRTC Core APIs

API	Purpose
`getUserMedia()`	Capture camera/microphone streams
`RTCPeerConnection`	Manage peer-to-peer media connections
`RTCDataChannel`	Arbitrary data transfer between peers
`MediaRecorder`	Record media streams
`getDisplayMedia()`	Screen sharing capture

Signaling and Session Establishment

WebRTC requires external signaling (not specified by standard):

Offer/Answer: SDP (Session Description Protocol) exchange
ICE Candidates: Network endpoint discovery
Trickle ICE: Incremental candidate exchange for faster connection

                Signaling Server
                     |
         +-----------+-----------+
         |                       |
     Peer A                  Peer B
         |                       |
         +--- STUN/TURN Server --+
                     |
              Media Streams

NAT Traversal

STUN (Session Traversal Utilities for NAT): Discover public IP/port
TURN (Traversal Using Relays around NAT): Relay when direct fails
ICE (Interactive Connectivity Establishment): Framework combining both
~85% of connections succeed with STUN only
TURN required for symmetric NATs, enterprise firewalls

Scaling Architectures

Architecture	Description	Use Case
Mesh	All peers connect to all peers	2-4 participants
SFU	Server forwards streams selectively	5-50 participants
MCU	Server mixes into single stream	Legacy endpoints

Media Processing

Codec negotiation: VP8, VP9, H.264, AV1 for video; Opus for audio
Simulcast: Send multiple quality layers, SFU selects per recipient
SVC (Scalable Video Coding): Single stream with extractable layers
Bandwidth estimation: REMB, Transport-CC for congestion control
Jitter buffer: Smooth out network timing variations

Implementation

Basic WebRTC Connection

// Get user media
const stream = await navigator.mediaDevices.getUserMedia({
  video: { width: 1280, height: 720 },
  audio: { echoCancellation: true, noiseSuppression: true }
});

// Create peer connection with STUN server
const pc = new RTCPeerConnection({
  iceServers: [
    { urls: 'stun:stun.l.google.com:19302' },
    { urls: 'turn:turn.example.com', username: 'user', credential: 'pass' }
  ]
});

// Add local tracks
stream.getTracks().forEach(track => pc.addTrack(track, stream));

// Handle ICE candidates
pc.onicecandidate = ({candidate}) => {
  if (candidate) sendToSignalingServer({type: 'candidate', candidate});
};

// Handle remote stream
pc.ontrack = ({streams}) => {
  remoteVideo.srcObject = streams[0];
};

// Create and send offer
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
sendToSignalingServer({type: 'offer', sdp: offer});

Signaling Server (Node.js/Socket.io)

io.on('connection', socket => {
  socket.on('join-room', roomId => {
    socket.join(roomId);
    socket.to(roomId).emit('user-joined', socket.id);
  });

  socket.on('offer', ({to, sdp}) => {
    io.to(to).emit('offer', {from: socket.id, sdp});
  });

  socket.on('answer', ({to, sdp}) => {
    io.to(to).emit('answer', {from: socket.id, sdp});
  });

  socket.on('candidate', ({to, candidate}) => {
    io.to(to).emit('candidate', {from: socket.id, candidate});
  });
});

Screen Sharing

const screenStream = await navigator.mediaDevices.getDisplayMedia({
  video: { cursor: 'always' },
  audio: true  // System audio (browser support varies)
});

// Replace video track in existing connection
const videoSender = pc.getSenders().find(s => s.track?.kind === 'video');
await videoSender.replaceTrack(screenStream.getVideoTracks()[0]);

References

Notes

WebRTC requires HTTPS (except localhost) for getUserMedia
Mobile browser support varies; native SDKs often preferred
End-to-end encryption: Insertable Streams API (experimental)
Virtual backgrounds: TensorFlow.js BodyPix, MediaPipe
Recording: Server-side via SFU or client-side MediaRecorder
Common TURN providers: Twilio, Xirsys, Daily, self-hosted coturn
Bandwidth typically: 250-1000 kbps per video stream
Latency target: <150ms for interactive conversation
Quality metrics: MOS (Mean Opinion Score), SRTT, jitter, packet loss