Back to Archive
Technical Deep-Dive 18 min read

PDF Compression Algorithms Explained: A Complete Technical Guide

Understanding the compression technologies inside PDF files, from lossless text encoding to advanced image compression, and how to optimize documents without sacrificing quality.

Technical Summary

PDF files use multiple compression algorithms simultaneously, each optimized for different content types. Understanding these algorithms enables informed decisions about compression settings, balancing file size against quality requirements for specific use cases.

PDF Internal Structure and Compression

A PDF file is not a monolithic entity but a structured container holding diverse content types: text, vector graphics, raster images, fonts, metadata, and interactive elements. Each content type has optimal compression strategies, and modern PDF files typically employ multiple algorithms simultaneously within a single document.

The PDF specification (ISO 32000) defines several standard compression filters that can be applied to data streams within the document. These filters can be chained together, applying multiple compression stages sequentially. Understanding this architecture is essential for effective PDF optimization.

Stream Objects and Filters

In PDF architecture, most content is stored in stream objects. Each stream can specify one or more filters through the /Filter dictionary entry. When multiple filters are specified, they are applied in sequence during encoding and reversed during decoding. This allows powerful combinations like compressing image data with JPEG and then applying additional Flate compression to the result.

PDF Stream Filter Example

10 0 obj
<<
  /Type /XObject
  /Subtype /Image
  /Width 1920
  /Height 1080
  /ColorSpace /DeviceRGB
  /BitsPerComponent 8
  /Filter [/FlateDecode /DCTDecode]
  /Length 245789
>>
stream
... compressed binary data ...
endstream
endobj

Example showing a PDF image object with chained Flate and DCT (JPEG) compression filters.

Lossless Compression Algorithms

Lossless compression preserves data exactly, making it essential for text, vector graphics, and any content where accuracy is critical. The decompressed output is bit-for-bit identical to the original input.

Flate/DEFLATE Compression

FlateDecode, based on the DEFLATE algorithm (RFC 1951), is the most widely used compression filter in PDF files. It combines LZ77 dictionary compression with Huffman coding to achieve excellent compression ratios for text and structured data.

DEFLATE works in two stages. First, LZ77 identifies repeated sequences in the data and replaces subsequent occurrences with references to earlier positions. Then, Huffman coding assigns shorter bit sequences to more frequent symbols and longer sequences to rare symbols, further reducing the encoded size.

Flate Advantages
  • Excellent compression for text-heavy documents (60-80% reduction)
  • Universal support across all PDF readers
  • Fast decompression for smooth document viewing
  • No patent restrictions
Compression Levels

DEFLATE supports compression levels 1-9. Higher levels achieve better compression but require more processing time. Level 6 offers an optimal balance for most use cases, while level 9 is reserved for maximum compression when processing time is not critical.

LZW Compression (Legacy)

LZWDecode implements the Lempel-Ziv-Welch algorithm, historically significant but now largely superseded by Flate compression. LZW builds a dictionary of patterns during compression, assigning codes to increasingly longer sequences. While effective, LZW was encumbered by patents until 2004, leading the PDF ecosystem to favor Flate.

Modern PDF tools rarely generate new LZW-compressed content, but support for decoding remains necessary for backward compatibility with legacy documents.

Run Length Encoding (RLE)

RunLengthDecode implements the simplest compression strategy: replacing consecutive identical bytes with a count and the byte value. While primitive, RLE excels at compressing data with long runs of identical values, such as images with large areas of solid color or whitespace-heavy documents.

RLE is often used as a pre-processing stage before other compression, particularly effective for images that have been posterized or contain large uniform regions. The algorithm's simplicity ensures extremely fast encoding and decoding.

Lossy Image Compression

Lossy compression achieves dramatic size reductions by discarding information that humans are unlikely to perceive. This is essential for photographic images where lossless compression typically achieves only modest reductions.

DCT/JPEG Compression

DCTDecode implements JPEG compression, the dominant format for photographic images in PDFs. JPEG uses the Discrete Cosine Transform to convert spatial image data into frequency components, then quantizes these components based on human visual perception.

The human eye is more sensitive to changes in brightness than color, and more sensitive to gradual changes than high-frequency detail. JPEG exploits these characteristics by allocating more precision to perceptually important components and aggressively quantizing less visible high-frequency information.

JPEG Quality vs File Size

Quality 95
~15:1
Quality 85
~25:1
Quality 75
~40:1
Quality 50
~80:1

Typical compression ratios for photographic images at different JPEG quality settings. Quality 75-85 provides excellent balance for most documents.

JPEG2000 Compression

JPXDecode implements JPEG2000 (ISO/IEC 15444), a more advanced image compression standard using wavelet transforms instead of DCT. JPEG2000 offers superior compression efficiency, especially at high compression ratios, and supports both lossy and lossless modes within the same framework.

JPEG2000's key advantages include graceful degradation (no blocking artifacts at high compression), region-of-interest encoding, and progressive transmission. However, computational complexity is higher than JPEG, and reader support, while universal in PDF 1.5+ viewers, is not available in very old software.

"JPEG2000 typically achieves 20-30% better compression than JPEG at equivalent visual quality, making it the superior choice for image-heavy documents when compatibility is assured."

- ISO 32000-2 PDF 2.0 Specification Notes

Specialized Document Compression

JBIG2 for Scanned Documents

JBIG2Decode implements JBIG2 (ISO/IEC 14492), designed specifically for bi-level (black and white) images like scanned documents. JBIG2 achieves compression ratios of 20:1 to 50:1 on typical scanned pages, dramatically outperforming general-purpose algorithms.

JBIG2's power comes from symbol matching: the encoder identifies repeated shapes (typically characters) and stores them once in a dictionary, replacing occurrences with references. This is extraordinarily effective for text-heavy scans where the same letters appear hundreds of times per page.

JBIG2 Lossless Mode

Lossless JBIG2 preserves every pixel exactly. Symbol matching still provides excellent compression (10:1 to 20:1) by storing each unique character shape once. Essential for legal documents where pixel-perfect reproduction is required.

JBIG2 Lossy Mode

Lossy JBIG2 substitutes similar symbols with dictionary references, achieving 30:1 to 50:1 compression. However, this can cause character substitution errors (e.g., "8" replaced with "6"). Use only for archival, never for OCR source documents.

CCITT Fax Compression

CCITTFaxDecode implements compression standards originally developed for fax transmission: CCITT Group 3 and Group 4. These algorithms use run-length encoding combined with Huffman coding optimized for typical document patterns.

Group 4 compression achieves approximately 15:1 ratios on scanned text documents. While superseded by JBIG2 for new documents, CCITT remains important for compatibility with fax-originated PDFs and legacy scanning systems.

Compression Strategy by Content Type

Effective PDF optimization requires matching compression algorithms to content types. A single PDF may contain diverse content requiring different strategies.

Content Type Recommended Algorithm Typical Ratio
Text & Vector Graphics Flate (DEFLATE) 5:1 - 10:1
Photographs JPEG (Quality 75-85) 20:1 - 40:1
High-Quality Images JPEG2000 30:1 - 60:1
Scanned B&W Documents JBIG2 (Lossless) 15:1 - 25:1
Screenshots / Graphics PNG/Flate (Lossless) 2:1 - 5:1
Mixed Content Adaptive per element Varies

Advanced Optimization Techniques

Image Downsampling

Beyond compression algorithm selection, image resolution is a critical optimization factor. A 300 DPI image intended for print contains 4x the pixels of a 150 DPI version adequate for screen viewing. Downsampling reduces pixel count before compression, multiplicatively reducing file size.

For screen-only documents, 96-150 DPI is typically sufficient. Print documents require 200-300 DPI. Downsampling from print resolution to screen resolution often achieves 4-9x additional size reduction beyond compression gains.

Color Space Optimization

Color images store three or four channels (RGB or CMYK). Converting grayscale images incorrectly stored as color to single-channel grayscale reduces data by 66-75%. Similarly, identifying images with limited color palettes and converting to indexed color can dramatically reduce size for graphics and diagrams.

Font Subsetting and Embedding

Fonts embedded in PDFs can consume significant space. A full font file may contain thousands of glyphs while the document uses only a fraction. Font subsetting extracts only the glyphs actually used, often reducing font data by 90% or more.

Object Stream Compression

PDF 1.5 introduced object streams, allowing multiple PDF objects to be compressed together. This improves compression efficiency by enabling the algorithm to find patterns across objects rather than compressing each independently. Object streams are particularly effective for documents with many small objects.

Client-Side Compression Benefits

Modern browser-based PDF compression offers significant advantages over server-based solutions. Processing occurs entirely on the user's device, eliminating upload/download time for large files and ensuring document privacy.

Privacy Preservation

Documents never leave your device. Compression occurs locally using WebAssembly and JavaScript, providing full functionality without any server involvement.

No Upload Delays

Large PDFs begin processing immediately without waiting for upload. Particularly valuable for documents exceeding 100MB where upload times can be prohibitive.

Offline Capability

Once loaded, client-side compression works without internet connection. Process sensitive documents in air-gapped environments with complete security.

Compression Quality Guidelines

Selecting appropriate compression settings requires balancing file size against quality requirements. The following guidelines apply to common scenarios:

Email Attachments

Target: Under 5MB. Use aggressive image compression (JPEG 60-70), 96-120 DPI resolution, grayscale conversion where appropriate. Prioritize size reduction.

Web Publishing

Target: Under 2MB for quick loading. Use JPEG 70-80, 150 DPI maximum, aggressive font subsetting. Balance quality with page load performance.

Print Production

Maintain quality: JPEG 85-95 or JPEG2000, 300 DPI minimum, CMYK color preservation. Size is secondary to output quality.

Archival Storage

Use lossless compression only: Flate for text, PNG/Flate for images, JBIG2 lossless for scans. Preserve original quality for long-term access.

Conclusion

PDF compression is a nuanced technical domain where understanding the available algorithms enables informed optimization decisions. Different content types demand different approaches: lossless Flate compression for text and vectors, lossy JPEG or JPEG2000 for photographs, and specialized JBIG2 for scanned documents.

Effective compression combines algorithm selection with complementary techniques: appropriate image resolution, optimized color spaces, font subsetting, and object stream utilization. The goal is achieving the smallest file size that meets quality requirements for the intended use case.

Modern client-side compression tools bring these capabilities directly to the browser, enabling sophisticated PDF optimization without server uploads or privacy compromises. Understanding the underlying technology empowers users to make optimal choices for their specific documents and requirements.

Compress PDFs with Advanced Algorithms

HexPdf's compression tool applies intelligent algorithm selection based on your document content. Process files locally with zero uploads and complete privacy.

Compress PDF Now