Where are images stored inside a PDF?

Images are usually stored as Image XObject stream objects or as inline images inside content streams. They include a dictionary with keys such as Width, Height, ColorSpace, BitsPerComponent and Filter specifying compression.

How are layers implemented in PDF?

PDF uses Optional Content Groups (OCG) and /OCProperties in the Catalog to define layers and configurations. Content can be associated with an OCG so viewers can toggle visibility.

Should I rely on Info or XMP metadata?

XMP is the preferred, structured metadata format and should be treated as authoritative if present. The Info dictionary is legacy and simpler.

How PDF Stores Images, Layers and Metadata — A Full Engineering Guide

Comprehensive engineering-level explanation of image XObjects & inline images, filters & compression, Optional Content Groups (OCG/layers), and metadata (info dictionary vs XMP). Practical examples, extraction tips, and optimization strategies for developers and engineers.

Quick summary: PDF stores visual content in structured objects and streams. Images are usually embedded as Image XObjects or inline images and compressed with filters (DCT, JPX, Flate). Layers are implemented via Optional Content Groups (OCG) and Optional Content Configuration (OCMD). Metadata exists both as a legacy info dictionary and a structured XMP packet. This guide explains each part and shows how to inspect, extract, repair and optimize them.

Overview — why PDF images, layers and metadata matter

Images and layered content are central to modern PDF workflows: advertising PDFs, architectural plans, maps, and print proofs use multiple layers and high-resolution images. Metadata is critical for searchability, compliance (PDF/A), and automated processing. For any robust PDF processing pipeline (conversion, OCR, repair, archiving), understanding how PDF stores these elements is non-negotiable.

Image XObjects & inline images — the two main embedding patterns

PDF represents images using either:

Image XObjects — separate stream objects that are referenced from page content resources; they are reusable and preferred for large or repeated graphics.
Inline images — image data embedded directly inside the page content stream using the BI ... ID ... EI operators; useful for small, one-off images.

Image XObject: anatomy

An Image XObject is a stream object with a dictionary describing image properties. Typical keys:

Important notes:

/Filter indicates how the stream is encoded (see Filters section).
/ColorSpace may be a name (DeviceRGB) or an object (e.g., ICCBased or indexed color spaces).
/Mask or /SMask specify alpha/matte or soft mask information.

Inline images: BI / ID / EI

Inline images appear inside the page content stream. Their grammar is compact but less reusable and sometimes harder to extract programmatically:

Inline images are convenient for small icons but are discouraged for large images due to duplication and parsing complexity.

Filters & image compression: Flate, DCT, JPX, JBIG2 and more

Image streams in PDF are often compressed using filters. Filters can be applied singly or in arrays (multiple filters). The most common:

Filter	Typical use	Characteristics
`/DCTDecode`	JPEG images	Lossy, small size, ubiquitous
`/JPXDecode`	JPEG2000	Lossy or lossless, better compression at high quality
`/JBIG2Decode`	Monochrome scanned images	Excellent for bi-level scans, complex decoding
`/FlateDecode`	PNG-like deflate streams	Lossless, used for image data and content streams
`/ASCII85Decode / ASCIIHexDecode`	ASCII-safe encoding	Wrap binary for text transport

Common pitfalls with filters

- Some viewers expect /Length to match exact bytes between stream and endstream. If mismatched, extraction tools may fail.
- Chained filters must be applied in the correct order (first listed in the dictionary is applied last when decoding).
- JBIG2 streams can reference external symbol dictionaries (lossy segmentation) — extracting them without the dictionary yields gibberish.

Color spaces, masks and soft masks (SMask)

Color handling is integral to image fidelity. The /ColorSpace entry can be:

Device color spaces: /DeviceRGB, /DeviceCMYK, /DeviceGray
Calibrated or ICC-based: /ICCBased
Indexed color spaces for palettes

Masks and transparency

- /Mask (explicit mask) defines binary transparency (cut-out shapes).
- /SMask is a soft mask stream providing per-pixel alpha values (grayscale).
- /Matte and blend modes in the page content stream control compositing.

When converting to other formats (e.g., PDF→PNG), soft masks must be applied to the color image to reconstruct alpha transparency correctly.

Layers in PDF: Optional Content Groups (OCG) and configuration

Layers in PDF are implemented via Optional Content (OC) — objects that may be visible or hidden according to a configuration. The main parts:

/OCProperties in the Catalog defines available OCGs and the default configuration.
OCG objects are dictionaries with a /Name and optional properties.
Optional Content Membership is applied to content via marked-content operators (/OC property or layer visibility operations).

OCG example (simplified)

Content streams reference OCGs using marked content: /OC /OCG << /OC 20 0 R >> BDC ... EMC or via the paint operator with a property list that includes /OC.

Use-cases for layers

Architectural drawings (floors, electrical plans)
Map overlays (roads, labels, points of interest)
Multilingual print proofs (alternate language layers)
Optional annotations and redactions

Not all PDF viewers respect OC configurations identically. Tools must read /OCProperties to render layer-aware previews and also consider Optional Content Membership dictionaries for complex behaviors.

Metadata: Info dictionary vs XMP (XML) packet

PDFs historically carried a simple Info dictionary with fields like /Title, /Author, /Subject. Modern PDFs embed rich metadata using XMP (Extensible Metadata Platform), an XML packet stored in a stream and referenced from the catalog (/Metadata).

Info dictionary example

XMP metadata (recommended)

XMP is a namespaced XML block that can contain Dublin Core, PDF-specific, custom schema, and rights/licensing metadata. It is placed in a metadata stream referenced by the Catalog:

Why XMP matters

Structured metadata (searchable and machine-readable)
Can embed licensing, provenance, timestamps, content classifications
Used for PDF/A and archival workflows

When migrating or ingesting PDFs, read both Info and XMP. Some generators populate only one; XMP is generally authoritative if present.

Where images, layers and metadata live: objects, streams, and the XRef

PDF files are a set of numbered objects. Streams (binary) are where image bytes and XMP XML live. The cross-reference (XRef) maps object numbers to file offsets. Understanding where to find image data requires parsing the XRef and locating XObject streams referenced by Page resources.

Typical lookup flow to find images

Open Catalog, read Page tree to find target Page object.
Inspect Page /Resources /XObject dictionary for image XObject names (e.g., /Im1 10 0 R).
Follow reference to the stream object and decode using filters.
If /SMask or /ColorSpace references exist, decode those streams too.

Extracting images and layer-aware processing

Image extraction must decode filters and apply color profiles and masks. Layer-aware extraction requires interpreting Optional Content Membership and deciding whether to include content from hidden layers.

Step-by-step extraction (programmatic)

Parse XRef to access objects (libraries: qpdf, pdfcpu, PyPDF2, PoDoFo).
Find Page → /Resources → /XObject entries.
For each XObject where /Subtype /Image, read dictionary to determine filters, color space, bpc (bits per component).
Decode filters in order to obtain raw image bytes.
Apply ICC profile if /ColorSpace /ICCBased is present to map to sRGB or desired space.
Apply soft mask by compositing the SMask into the image as alpha.
Save as PNG/JPEG depending on original compression and need for alpha.

For JBIG2 and JPX, use native decoders. For chained filters, decode in correct order. Failure to do so results in corrupted extractions.

Repairing broken image streams, masks and metadata

Repairs fall into two categories: automated rebuilding and manual reconstruction.

Automated approaches

Rebuild XRef with qpdf --rebuild-xref or similar tools to ensure object offsets are correct.
Validate and correct stream /Length entries so decoders read proper byte ranges.
Attempt filter fallback (e.g., if DCT fails, try treating as JPX if file header suggests JPEG2000).
For damaged masks, reconstruct alpha by sampling nearby pixels or re-running OCR/segmentation for scanned content.

Manual & advanced repair

- Use a hex editor to locate stream boundaries and manually extract bytes between stream and endstream. Then try decompressing with zlib (Flate) or JPEG tools.
- For missing XMP, reconstruct metadata using Info dictionary and external records. For archival, repackage metadata into a proper XMP stream and update the Catalog reference.
- For corrupted JBIG2, you may need the original symbol dictionary; without it, consider raster extraction + OCR.

SaveFaste's repair pipeline combines XRef rebuilding, stream validation, filter detection heuristics and OCR fallback to maximize recovery success on image-heavy PDFs.

Optimization: balance quality, size and performance

For web delivery and fast viewing, consider:

Image recompression: convert high-resolution images to efficient formats (JPX for photographic content, Flate for lossless line art).
Downsampling: reduce DPI for display PDFs (e.g., 150–200 DPI for on-screen).
Font & image deduplication: reuse XObjects across pages instead of inlining duplicates.
Enable linearization: to support Fast Web View and progressive page rendering.

Automation pattern

A reliable pipeline:

Analyze image content type (text/line art vs photo vs scan).
Choose compression & target DPI.
Re-encode image stream and update XObject dictionary (new /Filter and /Length).
Regenerate XRef and write new PDF.

FAQ

Q: Are inline images always smaller than image XObjects?

A: No. Inline images are only smaller when the image is tiny; for repeated or large images, XObjects are more efficient due to reuse and separate storage.

Q: How can I tell if a PDF uses layers?

A: Inspect the Catalog for an /OCProperties entry. Layer names are defined in OCG objects and can be listed in the configuration dictionary.

Q: Why do some extracted images look washed out?

A: Often due to missing ICC profile or incorrect color space handling. Apply embedded ICC profiles (ICCBased) or map DeviceCMYK to sRGB carefully to preserve colors.

Q: Is XMP metadata mandatory?

A: No—XMP is recommended and more expressive, but not mandatory. If present, it should be treated as authoritative over the Info dictionary.