SaveFaste — Technical Documentation

Deep Technical Analysis of Font Systems in PDF: Embedding, Subsetting & Font Mapping

Author: SaveFaste Engineering • Updated: December 2025

1. Why PDF fonts matter

Fonts are the backbone of every PDF. They define how text is stored, displayed, extracted, embedded, and preserved across platforms. For document-processing platforms such as SaveFaste, understanding fonts is crucial for:

PDF font handling is more complex than HTML or DOCX because a PDF does not store plain text — it stores glyph references mapped through encodings.

2. Font architecture inside PDFs

Every font inside a PDF has a dictionary object that defines its type, encoding, widths, base font, resources, and (optional) embedded data. The general structure:

<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>

In more complex cases (e.g., CIDFonts for CJK languages), the structure may include:

PDF font systems must accommodate Unicode, legacy encodings, vertical writing, glyph substitution, and device-independent rendering. This is why the PDF specification devotes hundreds of pages to fonts alone.

3. Types of fonts in PDF

PDF defines several font categories.

3.1 Type1 fonts

PostScript-based fonts with vector outlines. Common in older PDFs and print documents.

3.2 TrueType & OpenType

The most common modern font type. PDF embeds the entire font program or a subset.

3.3 Type0 (CID) fonts

Composite fonts designed for Asian languages. They use CMaps to map character codes to CIDs (Character IDs).

3.4 Type3 fonts

User-defined glyphs encoded with PDF drawing operators. Rare but complex for converters.

Font TypeUsageComplexity
Type1Legacy PDFs, print workflowsMedium
TrueType/OpenTypeModern PDFsMedium
Type0/CIDAsian scripts, UnicodeHigh
Type3Custom shapesVery High

4. Font embedding

Embedding ensures that text appears the same on every device, regardless of system fonts.

There are two types of embedding:

Why embedding matters

5. Font subsetting

Subsetting reduces file size by embedding only the glyphs used. A subset font name is prefixed:

BTYFRA+Calibri

This means:

This is why many low-quality PDF converters produce garbage text.

6. Character-to-glyph mapping

When text is written in a PDF, it is not stored as Unicode. Instead, the PDF stores character codes that reference glyphs in the font program.

There are three mapping layers:

  1. Character code → CID/GID
  2. CID/GID → glyph outline
  3. CID/GID → Unicode (optional)

If the third layer is missing, the PDF may be readable visually but impossible to extract correctly.

7. ToUnicode and CMaps

The /ToUnicode CMap provides a way to map internal glyph codes back to Unicode characters.

Example ToUnicode extract:

<< /Type /ToUnicode >>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
1 begincodespacerange
<00> <FF>
endcodespacerange
1 beginbfchar
<21> <0041>   % A
<22> <0042>   % B
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream

Without this, extraction becomes guesswork.

8. Common issues in font extraction

SaveFaste’s converters implement fallback heuristics to reconstruct text even when ToUnicode is absent.

9. Repairing damaged fonts

When fonts are corrupted, typical repair steps include:

  1. Rebuilding ToUnicode mapping via heuristics.
  2. Reconstructing widths from embedded font program.
  3. Fixing incorrect /Encoding or /CIDToGIDMap.
  4. Extracting and reassembling font subsets.
  5. Linearizing the PDF for better parsing.

10. PDF conversion challenges (PDF → Word/HTML)

Fonts are the #1 reason PDF-to-Word conversions fail. Challenges include:

SaveFaste applies AI-powered reconstruction to map glyphs to their closest textual meaning.

11. Performance & size optimizations

FAQ

Q: Why do some PDFs show garbled text when copied?

A: Because the PDF does not include a proper /ToUnicode mapping.

Q: Can subset fonts cause extraction errors?

Yes. Subsets remap glyph indices, making extraction difficult unless ToUnicode exists.

Q: Why do some PDFs embed fonts fully?

For archiving, legal documents, or precise printing.

Q: Do font issues affect PDF→Word conversion?

Absolutely. Fonts are the main reason conversions break.

Conclusion

Understanding PDF fonts — embedding, subsetting, encoding, CID systems, and ToUnicode — is essential for accurate extraction and conversion. Platforms like SaveFaste rely on deep parsing of font systems to deliver reliable and high-quality document processing.