Why does PDF text sometimes appear garbled?

Because the file lacks a proper ToUnicode mapping.

Do subset fonts cause issues?

Yes, subsets remap glyph indexes and break extraction unless ToUnicode exists.

SaveFaste — Technical Documentation

Deep Technical Analysis of Font Systems in PDF: Embedding, Subsetting & Font Mapping

Author: SaveFaste Engineering • Updated: December 2025

1. Why PDF fonts matter

Fonts are the backbone of every PDF. They define how text is stored, displayed, extracted, embedded, and preserved across platforms. For document-processing platforms such as SaveFaste, understanding fonts is crucial for:

accurate text extraction
high-quality PDF→Word conversion
OCR post-processing
preserving Unicode characters
repairing corrupted files

PDF font handling is more complex than HTML or DOCX because a PDF does not store plain text — it stores glyph references mapped through encodings.

2. Font architecture inside PDFs

Every font inside a PDF has a dictionary object that defines its type, encoding, widths, base font, resources, and (optional) embedded data. The general structure:

<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>

In more complex cases (e.g., CIDFonts for CJK languages), the structure may include:

/CIDSystemInfo
/DescendantFonts
/CIDToGIDMap
/ToUnicode

PDF font systems must accommodate Unicode, legacy encodings, vertical writing, glyph substitution, and device-independent rendering. This is why the PDF specification devotes hundreds of pages to fonts alone.

3. Types of fonts in PDF

PDF defines several font categories.

3.1 Type1 fonts

PostScript-based fonts with vector outlines. Common in older PDFs and print documents.

3.2 TrueType & OpenType

The most common modern font type. PDF embeds the entire font program or a subset.

3.3 Type0 (CID) fonts

Composite fonts designed for Asian languages. They use CMaps to map character codes to CIDs (Character IDs).

3.4 Type3 fonts

User-defined glyphs encoded with PDF drawing operators. Rare but complex for converters.

Font Type	Usage	Complexity
Type1	Legacy PDFs, print workflows	Medium
TrueType/OpenType	Modern PDFs	Medium
Type0/CID	Asian scripts, Unicode	High
Type3	Custom shapes	Very High

4. Font embedding

Embedding ensures that text appears the same on every device, regardless of system fonts.

There are two types of embedding:

Full embedding — the entire font program is stored inside the PDF.
Partial embedding — only used glyphs are stored (subsetting).

Why embedding matters

prevents missing font substitutions
ensures consistent branding in corporate documents
enables offline printing
increases compatibility across OS platforms

5. Font subsetting

Subsetting reduces file size by embedding only the glyphs used. A subset font name is prefixed:

BTYFRA+Calibri

This means:

glyph indexes are remapped
original ordering is lost
Unicode mapping becomes non-trivial

This is why many low-quality PDF converters produce garbage text.

6. Character-to-glyph mapping

When text is written in a PDF, it is not stored as Unicode. Instead, the PDF stores character codes that reference glyphs in the font program.

There are three mapping layers:

Character code → CID/GID
CID/GID → glyph outline
CID/GID → Unicode (optional)

If the third layer is missing, the PDF may be readable visually but impossible to extract correctly.

7. ToUnicode and CMaps

The /ToUnicode CMap provides a way to map internal glyph codes back to Unicode characters.

Example ToUnicode extract:

<< /Type /ToUnicode >>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
1 begincodespacerange
<00> <FF>
endcodespacerange
1 beginbfchar
<21> <0041>   % A
<22> <0042>   % B
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream

Without this, extraction becomes guesswork.

8. Common issues in font extraction

Missing ToUnicode → unreadable text
Subset fonts → glyph remapping problems
Damaged font streams
Encrypted PDFs
CID fonts missing CMaps

SaveFaste’s converters implement fallback heuristics to reconstruct text even when ToUnicode is absent.

9. Repairing damaged fonts

When fonts are corrupted, typical repair steps include:

Rebuilding ToUnicode mapping via heuristics.
Reconstructing widths from embedded font program.
Fixing incorrect /Encoding or /CIDToGIDMap.
Extracting and reassembling font subsets.
Linearizing the PDF for better parsing.

10. PDF conversion challenges (PDF → Word/HTML)

Fonts are the #1 reason PDF-to-Word conversions fail. Challenges include:

subset glyphs not matching Unicode
vertical writing modes (CJK)
symbol fonts with no Unicode equivalents
missing /ToUnicode maps
Type3 handmade glyphs

SaveFaste applies AI-powered reconstruction to map glyphs to their closest textual meaning.

11. Performance & size optimizations

remove unused glyphs
compress embedded fonts
replace Type1 with TrueType for better compression
reuse resources across pages

FAQ

Q: Why do some PDFs show garbled text when copied?

A: Because the PDF does not include a proper /ToUnicode mapping.

Q: Can subset fonts cause extraction errors?

Yes. Subsets remap glyph indices, making extraction difficult unless ToUnicode exists.

Q: Why do some PDFs embed fonts fully?

For archiving, legal documents, or precise printing.

Q: Do font issues affect PDF→Word conversion?

Absolutely. Fonts are the main reason conversions break.

Conclusion

Understanding PDF fonts — embedding, subsetting, encoding, CID systems, and ToUnicode — is essential for accurate extraction and conversion. Platforms like SaveFaste rely on deep parsing of font systems to deliver reliable and high-quality document processing.