Deep Technical Analysis of Font Systems in PDF: Embedding, Subsetting & Font Mapping
1. Why PDF fonts matter
Fonts are the backbone of every PDF. They define how text is stored, displayed, extracted, embedded, and preserved across platforms. For document-processing platforms such as SaveFaste, understanding fonts is crucial for:
- accurate text extraction
- high-quality PDF→Word conversion
- OCR post-processing
- preserving Unicode characters
- repairing corrupted files
PDF font handling is more complex than HTML or DOCX because a PDF does not store plain text — it stores glyph references mapped through encodings.
2. Font architecture inside PDFs
Every font inside a PDF has a dictionary object that defines its type, encoding, widths, base font, resources, and (optional) embedded data. The general structure:
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>
In more complex cases (e.g., CIDFonts for CJK languages), the structure may include:
- /CIDSystemInfo
- /DescendantFonts
- /CIDToGIDMap
- /ToUnicode
PDF font systems must accommodate Unicode, legacy encodings, vertical writing, glyph substitution, and device-independent rendering. This is why the PDF specification devotes hundreds of pages to fonts alone.
3. Types of fonts in PDF
PDF defines several font categories.
3.1 Type1 fonts
PostScript-based fonts with vector outlines. Common in older PDFs and print documents.
3.2 TrueType & OpenType
The most common modern font type. PDF embeds the entire font program or a subset.
3.3 Type0 (CID) fonts
Composite fonts designed for Asian languages. They use CMaps to map character codes to CIDs (Character IDs).
3.4 Type3 fonts
User-defined glyphs encoded with PDF drawing operators. Rare but complex for converters.
| Font Type | Usage | Complexity |
|---|---|---|
| Type1 | Legacy PDFs, print workflows | Medium |
| TrueType/OpenType | Modern PDFs | Medium |
| Type0/CID | Asian scripts, Unicode | High |
| Type3 | Custom shapes | Very High |
4. Font embedding
Embedding ensures that text appears the same on every device, regardless of system fonts.
There are two types of embedding:
- Full embedding — the entire font program is stored inside the PDF.
- Partial embedding — only used glyphs are stored (subsetting).
Why embedding matters
- prevents missing font substitutions
- ensures consistent branding in corporate documents
- enables offline printing
- increases compatibility across OS platforms
5. Font subsetting
Subsetting reduces file size by embedding only the glyphs used. A subset font name is prefixed:
BTYFRA+Calibri
This means:
- glyph indexes are remapped
- original ordering is lost
- Unicode mapping becomes non-trivial
This is why many low-quality PDF converters produce garbage text.
6. Character-to-glyph mapping
When text is written in a PDF, it is not stored as Unicode. Instead, the PDF stores character codes that reference glyphs in the font program.
There are three mapping layers:
- Character code → CID/GID
- CID/GID → glyph outline
- CID/GID → Unicode (optional)
If the third layer is missing, the PDF may be readable visually but impossible to extract correctly.
7. ToUnicode and CMaps
The /ToUnicode CMap provides a way to map internal glyph codes back to Unicode characters.
Example ToUnicode extract:
<< /Type /ToUnicode >>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
1 begincodespacerange
<00> <FF>
endcodespacerange
1 beginbfchar
<21> <0041> % A
<22> <0042> % B
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
Without this, extraction becomes guesswork.
8. Common issues in font extraction
- Missing ToUnicode → unreadable text
- Subset fonts → glyph remapping problems
- Damaged font streams
- Encrypted PDFs
- CID fonts missing CMaps
SaveFaste’s converters implement fallback heuristics to reconstruct text even when ToUnicode is absent.
9. Repairing damaged fonts
When fonts are corrupted, typical repair steps include:
- Rebuilding ToUnicode mapping via heuristics.
- Reconstructing widths from embedded font program.
- Fixing incorrect /Encoding or /CIDToGIDMap.
- Extracting and reassembling font subsets.
- Linearizing the PDF for better parsing.
10. PDF conversion challenges (PDF → Word/HTML)
Fonts are the #1 reason PDF-to-Word conversions fail. Challenges include:
- subset glyphs not matching Unicode
- vertical writing modes (CJK)
- symbol fonts with no Unicode equivalents
- missing /ToUnicode maps
- Type3 handmade glyphs
SaveFaste applies AI-powered reconstruction to map glyphs to their closest textual meaning.
11. Performance & size optimizations
- remove unused glyphs
- compress embedded fonts
- replace Type1 with TrueType for better compression
- reuse resources across pages
FAQ
Q: Why do some PDFs show garbled text when copied?
A: Because the PDF does not include a proper /ToUnicode mapping.
Q: Can subset fonts cause extraction errors?
Yes. Subsets remap glyph indices, making extraction difficult unless ToUnicode exists.
Q: Why do some PDFs embed fonts fully?
For archiving, legal documents, or precise printing.
Q: Do font issues affect PDF→Word conversion?
Absolutely. Fonts are the main reason conversions break.
Conclusion
Understanding PDF fonts — embedding, subsetting, encoding, CID systems, and ToUnicode — is essential for accurate extraction and conversion. Platforms like SaveFaste rely on deep parsing of font systems to deliver reliable and high-quality document processing.