Docs: usermanual, add Shaping Concepts chapter.

6 years ago · 3a27e8fb97
parent 9aa865dcc6
commit 3a27e8fb97
2 changed files with 371 additions and 2 deletions
--- a/docs/harfbuzz-docs.xml
+++ b/docs/harfbuzz-docs.xml
@ -13,8 +13,8 @@
      <para>
        HarfBuzz is an <ulink url="http://www.microsoft.com/typography/otspec/">OpenType</ulink>
        text shaping engine. Using the HarfBuzz library allows
-	programs to convert a sequence of Unicode input text into
-	properly formatted and positioned output&mdash;for any writing
+	programs to convert a sequence of Unicode input into
+	properly formatted and positioned text output&mdash;for any writing
 	system and language.
      </para>

@ -34,6 +34,7 @@
      <xi:include href="usermanual-what-is-harfbuzz.xml"/>
      <xi:include href="usermanual-install-harfbuzz.xml"/>
      <xi:include href="usermanual-hello-harfbuzz.xml"/>
+      <xi:include href="usermanual-shaping-concepts.xml"/>
      <xi:include href="usermanual-buffers-language-script-and-direction.xml"/>
      <xi:include href="usermanual-fonts-and-faces.xml"/>
      <xi:include href="usermanual-clusters.xml"/>
--- a/docs/usermanual-shaping-concepts.xml
+++ b/docs/usermanual-shaping-concepts.xml
@ -0,0 +1,368 @@
+<chapter id="shaping-concepts">
+  <title>Shaping concepts</title>
+  <section id="text-shaping-concepts">
+    <title>Text shaping</title>
+    <para>
+      Text shaping is the process of transforming a sequence of Unicode
+      codepoints that represent individual characters (letters,
+      diacritics, tone marks, numbers, symbols, etc.) into the
+      orthographically and linguistically correct two-dimensional layout
+      of glyph shapes taken from a specified font.
+    </para>
+    <para>
+      For some writing systems (or <emphasis>scripts</emphasis>) and
+      languages, the process is simple, requiring the shaper to do
+      little more than advance the horizontal position forward by the
+      correct amount for each successive glyph.
+    </para>
+    <para>
+      But, for <emphasis>complex scripts</emphasis>, any combination of
+      several shaping operations may be required, and the rules for how
+      and when they are applied vary from script to script. HarfBuzz and
+      other shaping engines implement these rules.
+    </para>
+    <para>
+      The exact rules and necessary operations for a particular script
+      constitute a shaping <emphasis>model</emphasis>. OpenType
+      specifies a set of shaping models that covers all of
+      Unicode. Other shaping models are available, however, including
+      Graphite and Apple Advanced Typography (AAT). 
+    </para>
+  </section>
+  
+  <section id="complex-scripts">
+    <title>Complex scripts</title>
+    <para>
+      In text-shaping terminology, scripts are generally classified as
+      either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
+    </para>
+    <para>
+      Complex scripts are those for which transforming the input
+      sequence into the final layout requires some combination of
+      operations&mdash;such as context-dependent substitutions,
+      context-dependent mark positioning, glyph-to-glyph joining,
+      glyph reordering, or glyph stacking.
+    </para>
+    <para>
+      In some complex scripts, the shaping rules require that a text
+      run be divided into syllables before the operations can be
+      applied. Other complex scripts may apply shaping operations over
+      entire words or over the entire text run, with no subdivision
+      required.
+    </para>
+    <para>
+      Non-complex scripts, by definition, do not require these
+      operations. However, correctly shaping a text run in a
+      non-complex script may still involve Unicode normalization,
+      ligature substitutions, mark positioning, kerning, and applying
+      other font features. The key difference is that a text run in a
+      non-complex script can be processed sequentially and in the same
+      order as the input sequence of Unicode codepoints, without
+      requiring an analysis stage.
+    </para>
+  </section>
+  
+  <section id="shaping-operations">
+    <title>Shaping operations</title>
+    <para>
+      Shaping a complex-script text run involves transforming the
+      input sequence of Unicode codepoints with some combination of
+      operations that is specified in the shaping model for the
+      script.
+    </para>
+    <para>
+      The specific conditions that trigger a given operation for a
+      text run varies from script to script, as do the order that the
+      operations are performed in and which codepoints are
+      affected. However, the same general set of shaping operations is
+      common to all of the complex-script shaping models. 
+    </para>
+    
+    <itemizedlist>
+      <listitem>
+	<para>
+	  A <emphasis>reordering</emphasis> operation moves a glyph
+	  from its original ("logical") position in the sequence to
+	  some other ("visual") position.
+	</para>
+	<para>
+	  The shaping model for a given complex script might involve
+	  more than one reordering step.
+	</para>
+      </listitem>
+      
+      <listitem>
+	<para>
+	  A <emphasis>joining</emphasis> operation replaces a glyph
+	  with an alternate form that is designed to connect with one
+	  or more of the adjacent glyphs in the sequence.
+	</para>
+      </listitem>
+      
+      <listitem>
+	<para>
+	  A contextual <emphasis>substitution</emphasis> operation
+	  replaces either a single glyph or a subsequence of several
+	  glyphs with an alternate glyph. This substitution is
+	  performed when the original glyph or subsequence of glyphs
+	  occurs in a specified position with respect to the
+	  surrounding sequence. For example, one substitution might be
+	  performed only when the target glyph is the first glyph in
+	  the sequence, while another substitution is performed only
+	  when a different target glyph occurs immediately after a
+	  particular string pattern.
+	</para>
+	<para>
+	  The shaping model for a given complex script might involve
+	  multiple contextual-substitution operations, each applying
+	  to different target glyphs and patterns, and which are
+	  performed in separate steps.
+	</para>
+      </listitem>
+      
+      <listitem>
+	<para>
+	  A contextual <emphasis>positioning</emphasis> operation
+	  moves the horizontal and/or vertical position of a
+	  glyph. This positioning move is performed when the glyph
+	  occurs in a specified position with respect to the
+	  surrounding sequence.
+	</para>
+	<para>
+	  Many contextual positioning operations are used to place
+	  <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
+	  signs, and tone markers) with respect to
+	  <emphasis>base</emphasis> glyphs. However, some complex
+	  scripts may use contextual positioning operations to
+	  correctly place base glyphs as well, such as
+	  when the script uses <emphasis>stacking</emphasis> characters.
+	</para>
+      </listitem>
+      
+    </itemizedlist>
+  </section>
+  
+  <section id="unicode-character-categories">
+    <title>Unicode character categories</title>
+    <para>
+      Shaping models are typically specified with respect to how
+      scripts are defined in the Unicode standard.
+    </para>
+    <para>
+      Every codepoint in the Unicode Character Database (UCD) is
+      assigned a <emphasis>Unicode General Category</emphasis> (UGC),
+      which provides the most fundamental information about the
+      codepoint: whether the codepoint represents a
+      <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
+      <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
+      <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
+      or something else (<emphasis>Other</emphasis>).
+    </para>
+    <para>
+      These UGC properties are "Major" categories. Each codepoint is
+      further assigned to a "minor" category within its Major
+      category, such as "Letter, uppercase" (<literal>Lu</literal>) or
+      "Letter, modifier" (<literal>Lm</literal>).
+    </para>
+    <para>
+      Shaping models are concerned primarily with Letter and Mark
+      codepoints. The minor categories of Mark codepoints are
+      particularly important for shaping. Marks can be nonspacing
+      (<literal>Mn</literal>), spacing combining
+      (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
+    </para>
+    <para>
+      In addition to the UGC property, codepoints in the Indic and
+      Southeast Asian scripts are also assigned
+      <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
+      <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
+      property that provides more detailed information needed for
+      shaping.
+    </para>
+    <para>
+      The UISC property sub-categorizes Letters and Marks according to
+      common script-shaping behaviors. For example, UISC distinguishes
+      between consonant letters, vowel letters, and vowel marks. The
+      UIPC property sub-categorizes Mark codepoints by the visual
+      position that they occupy (above, below, right, left, or in
+      multiple positions).
+    </para>
+    <para>
+      Some complex scripts require that the text run be split into
+      syllables, and what constitutes a valid syllable in these
+      scripts is specified in regular expressions of the Letter and
+      Mark codepoints that take the UISC and UIPC properties into account.
+    </para>
+
+  </section>
+  
+  <section id="text-runs">
+    <title>Text runs</title>
+    <para>
+      Real-world text usually contains codepoints from a mixture of
+      different Unicode scripts (including punctuation, numbers, symbols,
+      white-space characters, and other codepoints that do not belong
+      to any script). Real-world text may also be marked up with
+      formatting that changes font properties (including the font,
+      font style, and font size).
+    </para>
+    <para>
+      For shaping purposes, all real-world text streams must be first
+      segmented into runs that have a uniform set of properties. 
+    </para>
+    <para>
+      In particular, shaping models always assume that every codepoint
+      in a text run has the same <emphasis>direction</emphasis>,
+      <emphasis>script</emphasis> tag, and
+      <emphasis>language</emphasis> tag.
+    </para>
+  </section>
+  
+  <section id="opentype-shaping-models">
+    <title>OpenType shaping models</title>
+    <para>
+      OpenType provides shaping models for the following scripts:
+    </para>
+
+    <itemizedlist>
+      <listitem>
+	<para>
+	  The <emphasis>default</emphasis> shaping model handles all
+	  non-complex scripts, and may also be used as a fallback for
+	  handling unrecognized scripts.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Indic</emphasis> shaping model handles the Indic
+	  scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
+	  Malayalam, Oriya, Tamil, Telugu, and Sinhala.
+	</para>
+	<para>
+	  The Indic shaping model was revised significantly in
+	  2005. To denote the change, a new set of <emphasis>script
+	  tags</emphasis> was assigned for Bengali, Devanagari,
+	  Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
+	  Telugu. For the sake of clarity, the term "Indic2" is
+	  sometimes used to refer to the current, revised shaping
+	  model.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Arabic</emphasis> shaping model supports
+	  Arabic, Mongolian, N'Ko, Syriac, and several other connected
+	  or cursive scripts.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Thai/Lao</emphasis> shaping model supports
+	  the Thai and Lao scripts.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Khmer</emphasis> shaping model supports the
+	  Khmer script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Myanmar</emphasis> shaping model supports the
+	  Myanmar (or Burmese) script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Tibetan</emphasis> shaping model supports the
+	  Tibetan script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Hangul</emphasis> shaping model supports the
+	  Hangul script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Hebrew</emphasis> shaping model supports the
+	  Hebrew script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Universal Shaping Engine</emphasis> (USE)
+	  shaping model supports complex scripts not covered by one of
+	  the above, script-specific shaping models, including
+	  Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
+	  Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
+	  Viet, and many others. 
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  Text runs that do not fall under one of the above shaping
+	  models may still require processing by a shaping engine. Of
+	  particular note is <emphasis>Emoji</emphasis> shaping, which
+	  may involve variation-selector sequences and glyph
+	  substitution. Emoji shaping is handled by the default
+	  shaping model.
+	</para>
+      </listitem>
+
+    </itemizedlist>
+    
+  </section>
+  
+  <section id="graphite-shaping">
+    <title>Graphite shaping</title>
+    <para>
+      In contrast to OpenType shaping, Graphite shaping does not
+      specify a predefined set of shaping models or a set of supported
+      scripts.
+    </para>
+    <para>
+      Instead, each Graphite font contains a complete set of rules that
+      implement the required shaping model for the intended
+      script. These rules include finite-state machines to match
+      sequences of codepoints to the shaping operations to perform.
+    </para>
+    <para>
+      Graphite shaping can perform the same shaping operations used in
+      OpenType shaping, as well as other functions that have not been
+      defined for OpenType shaping.
+    </para>
+  </section>
+  
+  <section id="aat-shaping">
+    <title>AAT shaping</title>
+    <para>
+      In contrast to OpenType shaping, AAT shaping does not specify a 
+      predefined set of shaping models or a set of supported scripts.
+    </para>
+    <para>
+      Instead, each AAT font includes a complete set of rules that
+      implement the desired shaping model for the intended
+      script. These rules include finite-state machines to match glyph
+      sequences and the shaping operations to perform.
+    </para>
+    <para>
+      Notably, AAT shaping rules are expressed for glyphs in the font,
+      not for Unicode codepoints. AAT shaping can perform the same
+      shaping operations used in OpenType shaping, as well as other
+      functions that have not been defined for OpenType shaping.
+    </para>
+  </section>
+</chapter>