Usermanual: expand clusters chapter.

7 years ago · 53ac46e974
parent 30cb45b3ea
commit 53ac46e974
1 changed files with 473 additions and 270 deletions
--- a/docs/usermanual-clusters.xml
+++ b/docs/usermanual-clusters.xml
@ -5,194 +5,363 @@
  <!ENTITY version SYSTEM "version.xml">
 ]>
 <chapter id="clusters">
 <sect1 id="clusters">
  <title>Clusters</title>
  <section id="clusters">
    <title>Clusters</title>
    <para>
      In text shaping, a <emphasis>cluster</emphasis> is a sequence of
      characters that needs to be treated as a single, indivisible
      unit.
    </para>
    <para>
      During the shaping process, some shaping operations may
      merge adjacent characters (for example, when two code points form
      a ligature and are replaced by a single glyph) or split one
      character into several (for example, when performing the Unicode
      canonical decomposition of a code point).
    </para>
    <para>
      HarfBuzz tracks clusters independently from how these
      shaping operations alter the individual glyphs that comprise the
      output HarfBuzz returns in a buffer. Consequently,
      a client program using HarfBuzz can utilize the cluster
      information to implement features such as:
    </para>
    <itemizedlist>
      <listitem>
 	<para>
 	  Correctly positioning the cursor between two characters that
 	  have combined into a single glyph by forming a ligature.
 	</para>
      </listitem>
      <listitem>
 	<para>
 	  Correctly highlighting a text selection that includes some,
 	  but not all, of the characters comprising a ligature. 
 	</para>
      </listitem>
      <listitem>
 	<para>
 	  Applying text attributes (such as color or underlining) to
 	  part, but not all, of a composed base-and-mark combination.
 	</para>
      </listitem>
      <listitem>
 	<para>
 	  Generating output document formats (such as PDF) with
 	  embedded text that can be fully extracted.
 	</para>
      </listitem>
      <listitem>
 	<para>
 	  Performing line-breaking, justification, and other
 	  line-level or paragraph-level operations that must be done
 	  after shaping is complete, but which require character-level
 	  properties.
 	</para>
      </listitem>
    </itemizedlist>
    <para>
      When you add text to a HarfBuzz buffer, each code point is assigned
      a <emphasis>cluster value</emphasis>.
    </para>
    <para>
      This cluster value is an arbitrary number; HarfBuzz uses it only
      to distinguish between clusters. Many client programs will use
      the index of each code point in the input text stream as the
      cluster value, as a matter of convenience; the actual value does
      not matter.
    </para>
    <para>
      Client programs can choose how HarfBuzz handles clusters during
      shaping by setting the
      <literal>cluster_level</literal> of the
      buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
      clustering support for this property:
    </para>
    <itemizedlist>
      <listitem>
 	<para><emphasis>Level 0</emphasis> is the default and
 	reproduces the behavior of the old HarfBuzz library.
 	</para>
 	<para>
-    In shaping text, a <emphasis>cluster</emphasis> is a sequence of
+	  The distinguishing feature of level 0 behavior is that, at
-    code points that needs to be treated as a single, indivisible unit.
+	  the beginning of processing the buffer, all code points that
 	  are categorized as <emphasis>marks</emphasis>,
 	  <emphasis>modifier symbols</emphasis>, or
 	  <emphasis>Emoji extended pictographic</emphasis> modifiers,
 	  as well as the <emphasis>Zero Width Joiner</emphasis> and
 	  <emphasis>Zero Width Non-Joiner</emphasis> code points, are
 	  assigned the cluster value of the closest preceding code
 	  point from <emphasis>diferent</emphasis> category. 
 	</para>
 	<para>
-    When you add text to a HB buffer, each character is associated with
+	  In essence, whenever a base character is followed by a mark
-    a <emphasis>cluster value</emphasis>. This is an arbitrary number as
+	  character or a sequence of mark characters, those marks are
-    far as HB is concerned.
+	  reassigned to the same initial cluster value as the base
 	  character. This reassignment is referred to as
 	  "merging" the affected clusters. This behavior is based on
 	  the Grapheme Cluster Boundary specification in <ulink
 	  url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode
 	  Technical Report 29</ulink>.
 	</para>
 	<para>
-    Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
+	  Client programs can specify level 0 behavior for a buffer by
-    actual number does not matter. Moreover, it is not required for the
+	  setting its <literal>cluster_level</literal> to
-    cluster values to be monotonically increasing, but pretty much all
+	  <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. 
    of HB's tests are performed on monotonically increasing cluster
    numbers. Nevertheless, there is no such assumption in the code
    itself. With that in mind, let's examine what happens with cluster
    values during shaping under each cluster-level.
 	</para>
      </listitem>
      <listitem>
 	<para>
-    HarfBuzz provides three <emphasis>levels</emphasis> of clustering
+	  <emphasis>Level 1</emphasis> tweaks the old behavior
-    support. Level 0 is the default behavior and reproduces the behavior
+	  slightly to produce better results. Therefore, level 1
-    of the old HarfBuzz library. Level 1 tweaks this behavior slightly
+	  clustering is recommended for code that is not required to
-    to produce better results, so level 1 clustering is recommended for
+	  implement backward compatibility with the old HarfBuzz.
    code that is not required to implement backward compatibility with
    the old HarfBuzz.
 	</para>
 	<para>
-    Level 2 differs significantly in how it treats cluster values.
+	  Level 1 differs from level 0 by not merging the 
-    Levels 0 and 1 both process ligatures and glyph decomposition by
+	  clusters of marks and other modifier code points with the
-    merging clusters; level 2 does not.
+	  preceding "base" code point's cluster. By preserving the
 	  cluster values of these marks and modifier code points,
 	  script shaping can perform additional operations that might
 	  lead to improved results (for example, reordering a sequence
 	  of marks).
 	</para>
 	<para>
 	  Client programs can specify level 1 behavior for a buffer by
 	  setting its <literal>cluster_level</literal> to
 	  <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. 
 	</para>
      </listitem>
      <listitem>
 	<para>
 	  <emphasis>Level 2</emphasis> differs significantly in how it
 	  treats cluster values. In level 2, HarfBuzz never merges
 	  clusters.
 	</para>
 	<para>
-    The conceptual model for what the cluster values mean, in levels 0
+	  This difference can be seen most clearly when HarfBuzz processes
-    and 1, is this:
+	  ligature substitutions and glyph decompositions. In level 0 
 	  and level 1, ligatures and glyph decomposition both involve
 	  merging clusters; in level 2, neither of these operations
 	  triggers a merge.
 	</para>
 	<para>
 	  Client programs can specify level 2 behavior for a buffer by
 	  setting its <literal>cluster_level</literal> to
 	  <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. 
 	</para>
      </listitem>
    </itemizedlist>
    <para>
      It is not <emphasis>required</emphasis> that the cluster values
      in a buffer be monotonically increasing. However, if the initial
      cluster values in a buffer are monotonic and the buffer is
      configured to use clustering level 0 or 1, then HarfBuzz
      guarantees that the final cluster values in the shaped buffer
      will also be monotonic. No such guarantee is made for cluster
      level 2.
    </para>
    <para>
      In levels 0 and 1, HarfBuzz implements the following conceptual model for
      cluster values:
    </para>
    <itemizedlist spacing="compact">
      <listitem>
 	<para>
-        the sequence of cluster values will always remain monotone
+          The sequence of cluster values will always remain monotonic.
 	</para>
      </listitem>
      <listitem>
 	<para>
-        each value represents a single cluster
+          Each cluster value represents a single cluster.
 	</para>
      </listitem>
      <listitem>
 	<para>
-        each cluster contains one or more glyphs and one or more
+          Each cluster contains one or more glyphs and one or more
-        characters
+          characters.
 	</para>
      </listitem>
    </itemizedlist>
    <para>
-    Assuming that initial cluster numbers were monotonically increasing
+      In practice, this model offers several benefits. Assuming that
-    and distinct, then all adjacent glyphs having the same cluster
+      the initial cluster values were monotonically increasing
-    number belong to the same cluster, and all characters belong to the
+      and distinct before shaping began, then, in the final output:
-    cluster that has the highest number not larger than their initial
+    </para>
-    cluster number. This will become clearer with an example.
+    <itemizedlist spacing="compact">
      <listitem>
 	<para>
 	  All adjacent glyphs having the same final cluster
 	  value belong to the same cluster.
 	</para>
      </listitem>
      <listitem>
 	<para>
          Each character belongs to the cluster that has the highest
 	  cluster value <emphasis>not larger than</emphasis> its
 	  initial cluster value.
 	</para>
-</sect1>
+      </listitem>
-<sect1 id="a-clustering-example-for-levels-0-and-1">
+    </itemizedlist>
  </section>
  <section id="a-clustering-example-for-levels-0-and-1">
    <title>A clustering example for levels 0 and 1</title>
    <para>
-    Let's say we start with the following character sequence and cluster
+      The guarantees and benefits of level 0 and level 1 can be seen
-    values:
+      with some examples. First, let us examine what happens with cluster
      values when shaping involves cluster merging with ligatures and
      decomposition.
    </para>
    <para>
      Let's say we start with the following character sequence (top row) and
      initial cluster values (bottom row):
    </para>
    <programlisting>
      A,B,C,D,E
      0,1,2,3,4
-</programlisting>
+    </programlisting>
    <para>
-    We then map the characters to glyphs. For simplicity, let's assume
+      During shaping, HarfBuzz maps these characters to glyphs from
-    that each character maps to the corresponding, identical-looking
+      the font. For simplicity, let's assume that each character maps
-    glyph:
+      to the corresponding, identical-looking glyph:
    </para>
    <programlisting>
      A,B,C,D,E
      0,1,2,3,4
-</programlisting>
+    </programlisting>
    <para>
      Now if, for example, <literal>B</literal> and <literal>C</literal>
-    ligate, then the clusters to which they belong &quot;merge&quot;.
+      form a ligature, then the clusters to which they belong
-    This merged cluster takes for its cluster number the minimum of all
+      &quot;merge&quot;. This merged cluster takes for its cluster
-    the cluster numbers of the clusters that went in. In this case, we
+      value the minimum of all the cluster values of the clusters that
-    get:
+      went in to the ligature. In this case, we get:
    </para>
    <programlisting>
      A,BC,D,E
      0,1 ,3,4
-</programlisting>
+    </programlisting>
    <para>
-    Now let's assume that the <literal>BC</literal> glyph decomposes
+      because 1 is the minimum of the set {1,2}, which were the
-    into three components, and <literal>D</literal> also decomposes into
+      cluster values of <literal>B</literal> and
-    two. The components each inherit the cluster value of their parent:
+      <literal>C</literal>. 
    </para>
    <para>
      Next, let us say that the <literal>BC</literal> ligature glyph
      decomposes into three components, and <literal>D</literal> also
      decomposes into two components. These components each inherit the
      cluster value of their parent: 
    </para>
    <programlisting>
      A,BC0,BC1,BC2,D0,D1,E
      0,1  ,1  ,1  ,3 ,3 ,4
-</programlisting>
+    </programlisting>
    <para>
-    Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
+      Next, if <literal>BC2</literal> and <literal>D0</literal> form a
-    their clusters (numbers 1 and 3) merge into
+      ligature, then their clusters (cluster values 1 and 3) merge into
      <literal>min(1,3) = 1</literal>:
    </para>
    <programlisting>
      A,BC0,BC1,BC2D0,D1,E
      0,1  ,1  ,1    ,1 ,4
-</programlisting>
+    </programlisting>
    <para>
      At this point, cluster 1 means: the character sequence
      <literal>BCD</literal> is represented by glyphs
      <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
      further.
    </para>
-</sect1>
+  </section>
-<sect1 id="reordering-in-levels-0-and-1">
+  <section id="reordering-in-levels-0-and-1">
    <title>Reordering in levels 0 and 1</title>
    <para>
-    Another common operation in the more complex shapers is when things
+      Another common operation in the more complex shapers is glyph
-    reorder. In those cases, to maintain monotone clusters, HB merges
+      reordering. In order to maintain a monotonic cluster sequence
-    the clusters of everything in the reordering sequence. For example,
+      when glyph reordering takes place, HarfBuzz merges the clusters
-    let's again start with the character sequence:
+      of everything in the reordering sequence.
    </para>
    <para>
      For example, let us again start with the character sequence (top
      row) and initial cluster values (bottom row):
    </para>
    <programlisting>
      A,B,C,D,E
      0,1,2,3,4
-</programlisting>
+    </programlisting>
    <para>
      If <literal>D</literal> is reordered before <literal>B</literal>,
-    then the <literal>B</literal>, <literal>C</literal>, and
+      then HarfBuzz merges the <literal>B</literal>,
-    <literal>D</literal> clusters merge, and we get:
+      <literal>C</literal>, and <literal>D</literal> clusters, and we
      get:
    </para>
    <programlisting>
      A,D,B,C,E
      0,1,1,1,4
-</programlisting>
+    </programlisting>
    <para>
      This is clearly not ideal, but it is the only sensible way to
-    maintain monotone indices and retain the true relationship between
+      maintain a monotonic sequence of cluster values and retain the
-    glyphs and characters.
+      true relationship between glyphs and characters.
    </para>
-</sect1>
+  </section>
-<sect1 id="the-distinction-between-levels-0-and-1">
+  <section id="the-distinction-between-levels-0-and-1">
    <title>The distinction between levels 0 and 1</title>
    <para>
-    So, the above is pretty much what cluster levels 0 and 1 do. The
+      The preceding examples demonstrate the main effects of using
-    only difference between the two is this: in level 0, at the very
+      cluster levels 0 and 1. The only difference between the two
-    beginning of the shaping process, we also merge clusters between
+      levels is this: in level 0, at the very beginning of the shaping
-    base characters and all Unicode marks (combining or not) following
+      process, HarfBuzz also merges clusters between any base character
-    them. E.g.:
+      and all Unicode marks (combining or not) that follow it.
    </para>
    <para>
      For example, let us start with the following character sequence
      (top row) and accompanying initial cluster values (bottom row):
    </para>
    <programlisting>
      A,acute,B
      0,1    ,2
-</programlisting>
+    </programlisting>
    <para>
-    will become:
+      The <literal>acute</literal> is a Unicode mark. If HarfBuzz is
      using cluster level 0 on this sequence, then the
      <literal>A</literal> and <literal>acute</literal> clusters will
      merge, and the result will become:
    </para>
    <programlisting>
      A,acute,B
      0,0    ,2
-</programlisting>
+    </programlisting>
    <para>
-    This is the default behavior. We do it because Windows did it and
+      This initial cluster merging is the default behavior of the
-    old HarfBuzz did it, so this remained the default. But this behavior
+      Windows shaping engine, and the old HarfBuzz codebase copied
-    makes it impossible to color diacritic marks differently from their
+      that behavior to maintain compatibility. Consequently, it has
-    base characters. That's why in level 1 we do not perform this
+      remained the default behavior in the new HarfBuzz codebase.
    initial merging step.
    </para>
    <para>
-    For clients, level 0 is more convenient if they rely on HarfBuzz
+      But this initial cluster-merging behavior makes it impossible to
-    clusters for cursor positioning. But that's wrong anyway: cursor
+      color diacritic marks differently from their base
-    positions should be determined based on Unicode grapheme boundaries,
+      characters. That is why, in level 1, HarfBuzz does not perform
-    NOT shaping clusters. As such, level 1 clusters are preferred.
+      the initial merging step.
    </para>
    <para>
-    One last note about levels 0 and 1. We currently don't allow a
+      For client programs that rely on HarfBuzz cluster values to
      perform cursor positioning, level 0 is more convenient. But
      relying on cluster boundaries for cursor positioning is wrong: cursor
      positions should be determined based on Unicode grapheme
      boundaries, not on shaping-cluster boundaries. As such, level 1
      clusters are preferred. 
    </para>
    <para>
      One last note about levels 0 and 1. HarfBuzz currently does not allow a
      <literal>MultipleSubst</literal> lookup to replace a glyph with zero
-    glyphs (i.e., to delete a glyph). But in some other situations,
+      glyphs (in other words, to delete a glyph). But, in some other situations,
      glyphs can be deleted. In those cases, if the glyph being deleted is
-    the last glyph of its cluster, we make sure to merge the cluster
+      the last glyph of its cluster, HarfBuzz makes sure to merge the cluster
      with a neighboring cluster.
    </para>
    <para>
-    This is, primarily, to make sure that the starting cluster of the
+      This is done primarily to make sure that the starting cluster of the
      text always has the cluster index pointing to the start of the text
      for the run; more than one client currently relies on this
      guarantee.
@ -204,107 +373,141 @@
      in the run was deleted. HarfBuzz might do something similar in the
      future.
    </para>
-</sect1>
+  </section>
-<sect1 id="level-2">
+  <section id="level-2">
    <title>Level 2</title>
    <para>
-    Level 2 is a different beast from levels 0 and 1. It is simple to
+      HarfBuzz's level 2 cluster behavior uses a significantly
-    describe, but hard to make sense of. It simply doesn't do any
+      different model than that of level 0 and level 1.
    cluster merging whatsoever. When things ligate or otherwise multiple
    glyphs turn into one, the cluster value of the first glyph is
    retained.
    </para>
    <para>
-    Here are a few examples of why processing cluster values produced at
+      The level 2 behavior is easy to describe, but it may be
-    this level might be tricky:
+      difficult to understand in practical terms. In brief, level 2 
      performs no merging of clusters whatsoever.
    </para>
  <sect2 id="ligatures-with-combining-marks">
    <title>Ligatures with combining marks</title>
    <para>
-      Imagine capital letters are bases and lower case letters are
+      When glyphs form a ligature (or when some other feature
-      combining marks. With an input sequence like this:
+      substitutes multiple glyphs with one glyph), the cluster value
      of the first glyph is retained as the cluster value for the
      ligature. However, no subsequent clusters &mdash; including
      marks and modifiers &mdash; are affected.
    </para>
    <para>
      Level 2 cluster behavior is less complex than level 0 or level
      1, but there are a few cases in which processing cluster values
      produced at level 2 may be tricky. 
    </para>
    <section id="ligatures-with-combining-marks-in-level-2">
      <title>Ligatures with combining marks in level 2</title>
      <para>
 	The first example of how HarfBuzz's level 2 cluster behavior
 	can be tricky is when the text to be shaped includes combining
 	marks attached to ligatures.
      </para>
      <para>
 	Let us start with an input sequence with the following
 	characters (top row) and initial cluster values (bottom row):
      </para>
      <programlisting>
-  A,a,B,b,C,c
+	A,acute,B,breve,C,circumflex
-  0,1,2,3,4,5
+	0,1    ,2,3    ,4,5
-</programlisting>
+      </programlisting>
      <para>
-      if <literal>A,B,C</literal> ligate, then here are the cluster
+	If the sequence <literal>A,B,C</literal> forms a ligature,
-      values one would get under the various levels:
+	then these are the cluster values HarfBuzz will return under
 	the various cluster levels:
      </para>
      <para>
-      level 0:
+	Level 0:
      </para>
      <programlisting>
-  ABC,a,b,c
+	ABC,acute,breve,circumflex
-  0  ,0,0,0
+	0  ,0    ,0    ,0
-</programlisting>
+      </programlisting>
      <para>
-      level 1:
+	Level 1:
      </para>
      <programlisting>
-  ABC,a,b,c
+	ABC,acute,breve,circumflex
-  0  ,0,0,5
+	0  ,0    ,0    ,5
-</programlisting>
+      </programlisting>
      <para>
-      level 2:
+	Level 2:
      </para>
      <programlisting>
-  ABC,a,b,c
+	ABC,acute,breve,circumflex
-  0  ,1,3,5
+	0  ,1    ,3    ,5
-</programlisting>
+      </programlisting>
      <para>
 	Making sense of the level 2 result is the hardest for a client
 	program, because there is nothing in the cluster values that
 	indicates that <literal>B</literal> and <literal>C</literal>
 	formed a ligature with <literal>A</literal>.
      </para>
      <para>
-      Making sense of the last example is the hardest for a client,
+	In contrast, the "merged" cluster values of the mark glyphs
-      because there is nothing in the cluster values to suggest that
+	that are seen in the level 0 and level 1 output are evidence
-      <literal>B</literal> and <literal>C</literal> ligated with
+	that a ligature substitution took place. 
      <literal>A</literal>.
      </para>
-  </sect2>
+    </section>
-  <sect2 id="reordering">
+    <section id="reordering-in-level-2">
-    <title>Reordering</title>
+      <title>Reordering in level 2</title>
      <para>
-      Another tricky case is when things reorder. Under level 2:
+	Another example of how HarfBuzz's level 2 cluster behavior
 	can be tricky is when glyphs reorder. Consider an input sequence
 	with the following characters (top row) and initial cluster
 	values (bottom row):
      </para>
      <programlisting>
 	A,B,C,D,E
 	0,1,2,3,4
-</programlisting>
+      </programlisting>
      <para>
 	Now imagine <literal>D</literal> moves before
-      <literal>B</literal>:
+	<literal>B</literal> in a reordering operation. The cluster
 	values will then be:
      </para>
      <programlisting>
 	A,D,B,C,E
 	0,3,1,2,4
-</programlisting>
+      </programlisting>
      <para>
-      Now, if <literal>D</literal> ligates with <literal>B</literal>, we
+	Next, if <literal>D</literal> forms a ligature with
-      get:
+	<literal>B</literal>, the output is:
      </para>
      <programlisting>
 	A,DB,C,E
 	0,3 ,2,4
-</programlisting>
+      </programlisting>
      <para>
-      In a different scenario, <literal>A</literal> and
+	However, in a different scenario, in which the shaping rules
-      <literal>B</literal> could have ligated
+	of the script instead caused <literal>A</literal> and
-      <emphasis>before</emphasis> <literal>D</literal> reordered; that
+	<literal>B</literal> to form a ligature
-      would have resulted in:
+	<emphasis>before</emphasis> the <literal>D</literal> reordered, the
 	result would be:
      </para>
      <programlisting>
 	AB,D,C,E
 	0 ,3,2,4   
-</programlisting>
+      </programlisting>
-    <para>
+      <para>
-      There's no way to differentiate between these two scenarios based
+	There is no way for a client program to differentiate between
-      on the cluster numbers alone.
+	these two scenarios based on the cluster values
-    </para>
+	alone. Consequently, client programs that use level 2 might
-    <para>
+	need to undertake additional work in order to manage cursor
-      Another problem happens with ligatures under level 2 if the
+	positioning, text attributes, or other desired features.
-      direction of the text is forced to opposite of its natural
+      </para>
-      direction (e.g. left-to-right Arabic). But that's too much of a
+    </section>
-      corner case to worry about.
+    <section id="other-considerations-in-level-2">
-    </para>
+      <title>Other considerations in level 2</title>
-  </sect2>
+      <para>
-</sect1>
+	There may be other problems encountered with ligatures under
 	level 2, such as if the direction of the text is forced to
 	opposite of its natural direction (for example, left-to-right
 	Arabic). But, generally speaking, these other scenarios are
 	minor corner cases that are too obscure for most client
 	programs to need to worry about.
      </para>
    </section>
  </section>
 </chapter>