Class ZH


  • public class ZH
    extends java.lang.Object
    ZH (Mandarin Chinese) - Documentation.

    A series of simple Helper Routines for inspecting the special UTF-8 (non-Mandarin) characters often used in Mandarin HTML Web-Pages.



    • Field Detail

      • AUC

        public static final java.lang.String AUC
        The complete list of "higher-level" (alternate) Uni-Code chars. Many of these are alternate punctuation marks used in documents that contain Mandarin Chinese.
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        public static final String AUC = 
                // Special Punctuation characters found in Chinese HTML Pages
                "、 。 · ˉ ˇ ¨ 〃 々 — ~ ‖ … ‘ ’ "             +
                "“ ” 〔 〕 〈 〉 《 》 「 」 『 』 〖 〗 【 】"	  +
                "± × ÷ ∶ ∧ ∨ ∑ ∏ ∪ ∩ ∈ ∷ √ ⊥ ∥ ∠"               +
                "⌒ ⊙ ∫ ∮ ≡ ≌ ≈ ∽ ∝ ≠ ≮ ≯ ≤ ≥ ∞ ∵ "           +
                "∴ ♂ ♀ ° ′ ″ ℃ $ ¤ ¢ £ ‰ § № ☆ ★"          +
                "○ ● ◎ ◇ ◆ □ ■ △ ▲ ※ → ← ↑ ↓ 〓 "            +
                "! " # ¥ % & ' ( ) * + , - . /"      +
        
                // Extra Alphabetic and Numeric Characters sometimes used
                // on web-pages written in Chinese
                "0 1 2 3 4 5 6 7 8 9 : ; < = > ?"   +
                "@ A B C D E F G H I J K L M N O"   +
                "P Q R S T U V W X Y Z [ \ ] ^ _"   +
                "` a b c d e f g h i j k l m n o"   +
                "p q r s t u v w x y z { | }  ̄"      +
        
                // Certain "Bullet List" / "Bullet Point" markers
                "⒈ ⒉ ⒊ ⒋ ⒌ ⒍ ⒎ ⒏ ⒐ ⒑ ⒒ ⒓ ⒔ ⒕ ⒖"      +
                "⒗ ⒘ ⒙ ⒚ ⒛ ⑴ ⑵ ⑶ ⑷ ⑸ ⑹ ⑺ ⑻ ⑼ ⑽ ⑾"   +
                "⑿ ⒀ ⒁ ⒂ ⒃ ⒄ ⒅ ⒆ ⒇ ① ② ③ ④ ⑤ ⑥ ⑦"         +
                "⑧ ⑨ ⑩ ㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩"               +
                "Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ"               +
        
                // The "Bo Po Mo Fo" Pronunciation Used for Chinese Characters
                "ㄐ ㄑ ㄒ ㄓ ㄔ ㄕ ㄖ ㄗ ㄘ ㄙ ㄚ ㄛ ㄜ ㄝ ㄞ ㄟ"   +
                "ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄧ ㄨ ㄩ";
        
      • CONSTSpecialQuoteLeft

        public static final char CONSTSpecialQuoteLeft
        Special Quotation Mark, left-side
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
        1
        public static final char CONSTSpecialQuoteLeft = (char) 0x201C;
        
      • CONSTSpecialQuoteRight

        public static final char CONSTSpecialQuoteRight
        Special Quotation Mark, right-side
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
        1
        public static final char CONSTSpecialQuoteRight = (char) 0x201D;
        
    • Method Detail

      • toneVowelToRegularVowel

        public static char toneVowelToRegularVowel​(char c)
        This makes the problems of dealing with the tone/accent marks above vowels in Chinese Pin-Yin easier. These convert vowels with tones over them into regular vowels. This can be useful for certain String operations, although clearly the original meaning of the word would be decimated.
        Parameters:
        c - any character from ASCII / UTF-8 / UniCode Basic Multi Lingual Plane.
        Returns:
        if this is a UTF-8 character that is an accented vowel, the un-accented version of that vowel is returned. If this is not a PinYin symbol for a tone-vowel, ASCII 0 is returned.
        See Also:
        toneVowelsToRegularVowels(String)
        Code:
        Exact Method Body:
        1
        2
         for (int i=0; i < CV.length; i++) if (CV[i] == c) return CV2RV[i];
         return (char) 0;
        
      • countToneVowels

        public static int countToneVowels​(java.lang.String pinYinStr)
        Counts the number of tone vowels in a PinYin String.
        Parameters:
        pinYinStr - A String, usually generated by Google Translate, (and scraped from Google Translate) that contains PinYin.
        Returns:
        The number of Mandarin Chinese Pin-Yin "Tone Vowels"
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
        9
         int count=0;
        
         TOP:
         for (int i = pinYinStr.length()-1; i >= 0; i--)
             for (int j=0; j < CV.length; j++)
                 if (pinYinStr.charAt(i) == CV[j])
                     { count++; continue TOP; }
        
         return count;
        
      • toneVowelsToRegularVowels

        public static java.lang.String toneVowelsToRegularVowels​
                    (java.lang.String s)
        
        This performs a conversion of all vowels in a String from those with tones over them to the normal (un-accented) equivalent. It uses the single-character-version of the synonymously named method
        Parameters:
        s - any java.lang.String containing Mandarin Romanizations.
        Returns:
        a String with all accented vowel's converted to regular vowels.
        See Also:
        toneVowelToRegularVowel(char)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
         int             strlen  = s.length();
         StringBuilder   sb      = new StringBuilder(s.length());
         char            c;
        
         for (int i=0; i < strlen; i++)
             if ((c = toneVowelToRegularVowel(s.charAt(i))) != 0)
                 sb.append(c);
             else
                 sb.append(s.charAt(i));
        
         return sb.toString();
        
      • HTML2ChineseVowels

        public static java.lang.String HTML2ChineseVowels​(java.lang.String s)
        Google Translate returns some text encoded as "&#num;" (the "ord(c)"). This is also called HTML Escaped Code - because instead of actual ASCII/UTF8 characters themselves, their "Ord" are returned - surrounded by the usual HTML Escape Character Sequence &#num; This method does the chr(html-hex-escape-code); and replaces the escape-sequence (which again is &#NUM;) with the actual ASCII character.

        NOTE: all of these are for "Chinese Tone Vowel" ASCII - The Google Translate module uses this method quite a bit. Here are a few examples of HTML-Escape-Sequence and the corresponding ASCII.
        HTML-EscapedASCII/UTF-8 Character
        &#192;À
        &#225;á
        &#283;ě
        &#363;ū
        &#474;ǚ
        ... see array below for list


        NOTE: HTML2UTF8(String) ==> This method does the exact same thing - but does not limit the characters to be converted to only Chinese Tone Vowels. This method only converts HTML-Escaped-Characters from this list:

        private static final int[] H2CV = { 39, 192, 201, 224, 225, 232, 233, 236, 237, 242, 243,
        249, 250, 252, 256, 257, 275, 283, 299, 333, 363, 462, 464, 466, 468, 474, 476 };
        See Also:
        HTML2UTF8(String)
        Code:
        Exact Method Body:
        1
        2
        3
        4
         for (int i=0; i < H2CV.length; i++)
             s = s.replaceAll("&#" + H2CV[i] + ";", "" + (char) H2CV[i]);
        
         return s;
        
      • HTML2UTF8

        public static java.lang.String HTML2UTF8​(java.lang.String s)
        NOTE: This does the same as HTML2ChineseVowels(String) EXCEPT that it converts ANY HTML string that has been encoded as: &amp;#NUM; - not just the characters having accents and corresponding to Chinese Tone Vowels.
        See Also:
        HTML2ChineseVowels(String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
         // Build the list of UTF8/ASCII character values (as Ord(c) / int) first.
         HashSet<Integer>    utfList = new HashSet<Integer>();
         Matcher             m       = P1.matcher(s);
        
         while (m.find()) utfList.add(Integer.parseInt(m.group(1)));
        
         // Now convert them.
         for (Integer i : utfList) s = s.replaceAll("&#" + i.toString() + ";", "" + ((char) i.intValue()));
        
         return s;
        
      • formatUTF8Chinese

        public static java.lang.String formatUTF8Chinese​(char c)
        This is used to convert a Chinese Character into a full String that includes the UTF-8 code represented as a HEXADECIMAL number and a decimal number
        Parameters:
        c - any ASCII/UniCode/UTF-8 char - but, generally, expected to be a "Chinese Character."

        NOTE: The choice for parameter char c has no actual constraints on its input value.
        Returns:
        A String of this format: 掭(0x63AD, 25517)
        Code:
        Exact Method Body:
        1
         return c + "(0x" + String.format("%x", ((int) c)).toUpperCase() + ", " + ((int) c) + ")";
        
      • isChinese

        public static boolean isChinese​(char c)
        Helper function - checks if this is a character in the UTF-8 & ASCII ranges that contain Mandarin Chinese characters. This is not guaranteed to be accurate - some non-Chinese Japanese characters exist in this range. For the precise definition of what this function actually does, see the ranges printed below.

        COPIED FROM***
        http://www.khngai.com/chinese/charmap/tbluni.php?page=0

        AND: ((c >= 0x4E00) && (c <= 0x9FFF))

        COPIED FROM***
        http://www.khngai.com/chinese/charmap/tblgb.php?page=1
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        TRUE if the input character 'c' is in the UTF-8/UniCode range for Chinese Characters
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
         if ((c >= 0x4E00) && (c <= 0x9FFF)) return true;
         if ((c >= 0xB0A0) && (c <= 0xBFFF)) return true;
         if ((c >= 0xC0A0) && (c <= 0xCFFF)) return true;
         if ((c >= 0xD0A0) && (c <= 0xDFFF)) return true;
         if ((c >= 0xE0A0) && (c <= 0xEFFF)) return true;
         if ((c >= 0xF0A0) && (c <= 0xF7FF)) return true;
        
         return false;
        
      • isOther

        public static boolean isOther​(char c)
        Checks a char is something that is not Alpha Numeric or White Space
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        ((!isAlphaNumeric(c)) && (!isSpace(c)));
        Code:
        Exact Method Body:
        1
         return ((!isAlphaNumeric(c)) && (!isSpace(c)));
        
      • isAlphaNumeric

        public static boolean isAlphaNumeric​(char c)
        Checks if a char is Alpha Numberic.
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        (isAlpha(c) || isNumber(c));
        Code:
        Exact Method Body:
        1
         return (isAlpha(c) || isNumber(c));
        
      • isAlpha

        public static boolean isAlpha​(char c)
        Checks if a char is Alphabetic.
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        (isToneVowel(c) || isRegVowel(c) || isRegLetter(c));
        Code:
        Exact Method Body:
        1
         return (isToneVowel(c) || isRegVowel(c) || isRegLetter(c));
        
      • isToneVowel

        public static boolean isToneVowel​(char c)
        This is a helper function for the Mandarin Chinese accented vowel symbols in UTF-8, ASCII and UniCode. The exact character code numbers are printed below.

        NOTE: In 罗马拼音 (Pin-Yin Romanization), there are a few symbols that should never come up - at least as the software pertains to 罗马拼音-results provided by Google Cloud Server Translation API (GCS-TS/TAPI). This is because NO word in Pin-Yin ever starts with the letter's I or U, or the U with an umlau - so - capitalized versions of these letters ought to never occur - unless the entire PinYin were capitalized - which is something GCSTS never does.
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        TRUE if the input character 'c' is one of the following:

        Simple ASCIIUTF-8 Tone Vowel
        a ā (257), á (225), ǎ (462), à (224)
        e ē (275), é (233), ě (283), è (232)
        i ī (299), í (237), ǐ (464), ì (236)
        o ō (333), ó (243), ǒ (466), ò (242)
        u ū (363), ú (250), ǔ (468), ù (249)
        u ǖ (470), ǘ (472), ǚ (474), ǜ (476)
        A Ā (256), Á (193), Ǎ (461), À (192)
        E Ē (274), É (201), Ě (282), È (200)
        O Ō (332), Ó (211), Ǒ (465), Ò (210)

        In Mandarin Chinese, PinYin-words cannot start with these letters below. Therefore it would be highly unlikely to see a "capitalized" version of these tone-vowels.

        Simple ASCIIUTF-8 Tone Vowel
        IĪ (298), Í (205), (there are 2: Ǐ (463), Ĭ (300)), Ì (204)
        UŪ (362), Ú (218), Ŭ (364), Ù (217)
        U(Ü (220) -no tone): Ǖ (469), Ǘ (471), Ǘ (473), Ǜ (475)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
         // A, ā 257, á 225, ǎ 462, à 224
         if ((c == 257) || (c == 225) || (c == 462) || (c == 224)) return true;
        
         // E, ē 275, é 233, ě 283, è 232
         if ((c == 275) || (c == 233) || (c == 283) || (c == 232)) return true;
                      
         // I, ī 299, í 237, ǐ 464, ì 236 
         if ((c == 299) || (c == 237) || (c == 464) || (c == 236)) return true;
        
         // O, ō 333, ó 243, ǒ	466, ò 242
         if ((c == 333) || (c == 243) || (c == 466) || (c == 242)) return true;
        
         // U, ū 363, ú 250, ǔ 468, ù 249
         if ((c == 363) || (c == 250) || (c == 468) || (c == 249)) return true;
        
         // U, ǖ 470, ǘ 472, ǚ 474, ǜ 476
         if ((c == 470) || (c == 472) || (c == 474) || (c == 476)) return true;
        
         // *******
         // Capital vowels with tone symbols
        
         // Ā 256, Á 193, Ǎ 461, À 192
         if ((c == 256) || (c == 193) || (c == 461) || (c == 192)) return true;
        
         // Ē 274, É 201, Ě 282, È 200
         if ((c == 274) || (c == 201) || (c == 282) || (c == 200)) return true;
        
         // Ō 332, Ó 211, Ǒ 465, Ò 210
         if ((c == 332) || (c == 211) || (c == 465) || (c == 210)) return true;
        
         // Not sure about these - found them on a website
         // **********************************************
         //       1234 5678 9ABC DEF
         // A8A0  āáǎà ēéěè  īíǐì ōóǒ
         //
         //       0 1234 5678 9 A
         // A8B0  ò ūúǔù  ǖǘǚǜ  ü ê
         // **********************************************
         if ((c >= 0xA8A1) && (c <= 0xA8Ba)) return true;
        
         return false;
        
      • isRegVowel

        public static boolean isRegVowel​(char c)
        Checks that a character is a standard vowel.
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        TRUE if the input character 'c' EQUALS one of these ten letters: a, e, i, o, u, A, E, I, O, U
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
         // The normal vowels
        
         // a 97, A 65
         if ((c == 97) || (c == 65))     return true;
        
         // e 101, E 69
         if ((c == 101) || (c == 69))    return true;
        
         // i 105, I 73
         if ((c == 105) || (c == 73))    return true;
        
         // o 111, O 79
         if ((c == 111) || (c == 79))    return true;
        
         // u 117, U 85
         if ((c == 117) || (c == 85))    return true;
        
         return false;
        
      • isRegLetter

        public static boolean isRegLetter​(char c)
        Regular Letters Include: 'A' ... 'Z' (65 - 90), 'a' ... 'z' (97 - 122)
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        TRUE if the input character 'c' is any letter in lower-level ASCII (and not any of the AUC).
        Code:
        Exact Method Body:
        1
         return ((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122));
        
      • isNumber

        public static boolean isNumber​(char c)
        Regular Numbers Include: '0' ... '9'
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        TRUE if the input character 'c' is in the range of ASCII '0' ... '9' (not any of the AUC)
        Code:
        Exact Method Body:
        1
         return ((c >= 48) && (c <= 57));
        
      • isSpace

        public static boolean isSpace​(char c)
        Checks for WhiteSpace: '\t', '\n', '\r', ' '
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        TRUE if the input character 'c' is a whitespace character code from the above list
        Code:
        Exact Method Body:
        1
         return ((c == 9) || (c == 12) || (c == 15) || (c == 32));
        
      • bulletListAUC

        public static int bulletListAUC​(char c)
        Bullet List characters in upper UniCode / UTF-8. These characters exist in UTF-8 - and they are occasionally used in documents found on Chinese News Websites. They are all "bullet-list" points. An integer is returned for each of these, that is equal to the number represented by the UTF-8/UniCode character here.

        • 0 1 2 3 4 5 6 7 8 9 a b c d e f
        • N ⒈ ⒉ ⒊ ⒋ ⒌ ⒍ ⒎ ⒏ ⒐ ⒑ ⒒ ⒓ ⒔ ⒕ ⒖
        • ⒗ ⒘ ⒙ ⒚ ⒛ ⑴ ⑵ ⑶ ⑷ ⑸ ⑹ ⑺ ⑻ ⑼ ⑽ ⑾
        • ⑿ ⒀ ⒁ ⒂ ⒃ ⒄ ⒅ ⒆ ⒇ ① ② ③ ④ ⑤ ⑥ ⑦
        • ⑧ ⑨ ⑩ N N ㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩ N
        • N Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ
        Parameters:
        c - any character as input
        Returns:
        The number equivalent represented by this bullet point.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
         // ⒈ ==> ⒛
         if ((c >= 0x2488) && (c <= 0x249B))	return ((int) c) - 0x2487;
        
         // ⑴ ==> ⒇
         if ((c >= 0x2474) && (c <= 0x2487))	return ((int) c) - 0x2473;	
        
         // ① ==> ⑩
         if ((c >= 0x2460) && (c <= 0x2469))	return ((int) c) - 0x245F;
        
         // ㈠ ==> ㈩
         if ((c >= 0x3220) && (c <= 0x3229))	return ((int) c) - 0x321F;
        
         // Ⅰ ==> Ⅻ
         if ((c >= 0x2160) && (c <= 0x216B))	return ((int) c) - 0x215F;
        
         return 0;
        
      • alphaNumericAUC

        public static char alphaNumericAUC​(char c)
        Alpha-Numeric character code from upper UniCode / UTF-8

        These characters exist in UTF-8 - but they ARE NOT the usual ASCII characters for the letters 'A' ... 'Z' or the numbers '0' ... '9' They, however, are sometimes found in documents on Chinese News Websites, etc.

        Copied from:
        http://www.khngai.com/chinese/charmap/tblgb.php?page=0

        • 0 1 2 3 4 5 6 7 8 9 a b c d e f
        • ! " # ¥ % & ' ( ) * + , - . /
        • 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
        • @ A B C D E F G H I J K L M N O
        • P Q R S T U V W X Y Z [ \ ] ^ _
        • a b c d e f g h i j k l m n o
        • p q r s t u v w x y z { | }  ̄
        Parameters:
        c - any character as input
        Returns:
        the "lower-level-ASCII" version of that character.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
         // ASCII 'A' is 65
         if ((c > 0xFF20) && (c < 0xFF3B))	return (char) (65 + (c - 0xFF21));
        
         // ASCII 'a' is 97
         if ((c > 0xFF40) && (c < 0xFF5B))	return (char) (97 + (c - 0xFF41));
        
         // ASCII '0' is 48
         if ((c >= 0xFF10) && (c <= 0xFF1A))	return (char) (48 + (c - 0xFF10));
        
         return 0;
        
      • punctuationAUC

        public static char punctuationAUC​(char c)
        This method, punctuationAUC(char), converts any characters which are common on many Mandarin Chinese websites into a lower-level, more typical/normal ASCII equivalent. This is can be very useful when trying to make sense of brackets, parenthesis, quotes, commas and other punctuation marks - and quickly convert them into a simple version of the character.

        If the input character has an "Alternate Version" in the lower-level-ASCII range, that lower level ASCII character is returned. If this isn't AUC, ASCII-0 is returned.

        For Instance:

        InputOutput
        〖 〗 【 】 [ ] [ ]
        。 ○ ● . . (ASCII-period)
        ¨ 〃 “ ” ″ " " (ASCII-double-quote)
        , (ASCII-comma) ASCII-0
        + (ASCII-plus) ASCII-0
        Parameters:
        c - any character as input
        Returns:
        the "lower-level-ASCII" version of that character

        NOTE: ASCII-0 is returned if this is not a valid "AUC" UTF-8 / UniCode code!
        Code:
        Exact Method Body:
          1
          2
          3
          4
          5
          6
          7
          8
          9
         10
         11
         12
         13
         14
         15
         16
         17
         18
         19
         20
         21
         22
         23
         24
         25
         26
         27
         28
         29
         30
         31
         32
         33
         34
         35
         36
         37
         38
         39
         40
         41
         42
         43
         44
         45
         46
         47
         48
         49
         50
         51
         52
         53
         54
         55
         56
         57
         58
         59
         60
         61
         62
         63
         64
         65
         66
         67
         68
         69
         70
         71
         72
         73
         74
         75
         76
         77
         78
         79
         80
         81
         82
         83
         84
         85
         86
         87
         88
         89
         90
         91
         92
         93
         94
         95
         96
         97
         98
         99
        100
        101
        102
        103
         // Copied from: 
         // *** http://www.khngai.com/chinese/charmap/tblgb.php?page=0
         //
         // 0 2 3 4 5 6 7 8 9 a b c d e f
         // N N 、 。 · ˉ ˇ ¨ 〃 々 — ~ ‖ … ‘ ’ 
         // “ ” 〔 〕 〈 〉 《 》 「 」 『 』 〖 〗 【 】
         // ± × ÷ ∶ ∧ ∨ ∑ ∏ ∪ ∩ ∈ ∷ √ ⊥ ∥ ∠
         // ⌒ ⊙ ∫ ∮ ≡ ≌ ≈ ∽ ∝ ≠ ≮ ≯ ≤ ≥ ∞ ∵ 
         // ∴ ♂ ♀ ° ′ ″ ℃ $ ¤ ¢ £ ‰ § № ☆ ★
         // ○ ● ◎ ◇ ◆ □ ■ △ ▲ ※ → ← ↑ ↓ 〓 
         //
         // 0 1 2 3 4 5 6 7 8 9 a b c d e f
         // ! " # ¥ % & ' ( ) * + , - . /
         // 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
         // @ A B C D E F G H I J K L M N O
         // P Q R S T U V W X Y Z [ \ ] ^ _
         // ` a b c d e f g h i j k l m n o
         // p q r s t u v w x y z { | }  ̄	 
        
         switch (c)
         {
             // 、 ,
             case 0x3001:               // 、
             case 0xFF0C: return ',';   // ,
        
             // 。 ○ ● .
             case 0x3002:               // 。
             case 0x25CB:               // ○
             case 0x25CF:               // ●
             case 0xFF0E: return '.';   // .
        
             // ‘ ’ ′ ' `
             case 0x2018:               // ‘
             case 0x2019:               // ’
             case 0x2032:               // ′
             case 0xFF07:               // '
             case 0xFF40: return '\'';  // `
        
             // ¨ 〃 “ ” ″ "
             case 0x00A8:               // ¨
             case 0x3003:               // 〃
             case 0x201C:               // “
             case 0x201D:               // ”
             case 0x2033:               // ″
             case 0xFF02: return '\"';  // "
        
             // 〔 (
             case 0x3014:               // 〔
             case 0xFF08: return '(';   // (
        
             // 〕 )
             case 0x3015:               // 〕
             case 0xFF09: return ')';   // )
        
             // 〈 <
             case 0x3008:               // 〈
             case 0xFF1C: return '<';   // <
        
             // 〉 >
             case 0x3009:               // 〉
             case 0xFF1E: return '>';   // >
        
             // 「 『 〖 【 [
             case 0x300C:               // 「
             case 0x300E:               // 『
             case 0x3016:               // 〖
             case 0x3010:               // 【
             case 0xFF3B: return '[';   // [
        
             // 」 』 〗】 ]
             case 0x300D:               // 」
             case 0x300F:               // 』
             case 0x3017:               // 〗
             case 0x3011:               // 】
             case 0xFF3D: return ']';   // ]
        
             // ∶ :
             case 0x2236:               // ∶
             case 0xFF1A: return ':';   // :
        
             case 0xFF01: return '!';   // !
             case 0xFF03: return '#';   // #
             case 0xFF05: return '%';   // %
             case 0xFF06: return '&';   // &
             case 0xFF1F: return '?';   // ?
             case 0xFF0F: return '/';   // /
             case 0xFF3E: return '^';   // ^
             case 0xFF5B: return '{';   // {
             case 0xFF5D: return '}';   // }
             case 0xFF5C: return '|';   // |
             case 0xFF0B: return '+';   // +
             case 0xFF3C: return '\\';  // \
             case 0xFF3F: return '_';   // _
        
             // — -
             case 0x2014:               // —
             case 0xFF0D: return '-';   // -
        
             // 〓 =
             case 0x3013:               // 〓
             case 0xFF1D: return '=';   // =
         }
         return 0;
        
      • isBPMFAUC

        public static boolean isBPMFAUC​(char c)
        Bo Po Mo Fo (注音符號).

        This is a popular pronunciation system for Mandarin Characters in Taiwan & Hong Kong.

        • N N N N N ㄅ ㄆ ㄇ ㄈ ㄉ ㄊ ㄋ ㄌ ㄍ ㄎ ㄏ
        • ㄐ ㄑ ㄒ ㄓ ㄔ ㄕ ㄖ ㄗ ㄘ ㄙ ㄚ ㄛ ㄜ ㄝ ㄞ ㄟ
        • ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄧ ㄨ ㄩ N N N N N N
        Parameters:
        c - any UTF-8, ASCII or UniCode character available from Plane 0, the Basic Multi-Lingual Plane
        Returns:
        TRUE if the input character 'c' is in this UTF-8/UniCode range. The HEXADECIMAL / UTF-8 representation of the 'Bo Po Mo Fo' range is: 0x3110 ... 0x3129.
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
         // 0 1 2 3 4 5 6 7 8 9 a b c d e f
         // N N N N N ㄅ ㄆ ㄇ ㄈ ㄉ ㄊ ㄋ ㄌ ㄍ ㄎ ㄏ
         // ㄐ ㄑ ㄒ ㄓ ㄔ ㄕ ㄖ ㄗ ㄘ ㄙ ㄚ ㄛ ㄜ ㄝ ㄞ ㄟ
         // ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄧ ㄨ ㄩ N N N N N N
        
         return (c >= 0x3110) && (c <= 0x3129);
        
      • endOfSentenceAUC

        public static char endOfSentenceAUC​(char c)
        Checks for end-of-sentence punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark. If the input character code is not an AUC version of a typical Mandarin-Chinese end-of-sentence punctuation mark - then ASCII-zero is returned.

        NOTE: if a lower-level-ASCII (normal) punctuation mark is input - then ASCII-0 is returned.

        SPECIFICALLY: with '.' '?' and '!' as input to this function, ASCII-0 will be returned.

        USE: endOfSentence(c) to have those punctuation marks included in non-zero results.
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        if the input character 'c' is an "alternate UTF-8" version of the punctuation marks:

        • a period ('.')
        • an exclamation-point ('!')
        • a question-mark ('?')


        Then the output to this method shall be determined by the table below:

        Input CharacterOutput Character
        。 ○ ● .'.' (normal period)
        '!' (regular exclamation point)
        '?' (usual question mark)


        NOTE: If the normal period, question, or exclamation are passed as input to this function, this function will return ASCII-0
        See Also:
        endOfSentence(char)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
         char auc = punctuationAUC(c);
        
         if (auc != 0) c = auc;
        
         // A 'switch' is used instead of an 'if' with a char-cast because it is easier to
         // read on this page.  Only the three characters with ASCII 46, 33, and 63 should
         // return non-zero values.
         switch ((int) auc)
         {
             // These characters identify an "End of Sentence" marker.
             case 0x2E: return '.';	// DEC: 46
             case 0x21: return '!';	// DEC: 33
             case 0x3F: return '?';	// DEC: 63
        
             // All other characters should result in a '0'
             default:   return (char) 0;
         }
        
      • endOfSentence

        public static char endOfSentence​(char c)
        Checks for end-of-sentence punctuation marks. This Helper function is *almost* identitical to the endOfSentenceAUC(c) method.

        endOfSentenceAUC(c) returns ASCII-0 for the usual-punctuation marks - '.', '!' and '?'.

        endOfSentence(c) does not 'leave-out' or 'deny' these lower-level-ASCII punctuation symbols.
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        If the input character 'c' is a period ('.'), an exclamation-point ('!'), or a question-mark ('?') - or an AUC version of that punctuation, then that punctuation is returned. Otherwise ASCII-0 is returned.
        See Also:
        endOfSentenceAUC(char)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
         char auc = endOfSentenceAUC(c);
        
         if (auc != 0) c = auc;
        
         // These three characters identify an "End of Sentence" Marker
         if ((c == '.') || (c == '!') || (c == '?')) return c;
        
         return (char) 0;
        
      • endOfPhraseAUC

        public static char endOfPhraseAUC​(char c)
        Checks for end-of-phrase punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark. If the input character code is not an AUC version of a typical Mandarin-Chinese phrase-delimiting punctuation mark - then ASCII-zero is returned.

        NOTE: if a lower-level-ASCII (normal) punctuation mark is input - then ASCII-0 is returned.

        SPECIFICALLY: with ',' ':' ';' and other common phrase-ending marks in Mandarin as input to this function, ASCII-0 will be returned.

        USE: endOfPhrase(c) to have those punctuation marks included in non-zero results.
        Parameters:
        c - any UTF-8, ASCII or UniCode character available.
        Returns:
        if the input character 'c' is an "alternate UTF-8" (AUC) version of the punctuation marks:

        PunctuationSymbol and ASCII-Code
        semi-colon ';' HEX:0x3B, DEC: 59
        comma ',' HEX:0x2C, DEC: 44
        colon ':' HEX:0x3A, DEC: 58
        double-quote '\"' HEX:0x22, DEC: 34
        single-quote '\'' HEX:0x27, DEC: 39
        left-bracket '[' HEX:0x5B, DEC: 91
        right-bracket ']' HEX:0x5D, DEC: 93
        less-than '<' HEX:0x3C, DEC: 60
        greater-than '>' HEX:0x3E, DEC: 62
        left-paren '(' HEX:0x28, DEC: 40
        right-paren ')' HEX:0x29, DEC: 41


        IMPORTANT NOTE: *only* the upper-level-UTF-8/UniCode versions of these punctuation marks will produce a non-zero result. An actual ASCII comma, semi-colon, quote, bracket, or parenthesis (etc...) will cause this method to return ASCII-0. Please use endOfPhrase(char) to include the lower-level (Already down-converted ASCII) with non-zero results.
        See Also:
        endOfPhrase(char)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
         char auc = punctuationAUC(c);
        
         if (auc != 0) c = auc;
        
         // A 'switch' is used instead of an 'if' with a char-cast because it is easier to
         // read on this page.  Only the characters having ASCII 59, 44, 58, 34, etc... should
         // return non-zero values.
         switch ((int) auc)
         {
             // These characters constitute an "End of Phrase" marker
             case 0x3B: return ';';	// DEC: 59
             case 0x2C: return ',';	// DEC: 44
             case 0x3A: return ':';	// DEC: 58
             case 0x22: return '\"';	// DEC: 34
             case 0x27: return '\'';	// DEC: 39
             case 0x5B: return '[';	// DEC: 91
             case 0x5D: return ']';	// DEC: 93
             case 0x3C: return '<';	// DEC: 60
             case 0x3E: return '>';	// DEC: 62
             case 0x28: return '(';	// DEC: 40
             case 0x29: return ')';	// DEC: 41
        
             // All other results should return '0'
             default: return 0;
         }
        
      • endOfPhrase

        public static char endOfPhrase​(char c)
        endOfPhrase - any version of the end-of-phrase markers usually used in Mandarin Chinese text. This method returns the exact same results as the endOfPhraseAUC(char) method.

        EXCEPT: The regular/normal version of that punctuation mark (ASCII for semi-colon, comma, quote, etc...) will return the exact-same semi-colon, comma or quote - instead of ASCII-0
        Input & Method Called:Result
        endOfPhrase(';') ';' // Normal ASCII semi-colon symbol
        endOfPhraseAUC(';') 0 // ASCII-0 returned
        endOfPhrase('】') ']' // left-bracket returned
        endOfPhraseAUC('】') ']' // left-bracket returned
        endOfPhrase(']') ']' // left-bracket returned
        endOfPhraseAUC(']') 0 // ASCII-0 returned


        The list of end-of-phrase characters include the following:
        ';' ',' ':' '\"' '\'' '[' ']' '<' '>' '(' ')'
        Parameters:
        c - Any character in the entire UniCode range. 0x0000 to 0xFFFF
        Returns:
        If 'c' is an "AUC" version of and end-of-phrase marker - or a regular lower-level ASCII version - then that punctuation mark is returned. Otherwise 0 is returned.
        See Also:
        punctuationAUC(char)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
         char auc = punctuationAUC(c);
        
         if (auc != 0) c = auc;
        
         if ((c == ';')  ||  (c == ',')  || (c == ':') ||
             (c == '\"') ||  (c == '\'') ||
             (c == '[')  ||  (c == ']')  || 
             (c == '<')  ||  (c == '>')  ||
             (c == '(')  ||  (c == ')'))
             return c;
        
         return (char) 0;
        
      • quoteAUC

        public static char quoteAUC​(char c)
        Quotes - any version.   AUC or normal-ASCII, (BOTH) single or double quote.
        Parameters:
        c - Any character in the entire UniCode range. 0x0000 to 0xFFFF which is the Basic Multi Lingual Plane.
        Returns:
        If the input character 'c' is an "AUC" version of the single (or double) quote, or the regular-ASCII single/double quote, then the appropriate single or double-quote is returned. Otherwise 0 is returned.
        See Also:
        punctuationAUC(char)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
         char auc = punctuationAUC(c);
        
         if (auc != 0) c = auc;
        
         switch ((int) c)
         {
             case 0x22:  return '\"';	// DEC: 34
             case 0x27:  return '\'';	// DEC: 39
             default:    return (char) 0;
         }
        
      • commaAUC

        public static char commaAUC​(char c)
        Comma - any version.   AUC or normal-ASCII, (BOTH) comma
        Parameters:
        c - Any character in the entire UTF-8 range. 0x0000 to 0xFFFF, the Basic Multi-Lingual Plane.
        Returns:
        If the input character 'c' is an "AUC" version of the comma, or the regular-ASCII comma, then the comma is returned. Otherwise 0 is returned.
        See Also:
        punctuationAUC(char)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
        9
         char auc = punctuationAUC(c);
        
         if (auc != 0) c = auc;
        
         switch ((int) c)
         {
             case 0x2c:  return ',';	// DEC: 44
             default:    return (char) 0;
         }
        
      • bracketAUC

        public static char bracketAUC​(char c)
        Brackets - any version.   AUC or normal-ASCII, (BOTH) brackets
        Parameters:
        c - Any character in the entirbrackets UniCode range. 0x0000 to 0xFFFF
        Returns:
        If the input character 'c' is an "AUC" version of the brackets, or the regular-ASCII brackets, then the appropriate brackets are returned. Otherwise 0 is returned.
        See Also:
        punctuationAUC(char)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
         char auc = punctuationAUC(c);
        
         if (auc != 0) c = auc;
        
         switch ((int) c)
         {
             case 0x5B:  return '[';	// DEC: 91
             case 0x5D:  return ']';	// DEC: 93
             case 0x3C:  return '<';	// DEC: 60
             case 0x3E:  return '>';	// DEC: 62
             default:    return (char) 0;
         }
        
      • parenAUC

        public static char parenAUC​(char c)
        Parenthesis - any version.   AUC or normal-ASCII, (BOTH) parenthesis
        Parameters:
        c - Any character in the entire UniCode range. 0x0000 to 0xFFFF
        Returns:
        If the input character 'c' is an "AUC" version of the parenthesis, or the regular-ASCII parenthesis, then the appropriate parenthesis are returned. Otherwise 0 is returned.
        See Also:
        punctuationAUC(char)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
         char auc = punctuationAUC(c);
        
         if (auc != 0) c = auc;
        
         switch ((int) c)
         {
             case 0x28:  return '(';	// DEC: 40
             case 0x29:  return ')';	// DEC: 41
             default:    return (char) 0;
         }
        
      • testAUC

        public static java.lang.String testAUC()
        Returns:
        An HTML <TABLE> that contains many tests of the subroutines in this class
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54
        55
        56
        57
        58
        59
        60
        61
        62
        63
        64
        65
        66
        67
        68
        69
        70
        71
        72
        73
        74
         StringBuilder ret = new StringBuilder();
         ret.append( "<TABLE BORDER=\"1\"><TR>"      +
                     "<TD WIDTH=\"30\">&nbsp;</TD>"  +
                     "<TD WIDTH=\"70\">&nbsp;</TD>"  +
                     "<TD WIDTH=\"70\">&nbsp;</TD>"  +
                     "<TD WIDTH=\"30\">&nbsp;</TD>"  );
        
         for (int i=4; i < 12; i++)
             ret.append("<TD WIDTH=\"70\">&nbsp;</TD>");
         ret.append("</TR>");;
        
         for (int i=0; i < AUC.length(); i++)
         {
             char c = AUC.charAt(i);
        
             if (c == ' ') continue;
        
             // Check original character (not punctuation-converted cc)
             char    bl          = Integer.toString(bulletListAUC(c)).charAt(0);
             boolean bpmf        = isBPMFAUC(c);
        
             // first, convert the punctuation to normal-ASCII punctuation
             // These are the "translated" characters
             // The "translated character" is where, for example '〗' ==> ']'
             char	newC       = punctuationAUC(c);
        
             // These are used for building <TABLE> & <TD> entry strings
             char    q           = quoteAUC(newC);
             char    es          = endOfSentenceAUC(newC);
             char    ep          = endOfPhraseAUC(newC);
             char    com         = commaAUC(newC);
             char    br          = bracketAUC(newC);
             char    p           = parenAUC(newC);
        
             char    ascii       = punctuationAUC(c);
             if (ascii   == 0)   ascii = alphaNumericAUC(c);
             if (bl      != 0)   ascii = bl;
             if (bpmf)           ascii = c;
             if (ascii   == 0)   ascii = 'x';
        
             // =================================================
             // This is for debugging this test function
             String	tmp =   " newCC = " + newC  + ", q="    + q     +
                             ", es="     + es    + ", ep="   + ep    +
                             ", com="    + com   + ", br="   + br    +
                             ", p="      + p	    + ", bl ="  + bl    +
                             ", bpmf="   + bpmf;
        
             tmp = tmp.replaceAll("<", "&lt;").replaceAll(">", "&gt;");
        
             // Build the HTML Table 
             ret.append("<TR>");
        
             ret.append("<TD>" + c + "</TD>");
             ret.append("<TD>" + ((int) c) + "</TD>");
             ret.append("<TD>" + "0x" + String.format("%x",(int) c).toUpperCase() + "</TD>");
             ret.append("<TD>" + ascii + "</TD>");
        
             ret.append("<TD>" + ((q		== 0)	? "" : "Quote")		+ "</TD>");	
             ret.append("<TD>" + ((es	== 0)	? "" : "Sentence")	+ "</TD>");
             ret.append("<TD>" + ((ep	== 0)	? "" : "Phrase")	+ "</TD>");
             ret.append("<TD>" + ((com	== 0)	? "" : "Comma")		+ "</TD>");
             ret.append("<TD>" + ((br	== 0)	? "" : "Bracket")	+ "</TD>");
             ret.append("<TD>" + ((p		== 0)	? "" : "Paren")		+ "</TD>");
             ret.append("<TD>" + ((bl	== 0)	? "" : "Bullet")	+ "</TD>"); 
             ret.append("<TD>" + (bpmf ? "BPMF" : "") + "</TD>");
        
             // ==========================================================
             // Un-Comment this if you want to debug this print function
             // outStr += "</TR><TR><TD COLSPAN=\"12\">" + tmp + "</TD></TR>";
        
         }
         ret.append("</TABLE>");
         return ret.toString();
        
      • countLeadingLettersAndNumbers

        public static int countLeadingLettersAndNumbers​
                    (java.lang.String chineseSentence)
        
        Checks for any leading alphabetic ('a' ... 'z') and numeric ('0' ... '9') characters in a Chinese String. CHANGED: 2018.09.24 - I left comma's and period's in the String (when situated between digits). These are considered to be part of the "Leading Letters and Numbers"
        Parameters:
        chineseSentence - A sentence that may or may not have leading letters & numbers.
        Returns:
        the String-index of the first non-alphabetic, non-numeric character in the String.

        NOTE: white-space does not count, and the position of the first white-space character will be returned, if white-space is contained in this String.
        See Also:
        isAlphaNumeric(char)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
         for (int i = 0; i < chineseSentence.length(); i++)
         {
             char c = chineseSentence.charAt(i);
             if ((! isAlphaNumeric(c)) && (c != '.') && (c != ',')) return i;
         }
        
         return chineseSentence.length(); // This really ought not to happen, but just in case....
        
      • convertAnyAUC

        public static java.lang.String convertAnyAUC​(java.lang.String s)
        Checks for higher-Unicode letters and numbers, and converts them into lower-level versions of the appropriate letter or number.

        SPECIFICALLY: This method is just a "for-loop" which makes a call to alphaNumericaAUC() and if zero is not returned from that method-call, then the input String is modified at the index which contained such a higher UTF-8 letter or number.
        Parameters:
        s - This may or may not have "Alternate UniCode" Characters for letters and numbers.
        Returns:
        if the "alternate" versions of 'A' ... 'Z' or '0' ... '9' are there, this will make sure to change them.
        See Also:
        alphaNumericAUC(char)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
        9
         char[] cArr = s.toCharArray();
        
         for (int i = 0; i < cArr.length; i++)
         {
             char auc = alphaNumericAUC(cArr[i]);
             if (auc != 0) cArr[i] = auc;
         }
        
         return new String(cArr);
        
      • countSyllablesAndNonChinese

        public static int countSyllablesAndNonChinese​(java.lang.String word,
                                                      java.lang.Appendable DOUT)
                                               throws java.io.IOException
        Counts syllables in a "word" of PinYin. The input String is expected to not have any spaces!

        NOTE:The number of syllables in a Chinese PinYin "word" identifies the number of Chinese Characters that were used to generate the input PinYin String.

        CHANGED: 2018.09.24 - Added a test for periods and commas that are situated directly between two digits. In the String "5.0" the period between 5 and 0 is no longer removed!

        If the String "5.0" were passed as the "word" parameter, the result should be 3!
        Parameters:
        word - A word in the "PinYin" format. (罗马拼音)
        DOUT - This must implement java.lang.Appendable
        Returns:
        the number of syllables (specifically: Chinese Characters) in the input word.
        Throws:
        java.io.IOException - The interface java.lang.Appendable mandates that the IOException must be treated as a checked exception for all output operations. Therefore IOException is a required exception in this method' throws clause.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
         int numChinese	= 0;
        
         // Tone-Vowels & Numbers always correspond to a character
         for (int letter = 0; letter < word.length(); letter++)
         {
             char c = word.charAt(letter);
             if (    ZH.isToneVowel(c)   ||
                     ZH.isNumber(c)      ||
                     (c == '.')          ||
                     (c == ',')
                 )
                 numChinese++;
         }
        
         // Checks for vowel-strings that don't contain a tone
         // ==> Checks for "clear tone"
         String copyW = "" + word;
        
         DOUT.append("[" + copyW + "] - ");
        
         for (int letterIndex = 0; letterIndex < copyW.length(); letterIndex++)
             if (    ! ZH.isRegVowel(copyW.charAt(letterIndex))      &&
                     ! ZH.isToneVowel(copyW.charAt(letterIndex))	)
                 copyW =	StringParse.setChar(copyW, letterIndex, ' ');
                    
         DOUT.append("after erasing non-vowels [" + copyW + "]\n");
                 
         String[] syllables = copyW.trim().split(" ");
        
         DOUT.append("Syllables are:");
         for (int sylIndex = 0; sylIndex < syllables.length; sylIndex++	)
             DOUT.append("[" + syllables[sylIndex] + "]");
         DOUT.append("\n");
        
         TOP:
         for (int sylIndex = 0; sylIndex < syllables.length; sylIndex++)
         {
             String	syllable    = syllables[sylIndex].trim();
             boolean	foundTone   = false;
        
             // The split(' ') function sometimes provides blanks
             if (syllable.length() == 0) continue TOP;
        
             for (int vowelIndex = 0; vowelIndex < syllable.length(); vowelIndex++)
                 if (ZH.isToneVowel(syllable.charAt(vowelIndex)))
                     continue TOP;
        
             numChinese++;
             DOUT.append("NOTE: *** FOUND CLEAR TONE\n");
         }
        
         return numChinese;
        
      • delAllPunctuationCHINESE

        public static java.lang.String delAllPunctuationCHINESE​
                    (java.lang.String s)
        
        Deletes all punctuation & non-character symbols. The String that is returned will be shortened by precisely the number of punctuation characters were contained by that String.

        NOTE: '.' and ',' (periods and commas) between number/digits are not removed!
        Parameters:
        s - An input String (in Mandarin - 普通话)
        Returns:
        a String that is the same as the input String - after skipping characters as follows:

        1
        2
        3
         if (isChinese(c) || isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue;
         (else) s = StringParse.delChar(s, chr--);
         
        
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
         char[]  cArr        = s.toCharArray();
         int     sourcePos   = 0;
         int     destPos     = 0;
        
         while (sourcePos < cArr.length)
         {
             char c = cArr[sourcePos];
        
             // Check for things like 5.0 or 1,120,987 - SPECIFICALLY Comma's and Period's situated
             // directly between 2 numbers.
        
             if (    ((c == '.') || (c == ','))
                 &&  (((sourcePos-1) == -1)          || isNumber(cArr[sourcePos-1]))
                 &&  (((sourcePos+1) == s.length())  || isNumber(cArr[sourcePos+1]))
             )
                 { cArr[destPos++] = cArr[sourcePos++]; continue; }
        
             // AUC were converted before calling this function ... (alphaNumericAUC(c) != 0)) 
        
             if (isChinese(c) || isAlphaNumeric(c))
                 { cArr[destPos++] = cArr[sourcePos++]; continue; }
        
             sourcePos++;
         }
        
         return s;
        
      • delAllPunctuationPINYIN

        public static java.lang.String delAllPunctuationPINYIN​(java.lang.String s)
        Deletes all punctuation & non-character symbols from a String of PinYin. The returned String will have the same length as it originally did, but the locations where punctuation existed will have been replaced with a space character.

        NOTE: '.' and ',' (periods and commas) between number/digits are not removed!
        Parameters:
        s - An input String in 罗马拼音
        Returns:
        A String that is the same as the input String - after skipping characters as follows:

        1
        2
        3
         if (isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue;
         (else) s = StringParse.setChar(s, chr, ' ');
         
        
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
         char[] cArr = s.toCharArray();
        
         // This loop cnverts all non-AlphaNumeric unicode to a space		
         for (int i = 0; i < cArr.length; i++)
         {
             char c = cArr[i];
        
             if (isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue;
        
             // Check for things like 5.0 or 1,120,987 - SPECIFICALLY Comma's and Period's
             // situated directly between 2 numbers.
        
             if (    ((c == '.') || (c == ','))
                 &&  (((i-1) == -1)          || isNumber(cArr[i-1]))
                 &&  (((i+1) == s.length())  || isNumber(cArr[i+1]))
             )
                 continue;
        
             cArr[i] = ' ';
         }
        
         return new String(cArr);
        
      • GTPPEIndexOf

        public static int GTPPEIndexOf​(java.lang.String s,
                                       char c)
        GTPPE: Google Translate Punctuation Pronunciation Equivalent This searches through a String to find the location of the "equivalent punctuation mark"
        Parameters:
        s - The input String, expected to be the result of a GCS TS query. This function is totally useless for any Pronunciation String that hasn't been obtained from GCS TS.

        NOTE: The input String is intended to be in "PinYin" (罗马拼音)
        c - The original punctuation character to look for... Generally, this is used to search for higher-level UTF-8 chars that have been "down-converted" by GCS TS
        Returns:
        the indexOf() of the character in the original input String. The actual character is not looked for, BUT RATHER, the Google Cloud Server Transation Services equivalent character. Specifically, GCSTS has a "substitute punctuation" for many higher-level UTF-8 and UniCode chars. There are 5 different versions of a quote...
        Code:
        Exact Method Body:
          1
          2
          3
          4
          5
          6
          7
          8
          9
         10
         11
         12
         13
         14
         15
         16
         17
         18
         19
         20
         21
         22
         23
         24
         25
         26
         27
         28
         29
         30
         31
         32
         33
         34
         35
         36
         37
         38
         39
         40
         41
         42
         43
         44
         45
         46
         47
         48
         49
         50
         51
         52
         53
         54
         55
         56
         57
         58
         59
         60
         61
         62
         63
         64
         65
         66
         67
         68
         69
         70
         71
         72
         73
         74
         75
         76
         77
         78
         79
         80
         81
         82
         83
         84
         85
         86
         87
         88
         89
         90
         91
         92
         93
         94
         95
         96
         97
         98
         99
        100
        101
        102
        103
        104
        105
        106
        107
        108
        109
        110
        111
        112
        113
        114
        115
        116
        117
        118
        119
        120
        121
        122
         int cc = (int) c;
        
         // if (c == '∶')	return s.indexOf(c);
         if (cc == 0x2236)	return s.indexOf(c);
         // if (c == ':')	return s.indexOf(':');
         if (cc == 0xFF1A)	return s.indexOf(':');	// (0x003A);
         // if (c == ':')	return s.indexOf(c);	// Natural colon
         if (cc == 0x003A)	return s.indexOf(c);
        
         // commas
         // if (c == '、')	return s.indexOf(',');
         if (cc == 0x3001)	return s.indexOf(',');	// (0x002C);
         // if (c == ',')	return s.indexOf(',');
         if (cc == 0xFF0C)	return s.indexOf(',');	// (0x002C);
         // if (c == ',')	return s.indexOf(c);	// natural comma
         if (cc == 0x002C)	return s.indexOf(c);
        
         // periods
         // if (c == '。')	return s.indexOf('.');
         if (cc == 0x3002)	return s.indexOf('.');	// (0x002E);
         // if (c == '○')	return s.indexOf(c);
         if (cc == 0x25CB)	return s.indexOf(c);
         // if (c == '●')	return s.indexOf(c);
         if (cc == 0x25CF)	return s.indexOf(c);
         // if (c == '.')	return s.indexOf('.');
         if (cc == 0xFF0E)	return s.indexOf('.');	// (0x002E);
         // if (c == '.')	return s.indexOf(c);	// natural period
         if (cc == 0x002E)	return s.indexOf(c);
        
        
         // Exclamation & Question
         // if (c == '?')	return s.indexOf(c);	// natural question-mark
         if (cc == 0x003F)	return s.indexOf(c);
         // if (c == '?')	return s.indexOf('?');
         if (cc == 0xFF1F)	return s.indexOf('?');	// (0x003F);
         // if (c == '!')	return s.indexOf('!');
         if (cc == 0xFF01)	return s.indexOf('!');	// (0x0021);
         // if (c == '!')	return s.indexOf(c);	// natural exclamation
         if (cc == 0x0021)	return s.indexOf(c);
        
         // single-quotes
         // if (c == '‘')	return s.indexOf(c);
         if (cc == 0x2018)	return s.indexOf(c);
         // if (c == '’')	return s.indexOf(c);
         if (cc == 0x2019)	return s.indexOf(c);
         // if (c == '′')	return s.indexOf(c);
         if (cc == 0x2032)	return s.indexOf(c);
         // if (c == ''')	return s.indexOf('\'');
         if (cc == 0xFF07)	return s.indexOf('\'');	// (0x0027);
         // if (c == '`')	return s.indexOf('`');
         if (cc == 0xFF40)	return s.indexOf('`');	// (0x0060);
         // if (c == '\'')	return s.indexOf(c);	// natural single-quotes
         if (cc == 0x0027)	return s.indexOf(c);
         
        
         // NOT DETECTED RIGHT NOW.. 
         // if (c == '《')	return s.indexOf('“');
         if (cc == 0x300A)	return s.indexOf(CONSTSpecialQuoteLeft);
         // if (c == '》')	return s.indexOf('”');
         if (cc == 0x300B)	return s.indexOf(CONSTSpecialQuoteRight);
        
         // double-quotes
         // if (c == '¨')	return s.indexOf(c);
         if (cc == 0x00A8)	return s.indexOf(c);
         // if (c == '〃')	return s.indexOf(c);
         if (cc == 0x3003)	return s.indexOf(c);
         // if (c == '“')	return s.indexOf(c);
         if (cc == 0x201C)	return s.indexOf(c);
         // if (c == '”')	return s.indexOf(c);
         if (cc == 0x201D)	return s.indexOf(c);
         // if (c == '″')	return s.indexOf(c);
         if (cc == 0x2033)	return s.indexOf(c);
         // if (c == '"')	return s.indexOf('\"');
         if (cc == 0xFF02)	return s.indexOf('\"');	// (0x0022);
         // if (c == '\"')	return s.indexOf(c);	// natural double quotes
         if (cc == 0x0022)	return s.indexOf(c);
        
        
         // Brackets
         // if (c == '[')	return s.indexOf(c);
         if (cc == 0x005B)	return s.indexOf(c);
         // if (c == ']')	return s.indexOf(c);
         if (cc == 0x005D)	return s.indexOf(c);
         // if (c == '[')	return s.indexOf('[');
         if (cc == 0xFF3B)	return s.indexOf('[');	// (0x005B);
         // if (c == ']')	return s.indexOf(']');
         if (cc == 0xFF3D)	return s.indexOf(']');	// (0x005D);
         // if (c == '【')	return s.indexOf('[');
         if (cc == 0x3010)	return s.indexOf('[');	// (0x005B);
         // if (c == '】')	return s.indexOf(']');
         if (cc == 0x3011)	return s.indexOf(']');	// (0x005D);
         // if (c == '〖')	return s.indexOf(c);
         if (cc == 0x3016)	return s.indexOf(c);
         // if (c == '〗')	return s.indexOf(c);
         if (cc == 0x3017)	return s.indexOf(c);
         // if (c == '『')	return s.indexOf('“');
         if (cc == 0x300E)	return s.indexOf(CONSTSpecialQuoteLeft);
         // if (c == '』')	return s.indexOf('”');
         if (cc == 0x300F)	return s.indexOf(CONSTSpecialQuoteRight);
         // if (c == '「')	return s.indexOf('`');
         if (cc == 0x300C)	return s.indexOf('`');	// (0x0060);
         // if (c == '」')	return s.indexOf('\'');
         if (cc == 0x300D)	return s.indexOf('\'');	// (0x0027);
        
        
         // Parenthesis
         // if (c == '(')	return s.indexOf(c);
         if (cc == 0x0028)	return s.indexOf(c);
         // if (c == ')')	return s.indexOf(c);
         if (cc == 0x0029)	return s.indexOf(c);
         // if (c == '(')	return s.indexOf('(');
         if (cc == 0xFF08)	return s.indexOf('(');	// (0x0028);
         // if (c == ')')	return s.indexOf(')');
         if (cc == 0xFF09)	return s.indexOf(')');	// (0x0029);
         // if (c == '〔')	return s.indexOf(c);
         if (cc == 0x3014)	return s.indexOf(c);
         // if (c == '〕')	return s.indexOf(c);
         if (cc == 0x3015)	return s.indexOf(c);
        
         System.out.println("character not found: \'" + c + "\'\nZH.GTPPEIndexOf(String s, char c)");
         System.exit(0);
         return 0;