Class ES


  • public class ES
    extends java.lang.Object
    ES (Español) - Documentation.

    This class provides some simple helper routines for working with Spanish language special characters. It deals particularly with accented vowels.



    • Field Detail

      • GRAVE

        public static final int GRAVE
        GRAVE & ACCUTE are the "first bit" of this mask, if that bit is '0', then the mask is ACCUTE
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
        1
        public static final int GRAVE		= 0b0001;
        
      • UPPERCASE

        public static final int UPPERCASE
        UPPER & LOWER CASE are the "second bit" of this mask, if that bit is '0', then he mask is LOWER-CASE
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
        1
        public static final int UPPERCASE	= 0b0010;
        
    • Method Detail

      • getAccentedVowel

        public static char getAccentedVowel​(char vowel,
                                            int flags)
        This is intended to produce an accented vowel 'on request' from the method invocation. The complete list of characters that may be returned by this function are listed below.
        Upper, GraveUpper, AcuteLower, GraveLower, Acute
        À (192)Á (193)à (224)á (225)
        È (200)É (201)è (232)é (233)
        Ì (204)Í (205)ì (236)í (237)
        Ò (210)Ó (211)ò (242)ó (243)
        Ù (217)Ú (218)ù (249)ú (250)
        Parameters:
        vowel - Any vowel: [A, E, I, O, U] or [a, e, i, o, u]
        If 'vowel' is not one of these 10 choices, then other characters will be ignored, and this method will just return (char) 0.
        flags - The following values can be OR'D (masked): Helper.GRAVE or Helper.UPPERCASE
        In total, there are 4 possible versions: Upper-Case/Lower-Case output, and Accute/Grave output.
        NOTE:


        • If Helper.GRAVE is not masked (binary-bit 0), then an "accute" accented vowel is returned (accute is "the default").
        • If Helper.UPPERCASE is not masked (binary-bit 1), then a lower-case vowel is returned (lower-case is "the default").
        Returns:
        With correct input: one of ten listed vowels above - and if not, then ASCII 0 is returned.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
         int i = 0;
        
         if		((vowel == 'a') || (vowel == 'A'))		i = 192;
         else if	((vowel == 'e') || (vowel == 'E'))		i = 200;
         else if ((vowel == 'i') || (vowel == 'I'))		i = 204;
         else if ((vowel == 'o') || (vowel == 'O'))		i = 210;
         else if ((vowel == 'u') || (vowel == 'U'))		i = 217;
         else									return (char) 0;
        
         if		(((flags & UPPERCASE) > 0) &&
                 ((flags & GRAVE) > 0))			return (char) (i + 0);	// À (192)È (200)Ì (204)Ò (210)Ù (217)
         else if	((flags & UPPERCASE) > 0)		return (char) (i + 1);	// Á (193)É (201)Í (205)Ó (211)Ú (218)
         else if ((flags & GRAVE) > 0)			return (char) (i + 32);	// à (224)è (232)ì (236)ò (242)ù (249)
         else									return (char) (i + 33);	// á (225)é (233)í (237)ó (243)ú (250)
        
      • toNonAccented

        public static char toNonAccented​(char c,
                                         boolean preserveCase)
        This converts all Spanish-Accented characters into a lower-case, and non-accented equivalent. Also, upper-case regular characters are down-cased. If specifically requested, case can be preserved.
        A (65) ... Z (90) ⇒ a .. z
        À (192), Á (193), à (224), á (225) ⇒ A or a
        È (200), É (201), è (232), é (233) ⇒ E or e
        Ì (204), Í (205), ì (236), í (237) ⇒ I or i
        Ò (210), Ó (211), ò (242), ó (243) ⇒ O or o
        Ù (217), Ú (218), ù (249), ú (250) ⇒ U or u
        Ñ (209), ñ (241) ⇒ N or n
        Ü (220), ü (252) ⇒ U or u
        Ý (221), ý (253) ⇒ Y or y
        Parameters:
        c - Any ASCII/UniCode character
        preserveCase - If this is TRUE, then accented capital letters remain capitlized. If this is FALSE, then all letters are converted to lowercase.
        Returns:
        If this character contained an accent, it will be removed. It will also be in lower-case form, unless preserveCase is TRUE.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
         if ((c == 224) || (c == 225))	return 'a';
         if ((c == 232) || (c == 233))	return 'e';
         if ((c == 236) || (c == 237))	return 'i';
         if ((c == 242) || (c == 243))	return 'o';
         if ((c == 249) || (c == 250))	return 'u';
         if (c == 241)					return 'n';
         if (c == 252)					return 'u';
         if (c == 253)					return 'y';
        
         if ((c == 192) || (c == 193))	return (preserveCase ? 'A' : 'a');
         if ((c == 200) || (c == 201))	return (preserveCase ? 'E' : 'e');
         if ((c == 204) || (c == 205))	return (preserveCase ? 'I' : 'i');
         if ((c == 210) || (c == 211))	return (preserveCase ? 'O' : 'o');
         if ((c == 217) || (c == 218))	return (preserveCase ? 'U' : 'u');
         if (c == 209)					return (preserveCase ? 'N' : 'n');
         if (c == 220)					return (preserveCase ? 'U' : 'u');
         if (c == 221)					return (preserveCase ? 'Y' : 'y');
        
         if ((c >= 'A') && (c <= 'Z'))	return (char) (preserveCase ? c : (c -'A' + 'a'));
        
         return c;
        
      • toNonAccented

        public static java.lang.String toNonAccented​(java.lang.String s,
                                                     boolean preserveCase)
        Removes Spanish-Accent Characters from all characters in a string.
        Returns:
        a new String, one where toNonAccented(s.charAt(i), preserveCase) has been called for each character in the String. This is just a small for-loop over a String.
        See Also:
        toNonAccented(char, boolean)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
         int len = s.length();
         StringBuilder sb = new StringBuilder();
         for (int i=0; i < len; i++)
             sb.append(toNonAccented(s.charAt(i), preserveCase));
         return sb.toString();
        
      • toLowerCaseSpanish

        public static char toLowerCaseSpanish​(char c)
        Produces a lower-case Spanish Character - if and only if the input-parameter is an upper-case Spanish Character. This is almost identifical to the usual String function toLowerCase(char), but it also includes Spanish vowels and consonants with:

        • accent marks: À, Á, à, and á ... etc.
        • umlaut's: Ü and ü
        • tildes: Ñ and ñ
        NOTE: The 'accute' and 'grave' accent marks are not so prevalently used anymore as in the time of "Don Quijote de la Mancha" - however, they are included here, just in case. Mostly the 'acute' accent mark (from top-right-corner to the lower-left-corner) is used in newspapers around here (Dallas, Texas).
        Parameters:
        c - Any ASCII or UniCode char
        Returns:
        Uppercase letters 'A' .. 'Z' are converted to 'a' .. 'z'
        AND:
        À (192), Á (193) ⇒ à (224), á (225)
        È (200), É (201) ⇒ è (232), é (233)
        Ì (204), Í (205) ⇒ ì (236), í (237)
        Ò (210), Ó (211) ⇒ ò (242), ó (243)
        Ù (217), Ú (218) ⇒ ù (249), ú (250)
        Ñ (209) ⇒ ñ (241)
        Ý (221) ⇒ ý (253)
        Ü (220) ⇒ ü (252)
        See Also:
        toUpperCaseSpanish(char), toLowerCaseSpanish(String)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
         if ((c >= 'A') && (c <= 'Z'))
             return (char) (c + 'a' - 'A');
         else if (	(c == 192) || (c == 193) || (c == 200) || (c == 201) ||
                     (c == 204) || (c == 205) || (c == 210) || (c == 211) ||
                     (c == 217) || (c == 218) || (c == 209) || (c == 220) ||
                     (c == 221)	)
             return (char) (c + 32);
         return c;
        
      • toLowerCaseSpanish

        public static java.lang.String toLowerCaseSpanish​(java.lang.String s)
        This cycles through an input-String parameter, and converts any/all letters that are uppercase - including ones with accent marks, tildes, and umlaut's, and returns a String in which all characters are lower-case, but have their punctuation preserved.
        Returns:
        a new string in which Helper.toLowerCaseSpanish(char) has been invoked on each character.
        See Also:
        toLowerCaseSpanish(char)
        Code:
        Exact Method Body:
        1
        2
        3
         StringBuilder ret = new StringBuilder();
         for (int i=0; i < s.length(); i++) ret.append(toLowerCaseSpanish(s.charAt(i)));
         return ret.toString();
        
      • toUpperCaseSpanish

        public static char toUpperCaseSpanish​(char c)
        Produces an upper-case Spanish Character - if and only if the input-parameter is a lower-case Spanish Character. See toLowerCaseSpanish(char) for more notes!
        Parameters:
        c - Any ASCII or UniCode char
        Returns:
        Lowercase letters 'a' .. 'z' are converted to 'A' .. 'Z'
        AND:
        à (224), á (225) ⇒ À (192), Á (193)
        è (232), é (233) ⇒ È (200), É (201)
        ì (236), í (237) ⇒ Ì (204), Í (205)
        ò (242), ó (243) ⇒ Ò (210), Ó (211)
        ù (249), ú (250) ⇒ Ù (217), Ú (218)
        ñ (241) ⇒ Ñ (209)
        ý (253) ⇒ Ý (221)
        ü (252) ⇒ Ü (220)
        See Also:
        toLowerCaseSpanish(char), toUpperCaseSpanish(String)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
         if ((c >= 'a') && (c <= 'z'))
             return (char) (c + 'A' - 'a');
         else if (	(c == 224) || (c == 225) || (c == 232) || (c == 233) ||
                     (c == 236) || (c == 237) || (c == 242) || (c == 243) ||
                     (c == 249) || (c == 250) || (c == 241) || (c == 253) ||
                     (c == 252)	)
             return (char) (c - 32);
         return c;
        
      • toUpperCaseSpanish

        public static java.lang.String toUpperCaseSpanish​(java.lang.String s)
        This cycles through an input-String parameter, and converts any/all letters that are lower-case, including ones with accent marks, tildes, and umlaut's, and returns a String in which all characters are upper-case, but have their punctuation preserved.
        Returns:
        a new string in which Helper.toUpperCaseSpanish(char) has been invoked on each character.
        See Also:
        toUpperCaseSpanish(char)
        Code:
        Exact Method Body:
        1
        2
        3
         StringBuilder ret = new StringBuilder();
         for (int i=0; i < s.length(); i++) ret.append(toLowerCaseSpanish(s.charAt(i)));
         return ret.toString();
        
      • isLanguageChar

        public static boolean isLanguageChar​(char c)
        Checks if this character could be a Spanish Language Character
        Parameters:
        c - Any ASCII or Uni-Code Character
        Returns:
        TRUE: If and only if 'c' is one of the following char-sets:

        • a ... z
        • A ... Z
        • Á (193), É (201), Í (205), Ó (211), Ú (218), Ý (221), Ü (220), Ñ (209)
        • á (225), é (233), í (237), ó (243), ú (250), ý (253), ü (252), ñ (241)
        and FALSE otherwise...
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
         if ((c >= 'a') && (c <= 'z')) return true;
         if ((c >= 'A') && (c <= 'Z')) return true;
         // Á 193, É 201, Í 205, Ó 211, Ú 218, Ý 221, Ü 220, Ñ 209
         if ((c == 193) || (c == 201) || (c == 205) || (c == 211) || (c == 218) || (c == 221) || (c == 220) || (c == 209)) return true;
         // á 225, é 233, í 237, ó 243, ú 250, ý 253, ü 252, ñ 241
         if ((c == 225) || (c == 233) || (c == 237) || (c == 243) || (c == 250) || (c == 253) || (c == 252) || (c == 241)) return true;
         return false;
        
      • onlyLanguageChars

        public static boolean onlyLanguageChars​(java.lang.String s)
        Checks if a String contains non-Spanish-Language Characters. Utilizes isLanguageChar(char)
        Parameters:
        s - Any String consisting of ASCII & UniCode Characters
        Returns:
        TRUE only if isLanguageChar(s.charAt(i)) returns TRUE for ever integer i.
        FALSE otherwise.
        See Also:
        isLanguageChar(char)
        Code:
        Exact Method Body:
        1
        2
         for (int i=0; i < s.length(); i++) if (! isLanguageChar(s.charAt(i))) return false;
         return true;
        
      • isSpanishVerbInfinitive

        public static boolean isSpanishVerbInfinitive​(java.lang.String s)
        This is a function which identifies Spanish Language Infinitive Form Verbs.
        Parameters:
        s - Any String consisting of ASCII & UniCode Characters
        Returns:
        TRUE if and only if:
        input String s ends with: ar, er, ir, arse, erse, irse, ír, írse
        input String s passes the onlyLanguageChars(s) boolean test
        FALSE otherwise
        See Also:
        onlyLanguageChars(String)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
         s = toLowerCaseSpanish(s);
         if (onlyLanguageChars(s))
             if (s.endsWith("ar")	|| s.endsWith("er")		|| s.endsWith("ir") ||
                 s.endsWith("arse")	|| s.endsWith("erse")	|| s.endsWith("irse") ||
                 s.endsWith("ír")	|| s.endsWith("írse"))
                 return true;
         return false;
        
      • convertHTML_TO_UTF8

        public static java.lang.String convertHTML_TO_UTF8​(java.lang.String s)
        This function is somewhat redundant, as a complete HTML-Character Escape-Sequence class is included in the Torello.HTML package. There is a link provided to these methods at the end of this comment. This method was written much earlier, and functions well, but it can only convert HTML-Escape-Sequences that are used in Spanish - rather than all HTML-Character Escape-Sequences. Here is the complete list:
        &aacute;⇒ á
        &eacute; ⇒ é
        &iacute;⇒ í
        &oacute; ⇒ ó
        &uacute;⇒ ú
        &Aacute; ⇒ Á
        &Eacute;⇒ É
        &Iacute; ⇒ Í
        &Oacute;⇒ Ó
        &Uacute; ⇒ Ú
        &ntilde;⇒ ñ
        &laquo; ⇒ «
        &raquo; ⇒ »
        &mdash; ⇒ -
        &uuml; ⇒ ü
        &iuml; ⇒ ï
        &iexcl; ⇒ ¡
        &iquest; ⇒ ¿
        &quot; ⇒ "
        Parameters:
        s - Any ASCII/UniCode String, which ostensibly ought to (possibly) contain Spanish-Language HTML-Escaped characters within them.
        Returns:
        A string where all HTML escape-sequences have been converted to their actual character equivalent.
        See Also:
        Escape.escHTMLToChar(String), Escape.htmlEsc(char)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
         return s.replaceAll("&aacute;", "á")	.replaceAll("&eacute;", "é")	
                 .replaceAll("&iacute;", "í")	.replaceAll("&oacute;", "ó")
                 .replaceAll("&uacute;", "ú")	.replaceAll("&Aacute;", "Á")
                 .replaceAll("&Eacute;", "É")	.replaceAll("&Iacute;", "Í")
                 .replaceAll("&Oacute;", "")		.replaceAll("&Uacute;", "Ú")
                 .replaceAll("&ntilde;", "ñ")	.replaceAll("&laquo;", "«")
                 .replaceAll("&raquo;", "»")		.replaceAll("&mdash;", "-")
                 .replaceAll("&uuml;", "ü")		.replaceAll("&iuml;", "ï")
                 .replaceAll("&iexcl;", "¡")		.replaceAll("&iquest;", "¿")
                 .replaceAll("&quot;", "\"");
        
      • setRemoveWordsArr

        public static void setRemoveWordsArr​(java.lang.String[] wordList)
        This just stores a list of "words", and they are removed from certain texts/articles. This program currently uses it to remove certain extremely commonly used words, so they are not repeatedly searched for in the dictionary. It is kind of a hack.
        Parameters:
        wordList - An array of Strings. It is expected to be a list of words that may be removed from Spanish Texts, but it can be any list of words. It is checked to see if 100% of the characters in each word are alphabetic, and throws an IllegalArgumentException if they are not.
        Throws:
        java.lang.IllegalArgumentException - if the wordList parameter contains strings with invalid non-word characters.
        See Also:
        removeList
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
         removeList = new Vector<String>();
                
         for (int i=0; i < wordList.length; i++)
         {
             String word = wordList[i];
             for (int j=0; j < word.length(); j++)
                 if (! isLanguageChar(word.charAt(j)))
                     throw new IllegalArgumentException
                         ("Contains word:" + word + " which has invalid, non-word, language-characters");
             removeList.addElement(word);
         }
        
      • removeWords

        public static java.lang.String removeWords​(java.lang.String s)
        This function references the words in the "removeList" and removes every occurence of each word that is present in the "removeList" Vector<String>.
        Parameters:
        s - A String of Spanish Words.
        Returns:
        The same string with each instance of each word that is listed in the "removeList" Vector removed from the String!
        See Also:
        removeList, setRemoveWordsArr(String[])
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
         // boolean printIt = false;
         // int tpos = s.indexOf(" a ");
         // if (tpos != -1) if (s.indexOf(" a ", tpos + 3) != -1) printIt = true;
         // if (printIt) System.out.println(s + ":");
                
         Enumeration<String> e = removeList.elements();
         // System.out.println("CLEANING: [" + s + "]");
         while (e.hasMoreElements())
         {
             // System.out.print(" HERE! ");
             String lc = toLowerCaseSpanish(s);
             // System.out.print(" <" + lc + ">");
             String word = e.nextElement();
             // System.out.print(" {" + word + "}");
             int pos = 0;
             while ((pos = lc.indexOf(word, pos)) != -1)
             {
                 int startPos = pos;
                 int endPos = pos + word.length();
                 boolean leftEnd = (startPos == 0);
                 boolean rightEnd = (endPos == lc.length());
                 char leftChar = leftEnd ? 0 : lc.charAt(startPos - 1);
                 // endPos is an "off-by-one" thing... I have thought this through!
                 char rightChar = rightEnd ? 0 : lc.charAt(endPos);
                 // if (printIt) System.out.print("(" + leftChar + "," + rightChar + "," + leftEnd + "," + rightEnd + "," + startPos + "," + endPos + ") ");
                 if (isLanguageChar(leftChar)) { pos = endPos; continue; }
                 if (isLanguageChar(rightChar)) { pos = endPos; continue; }
                 // System.out.print("(" + startPos + "," + endPos + ")" );
                 boolean leftSpace = (leftChar == ' ');
                 boolean rightSpace = (rightChar == ' ');
                 if (leftSpace && rightSpace) startPos--;
                 else if (leftSpace && rightEnd) startPos--;
                 else if (leftEnd && rightSpace) endPos++;
                        
                 s = (leftEnd ? "" : s.substring(0, startPos)) + (rightEnd ? "" : s.substring(endPos));
                 // if (printIt) System.out.print("[" + s + "] ");
                 lc = toLowerCaseSpanish(s);
             }
         }
         // if (printIt) System.out.println("\n");
         return s;