Class PinYinParse


  • public class PinYinParse
    extends java.lang.Object
    PinYinParse (罗马拼音) - Documentation.

    This class was originally written in the summer of 2016, however, it was in java-script. It parses the output that is generated by Google's Translate website. It takes Romanized Pin-Yin as input, and produces a string of character-word-pronunciation vectors.

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Internal-State: A user may click on this class' source code (see link below) to view any and all internally defined fields class. A cursory inspection of the code would prove that this class has precisely zero internally defined global fields (Spaghetti). All variables used by the methods in this class are local fields only, and therefore this class ought to be though of as 'state-less'.



    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method
      static boolean parse​(Appendable DOUT, String simpSentence, String pronSentence, Vector<String> characters, Vector<String> pronunciation)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • parse

        public static boolean parse​
                    (java.lang.Appendable DOUT,
                     java.lang.String simpSentence,
                     java.lang.String pronSentence,
                     java.util.Vector<java.lang.String> characters,
                     java.util.Vector<java.lang.String> pronunciation)
                throws java.io.IOException
        
        The purpose of this is produce the Parallel arrays (Vector) which contain Chinese Characters and Chinese PinYin based on the results of the Google Translate Query.

        NOTE: This is of "limited use" - since primarily the input to this function is a String that is scraped from the Google Translate Website, not from a query to Google Cloud Server's Translate-API. The API version of Mandarin Translations literally leaves out the Pin-Yin Romanizations, and makes the entire package a lot less useable. The web-site itself can be scraped, and the Pin-Yin obtained, but that String comes from a web-site that changes from time-to-time.

        NOTE: If scraping Google's Translate Web-site conjurs images of the police coming to your door, another web-site that seems to do pretty good Romanization is Pin1Yin1.com. I have another class that scrapes that site.
        Parameters:
        DOUT - This is filled up with Debug Information as this class is run. It may be any implementation of java's java.lang.Appendable interface.
        simpSentence - This is the complete simplified-Mandarin sentence obtained from news-article.
        pronSentence - This is the pronunciation of the simplified-Mandarin sentence. This should have already been obtained from Google Translate.
        characters - This should be an empty vector. It will be populated by the words from the original Mandarin sentence, based on the pronunciation obtained from Google Translate.
        pronunciation - This should also be an empty vector. It will be populated after the words from the pronunciation sentence have been parsed into individual words.
        Returns:
        boolean This is true if there was possibly an error along the way. The specific requirements for the boolean value are:
        (cSent.length() != totalChinese) && (totalChinese > 0);
        Throws:
        java.io.IOException - The interface java.lang.Appendable mandates that the IOException must be treated as a checked exception for all output operations. Therefore IOException is a required exception in this method' throws clause.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54
        55
        56
        57
        58
        59
        60
        61
        62
        63
        64
        65
        66
        67
        68
        69
        70
        71
        72
         int totalChinese = 0;
         DOUT.append("********************************************\n");
         DOUT.append("chin = " + simpSentence + "\n");
         DOUT.append("pron = " + pronSentence + "\n");
         String cSent = ZH.convertAnyAUC(simpSentence); // remove "alternate" (AUC) versions of A...Z or 0..9 are there..
        
         // CHANGED 2018.09.24 - dellAllPunctuation does not remove '.' and ',' between numbers!
         String pSent = ZH.delAllPunctuationPINYIN(pronSentence);
        
         cSent = ZH.delAllPunctuationCHINESE(cSent);
        
         DOUT.append("********************************************\n");
         DOUT.append("After Removing non-alphanumeric UniCode, and Alt-UniCode:\n");
         DOUT.append("cSent=" + cSent + "\n");
         DOUT.append("pSent=" + pSent + "\n");
         DOUT.append("********************************************\n");
        
         // Leading or ending blanks messes this up
         // *** Use trim()
         String[] pWords = pSent.trim().split(" ");
        
         for (int i = 0; i < pWords.length; i++)
         {
             String pronWord = pWords[i].trim();
        
             if (pronWord.length() == 0) continue;
        
             // Sometimes alphabetic characters appear in the chinese string.
             int leading = ZH.countLeadingLettersAndNumbers(cSent.substring(totalChinese));
             if (leading > 0)
             {
                 String alphaNumericASCII = cSent.substring(totalChinese, totalChinese + leading);
        
                 DOUT.append("*** Found English and Numbers ASCII in Chinese Sentence ***\n");
                 DOUT.append("There are " + leading + " leading alpha numeric characters.");
                 DOUT.append(" [" + alphaNumericASCII + "]\n");
                 DOUT.append("pronunciation word is: [" + pronWord + "]\n");
        
                 pronunciation.add(pronWord);
                 characters.add(alphaNumericASCII);
        
                 totalChinese += leading;
             }
             // else - it's just normal characters in the chinese string
             else
             {
                 int numChinese      = ZH.countSyllablesAndNonChinese(pronWord, DOUT);
                 String chineseWord  = cSent.substring(totalChinese, totalChinese + numChinese);
                                  
                 DOUT.append("The word [" + pronWord + "] ");
                 DOUT.append("corresponds to " + numChinese + " Unicode Characters ");
                 DOUT.append("[" + chineseWord + "]\n");
        
                 // Add the new word to the list
                 pronunciation.add(pronWord);
                 characters.add(chineseWord);
        
                 totalChinese += numChinese;
             }
         }
        
         DOUT.append(
             "********************************************\n" +
             "COMPLETED SENTENCE LOOP\n" +
             "SUMMARY:\n" +
             "FOUND (" + totalChinese + ") characters in Chinese String\n" +
             "STRING CONTAINS (" + cSent.length() + ") characters\n" +
             ((totalChinese != cSent.length()) ? "\nPOSSIBLE ERROR MISMATCH\n\n" : "") +
             "********************************************\n"
         );
        
         return (cSent.length() != totalChinese) && (totalChinese > 0);