Class ForeignNewsArticle


  • public class ForeignNewsArticle
    extends java.lang.Object
    Foreign News Article - Documentation.

    This class will easily translate the contents of a news-article that is any any language that may be translated using the Google Cloud Server Translate API into English. This does a very simple rendition of translation. It expects the user of this class to "pick out the article content" and providing that vectorized-HTML sub-page to the processArticle(...) method of this class.

    This class will:

    • Translate the text from the native-language to English.
    • Generate a side-by-side article with both original-language and English article content
    • Save the page as an "index.html" file in the user-specified directory
    • Download any photos present in the HTML
    • Re-name the photo file-names, after downloading them to a local user-specified directory.
    • Update the page HTML <IMG SRC="..."> nodes accordingly with the new image names.


    In order to Translate a Foreign Language News Article into English or Spanish - this is the only class that is really needed. It does a "simple-translation" using the Google Cloud Server Translate API.

    IMPERATIVE: This class makes calls to the GCSTAPI, and therefore, Google is going to want an "API Key" so that it can bill your account for the translations. It has been explained that this Java package is not going to eat your API-key, but indeed it is going to expect one for these classes to work. class GCSTAPI has a field simple called public static String key that needs to be set to a valid GCS Translate API Key, because otherwise the API calls will fail. You may read more about this on Google's website, and in the class Torello.Languages.GCSTAPI.

    FINALLY: This class makes calls to class ImageScraper, which uses a Time-Out monitor-thread to prevent locking up when downloading images. However, when your program exists, it may sit idle for anywhere between 1 second and 1 minute, because the Java JRE does not automatically kill all threads - even when program flow exits and terminates.

    To solve this problem immediately, call:
    
    ImageScraper.shutdownTOThreads();
    


    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Field: The methods in this class do not create any internal state that is maintained - but there is a single private & static field defined. This field is instantiated only once during the Class Loader phase (and only if this class shall be used), and serves as a data 'lookup' field (like a static constant). View this class' source-code in the link provided below to see internally used data.

    The internal field is a public static final String that stores the HTML Header Page portion of the returned String.
    See Also:
    GCSTAPI.key, GCSTAPI.sentence(String, LC, LC), GCSTAPI.wordByWord(Vector, LC, LC)



    • Field Summary

      Fields 
      Modifier and Type Field
      static String HEADER
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method
      static Ret3<Vector<String>,
           ​Vector<String>,
           ​String[]>
      processArticle​(Vector<HTMLNode> articleBody, URL url, String title, LC srcLang, Appendable log, String targetDirectory)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • HEADER

        public static final java.lang.String HEADER
        This is the HTML page header that is appended to the output page.
        Code:
        Exact Field Declaration Expression:
        1
        2
        3
        4
        5
        6
        7
        8
        public static final String	HEADER =  
                "<HTML>\n"	+
                HTMLHeader.metaTag + "\n"	+
                "<TITLE>Translated, Foreign Language Article</TITLE>\n"	+
                "<SCRIPT type=\"text/javascript\">\n" + HTMLHeader.javaScript + "\n" + "</SCRIPT>" + "\n"	+
                "<STYLE>\n" + HTMLHeader.css + "</STYLE>" + "\n"	+
                "<BODY>" + "\n" + HTMLHeader.popUpDIV + "\n"	+
                HTMLHeader.text2SpeechNote;
        
    • Method Detail

      • processArticle

        public static Ret3<java.util.Vector<java.lang.String>,​java.util.Vector<java.lang.String>,​java.lang.String[]> processArticle​
                    (java.util.Vector<HTMLNode> articleBody,
                     java.net.URL url,
                     java.lang.String title,
                     LC srcLang,
                     java.lang.Appendable log,
                     java.lang.String targetDirectory)
                throws java.io.IOException
        
        This will download and translate a news article from a foreign news website. All that you need to do is provide the main "Article-Body" of the article, and some information - and calls to Google Cloud Server Translate API will be handled by the code.

        IMPORTANT NOTE: This class makes calls to the GCSTAPI, which is an acronym meaning the Google Cloud Server Translate API. This server expects you to pay Google for the services that it provides. The translations are not free - but they are not too expensive either. You must be sure to set the class GSCTAPI -> String key field in order for the GGCS Translate API Queries to succeed.

        Your Directory Will Contain:

        1. Article Photos, stored by number as they appear in the article
        2. index.html - Article Body with Translations
        Parameters:
        articleBody - This should have the content of the article from the vectorized HTML page. Read more about cleaning an HTML news article in the class ArticleGet.


        Example:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
         // Generally retreiving the "Article Body" from a news-article web-page is a 'sort-of' simple
         // two-step process.
         //
         // Step 1:  You must look at the web-page in your browser and press your browser's "View Content"
         //          Button.  Identify the HTML Divider Element that looks something to the effect of
         //          <DIV CLASS='article_body'> ... or maybe <DIV CLASS='page_content'>
         //          You will have to find the relevant divider, or article element once, and only once,
         //          per website
         //
         // Step 2: Grab that content with a simple call to the Inclusive-Get methods in NodeSearch
        
         URL url = new URL("https://some.foreign-news.site/some-article.html");
         Vector<HTMLNode> articlePage = HTMLPage.getPageTokens(url, false);
         Vector<HTMLNode> articleBody = InnerTagGetInclusive.first(articlePage, "div", "class",
                                            TextComparitor.C, "page-content");
                                            // use whatever tag you have found via the "View Content"
                                            // Button on your browser.  You only need to find this tag
                                            // once per website!
        
         // Now pass the 'articleBody' to this 'processArticle' method.
         // You will also have to  retrieve the "Article Title" manually as well.
         // Hopefully it is obvious that the 'title' could be stored in any number of ways
         // depending on which site is being viewed.  The title location is usually "consistently 
         // the same" as long as your on the same website.
        
         String title = "?";    // you must search the page to retrieve the title
         LC articleLC = LC.es;  // Select the (spoken) language used in the article.
                                // This could be LC.vi (Vietnamese), LC.es (Spanish) etc...
        
         Ret3<Vector<String>, Vector<String>, String[]> response = processArticle
                 (articleBody, url, title, articleLC, new StorageWriter(), "outdir/");
        
         // The returned String-Vectors will have the translated sentences and words readily
         // available for use - if you wish to further process the article-content.
         // The output directory 'outdir/' will have a readable 'index.html' file, along
         // with any photos that were found on the page already downloaded so they may be
         // locally included on the output page.
         
        
        url - This article's URL to be scraped. This is used, only, for including a link to the articles original page on the output index.html file.
        title - This is needed because obtaining the title can be done in myraid ways. If it is kept as an "external option" - this provides more leeway to the coder/programmer.
        srcLang - This is just the "two character" language code that Google Cloud Server expects to see.
        log - This logs progress to terminal out. Null may be passsed, in which case output will not be displayed. Any implementation of java.lang.Appendable will suffice. Make note that the 'Appendable' interface allows / requires heeding IOException's for it's 'append(...)' methods.
        targetDirectory - This is the directory where the image-files and 'index.html' file will be stored.
        Returns:
        This will return an instance of: Ret3<Vector<String>, Vector<String>, String[]>

        • ret3.a (Vector<String>)

          This vector contains a list of sentences, or sentence-fragments, in the original language of the news or article.

        • ret3.b (Vector<String>)

          This vector contains a list of sentences, or sentence-fragments, in the target language, which is english.
        • ret3.c (String[])

          This array of strings contains a list of filenames, one for each image that was present on the original news or article page, and therefore downloaded.
        Throws:
        java.io.IOException