Class BaiDuQuery


  • public class BaiDuQuery
    extends java.lang.Object
    BaiDuQuery (百度搜索) - Documentation.

    Searches Chinese Search Engine "百度搜索" (www.BaiDu.com).

    这些查询Java函数应接受普通话字符串。
    The query methods in this class will accept Mandarin Chinese Search-String's

    IMPORTANT NOTE: As of September, 2020 - the writing of this class, 百度 (BaidDu) Search Results are identified by the following:

    HTML Elements:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    <DIV CLASS="result ...">    <!-- **OR ** -->
    <DIV CLASS="result-op ...">  
                                <!--    This divider contains / wraps the result.  There will be
                                        likely 12 such HTML '<DIV CLASS="result">...</DIV>' elements
                                        on a single page of a BaiDu Search.'
                                -->
    
                                <!--
                                        When the class is "result-op", instead of "result", it means
                                        there are likely to be "sub-matches" listed in addition to the
                                        primary search-result.  A "sub-match" is where the result-site
                                        has many / multiple search results listed on it's site - rather
                                        than just one.
                                -->
    
    <A HREF="...">              <!--    This Anchor 'A' Element and 'H3' Element contain the actual 
                                        link, and the link-text.  If this were to change with BaiDu,
                                        this particular search-engine class would fail.
                                -->
    </A>
    
    <DIV CLASS="c-row">         <!--    If there "sub-links" available, they are "wrapped" inside
                                        of a divider ('DIV') whose CLASS="s".
                                        There is possibly other info buried here.
                                -->
    </DIV>
    </DIV>
    


    THIS CLASS WILL NOT WORK WITHOUT STARTING THE SPLASH-SERVER Package Torello.HTML contains a simple - mostly documentation / informational class named Torello.HTML.SplashBridge. This server was not developed by the same organization as this JAR Java HTML library was. Though the Splash Server has myriad of features; primarily it is used here to execute the script found on a web-page. When querying the Search Bar, the pages that are returned are heavily laden with script and AJAX calls.

    In order to retrieve the HTML of the search being performed, the script calls that use AJAX, Java-Script, Type-Script, jQuery, React JS and Angular JS need to be completed. Currently, the most popular tool available for this task is the Selenium WebDriver package. It "hooks up directly" to an instance of Google Chrome and asks it to execute the / any script on a given web-page, before returning the HTML to be parsed. This can be useful, but since the tool is primarily marketed as a UI (User Interface) Testing Tool, much of the API is about performing button clicks and scroll-bar movements. Here, all that is needed to make sure that the HTML which is intended to be viewed when a page finishes loading is indeed loaded.

    This package has used, so far, successfully the Splash Tool to run the initial scripts that may be present on most web-pages.

    To start Splash, you must have access to the Docker Tool to install it on your machine. It is just a small web-server like piece of software. It will listen on port 8050, and act as a "Proxy" for calls to the Search Engine. When it receives a request for a URL, for instance, it will execute all available script on the page first, and then return the HTML to be parsed by this Java HTML Library

    UNIX or DOS Shell Command:
    Install Docker. Make sure Docker version 17 (or greater) is installed. sudo is for the UNIX command line (Super User Do), not for MS-DOS. Pull the image: $ sudo docker pull scrapinghub/splash Start the container: $ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash

    This must be run and listening on port 8050 before the methods in this class can function.

    Microsoft Windows Users: Please review the class SplashBridge for information on using the Splash HTTP Server in a Windows Environment via the Docker Loading Program which has been ported to Microsoft Windows.

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Fields: The methods in this class do not create any internal state that is maintained - however there are a few private & static fields defined. These fields are instantiated only once during the Class Loader phase (and only if this class shall be used), and serve as data 'lookup' fields (static constants). View this class' source-code in the link provided below to see internally used data.

    This class has one private, static field for listing escape characters, and a public, static field that holds the PORT NUMBER and IP ADDRESS of the Splash Server.



    • Field Detail

      • SPLASH_URL

        public static java.lang.String SPLASH_URL
        In order to use "Splash" - it is *very simple* ... Start the Server, and append this String to the beginning of all URL's:
        Code:
        Exact Field Declaration Expression:
        1
        public static String SPLASH_URL = "http://localhost:8050/render.html?url=";
        
    • Method Detail

      • main

        public static void main​(java.lang.String[] argv)
                         throws java.io.IOException
        This class may be invoked at the Command Line. The arguments passed to this class will be sent to the query(Appendable, String[]) method. The results will be printed to the terminal.
        Throws:
        java.io.IOException - If there is any I/O problems that occur in scraping the site.
        Code:
        Exact Method Body:
        1
         query(System.out, argv);
        
      • query

        public static Ret2<BaiDuQuery.SearchResult[],​java.net.URL[]> query​
                    (java.lang.Appendable log,
                     java.lang.String... argv)
                throws java.io.IOException
        
        This will poll the nearest 百度.com Web Server for the results of a search.

        IMPORANT: As explained at the top of this class, this method will not work if the Splash Server has not been installed and started on your computer.
        Parameters:
        log - This is the log parameter. If this parameter is null, it shall be ingored. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        argv - This should be the list of keywords that would be typed into a 百度 Search Bar. There may be spaces here, so the entire search String may be sent as a single-String, or it may be broken up into tokens and passed individually here. It is mostly irrelevant.
        Returns:
        This shall return an instance of Ret2. The two elements that will make up the result will be as follows:

        1. Ret2.a (SearchResult[])

          This will be an array of SearchResult that (likely) contains several elements.

        2. Ret2.b (URL[])

          Search Engines return lists of next-pages in order to retrieve the next bunch of search results for a query. These are the links that this page has provided for those next-pages of search-results.
        Throws:
        java.io.IOException - If there is any I/O problems that occur in scraping the site.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
         StringBuilder queryBuilder = new StringBuilder();
        
         for (int i=0; i < argv.length; i++)
         {
             String temp = argv[i].replace("+", "%2B").replace(" ", "+");
        
             temp = StrReplace.r
                 (temp, URL_ESC_CHARS, (int t, char c) -> '%' + Integer.toHexString((int) c));
        
             queryBuilder.append(temp);
             if (i < (argv.length -1)) queryBuilder.append('+');
         }
        
         String queryStr = queryBuilder.toString();
        
         if (log != null) log.append("Query String:\n" + C.BYELLOW + queryStr + C.RESET + '\n');
        
         return query(log, new URL("https://www.baidu.com/s?wd=" + queryStr));
        
      • query

        public static Ret2<BaiDuQuery.SearchResult[],​java.net.URL[]> query​
                    (java.lang.Appendable log,
                     java.net.URL query)
                throws java.io.IOException
        
        This will poll the nearest 百度 Web Server for the results of a search - given a provided URL. The URL provided by this method ought to be one of the URL's retrieved from the "next" button - as was returned by a previous search engine query (the Ret2.b list of URL's).

        IMPORANT: As explained at the top of this class, this method will not work if the Splash Server has not been installed and started on your computer.

        Here is the "core HTML retrieve operation" for a BaiDu.com Search Bar Result
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
         // Create a "百度 Results Iterator" - each result is wrapped in an HTML Element
         // that looks like: <DIV CLASS="result"> ... </DIV>  (or CLASS="result-op")
         HNLIInclusive resultsIter = InnerTagInclusiveIterator.get
              (v, "div", "class", TextComparitor.C_OR, "result", "result-op");
         
         while (resultsIter.hasNext())
         
              // The first anchor <A HREF=...> will contain the link for this search-result.
              Vector<HTMLNode>  firstLink = TagNodeGetInclusive.first(result, "a");
         
              // Here is how the URL and Anchor text is collected
              String            url       = ((TagNode) firstLink.elementAt(0)).AV("href").trim();
              String            title     = Util.textNodesString(firstLink).trim(); 
        
         
        
        Parameters:
        log - This is the log parameter. If this parameter is null, it shall be ingored. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        query - This may be a query URL that has been prepared, by 百度, to be used for the "next 10 results" of a particular search.

        Specifically: This URL should have been retrieved from a previous search-results page, and was listed as containing additional (next 10 matches) links.
        Returns:
        This shall return an instance of Ret2. The two elements that will make up the result will be as follows:

        1. Ret2.a (SearchResult[])

          This will be an array of SearchResult that (likely) contains several elements.

        2. Ret2.b (URL[])

          Search Engines return lists of next-pages in order to retrieve the next bunch of search results for a query. These are the links that this page has provided for those next-pages of search-results.
        Throws:
        java.io.IOException - If there is any I/O problems that occur in scraping the site.
        Code:
        Exact Method Body:
          1
          2
          3
          4
          5
          6
          7
          8
          9
         10
         11
         12
         13
         14
         15
         16
         17
         18
         19
         20
         21
         22
         23
         24
         25
         26
         27
         28
         29
         30
         31
         32
         33
         34
         35
         36
         37
         38
         39
         40
         41
         42
         43
         44
         45
         46
         47
         48
         49
         50
         51
         52
         53
         54
         55
         56
         57
         58
         59
         60
         61
         62
         63
         64
         65
         66
         67
         68
         69
         70
         71
         72
         73
         74
         75
         76
         77
         78
         79
         80
         81
         82
         83
         84
         85
         86
         87
         88
         89
         90
         91
         92
         93
         94
         95
         96
         97
         98
         99
        100
        101
        102
        103
        104
        105
        106
        107
        108
        109
        110
         // Use a java Stream.Builder to save the results to a Java Stream.
         // Streams are easily converted to arrays.
         Stream.Builder<SearchResult> resultsBuilder = Stream.builder();
        
         URL splashQuery = new URL(SPLASH_URL + query.toString());
        
         // Download the HTML, and save it to a java.util.Vector (like an array)
         Vector<HTMLNode> v = HTMLPage.getPageTokens(splashQuery, false, "out.html", null, null);
        
         // Create a "Google Results Iterator" - each result is wrapped in an HTML Element
         // that looks like: <DIV CLASS="rc"> ... </DIV>
         HNLIInclusive resultsIter = InnerTagInclusiveIterator.get
             (v, "div", "class", TextComparitor.C_OR, "result", "result-op");
        
         while (resultsIter.hasNext())
         {
             // Get the <DIV CLASS="rc"> ... </DIV> contents.
             Vector<HTMLNode>    result          = resultsIter.next();
        
             // The first anchor <A HREF=...> will contain the link for this search-result.
             Vector<HTMLNode>    firstLink       = TagNodeGetInclusive.first(result, "a");
        
             String url      = ((TagNode) firstLink.elementAt(0)).AV("href").trim();
             String title    = Util.textNodesString(firstLink).trim(); 
        
             // Save the results in a Java Stream, using Stream.Builder.
             Stream.Builder<SearchResult> subResultsBuilder = Stream.builder();
        
             // To get the list of search-result sub-links, retrieve all links that are labelled
             // <DIV CLASS="c-row"> ... </A>
             HNLIInclusive subLinksIter = InnerTagInclusiveIterator.get
                 (result, "div", "class", TextComparitor.C, "c-row");
        
             // Iterate through any "Sub Links"  Again, a "Sub Link" is hereby being defined
             // as a search result for a particular web-site that would be able to produce
             // many / numerous additional links.  Often times these additional links are more
             // useful than the primary link that was returned.
             while (subLinksIter.hasNext())
             {
                 Vector<HTMLNode>    div             = subLinksIter.next();
                 // System.out.println(Util.pageToString(div) + "\n********* RT **********************\n");
        
                 // The link / search-result itself is the first HTML Anchor Element (<A HREF=...>...</A>)
                 DotPair             subLink         = TagNodeFindInclusive.first(div, "A");
        
                 if (subLink == null) continue;
        
                 // Get the URL
                 String subLinkURL = ((TagNode) div.elementAt(subLink.start)).AV("href").trim();
        
                 // The first URL returned is just the one we have already retrieved.
                 if (subLinkURL.equalsIgnoreCase(url)) continue;
        
                 // Get the text that is wrapped inside the <A HREF=..> "this-text" </A>
                 // HTML Element.  Util.textNodesString(...) simply removes all TagNodes, and
                 // appends the TextNodes together.
                 String subLinkTitle = Util.textNodesString(div, subLink).trim();
        
                 subResultsBuilder.accept(new SearchResult(subLinkURL, subLinkTitle));
             }
        
             // Use Java Stream's to build the SearchResult[] Array.  Call the
             // Stream.Builder.build() method, and then call the Stream.toArray(...) method.
             SearchResult[] subResults = subResultsBuilder.build().toArray(SearchResult[]::new);
        
             SearchResult sr = new SearchResult
                 (url, title, (subResults.length > 0) ? subResults : null);
        
             resultsBuilder.accept(sr);
         }
        
         // Use java's Stream.Builder.build() to create the Stream, then easily convert
         // to an array.
         SearchResult[] srArr = resultsBuilder.build().toArray(SearchResult[]::new);
        
         // If the log is not null, print out the results.
         if (log != null)
             for (SearchResult sr : srArr) log.append(sr.toString() + '\n');
        
         // IMPORTANT NOTE:  This code will retrieve the next available 10 PAGES of 
         // SEARCH RESULTS as a URL.
         DotPair nextResultsDIV = InnerTagFindInclusive.first
             (v, "div", "id", TextComparitor.EQ, "page");
        
         // Use these criteria specifiers to find the HTML '<A HREF...>' NEXT-PAGE in
         // SEARCH-RESULTS links...
         Vector<DotPair> nextPages = InnerTagFindInclusive.all
             (v, nextResultsDIV.start, nextResultsDIV.end, "a", "href");
        
         URL[] urlArr = new URL[nextPages.size()];
        
         /*
         log.append(
             "nextResultsDIV.size(): " + nextResultsDIV.size() + '\n' +
             "urlArr.length: " + urlArr.length + '\n'
         );
         */
        
         // The URL for each of the next pages.  A programmer may expand this
         // answer by investigating more of these links in a loop.
         for (int i=0; i < nextPages.size(); i++)
         {
             DotPair link = nextPages.elementAt(i);
             String  href = ((TagNode) v.elementAt(link.start)).AV("href");
        
             urlArr[i] = Links.resolve(href, query);
             // log.append(urlArr[i].toString() + '\n');
         }
        
         return new Ret2<SearchResult[], URL[]>(srArr, urlArr);